15. Hyperparameter Tuning

Scikit has many approaches to optimizing or tuning the hyperparameters of models. Let’s take a look at how we can use GridSearchCV to search over a space of possible hyperparamter combinations.

15.1. Create data

Let’s create a dummy binary classification dataset.

[1]:

import numpy as np
from sklearn.datasets import make_classification

np.random.seed(37)

X, y = make_classification(**{
    'n_samples': 2000,
    'n_features': 20,
    'n_informative': 2,
    'n_redundant': 2,
    'n_repeated': 0,
    'n_classes': 2,
    'n_clusters_per_class': 2,
    'random_state': 37
})

print(f'X shape = {X.shape}, y shape {y.shape}')

X shape = (2000, 20), y shape (2000,)

15.2. Tuning Logistic Regression

Let’s try to tune a logistic regression model. The logistic regression model will be referred to as the estimator; it is this estimator’s possible hyperparamters that we want to optimize. When tuning hyperparameters, we also need a way to split the data, and here, we will use StratifiedKFold. Another important input to the grid search is the param_grid argument, which is a dictionary specifying the search space of each hyperparameter. Here, our search space is simple, it is over the regularization strength. Lastly, we need an optimization criteria, and we specify that through the scoring argument.

[2]:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold

p = {
    'solver': 'sag',
    'penalty': 'l2',
    'random_state': 37,
    'max_iter': 100
}
estimator = LogisticRegression(**p)

p = {
    'n_splits': 5,
    'shuffle': True,
    'random_state': 37
}
cv = StratifiedKFold(**p)

p = {
    'estimator': estimator,
    'cv': cv,
    'param_grid': {
        'C': [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
    },
    'scoring': {
        'auc': 'roc_auc',
        'apr': 'average_precision'
    },
    'verbose': 5,
    'refit': 'auc',
    'error_score': np.NaN,
    'n_jobs': -1
}
model = GridSearchCV(**p)

model.fit(X, y)

Fitting 5 folds for each of 11 candidates, totalling 55 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of  55 | elapsed:    1.0s remaining:   12.5s
[Parallel(n_jobs=-1)]: Done  16 out of  55 | elapsed:    1.0s remaining:    2.5s
[Parallel(n_jobs=-1)]: Done  28 out of  55 | elapsed:    1.1s remaining:    1.0s
[Parallel(n_jobs=-1)]: Done  40 out of  55 | elapsed:    1.1s remaining:    0.4s
[Parallel(n_jobs=-1)]: Done  52 out of  55 | elapsed:    1.1s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  55 out of  55 | elapsed:    1.2s finished

[2]:

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=37, shuffle=True),
             estimator=LogisticRegression(random_state=37, solver='sag'),
             n_jobs=-1,
             param_grid={'C': [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8,
                               0.9, 1.0]},
             refit='auc',
             scoring={'apr': 'average_precision', 'auc': 'roc_auc'}, verbose=5)

The best_params_ property gives the best combination of hyperparameters.

[3]:

model.best_params_

[3]:

{'C': 0.4}

The best_score_ property gives the best score.

[4]:

model.best_score_

[4]:

0.9644498503712592

To retrieve the best estimator induced by the search and scoring criteria, access best_estimator_.

[5]:

model.best_estimator_

[5]:

LogisticRegression(C=0.4, random_state=37, solver='sag')

15.3. Tuning Random Forest

Here, we tune a RandomForestClassifier.

[6]:

from sklearn.ensemble import RandomForestClassifier

p = {
    'random_state': 37
}
estimator = RandomForestClassifier(**p)

p = {
    'n_splits': 5,
    'shuffle': True,
    'random_state': 37
}
cv = StratifiedKFold(**p)

p = {
    'estimator': estimator,
    'cv': cv,
    'param_grid': {
        'n_estimators': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
        'criterion': ['gini', 'entropy']
    },
    'scoring': {
        'auc': 'roc_auc',
        'apr': 'average_precision'
    },
    'verbose': 5,
    'refit': 'auc',
    'error_score': np.NaN,
    'n_jobs': -1
}
model = GridSearchCV(**p)

model.fit(X, y)

Fitting 5 folds for each of 20 candidates, totalling 100 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done  58 out of 100 | elapsed:    0.6s remaining:    0.5s
[Parallel(n_jobs=-1)]: Done  79 out of 100 | elapsed:    0.9s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    1.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    1.2s finished

[6]:

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=37, shuffle=True),
             estimator=RandomForestClassifier(random_state=37), n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'n_estimators': [10, 20, 30, 40, 50, 60, 70, 80, 90,
                                          100]},
             refit='auc',
             scoring={'apr': 'average_precision', 'auc': 'roc_auc'}, verbose=5)

[7]:

model.best_params_

[7]:

{'criterion': 'entropy', 'n_estimators': 50}

[8]:

model.best_score_

[8]:

0.9763199132478311

[9]:

model.best_estimator_

[9]:

RandomForestClassifier(criterion='entropy', n_estimators=50, random_state=37)

15.4. Tuning with a pipeline

Our estimator can also be a pipeline. For each processor in the pipeline, we can also specify the parameter grid.

[10]:

from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
pca = PCA()
rf = RandomForestClassifier(**{
    'random_state': 37
})
pipeline = Pipeline(steps=[('scaler', scaler), ('pca', pca), ('rf', rf)])

cv = StratifiedKFold(**{
    'n_splits': 5,
    'shuffle': True,
    'random_state': 37
})

model = GridSearchCV(**{
    'estimator': pipeline,
    'cv': cv,
    'param_grid': {
        'scaler__feature_range': [(0, 1), (0, 2)],
        'pca__n_components': [2, 3, 4, 5, 10, 11, 12, 15],
        'rf__n_estimators': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
        'rf__criterion': ['gini', 'entropy']
    },
    'scoring': {
        'auc': 'roc_auc',
        'apr': 'average_precision'
    },
    'verbose': 5,
    'refit': 'auc',
    'error_score': np.NaN,
    'n_jobs': -1
})

model.fit(X, y)

Fitting 5 folds for each of 320 candidates, totalling 1600 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 132 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done 384 tasks      | elapsed:    2.1s
[Parallel(n_jobs=-1)]: Done 708 tasks      | elapsed:    4.2s
[Parallel(n_jobs=-1)]: Done 1104 tasks      | elapsed:    7.4s
[Parallel(n_jobs=-1)]: Done 1600 out of 1600 | elapsed:   12.3s finished

[10]:

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=37, shuffle=True),
             estimator=Pipeline(steps=[('scaler', MinMaxScaler()),
                                       ('pca', PCA()),
                                       ('rf',
                                        RandomForestClassifier(random_state=37))]),
             n_jobs=-1,
             param_grid={'pca__n_components': [2, 3, 4, 5, 10, 11, 12, 15],
                         'rf__criterion': ['gini', 'entropy'],
                         'rf__n_estimators': [10, 20, 30, 40, 50, 60, 70, 80,
                                              90, 100],
                         'scaler__feature_range': [(0, 1), (0, 2)]},
             refit='auc',
             scoring={'apr': 'average_precision', 'auc': 'roc_auc'}, verbose=5)

[11]:

model.best_params_

[11]:

{'pca__n_components': 3,
 'rf__criterion': 'entropy',
 'rf__n_estimators': 70,
 'scaler__feature_range': (0, 1)}

[12]:

model.best_score_

[12]:

0.9710898858096453

[13]:

model.best_estimator_

[13]:

Pipeline(steps=[('scaler', MinMaxScaler()), ('pca', PCA(n_components=3)),
                ('rf',
                 RandomForestClassifier(criterion='entropy', n_estimators=70,
                                        random_state=37))])

15.5. Validation with tuning

In some cases, you might want to validate the hyperparameter tuning as a part of your learning process. In this example, we show an example of how to so. Here are some things to note in this example.

The data generated will be multiclass.
We will implement custom scorers. The average precision score does not natively handle the multi-class label, and we will have to transform the ground truth lables into a one-hot encoded vector.

Now let’s generate some data.

[14]:

X, y = make_classification(**{
    'n_samples': 1000,
    'n_features': 10,
    'n_clusters_per_class': 1,
    'n_classes': 3,
    'random_state': 37
})

print(f'X shape = {X.shape}, y shape {y.shape}')

X shape = (1000, 10), y shape (1000,)

Below, we create a model that is a grid search based on random forest. Note how we use the make_scorer() method to create custom scorers.

[15]:

from sklearn.metrics import roc_auc_score, average_precision_score, make_scorer
from sklearn.preprocessing import OneHotEncoder

def apr_score(y_true, y_pred, average='micro'):
    encoder = OneHotEncoder()
    Y = encoder.fit_transform(y_true.reshape(-1, 1)).todense()

    return average_precision_score(Y, y_pred, average=average)

def get_model():
    scaler = MinMaxScaler()
    pca = PCA()
    rf = RandomForestClassifier(**{
        'random_state': 37
    })
    pipeline = Pipeline(steps=[('scaler', scaler), ('pca', pca), ('rf', rf)])

    cv = StratifiedKFold(**{
        'n_splits': 5,
        'shuffle': True,
        'random_state': 37
    })

    auc_scorer = make_scorer(
        roc_auc_score,
        greater_is_better=True,
        needs_proba=True,
        multi_class='ovo')
    apr_scorer_macro = make_scorer(
        apr_score,
        greater_is_better=True,
        needs_proba=True,
        average='macro')
    apr_scorer_micro = make_scorer(
        apr_score,
        greater_is_better=True,
        needs_proba=True,
        average='micro')
    apr_scorer_weighted = make_scorer(
        apr_score,
        greater_is_better=True,
        needs_proba=True,
        average='weighted')

    model = GridSearchCV(**{
        'estimator': pipeline,
        'cv': cv,
        'param_grid': {
            'scaler__feature_range': [(0, 1), (0, 2)],
            'pca__n_components': [2, 3, 4, 5, 10, 11, 12, 15],
            'rf__n_estimators': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
            'rf__criterion': ['gini', 'entropy']
        },
        'scoring': {
            'auc': auc_scorer,
            'apr_scorer_macro': apr_scorer_macro,
            'apr_scorer_micro': apr_scorer_micro,
            'apr_scorer_weighted': apr_scorer_weighted
        },
        'verbose': 5,
        'refit': 'apr_scorer_micro',
        'error_score': np.NaN,
        'n_jobs': -1
    })
    return model

Now we can perform stratified, k-fold cross-validation while incorporating hyperparameter tuning as a part of the validation process.

[16]:

import warnings
import pandas as pd

warnings.filterwarnings('ignore')

results = []

for tr, te in StratifiedKFold(random_state=37, shuffle=True, n_splits=10).split(X, y):
    X_tr, X_te = X[tr], X[te]
    y_tr, y_te = y[tr], y[te]

    model = get_model()
    model.fit(X_tr, y_tr)

    y_pred = model.predict_proba(X_te)

    auc_ovr = roc_auc_score(y_te, y_pred, multi_class='ovr')
    auc_ovo = roc_auc_score(y_te, y_pred, multi_class='ovo')
    apr_macro = apr_score(y_te, y_pred, average='macro')
    apr_micro = apr_score(y_te, y_pred, average='micro')
    apr_weighted = apr_score(y_te, y_pred, average='weighted')

    results.append({
        'auc_ovr': auc_ovr,
        'auc_ovo': auc_ovo,
        'apr_macro': apr_macro,
        'apr_micro': apr_micro,
        'apr_weighted': apr_weighted
    })

rdf = pd.DataFrame(results)

Fitting 5 folds for each of 448 candidates, totalling 2240 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 132 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done 704 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done 2240 out of 2240 | elapsed:    2.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.1s

Fitting 5 folds for each of 448 candidates, totalling 2240 fits

[Parallel(n_jobs=-1)]: Done 132 tasks      | elapsed:    0.4s
[Parallel(n_jobs=-1)]: Done 704 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done 2240 out of 2240 | elapsed:    2.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.1s

Fitting 5 folds for each of 448 candidates, totalling 2240 fits

[Parallel(n_jobs=-1)]: Done 132 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done 704 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done 2240 out of 2240 | elapsed:    2.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.1s

Fitting 5 folds for each of 448 candidates, totalling 2240 fits

[Parallel(n_jobs=-1)]: Done 132 tasks      | elapsed:    0.4s
[Parallel(n_jobs=-1)]: Done 704 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done 1972 tasks      | elapsed:    2.0s
[Parallel(n_jobs=-1)]: Done 2240 out of 2240 | elapsed:    2.2s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.1s

Fitting 5 folds for each of 448 candidates, totalling 2240 fits

[Parallel(n_jobs=-1)]: Done 132 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done 704 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done 2240 out of 2240 | elapsed:    2.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.1s

Fitting 5 folds for each of 448 candidates, totalling 2240 fits

[Parallel(n_jobs=-1)]: Done 132 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done 704 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done 2240 out of 2240 | elapsed:    2.1s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.1s

Fitting 5 folds for each of 448 candidates, totalling 2240 fits

[Parallel(n_jobs=-1)]: Done 132 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done 704 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done 2240 out of 2240 | elapsed:    2.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.1s

Fitting 5 folds for each of 448 candidates, totalling 2240 fits

[Parallel(n_jobs=-1)]: Done 132 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done 704 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done 2240 out of 2240 | elapsed:    2.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.1s

Fitting 5 folds for each of 448 candidates, totalling 2240 fits

[Parallel(n_jobs=-1)]: Done 132 tasks      | elapsed:    0.4s
[Parallel(n_jobs=-1)]: Done 704 tasks      | elapsed:    1.2s
[Parallel(n_jobs=-1)]: Done 2240 out of 2240 | elapsed:    2.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.1s

Fitting 5 folds for each of 448 candidates, totalling 2240 fits

[Parallel(n_jobs=-1)]: Done 132 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done 704 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done 2240 out of 2240 | elapsed:    2.0s finished

[17]:

rdf.mean()

[17]:

auc_ovr         0.998931
auc_ovo         0.998932
apr_macro       0.997529
apr_micro       0.997535
apr_weighted    0.997533
dtype: float64

15.6. tune-sklearn

tune-sklearn is a drop-in replacement for scikit-learn’s hyperparameter tuning. This API promises to find hyperpameters in a shorter amount of time and smarter way.

[18]:

from tune_sklearn import TuneGridSearchCV

def get_model():
    scaler = MinMaxScaler()
    pca = PCA()
    rf = RandomForestClassifier(**{
        'random_state': 37
    })
    pipeline = Pipeline(steps=[('scaler', scaler), ('pca', pca), ('rf', rf)])

    cv = StratifiedKFold(**{
        'n_splits': 5,
        'shuffle': True,
        'random_state': 37
    })

    auc_scorer = make_scorer(
        roc_auc_score,
        greater_is_better=True,
        needs_proba=True,
        multi_class='ovo')
    apr_scorer_macro = make_scorer(
        apr_score,
        greater_is_better=True,
        needs_proba=True,
        average='macro')
    apr_scorer_micro = make_scorer(
        apr_score,
        greater_is_better=True,
        needs_proba=True,
        average='micro')
    apr_scorer_weighted = make_scorer(
        apr_score,
        greater_is_better=True,
        needs_proba=True,
        average='weighted')

    model = TuneGridSearchCV(**{
        'estimator': pipeline,
        'cv': cv,
        'param_grid': {
            'scaler__feature_range': [(0, 1)],
            'pca__n_components': [2, 3, 4, 5],
            'rf__criterion': ['gini', 'entropy']
        },
        'scoring': {
            'auc': auc_scorer,
            'apr_scorer_macro': apr_scorer_macro,
            'apr_scorer_micro': apr_scorer_micro,
            'apr_scorer_weighted': apr_scorer_weighted
        },
        'verbose': 1,
        'refit': 'apr_scorer_micro',
        'error_score': np.NaN,
        'n_jobs': -1,
        'early_stopping': 'MedianStoppingRule',
        'max_iters': 10
    })
    return model

[19]:

results = []

for tr, te in StratifiedKFold(random_state=37, shuffle=True, n_splits=5).split(X, y):
    X_tr, X_te = X[tr], X[te]
    y_tr, y_te = y[tr], y[te]

    model = get_model()
    model.fit(X_tr, y_tr)

    y_pred = model.predict_proba(X_te)

    auc_ovr = roc_auc_score(y_te, y_pred, multi_class='ovr')
    auc_ovo = roc_auc_score(y_te, y_pred, multi_class='ovo')
    apr_macro = apr_score(y_te, y_pred, average='macro')
    apr_micro = apr_score(y_te, y_pred, average='micro')
    apr_weighted = apr_score(y_te, y_pred, average='weighted')

    results.append({
        'auc_ovr': auc_ovr,
        'auc_ovo': auc_ovo,
        'apr_macro': apr_macro,
        'apr_micro': apr_micro,
        'apr_weighted': apr_weighted
    })

rdf = pd.DataFrame(results)

== Status ==
Memory usage on this node: 5.3/50.1 GiB
Using MedianStoppingRule: num_stopped=0.
Resources requested: 0/32 CPUs, 0/0 GPUs, 0.0/28.61 GiB heap, 0.0/9.86 GiB objects
Result logdir: /root/ray_results/_PipelineTrainable_2022-05-07_13-29-16
Number of trials: 8/8 (8 TERMINATED)

[20]:

rdf.mean()

[20]:

auc_ovr         0.998309
auc_ovo         0.998308
apr_macro       0.996224
apr_micro       0.996087
apr_weighted    0.996239
dtype: float64

15.7. Pipelines, column transformers, grid search

15.7.1. Simple

[21]:

df = pd.DataFrame({
    'text': ['pizza apple orange', 'potato tomato greens pizza', 'computer monitor', 'mouse keyboard'],
    'hand': ['left', 'right', np.nan, 'left'],
    'gender': ['m', 'f', 'f', 'm'],
    'age': [22.2, 32.3, 44.4, 55.5],
    'y': [1, 1, 0, 0]
})

df

[21]:

	text	hand	gender	age	y
0	pizza apple orange	left	m	22.2	1
1	potato tomato greens pizza	right	f	32.3	1
2	computer monitor	NaN	f	44.4	0
3	mouse keyboard	left	m	55.5	0

[22]:

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import ColumnTransformer

p0 = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='constant', fill_value='')),
    ('reshape', FunctionTransformer(np.reshape, kw_args={'newshape':-1})),
    ('vectorize', CountVectorizer())
])
p1 = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(drop=['left']))
])
p2 = Pipeline(steps=[('ohe', OneHotEncoder(drop=['f']))])
p4 = Pipeline(steps=[
    ('impute', SimpleImputer()),
    ('scale', StandardScaler())
])

t = ColumnTransformer([
    ('text', p0, [0]),
    ('hand', p1, [1]),
    ('gender', p2, [2]),
    ('age', p4, [3])
], remainder='drop')


T = t.fit_transform(df)

[23]:

t_fields = t.named_transformers_['text'].named_steps['vectorize'].get_feature_names()
h_fields = list(t.named_transformers_['hand'].named_steps['ohe'].get_feature_names())
g_fields = list(t.named_transformers_['gender'].named_steps['ohe'].get_feature_names())
o_fields = ['age']

fields = t_fields + h_fields + g_fields + o_fields
fields

[23]:

['apple',
 'computer',
 'greens',
 'keyboard',
 'monitor',
 'mouse',
 'orange',
 'pizza',
 'potato',
 'tomato',
 'x0_right',
 'x0_m',
 'age']

[24]:

pd.DataFrame(T, columns=fields)

[24]:

	apple	computer	greens	keyboard	monitor	mouse	orange	pizza	potato	tomato	x0_right	x0_m	age
0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	0.0	1.0	-1.308967
1	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	1.0	1.0	1.0	0.0	-0.502835
2	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.462927
3	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	1.348874

15.7.2. With model

[25]:

p0 = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='constant', fill_value='')),
    ('reshape', FunctionTransformer(np.reshape, kw_args={'newshape':-1})),
    ('vectorize', CountVectorizer())
])
p1 = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(drop=['left']))
])
p2 = Pipeline(steps=[('ohe', OneHotEncoder(drop=['f']))])
p4 = Pipeline(steps=[
    ('impute', SimpleImputer()),
    ('scale', StandardScaler())
])

t = ColumnTransformer([
    ('text', p0, [0]),
    ('hand', p1, [1]),
    ('gender', p2, [2]),
    ('age', p4, [3])
], remainder='drop')

m = Pipeline(steps=[
    ('preprocess', t),
    ('regressor', LogisticRegression())
])

X, y = df[[c for c in df.columns if c != 'y']], df['y']

m.fit(X, y);

[26]:

m.predict_proba(X)[:,1]

[26]:

array([0.80466472, 0.78462158, 0.24182963, 0.16886457])

[27]:

pd.concat([
    pd.Series(m.named_steps['regressor'].intercept_, ['intercept']),
    pd.Series(m.named_steps['regressor'].coef_[0], fields)
])

[27]:

intercept   -0.333244
apple        0.195329
computer    -0.241834
greens       0.215375
keyboard    -0.168860
monitor     -0.241834
mouse       -0.168860
orange       0.195329
pizza        0.410704
potato       0.215375
tomato       0.215375
x0_right     0.215375
x0_m         0.026470
age         -0.703700
dtype: float64

15.7.3. With grid search

[28]:

N = 10
df = pd.DataFrame({
    'text': ['pizza apple orange', 'potato tomato greens pizza', 'computer monitor', 'mouse keyboard'] * N,
    'hand': ['left', 'right', np.nan, 'left'] * N,
    'gender': ['m', 'f', 'f', 'm'] * N,
    'age': [22.2, 32.3, 44.4, 55.5] * N,
    'y': [1, 1, 0, 0] * N
})

X, y = df[[c for c in df.columns if c != 'y']], df['y']

df.shape, X.shape, y.shape

[28]:

((40, 5), (40, 4), (40,))

[29]:

p0 = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='constant', fill_value='')),
    ('reshape', FunctionTransformer(np.reshape, kw_args={'newshape':-1})),
    ('vectorize', CountVectorizer())
])
p1 = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(drop=['left']))
])
p2 = Pipeline(steps=[('ohe', OneHotEncoder(drop=['f']))])
p4 = Pipeline(steps=[
    ('impute', SimpleImputer()),
    ('scale', StandardScaler())
])

t = ColumnTransformer([
    ('text', p0, [0]),
    ('hand', p1, [1]),
    ('gender', p2, [2]),
    ('age', p4, [3])
], remainder='drop')

e = Pipeline(steps=[
    ('preprocess', t),
    ('regressor', LogisticRegression())
])

cv = StratifiedKFold(**{
    'n_splits': 5,
    'shuffle': True,
    'random_state': 37
})

m = GridSearchCV(**{
    'estimator': e,
    'cv': cv,
    'param_grid': {
        'regressor__random_state': [29, 37]
    },
    'scoring': {
        'auc': 'roc_auc',
        'apr': 'average_precision'
    },
    'verbose': 5,
    'refit': 'auc',
    'error_score': np.NaN,
    'n_jobs': -1
})

m.fit(X, y)
m.best_params_

Fitting 5 folds for each of 2 candidates, totalling 10 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of  10 | elapsed:    0.2s remaining:    0.3s
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    0.2s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.2s finished

[29]:

{'regressor__random_state': 29}

[30]:

m.predict_proba(X)[:,1]

[30]:

array([0.95717104, 0.94940778, 0.06120815, 0.03221605, 0.95717104,
       0.94940778, 0.06120815, 0.03221605, 0.95717104, 0.94940778,
       0.06120815, 0.03221605, 0.95717104, 0.94940778, 0.06120815,
       0.03221605, 0.95717104, 0.94940778, 0.06120815, 0.03221605,
       0.95717104, 0.94940778, 0.06120815, 0.03221605, 0.95717104,
       0.94940778, 0.06120815, 0.03221605, 0.95717104, 0.94940778,
       0.06120815, 0.03221605, 0.95717104, 0.94940778, 0.06120815,
       0.03221605, 0.95717104, 0.94940778, 0.06120815, 0.03221605])

[31]:

pd.concat([
    pd.Series(m.best_estimator_.named_steps['regressor'].intercept_, ['intercept']),
    pd.Series(m.best_estimator_.named_steps['regressor'].coef_[0], fields)
])

[31]:

intercept   -0.796608
apple        0.428280
computer    -0.612042
greens       0.505914
keyboard    -0.322175
monitor     -0.612042
mouse       -0.322175
orange       0.428280
pizza        0.934194
potato       0.505914
tomato       0.505914
x0_right     0.505914
x0_m         0.106105
age         -1.532901
dtype: float64

15.7.4. With random search

[44]:

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

p0 = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='constant', fill_value='')),
    ('reshape', FunctionTransformer(np.reshape, kw_args={'newshape':-1})),
    ('vectorize', CountVectorizer())
])
p1 = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(drop=['left']))
])
p2 = Pipeline(steps=[('ohe', OneHotEncoder(drop=['f']))])
p4 = Pipeline(steps=[
    ('impute', SimpleImputer()),
    ('scale', StandardScaler())
])

t = ColumnTransformer([
    ('text', p0, [0]),
    ('hand', p1, [1]),
    ('gender', p2, [2]),
    ('age', p4, [3])
], remainder='drop')

e = Pipeline(steps=[
    ('preprocess', t),
    ('regressor', LogisticRegression())
])

cv = StratifiedKFold(**{
    'n_splits': 5,
    'shuffle': True,
    'random_state': 37
})

uniform.random_state = 37
randint.random_state = 37

regressor__C_dist = uniform()
regressor__l1_ratio_dist = uniform()
regressor__random_state_dist = randint(5, 40)

regressor__C_dist.random_state = 37
regressor__l1_ratio_dist.random_state = 37
regressor__random_state_dist.random_state = 37

m = RandomizedSearchCV(**{
    'estimator': e,
    'cv': cv,
    'param_distributions': {
        'regressor__random_state': regressor__random_state_dist,
        'regressor__C': regressor__C_dist,
        'regressor__l1_ratio': regressor__l1_ratio_dist
    },
    'scoring': {
        'auc': 'roc_auc',
        'apr': 'average_precision'
    },
    'verbose': 5,
    'refit': 'auc',
    'error_score': np.NaN,
    'n_jobs': -1
})

m.fit(X, y)
m.best_params_

Fitting 5 folds for each of 10 candidates, totalling 50 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 out of  50 | elapsed:    0.1s remaining:    0.6s
[Parallel(n_jobs=-1)]: Done  20 out of  50 | elapsed:    0.1s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  31 out of  50 | elapsed:    0.1s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  42 out of  50 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    0.2s finished

[44]:

{'regressor__C': 0.8861628575625455,
 'regressor__l1_ratio': 0.44678402377577453,
 'regressor__random_state': 11}

[45]:

m.cv_results_

[45]:

{'mean_fit_time': array([0.01352501, 0.01620173, 0.01874933, 0.02089486, 0.02180014,
        0.02144923, 0.02035055, 0.02146559, 0.02045355, 0.01744499]),
 'std_fit_time': array([0.0004663 , 0.00136882, 0.00298622, 0.00060401, 0.00092841,
        0.00107438, 0.00205119, 0.00129154, 0.00327625, 0.00285363]),
 'mean_score_time': array([0.01998224, 0.01063399, 0.01161842, 0.0110549 , 0.01962199,
        0.01082191, 0.00972252, 0.0092669 , 0.00763993, 0.00690565]),
 'std_score_time': array([0.02239441, 0.00060964, 0.00199941, 0.0005795 , 0.01610719,
        0.00037426, 0.00064848, 0.00127296, 0.00166517, 0.00075786]),
 'param_regressor__C': masked_array(data=[0.8861628575625455, 0.2669851105048685,
                    0.18394258557069376, 0.07618735142456279,
                    0.5057860075182486, 0.20365586920932577,
                    0.5846689895186896, 0.7451596080317253,
                    0.4187299961837905, 0.7330989808813632],
              mask=[False, False, False, False, False, False, False, False,
                    False, False],
        fill_value='?',
             dtype=object),
 'param_regressor__l1_ratio': masked_array(data=[0.44678402377577453, 0.7932726296280042,
                    0.02419288751414972, 0.33254214738505705,
                    0.17149139198746954, 0.32804028992640133,
                    0.338890624803148, 0.04340148929509069,
                    0.46658684845827336, 0.6470581061933899],
              mask=[False, False, False, False, False, False, False, False,
                    False, False],
        fill_value='?',
             dtype=object),
 'param_regressor__random_state': masked_array(data=[11, 23, 25, 13, 20, 34, 26, 18, 38, 22],
              mask=[False, False, False, False, False, False, False, False,
                    False, False],
        fill_value='?',
             dtype=object),
 'params': [{'regressor__C': 0.8861628575625455,
   'regressor__l1_ratio': 0.44678402377577453,
   'regressor__random_state': 11},
  {'regressor__C': 0.2669851105048685,
   'regressor__l1_ratio': 0.7932726296280042,
   'regressor__random_state': 23},
  {'regressor__C': 0.18394258557069376,
   'regressor__l1_ratio': 0.02419288751414972,
   'regressor__random_state': 25},
  {'regressor__C': 0.07618735142456279,
   'regressor__l1_ratio': 0.33254214738505705,
   'regressor__random_state': 13},
  {'regressor__C': 0.5057860075182486,
   'regressor__l1_ratio': 0.17149139198746954,
   'regressor__random_state': 20},
  {'regressor__C': 0.20365586920932577,
   'regressor__l1_ratio': 0.32804028992640133,
   'regressor__random_state': 34},
  {'regressor__C': 0.5846689895186896,
   'regressor__l1_ratio': 0.338890624803148,
   'regressor__random_state': 26},
  {'regressor__C': 0.7451596080317253,
   'regressor__l1_ratio': 0.04340148929509069,
   'regressor__random_state': 18},
  {'regressor__C': 0.4187299961837905,
   'regressor__l1_ratio': 0.46658684845827336,
   'regressor__random_state': 38},
  {'regressor__C': 0.7330989808813632,
   'regressor__l1_ratio': 0.6470581061933899,
   'regressor__random_state': 22}],
 'split0_test_auc': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
 'split1_test_auc': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
 'split2_test_auc': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
 'split3_test_auc': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
 'split4_test_auc': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
 'mean_test_auc': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
 'std_test_auc': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
 'rank_test_auc': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32),
 'split0_test_apr': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
 'split1_test_apr': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
 'split2_test_apr': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
 'split3_test_apr': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
 'split4_test_apr': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
 'mean_test_apr': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
 'std_test_apr': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
 'rank_test_apr': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)}

[ ]: