15. Hyperparameter Tuning
Scikit has many approaches to optimizing or tuning the hyperparameters of models. Let’s take a look at how we can use GridSearchCV
to search over a space of possible hyperparamter combinations.
15.1. Create data
Let’s create a dummy binary classification dataset.
[1]:
import numpy as np
from sklearn.datasets import make_classification
np.random.seed(37)
X, y = make_classification(**{
'n_samples': 2000,
'n_features': 20,
'n_informative': 2,
'n_redundant': 2,
'n_repeated': 0,
'n_classes': 2,
'n_clusters_per_class': 2,
'random_state': 37
})
print(f'X shape = {X.shape}, y shape {y.shape}')
X shape = (2000, 20), y shape (2000,)
15.2. Tuning Logistic Regression
Let’s try to tune a logistic regression model. The logistic regression model will be referred to as the estimator
; it is this estimator’s possible hyperparamters that we want to optimize. When tuning hyperparameters, we also need a way to split the data, and here, we will use StratifiedKFold
. Another important input to the grid search is the param_grid
argument, which is a dictionary specifying the search space of each hyperparameter. Here, our search space is simple, it is over the
regularization strength
. Lastly, we need an optimization criteria, and we specify that through the scoring argument.
[2]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold
p = {
'solver': 'sag',
'penalty': 'l2',
'random_state': 37,
'max_iter': 100
}
estimator = LogisticRegression(**p)
p = {
'n_splits': 5,
'shuffle': True,
'random_state': 37
}
cv = StratifiedKFold(**p)
p = {
'estimator': estimator,
'cv': cv,
'param_grid': {
'C': [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
},
'scoring': {
'auc': 'roc_auc',
'apr': 'average_precision'
},
'verbose': 5,
'refit': 'auc',
'error_score': np.NaN,
'n_jobs': -1
}
model = GridSearchCV(**p)
model.fit(X, y)
Fitting 5 folds for each of 11 candidates, totalling 55 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 4 out of 55 | elapsed: 1.0s remaining: 12.5s
[Parallel(n_jobs=-1)]: Done 16 out of 55 | elapsed: 1.0s remaining: 2.5s
[Parallel(n_jobs=-1)]: Done 28 out of 55 | elapsed: 1.1s remaining: 1.0s
[Parallel(n_jobs=-1)]: Done 40 out of 55 | elapsed: 1.1s remaining: 0.4s
[Parallel(n_jobs=-1)]: Done 52 out of 55 | elapsed: 1.1s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 55 out of 55 | elapsed: 1.2s finished
[2]:
GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=37, shuffle=True),
estimator=LogisticRegression(random_state=37, solver='sag'),
n_jobs=-1,
param_grid={'C': [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8,
0.9, 1.0]},
refit='auc',
scoring={'apr': 'average_precision', 'auc': 'roc_auc'}, verbose=5)
The best_params_
property gives the best combination of hyperparameters.
[3]:
model.best_params_
[3]:
{'C': 0.4}
The best_score_
property gives the best score.
[4]:
model.best_score_
[4]:
0.9644498503712592
To retrieve the best estimator induced by the search and scoring criteria, access best_estimator_
.
[5]:
model.best_estimator_
[5]:
LogisticRegression(C=0.4, random_state=37, solver='sag')
15.3. Tuning Random Forest
Here, we tune a RandomForestClassifier
.
[6]:
from sklearn.ensemble import RandomForestClassifier
p = {
'random_state': 37
}
estimator = RandomForestClassifier(**p)
p = {
'n_splits': 5,
'shuffle': True,
'random_state': 37
}
cv = StratifiedKFold(**p)
p = {
'estimator': estimator,
'cv': cv,
'param_grid': {
'n_estimators': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
'criterion': ['gini', 'entropy']
},
'scoring': {
'auc': 'roc_auc',
'apr': 'average_precision'
},
'verbose': 5,
'refit': 'auc',
'error_score': np.NaN,
'n_jobs': -1
}
model = GridSearchCV(**p)
model.fit(X, y)
Fitting 5 folds for each of 20 candidates, totalling 100 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 8 tasks | elapsed: 0.1s
[Parallel(n_jobs=-1)]: Done 58 out of 100 | elapsed: 0.6s remaining: 0.5s
[Parallel(n_jobs=-1)]: Done 79 out of 100 | elapsed: 0.9s remaining: 0.2s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 1.2s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 1.2s finished
[6]:
GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=37, shuffle=True),
estimator=RandomForestClassifier(random_state=37), n_jobs=-1,
param_grid={'criterion': ['gini', 'entropy'],
'n_estimators': [10, 20, 30, 40, 50, 60, 70, 80, 90,
100]},
refit='auc',
scoring={'apr': 'average_precision', 'auc': 'roc_auc'}, verbose=5)
[7]:
model.best_params_
[7]:
{'criterion': 'entropy', 'n_estimators': 50}
[8]:
model.best_score_
[8]:
0.9763199132478311
[9]:
model.best_estimator_
[9]:
RandomForestClassifier(criterion='entropy', n_estimators=50, random_state=37)
15.4. Tuning with a pipeline
Our estimator can also be a pipeline. For each processor in the pipeline, we can also specify the parameter grid.
[10]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
pca = PCA()
rf = RandomForestClassifier(**{
'random_state': 37
})
pipeline = Pipeline(steps=[('scaler', scaler), ('pca', pca), ('rf', rf)])
cv = StratifiedKFold(**{
'n_splits': 5,
'shuffle': True,
'random_state': 37
})
model = GridSearchCV(**{
'estimator': pipeline,
'cv': cv,
'param_grid': {
'scaler__feature_range': [(0, 1), (0, 2)],
'pca__n_components': [2, 3, 4, 5, 10, 11, 12, 15],
'rf__n_estimators': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
'rf__criterion': ['gini', 'entropy']
},
'scoring': {
'auc': 'roc_auc',
'apr': 'average_precision'
},
'verbose': 5,
'refit': 'auc',
'error_score': np.NaN,
'n_jobs': -1
})
model.fit(X, y)
Fitting 5 folds for each of 320 candidates, totalling 1600 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 8 tasks | elapsed: 0.1s
[Parallel(n_jobs=-1)]: Done 132 tasks | elapsed: 0.8s
[Parallel(n_jobs=-1)]: Done 384 tasks | elapsed: 2.1s
[Parallel(n_jobs=-1)]: Done 708 tasks | elapsed: 4.2s
[Parallel(n_jobs=-1)]: Done 1104 tasks | elapsed: 7.4s
[Parallel(n_jobs=-1)]: Done 1600 out of 1600 | elapsed: 12.3s finished
[10]:
GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=37, shuffle=True),
estimator=Pipeline(steps=[('scaler', MinMaxScaler()),
('pca', PCA()),
('rf',
RandomForestClassifier(random_state=37))]),
n_jobs=-1,
param_grid={'pca__n_components': [2, 3, 4, 5, 10, 11, 12, 15],
'rf__criterion': ['gini', 'entropy'],
'rf__n_estimators': [10, 20, 30, 40, 50, 60, 70, 80,
90, 100],
'scaler__feature_range': [(0, 1), (0, 2)]},
refit='auc',
scoring={'apr': 'average_precision', 'auc': 'roc_auc'}, verbose=5)
[11]:
model.best_params_
[11]:
{'pca__n_components': 3,
'rf__criterion': 'entropy',
'rf__n_estimators': 70,
'scaler__feature_range': (0, 1)}
[12]:
model.best_score_
[12]:
0.9710898858096453
[13]:
model.best_estimator_
[13]:
Pipeline(steps=[('scaler', MinMaxScaler()), ('pca', PCA(n_components=3)),
('rf',
RandomForestClassifier(criterion='entropy', n_estimators=70,
random_state=37))])
15.5. Validation with tuning
In some cases, you might want to validate the hyperparameter tuning as a part of your learning process. In this example, we show an example of how to so. Here are some things to note in this example.
The data generated will be multiclass.
We will implement custom scorers. The average precision score does not natively handle the multi-class label, and we will have to transform the ground truth lables into a one-hot encoded vector.
Now let’s generate some data.
[14]:
X, y = make_classification(**{
'n_samples': 1000,
'n_features': 10,
'n_clusters_per_class': 1,
'n_classes': 3,
'random_state': 37
})
print(f'X shape = {X.shape}, y shape {y.shape}')
X shape = (1000, 10), y shape (1000,)
Below, we create a model
that is a grid search based on random forest. Note how we use the make_scorer()
method to create custom scorers.
[15]:
from sklearn.metrics import roc_auc_score, average_precision_score, make_scorer
from sklearn.preprocessing import OneHotEncoder
def apr_score(y_true, y_pred, average='micro'):
encoder = OneHotEncoder()
Y = encoder.fit_transform(y_true.reshape(-1, 1)).todense()
return average_precision_score(Y, y_pred, average=average)
def get_model():
scaler = MinMaxScaler()
pca = PCA()
rf = RandomForestClassifier(**{
'random_state': 37
})
pipeline = Pipeline(steps=[('scaler', scaler), ('pca', pca), ('rf', rf)])
cv = StratifiedKFold(**{
'n_splits': 5,
'shuffle': True,
'random_state': 37
})
auc_scorer = make_scorer(
roc_auc_score,
greater_is_better=True,
needs_proba=True,
multi_class='ovo')
apr_scorer_macro = make_scorer(
apr_score,
greater_is_better=True,
needs_proba=True,
average='macro')
apr_scorer_micro = make_scorer(
apr_score,
greater_is_better=True,
needs_proba=True,
average='micro')
apr_scorer_weighted = make_scorer(
apr_score,
greater_is_better=True,
needs_proba=True,
average='weighted')
model = GridSearchCV(**{
'estimator': pipeline,
'cv': cv,
'param_grid': {
'scaler__feature_range': [(0, 1), (0, 2)],
'pca__n_components': [2, 3, 4, 5, 10, 11, 12, 15],
'rf__n_estimators': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
'rf__criterion': ['gini', 'entropy']
},
'scoring': {
'auc': auc_scorer,
'apr_scorer_macro': apr_scorer_macro,
'apr_scorer_micro': apr_scorer_micro,
'apr_scorer_weighted': apr_scorer_weighted
},
'verbose': 5,
'refit': 'apr_scorer_micro',
'error_score': np.NaN,
'n_jobs': -1
})
return model
Now we can perform stratified, k-fold cross-validation while incorporating hyperparameter tuning as a part of the validation process.
[16]:
import warnings
import pandas as pd
warnings.filterwarnings('ignore')
results = []
for tr, te in StratifiedKFold(random_state=37, shuffle=True, n_splits=10).split(X, y):
X_tr, X_te = X[tr], X[te]
y_tr, y_te = y[tr], y[te]
model = get_model()
model.fit(X_tr, y_tr)
y_pred = model.predict_proba(X_te)
auc_ovr = roc_auc_score(y_te, y_pred, multi_class='ovr')
auc_ovo = roc_auc_score(y_te, y_pred, multi_class='ovo')
apr_macro = apr_score(y_te, y_pred, average='macro')
apr_micro = apr_score(y_te, y_pred, average='micro')
apr_weighted = apr_score(y_te, y_pred, average='weighted')
results.append({
'auc_ovr': auc_ovr,
'auc_ovo': auc_ovo,
'apr_macro': apr_macro,
'apr_micro': apr_micro,
'apr_weighted': apr_weighted
})
rdf = pd.DataFrame(results)
Fitting 5 folds for each of 448 candidates, totalling 2240 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 8 tasks | elapsed: 0.1s
[Parallel(n_jobs=-1)]: Done 132 tasks | elapsed: 0.3s
[Parallel(n_jobs=-1)]: Done 704 tasks | elapsed: 1.1s
[Parallel(n_jobs=-1)]: Done 2240 out of 2240 | elapsed: 2.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 8 tasks | elapsed: 0.1s
Fitting 5 folds for each of 448 candidates, totalling 2240 fits
[Parallel(n_jobs=-1)]: Done 132 tasks | elapsed: 0.4s
[Parallel(n_jobs=-1)]: Done 704 tasks | elapsed: 1.1s
[Parallel(n_jobs=-1)]: Done 2240 out of 2240 | elapsed: 2.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 8 tasks | elapsed: 0.1s
Fitting 5 folds for each of 448 candidates, totalling 2240 fits
[Parallel(n_jobs=-1)]: Done 132 tasks | elapsed: 0.3s
[Parallel(n_jobs=-1)]: Done 704 tasks | elapsed: 1.1s
[Parallel(n_jobs=-1)]: Done 2240 out of 2240 | elapsed: 2.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 8 tasks | elapsed: 0.1s
Fitting 5 folds for each of 448 candidates, totalling 2240 fits
[Parallel(n_jobs=-1)]: Done 132 tasks | elapsed: 0.4s
[Parallel(n_jobs=-1)]: Done 704 tasks | elapsed: 1.1s
[Parallel(n_jobs=-1)]: Done 1972 tasks | elapsed: 2.0s
[Parallel(n_jobs=-1)]: Done 2240 out of 2240 | elapsed: 2.2s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 8 tasks | elapsed: 0.1s
Fitting 5 folds for each of 448 candidates, totalling 2240 fits
[Parallel(n_jobs=-1)]: Done 132 tasks | elapsed: 0.3s
[Parallel(n_jobs=-1)]: Done 704 tasks | elapsed: 1.1s
[Parallel(n_jobs=-1)]: Done 2240 out of 2240 | elapsed: 2.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 8 tasks | elapsed: 0.1s
Fitting 5 folds for each of 448 candidates, totalling 2240 fits
[Parallel(n_jobs=-1)]: Done 132 tasks | elapsed: 0.3s
[Parallel(n_jobs=-1)]: Done 704 tasks | elapsed: 1.1s
[Parallel(n_jobs=-1)]: Done 2240 out of 2240 | elapsed: 2.1s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 8 tasks | elapsed: 0.1s
Fitting 5 folds for each of 448 candidates, totalling 2240 fits
[Parallel(n_jobs=-1)]: Done 132 tasks | elapsed: 0.3s
[Parallel(n_jobs=-1)]: Done 704 tasks | elapsed: 1.1s
[Parallel(n_jobs=-1)]: Done 2240 out of 2240 | elapsed: 2.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 8 tasks | elapsed: 0.1s
Fitting 5 folds for each of 448 candidates, totalling 2240 fits
[Parallel(n_jobs=-1)]: Done 132 tasks | elapsed: 0.3s
[Parallel(n_jobs=-1)]: Done 704 tasks | elapsed: 1.1s
[Parallel(n_jobs=-1)]: Done 2240 out of 2240 | elapsed: 2.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 8 tasks | elapsed: 0.1s
Fitting 5 folds for each of 448 candidates, totalling 2240 fits
[Parallel(n_jobs=-1)]: Done 132 tasks | elapsed: 0.4s
[Parallel(n_jobs=-1)]: Done 704 tasks | elapsed: 1.2s
[Parallel(n_jobs=-1)]: Done 2240 out of 2240 | elapsed: 2.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 8 tasks | elapsed: 0.1s
Fitting 5 folds for each of 448 candidates, totalling 2240 fits
[Parallel(n_jobs=-1)]: Done 132 tasks | elapsed: 0.3s
[Parallel(n_jobs=-1)]: Done 704 tasks | elapsed: 1.1s
[Parallel(n_jobs=-1)]: Done 2240 out of 2240 | elapsed: 2.0s finished
[17]:
rdf.mean()
[17]:
auc_ovr 0.998931
auc_ovo 0.998932
apr_macro 0.997529
apr_micro 0.997535
apr_weighted 0.997533
dtype: float64
15.6. tune-sklearn
tune-sklearn is a drop-in replacement for scikit-learn’s hyperparameter tuning. This API promises to find hyperpameters in a shorter amount of time and smarter way.
[18]:
from tune_sklearn import TuneGridSearchCV
def get_model():
scaler = MinMaxScaler()
pca = PCA()
rf = RandomForestClassifier(**{
'random_state': 37
})
pipeline = Pipeline(steps=[('scaler', scaler), ('pca', pca), ('rf', rf)])
cv = StratifiedKFold(**{
'n_splits': 5,
'shuffle': True,
'random_state': 37
})
auc_scorer = make_scorer(
roc_auc_score,
greater_is_better=True,
needs_proba=True,
multi_class='ovo')
apr_scorer_macro = make_scorer(
apr_score,
greater_is_better=True,
needs_proba=True,
average='macro')
apr_scorer_micro = make_scorer(
apr_score,
greater_is_better=True,
needs_proba=True,
average='micro')
apr_scorer_weighted = make_scorer(
apr_score,
greater_is_better=True,
needs_proba=True,
average='weighted')
model = TuneGridSearchCV(**{
'estimator': pipeline,
'cv': cv,
'param_grid': {
'scaler__feature_range': [(0, 1)],
'pca__n_components': [2, 3, 4, 5],
'rf__criterion': ['gini', 'entropy']
},
'scoring': {
'auc': auc_scorer,
'apr_scorer_macro': apr_scorer_macro,
'apr_scorer_micro': apr_scorer_micro,
'apr_scorer_weighted': apr_scorer_weighted
},
'verbose': 1,
'refit': 'apr_scorer_micro',
'error_score': np.NaN,
'n_jobs': -1,
'early_stopping': 'MedianStoppingRule',
'max_iters': 10
})
return model
[19]:
results = []
for tr, te in StratifiedKFold(random_state=37, shuffle=True, n_splits=5).split(X, y):
X_tr, X_te = X[tr], X[te]
y_tr, y_te = y[tr], y[te]
model = get_model()
model.fit(X_tr, y_tr)
y_pred = model.predict_proba(X_te)
auc_ovr = roc_auc_score(y_te, y_pred, multi_class='ovr')
auc_ovo = roc_auc_score(y_te, y_pred, multi_class='ovo')
apr_macro = apr_score(y_te, y_pred, average='macro')
apr_micro = apr_score(y_te, y_pred, average='micro')
apr_weighted = apr_score(y_te, y_pred, average='weighted')
results.append({
'auc_ovr': auc_ovr,
'auc_ovo': auc_ovo,
'apr_macro': apr_macro,
'apr_micro': apr_micro,
'apr_weighted': apr_weighted
})
rdf = pd.DataFrame(results)
Memory usage on this node: 5.3/50.1 GiB
Using MedianStoppingRule: num_stopped=0.
Resources requested: 0/32 CPUs, 0/0 GPUs, 0.0/28.61 GiB heap, 0.0/9.86 GiB objects
Result logdir: /root/ray_results/_PipelineTrainable_2022-05-07_13-29-16
Number of trials: 8/8 (8 TERMINATED)
[20]:
rdf.mean()
[20]:
auc_ovr 0.998309
auc_ovo 0.998308
apr_macro 0.996224
apr_micro 0.996087
apr_weighted 0.996239
dtype: float64
15.7. Pipelines, column transformers, grid search
15.7.1. Simple
[21]:
df = pd.DataFrame({
'text': ['pizza apple orange', 'potato tomato greens pizza', 'computer monitor', 'mouse keyboard'],
'hand': ['left', 'right', np.nan, 'left'],
'gender': ['m', 'f', 'f', 'm'],
'age': [22.2, 32.3, 44.4, 55.5],
'y': [1, 1, 0, 0]
})
df
[21]:
text | hand | gender | age | y | |
---|---|---|---|---|---|
0 | pizza apple orange | left | m | 22.2 | 1 |
1 | potato tomato greens pizza | right | f | 32.3 | 1 |
2 | computer monitor | NaN | f | 44.4 | 0 |
3 | mouse keyboard | left | m | 55.5 | 0 |
[22]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import ColumnTransformer
p0 = Pipeline(steps=[
('impute', SimpleImputer(strategy='constant', fill_value='')),
('reshape', FunctionTransformer(np.reshape, kw_args={'newshape':-1})),
('vectorize', CountVectorizer())
])
p1 = Pipeline(steps=[
('impute', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(drop=['left']))
])
p2 = Pipeline(steps=[('ohe', OneHotEncoder(drop=['f']))])
p4 = Pipeline(steps=[
('impute', SimpleImputer()),
('scale', StandardScaler())
])
t = ColumnTransformer([
('text', p0, [0]),
('hand', p1, [1]),
('gender', p2, [2]),
('age', p4, [3])
], remainder='drop')
T = t.fit_transform(df)
[23]:
t_fields = t.named_transformers_['text'].named_steps['vectorize'].get_feature_names()
h_fields = list(t.named_transformers_['hand'].named_steps['ohe'].get_feature_names())
g_fields = list(t.named_transformers_['gender'].named_steps['ohe'].get_feature_names())
o_fields = ['age']
fields = t_fields + h_fields + g_fields + o_fields
fields
[23]:
['apple',
'computer',
'greens',
'keyboard',
'monitor',
'mouse',
'orange',
'pizza',
'potato',
'tomato',
'x0_right',
'x0_m',
'age']
[24]:
pd.DataFrame(T, columns=fields)
[24]:
apple | computer | greens | keyboard | monitor | mouse | orange | pizza | potato | tomato | x0_right | x0_m | age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1.308967 |
1 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | -0.502835 |
2 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.462927 |
3 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.348874 |
15.7.2. With model
[25]:
p0 = Pipeline(steps=[
('impute', SimpleImputer(strategy='constant', fill_value='')),
('reshape', FunctionTransformer(np.reshape, kw_args={'newshape':-1})),
('vectorize', CountVectorizer())
])
p1 = Pipeline(steps=[
('impute', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(drop=['left']))
])
p2 = Pipeline(steps=[('ohe', OneHotEncoder(drop=['f']))])
p4 = Pipeline(steps=[
('impute', SimpleImputer()),
('scale', StandardScaler())
])
t = ColumnTransformer([
('text', p0, [0]),
('hand', p1, [1]),
('gender', p2, [2]),
('age', p4, [3])
], remainder='drop')
m = Pipeline(steps=[
('preprocess', t),
('regressor', LogisticRegression())
])
X, y = df[[c for c in df.columns if c != 'y']], df['y']
m.fit(X, y);
[26]:
m.predict_proba(X)[:,1]
[26]:
array([0.80466472, 0.78462158, 0.24182963, 0.16886457])
[27]:
pd.concat([
pd.Series(m.named_steps['regressor'].intercept_, ['intercept']),
pd.Series(m.named_steps['regressor'].coef_[0], fields)
])
[27]:
intercept -0.333244
apple 0.195329
computer -0.241834
greens 0.215375
keyboard -0.168860
monitor -0.241834
mouse -0.168860
orange 0.195329
pizza 0.410704
potato 0.215375
tomato 0.215375
x0_right 0.215375
x0_m 0.026470
age -0.703700
dtype: float64
15.7.3. With grid search
[28]:
N = 10
df = pd.DataFrame({
'text': ['pizza apple orange', 'potato tomato greens pizza', 'computer monitor', 'mouse keyboard'] * N,
'hand': ['left', 'right', np.nan, 'left'] * N,
'gender': ['m', 'f', 'f', 'm'] * N,
'age': [22.2, 32.3, 44.4, 55.5] * N,
'y': [1, 1, 0, 0] * N
})
X, y = df[[c for c in df.columns if c != 'y']], df['y']
df.shape, X.shape, y.shape
[28]:
((40, 5), (40, 4), (40,))
[29]:
p0 = Pipeline(steps=[
('impute', SimpleImputer(strategy='constant', fill_value='')),
('reshape', FunctionTransformer(np.reshape, kw_args={'newshape':-1})),
('vectorize', CountVectorizer())
])
p1 = Pipeline(steps=[
('impute', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(drop=['left']))
])
p2 = Pipeline(steps=[('ohe', OneHotEncoder(drop=['f']))])
p4 = Pipeline(steps=[
('impute', SimpleImputer()),
('scale', StandardScaler())
])
t = ColumnTransformer([
('text', p0, [0]),
('hand', p1, [1]),
('gender', p2, [2]),
('age', p4, [3])
], remainder='drop')
e = Pipeline(steps=[
('preprocess', t),
('regressor', LogisticRegression())
])
cv = StratifiedKFold(**{
'n_splits': 5,
'shuffle': True,
'random_state': 37
})
m = GridSearchCV(**{
'estimator': e,
'cv': cv,
'param_grid': {
'regressor__random_state': [29, 37]
},
'scoring': {
'auc': 'roc_auc',
'apr': 'average_precision'
},
'verbose': 5,
'refit': 'auc',
'error_score': np.NaN,
'n_jobs': -1
})
m.fit(X, y)
m.best_params_
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 4 out of 10 | elapsed: 0.2s remaining: 0.3s
[Parallel(n_jobs=-1)]: Done 7 out of 10 | elapsed: 0.2s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 0.2s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 0.2s finished
[29]:
{'regressor__random_state': 29}
[30]:
m.predict_proba(X)[:,1]
[30]:
array([0.95717104, 0.94940778, 0.06120815, 0.03221605, 0.95717104,
0.94940778, 0.06120815, 0.03221605, 0.95717104, 0.94940778,
0.06120815, 0.03221605, 0.95717104, 0.94940778, 0.06120815,
0.03221605, 0.95717104, 0.94940778, 0.06120815, 0.03221605,
0.95717104, 0.94940778, 0.06120815, 0.03221605, 0.95717104,
0.94940778, 0.06120815, 0.03221605, 0.95717104, 0.94940778,
0.06120815, 0.03221605, 0.95717104, 0.94940778, 0.06120815,
0.03221605, 0.95717104, 0.94940778, 0.06120815, 0.03221605])
[31]:
pd.concat([
pd.Series(m.best_estimator_.named_steps['regressor'].intercept_, ['intercept']),
pd.Series(m.best_estimator_.named_steps['regressor'].coef_[0], fields)
])
[31]:
intercept -0.796608
apple 0.428280
computer -0.612042
greens 0.505914
keyboard -0.322175
monitor -0.612042
mouse -0.322175
orange 0.428280
pizza 0.934194
potato 0.505914
tomato 0.505914
x0_right 0.505914
x0_m 0.106105
age -1.532901
dtype: float64
15.7.4. With random search
[44]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint
p0 = Pipeline(steps=[
('impute', SimpleImputer(strategy='constant', fill_value='')),
('reshape', FunctionTransformer(np.reshape, kw_args={'newshape':-1})),
('vectorize', CountVectorizer())
])
p1 = Pipeline(steps=[
('impute', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(drop=['left']))
])
p2 = Pipeline(steps=[('ohe', OneHotEncoder(drop=['f']))])
p4 = Pipeline(steps=[
('impute', SimpleImputer()),
('scale', StandardScaler())
])
t = ColumnTransformer([
('text', p0, [0]),
('hand', p1, [1]),
('gender', p2, [2]),
('age', p4, [3])
], remainder='drop')
e = Pipeline(steps=[
('preprocess', t),
('regressor', LogisticRegression())
])
cv = StratifiedKFold(**{
'n_splits': 5,
'shuffle': True,
'random_state': 37
})
uniform.random_state = 37
randint.random_state = 37
regressor__C_dist = uniform()
regressor__l1_ratio_dist = uniform()
regressor__random_state_dist = randint(5, 40)
regressor__C_dist.random_state = 37
regressor__l1_ratio_dist.random_state = 37
regressor__random_state_dist.random_state = 37
m = RandomizedSearchCV(**{
'estimator': e,
'cv': cv,
'param_distributions': {
'regressor__random_state': regressor__random_state_dist,
'regressor__C': regressor__C_dist,
'regressor__l1_ratio': regressor__l1_ratio_dist
},
'scoring': {
'auc': 'roc_auc',
'apr': 'average_precision'
},
'verbose': 5,
'refit': 'auc',
'error_score': np.NaN,
'n_jobs': -1
})
m.fit(X, y)
m.best_params_
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 9 out of 50 | elapsed: 0.1s remaining: 0.6s
[Parallel(n_jobs=-1)]: Done 20 out of 50 | elapsed: 0.1s remaining: 0.2s
[Parallel(n_jobs=-1)]: Done 31 out of 50 | elapsed: 0.1s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 42 out of 50 | elapsed: 0.2s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 50 out of 50 | elapsed: 0.2s finished
[44]:
{'regressor__C': 0.8861628575625455,
'regressor__l1_ratio': 0.44678402377577453,
'regressor__random_state': 11}
[45]:
m.cv_results_
[45]:
{'mean_fit_time': array([0.01352501, 0.01620173, 0.01874933, 0.02089486, 0.02180014,
0.02144923, 0.02035055, 0.02146559, 0.02045355, 0.01744499]),
'std_fit_time': array([0.0004663 , 0.00136882, 0.00298622, 0.00060401, 0.00092841,
0.00107438, 0.00205119, 0.00129154, 0.00327625, 0.00285363]),
'mean_score_time': array([0.01998224, 0.01063399, 0.01161842, 0.0110549 , 0.01962199,
0.01082191, 0.00972252, 0.0092669 , 0.00763993, 0.00690565]),
'std_score_time': array([0.02239441, 0.00060964, 0.00199941, 0.0005795 , 0.01610719,
0.00037426, 0.00064848, 0.00127296, 0.00166517, 0.00075786]),
'param_regressor__C': masked_array(data=[0.8861628575625455, 0.2669851105048685,
0.18394258557069376, 0.07618735142456279,
0.5057860075182486, 0.20365586920932577,
0.5846689895186896, 0.7451596080317253,
0.4187299961837905, 0.7330989808813632],
mask=[False, False, False, False, False, False, False, False,
False, False],
fill_value='?',
dtype=object),
'param_regressor__l1_ratio': masked_array(data=[0.44678402377577453, 0.7932726296280042,
0.02419288751414972, 0.33254214738505705,
0.17149139198746954, 0.32804028992640133,
0.338890624803148, 0.04340148929509069,
0.46658684845827336, 0.6470581061933899],
mask=[False, False, False, False, False, False, False, False,
False, False],
fill_value='?',
dtype=object),
'param_regressor__random_state': masked_array(data=[11, 23, 25, 13, 20, 34, 26, 18, 38, 22],
mask=[False, False, False, False, False, False, False, False,
False, False],
fill_value='?',
dtype=object),
'params': [{'regressor__C': 0.8861628575625455,
'regressor__l1_ratio': 0.44678402377577453,
'regressor__random_state': 11},
{'regressor__C': 0.2669851105048685,
'regressor__l1_ratio': 0.7932726296280042,
'regressor__random_state': 23},
{'regressor__C': 0.18394258557069376,
'regressor__l1_ratio': 0.02419288751414972,
'regressor__random_state': 25},
{'regressor__C': 0.07618735142456279,
'regressor__l1_ratio': 0.33254214738505705,
'regressor__random_state': 13},
{'regressor__C': 0.5057860075182486,
'regressor__l1_ratio': 0.17149139198746954,
'regressor__random_state': 20},
{'regressor__C': 0.20365586920932577,
'regressor__l1_ratio': 0.32804028992640133,
'regressor__random_state': 34},
{'regressor__C': 0.5846689895186896,
'regressor__l1_ratio': 0.338890624803148,
'regressor__random_state': 26},
{'regressor__C': 0.7451596080317253,
'regressor__l1_ratio': 0.04340148929509069,
'regressor__random_state': 18},
{'regressor__C': 0.4187299961837905,
'regressor__l1_ratio': 0.46658684845827336,
'regressor__random_state': 38},
{'regressor__C': 0.7330989808813632,
'regressor__l1_ratio': 0.6470581061933899,
'regressor__random_state': 22}],
'split0_test_auc': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
'split1_test_auc': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
'split2_test_auc': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
'split3_test_auc': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
'split4_test_auc': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
'mean_test_auc': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
'std_test_auc': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
'rank_test_auc': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32),
'split0_test_apr': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
'split1_test_apr': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
'split2_test_apr': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
'split3_test_apr': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
'split4_test_apr': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
'mean_test_apr': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
'std_test_apr': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
'rank_test_apr': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)}
[ ]: