11. combo
combo is an API for combining machine learning (ML) models and scores. In particular, combo supports stacking
models. Let’s see how we can use combo as a part of our learning approach.
11.1. Data
First, let’s synthesize the data using make_classification()
from Scikit-Learn. The data itself is not interesting, as we just need toy data to illustrate combo and model stacking.
[1]:
import numpy as np
import random
from sklearn.datasets import make_classification
np.random.seed(37)
random.seed(37)
X, y = make_classification(**{
'n_samples': 2000,
'n_features': 10,
'n_informative': 8,
'n_redundant': 0,
'n_repeated': 0,
'n_classes': 2,
'n_clusters_per_class': 1,
'random_state': 37
})
print(f'X shape = {X.shape}, y shape {y.shape}')
X shape = (2000, 10), y shape (2000,)
11.2. Stacking
The class Stacking
can be used to stack any number of models. The argument base_estimators
to the Stacking
constructor must have at least 2 models or an exception will be thrown. You can use Stacking
as a drop-in replacement for any ordinary model. In this example, we have a very complicated approach for the modeling. Not only are we stacking, but also performing grid search (hyper-parameter tuning). Additionally, stacking is done as a part of a pipeline
. This API fits very
nicely with Scikit-Learn.
One additional thing, if we wanted to perform grid search on one of the base estimators inside Stacking
, we can do so; look at the get_logistic_regression_model()
which, yet, also returns a grid search wrapping a pipeline for LogisticRegression
. That is why this model is very sophisticated; you have grid search within grid search and within each grid search, their own pipelines!
[10]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from combo.models.classifier_stacking import Stacking
def get_logistic_regression_model():
pipeline = Pipeline(steps=[('classifier', LogisticRegression(random_state=37))])
cv = StratifiedKFold(**{
'n_splits': 3,
'shuffle': True,
'random_state': 37
})
param_grid = {
'classifier__penalty': ['l1', 'l2']
}
model = GridSearchCV(**{
'estimator': pipeline,
'cv': cv,
'param_grid': param_grid,
'scoring': {
'auc': 'roc_auc',
'apr': 'average_precision'
},
'verbose': 5,
'refit': 'apr',
'error_score': 0.0,
'n_jobs': -1
})
return model
def get_stacking_model():
scaler = MinMaxScaler()
classifier = Stacking(
base_estimators=[DecisionTreeClassifier(), LogisticRegression()],
random_state=37
)
pipeline = Pipeline(steps=[
('scaler', scaler),
('classifier', classifier)
])
cv = StratifiedKFold(**{
'n_splits': 5,
'shuffle': True,
'random_state': 37
})
param_grid = {
'scaler__feature_range': [(0, 1), (0, 2)],
'classifier__base_estimators': [
[
DecisionTreeClassifier(),
get_logistic_regression_model(),
KNeighborsClassifier()
],
[
RandomForestClassifier(),
GradientBoostingClassifier()],
[
DecisionTreeClassifier(),
get_logistic_regression_model(),
KNeighborsClassifier(),
RandomForestClassifier(),
GradientBoostingClassifier()
]
],
'classifier__n_folds': [2, 5],
'classifier__use_proba': [False, True]
}
model = GridSearchCV(**{
'estimator': pipeline,
'cv': cv,
'param_grid': param_grid,
'scoring': {
'auc': 'roc_auc',
'apr': 'average_precision'
},
'verbose': 5,
'refit': 'apr',
'error_score': 0.0,
'n_jobs': -1
})
return model
Let’s the stacking model and fit it.
[11]:
model = get_stacking_model()
model.fit(X, y)
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
[11]:
GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=37, shuffle=True),
error_score=0.0,
estimator=Pipeline(steps=[('scaler', MinMaxScaler()),
('classifier',
Stacking(base_estimators=[DecisionTreeClassifier(), LogisticRegression()],
keep_original=True, meta_clf=LogisticRegression(), n_folds=2,
pre_fitted=None, random_state=37, shuffle_data=False, threshold=No...
param_grid={'classifier__penalty': ['l1',
'l2']},
refit='apr',
scoring={'apr': 'average_precision',
'auc': 'roc_auc'},
verbose=5),
KNeighborsClassifier(),
RandomForestClassifier(),
GradientBoostingClassifier()]],
'classifier__n_folds': [2, 5],
'classifier__use_proba': [False, True],
'scaler__feature_range': [(0, 1), (0, 2)]},
refit='apr',
scoring={'apr': 'average_precision', 'auc': 'roc_auc'}, verbose=5)
Let’s look at the training performances.
[12]:
from sklearn.metrics import roc_auc_score, average_precision_score
y_pred = model.predict_proba(X)[:,1]
roc_auc_score(y, y_pred), average_precision_score(y, y_pred)
[12]:
(0.999124, 0.9991472599492524)
Although there is much nesting and stacking of models, we can still navigate to the information we want. To look at the stacking model’s best parameters is easy model.best_params_
. Notice that the grid search says that stacked classifier with base estimators DecisionTreeClassifier
, GridSearchCV
and KNeighborsClassifier
is the best one.
So we have to dig into GridSearchCV
and see what the best hyper-parameters and models are for this object.
[13]:
model.best_params_
[13]:
{'classifier__base_estimators': [DecisionTreeClassifier(),
GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=37, shuffle=True),
error_score=0.0,
estimator=Pipeline(steps=[('classifier',
LogisticRegression(random_state=37))]),
n_jobs=-1, param_grid={'classifier__penalty': ['l1', 'l2']},
refit='apr',
scoring={'apr': 'average_precision', 'auc': 'roc_auc'}, verbose=5),
KNeighborsClassifier()],
'classifier__n_folds': 2,
'classifier__use_proba': False,
'scaler__feature_range': (0, 1)}
The best estimator is retrieved from model.best_estimator_
.
[14]:
model.best_estimator_
[14]:
Pipeline(steps=[('scaler', MinMaxScaler()),
('classifier',
Stacking(base_estimators=[DecisionTreeClassifier(), GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=37, shuffle=True),
error_score=0.0,
estimator=Pipeline(steps=[('classifier',
LogisticRegression(random_state=37))]),
n_jobs=-1, param_grid={'classifier__penalty': ['l1', 'l2']},
refit='apr',
scoring={'apr': 'average_precision', 'auc': 'roc_auc'}, verbose=5), KNeighborsClassifier()],
keep_original=True, meta_clf=LogisticRegression(), n_folds=2,
pre_fitted=None, random_state=37, shuffle_data=False, threshold=None,
use_proba=False))])
Now we can navigate to each of the base estimators. Since there are 3 of them, we can access each by index.
[20]:
model.best_estimator_[1].base_estimators[0]
[20]:
DecisionTreeClassifier()
[21]:
model.best_estimator_[1].base_estimators[1]
[21]:
GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=37, shuffle=True),
error_score=0.0,
estimator=Pipeline(steps=[('classifier',
LogisticRegression(random_state=37))]),
n_jobs=-1, param_grid={'classifier__penalty': ['l1', 'l2']},
refit='apr',
scoring={'apr': 'average_precision', 'auc': 'roc_auc'}, verbose=5)
[42]:
model.best_estimator_[1].base_estimators[2]
[42]:
KNeighborsClassifier()
The second base estimator is another GridSearchCV
instance. Again, we can carefully navigate to retrieve the best parameters and estimator.
[26]:
model.best_estimator_[1].base_estimators[1].best_params_
[26]:
{'classifier__penalty': 'l2'}
[27]:
model.best_estimator_[1].base_estimators[1].best_estimator_
[27]:
Pipeline(steps=[('classifier', LogisticRegression(random_state=37))])
11.3. Stacking with validation
Stacking models with combo works perfectly with Scikit-Learn’s validation framework. Here, we apply stratified k-fold cross-validation to a stacked model.
[40]:
import pandas as pd
r_df = []
for tr_index, te_index in StratifiedKFold(n_splits=5, shuffle=True, random_state=37).split(X, y):
X_tr, y_tr = X[tr_index,:], y[tr_index]
X_te, y_te = X[te_index,:], y[te_index]
print(X_tr.shape, y_tr.shape, X_te.shape, y_te.shape)
model = get_stacking_model()
model.fit(X_tr, y_tr)
y_pred = model.predict_proba(X_te)[:,1]
roc = roc_auc_score(y_te, y_pred)
aps = average_precision_score(y_te, y_pred)
r_df.append({'roc': roc, 'aps': aps})
r_df = pd.DataFrame(r_df)
r_df
(1600, 10) (1600,) (400, 10) (400,)
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
(1600, 10) (1600,) (400, 10) (400,)
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
(1600, 10) (1600,) (400, 10) (400,)
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
(1600, 10) (1600,) (400, 10) (400,)
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
(1600, 10) (1600,) (400, 10) (400,)
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
[40]:
roc | aps | |
---|---|---|
0 | 0.993850 | 0.992000 |
1 | 0.992425 | 0.991328 |
2 | 0.988525 | 0.990718 |
3 | 0.995275 | 0.994601 |
4 | 0.996900 | 0.996020 |
Below are the performances of the stacking approach.
[41]:
r_df.mean()
[41]:
roc 0.993395
aps 0.992933
dtype: float64