11. combo

combo is an API for combining machine learning (ML) models and scores. In particular, combo supports stacking models. Let’s see how we can use combo as a part of our learning approach.

11.1. Data

First, let’s synthesize the data using make_classification() from Scikit-Learn. The data itself is not interesting, as we just need toy data to illustrate combo and model stacking.

[1]:
import numpy as np
import random
from sklearn.datasets import make_classification

np.random.seed(37)
random.seed(37)

X, y = make_classification(**{
    'n_samples': 2000,
    'n_features': 10,
    'n_informative': 8,
    'n_redundant': 0,
    'n_repeated': 0,
    'n_classes': 2,
    'n_clusters_per_class': 1,
    'random_state': 37
})

print(f'X shape = {X.shape}, y shape {y.shape}')
X shape = (2000, 10), y shape (2000,)

11.2. Stacking

The class Stacking can be used to stack any number of models. The argument base_estimators to the Stacking constructor must have at least 2 models or an exception will be thrown. You can use Stacking as a drop-in replacement for any ordinary model. In this example, we have a very complicated approach for the modeling. Not only are we stacking, but also performing grid search (hyper-parameter tuning). Additionally, stacking is done as a part of a pipeline. This API fits very nicely with Scikit-Learn.

One additional thing, if we wanted to perform grid search on one of the base estimators inside Stacking, we can do so; look at the get_logistic_regression_model() which, yet, also returns a grid search wrapping a pipeline for LogisticRegression. That is why this model is very sophisticated; you have grid search within grid search and within each grid search, their own pipelines!

[10]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from combo.models.classifier_stacking import Stacking

def get_logistic_regression_model():
    pipeline = Pipeline(steps=[('classifier', LogisticRegression(random_state=37))])

    cv = StratifiedKFold(**{
        'n_splits': 3,
        'shuffle': True,
        'random_state': 37
    })

    param_grid = {
        'classifier__penalty': ['l1', 'l2']
    }

    model = GridSearchCV(**{
        'estimator': pipeline,
        'cv': cv,
        'param_grid': param_grid,
        'scoring': {
            'auc': 'roc_auc',
            'apr': 'average_precision'
        },
        'verbose': 5,
        'refit': 'apr',
        'error_score': 0.0,
        'n_jobs': -1
    })

    return model

def get_stacking_model():
    scaler = MinMaxScaler()
    classifier = Stacking(
        base_estimators=[DecisionTreeClassifier(), LogisticRegression()],
        random_state=37
    )
    pipeline = Pipeline(steps=[
        ('scaler', scaler),
        ('classifier', classifier)
    ])

    cv = StratifiedKFold(**{
        'n_splits': 5,
        'shuffle': True,
        'random_state': 37
    })

    param_grid = {
        'scaler__feature_range': [(0, 1), (0, 2)],
        'classifier__base_estimators': [
            [
                DecisionTreeClassifier(),
                get_logistic_regression_model(),
                KNeighborsClassifier()
            ],
            [
                RandomForestClassifier(),
                GradientBoostingClassifier()],
            [
                DecisionTreeClassifier(),
                get_logistic_regression_model(),
                KNeighborsClassifier(),
                RandomForestClassifier(),
                GradientBoostingClassifier()
            ]
        ],
        'classifier__n_folds': [2, 5],
        'classifier__use_proba': [False, True]
    }

    model = GridSearchCV(**{
        'estimator': pipeline,
        'cv': cv,
        'param_grid': param_grid,
        'scoring': {
            'auc': 'roc_auc',
            'apr': 'average_precision'
        },
        'verbose': 5,
        'refit': 'apr',
        'error_score': 0.0,
        'n_jobs': -1
    })

    return model

Let’s the stacking model and fit it.

[11]:
model = get_stacking_model()
model.fit(X, y)
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
[11]:
GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=37, shuffle=True),
             error_score=0.0,
             estimator=Pipeline(steps=[('scaler', MinMaxScaler()),
                                       ('classifier',
                                        Stacking(base_estimators=[DecisionTreeClassifier(), LogisticRegression()],
     keep_original=True, meta_clf=LogisticRegression(), n_folds=2,
     pre_fitted=None, random_state=37, shuffle_data=False, threshold=No...
                                                                       param_grid={'classifier__penalty': ['l1',
                                                                                                           'l2']},
                                                                       refit='apr',
                                                                       scoring={'apr': 'average_precision',
                                                                                'auc': 'roc_auc'},
                                                                       verbose=5),
                                                          KNeighborsClassifier(),
                                                          RandomForestClassifier(),
                                                          GradientBoostingClassifier()]],
                         'classifier__n_folds': [2, 5],
                         'classifier__use_proba': [False, True],
                         'scaler__feature_range': [(0, 1), (0, 2)]},
             refit='apr',
             scoring={'apr': 'average_precision', 'auc': 'roc_auc'}, verbose=5)

Let’s look at the training performances.

[12]:
from sklearn.metrics import roc_auc_score, average_precision_score

y_pred = model.predict_proba(X)[:,1]
roc_auc_score(y, y_pred), average_precision_score(y, y_pred)
[12]:
(0.999124, 0.9991472599492524)

Although there is much nesting and stacking of models, we can still navigate to the information we want. To look at the stacking model’s best parameters is easy model.best_params_. Notice that the grid search says that stacked classifier with base estimators DecisionTreeClassifier, GridSearchCV and KNeighborsClassifier is the best one.

So we have to dig into GridSearchCV and see what the best hyper-parameters and models are for this object.

[13]:
model.best_params_
[13]:
{'classifier__base_estimators': [DecisionTreeClassifier(),
  GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=37, shuffle=True),
               error_score=0.0,
               estimator=Pipeline(steps=[('classifier',
                                          LogisticRegression(random_state=37))]),
               n_jobs=-1, param_grid={'classifier__penalty': ['l1', 'l2']},
               refit='apr',
               scoring={'apr': 'average_precision', 'auc': 'roc_auc'}, verbose=5),
  KNeighborsClassifier()],
 'classifier__n_folds': 2,
 'classifier__use_proba': False,
 'scaler__feature_range': (0, 1)}

The best estimator is retrieved from model.best_estimator_.

[14]:
model.best_estimator_
[14]:
Pipeline(steps=[('scaler', MinMaxScaler()),
                ('classifier',
                 Stacking(base_estimators=[DecisionTreeClassifier(), GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=37, shuffle=True),
             error_score=0.0,
             estimator=Pipeline(steps=[('classifier',
                                        LogisticRegression(random_state=37))]),
             n_jobs=-1, param_grid={'classifier__penalty': ['l1', 'l2']},
             refit='apr',
             scoring={'apr': 'average_precision', 'auc': 'roc_auc'}, verbose=5), KNeighborsClassifier()],
     keep_original=True, meta_clf=LogisticRegression(), n_folds=2,
     pre_fitted=None, random_state=37, shuffle_data=False, threshold=None,
     use_proba=False))])

Now we can navigate to each of the base estimators. Since there are 3 of them, we can access each by index.

[20]:
model.best_estimator_[1].base_estimators[0]
[20]:
DecisionTreeClassifier()
[21]:
model.best_estimator_[1].base_estimators[1]
[21]:
GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=37, shuffle=True),
             error_score=0.0,
             estimator=Pipeline(steps=[('classifier',
                                        LogisticRegression(random_state=37))]),
             n_jobs=-1, param_grid={'classifier__penalty': ['l1', 'l2']},
             refit='apr',
             scoring={'apr': 'average_precision', 'auc': 'roc_auc'}, verbose=5)
[42]:
model.best_estimator_[1].base_estimators[2]
[42]:
KNeighborsClassifier()

The second base estimator is another GridSearchCV instance. Again, we can carefully navigate to retrieve the best parameters and estimator.

[26]:
model.best_estimator_[1].base_estimators[1].best_params_
[26]:
{'classifier__penalty': 'l2'}
[27]:
model.best_estimator_[1].base_estimators[1].best_estimator_
[27]:
Pipeline(steps=[('classifier', LogisticRegression(random_state=37))])

11.3. Stacking with validation

Stacking models with combo works perfectly with Scikit-Learn’s validation framework. Here, we apply stratified k-fold cross-validation to a stacked model.

[40]:
import pandas as pd

r_df = []

for tr_index, te_index in StratifiedKFold(n_splits=5, shuffle=True, random_state=37).split(X, y):
    X_tr, y_tr = X[tr_index,:], y[tr_index]
    X_te, y_te = X[te_index,:], y[te_index]

    print(X_tr.shape, y_tr.shape, X_te.shape, y_te.shape)

    model = get_stacking_model()
    model.fit(X_tr, y_tr)

    y_pred = model.predict_proba(X_te)[:,1]

    roc = roc_auc_score(y_te, y_pred)
    aps = average_precision_score(y_te, y_pred)

    r_df.append({'roc': roc, 'aps': aps})

r_df = pd.DataFrame(r_df)
r_df
(1600, 10) (1600,) (400, 10) (400,)
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
(1600, 10) (1600,) (400, 10) (400,)
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
(1600, 10) (1600,) (400, 10) (400,)
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
(1600, 10) (1600,) (400, 10) (400,)
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
(1600, 10) (1600,) (400, 10) (400,)
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits
[40]:
roc aps
0 0.993850 0.992000
1 0.992425 0.991328
2 0.988525 0.990718
3 0.995275 0.994601
4 0.996900 0.996020

Below are the performances of the stacking approach.

[41]:
r_df.mean()
[41]:
roc    0.993395
aps    0.992933
dtype: float64