7. imbalanced-learn

imbalanced-learn is a package to deal with imbalance in data. The data imbalance typically manifest when you have data with class labels, and one or more of these classes suffers from having too few examples to learn from. imbalanced-learn has three broad categories of approaches to deal with class imbalance.

  • oversampling: oversample the minority class

  • understampling: undersample the majority class

  • combination: use a combination of oversampling and undersampling

Let’s investigate the use of each of these approaches in dealing with the class imbalance problem.

7.1. Data generation

Here, we will create a dataset using Scikit-Learn’s make_classification() method. There will be only 2 classes, and as you will see, the samples per class that are about the same amount.

[1]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
import numpy as np
import pandas as pd

plt.style.use('ggplot')
np.random.seed(37)

X, y = make_classification(**{
    'n_samples': 5000,
    'n_features': 5,
    'n_classes': 2,
    'random_state': 37
})

columns = [f'x{i}' for i in range(X.shape[1])] + ['y']
df = pd.DataFrame(np.hstack([X, y.reshape(-1, 1)]), columns=columns)

print(df.shape)
(5000, 6)
[2]:
df.head()
[2]:
x0 x1 x2 x3 x4 y
0 -0.729402 0.390517 -0.603771 0.286312 -0.266412 0.0
1 0.030495 -0.970299 1.223902 -0.343972 -0.479884 0.0
2 -0.657696 -0.811643 -1.075159 0.405169 -0.022806 0.0
3 0.138540 2.012018 -1.825350 0.482964 0.845321 1.0
4 2.231350 -0.705512 -0.453736 -0.238611 1.757486 1.0
[3]:
ax = df.y.value_counts().plot(kind='bar')
_ = ax.set_title('Frequency of Classes, Balanced')
_images/imbalanced-learn_4_0.png

7.2. Class imbalance

We will then transform the data so that class 0 is the majority class and class 1 is the minority class. Class 1 will have only 1% of what was originally generated.

[4]:
df0 = df[df.y == 0].copy(deep=True).reset_index(drop=True)
df1 = df[df.y == 1].sample(frac=0.01).copy(deep=True).reset_index(drop=True)

df = pd.concat([df0, df1])
[5]:
df.shape
[5]:
(2508, 6)
[6]:
ax = df.y.value_counts().plot(kind='bar')
_ = ax.set_title('Frequency of Classes, Imbalanced')
_images/imbalanced-learn_8_0.png

7.3. Learning with class imbalance

We will use a random forest classifier to learn from the imbalanced data. The learning will be validated using a stratified k=10 fold approach. We will also benchmark the performance of the random forest classifier using Area Under the Curve (auc) and Average Precision Score (aps).

[7]:
X = df[[c for c in df.columns if c != 'y']]
y = df.y

print(X.shape, y.shape)
(2508, 5) (2508,)
[8]:
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, average_precision_score

skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=37)

rdf = []
for fold, (tr, te) in enumerate(skf.split(X, y)):
    X_tr, X_te = X.iloc[tr], X.iloc[te]
    y_tr, y_te = y.iloc[tr], y.iloc[te]

    model = LogisticRegression(penalty='l2', solver='liblinear', random_state=37)
    model = RandomForestClassifier(n_jobs=-1, random_state=37)
    model.fit(X_tr, y_tr)
    y_pr = model.predict_proba(X_te)[:,1]

    auc = roc_auc_score(y_te, y_pr)
    aps = average_precision_score(y_te, y_pr)

    rdf.append({'auc': auc, 'aps': aps})

rdf = pd.DataFrame(rdf)
[9]:
rdf[['auc', 'aps']].agg(['mean', 'std'])
[9]:
auc aps
mean 0.972218 0.750400
std 0.080609 0.305972

7.4. Oversampling

Below are the results of applying multiple oversampling techniques with random forest classification validated by stratified, k-fold cross-validation.

[10]:
from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN, BorderlineSMOTE, SVMSMOTE, KMeansSMOTE
from sklearn.cluster import KMeans
from collections import Counter
from itertools import chain

def get_oversampler(sampler):
    if 'adasyn' == sampler:
        p = {
            'random_state': 37,
            'n_neighbors': 5
        }
        return ADASYN(**p)
    elif 'borderlinesmote' == sampler:
        p = {
            'random_state': 37,
            'n_jobs': -1,
            'k_neighbors': 5,
            'm_neighbors': 10
        }
        return BorderlineSMOTE(**p)
    elif 'svmsmote' == sampler:
        p = {
            'random_state': 37,
            'n_jobs': -1,
            'k_neighbors': 5,
            'm_neighbors': 10
        }
        return SVMSMOTE(**p)
    elif 'kmeanssmote' == sampler:
        kmeans = KMeans(n_clusters=5, random_state=37)
        p = {
            'random_state': 37,
            'n_jobs': -1,
            'k_neighbors': 5,
            'kmeans_estimator': kmeans
        }
        return KMeansSMOTE(**p)
    elif 'random' == sampler:
        p = {
            'random_state': 37
        }
        return RandomOverSampler(**p)
    else:
        p = {
            'random_state': 37,
            'k_neighbors': 5
        }
        return SMOTE(**p)

def get_results(sampler, f):
    skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=37)

    results = []
    for fold, (tr, te) in enumerate(skf.split(X, y)):
        X_tr, X_te = X.iloc[tr], X.iloc[te]
        y_tr, y_te = y.iloc[tr], y.iloc[te]

        counts = sorted(Counter(y_tr).items())
        n_0, n_1 = counts[0][1], counts[1][1]

        if sampler != 'none':
            sampling_approach = f(sampler)
            X_tr, y_tr = sampling_approach.fit_resample(X_tr, y_tr)

        # model = LogisticRegression(penalty='l2', solver='liblinear', random_state=37)
        model = RandomForestClassifier(n_jobs=-1, random_state=37)
        model.fit(X_tr, y_tr)
        y_pr = model.predict_proba(X_te)[:,1]

        auc = roc_auc_score(y_te, y_pr)
        aps = average_precision_score(y_te, y_pr)

        counts = sorted(Counter(y_tr).items())
        r_0, r_1 = counts[0][1], counts[1][1]

        results.append({
            'sampler': sampler,
            'auc': auc,
            'aps': aps,
            'n_maj': n_0,
            'r_maj': r_0,
            'n_min': n_1,
            'r_min': r_1
        })

    return results
[11]:
%%time

samplers = ['none', 'random', 'smote', 'adasyn', 'borderlinesmote', 'svmsmote']
odf = pd.DataFrame(list(chain(*[get_results(s, get_oversampler) for s in samplers])))
Wall time: 15 s
[12]:
odf[['sampler', 'auc', 'aps']].groupby('sampler').agg(['mean', 'std'])
[12]:
auc aps
mean std mean std
sampler
adasyn 0.971713 0.080083 0.748178 0.222613
borderlinesmote 0.971511 0.080007 0.734289 0.206186
none 0.972218 0.080609 0.750400 0.305972
random 0.972351 0.079962 0.778178 0.242363
smote 0.972014 0.079815 0.759844 0.204722
svmsmote 0.971881 0.080504 0.739844 0.278877

7.5. Undersampling

Below are the results of applying multiple undersampling techniques with random forest classification validated by stratified, k-fold cross-validation.

[13]:
from imblearn.under_sampling import RandomUnderSampler, NearMiss, EditedNearestNeighbours, RepeatedEditedNearestNeighbours, CondensedNearestNeighbour, OneSidedSelection, NeighbourhoodCleaningRule, InstanceHardnessThreshold

def get_undersampler(sampler):
    if 'random' == sampler:
        p = {
            'random_state': 37
        }
        return RandomUnderSampler(**p)
    elif 'nearmiss1' == sampler:
        p = {
            'version': 1,
            'n_jobs': -1
        }
        return NearMiss(**p)
    elif 'nearmiss2' == sampler:
        p = {
            'version': 2,
            'n_jobs': -1
        }
        return NearMiss(**p)
    elif 'nearmiss3' == sampler:
        p = {
            'version': 3,
            'n_jobs': -1
        }
        return NearMiss(**p)
    elif 'editednn' == sampler:
        p = {
            'n_jobs': -1
        }
        return EditedNearestNeighbours(**p)
    elif 'reditednn' == sampler:
        p = {
            'n_jobs': -1
        }
        return RepeatedEditedNearestNeighbours(**p)
    elif 'condensednn' == sampler:
        p = {
            'random_state': 37,
            'n_jobs': -1
        }
        return CondensedNearestNeighbour(**p)
    elif 'onesided' == sampler:
        p = {
            'random_state': 37,
            'n_jobs': -1
        }
        return OneSidedSelection(**p)
    elif 'neighcleanrule' == sampler:
        p = {
            'n_jobs': -1
        }
        return NeighbourhoodCleaningRule(**p)
    elif 'instancehardthresh' == sampler:
        estimator = LogisticRegression(solver='lbfgs', multi_class='auto')
        p = {
            'estimator': estimator,
            'random_state': 37,
            'n_jobs': -1
        }
        return InstanceHardnessThreshold(**p)
[14]:
%%time

samplers = ['random', 'nearmiss1', 'nearmiss2',
            'nearmiss3', 'editednn', 'reditednn', 'condensednn',
            'onesided', 'neighcleanrule']
udf = pd.DataFrame(list(chain(*[get_results(s, get_undersampler) for s in samplers])))
Wall time: 48.4 s
[15]:
udf[['sampler', 'auc', 'aps']].groupby('sampler').agg(['mean', 'std'])
[15]:
auc aps
mean std mean std
sampler
condensednn 0.994261 0.010101 0.746438 0.278687
editednn 0.971847 0.081873 0.745956 0.281920
nearmiss1 0.991982 0.007731 0.661993 0.202126
nearmiss2 0.992929 0.009534 0.678327 0.251700
nearmiss3 0.991876 0.015964 0.727037 0.274334
neighcleanrule 0.972518 0.081746 0.808733 0.265511
onesided 0.971445 0.082091 0.714289 0.278155
random 0.996172 0.006600 0.792222 0.225959
reditednn 0.972047 0.081932 0.767622 0.258529

7.6. Combination

Below are the results of applying multiple combination techniques with random forest classification validated by stratified, k-fold cross-validation.

[16]:
from imblearn.combine import SMOTEENN, SMOTETomek

def get_combine(sampler):
    if 'smoteenn' == sampler:
        p = {
            'random_state': 37,
            'k_neighbors': 5
        }
        smote = SMOTE(**p)

        p = {
            'n_jobs': -1
        }
        enn = EditedNearestNeighbours(**p)

        p = {
            'smote': smote,
            'enn': enn,
            'n_jobs': -1,
            'random_state': 37
        }
        return SMOTEENN(**p)
    elif 'smotetomek' == sampler:
        p = {
            'random_state': 37,
            'k_neighbors': 5
        }
        smote = SMOTE(**p)

        p = {
            'smote': smote,
            'random_state': 37,
            'n_jobs': -1
        }
        return SMOTETomek(**p)
[17]:
%%time

samplers = ['smoteenn', 'smotetomek']
cdf = pd.DataFrame(list(chain(*[get_results(s, get_combine) for s in samplers])))
Wall time: 5.35 s
[18]:
cdf[['sampler', 'auc', 'aps']].groupby('sampler').agg(['mean', 'std'])
[18]:
auc aps
mean std mean std
sampler
smoteenn 0.971678 0.080401 0.726511 0.241162
smotetomek 0.972014 0.079815 0.759844 0.204722

7.7. Comparisons

Here, we will compare the results of all sampling approaches.

[19]:
odf['type'] = odf.sampler.apply(lambda s: 'baseline' if s == 'none' else 'over')
udf['type'] = 'under'
cdf['type'] = 'combo'

rdf = pd.concat([odf, udf, cdf]).reset_index(drop=True)

As you can see below,

  • all sampling techniques do about the same in terms of AUC,

  • the differences is with the APS performance,

  • undersampling with neighborhood cleaning rule seems to do the best, and

  • surprisingly, no sampling is still competitive.

[20]:
sort = [('aps', 'mean'), ('aps', 'std'), ('auc', 'mean'), ('auc', 'std')]
rdf[['sampler', 'type', 'auc', 'aps']]\
    .groupby(['type', 'sampler'])\
    .agg(['mean', 'std'])\
    .sort_values(sort, ascending=False)
[20]:
auc aps
mean std mean std
type sampler
under neighcleanrule 0.972518 0.081746 0.808733 0.265511
random 0.996172 0.006600 0.792222 0.225959
over random 0.972351 0.079962 0.778178 0.242363
under reditednn 0.972047 0.081932 0.767622 0.258529
combo smotetomek 0.972014 0.079815 0.759844 0.204722
over smote 0.972014 0.079815 0.759844 0.204722
baseline none 0.972218 0.080609 0.750400 0.305972
over adasyn 0.971713 0.080083 0.748178 0.222613
under condensednn 0.994261 0.010101 0.746438 0.278687
editednn 0.971847 0.081873 0.745956 0.281920
over svmsmote 0.971881 0.080504 0.739844 0.278877
borderlinesmote 0.971511 0.080007 0.734289 0.206186
under nearmiss3 0.991876 0.015964 0.727037 0.274334
combo smoteenn 0.971678 0.080401 0.726511 0.241162
under onesided 0.971445 0.082091 0.714289 0.278155
nearmiss2 0.992929 0.009534 0.678327 0.251700
nearmiss1 0.991982 0.007731 0.661993 0.202126

7.8. Pipeline

If you need to use one of the samplers in a pipeline, do not use sklearn.pipeline.Pipeline, instead, use imblearn.pipeline, which is a drop-in replacement. Here’s an example of a learning pipeline with hyperparameter tuning.

[21]:
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, average_precision_score, make_scorer
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.preprocessing import MinMaxScaler

def get_model(n_splits=5):
    cv = StratifiedKFold(**{
        'n_splits': n_splits,
        'shuffle': True,
        'random_state': 37
    })

    auc_scorer = make_scorer(
        roc_auc_score,
        greater_is_better=True,
        needs_proba=True,
        multi_class='ovo')

    scoring = {
      'auc': auc_scorer
    }

    scaler = MinMaxScaler()
    sampler = ADASYN(**{
        'random_state': 37,
        'n_jobs': -1
    })
    classifier = RandomForestClassifier(**{
        'random_state': 37,
        'n_jobs': -1,
        'verbose': 0
    })

    pipeline = Pipeline([
        ('scaler', scaler),
        ('sampler', sampler),
        ('classifier', classifier)
    ])

    param_grid = {
      'sampler__sampling_strategy': ['all', 'auto'],
      'sampler__n_neighbors': [3, 5],
      'classifier__n_estimators': [50, 100]
    }

    model = GridSearchCV(**{
        'estimator': pipeline,
        'cv': cv,
        'param_grid': param_grid,
        'verbose': 1,
        'scoring': scoring,
        'refit': 'auc',
        'error_score': np.NaN,
        'n_jobs': -1
    })
    return model
[22]:
model = get_model(n_splits=2)
model.fit(X, y)
Fitting 2 folds for each of 8 candidates, totalling 16 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of  16 | elapsed:    3.8s remaining:   27.4s
[Parallel(n_jobs=-1)]: Done  16 out of  16 | elapsed:    4.2s finished
[22]:
GridSearchCV(cv=StratifiedKFold(n_splits=2, random_state=37, shuffle=True),
             estimator=Pipeline(steps=[('scaler', MinMaxScaler()),
                                       ('sampler',
                                        ADASYN(n_jobs=-1, random_state=37)),
                                       ('classifier',
                                        RandomForestClassifier(n_jobs=-1,
                                                               random_state=37))]),
             n_jobs=-1,
             param_grid={'classifier__n_estimators': [50, 100],
                         'sampler__n_neighbors': [3, 5],
                         'sampler__sampling_strategy': ['all', 'auto']},
             refit='auc',
             scoring={'auc': make_scorer(roc_auc_score, needs_proba=True, multi_class=ovo)},
             verbose=1)
[23]:
model.best_params_
[23]:
{'classifier__n_estimators': 100,
 'sampler__n_neighbors': 3,
 'sampler__sampling_strategy': 'all'}
[24]:
y_pr = model.predict_proba(X)[:,1]

auc = roc_auc_score(y, y_pr)
aps = average_precision_score(y, y_pr)

print(f'auc = {auc:.5f}, aps = {aps:.5f}')
auc = 1.00000, aps = 1.00000