18. Customized Estimators

You are able to roll your own estimator, regressor, classifier or transformer. Below are some templates adapted from the GitHub repository and works with scikit-learn v0.24.1 (the GitHub repository for the project-template is independent from and not synchronized with the scikit-learn repository). There is also official documentation elsewhere.

Note that you should (though not necessarily) inherit from BaseEstimator and use the appropriate mixin. After you write your estimator, apply the check_estimator() method to check (test) if your estimator is valid.

18.1. Basic estimator

Here is a barebones, dummy estimator. You need to implement two methods with the following signatures.

  • fit(self, X, y, **kwargs)

  • predict(self, X)

When you run fit(), make sure the first thing you do is check if y is None. The check_X_y() method is also required, and the properties is_fitted_ and n_features_in_ are also required to be set inside fit(). At the end of fit(), self must always be returned.

The predict() method must return a prediction for every row. Likewise, before making any predictions, check_is_fitted() and check_array() are required to be called.

[1]:
from sklearn.utils.estimator_checks import check_estimator
from sklearn.base import BaseEstimator, RegressorMixin, ClassifierMixin, TransformerMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels
from sklearn.metrics import euclidean_distances
import numpy as np

class SpecialEstimator(BaseEstimator):
    def __init__(self):
        pass

    def fit(self, X, y, **kwargs):
        if y is None:
            raise ValueError('requires y to be passed, but the target y is None')

        X, y = check_X_y(X, y)
        self.is_fitted_ = True
        self.n_features_in_ = X.shape[1]

        return self

    def predict(self, X):
        check_is_fitted(self, 'is_fitted_')
        X = check_array(X)
        return np.ones(X.shape[0], dtype=np.int64)

check_estimator(SpecialEstimator())

18.2. Basic regressor

If your estimator is indeed a regressor, use RegressorMixin. The fit() and predict() implementations follows the same as before. However, notice the _more_tags() method? This method is used to override or supply additional tags. As of v0.24.1, the documentation states that tags are experimental and subject to change. But what are these tags? These tags are essentially hints about the capabilities of the estimator. The poor_score tag hints that the regressor either fails (True) or not fails (False, default) to provide a reasonable test-set score. By default, this tag is set to False, and here, we implement _more_tags() to override that value to True (otherwise, there is a warning generated).

[2]:
class SpecialRegressor(RegressorMixin, BaseEstimator):
    def __init__(self):
        pass

    def fit(self, X, y, **kwargs):
        if y is None:
            raise ValueError('requires y to be passed, but the target y is None')

        X, y = check_X_y(X, y)
        self.is_fitted_ = True
        self.n_features_in_ = X.shape[1]

        return self

    def predict(self, X):
        check_is_fitted(self, 'is_fitted_')
        X = check_array(X)
        return np.ones(X.shape[0], dtype=np.int64)

    def _more_tags(self):
        return {
            'poor_score': True
        }

check_estimator(SpecialRegressor())

18.3. Basic classifier

Classifiers should use ClassifierMixin, and also follow the fit() and predict() contracts. One caveate here is that in the fit() method, we must also store the state of the number of classes in classes_. Be careful with the predict() method, as it should return label values that are consistent with the class values seen during fit().

[3]:
from random import choice

class SpecialClassifier(ClassifierMixin, BaseEstimator):
    def __init__(self):
        pass

    def fit(self, X, y, **kwargs):
        if y is None:
            raise ValueError('requires y to be passed, but the target y is None')

        X, y = check_X_y(X, y)

        self.n_features_in_ = X.shape[1]
        self.classes_ = unique_labels(y)
        self.is_fitted_ = True

        self.X_ = X
        self.y_ = y

        return self

    def predict(self, X):
        check_is_fitted(self, ['is_fitted_', 'X_', 'y_'])
        X = check_array(X)

        closest = np.argmin(euclidean_distances(X, self.X_), axis=1)
        return self.y_[closest]

check_estimator(SpecialClassifier())

18.4. Basic transformer

Transformers should use TransformerMixin and implement two methods.

  • fit(self, X, y=None)

  • transform(self, X)

The check and properties saved shown below inside fit() and transform() are all required to pass the checks.

[4]:
class SpecialTransformer(TransformerMixin, BaseEstimator):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        X = check_array(X, accept_sparse=False)

        self.n_features_in_ = X.shape[1]
        self.n_features_ = X.shape[1]
        self.is_fitted_ = True

        return self

    def transform(self, X):
        check_is_fitted(self, ['is_fitted_'])

        X = check_array(X, accept_sparse=False)

        if X.shape[1] != self.n_features_:
            raise ValueError('Shape of input is different from what was seen in `fit`')

        return np.sqrt(X)

check_estimator(SpecialTransformer())

18.5. Custom estimator with pipeline

Below, we illustrate the use of create a custom estimator using a pipeline. The pipeline is very simple, we first rescale the data followed by a regression. We will use the California housing data.

[12]:
from sklearn.datasets import fetch_california_housing

X, y = fetch_california_housing(return_X_y=True, as_frame=True)
[17]:
X.head()
[17]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25
[18]:
y.head()
[18]:
0    4.526
1    3.585
2    3.521
3    3.413
4    3.422
Name: MedHouseVal, dtype: float64

Now, we will define our AwesomeEstimator. Notice how we pass in the hyperparameters to tune as a part of the fit() function and not the constructor? It is very expensive and tricky to validate these models with check_estimator() since there is searching involve.

[56]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, StratifiedKFold

class AwesomeEstimator(RegressorMixin, BaseEstimator):
    def __init__(self):
        pass

    def __get_pipeline(self):
        scaler = MinMaxScaler()
        regressor = RandomForestRegressor(**{
            'random_state': 37
        })

        steps=[
            ('scaler', scaler),
            ('regressor', regressor)]

        pipeline = Pipeline(steps=steps)
        return pipeline

    def __get_model(self, feature_range, n_estimators):
        model = GridSearchCV(**{
            'estimator': self.__get_pipeline(),
            'cv': 5,
            'param_grid': {
                'scaler__feature_range': feature_range,
                'regressor__n_estimators': n_estimators
            },
            'scoring': 'neg_mean_absolute_error',
            'verbose': 5,
            'refit': 'neg_mean_absolute_error',
            'error_score': np.NaN,
            'n_jobs': -1
        })
        return model


    def fit(self, X, y, feature_range=[(0, 1)], n_estimators=[100]):
        if y is None:
            raise ValueError('requires y to be passed, but the target y is None')

        X, y = check_X_y(X, y)
        self.is_fitted_ = True
        self.n_features_in_ = X.shape[1]

        self.model_ = self.__get_model(feature_range, n_estimators)
        self.model_.fit(X, y)

        return self

    def predict(self, X):
        check_is_fitted(self, ['is_fitted_', 'model_'])
        return self.model_.predict(X)

check_estimator(AwesomeEstimator())
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits

Let’s run the AwesomeEstimator using the default grid search values.

[59]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

estimator = AwesomeEstimator()
estimator.fit(X, y)
y_pred = estimator.predict(X)

mae = mean_absolute_error(y, y_pred)
mse = np.sqrt(mean_squared_error(y, y_pred))
rsq = r2_score(y, y_pred)

print(f'mae {mae:.5f}, rmse {mse:.5f}, rsq {rsq:.5f}')
Fitting 5 folds for each of 1 candidates, totalling 5 fits
mae 0.11960, rmse 0.18588, rsq 0.97405

We can also expand the search with additional hyperparameters.

[60]:
estimator = AwesomeEstimator()
estimator.fit(X, y, feature_range=[(0, 1), (0, 5)], n_estimators=[100, 200])
y_pred = estimator.predict(X)

mae = mean_absolute_error(y, y_pred)
mse = np.sqrt(mean_squared_error(y, y_pred))
rsq = r2_score(y, y_pred)

print(f'mae {mae:.5f}, rmse {mse:.5f}, rsq {rsq:.5f}')
Fitting 5 folds for each of 4 candidates, totalling 20 fits
mae 0.11846, rmse 0.18360, rsq 0.97468
[62]:
estimator.model_.best_params_
[62]:
{'regressor__n_estimators': 200, 'scaler__feature_range': (0, 1)}