18. Customized Estimators
You are able to roll your own estimator
, regressor
, classifier
or transformer
. Below are some templates adapted from the GitHub repository and works with scikit-learn v0.24.1 (the GitHub repository for the project-template is independent from and not
synchronized with the scikit-learn repository). There is also official documentation elsewhere.
Note that you should (though not necessarily) inherit from BaseEstimator
and use the appropriate mixin
. After you write your estimator, apply the check_estimator()
method to check (test) if your estimator is valid.
18.1. Basic estimator
Here is a barebones, dummy estimator. You need to implement two methods with the following signatures.
fit(self, X, y, **kwargs)
predict(self, X)
When you run fit()
, make sure the first thing you do is check if y
is None
. The check_X_y()
method is also required, and the properties is_fitted_
and n_features_in_
are also required to be set inside fit()
. At the end of fit()
, self
must always be returned.
The predict()
method must return a prediction for every row. Likewise, before making any predictions, check_is_fitted()
and check_array()
are required to be called.
[1]:
from sklearn.utils.estimator_checks import check_estimator
from sklearn.base import BaseEstimator, RegressorMixin, ClassifierMixin, TransformerMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels
from sklearn.metrics import euclidean_distances
import numpy as np
class SpecialEstimator(BaseEstimator):
def __init__(self):
pass
def fit(self, X, y, **kwargs):
if y is None:
raise ValueError('requires y to be passed, but the target y is None')
X, y = check_X_y(X, y)
self.is_fitted_ = True
self.n_features_in_ = X.shape[1]
return self
def predict(self, X):
check_is_fitted(self, 'is_fitted_')
X = check_array(X)
return np.ones(X.shape[0], dtype=np.int64)
check_estimator(SpecialEstimator())
18.2. Basic regressor
If your estimator is indeed a regressor, use RegressorMixin
. The fit()
and predict()
implementations follows the same as before. However, notice the _more_tags()
method? This method is used to override or supply additional tags
. As of v0.24.1, the documentation states that tags are experimental and subject to change. But what are these tags? These tags are essentially hints about the capabilities of
the estimator. The poor_score
tag hints that the regressor either fails (True
) or not fails (False
, default) to provide a reasonable test-set score. By default, this tag is set to False
, and here, we implement _more_tags()
to override that value to True
(otherwise, there is a warning generated).
[2]:
class SpecialRegressor(RegressorMixin, BaseEstimator):
def __init__(self):
pass
def fit(self, X, y, **kwargs):
if y is None:
raise ValueError('requires y to be passed, but the target y is None')
X, y = check_X_y(X, y)
self.is_fitted_ = True
self.n_features_in_ = X.shape[1]
return self
def predict(self, X):
check_is_fitted(self, 'is_fitted_')
X = check_array(X)
return np.ones(X.shape[0], dtype=np.int64)
def _more_tags(self):
return {
'poor_score': True
}
check_estimator(SpecialRegressor())
18.3. Basic classifier
Classifiers should use ClassifierMixin
, and also follow the fit()
and predict()
contracts. One caveate here is that in the fit()
method, we must also store the state of the number of classes in classes_
. Be careful with the predict()
method, as it should return label values that are consistent with the class values seen during fit()
.
[3]:
from random import choice
class SpecialClassifier(ClassifierMixin, BaseEstimator):
def __init__(self):
pass
def fit(self, X, y, **kwargs):
if y is None:
raise ValueError('requires y to be passed, but the target y is None')
X, y = check_X_y(X, y)
self.n_features_in_ = X.shape[1]
self.classes_ = unique_labels(y)
self.is_fitted_ = True
self.X_ = X
self.y_ = y
return self
def predict(self, X):
check_is_fitted(self, ['is_fitted_', 'X_', 'y_'])
X = check_array(X)
closest = np.argmin(euclidean_distances(X, self.X_), axis=1)
return self.y_[closest]
check_estimator(SpecialClassifier())
18.4. Basic transformer
Transformers should use TransformerMixin
and implement two methods.
fit(self, X, y=None)
transform(self, X)
The check and properties saved shown below inside fit()
and transform()
are all required to pass the checks.
[4]:
class SpecialTransformer(TransformerMixin, BaseEstimator):
def __init__(self):
pass
def fit(self, X, y=None):
X = check_array(X, accept_sparse=False)
self.n_features_in_ = X.shape[1]
self.n_features_ = X.shape[1]
self.is_fitted_ = True
return self
def transform(self, X):
check_is_fitted(self, ['is_fitted_'])
X = check_array(X, accept_sparse=False)
if X.shape[1] != self.n_features_:
raise ValueError('Shape of input is different from what was seen in `fit`')
return np.sqrt(X)
check_estimator(SpecialTransformer())
18.5. Custom estimator with pipeline
Below, we illustrate the use of create a custom estimator using a pipeline. The pipeline is very simple, we first rescale the data followed by a regression. We will use the California housing data
.
[12]:
from sklearn.datasets import fetch_california_housing
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
[17]:
X.head()
[17]:
MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | |
---|---|---|---|---|---|---|---|---|
0 | 8.3252 | 41.0 | 6.984127 | 1.023810 | 322.0 | 2.555556 | 37.88 | -122.23 |
1 | 8.3014 | 21.0 | 6.238137 | 0.971880 | 2401.0 | 2.109842 | 37.86 | -122.22 |
2 | 7.2574 | 52.0 | 8.288136 | 1.073446 | 496.0 | 2.802260 | 37.85 | -122.24 |
3 | 5.6431 | 52.0 | 5.817352 | 1.073059 | 558.0 | 2.547945 | 37.85 | -122.25 |
4 | 3.8462 | 52.0 | 6.281853 | 1.081081 | 565.0 | 2.181467 | 37.85 | -122.25 |
[18]:
y.head()
[18]:
0 4.526
1 3.585
2 3.521
3 3.413
4 3.422
Name: MedHouseVal, dtype: float64
Now, we will define our AwesomeEstimator
. Notice how we pass in the hyperparameters to tune as a part of the fit()
function and not the constructor? It is very expensive and tricky to validate these models with check_estimator()
since there is searching involve.
[56]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, StratifiedKFold
class AwesomeEstimator(RegressorMixin, BaseEstimator):
def __init__(self):
pass
def __get_pipeline(self):
scaler = MinMaxScaler()
regressor = RandomForestRegressor(**{
'random_state': 37
})
steps=[
('scaler', scaler),
('regressor', regressor)]
pipeline = Pipeline(steps=steps)
return pipeline
def __get_model(self, feature_range, n_estimators):
model = GridSearchCV(**{
'estimator': self.__get_pipeline(),
'cv': 5,
'param_grid': {
'scaler__feature_range': feature_range,
'regressor__n_estimators': n_estimators
},
'scoring': 'neg_mean_absolute_error',
'verbose': 5,
'refit': 'neg_mean_absolute_error',
'error_score': np.NaN,
'n_jobs': -1
})
return model
def fit(self, X, y, feature_range=[(0, 1)], n_estimators=[100]):
if y is None:
raise ValueError('requires y to be passed, but the target y is None')
X, y = check_X_y(X, y)
self.is_fitted_ = True
self.n_features_in_ = X.shape[1]
self.model_ = self.__get_model(feature_range, n_estimators)
self.model_.fit(X, y)
return self
def predict(self, X):
check_is_fitted(self, ['is_fitted_', 'model_'])
return self.model_.predict(X)
check_estimator(AwesomeEstimator())
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Let’s run the AwesomeEstimator
using the default grid search values.
[59]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
estimator = AwesomeEstimator()
estimator.fit(X, y)
y_pred = estimator.predict(X)
mae = mean_absolute_error(y, y_pred)
mse = np.sqrt(mean_squared_error(y, y_pred))
rsq = r2_score(y, y_pred)
print(f'mae {mae:.5f}, rmse {mse:.5f}, rsq {rsq:.5f}')
Fitting 5 folds for each of 1 candidates, totalling 5 fits
mae 0.11960, rmse 0.18588, rsq 0.97405
We can also expand the search with additional hyperparameters.
[60]:
estimator = AwesomeEstimator()
estimator.fit(X, y, feature_range=[(0, 1), (0, 5)], n_estimators=[100, 200])
y_pred = estimator.predict(X)
mae = mean_absolute_error(y, y_pred)
mse = np.sqrt(mean_squared_error(y, y_pred))
rsq = r2_score(y, y_pred)
print(f'mae {mae:.5f}, rmse {mse:.5f}, rsq {rsq:.5f}')
Fitting 5 folds for each of 4 candidates, totalling 20 fits
mae 0.11846, rmse 0.18360, rsq 0.97468
[62]:
estimator.model_.best_params_
[62]:
{'regressor__n_estimators': 200, 'scaler__feature_range': (0, 1)}