# 18. Customized Estimators

You are able to roll your own `estimator`

, `regressor`

, `classifier`

or `transformer`

. Below are some templates adapted from the GitHub repository and works with scikit-learn v0.24.1 (the GitHub repository for the project-template is independent from and not
synchronized with the scikit-learn repository). There is also official documentation elsewhere.

Note that you should (though not necessarily) inherit from `BaseEstimator`

and use the appropriate `mixin`

. After you write your estimator, apply the `check_estimator()`

method to check (test) if your estimator is valid.

## 18.1. Basic estimator

Here is a barebones, dummy estimator. You need to implement two methods with the following signatures.

fit(self, X, y, **kwargs)

predict(self, X)

When you run `fit()`

, make sure the first thing you do is check if `y`

is `None`

. The `check_X_y()`

method is also required, and the properties `is_fitted_`

and `n_features_in_`

are also required to be set inside `fit()`

. At the end of `fit()`

, `self`

must always be returned.

The `predict()`

method must return a prediction for every row. Likewise, before making any predictions, `check_is_fitted()`

and `check_array()`

are required to be called.

```
[1]:
```

```
from sklearn.utils.estimator_checks import check_estimator
from sklearn.base import BaseEstimator, RegressorMixin, ClassifierMixin, TransformerMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels
from sklearn.metrics import euclidean_distances
import numpy as np
class SpecialEstimator(BaseEstimator):
def __init__(self):
pass
def fit(self, X, y, **kwargs):
if y is None:
raise ValueError('requires y to be passed, but the target y is None')
X, y = check_X_y(X, y)
self.is_fitted_ = True
self.n_features_in_ = X.shape[1]
return self
def predict(self, X):
check_is_fitted(self, 'is_fitted_')
X = check_array(X)
return np.ones(X.shape[0], dtype=np.int64)
check_estimator(SpecialEstimator())
```

## 18.2. Basic regressor

If your estimator is indeed a regressor, use `RegressorMixin`

. The `fit()`

and `predict()`

implementations follows the same as before. However, notice the `_more_tags()`

method? This method is used to override or supply additional `tags`

. As of v0.24.1, the documentation states that tags are experimental and subject to change. But what are these tags? These tags are essentially hints about the capabilities of
the estimator. The `poor_score`

tag hints that the regressor either fails (`True`

) or not fails (`False`

, default) to provide a *reasonable* test-set score. By default, this tag is set to `False`

, and here, we implement `_more_tags()`

to override that value to `True`

(otherwise, there is a warning generated).

```
[2]:
```

```
class SpecialRegressor(RegressorMixin, BaseEstimator):
def __init__(self):
pass
def fit(self, X, y, **kwargs):
if y is None:
raise ValueError('requires y to be passed, but the target y is None')
X, y = check_X_y(X, y)
self.is_fitted_ = True
self.n_features_in_ = X.shape[1]
return self
def predict(self, X):
check_is_fitted(self, 'is_fitted_')
X = check_array(X)
return np.ones(X.shape[0], dtype=np.int64)
def _more_tags(self):
return {
'poor_score': True
}
check_estimator(SpecialRegressor())
```

## 18.3. Basic classifier

Classifiers should use `ClassifierMixin`

, and also follow the `fit()`

and `predict()`

contracts. One caveate here is that in the `fit()`

method, we must also store the state of the number of classes in `classes_`

. Be careful with the `predict()`

method, as it should return label values that are consistent with the class values seen during `fit()`

.

```
[3]:
```

```
from random import choice
class SpecialClassifier(ClassifierMixin, BaseEstimator):
def __init__(self):
pass
def fit(self, X, y, **kwargs):
if y is None:
raise ValueError('requires y to be passed, but the target y is None')
X, y = check_X_y(X, y)
self.n_features_in_ = X.shape[1]
self.classes_ = unique_labels(y)
self.is_fitted_ = True
self.X_ = X
self.y_ = y
return self
def predict(self, X):
check_is_fitted(self, ['is_fitted_', 'X_', 'y_'])
X = check_array(X)
closest = np.argmin(euclidean_distances(X, self.X_), axis=1)
return self.y_[closest]
check_estimator(SpecialClassifier())
```

## 18.4. Basic transformer

Transformers should use `TransformerMixin`

and implement two methods.

fit(self, X, y=None)

transform(self, X)

The check and properties saved shown below inside `fit()`

and `transform()`

are all required to pass the checks.

```
[4]:
```

```
class SpecialTransformer(TransformerMixin, BaseEstimator):
def __init__(self):
pass
def fit(self, X, y=None):
X = check_array(X, accept_sparse=False)
self.n_features_in_ = X.shape[1]
self.n_features_ = X.shape[1]
self.is_fitted_ = True
return self
def transform(self, X):
check_is_fitted(self, ['is_fitted_'])
X = check_array(X, accept_sparse=False)
if X.shape[1] != self.n_features_:
raise ValueError('Shape of input is different from what was seen in `fit`')
return np.sqrt(X)
check_estimator(SpecialTransformer())
```

## 18.5. Custom estimator with pipeline

Below, we illustrate the use of create a custom estimator using a pipeline. The pipeline is very simple, we first rescale the data followed by a regression. We will use the `California housing data`

.

```
[12]:
```

```
from sklearn.datasets import fetch_california_housing
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
```

```
[17]:
```

```
X.head()
```

```
[17]:
```

MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | |
---|---|---|---|---|---|---|---|---|

0 | 8.3252 | 41.0 | 6.984127 | 1.023810 | 322.0 | 2.555556 | 37.88 | -122.23 |

1 | 8.3014 | 21.0 | 6.238137 | 0.971880 | 2401.0 | 2.109842 | 37.86 | -122.22 |

2 | 7.2574 | 52.0 | 8.288136 | 1.073446 | 496.0 | 2.802260 | 37.85 | -122.24 |

3 | 5.6431 | 52.0 | 5.817352 | 1.073059 | 558.0 | 2.547945 | 37.85 | -122.25 |

4 | 3.8462 | 52.0 | 6.281853 | 1.081081 | 565.0 | 2.181467 | 37.85 | -122.25 |

```
[18]:
```

```
y.head()
```

```
[18]:
```

```
0 4.526
1 3.585
2 3.521
3 3.413
4 3.422
Name: MedHouseVal, dtype: float64
```

Now, we will define our `AwesomeEstimator`

. Notice how we pass in the hyperparameters to tune as a part of the `fit()`

function and not the constructor? It is very expensive and tricky to validate these models with `check_estimator()`

since there is searching involve.

```
[56]:
```

```
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, StratifiedKFold
class AwesomeEstimator(RegressorMixin, BaseEstimator):
def __init__(self):
pass
def __get_pipeline(self):
scaler = MinMaxScaler()
regressor = RandomForestRegressor(**{
'random_state': 37
})
steps=[
('scaler', scaler),
('regressor', regressor)]
pipeline = Pipeline(steps=steps)
return pipeline
def __get_model(self, feature_range, n_estimators):
model = GridSearchCV(**{
'estimator': self.__get_pipeline(),
'cv': 5,
'param_grid': {
'scaler__feature_range': feature_range,
'regressor__n_estimators': n_estimators
},
'scoring': 'neg_mean_absolute_error',
'verbose': 5,
'refit': 'neg_mean_absolute_error',
'error_score': np.NaN,
'n_jobs': -1
})
return model
def fit(self, X, y, feature_range=[(0, 1)], n_estimators=[100]):
if y is None:
raise ValueError('requires y to be passed, but the target y is None')
X, y = check_X_y(X, y)
self.is_fitted_ = True
self.n_features_in_ = X.shape[1]
self.model_ = self.__get_model(feature_range, n_estimators)
self.model_.fit(X, y)
return self
def predict(self, X):
check_is_fitted(self, ['is_fitted_', 'model_'])
return self.model_.predict(X)
check_estimator(AwesomeEstimator())
```

```
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
```

Let’s run the `AwesomeEstimator`

using the default grid search values.

```
[59]:
```

```
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
estimator = AwesomeEstimator()
estimator.fit(X, y)
y_pred = estimator.predict(X)
mae = mean_absolute_error(y, y_pred)
mse = np.sqrt(mean_squared_error(y, y_pred))
rsq = r2_score(y, y_pred)
print(f'mae {mae:.5f}, rmse {mse:.5f}, rsq {rsq:.5f}')
```

```
Fitting 5 folds for each of 1 candidates, totalling 5 fits
mae 0.11960, rmse 0.18588, rsq 0.97405
```

We can also expand the search with additional hyperparameters.

```
[60]:
```

```
estimator = AwesomeEstimator()
estimator.fit(X, y, feature_range=[(0, 1), (0, 5)], n_estimators=[100, 200])
y_pred = estimator.predict(X)
mae = mean_absolute_error(y, y_pred)
mse = np.sqrt(mean_squared_error(y, y_pred))
rsq = r2_score(y, y_pred)
print(f'mae {mae:.5f}, rmse {mse:.5f}, rsq {rsq:.5f}')
```

```
Fitting 5 folds for each of 4 candidates, totalling 20 fits
mae 0.11846, rmse 0.18360, rsq 0.97468
```

```
[62]:
```

```
estimator.model_.best_params_
```

```
[62]:
```

```
{'regressor__n_estimators': 200, 'scaler__feature_range': (0, 1)}
```