1. Tips
These are some tips for the data scientist practioner.
1.1. Control ticks
It might be annoying that the default x- and y-axis tick labels are coarsed grained. Take the example below; the x- and y-axis ticks are even numbered. What if we want more granularity and also show all whole numbers?
[1]:
from scipy.special import expit as logistic
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(37)
plt.style.use('ggplot')
x = np.arange(-6, 6.1, 0.1)
y = logistic(x)
s = pd.Series(y, x)
fig, ax = plt.subplots(figsize=(10, 3))
_ = s.plot.line(x, y, ax=ax)
_ = ax.set_title('Basic line plot')
Use the set_xticks()
and set_yticks()
functions to control the ticks on the x- and y-axis.
[2]:
fig, ax = plt.subplots(figsize=(10, 3))
_ = s.plot.line(x, y, ax=ax)
_ = ax.set_title('Basic line plot')
_ = ax.set_xticks(np.arange(-6, 6.1, 1))
_ = ax.set_yticks(np.arange(0, 1.1, 0.1))
1.2. Multi-class, average precision score
In multi-class classification, your y_true
(truth labels) might be a 1-dimensional vector, but your predictions y_pred
(especially if you use predict_proba()
) will be multi-dimensional. The average_precision_score(y_true, y_pred)
expects that both y_true
and y_pred
are multi-dimensional. For example, the following will fail.
y_true = [1, 1, 0, 0, 2, 2]
y_pred = [
[0.0, 1.0, 0.0],
[0.0, 1.0, 0.0],
[1.0, 0.0, 0.0],
[1.0, 0.0, 0.0],
[0.0, 0.0, 1.0],
[0.0, 0.0, 1.0]
]
average_precision_score(y_true, y_pred) # fails
You will need to one-hot encode y_true
.
[3]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import average_precision_score
y_true = np.array([1, 1, 0, 0, 2, 2])
encoder = OneHotEncoder()
Y = encoder.fit_transform(y_true.reshape(-1, 1)).todense()
y_pred = np.array([
[0.0, 1.0, 0.0],
[0.0, 1.0, 0.0],
[1.0, 0.0, 0.0],
[1.0, 0.0, 0.0],
[0.0, 0.0, 1.0],
[0.0, 0.0, 1.0]
])
average_precision_score(Y, y_pred)
/home/super/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py:593: FutureWarning: np.matrix usage is deprecated in 1.0 and will raise a TypeError in 1.2. Please convert to a numpy array with np.asarray. For more information see: https://numpy.org/doc/stable/reference/generated/numpy.matrix.html
warnings.warn(
[3]:
1.0
1.3. Sorting multi-index
You might have a dataframe that has multi-index for the rows and columns. How do you sort by columns or rows?
[4]:
df = pd.DataFrame({
'height': ['tall', 'tall', 'tall', 'tall', 'short', 'short', 'short', 'short'],
'weight': ['heavy', 'heavy', 'light', 'light', 'heavy', 'heavy', 'light', 'light'],
'value': [9.9, 10.0, 7.7, 6.7, 5.5, 6.6, 3.3, 2.3]
})
stats = df.groupby(['height', 'weight']).agg(['mean', 'std'])
stats
[4]:
value | |||
---|---|---|---|
mean | std | ||
height | weight | ||
short | heavy | 6.05 | 0.777817 |
light | 2.80 | 0.707107 | |
tall | heavy | 9.95 | 0.070711 |
light | 7.20 | 0.707107 |
To sort based on column multi-index, you have to use sort_values()
and supply a list of tuples. Below, we will sort by the mean
and then std
columns.
[5]:
stats.sort_values([('value', 'mean'), ('value', 'std')])
[5]:
value | |||
---|---|---|---|
mean | std | ||
height | weight | ||
short | light | 2.80 | 0.707107 |
heavy | 6.05 | 0.777817 | |
tall | light | 7.20 | 0.707107 |
heavy | 9.95 | 0.070711 |
To sort based on the row multi-index, it is not as complicated, simply supply a list of index names to sort_values()
.
[6]:
stats.sort_values(['weight'], axis=0)
[6]:
value | |||
---|---|---|---|
mean | std | ||
height | weight | ||
short | heavy | 6.05 | 0.777817 |
tall | heavy | 9.95 | 0.070711 |
short | light | 2.80 | 0.707107 |
tall | light | 7.20 | 0.707107 |
If you need to sort descendingly, pass in ascending=False
.
[7]:
stats.sort_values(['weight'], axis=0, ascending=False)
[7]:
value | |||
---|---|---|---|
mean | std | ||
height | weight | ||
short | light | 2.80 | 0.707107 |
tall | light | 7.20 | 0.707107 |
short | heavy | 6.05 | 0.777817 |
tall | heavy | 9.95 | 0.070711 |
1.4. One-Hot Encoding Pipeline
When you use OHE in a pipeline, you will have to deal with missing values. If you use SimpleImputer
and fill missing values with a constant value, you will always end up with an extra column. What’s worse, the columns corresponding to the categorical values will NOT be missing and instead have 0’s. Additionally, your output is a matrix, and you lose the field names. Here’s an example dataframe with 2 numeric field (height and age) and 2 categorical fields (pet and color).
[8]:
df = pd.DataFrame({
'height': [2.3, 3.3, 2.4, 5.5, np.nan],
'pet': ['cat', 'dog', 'cat', np.nan, 'dog'],
'color': ['blue', 'black', np.nan, 'brown', 'black'],
'age': [10, 5, np.nan, 4, 8]
})
df
[8]:
height | pet | color | age | |
---|---|---|---|---|
0 | 2.3 | cat | blue | 10.0 |
1 | 3.3 | dog | black | 5.0 |
2 | 2.4 | cat | NaN | NaN |
3 | 5.5 | NaN | brown | 4.0 |
4 | NaN | dog | black | 8.0 |
Let’s say we want to OHE the pet and color fields. See the problems here?
What happened to the headers?
The output is a matrix and not a dataframe.
We have an additional column.
[9]:
from sklearn.impute import SimpleImputer
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
cat_pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value=np.nan)),
('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))
])
transformer = ColumnTransformer(transformers=[('cat', cat_pipeline, ['pet', 'color'])])
pipeline = Pipeline(steps=[
('preprocessing', transformer)
])
pipeline.fit_transform(df)
[9]:
array([[1., 0., 0., 0., 1., 0., 0.],
[0., 1., 0., 1., 0., 0., 0.],
[1., 0., 0., 0., 0., 0., 1.],
[0., 0., 1., 0., 0., 1., 0.],
[0., 1., 0., 1., 0., 0., 0.]])
We can try to recover the field names as follows. If we specified for the transformer to passthrough (eg remainder='passthrough'
the numeric fields, then we have to do even more work to align the field names. It should be apparent that the _nan
fields are a nuisance and the generic field prefixes of x0_
and x1_
are not very helpful.
[10]:
columns = transformer.named_transformers_['cat'] \
.named_steps['ohe'] \
.get_feature_names_out()
pd.DataFrame(pipeline.transform(df), columns=columns)
[10]:
x0_cat | x0_dog | x0_nan | x1_black | x1_blue | x1_brown | x1_nan | |
---|---|---|---|---|---|---|---|
0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
2 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
3 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
4 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
So, what can we do? We can create a custom transformer to fix the problem. Take note of the pattern below. The OheAdjustmentTransformer
is the custom transformer that takes in the ColumnTransformer
; the former needs the latter to retrieve the OHE fields. Also, notice the dizzying nested structures of
pipelines within pipelines,
transformers within pipelines, and
pipelines within transformers.
The final output is a little bit better since
the field names are more meaningful (they have the original field names as prefixes instead of
x0_
),the
_nan
columns are dropped,the records for which there should be missing values instead of 0’s are corrected, and
if we choose to pass through the numeric fields, then they are also available.
[11]:
class OheAdjustmentTransformer:
def __init__(self, transformer, num_columns, cat_columns, missing_suffix='nan'):
self.transformer = transformer
self.num_columns = num_columns
self.cat_columns = cat_columns
self.missing_suffix = missing_suffix
def adjust(self, df):
def make_null(d, cat_field, nan_field):
n_df = pd.DataFrame()
for field in d.columns:
if field == nan_field:
continue
u_index = field.find('_')
val = field[u_index+1:]
n_field = f'{cat_field}_{val}'
n_df[n_field] = np.select([d[nan_field]==1], [np.nan], default=d[field])
n_df.index = d.index
return n_df
prefixes = [f'x{i}_' for i in range(len(self.cat_columns))]
c2p = {c: p for c, p in zip(self.cat_columns, prefixes)}
c2n = {c: f'{p}nan' for c, p in zip(self.cat_columns, prefixes)}
df_cols = {c: [f for f in df.columns if f.startswith(p)]
for c, p in zip(self.cat_columns, prefixes)}
dfs = {c: df[df_cols[c]] for c in df_cols}
n_dfs = {c: make_null(d, c, c2n[c]) for c, d in dfs.items()}
f_df = pd.concat([d for d in n_dfs.values()], axis=1)
return f_df
def fit(self, X=None, y=None):
return self
def transform(self, X):
cat_columns = self.transformer.named_transformers_['cat'] \
.named_steps['ohe'] \
.get_feature_names_out()
cat_columns = list(cat_columns)
num_columns = self.num_columns
if X.shape[1] == len(cat_columns):
columns = cat_columns
else:
columns = cat_columns + num_columns
df = pd.DataFrame(X, columns=columns)
if X.shape[1] == len(cat_columns):
return self.adjust(df)
else:
return self.adjust(df).join(df[num_columns])
cat_columns = ['pet', 'color']
num_columns = [c for c in df.columns if c not in cat_columns]
cat_pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value=np.nan)),
('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))
])
transformer = ColumnTransformer(
transformers=[('cat', cat_pipeline, cat_columns)],
remainder='passthrough')
df_pipeline = Pipeline(steps=[
('df', OheAdjustmentTransformer(transformer, num_columns, cat_columns))
])
pipeline = Pipeline(steps=[
('preprocessing', transformer),
('postprocessing', df_pipeline)
])
pipeline.fit_transform(df)
[11]:
pet_cat | pet_dog | color_black | color_blue | color_brown | height | age | |
---|---|---|---|---|---|---|---|
0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.3 | 10.0 |
1 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 3.3 | 5.0 |
2 | 1.0 | 0.0 | NaN | NaN | NaN | 2.4 | NaN |
3 | NaN | NaN | 0.0 | 0.0 | 1.0 | 5.5 | 4.0 |
4 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | NaN | 8.0 |
1.5. Frequency Encoding
Frequency encoding is a way to turn categorial variables into numeric ones without expanding the dimensionality of the data (as in the case of one-hot encoding). Every categorical value is replaced either with its count or probability. Here’s how to implement frequency encoding with Scikit-Learn. Other implementations are available but break with missing values. This approach here does not. First, let’s create sample data.
[12]:
df = pd.DataFrame({
'x1': [1, 2, 3, 4, np.nan, 6],
'x2': ['cat', 'dog', 'cat', 'cat', 'dog', np.nan],
'y': [1, 3, 4, 5, 2, 6]
})
df
[12]:
x1 | x2 | y | |
---|---|---|---|
0 | 1.0 | cat | 1 |
1 | 2.0 | dog | 3 |
2 | 3.0 | cat | 4 |
3 | 4.0 | cat | 5 |
4 | NaN | dog | 2 |
5 | 6.0 | NaN | 6 |
The frequency encoder is implemented as follows. This implementation will handle with the input X
is either a Pandas data frame or numpy array.
[13]:
class FreqEncoder:
def __init__(self, variables):
self.variables_ = variables
self.v2e_ = None
self.e2v_ = None
def fit(self, X=None, y=None):
def get_df_freq(c):
return X[c].value_counts() / X[c].value_counts().sum()
def get_np_freq(i):
s = pd.Series(X[:,i]).value_counts()
return s / s.sum()
if isinstance(X, pd.DataFrame):
self.v2e_ = {c: get_df_freq(c).to_dict() for c in self.variables_}
self.e2v_ = {c: {val: key for key, val in d.items()} for c, d in self.v2e_.items()}
elif isinstance(X, np.ndarray):
self.v2e_ = {i: get_np_freq(i).to_dict() for i in self.variables_}
self.e2v_ = {i: {val: key for key, val in d.items()} for i, d in self.v2e_.items()}
else:
raise Exception(f'X type is not handled: {type(X)}')
return self
def transform(self, X):
if isinstance(X, pd.DataFrame):
return X.assign(**{c: lambda d: d[c].map(self.v2e_[c]) for c in self.variables_})
elif isinstance(X, np.ndarray):
df = pd.DataFrame(X, columns=[f'{i}' for i in range(X.shape[1])])
df = df.assign(**{f'{i}': lambda d: d[f'{i}'].map(self.v2e_[i]) for i in self.variables_})
return df.values
else:
raise Exception(f'X type is not handled: {type(X)}')
def inverse_transform(self, X):
if isinstance(X, pd.DataFrame):
return X.assign(**{c: lambda d: d[c].map(self.e2v_[c]) for c in self.variables_})
elif isinstance(X, np.ndarray):
df = pd.DataFrame(X, columns=[f'{i}' for i in range(X.shape[1])])
df = df.assign(**{f'{i}': lambda d: d[f'{i}'].map(self.e2v_[i]) for i in self.variables_})
return df.values
else:
raise Exception(f'X type is not handled: {type(X)}')
Now, let’s use the encoder and inspect its properties. The v2e_
field is the forward mapping of categorial variable values to frequencies and the e2v_
field is the inverse/backward mapping of frequencies to categorial variable values. Obviously, a concern is with ties (we leave it up to the user to add a random fudge factor and renormalize to fix this problem).
[14]:
encoder = FreqEncoder(list(df.select_dtypes(include=['object']).columns))
encoder.fit(df)
encoder.v2e_, encoder.e2v_
[14]:
({'x2': {'cat': 0.6, 'dog': 0.4}}, {'x2': {0.6: 'cat', 0.4: 'dog'}})
Here is the transform.
[15]:
encoder.transform(df)
[15]:
x1 | x2 | y | |
---|---|---|---|
0 | 1.0 | 0.6 | 1 |
1 | 2.0 | 0.4 | 3 |
2 | 3.0 | 0.6 | 4 |
3 | 4.0 | 0.6 | 5 |
4 | NaN | 0.4 | 2 |
5 | 6.0 | NaN | 6 |
Here is the inverse transform.
[16]:
encoder.inverse_transform(encoder.transform(df))
[16]:
x1 | x2 | y | |
---|---|---|---|
0 | 1.0 | cat | 1 |
1 | 2.0 | dog | 3 |
2 | 3.0 | cat | 4 |
3 | 4.0 | cat | 5 |
4 | NaN | dog | 2 |
5 | 6.0 | NaN | 6 |
Let’s use the encoder in a pipeline.
[17]:
from sklearn.linear_model import LinearRegression
X, y = df[['x1', 'x2']], df['y']
model = Pipeline([
('encoder', FreqEncoder(['x2'])),
('imputer', SimpleImputer()),
('estimator', LinearRegression())
])
model.fit(X, y)
model.predict(X)
[17]:
array([1.70620043, 1.89185292, 3.61679885, 4.57209805, 3.03821197,
6.17483778])
We may access the fitted encoder from the pipeline.
[18]:
model.steps[0][1].transform(X)
[18]:
x1 | x2 | |
---|---|---|
0 | 1.0 | 0.6 |
1 | 2.0 | 0.4 |
2 | 3.0 | 0.6 |
3 | 4.0 | 0.6 |
4 | NaN | 0.4 |
5 | 6.0 | NaN |
And we can see the fitted encoder in the pipeline at action with inverse transformation.
[19]:
model.steps[0][1].inverse_transform(model.steps[0][1].transform(X))
[19]:
x1 | x2 | |
---|---|---|
0 | 1.0 | cat |
1 | 2.0 | dog |
2 | 3.0 | cat |
3 | 4.0 | cat |
4 | NaN | dog |
5 | 6.0 | NaN |
For the sake of completeness, let’s work with numpy arrays.
[20]:
X, y = df[['x1', 'x2']].values, df['y'].values
X, y
[20]:
(array([[1.0, 'cat'],
[2.0, 'dog'],
[3.0, 'cat'],
[4.0, 'cat'],
[nan, 'dog'],
[6.0, nan]], dtype=object),
array([1, 3, 4, 5, 2, 6]))
Notice that since we are using a numpy array for X
, we cannot pass in the column names and must pass in the index of the columns that need to be frequency encoded.
[21]:
model = Pipeline([
('encoder', FreqEncoder([1])),
('imputer', SimpleImputer()),
('estimator', LinearRegression())
])
model.fit(X, y)
model.predict(X)
[21]:
array([1.70620043, 1.89185292, 3.61679885, 4.57209805, 3.03821197,
6.17483778])
Here is the fitted frequency encoded model used to transform the data.
[22]:
model.steps[0][1].transform(X[:,:2])
[22]:
array([[1.0, 0.6],
[2.0, 0.4],
[3.0, 0.6],
[4.0, 0.6],
[nan, 0.4],
[6.0, nan]], dtype=object)
Here is the fitted frequency encoded model used to inverse transform the data.
[23]:
model.steps[0][1].inverse_transform(model.steps[0][1].transform(X[:,:2]))
[23]:
array([[1.0, 'cat'],
[2.0, 'dog'],
[3.0, 'cat'],
[4.0, 'cat'],
[nan, 'dog'],
[6.0, nan]], dtype=object)
1.6. Displaying Pandas Series and DataFrames
Often times, we want to display ALL columns and fields of a Pandas Series or DataFrame. We can set the options in a notebook cell as follows.
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
But now, these settings take over all subsequent displays. Later, when we need to revert, we can do the following.
# one at a time
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')
# or all at once
pd.reset_option('all')
It’s better to use the Pandas option context manager to temporarily make the settings just for a cell.
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
display(some_df)
display(some_series)