4. Patsy
Patsy is a neat API to transform your data into experimentation model form. For regression and classification problems, you often want your data in the Xy
form where X
is a matrix (independent variable) and y
is a column vector (dependent variable). In regression, let’s say you have \(x_1, x_2\) as your independent variable and \(y\) as your dependent variable. You might want to express the possible models as follows.
\(y = b_0 + x_1 + x_2\)
\(y = b_0 + x_1 + x_2 + x_1 x_2\)
\(y = b_0 + x_1 + x_2 + x_1^2 + x_2^2\)
\(y = b_0 + x_1 + x_2 + x_1^2 + x_2^2 + x_1 x_2\)
If your data is stored in a dataframe, it’s not terribly hard (but not so easy either) to transform your dataframe into one with the polynomial and interaction effects. Let’s look at a made up dataframe below, and let’s say that we want to model our regression problem with \(y = b_0 + x_1 + x_2 + x_1^2 + x_2^2 + x_1 x_2\). How would we transform the dataframe?
[1]:
import pandas as pd
df = pd.DataFrame({
'height': [10, 20, 30, 40, 50],
'weight': [88, 99, 125, 155, 120],
'y': [50, 60, 70, 80, 90]
})
df
[1]:
height | weight | y | |
---|---|---|---|
0 | 10 | 88 | 50 |
1 | 20 | 99 | 60 |
2 | 30 | 125 | 70 |
3 | 40 | 155 | 80 |
4 | 50 | 120 | 90 |
Here’s one possible way to transform the dataframe into the model we desire. There are 6 lines of code!
[2]:
X = df[['height', 'weight']]
X['intercept'] = 1
X['height**2'] = df.height**2
X['weight**2'] = df.weight**2
X['height*weight'] = df.height * df.weight
X = X[['intercept', 'height', 'weight', 'height**2', 'weight**2', 'height*weight']]
X
[2]:
intercept | height | weight | height**2 | weight**2 | height*weight | |
---|---|---|---|---|---|---|
0 | 1 | 10 | 88 | 100 | 7744 | 880 |
1 | 1 | 20 | 99 | 400 | 9801 | 1980 |
2 | 1 | 30 | 125 | 900 | 15625 | 3750 |
3 | 1 | 40 | 155 | 1600 | 24025 | 6200 |
4 | 1 | 50 | 120 | 2500 | 14400 | 6000 |
4.1. Formulas
There’s an easier way using patsy
. Patsy uses R
-style formulas to define the model and takes care of transforming the data for you. There are only 2 lines of code in using patsy! Really, just 1 line of code if the formula was not so long, then we would have passed the string literal to the dmatrice()
function.
[3]:
from patsy import dmatrices
formula = 'y ~ height + weight + I(height**2) + I(weight**2) + height:weight'
y, X = dmatrices(formula, df, return_type='dataframe')
X
[3]:
Intercept | height | weight | I(height ** 2) | I(weight ** 2) | height:weight | |
---|---|---|---|---|---|---|
0 | 1.0 | 10.0 | 88.0 | 100.0 | 7744.0 | 880.0 |
1 | 1.0 | 20.0 | 99.0 | 400.0 | 9801.0 | 1980.0 |
2 | 1.0 | 30.0 | 125.0 | 900.0 | 15625.0 | 3750.0 |
3 | 1.0 | 40.0 | 155.0 | 1600.0 | 24025.0 | 6200.0 |
4 | 1.0 | 50.0 | 120.0 | 2500.0 | 14400.0 | 6000.0 |
Notice how we had to use I(height**2)
and I(weight**2)
? The **
operator already has a special meaning in patsy (which is to produce interaction effects), and not the Python meaning (which is raising a number to the power of another). The I()
is called the identity operator and preserves the expected Python expression. Also, we got the Intercept
column by default. If we
did not want the Intercept
column, then we would subtract 1.
[4]:
formula = 'y ~ height + weight + I(height**2) + I(weight**2) + height:weight - 1'
y, X = dmatrices(formula, df, return_type='dataframe')
X
[4]:
height | weight | I(height ** 2) | I(weight ** 2) | height:weight | |
---|---|---|---|---|---|
0 | 10.0 | 88.0 | 100.0 | 7744.0 | 880.0 |
1 | 20.0 | 99.0 | 400.0 | 9801.0 | 1980.0 |
2 | 30.0 | 125.0 | 900.0 | 15625.0 | 3750.0 |
3 | 40.0 | 155.0 | 1600.0 | 24025.0 | 6200.0 |
4 | 50.0 | 120.0 | 2500.0 | 14400.0 | 6000.0 |
The return_type
argument is set to dataframe
so that a tuple of pandas
dataframes are returned; otherwise, a tuple of DesignMatrix
is returned instead. A DesignMatrix
is just a wrapper around numpy
arrays with metadata.
[5]:
formula = 'y ~ height + weight + I(height**2) + I(weight**2) + height:weight - 1'
y, X = dmatrices(formula, df)
[6]:
X
[6]:
DesignMatrix with shape (5, 5)
height weight I(height ** 2) I(weight ** 2) height:weight
10 88 100 7744 880
20 99 400 9801 1980
30 125 900 15625 3750
40 155 1600 24025 6200
50 120 2500 14400 6000
Terms:
'height' (column 0)
'weight' (column 1)
'I(height ** 2)' (column 2)
'I(weight ** 2)' (column 3)
'height:weight' (column 4)
[7]:
y
[7]:
DesignMatrix with shape (5, 1)
y
50
60
70
80
90
Terms:
'y' (column 0)
If we did not want y
, we simply omit the y ~
part and use the dmatrix()
function instead.
[8]:
from patsy import dmatrix
formula = 'height + weight + I(height**2) + I(weight**2) + height:weight - 1'
X = dmatrix(formula, df, return_type='dataframe')
X
[8]:
height | weight | I(height ** 2) | I(weight ** 2) | height:weight | |
---|---|---|---|---|---|
0 | 10.0 | 88.0 | 100.0 | 7744.0 | 880.0 |
1 | 20.0 | 99.0 | 400.0 | 9801.0 | 1980.0 |
2 | 30.0 | 125.0 | 900.0 | 15625.0 | 3750.0 |
3 | 40.0 | 155.0 | 1600.0 | 24025.0 | 6200.0 |
4 | 50.0 | 120.0 | 2500.0 | 14400.0 | 6000.0 |
We already saw the builtin I()
function, but we can also use other functions to transform our data. Below, we do a log transform on height and weight.
[9]:
import numpy as np
formula = 'y ~ np.log(height) + np.log(weight) + I(height**2) + I(weight**2) + height:weight'
y, X = dmatrices(formula, df, return_type='dataframe')
X
[9]:
Intercept | np.log(height) | np.log(weight) | I(height ** 2) | I(weight ** 2) | height:weight | |
---|---|---|---|---|---|---|
0 | 1.0 | 2.302585 | 4.477337 | 100.0 | 7744.0 | 880.0 |
1 | 1.0 | 2.995732 | 4.595120 | 400.0 | 9801.0 | 1980.0 |
2 | 1.0 | 3.401197 | 4.828314 | 900.0 | 15625.0 | 3750.0 |
3 | 1.0 | 3.688879 | 5.043425 | 1600.0 | 24025.0 | 6200.0 |
4 | 1.0 | 3.912023 | 4.787492 | 2500.0 | 14400.0 | 6000.0 |
We can also center or standardize values using using the builtin functions center()
and standardize()
.
[10]:
formula = 'y ~ center(height) + standardize(weight) + I(height**2) + I(weight**2) + center(height):standardize(weight)'
y, X = dmatrices(formula, df, return_type='dataframe')
X
[10]:
Intercept | center(height) | standardize(weight) | I(height ** 2) | I(weight ** 2) | center(height):standardize(weight) | |
---|---|---|---|---|---|---|
0 | 1.0 | -20.0 | -1.269602 | 100.0 | 7744.0 | 25.392048 |
1 | 1.0 | -10.0 | -0.794581 | 400.0 | 9801.0 | 7.945811 |
2 | 1.0 | 0.0 | 0.328197 | 900.0 | 15625.0 | 0.000000 |
3 | 1.0 | 10.0 | 1.623709 | 1600.0 | 24025.0 | 16.237092 |
4 | 1.0 | 20.0 | 0.112278 | 2500.0 | 14400.0 | 2.245555 |
4.2. Categorical data
How does patsy handle categorical data? Below, we create a dummy data set with two categorical fields; handedness
has the values left and right; school has the values elementary, middle, high (high for highschool).
[11]:
df = pd.DataFrame({
'height': [10, 20, 30, 40, 50],
'weight': [88, 99, 125, 155, 120],
'handedness': ['left', 'right', 'left', 'right', 'left'],
'school': ['elementary', 'middle', 'high', 'elementary', 'middle'],
'y': [50, 60, 70, 80, 90]
})
df
[11]:
height | weight | handedness | school | y | |
---|---|---|---|---|---|
0 | 10 | 88 | left | elementary | 50 |
1 | 20 | 99 | right | middle | 60 |
2 | 30 | 125 | left | high | 70 |
3 | 40 | 155 | right | elementary | 80 |
4 | 50 | 120 | left | middle | 90 |
The formulas work the same way as before, and patsy magically creates one-hot encoded variables for the categorical variables. Notice how one of the categorical variables is automatically dropped? For example, we might expect to see handedness[T.left]
and handedness[T.right]
as bew columns, but we only see handedness[T.right]
. Likewise, school[T.elementary]
is dropped. The resulting column names of the pattern name[T.value]
is to signal that these fields are the result of
one-hot encoding against categorical variables.
[12]:
formula = 'y ~ height + weight + handedness + school'
y, X = dmatrices(formula, df, return_type='dataframe')
X
[12]:
Intercept | handedness[T.right] | school[T.high] | school[T.middle] | height | weight | |
---|---|---|---|---|---|---|
0 | 1.0 | 0.0 | 0.0 | 0.0 | 10.0 | 88.0 |
1 | 1.0 | 1.0 | 0.0 | 1.0 | 20.0 | 99.0 |
2 | 1.0 | 0.0 | 1.0 | 0.0 | 30.0 | 125.0 |
3 | 1.0 | 1.0 | 0.0 | 0.0 | 40.0 | 155.0 |
4 | 1.0 | 0.0 | 0.0 | 1.0 | 50.0 | 120.0 |
If we use the C()
builtin function, we can specify the ordering or levels of the categorical variable. Notice how the first level is the one dropped.
[13]:
handedness_levels = ['right', 'left']
school_levels = ['elementary', 'middle', 'high']
formula = 'y ~ height + weight + C(handedness, levels=handedness_levels) + C(school, levels=school_levels)'
y, X = dmatrices(formula, df, return_type='dataframe')
X
[13]:
Intercept | C(handedness, levels=handedness_levels)[T.left] | C(school, levels=school_levels)[T.middle] | C(school, levels=school_levels)[T.high] | height | weight | |
---|---|---|---|---|---|---|
0 | 1.0 | 1.0 | 0.0 | 0.0 | 10.0 | 88.0 |
1 | 1.0 | 0.0 | 1.0 | 0.0 | 20.0 | 99.0 |
2 | 1.0 | 1.0 | 0.0 | 1.0 | 30.0 | 125.0 |
3 | 1.0 | 0.0 | 0.0 | 0.0 | 40.0 | 155.0 |
4 | 1.0 | 1.0 | 1.0 | 0.0 | 50.0 | 120.0 |
4.3. Interaction
To specify interaction effects, we have a couple of options. We can use the following operators or symbols.
:
*
**
Below, we specify interaction effect using :
. You will notice that handedness
is not part of the model.
[14]:
formula = 'y ~ handedness:school'
y, X = dmatrices(formula, df, return_type='dataframe')
list(X.columns)
[14]:
['Intercept',
'school[T.high]',
'school[T.middle]',
'handedness[T.right]:school[elementary]',
'handedness[T.right]:school[high]',
'handedness[T.right]:school[middle]']
Below, we use *
, and you will notice all two-way and single effects are included. The result is typically what we want. Now, handedness
is part of the model.
[15]:
formula = 'y ~ handedness*school'
y, X = dmatrices(formula, df, return_type='dataframe')
list(X.columns)
[15]:
['Intercept',
'handedness[T.right]',
'school[T.high]',
'school[T.middle]',
'handedness[T.right]:school[T.high]',
'handedness[T.right]:school[T.middle]']
The **
is probably the best way to specify 1 and 2-way interactions between two variables. Additionally, patsy just knows what to do with dropping columns to prevent multicollinearity.
[16]:
formula = 'y ~ (handedness + school)**2'
y, X = dmatrices(formula, df, return_type='dataframe')
list(X.columns)
[16]:
['Intercept',
'handedness[T.right]',
'school[T.high]',
'school[T.middle]',
'handedness[T.right]:school[T.high]',
'handedness[T.right]:school[T.middle]']
Here is a model specified up to all 2-way interactions for all the variables.
[17]:
formula = 'y ~ (height + weight + handedness + school)**2'
y, X = dmatrices(formula, df, return_type='dataframe')
list(X.columns)
[17]:
['Intercept',
'handedness[T.right]',
'school[T.high]',
'school[T.middle]',
'handedness[T.right]:school[T.high]',
'handedness[T.right]:school[T.middle]',
'height',
'height:handedness[T.right]',
'height:school[T.high]',
'height:school[T.middle]',
'weight',
'weight:handedness[T.right]',
'weight:school[T.high]',
'weight:school[T.middle]',
'height:weight']
And if we wanted up to all 3-way interactions?
[18]:
formula = 'y ~ (height + weight + handedness + school)**3'
y, X = dmatrices(formula, df, return_type='dataframe')
list(X.columns)
[18]:
['Intercept',
'handedness[T.right]',
'school[T.high]',
'school[T.middle]',
'handedness[T.right]:school[T.high]',
'handedness[T.right]:school[T.middle]',
'height',
'height:handedness[T.right]',
'height:school[T.high]',
'height:school[T.middle]',
'height:handedness[T.right]:school[T.high]',
'height:handedness[T.right]:school[T.middle]',
'weight',
'weight:handedness[T.right]',
'weight:school[T.high]',
'weight:school[T.middle]',
'weight:handedness[T.right]:school[T.high]',
'weight:handedness[T.right]:school[T.middle]',
'height:weight',
'height:weight:handedness[T.right]',
'height:weight:school[T.high]',
'height:weight:school[T.middle]']
And if we wanted up to all 4-way interactions?
[19]:
formula = 'y ~ (height + weight + handedness + school)**4'
y, X = dmatrices(formula, df, return_type='dataframe')
list(X.columns)
[19]:
['Intercept',
'handedness[T.right]',
'school[T.high]',
'school[T.middle]',
'handedness[T.right]:school[T.high]',
'handedness[T.right]:school[T.middle]',
'height',
'height:handedness[T.right]',
'height:school[T.high]',
'height:school[T.middle]',
'height:handedness[T.right]:school[T.high]',
'height:handedness[T.right]:school[T.middle]',
'weight',
'weight:handedness[T.right]',
'weight:school[T.high]',
'weight:school[T.middle]',
'weight:handedness[T.right]:school[T.high]',
'weight:handedness[T.right]:school[T.middle]',
'height:weight',
'height:weight:handedness[T.right]',
'height:weight:school[T.high]',
'height:weight:school[T.middle]',
'height:weight:handedness[T.right]:school[T.high]',
'height:weight:handedness[T.right]:school[T.middle]']
4.4. Null values
When creating your design matrices from formulas, patsy, for better or worse, will drop records that have null values. Below, we have 5 records and there is one record with a null value for handedness
.
[20]:
df = pd.DataFrame({
'height': [10, 20, 30, 40, 50],
'weight': [88, 99, 125, 155, 120],
'handedness': ['left', 'right', 'left', None, 'left'],
'school': ['elementary', 'middle', 'high', 'elementary', 'middle'],
'y': [50, 60, 70, 80, 90]
})
df
[20]:
height | weight | handedness | school | y | |
---|---|---|---|---|---|
0 | 10 | 88 | left | elementary | 50 |
1 | 20 | 99 | right | middle | 60 |
2 | 30 | 125 | left | high | 70 |
3 | 40 | 155 | None | elementary | 80 |
4 | 50 | 120 | left | middle | 90 |
If we defined a model, then observe how the record with the null value is removed!
[21]:
formula = 'y ~ handedness*school'
y, X = dmatrices(formula, df, return_type='dataframe')
X
[21]:
Intercept | handedness[T.right] | school[T.high] | school[T.middle] | handedness[T.right]:school[T.high] | handedness[T.right]:school[T.middle] | |
---|---|---|---|---|---|---|
0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 |
2 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
4 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
To prevent records with null values from being remove, we have to specify what to do with null values using the NA_action
argument. By default, NA_action
is set to drop
to remove records with null values. The alternative is to set this argument to raise
, which will raise an exception. Neither of these string values help to retain records with null values. We have to use the object NAAction
and specify the null types to empty (e.g. NA_types=[]
). The example is below.
Observe how both levels of handedness
is part of the data? You will have to drop one of these now. Yikes! Always make sure that you check your resulting design matrices!
It’s a bit awkward how records with null values are dropped by default, but, if you decide to keep them, then columns are not dropped as expected. What’s being dropped and not being dropped? Again, check your resulting design matrices to make sure that your row and column counts/outputs are as expected when dealing with null values.
[22]:
from patsy import NAAction
formula = 'y ~ handedness*school'
y, X = dmatrices(formula, df, return_type='dataframe', NA_action=NAAction(NA_types=[]))
X
[22]:
Intercept | handedness[T.left] | handedness[T.right] | school[T.high] | school[T.middle] | handedness[T.left]:school[T.high] | handedness[T.right]:school[T.high] | handedness[T.left]:school[T.middle] | handedness[T.right]:school[T.middle] | |
---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
2 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
3 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
If you have null values for continuous variables, that is fine. In the example below, one of the weight
value is None
.
[23]:
df = pd.DataFrame({
'height': [10, 20, 30, 40, 50],
'weight': [88, 99, 125, None, 120],
'handedness': ['left', 'right', 'left', None, 'left'],
'school': ['elementary', 'middle', 'high', 'elementary', 'middle'],
'y': [50, 60, 70, 80, 90]
})
formula = 'y ~ height + weight + handedness + school'
y, X = dmatrices(formula, df, return_type='dataframe', NA_action=NAAction(NA_types=[]))
X
[23]:
Intercept | handedness[T.left] | handedness[T.right] | school[T.high] | school[T.middle] | height | weight | |
---|---|---|---|---|---|---|---|
0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 10.0 | 88.0 |
1 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 20.0 | 99.0 |
2 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 30.0 | 125.0 |
3 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 40.0 | NaN |
4 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 50.0 | 120.0 |
And just for completeness, if we do not have null values for the categorical variables, then patsy drops one of them as expected.
[24]:
df = pd.DataFrame({
'height': [10, 20, 30, 40, 50],
'weight': [88, 99, 125, None, 120],
'handedness': ['left', 'right', 'left', 'right', 'left'],
'school': ['elementary', 'middle', 'high', 'elementary', 'middle'],
'y': [50, 60, 70, 80, 90]
})
formula = 'y ~ height + weight + handedness + school'
y, X = dmatrices(formula, df, return_type='dataframe', NA_action=NAAction(NA_types=[]))
X
[24]:
Intercept | handedness[T.right] | school[T.high] | school[T.middle] | height | weight | |
---|---|---|---|---|---|---|
0 | 1.0 | 0.0 | 0.0 | 0.0 | 10.0 | 88.0 |
1 | 1.0 | 1.0 | 0.0 | 1.0 | 20.0 | 99.0 |
2 | 1.0 | 0.0 | 1.0 | 0.0 | 30.0 | 125.0 |
3 | 1.0 | 1.0 | 0.0 | 0.0 | 40.0 | NaN |
4 | 1.0 | 0.0 | 0.0 | 1.0 | 50.0 | 120.0 |