4. Patsy

Patsy is a neat API to transform your data into experimentation model form. For regression and classification problems, you often want your data in the Xy form where X is a matrix (independent variable) and y is a column vector (dependent variable). In regression, let’s say you have \(x_1, x_2\) as your independent variable and \(y\) as your dependent variable. You might want to express the possible models as follows.

  • \(y = b_0 + x_1 + x_2\)

  • \(y = b_0 + x_1 + x_2 + x_1 x_2\)

  • \(y = b_0 + x_1 + x_2 + x_1^2 + x_2^2\)

  • \(y = b_0 + x_1 + x_2 + x_1^2 + x_2^2 + x_1 x_2\)

If your data is stored in a dataframe, it’s not terribly hard (but not so easy either) to transform your dataframe into one with the polynomial and interaction effects. Let’s look at a made up dataframe below, and let’s say that we want to model our regression problem with \(y = b_0 + x_1 + x_2 + x_1^2 + x_2^2 + x_1 x_2\). How would we transform the dataframe?

[1]:
import pandas as pd

df = pd.DataFrame({
    'height': [10, 20, 30, 40, 50],
    'weight': [88, 99, 125, 155, 120],
    'y': [50, 60, 70, 80, 90]
})
df
[1]:
height weight y
0 10 88 50
1 20 99 60
2 30 125 70
3 40 155 80
4 50 120 90

Here’s one possible way to transform the dataframe into the model we desire. There are 6 lines of code!

[2]:
X = df[['height', 'weight']]
X['intercept'] = 1
X['height**2'] = df.height**2
X['weight**2'] = df.weight**2
X['height*weight'] = df.height * df.weight
X = X[['intercept', 'height', 'weight', 'height**2', 'weight**2', 'height*weight']]

X
[2]:
intercept height weight height**2 weight**2 height*weight
0 1 10 88 100 7744 880
1 1 20 99 400 9801 1980
2 1 30 125 900 15625 3750
3 1 40 155 1600 24025 6200
4 1 50 120 2500 14400 6000

4.1. Formulas

There’s an easier way using patsy. Patsy uses R-style formulas to define the model and takes care of transforming the data for you. There are only 2 lines of code in using patsy! Really, just 1 line of code if the formula was not so long, then we would have passed the string literal to the dmatrice() function.

[3]:
from patsy import dmatrices

formula = 'y ~ height + weight + I(height**2) + I(weight**2) + height:weight'
y, X = dmatrices(formula, df, return_type='dataframe')

X
[3]:
Intercept height weight I(height ** 2) I(weight ** 2) height:weight
0 1.0 10.0 88.0 100.0 7744.0 880.0
1 1.0 20.0 99.0 400.0 9801.0 1980.0
2 1.0 30.0 125.0 900.0 15625.0 3750.0
3 1.0 40.0 155.0 1600.0 24025.0 6200.0
4 1.0 50.0 120.0 2500.0 14400.0 6000.0

Notice how we had to use I(height**2) and I(weight**2)? The ** operator already has a special meaning in patsy (which is to produce interaction effects), and not the Python meaning (which is raising a number to the power of another). The I() is called the identity operator and preserves the expected Python expression. Also, we got the Intercept column by default. If we did not want the Intercept column, then we would subtract 1.

[4]:
formula = 'y ~ height + weight + I(height**2) + I(weight**2) + height:weight - 1'
y, X = dmatrices(formula, df, return_type='dataframe')

X
[4]:
height weight I(height ** 2) I(weight ** 2) height:weight
0 10.0 88.0 100.0 7744.0 880.0
1 20.0 99.0 400.0 9801.0 1980.0
2 30.0 125.0 900.0 15625.0 3750.0
3 40.0 155.0 1600.0 24025.0 6200.0
4 50.0 120.0 2500.0 14400.0 6000.0

The return_type argument is set to dataframe so that a tuple of pandas dataframes are returned; otherwise, a tuple of DesignMatrix is returned instead. A DesignMatrix is just a wrapper around numpy arrays with metadata.

[5]:
formula = 'y ~ height + weight + I(height**2) + I(weight**2) + height:weight - 1'
y, X = dmatrices(formula, df)
[6]:
X
[6]:
DesignMatrix with shape (5, 5)
  height  weight  I(height ** 2)  I(weight ** 2)  height:weight
      10      88             100            7744            880
      20      99             400            9801           1980
      30     125             900           15625           3750
      40     155            1600           24025           6200
      50     120            2500           14400           6000
  Terms:
    'height' (column 0)
    'weight' (column 1)
    'I(height ** 2)' (column 2)
    'I(weight ** 2)' (column 3)
    'height:weight' (column 4)
[7]:
y
[7]:
DesignMatrix with shape (5, 1)
   y
  50
  60
  70
  80
  90
  Terms:
    'y' (column 0)

If we did not want y, we simply omit the y ~ part and use the dmatrix() function instead.

[8]:
from patsy import dmatrix

formula = 'height + weight + I(height**2) + I(weight**2) + height:weight - 1'
X = dmatrix(formula, df, return_type='dataframe')

X
[8]:
height weight I(height ** 2) I(weight ** 2) height:weight
0 10.0 88.0 100.0 7744.0 880.0
1 20.0 99.0 400.0 9801.0 1980.0
2 30.0 125.0 900.0 15625.0 3750.0
3 40.0 155.0 1600.0 24025.0 6200.0
4 50.0 120.0 2500.0 14400.0 6000.0

We already saw the builtin I() function, but we can also use other functions to transform our data. Below, we do a log transform on height and weight.

[9]:
import numpy as np

formula = 'y ~ np.log(height) + np.log(weight) + I(height**2) + I(weight**2) + height:weight'
y, X = dmatrices(formula, df, return_type='dataframe')

X
[9]:
Intercept np.log(height) np.log(weight) I(height ** 2) I(weight ** 2) height:weight
0 1.0 2.302585 4.477337 100.0 7744.0 880.0
1 1.0 2.995732 4.595120 400.0 9801.0 1980.0
2 1.0 3.401197 4.828314 900.0 15625.0 3750.0
3 1.0 3.688879 5.043425 1600.0 24025.0 6200.0
4 1.0 3.912023 4.787492 2500.0 14400.0 6000.0

We can also center or standardize values using using the builtin functions center() and standardize().

[10]:
formula = 'y ~ center(height) + standardize(weight) + I(height**2) + I(weight**2) + center(height):standardize(weight)'
y, X = dmatrices(formula, df, return_type='dataframe')

X
[10]:
Intercept center(height) standardize(weight) I(height ** 2) I(weight ** 2) center(height):standardize(weight)
0 1.0 -20.0 -1.269602 100.0 7744.0 25.392048
1 1.0 -10.0 -0.794581 400.0 9801.0 7.945811
2 1.0 0.0 0.328197 900.0 15625.0 0.000000
3 1.0 10.0 1.623709 1600.0 24025.0 16.237092
4 1.0 20.0 0.112278 2500.0 14400.0 2.245555

4.2. Categorical data

How does patsy handle categorical data? Below, we create a dummy data set with two categorical fields; handedness has the values left and right; school has the values elementary, middle, high (high for highschool).

[11]:
df = pd.DataFrame({
    'height': [10, 20, 30, 40, 50],
    'weight': [88, 99, 125, 155, 120],
    'handedness': ['left', 'right', 'left', 'right', 'left'],
    'school': ['elementary', 'middle', 'high', 'elementary', 'middle'],
    'y': [50, 60, 70, 80, 90]
})
df
[11]:
height weight handedness school y
0 10 88 left elementary 50
1 20 99 right middle 60
2 30 125 left high 70
3 40 155 right elementary 80
4 50 120 left middle 90

The formulas work the same way as before, and patsy magically creates one-hot encoded variables for the categorical variables. Notice how one of the categorical variables is automatically dropped? For example, we might expect to see handedness[T.left] and handedness[T.right] as bew columns, but we only see handedness[T.right]. Likewise, school[T.elementary] is dropped. The resulting column names of the pattern name[T.value] is to signal that these fields are the result of one-hot encoding against categorical variables.

[12]:
formula = 'y ~ height + weight + handedness + school'
y, X = dmatrices(formula, df, return_type='dataframe')

X
[12]:
Intercept handedness[T.right] school[T.high] school[T.middle] height weight
0 1.0 0.0 0.0 0.0 10.0 88.0
1 1.0 1.0 0.0 1.0 20.0 99.0
2 1.0 0.0 1.0 0.0 30.0 125.0
3 1.0 1.0 0.0 0.0 40.0 155.0
4 1.0 0.0 0.0 1.0 50.0 120.0

If we use the C() builtin function, we can specify the ordering or levels of the categorical variable. Notice how the first level is the one dropped.

[13]:
handedness_levels = ['right', 'left']
school_levels = ['elementary', 'middle', 'high']

formula = 'y ~ height + weight + C(handedness, levels=handedness_levels) + C(school, levels=school_levels)'
y, X = dmatrices(formula, df, return_type='dataframe')

X
[13]:
Intercept C(handedness, levels=handedness_levels)[T.left] C(school, levels=school_levels)[T.middle] C(school, levels=school_levels)[T.high] height weight
0 1.0 1.0 0.0 0.0 10.0 88.0
1 1.0 0.0 1.0 0.0 20.0 99.0
2 1.0 1.0 0.0 1.0 30.0 125.0
3 1.0 0.0 0.0 0.0 40.0 155.0
4 1.0 1.0 1.0 0.0 50.0 120.0

4.3. Interaction

To specify interaction effects, we have a couple of options. We can use the following operators or symbols.

  • :

  • *

  • **

Below, we specify interaction effect using :. You will notice that handedness is not part of the model.

[14]:
formula = 'y ~ handedness:school'
y, X = dmatrices(formula, df, return_type='dataframe')
list(X.columns)
[14]:
['Intercept',
 'school[T.high]',
 'school[T.middle]',
 'handedness[T.right]:school[elementary]',
 'handedness[T.right]:school[high]',
 'handedness[T.right]:school[middle]']

Below, we use *, and you will notice all two-way and single effects are included. The result is typically what we want. Now, handedness is part of the model.

[15]:
formula = 'y ~ handedness*school'
y, X = dmatrices(formula, df, return_type='dataframe')
list(X.columns)
[15]:
['Intercept',
 'handedness[T.right]',
 'school[T.high]',
 'school[T.middle]',
 'handedness[T.right]:school[T.high]',
 'handedness[T.right]:school[T.middle]']

The ** is probably the best way to specify 1 and 2-way interactions between two variables. Additionally, patsy just knows what to do with dropping columns to prevent multicollinearity.

[16]:
formula = 'y ~ (handedness + school)**2'
y, X = dmatrices(formula, df, return_type='dataframe')
list(X.columns)
[16]:
['Intercept',
 'handedness[T.right]',
 'school[T.high]',
 'school[T.middle]',
 'handedness[T.right]:school[T.high]',
 'handedness[T.right]:school[T.middle]']

Here is a model specified up to all 2-way interactions for all the variables.

[17]:
formula = 'y ~ (height + weight + handedness + school)**2'
y, X = dmatrices(formula, df, return_type='dataframe')
list(X.columns)
[17]:
['Intercept',
 'handedness[T.right]',
 'school[T.high]',
 'school[T.middle]',
 'handedness[T.right]:school[T.high]',
 'handedness[T.right]:school[T.middle]',
 'height',
 'height:handedness[T.right]',
 'height:school[T.high]',
 'height:school[T.middle]',
 'weight',
 'weight:handedness[T.right]',
 'weight:school[T.high]',
 'weight:school[T.middle]',
 'height:weight']

And if we wanted up to all 3-way interactions?

[18]:
formula = 'y ~ (height + weight + handedness + school)**3'
y, X = dmatrices(formula, df, return_type='dataframe')
list(X.columns)
[18]:
['Intercept',
 'handedness[T.right]',
 'school[T.high]',
 'school[T.middle]',
 'handedness[T.right]:school[T.high]',
 'handedness[T.right]:school[T.middle]',
 'height',
 'height:handedness[T.right]',
 'height:school[T.high]',
 'height:school[T.middle]',
 'height:handedness[T.right]:school[T.high]',
 'height:handedness[T.right]:school[T.middle]',
 'weight',
 'weight:handedness[T.right]',
 'weight:school[T.high]',
 'weight:school[T.middle]',
 'weight:handedness[T.right]:school[T.high]',
 'weight:handedness[T.right]:school[T.middle]',
 'height:weight',
 'height:weight:handedness[T.right]',
 'height:weight:school[T.high]',
 'height:weight:school[T.middle]']

And if we wanted up to all 4-way interactions?

[19]:
formula = 'y ~ (height + weight + handedness + school)**4'
y, X = dmatrices(formula, df, return_type='dataframe')
list(X.columns)
[19]:
['Intercept',
 'handedness[T.right]',
 'school[T.high]',
 'school[T.middle]',
 'handedness[T.right]:school[T.high]',
 'handedness[T.right]:school[T.middle]',
 'height',
 'height:handedness[T.right]',
 'height:school[T.high]',
 'height:school[T.middle]',
 'height:handedness[T.right]:school[T.high]',
 'height:handedness[T.right]:school[T.middle]',
 'weight',
 'weight:handedness[T.right]',
 'weight:school[T.high]',
 'weight:school[T.middle]',
 'weight:handedness[T.right]:school[T.high]',
 'weight:handedness[T.right]:school[T.middle]',
 'height:weight',
 'height:weight:handedness[T.right]',
 'height:weight:school[T.high]',
 'height:weight:school[T.middle]',
 'height:weight:handedness[T.right]:school[T.high]',
 'height:weight:handedness[T.right]:school[T.middle]']

4.4. Null values

When creating your design matrices from formulas, patsy, for better or worse, will drop records that have null values. Below, we have 5 records and there is one record with a null value for handedness.

[20]:
df = pd.DataFrame({
    'height': [10, 20, 30, 40, 50],
    'weight': [88, 99, 125, 155, 120],
    'handedness': ['left', 'right', 'left', None, 'left'],
    'school': ['elementary', 'middle', 'high', 'elementary', 'middle'],
    'y': [50, 60, 70, 80, 90]
})
df
[20]:
height weight handedness school y
0 10 88 left elementary 50
1 20 99 right middle 60
2 30 125 left high 70
3 40 155 None elementary 80
4 50 120 left middle 90

If we defined a model, then observe how the record with the null value is removed!

[21]:
formula = 'y ~ handedness*school'
y, X = dmatrices(formula, df, return_type='dataframe')
X
[21]:
Intercept handedness[T.right] school[T.high] school[T.middle] handedness[T.right]:school[T.high] handedness[T.right]:school[T.middle]
0 1.0 0.0 0.0 0.0 0.0 0.0
1 1.0 1.0 0.0 1.0 0.0 1.0
2 1.0 0.0 1.0 0.0 0.0 0.0
4 1.0 0.0 0.0 1.0 0.0 0.0

To prevent records with null values from being remove, we have to specify what to do with null values using the NA_action argument. By default, NA_action is set to drop to remove records with null values. The alternative is to set this argument to raise, which will raise an exception. Neither of these string values help to retain records with null values. We have to use the object NAAction and specify the null types to empty (e.g. NA_types=[]). The example is below. Observe how both levels of handedness is part of the data? You will have to drop one of these now. Yikes! Always make sure that you check your resulting design matrices!

It’s a bit awkward how records with null values are dropped by default, but, if you decide to keep them, then columns are not dropped as expected. What’s being dropped and not being dropped? Again, check your resulting design matrices to make sure that your row and column counts/outputs are as expected when dealing with null values.

[22]:
from patsy import NAAction

formula = 'y ~ handedness*school'
y, X = dmatrices(formula, df, return_type='dataframe', NA_action=NAAction(NA_types=[]))
X
[22]:
Intercept handedness[T.left] handedness[T.right] school[T.high] school[T.middle] handedness[T.left]:school[T.high] handedness[T.right]:school[T.high] handedness[T.left]:school[T.middle] handedness[T.right]:school[T.middle]
0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0
2 1.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0
3 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0

If you have null values for continuous variables, that is fine. In the example below, one of the weight value is None.

[23]:
df = pd.DataFrame({
    'height': [10, 20, 30, 40, 50],
    'weight': [88, 99, 125, None, 120],
    'handedness': ['left', 'right', 'left', None, 'left'],
    'school': ['elementary', 'middle', 'high', 'elementary', 'middle'],
    'y': [50, 60, 70, 80, 90]
})

formula = 'y ~ height + weight + handedness + school'
y, X = dmatrices(formula, df, return_type='dataframe', NA_action=NAAction(NA_types=[]))
X
[23]:
Intercept handedness[T.left] handedness[T.right] school[T.high] school[T.middle] height weight
0 1.0 1.0 0.0 0.0 0.0 10.0 88.0
1 1.0 0.0 1.0 0.0 1.0 20.0 99.0
2 1.0 1.0 0.0 1.0 0.0 30.0 125.0
3 1.0 0.0 0.0 0.0 0.0 40.0 NaN
4 1.0 1.0 0.0 0.0 1.0 50.0 120.0

And just for completeness, if we do not have null values for the categorical variables, then patsy drops one of them as expected.

[24]:
df = pd.DataFrame({
    'height': [10, 20, 30, 40, 50],
    'weight': [88, 99, 125, None, 120],
    'handedness': ['left', 'right', 'left', 'right', 'left'],
    'school': ['elementary', 'middle', 'high', 'elementary', 'middle'],
    'y': [50, 60, 70, 80, 90]
})

formula = 'y ~ height + weight + handedness + school'
y, X = dmatrices(formula, df, return_type='dataframe', NA_action=NAAction(NA_types=[]))
X
[24]:
Intercept handedness[T.right] school[T.high] school[T.middle] height weight
0 1.0 0.0 0.0 0.0 10.0 88.0
1 1.0 1.0 0.0 1.0 20.0 99.0
2 1.0 0.0 1.0 0.0 30.0 125.0
3 1.0 1.0 0.0 0.0 40.0 NaN
4 1.0 0.0 0.0 1.0 50.0 120.0