4. Patsy

Patsy is a neat API to transform your data into experimentation model form. For regression and classification problems, you often want your data in the Xy form where X is a matrix (independent variable) and y is a column vector (dependent variable). In regression, let’s say you have \(x_1, x_2\) as your independent variable and \(y\) as your dependent variable. You might want to express the possible models as follows.

\(y = b_0 + x_1 + x_2\)
\(y = b_0 + x_1 + x_2 + x_1 x_2\)
\(y = b_0 + x_1 + x_2 + x_1^2 + x_2^2\)
\(y = b_0 + x_1 + x_2 + x_1^2 + x_2^2 + x_1 x_2\)

If your data is stored in a dataframe, it’s not terribly hard (but not so easy either) to transform your dataframe into one with the polynomial and interaction effects. Let’s look at a made up dataframe below, and let’s say that we want to model our regression problem with \(y = b_0 + x_1 + x_2 + x_1^2 + x_2^2 + x_1 x_2\). How would we transform the dataframe?

[1]:

import pandas as pd

df = pd.DataFrame({
    'height': [10, 20, 30, 40, 50],
    'weight': [88, 99, 125, 155, 120],
    'y': [50, 60, 70, 80, 90]
})
df

[1]:

	height	weight	y
0	10	88	50
1	20	99	60
2	30	125	70
3	40	155	80
4	50	120	90

Here’s one possible way to transform the dataframe into the model we desire. There are 6 lines of code!

[2]:

X = df[['height', 'weight']]
X['intercept'] = 1
X['height**2'] = df.height**2
X['weight**2'] = df.weight**2
X['height*weight'] = df.height * df.weight
X = X[['intercept', 'height', 'weight', 'height**2', 'weight**2', 'height*weight']]

X

[2]:

	intercept	height	weight	height**2	weight**2	height*weight
0	1	10	88	100	7744	880
1	1	20	99	400	9801	1980
2	1	30	125	900	15625	3750
3	1	40	155	1600	24025	6200
4	1	50	120	2500	14400	6000

4.1. Formulas

There’s an easier way using patsy. Patsy uses R-style formulas to define the model and takes care of transforming the data for you. There are only 2 lines of code in using patsy! Really, just 1 line of code if the formula was not so long, then we would have passed the string literal to the dmatrice() function.

[3]:

from patsy import dmatrices

formula = 'y ~ height + weight + I(height**2) + I(weight**2) + height:weight'
y, X = dmatrices(formula, df, return_type='dataframe')

X

[3]:

	Intercept	height	weight	I(height ** 2)	I(weight ** 2)	height:weight
0	1.0	10.0	88.0	100.0	7744.0	880.0
1	1.0	20.0	99.0	400.0	9801.0	1980.0
2	1.0	30.0	125.0	900.0	15625.0	3750.0
3	1.0	40.0	155.0	1600.0	24025.0	6200.0
4	1.0	50.0	120.0	2500.0	14400.0	6000.0

Notice how we had to use I(height**2) and I(weight**2)? The ** operator already has a special meaning in patsy (which is to produce interaction effects), and not the Python meaning (which is raising a number to the power of another). The I() is called the identity operator and preserves the expected Python expression. Also, we got the Intercept column by default. If we did not want the Intercept column, then we would subtract 1.

[4]:

formula = 'y ~ height + weight + I(height**2) + I(weight**2) + height:weight - 1'
y, X = dmatrices(formula, df, return_type='dataframe')

X

[4]:

	height	weight	I(height ** 2)	I(weight ** 2)	height:weight
0	10.0	88.0	100.0	7744.0	880.0
1	20.0	99.0	400.0	9801.0	1980.0
2	30.0	125.0	900.0	15625.0	3750.0
3	40.0	155.0	1600.0	24025.0	6200.0
4	50.0	120.0	2500.0	14400.0	6000.0

The return_type argument is set to dataframe so that a tuple of pandas dataframes are returned; otherwise, a tuple of DesignMatrix is returned instead. A DesignMatrix is just a wrapper around numpy arrays with metadata.

[5]:

formula = 'y ~ height + weight + I(height**2) + I(weight**2) + height:weight - 1'
y, X = dmatrices(formula, df)

[6]:

[6]:

DesignMatrix with shape (5, 5)
  height  weight  I(height ** 2)  I(weight ** 2)  height:weight
      10      88             100            7744            880
      20      99             400            9801           1980
      30     125             900           15625           3750
      40     155            1600           24025           6200
      50     120            2500           14400           6000
  Terms:
    'height' (column 0)
    'weight' (column 1)
    'I(height ** 2)' (column 2)
    'I(weight ** 2)' (column 3)
    'height:weight' (column 4)

[7]:

[7]:

DesignMatrix with shape (5, 1)
   y
  50
  60
  70
  80
  90
  Terms:
    'y' (column 0)

If we did not want y, we simply omit the y ~ part and use the dmatrix() function instead.

[8]:

from patsy import dmatrix

formula = 'height + weight + I(height**2) + I(weight**2) + height:weight - 1'
X = dmatrix(formula, df, return_type='dataframe')

X

[8]:

	height	weight	I(height ** 2)	I(weight ** 2)	height:weight
0	10.0	88.0	100.0	7744.0	880.0
1	20.0	99.0	400.0	9801.0	1980.0
2	30.0	125.0	900.0	15625.0	3750.0
3	40.0	155.0	1600.0	24025.0	6200.0
4	50.0	120.0	2500.0	14400.0	6000.0

We already saw the builtin I() function, but we can also use other functions to transform our data. Below, we do a log transform on height and weight.

[9]:

import numpy as np

formula = 'y ~ np.log(height) + np.log(weight) + I(height**2) + I(weight**2) + height:weight'
y, X = dmatrices(formula, df, return_type='dataframe')

X

[9]:

	Intercept	np.log(height)	np.log(weight)	I(height ** 2)	I(weight ** 2)	height:weight
0	1.0	2.302585	4.477337	100.0	7744.0	880.0
1	1.0	2.995732	4.595120	400.0	9801.0	1980.0
2	1.0	3.401197	4.828314	900.0	15625.0	3750.0
3	1.0	3.688879	5.043425	1600.0	24025.0	6200.0
4	1.0	3.912023	4.787492	2500.0	14400.0	6000.0

We can also center or standardize values using using the builtin functions center() and standardize().

[10]:

formula = 'y ~ center(height) + standardize(weight) + I(height**2) + I(weight**2) + center(height):standardize(weight)'
y, X = dmatrices(formula, df, return_type='dataframe')

X

[10]:

	Intercept	center(height)	standardize(weight)	I(height ** 2)	I(weight ** 2)	center(height):standardize(weight)
0	1.0	-20.0	-1.269602	100.0	7744.0	25.392048
1	1.0	-10.0	-0.794581	400.0	9801.0	7.945811
2	1.0	0.0	0.328197	900.0	15625.0	0.000000
3	1.0	10.0	1.623709	1600.0	24025.0	16.237092
4	1.0	20.0	0.112278	2500.0	14400.0	2.245555

4.2. Categorical data

How does patsy handle categorical data? Below, we create a dummy data set with two categorical fields; handedness has the values left and right; school has the values elementary, middle, high (high for highschool).

[11]:

df = pd.DataFrame({
    'height': [10, 20, 30, 40, 50],
    'weight': [88, 99, 125, 155, 120],
    'handedness': ['left', 'right', 'left', 'right', 'left'],
    'school': ['elementary', 'middle', 'high', 'elementary', 'middle'],
    'y': [50, 60, 70, 80, 90]
})
df

[11]:

	height	weight	handedness	school	y
0	10	88	left	elementary	50
1	20	99	right	middle	60
2	30	125	left	high	70
3	40	155	right	elementary	80
4	50	120	left	middle	90

The formulas work the same way as before, and patsy magically creates one-hot encoded variables for the categorical variables. Notice how one of the categorical variables is automatically dropped? For example, we might expect to see handedness[T.left] and handedness[T.right] as bew columns, but we only see handedness[T.right]. Likewise, school[T.elementary] is dropped. The resulting column names of the pattern name[T.value] is to signal that these fields are the result of one-hot encoding against categorical variables.

[12]:

formula = 'y ~ height + weight + handedness + school'
y, X = dmatrices(formula, df, return_type='dataframe')

X

[12]:

	Intercept	handedness[T.right]	school[T.high]	school[T.middle]	height	weight
0	1.0	0.0	0.0	0.0	10.0	88.0
1	1.0	1.0	0.0	1.0	20.0	99.0
2	1.0	0.0	1.0	0.0	30.0	125.0
3	1.0	1.0	0.0	0.0	40.0	155.0
4	1.0	0.0	0.0	1.0	50.0	120.0

If we use the C() builtin function, we can specify the ordering or levels of the categorical variable. Notice how the first level is the one dropped.

[13]:

handedness_levels = ['right', 'left']
school_levels = ['elementary', 'middle', 'high']

formula = 'y ~ height + weight + C(handedness, levels=handedness_levels) + C(school, levels=school_levels)'
y, X = dmatrices(formula, df, return_type='dataframe')

X

[13]:

	Intercept	C(handedness, levels=handedness_levels)[T.left]	C(school, levels=school_levels)[T.middle]	C(school, levels=school_levels)[T.high]	height	weight
0	1.0	1.0	0.0	0.0	10.0	88.0
1	1.0	0.0	1.0	0.0	20.0	99.0
2	1.0	1.0	0.0	1.0	30.0	125.0
3	1.0	0.0	0.0	0.0	40.0	155.0
4	1.0	1.0	1.0	0.0	50.0	120.0

4.3. Interaction

To specify interaction effects, we have a couple of options. We can use the following operators or symbols.

:
*
**

Below, we specify interaction effect using :. You will notice that handedness is not part of the model.

[14]:

formula = 'y ~ handedness:school'
y, X = dmatrices(formula, df, return_type='dataframe')
list(X.columns)

[14]:

['Intercept',
 'school[T.high]',
 'school[T.middle]',
 'handedness[T.right]:school[elementary]',
 'handedness[T.right]:school[high]',
 'handedness[T.right]:school[middle]']

Below, we use *, and you will notice all two-way and single effects are included. The result is typically what we want. Now, handedness is part of the model.

[15]:

formula = 'y ~ handedness*school'
y, X = dmatrices(formula, df, return_type='dataframe')
list(X.columns)

[15]:

['Intercept',
 'handedness[T.right]',
 'school[T.high]',
 'school[T.middle]',
 'handedness[T.right]:school[T.high]',
 'handedness[T.right]:school[T.middle]']

The ** is probably the best way to specify 1 and 2-way interactions between two variables. Additionally, patsy just knows what to do with dropping columns to prevent multicollinearity.

[16]:

formula = 'y ~ (handedness + school)**2'
y, X = dmatrices(formula, df, return_type='dataframe')
list(X.columns)

[16]:

['Intercept',
 'handedness[T.right]',
 'school[T.high]',
 'school[T.middle]',
 'handedness[T.right]:school[T.high]',
 'handedness[T.right]:school[T.middle]']

Here is a model specified up to all 2-way interactions for all the variables.

[17]:

formula = 'y ~ (height + weight + handedness + school)**2'
y, X = dmatrices(formula, df, return_type='dataframe')
list(X.columns)

[17]:

['Intercept',
 'handedness[T.right]',
 'school[T.high]',
 'school[T.middle]',
 'handedness[T.right]:school[T.high]',
 'handedness[T.right]:school[T.middle]',
 'height',
 'height:handedness[T.right]',
 'height:school[T.high]',
 'height:school[T.middle]',
 'weight',
 'weight:handedness[T.right]',
 'weight:school[T.high]',
 'weight:school[T.middle]',
 'height:weight']

And if we wanted up to all 3-way interactions?

[18]:

formula = 'y ~ (height + weight + handedness + school)**3'
y, X = dmatrices(formula, df, return_type='dataframe')
list(X.columns)

[18]:

['Intercept',
 'handedness[T.right]',
 'school[T.high]',
 'school[T.middle]',
 'handedness[T.right]:school[T.high]',
 'handedness[T.right]:school[T.middle]',
 'height',
 'height:handedness[T.right]',
 'height:school[T.high]',
 'height:school[T.middle]',
 'height:handedness[T.right]:school[T.high]',
 'height:handedness[T.right]:school[T.middle]',
 'weight',
 'weight:handedness[T.right]',
 'weight:school[T.high]',
 'weight:school[T.middle]',
 'weight:handedness[T.right]:school[T.high]',
 'weight:handedness[T.right]:school[T.middle]',
 'height:weight',
 'height:weight:handedness[T.right]',
 'height:weight:school[T.high]',
 'height:weight:school[T.middle]']

And if we wanted up to all 4-way interactions?

[19]:

formula = 'y ~ (height + weight + handedness + school)**4'
y, X = dmatrices(formula, df, return_type='dataframe')
list(X.columns)

[19]:

['Intercept',
 'handedness[T.right]',
 'school[T.high]',
 'school[T.middle]',
 'handedness[T.right]:school[T.high]',
 'handedness[T.right]:school[T.middle]',
 'height',
 'height:handedness[T.right]',
 'height:school[T.high]',
 'height:school[T.middle]',
 'height:handedness[T.right]:school[T.high]',
 'height:handedness[T.right]:school[T.middle]',
 'weight',
 'weight:handedness[T.right]',
 'weight:school[T.high]',
 'weight:school[T.middle]',
 'weight:handedness[T.right]:school[T.high]',
 'weight:handedness[T.right]:school[T.middle]',
 'height:weight',
 'height:weight:handedness[T.right]',
 'height:weight:school[T.high]',
 'height:weight:school[T.middle]',
 'height:weight:handedness[T.right]:school[T.high]',
 'height:weight:handedness[T.right]:school[T.middle]']

4.4. Null values

When creating your design matrices from formulas, patsy, for better or worse, will drop records that have null values. Below, we have 5 records and there is one record with a null value for handedness.

[20]:

df = pd.DataFrame({
    'height': [10, 20, 30, 40, 50],
    'weight': [88, 99, 125, 155, 120],
    'handedness': ['left', 'right', 'left', None, 'left'],
    'school': ['elementary', 'middle', 'high', 'elementary', 'middle'],
    'y': [50, 60, 70, 80, 90]
})
df

[20]:

	height	weight	handedness	school	y
0	10	88	left	elementary	50
1	20	99	right	middle	60
2	30	125	left	high	70
3	40	155	None	elementary	80
4	50	120	left	middle	90

If we defined a model, then observe how the record with the null value is removed!

[21]:

formula = 'y ~ handedness*school'
y, X = dmatrices(formula, df, return_type='dataframe')
X

[21]:

	Intercept	handedness[T.right]	school[T.high]	school[T.middle]	handedness[T.right]:school[T.middle]
0	1.0	0.0	0.0	0.0	0.0
1	1.0	1.0	0.0	1.0	1.0
2	1.0	0.0	1.0	0.0	0.0
4	1.0	0.0	0.0	1.0	0.0

To prevent records with null values from being remove, we have to specify what to do with null values using the NA_action argument. By default, NA_action is set to drop to remove records with null values. The alternative is to set this argument to raise, which will raise an exception. Neither of these string values help to retain records with null values. We have to use the object NAAction and specify the null types to empty (e.g. NA_types=[]). The example is below. Observe how both levels of handedness is part of the data? You will have to drop one of these now. Yikes! Always make sure that you check your resulting design matrices!

It’s a bit awkward how records with null values are dropped by default, but, if you decide to keep them, then columns are not dropped as expected. What’s being dropped and not being dropped? Again, check your resulting design matrices to make sure that your row and column counts/outputs are as expected when dealing with null values.

[22]:

from patsy import NAAction

formula = 'y ~ handedness*school'
y, X = dmatrices(formula, df, return_type='dataframe', NA_action=NAAction(NA_types=[]))
X

[22]:

	Intercept	handedness[T.left]	handedness[T.right]	school[T.high]	school[T.middle]	handedness[T.left]:school[T.high]	handedness[T.left]:school[T.middle]	handedness[T.right]:school[T.middle]
0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
1	1.0	0.0	1.0	0.0	1.0	0.0	0.0	1.0
2	1.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0
3	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	1.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0

If you have null values for continuous variables, that is fine. In the example below, one of the weight value is None.

[23]:

df = pd.DataFrame({
    'height': [10, 20, 30, 40, 50],
    'weight': [88, 99, 125, None, 120],
    'handedness': ['left', 'right', 'left', None, 'left'],
    'school': ['elementary', 'middle', 'high', 'elementary', 'middle'],
    'y': [50, 60, 70, 80, 90]
})

formula = 'y ~ height + weight + handedness + school'
y, X = dmatrices(formula, df, return_type='dataframe', NA_action=NAAction(NA_types=[]))
X

[23]:

	Intercept	handedness[T.left]	handedness[T.right]	school[T.high]	school[T.middle]	height	weight
0	1.0	1.0	0.0	0.0	0.0	10.0	88.0
1	1.0	0.0	1.0	0.0	1.0	20.0	99.0
2	1.0	1.0	0.0	1.0	0.0	30.0	125.0
3	1.0	0.0	0.0	0.0	0.0	40.0	NaN
4	1.0	1.0	0.0	0.0	1.0	50.0	120.0

And just for completeness, if we do not have null values for the categorical variables, then patsy drops one of them as expected.

[24]:

df = pd.DataFrame({
    'height': [10, 20, 30, 40, 50],
    'weight': [88, 99, 125, None, 120],
    'handedness': ['left', 'right', 'left', 'right', 'left'],
    'school': ['elementary', 'middle', 'high', 'elementary', 'middle'],
    'y': [50, 60, 70, 80, 90]
})

formula = 'y ~ height + weight + handedness + school'
y, X = dmatrices(formula, df, return_type='dataframe', NA_action=NAAction(NA_types=[]))
X

[24]:

	Intercept	handedness[T.right]	school[T.high]	school[T.middle]	height	weight
0	1.0	0.0	0.0	0.0	10.0	88.0
1	1.0	1.0	0.0	1.0	20.0	99.0
2	1.0	0.0	1.0	0.0	30.0	125.0
3	1.0	1.0	0.0	0.0	40.0	NaN
4	1.0	0.0	0.0	1.0	50.0	120.0