# 5. statsmodels

statsmodel is another statistical library you may use to get more information on your regression models. Let’s take a look at how to do some simple things with this API. Below, we create functions to get data for regression and classification.

[1]:

%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression, make_classification
import numpy as np
import pandas as pd

np.random.seed(37)
plt.style.use('ggplot')

def get_regression_data(n_features=5, n_samples=1000):
X, y = make_regression(**{
'n_samples': n_samples,
'n_features': n_features,
'n_informative': n_features,
'n_targets': 1,
'bias': 5.3,
'random_state': 37
})

data = np.hstack([X, y.reshape(-1, 1)])
cols = [f'x{i}' for i in range(n_features)] + ['y']
return pd.DataFrame(data, columns=cols)

def get_classification_data(n_features=5, n_samples=1000):
X, y = make_classification(**{
'n_samples': n_samples,
'n_features': n_features,
'n_informative': n_features,
'n_redundant': 0,
'n_repeated': 0,
'n_classes': 2,
'n_clusters_per_class': 1,
'random_state': 37
})

data = np.hstack([X, y.reshape(-1, 1)])
cols = [f'x{i}' for i in range(n_features)] + ['y']
return pd.DataFrame(data, columns=cols)


## 5.1. Ordinary least square

An ordinary least square (OLS) model is created using the OLS() function. Below, the patsy API is used to separate the dataframe using R style equation syntax.

[2]:

import statsmodels.api as sm
from patsy import dmatrices

df = get_regression_data()
y, X = dmatrices('y ~ x0 + x1 + x2 + x3 + x4', data=df, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()


You can also bypass patsy and use the formula API to define the model.

[3]:

import statsmodels.formula.api as smf

mod = smf.ols(formula='y ~ x0 + x1 + x2 + x3 + x4', data=df)
res = mod.fit()


The summary of the data is available through summary().

[4]:

res.summary()

[4]:

Dep. Variable: R-squared: y 1.000 OLS 1.000 Least Squares 7.399e+32 Tue, 01 Dec 2020 0.00 17:09:02 29657. 1000 -5.930e+04 994 -5.927e+04 5 nonrobust
coef std err t P>|t| [0.025 0.975] 5.3000 1.01e-15 5.23e+15 0.000 5.300 5.300 41.8134 1.02e-15 4.1e+16 0.000 41.813 41.813 25.0388 1.08e-15 2.33e+16 0.000 25.039 25.039 3.2108 1.02e-15 3.14e+15 0.000 3.211 3.211 29.9942 1.01e-15 2.98e+16 0.000 29.994 29.994 25.3606 1e-15 2.53e+16 0.000 25.361 25.361
 Omnibus: Durbin-Watson: 7.219 1.895 0.027 8.934 0.086 0.0115 3.43 1.16

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The params properties of the results will retrieve the coefficients of the model.

[5]:

res.params

[5]:

Intercept     5.300000
x0           41.813417
x1           25.038808
x2            3.210812
x3           29.994205
x4           25.360629
dtype: float64


The rsquared property will retrieve the $$R^2$$ value.

[6]:

res.rsquared

[6]:

1.0


There are a lot of model diagnostic functions available. Below, we apply the rainbow test for linearity.

[7]:

f_stats, p_value = sm.stats.linear_rainbow(res)
print(f'f-statistic: {f_stats:.5f}, p-value: {p_value:.5f}')

f-statistic: 0.24804, p-value: 1.00000


## 5.2. Logistic regression

Logistic regression is a type of generalized linear model. A logistic regression model is created using the GLM() function. Note that y comes before X for GLM(), as opposed to X coming before y when using OLS().

[8]:

df = get_classification_data()
y, X = dmatrices('y ~ x0 + x1 + x2 + x3 + x4', data=df, return_type='dataframe')
binonmial_model = sm.GLM(y, X, family=sm.families.Binomial())
res = binonmial_model.fit()

[9]:

res.summary()

[9]:

Dep. Variable: No. Observations: y 1000 GLM 994 Binomial 5 logit 1.0000 IRLS -76.314 Tue, 01 Dec 2020 152.63 17:09:02 9.15e+05 9 nonrobust
coef std err z P>|z| [0.025 0.975] -1.5205 0.636 -2.389 0.017 -2.768 -0.273 2.2090 0.259 8.522 0.000 1.701 2.717 -3.1632 0.348 -9.083 0.000 -3.846 -2.481 2.0077 0.248 8.095 0.000 1.522 2.494 -1.0259 0.301 -3.408 0.001 -1.616 -0.436 2.6748 0.382 7.010 0.000 1.927 3.423

## 5.3. ANOVA

Analysis of Variance (ANOVA) may be conducted. The data below is taken from here.

[10]:

df = pd.DataFrame({
'A': [25, 30, 28, 36, 29],
'B': [45, 55, 29, 56, 40],
'C': [30, 29, 33, 37, 27],
'D': [54, 60, 51, 62, 73]
})
df

[10]:

A B C D
0 25 45 30 54
1 30 55 29 60
2 28 29 33 51
3 36 56 37 62
4 29 40 27 73

Let’s do some box plots.

[11]:

fig, ax = plt.subplots(figsize=(10, 5))

_ = df.plot(kind='box', ax=ax)
_ = ax.set_title('Box plots')


scipy can do the one-way ANOVA.

[12]:

import scipy.stats as stats

f, p = stats.f_oneway(*[df[c] for c in df.columns])
print(f'f-statistics: {f:.5f}, p-value: {p:.5f}')

f-statistics: 17.49281, p-value: 0.00003


Before statsmodel can do ANOVA, we have to melt the dataframe.

[13]:

ldf = pd.melt(df.reset_index(), id_vars=['index'], value_vars=['A', 'B', 'C', 'D'])\
.rename(columns={'index': 'index', 'variable': 'treatment', 'value': 'value'})

[13]:

index treatment value
0 0 A 25
1 1 A 30
2 2 A 28
3 3 A 36
4 4 A 29

Now we specify the ANOVA model and acquire the table.

[14]:

anova_model = smf.ols('value ~ C(treatment)', data=ldf).fit()
anova_table = sm.stats.anova_lm(anova_model, typ=2)
anova_table

[14]:

sum_sq df F PR(>F)
C(treatment) 3010.95 3.0 17.49281 0.000026
Residual 918.00 16.0 NaN NaN

Post-hoc testing is also possible.

[15]:

from statsmodels.stats.multicomp import pairwise_tukeyhsd

res = pairwise_tukeyhsd(ldf.value, ldf.treatment)
res.summary()

[15]:

Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj lower upper reject
A B 15.4 0.0251 1.6929 29.1071 True
A C 1.6 0.9 -12.1071 15.3071 False
A D 30.4 0.001 16.6929 44.1071 True
B C -13.8 0.0482 -27.5071 -0.0929 True
B D 15.0 0.0296 1.2929 28.7071 True
C D 28.8 0.001 15.0929 42.5071 True