5. statsmodels

statsmodel is another statistical library you may use to get more information on your regression models. Let’s take a look at how to do some simple things with this API. Below, we create functions to get data for regression and classification.

[1]:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression, make_classification
import numpy as np
import pandas as pd

np.random.seed(37)
plt.style.use('ggplot')

def get_regression_data(n_features=5, n_samples=1000):
    X, y = make_regression(**{
        'n_samples': n_samples,
        'n_features': n_features,
        'n_informative': n_features,
        'n_targets': 1,
        'bias': 5.3,
        'random_state': 37
    })

    data = np.hstack([X, y.reshape(-1, 1)])
    cols = [f'x{i}' for i in range(n_features)] + ['y']
    return pd.DataFrame(data, columns=cols)

def get_classification_data(n_features=5, n_samples=1000):
    X, y = make_classification(**{
        'n_samples': n_samples,
        'n_features': n_features,
        'n_informative': n_features,
        'n_redundant': 0,
        'n_repeated': 0,
        'n_classes': 2,
        'n_clusters_per_class': 1,
        'random_state': 37
    })

    data = np.hstack([X, y.reshape(-1, 1)])
    cols = [f'x{i}' for i in range(n_features)] + ['y']
    return pd.DataFrame(data, columns=cols)

5.1. Ordinary least square

An ordinary least square (OLS) model is created using the OLS() function. Below, the patsy API is used to separate the dataframe using R style equation syntax.

[2]:
import statsmodels.api as sm
from patsy import dmatrices

df = get_regression_data()
y, X = dmatrices('y ~ x0 + x1 + x2 + x3 + x4', data=df, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()

You can also bypass patsy and use the formula API to define the model.

[3]:
import statsmodels.formula.api as smf

mod = smf.ols(formula='y ~ x0 + x1 + x2 + x3 + x4', data=df)
res = mod.fit()

The summary of the data is available through summary().

[4]:
res.summary()
[4]:
OLS Regression Results
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 7.399e+32
Date: Tue, 01 Dec 2020 Prob (F-statistic): 0.00
Time: 17:09:02 Log-Likelihood: 29657.
No. Observations: 1000 AIC: -5.930e+04
Df Residuals: 994 BIC: -5.927e+04
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 5.3000 1.01e-15 5.23e+15 0.000 5.300 5.300
x0 41.8134 1.02e-15 4.1e+16 0.000 41.813 41.813
x1 25.0388 1.08e-15 2.33e+16 0.000 25.039 25.039
x2 3.2108 1.02e-15 3.14e+15 0.000 3.211 3.211
x3 29.9942 1.01e-15 2.98e+16 0.000 29.994 29.994
x4 25.3606 1e-15 2.53e+16 0.000 25.361 25.361
Omnibus: 7.219 Durbin-Watson: 1.895
Prob(Omnibus): 0.027 Jarque-Bera (JB): 8.934
Skew: 0.086 Prob(JB): 0.0115
Kurtosis: 3.430 Cond. No. 1.16


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The params properties of the results will retrieve the coefficients of the model.

[5]:
res.params
[5]:
Intercept     5.300000
x0           41.813417
x1           25.038808
x2            3.210812
x3           29.994205
x4           25.360629
dtype: float64

The rsquared property will retrieve the \(R^2\) value.

[6]:
res.rsquared
[6]:
1.0

There are a lot of model diagnostic functions available. Below, we apply the rainbow test for linearity.

[7]:
f_stats, p_value = sm.stats.linear_rainbow(res)
print(f'f-statistic: {f_stats:.5f}, p-value: {p_value:.5f}')
f-statistic: 0.24804, p-value: 1.00000

5.2. Logistic regression

Logistic regression is a type of generalized linear model. A logistic regression model is created using the GLM() function. Note that y comes before X for GLM(), as opposed to X coming before y when using OLS().

[8]:
df = get_classification_data()
y, X = dmatrices('y ~ x0 + x1 + x2 + x3 + x4', data=df, return_type='dataframe')
binonmial_model = sm.GLM(y, X, family=sm.families.Binomial())
res = binonmial_model.fit()
[9]:
res.summary()
[9]:
Generalized Linear Model Regression Results
Dep. Variable: y No. Observations: 1000
Model: GLM Df Residuals: 994
Model Family: Binomial Df Model: 5
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -76.314
Date: Tue, 01 Dec 2020 Deviance: 152.63
Time: 17:09:02 Pearson chi2: 9.15e+05
No. Iterations: 9
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept -1.5205 0.636 -2.389 0.017 -2.768 -0.273
x0 2.2090 0.259 8.522 0.000 1.701 2.717
x1 -3.1632 0.348 -9.083 0.000 -3.846 -2.481
x2 2.0077 0.248 8.095 0.000 1.522 2.494
x3 -1.0259 0.301 -3.408 0.001 -1.616 -0.436
x4 2.6748 0.382 7.010 0.000 1.927 3.423

5.3. ANOVA

Analysis of Variance (ANOVA) may be conducted. The data below is taken from here.

[10]:
df = pd.DataFrame({
    'A': [25, 30, 28, 36, 29],
    'B': [45, 55, 29, 56, 40],
    'C': [30, 29, 33, 37, 27],
    'D': [54, 60, 51, 62, 73]
})
df
[10]:
A B C D
0 25 45 30 54
1 30 55 29 60
2 28 29 33 51
3 36 56 37 62
4 29 40 27 73

Let’s do some box plots.

[11]:
fig, ax = plt.subplots(figsize=(10, 5))

_ = df.plot(kind='box', ax=ax)
_ = ax.set_title('Box plots')
_images/statsmodels_20_0.png

scipy can do the one-way ANOVA.

[12]:
import scipy.stats as stats

f, p = stats.f_oneway(*[df[c] for c in df.columns])
print(f'f-statistics: {f:.5f}, p-value: {p:.5f}')
f-statistics: 17.49281, p-value: 0.00003

Before statsmodel can do ANOVA, we have to melt the dataframe.

[13]:
ldf = pd.melt(df.reset_index(), id_vars=['index'], value_vars=['A', 'B', 'C', 'D'])\
    .rename(columns={'index': 'index', 'variable': 'treatment', 'value': 'value'})
ldf.head()
[13]:
index treatment value
0 0 A 25
1 1 A 30
2 2 A 28
3 3 A 36
4 4 A 29

Now we specify the ANOVA model and acquire the table.

[14]:
anova_model = smf.ols('value ~ C(treatment)', data=ldf).fit()
anova_table = sm.stats.anova_lm(anova_model, typ=2)
anova_table
[14]:
sum_sq df F PR(>F)
C(treatment) 3010.95 3.0 17.49281 0.000026
Residual 918.00 16.0 NaN NaN

Post-hoc testing is also possible.

[15]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

res = pairwise_tukeyhsd(ldf.value, ldf.treatment)
res.summary()
[15]:
Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj lower upper reject
A B 15.4 0.0251 1.6929 29.1071 True
A C 1.6 0.9 -12.1071 15.3071 False
A D 30.4 0.001 16.6929 44.1071 True
B C -13.8 0.0482 -27.5071 -0.0929 True
B D 15.0 0.0296 1.2929 28.7071 True
C D 28.8 0.001 15.0929 42.5071 True