2. Visualizing data

Before you endeavor on data analysis, it is typical to visualize the data. Data visualization may help with a gut check on data quality and validation. Additionally, since a lot of techniques make certain strong assumptions on the characteristics of data, seeing the data may help to see in which ways it deviates from those assumptions. In this notebook, we show a few ways to visualize data using univariate, bivariate and multivariate techniques.

2.1. Loading data

Let’s load up the iris data. This data contains measurements of the sepal length and width and petal length and width of 3 flower species (4 variables). This data is typically bundled in many data analysis package for testing various machine learning algorithms. Here, we are loading up the iris data set from Scikit-Learn. Note that we load the data into a matrix \(X\) and vector \(y\), where \(X\) is a matrix of of the 4 variables over 150 observations. In machine learning, it is typical to denote \(X\) as a matrix of observations and independent variables with \(y\) as a vector of the target (or dependent) variable. Later, we also convert \(X\) and \(y\) into a Pandas dataframe for easier use with Seaborn.

[1]:

from sklearn import datasets

iris = datasets.load_iris()

[2]:

type(iris)

[2]:

sklearn.utils.Bunch

[3]:

iris.keys()

[3]:

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

[4]:

X, y = iris.data, iris.target

[5]:

X.shape

[5]:

(150, 4)

[6]:

y.shape

[6]:

(150,)

[7]:

iris.feature_names

[7]:

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

[8]:

iris.target_names

[8]:

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

[9]:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    np.hstack([X, np.array([iris.target_names[v] for v in y]).reshape(-1, 1)]),
    columns=['sepal_width', 'sepal_height', 'petal_width', 'petal_height', 'species'])
df = df.astype({k: np.float for k in df.columns if k != 'species'})

2.2. Univariate visualization

Univariate visualizations center on displaying one variable at a time.

2.2.1. Univariate distribution

It is common to simply observe the distribution of a variable. Typically, you want to see the shape (modes and spread) of the variable. In a lot of cases, it helps if the distribution of a variable follows a Gaussian distribution. For continuous variables, the function generating the data is called the probability density function (PDF) and for categorical variables, probability mass function (PMF).

[10]:

%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns
from itertools import cycle

colors = cycle(['r', 'g', 'b', 'm'])
columns=['sepal_width', 'sepal_height', 'petal_width', 'petal_height']
n_vars = len(columns)

n_cols = 4
n_rows = n_vars / n_cols if n_vars % n_cols == 0 else (n_vars / n_cols) + 1
n_plots = int(n_cols * n_rows)

fig = plt.figure(figsize=(20, 3))

for i in range(n_plots):
    ax = fig.add_subplot(n_rows, n_cols, i + 1)

    if i < n_vars:
        sns.distplot(X[:,i], ax=ax, color=next(colors), kde=True, norm_hist=True, hist=True)
        ax.set_title(columns[i])
        ax.set_xlabel('value')
        ax.set_ylabel('probability')
    else:
        ax.axis('off')

plt.tight_layout()

2.2.2. Box plots

While a density plot may give you clues on the PDF (or PMF), a box plot (or box-and-whisker plot) gives you clue on outliers in your data. The middle line represents the median over all the data points (the global median); the left line is called the first quartile or Q1 and it is the median of all data points to the left of the global median; the right line is called the third quartile or Q3 and it is the median of all data points to the right of the global median. The distance from Q1 to Q3 is called the inter-quartile range, IQR. The left whisker is Q1 + 1.5 x IQR and the right whisker is Q3 + 1.5 x IQR. Anything outside of the whiskers are considered outliers. The left and right whiskers are often referred to as the min and max.

[11]:

colors = cycle(['r', 'g', 'b', 'm'])
n_cols = 4
n_rows = n_vars / n_cols if n_vars % n_cols == 0 else (n_vars / n_cols) + 1
n_plots = int(n_cols * n_rows)

fig = plt.figure(figsize=(20, 3))

for i in range(n_plots):
    ax = fig.add_subplot(n_rows, n_cols, i + 1)

    if i < n_vars:
        sns.boxplot(X[:,i], ax=ax, color=next(colors))
        ax.set_title(columns[i])
        ax.set_xlabel('value')
    else:
        ax.axis('off')

plt.tight_layout()

2.2.3. Pie and bar graph

Since we have a categorical variable, the species of the flower, we may use a pie and bar graph to plot the proportion and distribution of the classes.

[12]:

fig, ax = plt.subplots(1, 2, figsize=(20, 5))

species = df['species'].value_counts()
proportions = species / species.sum()
labels = ['{}, {:.2}%'.format(i, v) for i, v in proportions.items()]

species.plot.pie(ax=ax[0], colors=['r', 'g', 'b'], labels=labels)
species.plot.bar(ax=ax[1], color=['r', 'g', 'b'])

ax[0].set_ylabel('')
ax[1].set_ylabel('count')
ax[1].set_xlabel('species')
plt.tight_layout()

2.3. Bivariate visualization

A bivariate visualization display the relationship between two variables at a time.

2.3.1. Density plot

Here we plot a join density plot between all pairs of variables (there are 6 such pairs).

[13]:

from itertools import combinations

colors = cycle(['Reds', 'Greens', 'Blues', 'Purples'])
columns=['sepal_width', 'sepal_height', 'petal_width', 'petal_height']
pairs = [comb for comb in combinations(columns, 2) if comb[0] != comb[1]]
n_pairs = len(pairs)

n_cols = 6
n_rows = n_pairs / n_cols if n_pairs % n_cols == 0 else (n_pairs / n_cols) + 1
n_plots = int(n_cols * n_rows)

fig = plt.figure(figsize=(20, 3))

for i in range(n_plots):
    ax = fig.add_subplot(n_rows, n_cols, i + 1)

    if i < n_pairs:
        label_x = pairs[i][0]
        label_y = pairs[i][1]

        x = df[label_x]
        y = df[label_y]
        sns.kdeplot(x, y, shade=True, cmap=next(colors))
    else:
        ax.axis('off')

plt.tight_layout()

2.3.2. Regression plot

We may also plot two variables against one another using a regression plot. The assumption with a regression plot is that the relationship between the two variables are linear (the solid line in each plot), which may not hold or be true.

[14]:

colors = cycle(['r', 'g', 'b', 'm'])
columns=['sepal_width', 'sepal_height', 'petal_width', 'petal_height']
pairs = [comb for comb in combinations(columns, 2) if comb[0] != comb[1]]
n_pairs = len(pairs)

n_cols = 6
n_rows = n_pairs / n_cols if n_pairs % n_cols == 0 else (n_pairs / n_cols) + 1
n_plots = int(n_cols * n_rows)

fig = plt.figure(figsize=(20, 3))

for i in range(n_plots):
    ax = fig.add_subplot(n_rows, n_cols, i + 1)

    if i < n_pairs:
        label_x = pairs[i][0]
        label_y = pairs[i][1]

        x = df[label_x]
        y = df[label_y]
        sns.regplot(x, y, color=next(colors))
    else:
        ax.axis('off')

plt.tight_layout()

2.3.3. Heatmap

The heatmap is also a way to plot the pair-wise correlations. Note the assumption here is that the correlation is linear, which may not be true.

[15]:

fig, ax = plt.subplots(figsize=(5, 5))
g = sns.heatmap(np.corrcoef(X.T), yticklabels=columns, xticklabels=columns,
                linewidths=.5, annot=True, square=True, ax=ax, cbar=True)
g.set_yticklabels(g.get_yticklabels(), rotation=0)
g.set_xticklabels(g.get_xticklabels(), rotation=90)

[15]:

[Text(0.5, 0, 'sepal_width'),
 Text(1.5, 0, 'sepal_height'),
 Text(2.5, 0, 'petal_width'),
 Text(3.5, 0, 'petal_height')]

2.4. Categorical plots

We may plot the continuous variables partitioned by a categorical variable. This type of visualization allows us to see how the data distributions changes based on the the categorical variable.

2.4.1. Pair plot

The pair plot shows distribution plots partitioned by the values of a categorical variable in the diagonal. In the off-diagonal plots are scatter plots.

[16]:

palette = {'setosa': 'g', 'versicolor': 'r', 'virginica': 'b'}

sns.pairplot(df, hue='species', palette=palette)

[16]:

<seaborn.axisgrid.PairGrid at 0x7fc7200a6790>

2.4.2. Box plots by category (species)

We may revisit box plots by plotting the distribution of each variable by species.

[17]:

fig, ax = plt.subplots(1, 4, figsize=(20, 5))
axes = np.ravel(ax)

for a, c in zip(axes, ['sepal_width', 'sepal_height', 'petal_width', 'petal_height']):
    sns.boxplot(x='species', y=c, data=df, palette={'setosa': 'g', 'versicolor': 'r', 'virginica': 'b'}, ax=a)
    a.set_xlabel('')

plt.tight_layout()

2.5. Multivariate visualization

Multivariate visualizations are the most difficult visualizations since you are dealing with 3 or more variables. For 3 variables, it is not terribly difficult to visualize since we operate in a 3 dimensional world. When the number of variables (or dimensions) exceed 3, it becomes very difficult to fathom how data looks like in such high dimensional space.

2.5.1. Three-dimensional plot

Here, we plot all 3-pairs (there are 4 such pairs) of variables in 3 dimensional space.

[18]:

from itertools import combinations
from mpl_toolkits.mplot3d import Axes3D

combos = list(combinations(['sepal_width', 'sepal_height', 'petal_width', 'petal_height'], 3))
n_combos = len(combos)

fig = plt.figure(figsize=(20, 15))

for i, combo in enumerate(combos):
    ax = fig.add_subplot(2, 2, i + 1, projection='3d')

    for s, c in palette.items():
        d = df[df['species'] == s]
        x, y, z = d[combo[0]], d[combo[1]], d[combo[2]]
        ax.scatter3D(x, y, z, color=c, label=s)
        ax.set_xlabel(combo[0])
        ax.set_ylabel(combo[1])
        ax.set_zlabel(combo[2])
        ax.legend()

plt.tight_layout()

2.5.2. Parallel coordinate plot

One of the best ways to visualize high-dimensional data is through parallel coordinates. There is much theory developed behind parallel coordinates (see Wegman 2002). The idea is that a point in n-dimensional space becomes a line in parallel coordinates. For n-dimensional space, there are correspondingly n-axes in parallel coordinate space.

[19]:

fig, ax = plt.subplots(figsize=(15, 5))
pd.plotting.parallel_coordinates(df, 'species', ax=ax, color=['g', 'r', 'b'])

[19]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fc7134fb510>

2.6. Visualizing missing data

Here, we purposefully make some data missing at random (and some not missing at random). We then use the missingno package to visualize the missingness of the data in a few different ways.

[20]:

from numpy.random import randint

np.random.seed(37)

missing_df = df.copy()
n_data_points = missing_df.shape[0] * missing_df.shape[1]
n_missing = int(0.10 * n_data_points)

for i in range(n_missing):
    i_loc = randint(missing_df.shape[0])
    j_loc = randint(missing_df.shape[1])
    missing_df.iloc[i_loc, j_loc] = np.nan

j1 = missing_df.shape[1] - 1
j2 = missing_df.shape[1] - 2
for i in range(n_missing):
    i_loc = randint(missing_df.shape[0])
    missing_df.iloc[i_loc, j1] = np.nan
    missing_df.iloc[i_loc, j2] = np.nan

[21]:

import missingno as msno

msno.matrix(missing_df)

[21]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fc713566490>

[22]:

msno.bar(missing_df)

[22]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fc712c589d0>

[23]:

msno.heatmap(missing_df)

[23]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fc71353df10>

[24]:

msno.dendrogram(missing_df)

[24]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fc713654d10>