Introduction

PCA is dimension reduction technique which takes set of possibly correlated variables and tranforms into linearly uncorrelated principal components. It is used to emphasize variations and bring out strong patterns in a dataset.

In simple words, principal component analysis is a method of extracting important variables from a large set of variables available in a data set. It extracts low dimensional set of features from a high dimensional data set with a motive to capture as much information as possible.

This post is intended to visualize principle components using python. You can find mathematical explanations in links given at the bottom.

Let’s start!

Import basic packages

import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd  
import seaborn as sns 

Load Dataset

We can use boston housing dataset for PCA. Boston dataset has 13 features which we can reduce by using PCA.

from sklearn.datasets import load_boston
boston_dataset = load_boston()
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

Standardize data

PCA is largely affected by scales and different features might have different scales. So it is better to standardize data before finding PCA components. Sklearn’s StandardScaler scales data to scale of zero mean and unit variance. It is important step in many of the machine learning algorithms.

from sklearn.preprocessing import StandardScaler
x = StandardScaler().fit_transform(boston)
x = pd.DataFrame(x, columns=boston_dataset.feature_names)

Get PCA Components

Calculating PCA involves following steps:

  1. Calculating the covariance matrix
  2. Calculating the eigenvalues and eigenvector
  3. Forming Principal Components
  4. Projection into the new feature space

This post is more about visualizing PCA components than to calculate and fortunately sklearn provides PCA module for getting PCA components

In PCA(), n_components specifies how many components are returned after fit and tranformation.

from sklearn.decomposition import PCA
pcamodel = PCA(n_components=5)
pca = pcamodel.fit_transform(x)
pca.shape
(506, 5)

PCA model attribute plots

PCA components and their significance can be explained using following attributes

Explained variance is the amount of variance explained by each of the selected components. This attribute is associated with the sklearn PCA model as explained_variance_

Explained variance ratio is the percentage of variance explained by each of the selected components. It’s attribute is explained_variance_ratio_

pcamodel.explained_variance_ 
array([6.1389812 , 1.43611329, 1.2450773 , 0.85927328, 0.83646904])
pcamodel.explained_variance_ratio_
array([0.47129606, 0.11025193, 0.0955859 , 0.06596732, 0.06421661])

explained_variance_ plot

plt.bar(range(1,len(pcamodel.explained_variance_ )+1),pcamodel.explained_variance_ )
plt.ylabel('Explained variance')
plt.xlabel('Components')
plt.plot(range(1,len(pcamodel.explained_variance_ )+1),
         np.cumsum(pcamodel.explained_variance_),
         c='red',
         label="Cumulative Explained Variance")
plt.legend(loc='upper left')

explained_variance_ratio_ plot

plt.plot(pcamodel.explained_variance_ratio_)
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.show()

#PCA1 is at 0 in xscale

Scree plot

Scree plot is nothing but plot of eigen values(explained_variance_) for each of the components.

plt.plot(pcamodel.explained_variance_)
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.show()

It can be seen from plots that, PCA-1 explains most of the variance than subsequent components. In other words, most of the features are explained and encompassed by PCA1

Scatter plot of PCA1 and PCA2

pca helds all PCA components. First two of them can be visualized using scatter plot.

plt.scatter(pca[:, 0], pca[:, 1])

3D Scatter plot of PCA1,PCA2 and PCA3

We can use Scatter3D library from plotly to plot first 3 components in 3D space.


#Make Plotly figure
import plotly.plotly as py
import plotly.graph_objs as go

plotly.tools.set_credentials_file(username='prasadostwal', api_key='yourapikey')
# api key hidden

fig1 = go.Scatter3d(x=pca[:, 0],
                    y=pca[:, 1],
                    z=pca[:, 2],
                    marker=dict(opacity=0.9,
                                reversescale=True,
                                colorscale='Blues',
                                size=5),
                    line=dict (width=0.02),
                    mode='markers')

#Make Plot.ly Layout
mylayout = go.Layout(scene=dict(xaxis=dict( title="PCA1"),
                                yaxis=dict( title="PCA2"),
                                zaxis=dict(title="PCA3")),)

#Plot and save html
py.iplot({"data": [fig1],
                     "layout": mylayout},
                     auto_open=True,
                     filename=("3DPlot.html"))





Effect of variables on each components

components_ attribute provides principal axes in feature space, representing the directions of maximum variance in the data. This means, we can see influence on each of the components by features.

ax = sns.heatmap(pcamodel.components_,
                 cmap='YlGnBu',
                 yticklabels=[ "PCA"+str(x) for x in range(1,pcamodel.n_components_+1)],
                 xticklabels=list(x.columns),
                 cbar_kws={"orientation": "horizontal"})
ax.set_aspect("equal")

PCA Biplot

Biplot is an interesting plot and contains lot of useful information.

It contains two plots:

  1. PCA scatter plot which shows first two component ( We already plotted this above)
  2. PCA loading plot which shows how strongly each characteristic influences a principal component.

PCA Loading Plot: All vectors start at origin and their projected values on components explains how much weight they have on that component. Also , angles between individual vectors tells about correlation between them.

More about biplot here

Let’s plot for our data.

def myplot(score,coeff,labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())
    plt.scatter(xs * scalex,ys * scaley,s=5)
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'green', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
 
    plt.xlabel("PC{}".format(1))
    plt.ylabel("PC{}".format(2))
    plt.grid()

myplot(pca[:,0:2],np.transpose(pcamodel.components_[0:2, :]),list(x.columns))
plt.show()

Source:

You can find above Jupyter notebook here

Interesing reads about PCA: