Principal Component Analysis (PCA)

Published:

Principal Component Analysis (PCA)

In this notebook we will be going over the Principal Component Analysis without much focus on the theory just mentioning it lightly, for the full theory and linear algbera interpretation of PCA check the Dimension Reduction Folder in the github repo.

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

PCA steps :

  1. Centering the data by substracting off the mean from each observation $X-\bar{X}$
  2. Calculate the SVD or the eigenvectors of the covariance matrix which represent the principal Componenets(handled by sklearn)
  3. Calculate the principal Componenets Scores (Projecting Data onto the new axes)

Feature Scaling

X = np.array([[1,3,2],[5,1,3],[22,3,5],[13,21,0],[2,4,10],[9,11,1]])
X
X_centered = StandardScaler().fit_transform(X)
X_centered
  • Now our data is Standardized with mean $\mu=0$ and standard deviation $\sigma=1$
  • It Centers data around the mean
  • Preserves the relationships between the data points
  • Changes the shape of the original distribution
X_centered.shape
  • Our data set has $n=6$ observations and $p=3$, keep it in mind sine the PCA will reduce the dimensionality of this matrix to be more intrepretable and easy to visualize

Fitting PCA

pca_one_dim = PCA(n_components=1,svd_solver='full')
pca_two_dim = PCA(n_components=2,svd_solver='full')
pca_one_dim.fit(X_centered)
pca_two_dim.fit(X_centered);
pca_one_dim
pca_two_dim
  • Here we gonan fit two PCA instances for one dimension and for two dimension represented by n_components which where our data will be projected onto
  • Using the svd_solver='full' since PCA is simply a statistical intrepretation of Singular Value Decomposition
print(pca_one_dim.explained_variance_ratio_)
print(pca_two_dim.explained_variance_ratio_)
  • Explained Variance a very important thing to notice while performing PCA since it shows:
    • How much each component explain the variance in the original data
    • The first component explains around $56\%$ of the variance
    • While the Second component explain only $28.7\%$ of the variance
pca_three_dim = PCA(n_components=3,svd_solver='full')
pca_three_dim.fit(X_centered);
print(pca_three_dim.explained_variance_ratio_)
  • As a stupid experiment but also good for intuition setting the n_components=3 which is original dimension of our data set matrix X
  • We can notice that the variance adds up to $100\%$ across all principal componenets
  • They are ordered from the most important $\to$ the one that explains the maximum variance in the orignal data X

Projecting the Data

After finding the principal componenets of our data X, Now time to project or transform the data into the new axes(PCs)

X_pca1 = pca_one_dim.transform(X_centered)
X_pca2 = pca_two_dim.transform(X_centered)
X_centered
X_pca1
X_pca2
  • These values are the projected coordinates or Principal Components Scores

Visualization

One of the most important uses of PCA is visualization since we can’t plot data with dimension more than 3 or 4 if we wanna stretch it, So PCA provide an sophesticated method to represent it in a lower dimensions

Here we plotted our 3 dimensional data into a 2D graph which is easy to work with and interpret,and will help us decide on which statistical learning method we pick and the nature of data we dealing with, without mentioning the unsupervised use of PCA.

plt.scatter(X_pca2[:,0],X_pca2[:,1], alpha=0.7 ,edgecolors='k')
plt.xlabel('Principal Componenet 1')
plt.ylabel('Principal Componenet 2')
plt.title('PC Scores')
plt.grid(True)
plt.show();

For more intresting data and results let’s import the Breast Cancer Dataset from sklearn which includes 30 features

from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = data.data
y = data.target
df = pd.DataFrame(X,columns= data.feature_names)
df['target'] = y
target_names = data.target_names
df
target_names
  • maliginant is a cancerous tumours that can spread to other parts of body
  • benign is a non-cancerous tumours and do not spread to other body parts

Applying PCA on Breast Cancer Data

X_scaled = StandardScaler().fit_transform(X)
X_scaled
  • As usual we scale the data before performing PCA
  • with Mean $\mu=0$ and a standard deviation $\sigma=1$
# Scaled Data frame for our data set 
df_scaled = pd.DataFrame(X_scaled , columns=[f"{name}_scaled" for name in data.feature_names])

Explained Variance Plot

Let’s plot the cummulative sum of explained variance ratio as we increase components, will help us decide on how many dimensions to keep

pca = PCA()
pca_full = pca.fit_transform(X_scaled)
# Explained Variance 
n_pcs = len(pca.explained_variance_ratio_)
exp_var_cumul= np.cumsum(pca.explained_variance_ratio_)
# Plotting 
plt.fill_between(range(1,n_pcs+1),exp_var_cumul,color="skyblue", alpha=0.6)
plt.plot(range(1,n_pcs+1),exp_var_cumul,color='Slateblue',alpha=0.8,linewidth=1)
plt.tick_params(labelsize=8)
plt.xticks(np.arange(31,step=5))
plt.xlabel("Number of PCs")
plt.ylabel("Explained Variance Ratio")
plt.title("How much each PC explain the variance")
plt.ylim(bottom=0)
plt.xlim(1,n_pcs)
plt.show();
var = pca.explained_variance_ratio_
plt.plot(var,marker='o',color='darkorange',ls="--")
plt.xlabel('Number of Components')
plt.ylabel('Explained Variance')
plt.title('Score for Each PC')
plt.grid(True)
plt.show();
  • As this plot shows that selecting $3-4$ components will explain most of the variance on the data set
  • For simple plotting let’s only select $2$ principle components
pca= PCA(n_components=2)
X_pca = pca.fit(X_scaled).transform(X_scaled)
pca.explained_variance_ratio_
for label,target_name in zip(np.unique(y),target_names):
   plt.scatter(X_pca[y==label,0],X_pca[y==label,1],alpha=0.7,edgecolors='k',label=target_name)
plt.xlabel('Principal Componenet 1')
plt.ylabel('Principal Componenet 2')
plt.title('Breast Cancer Data PCs')
plt.legend()
plt.grid(True)
plt.show();
  • Plotting the first two principal components agaisnt each other reveal some intresting patterns
  • Looks like this data can be seperated using a linear boundary decision
# Eigen vectors of the covariance matrix 
print(pca.components_.shape)
eigen_vectors =pca.components_
# singular values 
explained_variance = pca.explained_variance_
print(explained_variance.shape)
  • pca.componenets_ represent the eigen vectors of the covariance matrix $X^TX$
  • pca.explained_variance_ represent the singular values on the diagonal matrix from the Singular Value Decomposition $\Sigma$
loadings = eigen_vectors.T * np.sqrt(explained_variance)
loadings.shape
  • The loadings represent the role each feature plays on the Principal Component
  • they are the scaled version of the eigenvectors
  • It shows the direction and the magnitude of the variance
  • Helps to detect the important features on the data set and PCs
importance = np.abs(loadings[:,0]) + np.abs(loadings[:,1])

print(importance)
pd.DataFrame({"PC1":loadings[:,0],"PC2":loadings[:,1],"importance":importance}, index=data.feature_names)
  • The loadings are between $[-1,1]$, cause they are cosine rotation of the original axes to the PC axes
# Loadings Plot (Bar Plot for loadings)