Principal Component Analysis (PCA)

Published: October 09, 2024

Principal Component Analysis (PCA)

In this notebook we will be going over the Principal Component Analysis without much focus on the theory just mentioning it lightly, for the full theory and linear algbera interpretation of PCA check the Dimension Reduction Folder in the github repo.

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

PCA steps :

Centering the data by substracting off the mean from each observation $X-\bar{X}$
Calculate the SVD or the eigenvectors of the covariance matrix which represent the principal Componenets(handled by sklearn)
Calculate the principal Componenets Scores (Projecting Data onto the new axes)

Feature Scaling

X = np.array([[1,3,2],[5,1,3],[22,3,5],[13,21,0],[2,4,10],[9,11,1]])
X

X_centered = StandardScaler().fit_transform(X)
X_centered

Now our data is Standardized with mean $\mu=0$ and standard deviation $\sigma=1$
It Centers data around the mean
Preserves the relationships between the data points
Changes the shape of the original distribution

X_centered.shape

Our data set has $n=6$ observations and $p=3$, keep it in mind sine the PCA will reduce the dimensionality of this matrix to be more intrepretable and easy to visualize

Fitting PCA

pca_one_dim = PCA(n_components=1,svd_solver='full')
pca_two_dim = PCA(n_components=2,svd_solver='full')
pca_one_dim.fit(X_centered)
pca_two_dim.fit(X_centered);

pca_one_dim

pca_two_dim

Here we gonan fit two PCA instances for one dimension and for two dimension represented by n_components which where our data will be projected onto
Using the svd_solver='full' since PCA is simply a statistical intrepretation of Singular Value Decomposition

print(pca_one_dim.explained_variance_ratio_)

print(pca_two_dim.explained_variance_ratio_)

Explained Variance a very important thing to notice while performing PCA since it shows:
- How much each component explain the variance in the original data
- The first component explains around $56\%$ of the variance
- While the Second component explain only $28.7\%$ of the variance

pca_three_dim = PCA(n_components=3,svd_solver='full')
pca_three_dim.fit(X_centered);
print(pca_three_dim.explained_variance_ratio_)

As a stupid experiment but also good for intuition setting the n_components=3 which is original dimension of our data set matrix X
We can notice that the variance adds up to $100\%$ across all principal componenets
They are ordered from the most important $\to$ the one that explains the maximum variance in the orignal data X

Projecting the Data

After finding the principal componenets of our data X, Now time to project or transform the data into the new axes(PCs)

X_pca1 = pca_one_dim.transform(X_centered)
X_pca2 = pca_two_dim.transform(X_centered)
X_centered

X_pca1

X_pca2

These values are the projected coordinates or Principal Components Scores

Visualization

One of the most important uses of PCA is visualization since we can’t plot data with dimension more than 3 or 4 if we wanna stretch it, So PCA provide an sophesticated method to represent it in a lower dimensions

Here we plotted our 3 dimensional data into a 2D graph which is easy to work with and interpret,and will help us decide on which statistical learning method we pick and the nature of data we dealing with, without mentioning the unsupervised use of PCA.

plt.scatter(X_pca2[:,0],X_pca2[:,1], alpha=0.7 ,edgecolors='k')
plt.xlabel('Principal Componenet 1')
plt.ylabel('Principal Componenet 2')
plt.title('PC Scores')
plt.grid(True)
plt.show();

For more intresting data and results let’s import the Breast Cancer Dataset from sklearn which includes 30 features

from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X = data.data
y = data.target

df = pd.DataFrame(X,columns= data.feature_names)
df['target'] = y
target_names = data.target_names
df

target_names

maliginant is a cancerous tumours that can spread to other parts of body
benign is a non-cancerous tumours and do not spread to other body parts

Applying PCA on Breast Cancer Data

X_scaled = StandardScaler().fit_transform(X)
X_scaled

As usual we scale the data before performing PCA
with Mean $\mu=0$ and a standard deviation $\sigma=1$

# Scaled Data frame for our data set 
df_scaled = pd.DataFrame(X_scaled , columns=[f"{name}_scaled" for name in data.feature_names])

Explained Variance Plot

Let’s plot the cummulative sum of explained variance ratio as we increase components, will help us decide on how many dimensions to keep

pca = PCA()
pca_full = pca.fit_transform(X_scaled)
# Explained Variance 
n_pcs = len(pca.explained_variance_ratio_)
exp_var_cumul= np.cumsum(pca.explained_variance_ratio_)
# Plotting 
plt.fill_between(range(1,n_pcs+1),exp_var_cumul,color="skyblue", alpha=0.6)
plt.plot(range(1,n_pcs+1),exp_var_cumul,color='Slateblue',alpha=0.8,linewidth=1)
plt.tick_params(labelsize=8)
plt.xticks(np.arange(31,step=5))
plt.xlabel("Number of PCs")
plt.ylabel("Explained Variance Ratio")
plt.title("How much each PC explain the variance")
plt.ylim(bottom=0)
plt.xlim(1,n_pcs)
plt.show();

var = pca.explained_variance_ratio_
plt.plot(var,marker='o',color='darkorange',ls="--")
plt.xlabel('Number of Components')
plt.ylabel('Explained Variance')
plt.title('Score for Each PC')
plt.grid(True)
plt.show();

As this plot shows that selecting $3-4$ components will explain most of the variance on the data set
For simple plotting let’s only select $2$ principle components

pca= PCA(n_components=2)
X_pca = pca.fit(X_scaled).transform(X_scaled)
pca.explained_variance_ratio_

for label,target_name in zip(np.unique(y),target_names):
   plt.scatter(X_pca[y==label,0],X_pca[y==label,1],alpha=0.7,edgecolors='k',label=target_name)
plt.xlabel('Principal Componenet 1')
plt.ylabel('Principal Componenet 2')
plt.title('Breast Cancer Data PCs')
plt.legend()
plt.grid(True)
plt.show();

Plotting the first two principal components agaisnt each other reveal some intresting patterns
Looks like this data can be seperated using a linear boundary decision

# Eigen vectors of the covariance matrix 
print(pca.components_.shape)
eigen_vectors =pca.components_
# singular values 
explained_variance = pca.explained_variance_
print(explained_variance.shape)

pca.componenets_ represent the eigen vectors of the covariance matrix $X^TX$
pca.explained_variance_ represent the singular values on the diagonal matrix from the Singular Value Decomposition $\Sigma$

loadings = eigen_vectors.T * np.sqrt(explained_variance)
loadings.shape

The loadings represent the role each feature plays on the Principal Component
they are the scaled version of the eigenvectors
It shows the direction and the magnitude of the variance
Helps to detect the important features on the data set and PCs

importance = np.abs(loadings[:,0]) + np.abs(loadings[:,1])

print(importance)
pd.DataFrame({"PC1":loadings[:,0],"PC2":loadings[:,1],"importance":importance}, index=data.feature_names)

The loadings are between $[-1,1]$, cause they are cosine rotation of the original axes to the PC axes

# Loadings Plot (Bar Plot for loadings)

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

WOOJIN PARK

Principal Component Analysis (PCA)

PCA steps :

Feature Scaling

Fitting PCA

Projecting the Data

Visualization

Applying PCA on Breast Cancer Data

Explained Variance Plot

Share on