Logistic Regression

Published: July 08, 2024

Labs : Logistic Regression, LDA, QDA and KNN

This is the chapter $4$ lab we will examine the Smarket data part of the book’s li

import numpy as np
import pandas as pd 
from matplotlib.pyplot import subplots 
import statsmodels.api as sm 

from ISLP import load_data
from ISLP.models import (ModelSpec as MS, summarize)

from ISLP import confusion_table
from ISLP.models import contrast

from sklearn.discriminant_analysis import (LinearDiscriminantAnalysis as LDA, QuadraticDiscriminantAnalysis as QDA)
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

Exploring Data

Smarket = load_data('Smarket')
Smarket

Smarket.columns

We cant compute the correlation between the features since the Direction is a qualitative value

#Smarket.corr()

Smarket.Volume

Smarket.plot(y='Volume')

Logistic Regression

Fitting Logistic Regression model to predict Direction, which we need to drop the Year, Direction, Today Columns since they are redundant and not important

allvars = Smarket.columns.drop(['Today','Direction','Year'])
design = MS(allvars)
X = design.fit_transform(Smarket)
y = Smarket.Direction == 'Up'
glm= sm.GLM(y,X,family=sm.families.Binomial())
results = glm.fit()
summarize(results)

results.params

results.pvalues

The predict() method returns predictions on the probability when no arguments provided it used the Training data to compute the probabilities
We set the threshold to $50\%$ in labels

probs = results.predict()
probs.size

labels = np.array(['Down']*1250)
labels[probs>0.5]='Up'
labels

confusion_table(labels,Smarket.Direction)

np.mean(labels== Smarket.Direction)

Which indicate that our Logistic Regression model predicts correctly the market $52\%$
means that the Train Error is $47.8\%$, which usually is low and underestimate the Test Error
Applying the CV-Holdout, we hold out the observations from the year $2005$ and train the model on the years from $2001$ to $2004$

train = (Smarket.Year < 2005)
Smarket_train = Smarket.loc[train]
Smarket_test =Smarket.loc[~train]
Smarket_test.shape
print(Smarket_test)
print(Smarket_train)

X_train, X_test = X.loc[train], X.loc[~train]
y_train, y_test = y.loc[train], y.loc[~train]

glm_train = sm.GLM(y_train,X_train,family=sm.families.Binomial())

results = glm_train.fit()

probs = results.predict(exog=X_test)
probs

In this example the model is trained and tested on two completely seprate data sets which gives accurate measures to the model performance

D = Smarket.Direction
D

L_train, L_test = D.loc[train], D[~train]

labels =np.array(['Down']*252)
labels[probs>0.5]='Up'
confusion_table(labels,L_test)

Showing that the Logistic Regression predicted $121$ observation correct and $131$ case wrong which is $52\%$
Which is expected since its very unlikely to predict market based on past results

model = MS(['Lag1','Lag2']).fit(Smarket)
# Fitting the mode using onl the two predictors with the highest p-values 
X = model.transform(Smarket) 
X_train, X_test = X.loc[train], X.loc[~train]

glm_train = sm.GLM(y_train,X_train, family= sm.families.Binomial())
results = glm_train.fit()

probs = results.predict(exog=X_test)

labels=np.array(['Down']*252)
labels[probs>0.5]= 'Up'

confusion_table(labels,L_test)

This Logistic Regression model was only fitted using the predictors with the highest p_value the Lag1, Lag2
The new prediction results are improved in total $141$ are predicted correctly

(35+106)/252,(106/(106+76))

Which improves the model now it predicts $56\%$ correctly
The Test Error is no still high and no better than the naive approach
But when it comes to Up the logistic regression model predicts $58\%$ correct, Which suggests when the model predicts an increase better avoid trading

single_newdata  = pd.DataFrame({'Lag1':[1.2,1.5],'Lag2':[1.1,-0.8]})

newX= model.transform(single_newdata)
results.predict(newX)

Here we making a prediction on a day where the Lag1 and Lag2 values are set
Using the predict function gives

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

WOOJIN PARK

Labs : Logistic Regression, LDA, QDA and KNN

Exploring Data

Logistic Regression

Share on