Logistic Regression

Published:

Labs : Logistic Regression, LDA, QDA and KNN

This is the chapter $4$ lab we will examine the Smarket data part of the book’s li

import numpy as np
import pandas as pd 
from matplotlib.pyplot import subplots 
import statsmodels.api as sm 

from ISLP import load_data
from ISLP.models import (ModelSpec as MS, summarize)
from ISLP import confusion_table
from ISLP.models import contrast

from sklearn.discriminant_analysis import (LinearDiscriminantAnalysis as LDA, QuadraticDiscriminantAnalysis as QDA)
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

Exploring Data

Smarket = load_data('Smarket')
Smarket
Smarket.columns
  • We cant compute the correlation between the features since the Direction is a qualitative value
#Smarket.corr()
Smarket.Volume
Smarket.plot(y='Volume')

Logistic Regression

  • Fitting Logistic Regression model to predict Direction, which we need to drop the Year, Direction, Today Columns since they are redundant and not important
allvars = Smarket.columns.drop(['Today','Direction','Year'])
design = MS(allvars)
X = design.fit_transform(Smarket)
y = Smarket.Direction == 'Up'
glm= sm.GLM(y,X,family=sm.families.Binomial())
results = glm.fit()
summarize(results)
results.params
results.pvalues
  • The predict() method returns predictions on the probability when no arguments provided it used the Training data to compute the probabilities
  • We set the threshold to $50\%$ in labels
probs = results.predict()
probs.size
labels = np.array(['Down']*1250)
labels[probs>0.5]='Up'
labels
confusion_table(labels,Smarket.Direction)
np.mean(labels== Smarket.Direction)
  • Which indicate that our Logistic Regression model predicts correctly the market $52\%$
  • means that the Train Error is $47.8\%$, which usually is low and underestimate the Test Error
  • Applying the CV-Holdout, we hold out the observations from the year $2005$ and train the model on the years from $2001$ to $2004$
train = (Smarket.Year < 2005)
Smarket_train = Smarket.loc[train]
Smarket_test =Smarket.loc[~train]
Smarket_test.shape
print(Smarket_test)
print(Smarket_train)
X_train, X_test = X.loc[train], X.loc[~train]
y_train, y_test = y.loc[train], y.loc[~train]

glm_train = sm.GLM(y_train,X_train,family=sm.families.Binomial())

results = glm_train.fit()

probs = results.predict(exog=X_test)
probs
  • In this example the model is trained and tested on two completely seprate data sets which gives accurate measures to the model performance
D = Smarket.Direction
D
L_train, L_test = D.loc[train], D[~train]
labels =np.array(['Down']*252)
labels[probs>0.5]='Up'
confusion_table(labels,L_test)
  • Showing that the Logistic Regression predicted $121$ observation correct and $131$ case wrong which is $52\%$
  • Which is expected since its very unlikely to predict market based on past results
model = MS(['Lag1','Lag2']).fit(Smarket)
# Fitting the mode using onl the two predictors with the highest p-values 
X = model.transform(Smarket) 
X_train, X_test = X.loc[train], X.loc[~train]

glm_train = sm.GLM(y_train,X_train, family= sm.families.Binomial())
results = glm_train.fit()

probs = results.predict(exog=X_test)

labels=np.array(['Down']*252)
labels[probs>0.5]= 'Up'

confusion_table(labels,L_test)

  • This Logistic Regression model was only fitted using the predictors with the highest p_value the Lag1, Lag2
  • The new prediction results are improved in total $141$ are predicted correctly
(35+106)/252,(106/(106+76))
  • Which improves the model now it predicts $56\%$ correctly
  • The Test Error is no still high and no better than the naive approach
  • But when it comes to Up the logistic regression model predicts $58\%$ correct, Which suggests when the model predicts an increase better avoid trading
single_newdata  = pd.DataFrame({'Lag1':[1.2,1.5],'Lag2':[1.1,-0.8]})

newX= model.transform(single_newdata)
results.predict(newX)
  • Here we making a prediction on a day where the Lag1 and Lag2 values are set
  • Using the predict function gives

Categories: