Multiple Linear Regression

Published:

Muiltiple Regression

new_X= MS(['lstat','age']).fit_transform(Boston)
model1= sm.OLS(Y,new_X)
results1= model1.fit()
summarize(results1)
variables = Boston.columns.drop('medv')
variables
new_X= MS(variables).fit_transform(Boston) # Design matrix
new_X
model = sm.OLS(Y,new_X)
results =model.fit()
summarize(results)
variables_no_age = Boston.columns.drop(['medv','age'])
Xma= MS(variables_no_age).fit_transform(Boston)
model1= sm.OLS(Y,Xma)
results2 = model1.fit()
summarize(results2)

Multivariate Goodness of Fit

print(new_X.shape)
dir(results)
results.rsquared
new_X.shape[1]
X.shape[1]
vals = [VIF(new_X,i)
        for i in range (1,new_X.shape[1])]
variance_inflation_factor = pd.DataFrame({'variance_inflation_factor':vals},index=new_X.columns[1:])
variance_inflation_factor
vals =[]
for i in range (1,new_X.shape[1]): # same as List Comprehension
    vals.append(VIF(new_X,i))
variance_inflation_factor = pd.DataFrame({'variance_inflation_factor':vals},index=new_X.columns[1:])
variance_inflation_factor

Interaction Terms

  • Including interaction terms in a linear model helps to removes additive assumption in Linear Regression
  • Combnining Predictors will increase the predictions
X_inter = MS(['lstat','age',('lstat','age')]).fit_transform(Boston)
X_inter
model3= sm.OLS(Y,X_inter)
results3= model3.fit()
summarize(results3)

Non-Linear Transformations of the Predictors

  • Extending the linear regression model using Polynomial Regression
X_poly = MS([poly('lstat',degree=2),'age']).fit_transform(Boston)
X_poly
model4 = sm.OLS(Y,X_poly)
results4= model4.fit()
summarize(results4)
  • The zero p-value suggests that it leas to an improved model model

Using anova_lm() to quantify and compare the difference between the two models results

anova_lm(results4,results2)
  • anova_lm() performs a hypothesis test comparing the two models
  • The Null hypothesis -> that the the quadratic term in the results4 model isnt needed
  • The alternative hypothesis -> That the quadratic term is superior than the simple linear model
  • The NaN means that there is no previous model to compare to
  • anova_lm() can compare multiple models
ax= subplots(figsize=(8,8))[1]
ax.scatter(results4.fittedvalues,results4.resid)
ax.set_xlabel('Fitted values')
ax.set_ylabel('Residuals')
ax.axhline(0,c='k',ls='--');
  • Notice that using the Polynomial Regression Reduced The ther Residuals which increase the model accuracy

Qualitative PRedictors

Carseats = load_data('Carseats')
Carseats.columns
all_variables = list (Carseats.columns.drop('Sales'))
Y= Carseats['Sales']
model_variables = all_variables+[('Income','Advertising'),('Price','Age')]
X = MS(model_variables).fit_transform(Carseats)
X
  • The MS Automatically transforms qualitative features using one-hot encoding
  • Each category gets a vector where one element is set to 1 and all others 0
  • Example: Red(1,0,0) Blue(0,1,0) Green(0,0,1)
model = sm.OLS(Y,X)
results = model.fit()
summarize(results)
  • The high coefficients for ShelveLoc[Good] indicates that’s a good shelving effects the sales

Categories: