Multiple Linear Regression
Published:
Muiltiple Regression
new_X= MS(['lstat','age']).fit_transform(Boston)
model1= sm.OLS(Y,new_X)
results1= model1.fit()
summarize(results1)
variables = Boston.columns.drop('medv')
variables
new_X= MS(variables).fit_transform(Boston) # Design matrix
new_X
model = sm.OLS(Y,new_X)
results =model.fit()
summarize(results)
variables_no_age = Boston.columns.drop(['medv','age'])
Xma= MS(variables_no_age).fit_transform(Boston)
model1= sm.OLS(Y,Xma)
results2 = model1.fit()
summarize(results2)
Multivariate Goodness of Fit
print(new_X.shape)
dir(results)
results.rsquared
new_X.shape[1]
X.shape[1]
vals = [VIF(new_X,i)
for i in range (1,new_X.shape[1])]
variance_inflation_factor = pd.DataFrame({'variance_inflation_factor':vals},index=new_X.columns[1:])
variance_inflation_factor
vals =[]
for i in range (1,new_X.shape[1]): # same as List Comprehension
vals.append(VIF(new_X,i))
variance_inflation_factor = pd.DataFrame({'variance_inflation_factor':vals},index=new_X.columns[1:])
variance_inflation_factor
Interaction Terms
- Including interaction terms in a linear model helps to removes additive assumption in Linear Regression
- Combnining Predictors will increase the predictions
X_inter = MS(['lstat','age',('lstat','age')]).fit_transform(Boston)
X_inter
model3= sm.OLS(Y,X_inter)
results3= model3.fit()
summarize(results3)
Non-Linear Transformations of the Predictors
- Extending the linear regression model using Polynomial Regression
X_poly = MS([poly('lstat',degree=2),'age']).fit_transform(Boston)
X_poly
model4 = sm.OLS(Y,X_poly)
results4= model4.fit()
summarize(results4)
- The zero p-value suggests that it leas to an improved model model
Using anova_lm() to quantify and compare the difference between the two models results
anova_lm(results4,results2)
- anova_lm() performs a hypothesis test comparing the two models
- The Null hypothesis -> that the the quadratic term in the
results4model isnt needed - The alternative hypothesis -> That the quadratic term is superior than the simple linear model
- The
NaNmeans that there is no previous model to compare to - anova_lm() can compare multiple models
ax= subplots(figsize=(8,8))[1]
ax.scatter(results4.fittedvalues,results4.resid)
ax.set_xlabel('Fitted values')
ax.set_ylabel('Residuals')
ax.axhline(0,c='k',ls='--');
- Notice that using the Polynomial Regression Reduced The ther Residuals which increase the model accuracy
Qualitative PRedictors
Carseats = load_data('Carseats')
Carseats.columns
all_variables = list (Carseats.columns.drop('Sales'))
Y= Carseats['Sales']
model_variables = all_variables+[('Income','Advertising'),('Price','Age')]
X = MS(model_variables).fit_transform(Carseats)
X
- The
MSAutomatically transforms qualitative features using one-hot encoding - Each category gets a vector where one element is set to 1 and all others 0
- Example: Red(1,0,0) Blue(0,1,0) Green(0,0,1)
model = sm.OLS(Y,X)
results = model.fit()
summarize(results)
- The high coefficients for ShelveLoc[Good] indicates that’s a good shelving effects the sales
