K-Nearest Neighbors
Published:
K-Nearest Neighbors
- Now applying the non-parametric approach K-Nearest Neighbors, the fitting process is the same as LDA QDA
knn1 = KNeighborsClassifier(n_neighbors=1)
knn1.fit(X_train,L_train)
knn1_pred = knn1.predict(X_test)
confusion_table(knn1_pred, L_test)
- As expected the KNN with $K=1$ results in very poor predictions
- Predcited $126$ correctly around $50\%$
(43+83)/252 , np.mean(knn1_pred==L_test)
knn3 = KNeighborsClassifier(n_neighbors=3)
knn3.fit(X_train,L_train)
knn3_pred = knn3.predict(X_test)
confusion_table(knn3_pred, L_test)
np.mean(knn3_pred==L_test)
- The KNN algorithm isnt suited and doesn’t perform well on the Smarket data
- Loading the Caravan data from the book library
Caravan = load_data('Caravan')
Caravan
Purchase = Caravan.Purchase
Purchase.value_counts()
len(Purchase)
feature_df = Caravan.drop(columns=['Purchase'])
feature_df
- Since KNN classifer measures the distance between observations
- Differences between the data nature(Money and age for example)
- Feature scaling methods : Normalization and Standardization
- In this case Standardization is used transforming the observations to have a mean of $0$ and std of $1$
scaler = StandardScaler(with_mean=True,with_std=True,copy=True)
scaler
scaler.fit(feature_df)
x_std = scaler.transform(feature_df)
x_std
feature_std = pd.DataFrame(x_std,columns=feature_df.columns)
feature_std
feature_std.std()
(X_train,X_test,y_train,y_test) = train_test_split(feature_std,Purchase,test_size=1000,random_state=0)
knn1 = KNeighborsClassifier(n_neighbors=1)
knn1_pred = knn1.fit(X_train,y_train).predict(X_test)
np.mean(y_test != knn1_pred),np.mean(y_test !="No"),np.mean(y_test==knn1_pred)
confusion_table(knn1_pred,y_test)
9/(53+9)
- The error rate is $14\%$ which is better than randomly guessing
- If we randomly guessed that every case is a No it will result in $6\%$
Tunning Parameters
for K in range (1,8):
knn =KNeighborsClassifier(n_neighbors=K)
knn_pred = knn.fit(X_train,y_train).predict(X_test)
C = confusion_table(knn_pred,y_test)
tmpl =('K={0:d}:# predicted to rent : {1:>2},'+"# Who did rent {2:d}. accuracy {3: .1%}")
pred = C.loc['Yes'].sum()
did_rent = C.loc['Yes','Yes']
print(tmpl.format(K,pred,did_rent,did_rent/pred))
logit = LogisticRegression(C=1e10, solver='liblinear')
logit.fit(X_train,y_train)
logit_pred = logit.predict_proba(X_test)
logit_labels =np.where(logit_pred[:,0]>5, 'Yes','No')
confusion_table(logit_labels, y_test)
logit_labels =np.where(logit_pred[:,1]>0.25, "Yes", "No")
confusion_table(logit_labels,y_test)
9/(20+9)
- The logistc regression performs better than randomly guessing
- About $31\%$ correct predictions
