# K-Nearest Neighbors Classification

KNN is perhaps one of the simplest ML models out there. It is a ** non-parametric** model, which means it does not make any assumption on underlying data distribution. It is a

**as it does not learn from the training set immediately instead the learning process is performed at the time of prediction. It can be used for both regression and classification problems.**

*lazy learning algorithm*The underlying logic behind KNN is similarity measure. It measures the similarity between the new uncategorized data and historical categorized data and assigns the new data into categories that is most similar to the pre-defined categories.

KNN follows very basic rule to classify the observations.

In order to measure similarity, distance metrics such as Euclidean distance, Minkowski distance, Jaccard coefficient and Gower’s distance are used, with Euclidean distance being the most popular one.

Considering two observations, O1, and O2 with features as *X₁₁, X₁₂…..X₁n* and *X₂₁, X₂₂….X₂n* respectively; we can easily calculate Euclidean distance as,

Let us write some code to see how it works. We’ll be using sklearn’s breast_cancer dataset for our purpose.

#importing libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline#importing data from sklearn

from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()#converting data into pandas dataframe

canc_data = pd.DataFrame(data.data, columns=data.feature_names)#adding target field to the dataset

canc_data['target'] = pd.Series(data.target)#creating X and y

X_feature = list(canc_data.columns)

X_feature.remove('target')X = canc_data[X_feature]

y = canc_data['target']#splitting data for training and testing

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 100)

Now that we have the dataset ready, we can implement KNN algorithm on it.

#importing KNN from sklearn

from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=5, metric='minkowski').fit(X_train, y_train)#predicting for our test data

y_pred = knn_clf.predict(X_test)#generating classification report

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))#importing libraries to check model performance

from sklearn.metrics import confusion_matrix

from sklearn.metrics import accuracy_score

from sklearn.metrics import log_loss

from sklearn import metricsprint("Accuracy score on test: " , round((knn_clf.score(X_test, y_test)),3))

print("Accuracy score on train: ", round((knn_clf.score(X_train, y_train)),3))#printing log loss for the model

print('log_loss : ', log_loss(y_test, y_pred))#let find ROC and AUC score

#before we calculate roc_auc_score(), we need to find out the predicted probabilities for test data.pred_prob = pd.DataFrame(knn_clf.predict_proba(X_test))#we'll also add the actual label

test_result = pd.DataFrame( { 'actual' : y_test})

test_result = test_result.reset_index()

test_result['prob_0'] = pred_prob.iloc[:,0:1]

test_result['prob_1'] = pred_prob.iloc[:,1:2]#to calculate ROC AUC score we will pass actual class labels and predicted probabilityauc_score = round(metrics.roc_auc_score(test_result.actual, test_result.prob_1),3)

print("AUC Score : ",auc_score)#generating confusion matrix

cf_matrix = confusion_matrix(y_test, y_pred)

sns.heatmap(cf_matrix, annot=True, cmap='Blues')

plt.ylabel("True Label")

plt.xlabel("Predicted Label")

plt.show()

We can also perform hyperparameter tuning to search for optimal parameters. Using *GridSearchCV *from sklearn we can perform search from a list of hyperparameter values for optimal model efficiency.

*GrisSearchCV *will take the following parameters :

**estimator**: The ml model**param_grid**: A dictionary with parameter names as key and list of parameter values as values.**scoring**: accuracy measure.*‘r2’*for regression,*‘recall’, ‘precision’ ,*etc. for classification.**cv**: number of folds in k-fold (Each set of values will be evaluated by K-fold cross validation, thus the final score will be the average of score from all folds of K-fold cross validation)

#hyperparameter tuning using grid search

from sklearn.model_selection import GridSearchCV#we'll create a dictionary with possible hyperparameter values

param_val = [{'n_neighbors' : range(3,10), 'metric' : ['euclidean', 'minkowski', 'canberra']}]#grid search configuration

clfr = GridSearchCV(KNeighborsClassifier(), param_val, cv = 10, scoring = 'roc_auc')#fitting into our data

clfr.fit(X_train, y_train)

Once we have run the *GridSearchCV *, we can the see for the best parameters as follows:

`#we'll see for best score and parameters`

print(clfr.best_score_)

print(clfr.best_params_)

We can use accuracy plot as well to see how the accuracy of model varies with different values of *k.*

# creating empty list variable

acc = []# running KNN algorithm for 3 to 50 nearest neighbours(odd numbers) and storing the accuracy valuesfor i in range(3,50,2):

neigh = KNeighborsClassifier(n_neighbors=i)

neigh.fit(X_train, y_train)

train_acc = np.mean(neigh.predict(X_train) == y_train)

test_acc = np.mean(neigh.predict(X_test) == y_test)

acc.append([train_acc, test_acc])import matplotlib.pyplot as plt # library to do visualizations# train accuracy plot

plt.plot(np.arange(3,50,2),[i[0] for i in acc],"ro-")

plt.show()

For further information about KNN, please refer to the below links.