K-Nearest Neighbors Classification

KNN is perhaps one of the simplest ML models out there. It is a non-parametric model, which means it does not make any assumption on underlying data distribution. It is a lazy learning algorithm as it does not learn from the training set immediately instead the learning process is performed at the time of prediction. It can be used for both regression and classification problems.

The underlying logic behind KNN is similarity measure. It measures the similarity between the new uncategorized data and historical categorized data and assigns the new data into categories that is most similar to the pre-defined categories.

Fig 1 : Pictorial Representation of KNN

KNN follows very basic rule to classify the observations.

Fig 2 : Step of KNN classification

In order to measure similarity, distance metrics such as Euclidean distance, Minkowski distance, Jaccard coefficient and Gower’s distance are used, with Euclidean distance being the most popular one.

Considering two observations, O1, and O2 with features as X₁₁, X₁₂…..X₁n and X₂₁, X₂₂….X₂n respectively; we can easily calculate Euclidean distance as,

Fig 3 : Euclidean distance between two observations

Let us write some code to see how it works. We’ll be using sklearn’s breast_cancer dataset for our purpose.

#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#importing data from sklearn
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
#converting data into pandas dataframe
canc_data = pd.DataFrame(data.data, columns=data.feature_names)
#adding target field to the dataset
canc_data['target'] = pd.Series(data.target)
#creating X and y
X_feature = list(canc_data.columns)
X_feature.remove('target')
X = canc_data[X_feature]
y = canc_data['target']
#splitting data for training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 100)

Now that we have the dataset ready, we can implement KNN algorithm on it.

#importing KNN from sklearn
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=5, metric='minkowski').fit(X_train, y_train)
#predicting for our test data
y_pred = knn_clf.predict(X_test)
#generating classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
#importing libraries to check model performance
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss
from sklearn import metrics
print("Accuracy score on test: " , round((knn_clf.score(X_test, y_test)),3))
print("Accuracy score on train: ", round((knn_clf.score(X_train, y_train)),3))
#printing log loss for the model
print('log_loss : ', log_loss(y_test, y_pred))
#let find ROC and AUC score
#before we calculate roc_auc_score(), we need to find out the predicted probabilities for test data.
pred_prob = pd.DataFrame(knn_clf.predict_proba(X_test))#we'll also add the actual label
test_result = pd.DataFrame( { 'actual' : y_test})
test_result = test_result.reset_index()
test_result['prob_0'] = pred_prob.iloc[:,0:1]
test_result['prob_1'] = pred_prob.iloc[:,1:2]
#to calculate ROC AUC score we will pass actual class labels and predicted probabilityauc_score = round(metrics.roc_auc_score(test_result.actual, test_result.prob_1),3)
print("AUC Score : ",auc_score)
#generating confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix, annot=True, cmap='Blues')
plt.ylabel("True Label")
plt.xlabel("Predicted Label")
plt.show()

We can also perform hyperparameter tuning to search for optimal parameters. Using GridSearchCV from sklearn we can perform search from a list of hyperparameter values for optimal model efficiency.

GrisSearchCV will take the following parameters :

  1. estimator : The ml model
  2. param_grid : A dictionary with parameter names as key and list of parameter values as values.
  3. scoring : accuracy measure. ‘r2’ for regression, ‘recall’, ‘precision’ ,etc. for classification.
  4. cv : number of folds in k-fold (Each set of values will be evaluated by K-fold cross validation, thus the final score will be the average of score from all folds of K-fold cross validation)
#hyperparameter tuning using grid search
from sklearn.model_selection import GridSearchCV
#we'll create a dictionary with possible hyperparameter values
param_val = [{'n_neighbors' : range(3,10), 'metric' : ['euclidean', 'minkowski', 'canberra']}]
#grid search configuration
clfr = GridSearchCV(KNeighborsClassifier(), param_val, cv = 10, scoring = 'roc_auc')
#fitting into our data
clfr.fit(X_train, y_train)

Once we have run the GridSearchCV , we can the see for the best parameters as follows:

#we'll see for best score and parameters
print(clfr.best_score_)
print(clfr.best_params_)

We can use accuracy plot as well to see how the accuracy of model varies with different values of k.

# creating empty list variable
acc = []
# running KNN algorithm for 3 to 50 nearest neighbours(odd numbers) and storing the accuracy valuesfor i in range(3,50,2):
neigh = KNeighborsClassifier(n_neighbors=i)
neigh.fit(X_train, y_train)
train_acc = np.mean(neigh.predict(X_train) == y_train)
test_acc = np.mean(neigh.predict(X_test) == y_test)
acc.append([train_acc, test_acc])
import matplotlib.pyplot as plt # library to do visualizations# train accuracy plot
plt.plot(np.arange(3,50,2),[i[0] for i in acc],"ro-")
plt.show()
Fig 4 : Accuracy Plot

For further information about KNN, please refer to the below links.

--

--

--

Data Scientist (Learning everyday to learn more) https://www.linkedin.com/in/ankitgupta005/

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

FQuAD: French Question Answering Dataset

Analysis of Tweets about the Joker (2019 film) in Python

An Introduction to Neural Networks — The Perceptron

Building a Part-of-Speech (POS) tagger for domain-specific words in bug reports

NLP: Building Text Summarizer — Part 2

Tracking the Millennium Falcon with TensorFlow

An Intuitive Guide To Understanding The Learning Process Of A Neural Network

Automating Plant Recognition

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ankit Gupta

Ankit Gupta

Data Scientist (Learning everyday to learn more) https://www.linkedin.com/in/ankitgupta005/

More from Medium

Binary Classification with Logistic Regression

Deep Dive EDA on IRIS Dataset

Iris is a genus of species of flowering plants with showy flowers.

Used Car Price prediction & analysis

Diabetes Prediction Using Machine Learning Algorithms