Binary Classification Through Logistic Regression — Analytics Mag
The primary objective of any classification model is to predict the probability of an observation belonging to a particular class, also called as class probability. For instance, a credit card company would like to classify the potential list of customers into those who will repay on time and those who will not.
Now for any such scenarios where we are trying to determine whether a particular observation will fall into one class or another, our approach is to generate class probability and threshold. Anything higher than our threshold will be categorized into one class, and anything lower than threshold into another class.
A logistic regression predicts the class probability for each observation and map them to the category. For a binary logistic regression, let’s assume that we have positive outcome (Y = 1), and negative outcome (Y = 0), the probability that an observation belongs to a positive class, P(Y = 1) is given by
where,
here, x₁, x₂,…., x ₙ are independent variables.
Equation in Fig 2 could also be written as,
The left side of equation is a logit function, also known as log natural of odds; whereas the right hand side of equation is a linear function.
The logistic function generates an S- shaped curve, also known
as Sigmoid function. Considering the example of classifying a student passing the exam and hours studying. Our objective is finding the probability of a student passing the exam as a function of hours studying. A logistic function can be fit to explain the probability of passing the exam with respect to hours of study.
We’ll quickly perform some analysis to see how sklearn Logistic Regression works. For detailed understanding, please refer to the below link.
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline#importing data from sklearn
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()#converting data into pandas dataframe
canc_data = pd.DataFrame(data.data, columns=data.feature_names)#adding target field to the dataset
canc_data['target'] = pd.Series(data.target)#creating X and y
X_feature = list(canc_data.columns)
X_feature.remove('target')X = canc_data[X_feature]
y = canc_data['target']#splitting data for training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 100)#logistic regression model
from sklearn.linear_model import LogisticRegressionclf = LogisticRegression(random_state=0, max_iter=3000).fit(X_train, y_train)#predicting for our test data
y_pred = clf.predict(X_test)#generating classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
#importing libraries to check model performance
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss
from sklearn import metricsprint("Accuracy score on test: " , round((clf.score(X_test, y_test)),3))
print("Accuracy score on train: ", round((clf.score(X_train, y_train)),3))#printing log loss for the model
print('log_loss : ', log_loss(y_test, y_pred))
#generating confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix, annot=True, cmap='Blues')
plt.ylabel("True Label")
plt.xlabel("Predicted Label")
plt.show()
#let find ROC and AUC score
#before we calculate roc_auc_score(), we need to find out the predicted probabilityiies for test data.pred_prob = pd.DataFrame(clf.predict_proba(X_test))
pred_prob.head()
#we'll also add the actual labeltest_result = pd.DataFrame( { 'actual' : y_test})
test_result = test_result.reset_index()test_result['prob_0'] = pred_prob.iloc[:,0:1]
test_result['prob_1'] = pred_prob.iloc[:,1:2]test_result.head()
#to calculate ROC AUC score we will pass actual class labels and predicted probability auc_score = round(metrics.roc_auc_score(test_result.actual, test_result.prob_1),3)
auc_score
So we have an AUC of 0.996. It is definitely possible that our model might be over fitting, and so we’ll further perform some regularization in order to have an optimally fitted model.
Reference:
Originally published at https://analyticsmag.com on January 8, 2022.