Linear Regression : A linear quest for relationship

Ankit Gupta
5 min readJan 22, 2022

--

Photo by Mitul Shah from Burst

Linear regression is a supervised machine learning technique used to determine the relation between dependent and independent variables. It is useful in predicting continuous numerical variables, for example predicting the price of house, or sales value for a product. Linear regression tries to discover the strength and the relation between the dependent and independent variables.

Consider linear regression as a baking recipe. When we bake a one pound cake, we have the exact values for the quantity of ingredients required. So in that case, ingredients are the features and quantity would be coefficients. Similarly, in linear regression we’re given the ingredients(features) and detail about final output, so we just need to figure out the quantity(linear regression coefficients).

First lets see the (LINE) assumptions for a linear regression model :

Linearity : The model is linear in regression parameters (i.e. Y and coefficient values should be linearly related). The relationship between X and the mean of Y is linear.

Independence : Independent variables are not highly correlated (multi-collinear), i.e. observations are independent of each other. Also, Errors should be independent of each other, else it is called the problem of autocorrelation.

Normality : Errors or residuals follow a normal distribution and the expected value(mean) of error is zero.

Equal Variance : Errors should have constant variance. This is known as homoscedasticity.

If the model voids any of these assumptions, it would be prudent to either perform transformation or selecting a model suitable for non-linear data.

We’ll quickly see how to perform linear regression using sklearn. The data that we are going to use is California Housing dataset, and can be directly imported form sklearn. For complete code, please follow the following link :

Fig 1 : California Housing Dataset Description

We’ll split the data for training and test to perform EDA. It’s always wise to not look into the test data set or to perform feature engineering along with test data.

#importing train test split
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(california_housing_data, test_size=0.25, random_state=42)
Fig 2 : Function to check outliers

Once we call the function for our training data, we get the table representing the outliers count for each of the feature.

Fig 3 : Table representing outlier count

Upon further pre-processing our data, we’ll fit it to our regression model.

#importing linear regression library
from sklearn.linear_model import LinearRegression
#importing evaluation metrics
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import r2_score
linreg = LinearRegression().fit(X_train,y_train)
y_pred_train = linreg.predict(X_train)
y_pred_test = linreg.predict(X_test)

Now that we have our model and prediction, let us evaluate the output.

#evaluation metrics for training data
print('Training Data Evaluation Metrics')
print("MEA (Mean Absolute Error) :", mean_absolute_error(y_train, y_pred_train))
print("MSE (Mean Squared Error) :", mean_squared_error(y_train, y_pred_train))
print("RMSE (Root Mean Squared Error) :", mean_squared_error(y_train, y_pred_train, squared = False))
print("R2 (R-square Score) :", r2_score(y_train, y_pred_train))
print("")
#evaluation metrics for test data
print('Test Data Evaluation Metrics')
print("MEA (Mean Absolute Error) :", mean_absolute_error(y_test, y_pred_test))
print("MSE (Mean Squared Error) :", mean_squared_error(y_test, y_pred_test))
print("RMSE (Root Mean Squared Error) :", mean_squared_error(y_test, y_pred_test, squared = False))
print("R2 (R-square Score) :", r2_score(y_test, y_pred_test))
print("")
#plotting best fit line
plt.figure(figsize = (8,5))
plt.scatter(y_test,y_pred_test, c='lightblue')
p1 = max(max(y_pred_test), max(y_test))
p2 = min(min(y_pred_test), min(y_test))
plt.plot([p1, p2], [p1, p2], 'k-')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.show()
print("")
Fig 4 : Output for Linear Regression Model

Using sklaern’s linreg.coef_ we can get the coefficient values that our model has generated. We will see it through the plot.

Fig 5 : Effect of different features

Our output is not great. There could be multiple reasons for it and it is our job to figure them out. In this article, we’ll limit ourselves with the validation of linear regression model assumptions.

Validation of Model Assumptions

Assumption 1: The Dependent variable and target variable must have a linear relationship.

#importing library
import seaborn as sns
#setting theme
sns.set_theme(style="darkgrid")
#plotting pairplot
sns.pairplot(train_set)
plt.show()
Fig 6 : Pair Plot to visually analyze the relationship between features and target variable

As we can see that the features don’t have a linear relation with the target(median_house_value), thus violating the assumption that dependent variable and independent variable must have a linear relationship.

In order to fix non-linearity, we need to perform some transformation of the independent variable( like log transformation or other non-linear transformations.)

Assumption 2 : Multicollinearity

When the dataset has multiple independent variables, there is possibility of some of those features to be highly correlated. The presence of high correlation between independent variable is known as multi-collinearity. Its presence can destabilize the model.

In order to identify the existence of multi-collinearity we we use the measurement known as Variance Inflation Factor (VIF). VIF is the ratio of variance in the model with multiple variables, divided by the variance of a model with one variable alone.

VIF values are extremely high, representing very high correlation among the feature variables.

Assumption 3: Normality of Residue

Now we’ll validate the assumption of normality of residuals for linear regression. It can be checked using the probability-probability plot (P-P Plot). It’ll compare the cumulative distribution function of two probability distribution against each other. For our context, we’ll be seeing if residual distribution matches with normal distribution or not.

#finding residual
res = (y_test - y_pred_test)

Assumption 4: Homoscedasticity

Homoscedasticity or constant variance of errors can be observed through a residual plot (plot between standardized residual value and standardized predicted value).

As we can see a funnel shape in the residual vs fitted plot, representing heteroscedasticity, we may need to perform non linear transformations.

This was the brief about linear regression models. There’s a lot more to it, starting from feature transformation to selecting correct parameters. And sometimes linear regression models are not a great fit for our requirement, so we have other regression models to select from.

References :

  1. https://online.stat.psu.edu/stat462/node/117/#:~:text=When%20conducting%20a%20residual%20analysis,unequal%20error%20variances%2C%20and%20outliers.
  2. https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/R/R5_Correlation-Regression/R5_Correlation-Regression4.html#:~:text=There%20are%20four%20assumptions%20associated,are%20independent%20of%20each%20other.
  3. https://online.stat.psu.edu/stat462/node/146/
  4. https://www.kdnuggets.com/2019/07/check-quality-regression-model-python.html#:~:text=Homoscedasticity%20(constant%20variance)%3A%20The,be%20estimated%20from%20the%20data).
  5. https://www.analyticsvidhya.com/blog/2016/07/deeper-regression-analysis-assumptions-plots-solutions/

--

--

Ankit Gupta
Ankit Gupta

No responses yet