SVM : The Trickery of Kernel
SVM is one of the simplest and coolest ML algorithms. For non-linear problems, it literally follows the saying “looking at a problem from different perspectives”.
Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection. : sklearn
If we remember our school geometry class, we studied about coordinate system. Any point can be represented in a two dimension with two coordinates x & y. Similarly, for three dimensions it could be represented by x, y & z.
Continuing with the above intuition, we can plot a n-dimensional point using the values for n-dimensions for that particular point.
Imagine our dataset with rows and columns. Each row represents one data point or vector, with the value of each feature representing the value of a particular coordinate. The data points closest to the hyperplane, which affect the position of the hyperplane are called as Support Vector.
The idea behind SVM is to generate the best decision boundary which can segregate our data points into classes or category. The best decision boundary is called a hyperplane. Here the goal is to maximizing the distances between hyper-plane and the nearest data point. This distance is known as Margin.
Considering the example of a classification problem as represented in the below image :
If we are asked to select the line which will classify green stars and blue circles with greater accuracy, intuitively we will select the blue line. Why? Because our mind is telling us that the blue line is neither close to green stars or blue circles, and so it the best line which is at an optimum distance to both the groups. The blue line represents our hyper-plane, and the two headed arrow represents margin.
Okay, so that was quite easy to get a linear hyper-plane to separate two classes. But what about the following distribution,
As it is observable, we can not just draw a line in the image and segregate two classes. Or understanding in terms of hyperplane, we can’t put a plane between these two classes for segregation.
So how should we classify?
Imagine a scenario where you’re lost in a maze and stuck there for hours with no way visible. Now consider what if you could have a top view of maze? You could easily figure a way out. Something similar happens with SVM for non-linearly separable classes.
And here comes the Kernel trick. A trick so intuitive and smart to think of that it lets us imagine the ingenuity of mathematicians and researchers.
Kernel is a transformation function, i.e. it transforms the training dataset so that a non-linear decision surface is transformed into a linearly separable decision space in a higher dimension. So SVM through Kernel trick adds an additional feature to increase the dimension and find the separation boundary.
Now that we have an additional dimension, the class separation decision plane could easily be seen.
It’s very simple to model our data using sklearn’s. We’ll try to classify breast cancer data which could be imported directly from sklearn. Full code can be seen at :
#importing data from sklearn
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()#converting data into pandas dataframe
canc_data = pd.DataFrame(data.data, columns=data.feature_names)#adding target field to the dataset
canc_data['target'] = pd.Series(data.target)
#support vector classification model
from sklearn.svm import SVCmodel_svclinear = SVC(kernel = "linear",
C=1,gamma='auto',probability=False).fit(X_train,y_train)
As can seen from the code above, we have four parameter viz., kernel, C, gamma and probability. We’ll discuss them to see their importance.
kernel : It is used to specify the kernel type to be used in the algorithm. By default is take the value of “rbf”. So we can select kernel as per our requirement, as the kind of kernel will decide the transformational action that will be taken to search the decision boundary plane.
C : It is the regularization parameter, i.e. it’ll penalize the model and assist in avoiding overfitting. The strength of the regularization will be inversely proportional to C.
gamma : It defines how far the influence of a single training dataset reaches, with low values meaning ‘far’ and high values meaning ‘close’. It can be seen as the inverse of the influence of samples used by the model as support vectors.
Learn more about C and gamma over in the official sklearn documentation:
probability : Now probability is an interesting parameter. It does not affect the efficiency of model per se, instead it is used to enable probability estimates. By default probability = ‘False’.
For more information about the application of probability parameter :
Let’s check the performance through evaluation metrics.
Depending upon the requirement we can select Kernel type and other hyperparameter values.
References :
- Fig 4 : By marsroverdriver — originally posted to Flickr as The hedge maze at Traquair House, CC BY-SA 2.0, https://commons.wikimedia.org/w/index.php?curid=11774682