Notes for Reviewing SVM
A mixture from multiple textbooks and online resources
A typical way of solving classification is to find a hyperplane in the feature space. The algorithms that use this approach include SVM and logistic regression (the hyperplane of logistic regression is the one getting through y=0.5. How does logistic regression find that hyperplane? By fitting the data points with logistic regression function.).
Functional margin and geometric margin
Given a point x0 and a line wT*x + b = 0, the functional margin between the point and the line is
functional margin = wT*x0 + b
geometric margin = (wT*x0 + b) / ||w||
Find the maximum margin and the hyperplane is the middle
min 1/2*||w||^2
s.t. yi(wT*xi + b) >= 1, i = 1,2,...m
This problem can be solved by using Quadratic Programming software. But a more efficient way is to use Lagrange Multipliers and solve its dual problem.
SVM model parameters affected by support vectors with alpha > 0
Let’s look at the result first. For the above convex quadratic optimization with inequality constraints, the dual problem is
The new parameter, alpha, is the Lagrange Multiplier. The above optimization problem can be solved by sequential minimal optimization (SMO). Both w and b can be computed from alpha and certain data points. Only support vectors (the data points on the margin borderlines in the feature space) have non-zero alpha.
Thus, the final model is
f(x) = wT*x + b = SUM(alpha_i * yi * <xi, x>) + b
Only support vectors' alpha are non-zero.
This is not intuitive to me at first because the model parameters, w and b, seem to only depend on a few data points whose alpha_i > 0 in the training set. I thought the way of calculating w with (xi, yi) would be more complicated, not just a weighted sum. But I realized I never looked into the relationship between w and (xi, yi), even for logistic regression. Now when I think about it, I wonder whether it is possible to relate the w of logistic regression with (xi, yi) in a similar way? Later in this note, we shall see the relationship between soft-margin classifiers with different surrogate loss and logistic regression. In addition to the probabilistic formulation with Bernoulli distribution and cross entropy, logistic regression can also be formulated as a soft margin classifier with logistic loss and regularization, and solved by Lagrange Multipliers, so w will have a similar form as a weighted sum of (xi, yi)?) Does different surrogate loss generate different alpha or a different form of w and (xi, yi)?
Lagrange Multipliers, Dual Problem, and KKT Conditions
The above is the final form of SVM model based on a series of derivations. Let’s take a step back and look at how to get the dual problem. To convert the primal problem to the dual problem, the Lagrange Multipliers (alpha) are added to each constraint and form the Lagrange function:
The relationship between w and alpha is found by taking the derivatives of L.
The solutions should also satisfy the KKT conditions. When alpha = 0, y*f(x) — 1 > 0, the data points are not on the margin line. When alpha > 0, y*f(x) — 1=0, these are the support vectors.
Kernel Functions
The optimal margin classifier can only classify linearly separable problems with the given features. To classify non-linear separable problems, the original features need to be mapped to high dimensional space in order to find a linearly separable hyperplane. One common example is XOR.
One way is to map the original features to new feature space, then use the new features to train a SVM. So in the dual problem, only <xi, xj> is replaced by <g(xi), g(xj)>. So is the result of w.
However, computing g(x) then computing <g(xi), g(xj)> may be computationally expensive, an alternative way is to use a kernel function to compute the same result but with less complexity. One example is to compute the Gaussian kernel, which is equivalent to compute the inner product of infinite features.
k(x1, x2) = exp(-||x1 - x2||^2/(2*variance^2))
equivalent to
<g(x1), g(x2)>, where g(x) maps x to infinite feature space
From the dual problem and the final model, we understand that, interestingly, the key part is the inner product of xi and x, <xi, x>. So we can replace it with any kernel function.
In theory, if the original feature space is finite, then there must be a higher dimensional space that can separate the samples. However, in reality, it is unknown that what feature space should be mapped to, so which kernel function to choose is unknown. Even though we may explore a bunch of kernels and eventually find one to separate the samples, it is also possible to overfit the data instead of finding the right kernel to map onto the right space.
Soft Margin, Regularization, Surrogate Loss (hinge, exponential, logistic)
Due to the above reason, some problems may not be classified with a hyperplane. So soft margin is introduced to tolerate some errors. The optimization is
Here z = y*f(x)-1. When z < 0, the data point is classified on the wrong side so l(z) is 1; when z > 0, the data point is classified correctly so l(z) is 0. This is the general form of Regularization + Loss, which ||w|| is the regularization term and l(z) is the loss. However, l(z) here is non-convex and not continuous. Surrogate loss functions can be used to replace l(z). Here z = y*(wT*x + b).
For hinge loss, it is the most common loss used for SVM, which is
hinge(z) = max(0, 1-z), where z = y*f(x)
Note that “1” is interpreted as “margin” in hinge loss. (1) If y*f(x) > 1, not only the prediction f(x) and the ground truth y have the same sign, but also the margin is large enough (>1). Thus, hinge loss is zero. (2) If y*f(x) < 0, y and f(x) have the opposite signs, then the prediction is wrong so the loss is even larger than the margin. If 0 < y*f(x) < 1, although y and f(x) have the same sign, the hinge loss is also positive because of no enough margin (the data point would fall between the margin but classified correctly). By replacing l(z) with the hinge loss, we have
It can be re-written as we introduce a slack variable:
The re-written form means that we want to minimize the error term, its minimum is zero. But for some data points, the error term can be larger than 0, when the error > 1, y and f(x) are opposite; when 0 < the error < 1, no enough margin. Let the optimization decides which data points would have positive error so that the overall optimization goal is minimized.
Follow the above procedure, the Lagrange function and the dual problem are
The only difference of the dual problem is alpha_i < C. alpha determines the support vectors, the error term determines where the support vectors locate (on the margin, within the margin, or on the wrong side).
With the different surrogate loss, we can imagine the optimization will be different. Although we can use the same approach (Lagrange multipliers, the dual problem, SMO), the alpha can be different. For example, now I believe the w of logistic regression has a similar form as the weighted sum of (xi, yi). But the alpha must be different from the soft-margin classifier with hinge loss. SVM puts more weight onto “support vectors” while other data points don’t affect the hyperplane or margin at all. In contrast, logistic regression sort of puts weight onto all data points, and doesn’t emphasize the “role” of support vectors. Thus, SVM is more sensitive to overlapping data than logistic regression, but SVM may outperform when data points are well separated.
Don’t forget the important hyper-parameter C! Controlling C can result in low-variance/high-bias or high-variance/low-bias classifiers. When C is large, the error term is small, high variance but low bias. When C is small, the error term can be large, low variance but high bias. Note that C in this context and in Zhihua Zhou’s book is different from the C term in Intro to Statistical Learning. (Relationship to Logistic Regression in the SVM chapter of Introduction to Statistical Learning)
Relation between SVM and Logistic Regression
Let’s recall the derivation of the loss function of Logistic Regression in Stanford CS229 Lecture notes using a probabilistic approach based on maximum likelihood:
The last equation in the above note maximizes the log likelihood. It is equivalent to minimize the negate of the log likelihood, which turns out to be exactly the equation of the binary cross-entropy -(y*logh(x) + (1-y)*log(1-h(x)). This cross entropy formulation is probably the most common expression of the loss function of Logistic Regression.
Someone had an interesting question on StackOverflow, asking a different expression of the loss function in different textbooks. For example, the one below is another formulation of the loss function of Logistic Regression. Can you see the connection between the equation below and the cross entropy one? They are mathematically equivalent!
The trick is to substitute h(x) with the full expression of logistic function w.r.t z, which is h(x) = 1/(1+exp(-z) = 1/(1+exp(-wT*x-b). Then the likelihood/cross-entropy form w.r.t h(x) is equivalent to the above form w.r.t. to z, where z = wT*x + b. The details can be found in the second reply from Manuel in this derivation.(Note in some discussions, z may be defined differently, which was z = y*(wT*x + b)).
Another example is the following equation in Zhihua Zhou’s book. This is also equivalent to the above equation w.r.t z, just replacing z with beta*x.
Note that all the above formations of the loss function of Logistic Regression are based on y = {0, 1}.
However, for the SVM derivations, y = {-1, 1}. The hinge loss also assumes y = {-1, 1}. Can we formulate the loss function of Logistic Regression by using y = {-1, 1}? The short answer is yes and we should use logistic loss:
(这里需要澄清一个问题。Logistic Regression是一种模型,它的数学公式是h(x) = 1/(1+exp(wT*x+b))。而loss function是描述y和h(x)的关系,作为优化拟合任意h(x)的objective function,不同的loss function基于不同的对于P(y|x)的假设。比如之前讨论过的,cross entropy loss是基于P(y|x)为binomial的假设,MSE是基于P(y|x)为Gaussian的假设,binary loss是基于P(y|x)为Bernoulli的假设,而这里logistic loss应该也是基于P(y|x)为Bernoulli的假设,只是因为y的表达不同({0, 1} or {-1, 1})而与binary loss有不同的表达式。优化的思想是一致的,基于maximum likehood,最后推导出不同假设下的loss function表达形式。所以loss function与Logistic Regression的定义没有关系,loss是基于优化的假设,而logistic regression是一种建模方式,他们是没有关系的。比较logistic regression和SVM是不是应该从建模的角度去比较?这里为什么要比较loss function,一个是hinge loss,一个是logistic loss?不过hinge loss确实也表示了margin这一思想,而logistic loss表示了probablistic的思想,这样考虑他们也与模型也是有关联的。比如,用logistic regression建模,然后使用hinge loss优化拟合,可能是不合理的。所以在说logistic regression的时候,应该不是仅仅指建模的部分,也包括loss function的选择吧。这一点有待更多的学习。
补充:再看SVM上面推导出的优化问题,其实与logistic regression的建模很相近,都是基于wT*x+b的,估计这就是主要比较loss function的原因,因为模型很相近,但是由于loss function的不同,也显示了优化思想不同,一个是基于margin,一个是基于probabilistic。书中表示,两者的效果常常很相似,但是由于优化思想不同,还是根据数据的不同有些区别的。比如SVM对noise更敏感,对于class之间overlapping多的数据效果不够花,但是对于well separated数据的效果更好。归根结底,还是优化思想不同,而实现起来就是loss function的不同。)
Then Logistic Regression minimizes the sum of the logistic loss:
Thus, depending on y = {0, 1} or {-1, 1}, the loss function of Logistic Regression has two forms. The likelihood/cross-entropy form provides a “probability”, while the logistic loss form is easier to compare with the hinger loss (SVM). Although the notations are different, they are mathematically equivalent. I like this expression:
Thus, to understand the relationship between Logistic Regression and SVM, let’s use logistic loss form for Logistic Regression. As mentioned above, the logistic loss and hinge loss are both surrogate loss for soft margin classifier. More commonly, SMV uses the hinge loss to create a soft margin. If the logistic loss is used, it is Logistic Regression + Regularization (L2).
min 1/2*||w||^2 + C*SUM(log(1 + e(-y(wT*x + b))))
where the left term is L2 regularization, the right is the sum of logistic loss
C is the penalty term for the loss. When C is large, the error will be pushed to small values. When C is small, the error can be large values.
To summarize the differences between SVM and Logistic Regression:
- hinge loss vs. logistic loss
- if the logistic loss is used in the primal problem of SVM, the difference is the regularization term.
- if logistic loss is used to find the weights (w), the prediction y_pred =(wT*x + b) is {-1, 1}, which cannot be interpreted as a probability. Thus, for SVM (hinge or logistic loss), special operations are needed to output the probability. If the likelihood/cross-entropy form is used, the output is the probability.
- Logistic Regression (the likelihood/cross-entropy form) can be expanded to Softmax Regression for multiple classes, while SVM is for binary classification and needs special operation for multiple classes.
- Fundamentally, it is not the difference between Logistic Regression and SVM+logistic loss. It is basically the difference between logistic loss form and the likelihood/cross-entropy form for the same optimization goal.
- “When the classes are well separated, SVMs tend to behave better than logistic regression; in more overlapping regimes, logistic regression is often preferred.” With KKT conditions, the alpha_is in SVM is more sparse than those in Logistic Regression. That is, only support vectors impacts the decision boundary of SVM, which is sparse.
Again, if looking carefully the primal problems, it has the general structure of loss + regularization, where ||w||² is the regularization term and the second term is the loss.
Kernelize
Because the optimization goal is basically regularization + loss, you may wonder that if the other methods like logistic regression or LDA is formalized into the same framework, the parameters are all SUM(alpha_i*x_i). Actually that is the case. As discussed above, if the logistic loss is used, it is equivalent to logistic regression + L2 regularization. Thus, Kernelized Logistic Regression when represent Logistic Regression with the dual probem (with alpha_i for each data point, probably alpha_i is not as sparse as SVM) and replace the inner product with kernel functions. Here may be a good inspiration. This kernelization process also applies to other methods.
The same process can be applied to LDA. The original optimization goal of KLDA is
After a series of transformation, here is the equivalent representation with alpha’s
Don’t wonder too much about M & N and Sb & Sw. They are complex. Let’t focus on the general framework. Solve alpha’s. The final prediction is
Support Vector Regression (SVR)
For a regression problem, it is easier to compare the prediction and the ground-truth in x-y plot.
For SVC, the data points are in the feature space (x1-x2 plot) and the decision boundary is a hyperplance. For SVR, the above plot is a x-y plot and the red line is the prediction (fitted line). It may be not fair the compare SVC and SVR, Unlike Support Vector Classifier (SVC) where the support vectors are the data points falling into or on the margin, the support vectors for SVR are those outside the margin. The data points falling into the margin don’t contribute to the loss function.
Why is SVM sensitive to noise?
The Lagrange Multipliers are sparse. Only support vectors determine the hyperplane.
References:
- Stanford CS229 lecture notes for SVM
2. Machine Learning by Zhihua Zhou, the SVM chapter
3. Introduction to Statistical Learning, Relationship to Logistic Regression in the SVM chapter
4. Stackoverflow posts to undertand the two forms of Logistic Regression: post 1 and post 2