Sitemap

Support Vector Machine Math for people who don’t care about optimization

11 min readMar 23, 2021

I long avoided the math behind Support Vector Machines because I know it's an optimization problem, and unless I am minimizing something with gradient descent, I don’t know much about optimization or linear programming or any of that.

However, I spent some time reading up on optimization finally and hope to make the concepts by SVM math more accessible for people with little optimization background.

First, let's discuss what it is we are trying to do. Consider the following plot:

Image 1: Example Hyperplane Solved For Using an SVM. Figure Credit: Andrew Ng

The line separating the X’s from the O’s is called the separating hyperplane, where a hyperplane is just a plane in n-dimensional space of size n-1 dimensions (in the example above, n=2). Anyway, when we use SVMs for classification, we hope to find the hyperplane that makes our data linearly separable. However, unlike other methods of generating separating hyperplanes such as neural nets, SVMs do not take into account points that are far away from the line when generating the plane. In the example image above, point A is so far from the plane that it doesn’t really have much impact on the optimal position of the line. However, if we move point C down, we would clearly need to readjust the line to make sure the data is linearly separable.

This is one of the main benefits of SVMs, how they only take into account points that would alter the position of the line. These points are known as support vectors and are the only points considered when formulating the hyperplane.

As we discuss how SVMs solve for this hyperplane, we will do so in the context of a binary classification problem where we map some input X to our target space y∈{−1,1}. Our goal is to construct the following classifier.

In the equation above:

g is our activation function which returns 1 if (w^Tx + b) is positive, and -1 if it is negative.

w and b are the weights and biases of our model which we wish to learn

x is our input vector.

Example SVM “margin” γ. Figure Credit: Andrew Ng

Most mathy tutorials on SVMs that I have read first like to discuss the maximum margin classifier and then get into SVMs. A margin is the distance between the separating hyperplane and a data point. In the image above, γ is the margin from the line to point A. A maximum margin classifier is basically just an SVM for perfectly linearly separable data. The goal of a maximum margin classifier is to fit a hyperplane that maximizes the distance/margin between the closest data points on each side of the hyperplane such that the data is linearly separated as shown in the plot below.

Maximum Margin Classifier. Figure Credit: Andrew Ng

Lagrange Multipliers and Optimization Stuff

Here comes the scary optimization part. Don’t worry, I promise to explain exactly what this means. So, a maximum margin classifier wishes to solve the following problem:

So, what does this mean? For the sake of completeness, let's point out that optimization problems follow this general layout. First, we state what we want to optimize (maximize/minimize), then we state what constraints must be met, typically with the such that or s.t. notation.

Let's walk through this optimization problem. So recall that the closest data point on each side of the line will be γ away from the hyperplane right? So we just want to make sure all data points are as far away from the line as possible (maximize γ)… that is what the objective is saying.

Furthermore, the y(i)(w^T x(i)+b)≥γ part asks us to also ensure all points are classified correctly as γ must be positive and y(i)(w^T x(i)+b) is positive for any correctly classified data point. I am not going to fully explain the ∥w∥=1constraint because I would have to explain the difference between functional and geometric margins and … I don’t think we need to do that. For more details refer to section 4 of https://see.stanford.edu/materials/aimlcs229/cs229-notes3.pdf For now, we’ll just say we get some nice properties by using the ∥w∥=1 constraint.

Note: When I say the optimization is “tricky” I just mean that the function as written will be really hard to optimize. There are certain ways of setting up the equations that allow easier use of existing optimization algorithms to minimize/maximize a function

Now, this is actually not the final formulation of the optimization required for the maximum margin classifier. What I just described above provides some intuition, but is actually quite a tricky optimization.

To make it a little easier, we can get the same solution by optimizing for:

You might notice that we got rid of the ∥w∥=1 constraint and sort of added ∥w∥ to the denominator of the objective function. Without getting too much into the details, I will just say that optimizing for what is basically a normalized version of the maximum margin (which we get when dividing by ∥w∥) gets us the same result as maximizing the margin itself. Unfortunately, we still have a bit of a tricky optimization, but there is one more thing we can do.

There is a trick to restructure this optimization into something more doable, and when I say doable, I mean make it formatted to be able to be plugged into some optimization software (e.g. scipy.optimize). The way we will do this, at a super high level, is as follows.

Recall our classifier h(x)=g(wTx+b). Something interesting about this is that we could arbitrarily scale w,b and get the same classification result since our final output function, sign(), only cares about the sign of the output. That is, sign(g(wTx+b))=sign(g(2wTx+2b)). Since scaling constraints don’t affect the outcome, we can enforce that the margin γ=1. In other words, we can rescale everything such that the margin γ=1 without changing what were doing. With this notion of scale not affecting classification, let’s rewrite the optimization as:

This works because optimizing for γ/‖w‖ when γ=1 is the same as minimizing 1/∥w∥. Thus, we are now enforcing that the margin is equal to 1. Finally, we have something we can plug into some black-box optimization software. Also, I quickly note that this is known as the primal formulation of our optimization problem. More on this later…

Okay here comes the tricky part but I mean … this is the core of how SVMs work.

Suppose you have the following optimization problem:

So basically just a function we want to minimize with 2 constraints. We can solve this using Lagrange Multipliers (I will assume you remember what these are). We define the generalized lagrangian as follows:

Why do we need the LaGrangian? We do this because we can solve for the LaGrange multipliers which allow us to analytically minimize some function f(w). What you need to know: the LaGrange multipliers are the α’s and β’s that allow us to set the partial derivatives of L to 0 and minimize the equation given some w. So, if we have the multipliers, we can minimize the function… which is our goal.

Okay now comes what I find to be the weird part. There are these things called the ‘dual’ and ‘primal’ formulations of an optimization problem. Under certain conditions (which are met in our case), they produce the same solution. For SVMs, the dual formulation, for which I will be skipping all the mathy details, allows us to rewrite our maximum margin classifier optimization in a more useful way.

First, note that if we plug in our equations for the maximum margin classifier into our generalized lagrangian, we get:

After doing a bunch of math and getting to the dual form of this lagrangian, we find a new optimization function that gives us the same result as our primal form from earlier.

Quick Recap: We took the general formula for a LaGrangian, plugged in our Maximum Margin Classifier equations, then did some simplifications, rewrote the problem as an optimization problem using LaGrange multipliers. That's all you need to know.

The question I am sure you have at this point is … “WHY DO WE NEED TO DO THIS?”

First of all, we have removed the dependence on w,b. All we have left are the dot products of x(i),x(j), which we can compute before performing the optimization. Do you see how cool this is? All we need to do is solve for the αs. This makes the problem much more efficient and lets us work in a lower-dimensional space. Additionally, if we wanted to learn a non-linear decision boundary, we are able to replace the dot product with some non-linear kernel function. In summary, we now only rely on α’s for everything! But how do we use αs to get our hyperplane?

For the weight vector, we can express it as a linear combination of training data, training labels, and α’s

When we want to make a new prediction on a test data point x′, we simply just need to do

Remember that most of the weights will be 0 because the only data points that impact the decision boundary are the support vectors. This is another benefit of this new formulation. That is, nonzero α’s will be our support vectors. Thus, we can get our best separating hyperplane looking only at the nonzero α’s … pretty cool!

The Case of Non-Linearly Separable Data

What we have discussed so far will break if we add data points that make our data set no longer linearly separable. This is because, if you remember my discussion of what the original primal form of the optimization function meant, we required that all data points be correctly classified. To account for this, we can allow some “slack” and reformulate the optimization with L1 regularization which basically allows for some misclassification.

Here is the primal form of the regularized SVM

Now, each training point x(i) is allowed to have a distance less than 1 from the separating hyperplane (remember before we restricted points be at least γ=1 away). For a sample that does have a margin of less than 1, we pay the price of C. That is, C controls how much we care about making mistakes. If C is large, we allow many mistakes and use a ton of regularization. If C is small, we basically have our original optimization problem and try to make as few mistakes as possible.

The dual form, for which I am skipping the derivation of, looks like this:

As you can see, this is very similar to the linearly separable approach which we already discussed. Also, note that THIS is what we are going to want to code up to do SVM from scratch. Let's see how this looks.

First, let's generate some data. Note that we change the default (0,1) labels to (-1,1). This is required because we want our hyperplane equation to be equal to 0 and the margins to be equal to -1,1 respectively.

from sklearn.datasets import make_blobs
from scipy import optimize
import numpy as np
def gen_data(n_samples):
X, t = make_blobs(n_samples=n_samples, centers=2, cluster_std=2.0)
minmax = MinMaxScaler()
X = minmax.fit_transform(X)
for i in range(len(t)):
if t[i] == 0:
t[i] = -1
return X,t

This function will compute all the dot products we discussed. After doing this, all that's left to do is compute the αs.

def H(X):
N = len(X)
H = np.empty((N, N))
for i in range(N):
for j in range(N):
H[i, j] = np.dot(X[i], X[j])
return H

Here are our objective function and constraints we will pass to the optimizer. A few notes about scipy.optimize

1) Scipy will default minimize your function. Thus, to compute a maximum we return the negative of the function

2) The inequality constraints default ensure whatever you return is greater than 0

3) The equality constraints ensure whatever you return is equal to 0

def loss(a):
'''
a: vector of alpha values
t: (accessed globally) vector of labels in (-1,1)
'''
at = a * t
return -(a.sum() - 0.5 * np.dot(at.T, np.dot(K, at)))

# This is the inequality constraint
# For N datapoints, this will produce a
# matrix with 2N rows, where the first N
# assure alpha > 0 and the second N
# assure alpha < C. This is just a clever
# formulation of this I found on github! (see sources)
def constraint1(x):
# A,b accessed globally
return b - np.dot(A,x)

# This is the equality constraint sum_i alpha_i*y_i = 0
def constraint2(x):
# t accessed globally
return np.dot(x,t)

Here is the main logic of the program

# Get data and labels with 100 data points 
X,t = gen_data(100)
N = len(X)
# Set C value (error tolerence)
C = 1
# Generate the dot_product matrix
K = H(X)

# see constraint 1 comments to understand what happens here
A = np.vstack((-np.eye(N), np.eye(N)))
b = np.concatenate((np.zeros(N), C * np.ones(N)))

# Define constraints for scipy
constraints = ({'type': 'ineq', 'fun': constraint1},
{'type': 'eq', 'fun': constraint2})


# initalize alpha values (doesn't matter too much how you do this)
a0 = np.random.rand(N)
res = minimize(loss, a0, constraints=constraints)

a = res.x # optimal Lagrange multipliers
a[np.isclose(a, 0)] = 0 # zero out nearly zeros
a[np.isclose(a, C)] = C # round the ones that are nearly C

# points with a==0 do not contribute to prediction
support_idx = np.where(0 < a)[0] # index of points with a>0; i.e. support vectors

Now that we have performed our optimization, solve for the b term by taking

where N, M are the number of support vectors.

# Solve for b intercept term 
a_times_t = a * t
all_b = []
for n in support_idx:
x_n = X[n]
eval_ = np.array([np.dot(x_m, X[n]) if a_m > 0 else 0 for x_m, a_m in zip(X, a)])
b = t[n] - a_times_t.dot(eval_)
all_b.append(b)
b = np.mean(all_b)

Now we define a function to “predict” on test data, aka generate the hyperplane over an evenly spaced array of numbers.

def predict(test, X, t, k, a, b):

a_times_t = a * t
y = np.empty(len(test)) # array of predictions
for i, s in enumerate(test):
# evaluate dot product between new data point and support vectors; 0 if not a support vector
eval = np.array([np.dot(s, x_m) if a_m > 0 else 0 for x_m, a_m in zip(X, a)])
y[i] = a_times_t.dot(eval) + b
return y

Now we plot!

plt.scatter(X[:,0], X[:,1], c = t)
# plot the decision boundary and margins in the input space
grid = np.arange(X.min(), X.max(), 0.05)
xx, yy = np.meshgrid(grid, grid)
zs = predict(np.array(list(zip(np.ravel(xx), np.ravel(yy)))), X, t, None, a, b)
zz = zs.reshape(xx.shape)
CS = plt.contour(xx, yy, zz, levels=[-1, 0, 1], )
Example SVM Result

Sources:

--

--

Joseph Gatto
Joseph Gatto

Written by Joseph Gatto

Machine Learning Ph.D. Student

No responses yet