# SVMs¶

Note

On the homework:

How do we estimate \(\mu\) and \(\sigma\) from the data?

- \(\arg \max_{\mu, \sigma} P(GPA1, GPA2..GPA6|\mu, \sigma)\)
- \(\hat{\mu}_N = avg(GPA)\)

Support Vector Machines

## Max-Margin Classification¶

This is a linearly separable dataset, and all these hyperplanes are valid, but which one is *best*?

- The blue one has the largest
*margin* **Margin**: Distance between the hyperplane and the nearest point- defined for a given dataset \(\mathbf{D}\) and hyperplane \((\mathbf{w}, b)\)

SVM is a classification algorithm that tries to find the *maximum margin* separating hyperplane.

## Hard SVM¶

### Setup¶

- Input: training set of pairs \(<x_n, y_n>\)
- \(x_n\) is the D-dimensional feature vector
- \(y_n\) is the label - assume binary \(\{+1, -1\}\)

- Hypothesis class: set of all hyperplanes H
- Output: \(w\) and \(b\) of the maximum hypotheses \(h\in H\)
- \(w\) is a D-dimensional vector (1 for each feature)
- \(b\) is a scalar

### Prediction¶

- learned boundary is the maximum-margin hyperplane specified by \(w, b\)
- given a test instance \(x'\), prediction \(\hat{y} = sign(w \cdot x' + b)\)
- if the prediction is correct, \(\hat{y}(w \cdot x' + b) > 0\)

### Intuition¶

But \(y * activation\) is a weak condition - let’s increase it to be “sufficiently” positive

**Final Goal**: Find w, b that minimize 1/margin s.t. y * activation >= **1** for all points

### Optimization¶

Where \(\gamma\) is the distance from the hyperplane to the nearest point

- maximizing \(\gamma\) = minimizing \(1/\gamma\)
- constraints:
*all*training instances are correctly classified - we have a 1 instead of 0 in the condition to ensure a non-trivial margin
- this is a hard constraint, and so called a hard-margin SVM

- constraints:
- what about for non linearly-separable data?
- infeasible solution (feasible set is empty): no hyperplane tielded
- let’s loosen the constraint slightly

## Soft-Margin SVMs¶

- introduce one slack variable \(\xi_n\) for each training instance
- if a training instance is classified correctly, \(\xi_n\) is 0 since it needs no slack
- but \(\xi_n\) can even be >1 for incorrectly classified instances
- if \(\xi_n\) is 0, classification is correct
- if \(0 < \xi_n < 1\), classification is correct but margin is not large enough
- if \(\xi_n > 1\), classification is incorrect

- where \(C\) is a hyperparameter (how much to care about slack)
- if the slack component of the objective function is 0, it’s the same goal as a hard-margin SVM

TLDR: maximize margin while minimizing total cost the model has to pay for misclassification that happens while obtaining this margin

## Discussion¶

Note that the max-margin hyperplane lies in the middle between the positive and negative points

- So the margin is determined by only 2 data points, that lie on the lines \(w \cdot x + b = 1\) and \(w \cdot x + b = -1\)
- these points, \(x_+\) and \(x_-\), are called support vectors

Note

\(w \cdot x_1 + b\) is 0 since \(x_1\) is on the decision boundary

\(w \cdot x_\gamma = 1\) -> \(||w||*||x_\gamma|| = 1\) since \(w, x_\gamma\) are parallel

Therefore, we can modify the objective:

Or, intuitively, finding the smallest weights possible.

## Hinge Loss¶

We can write the slack variables in terms of \((w, b)\):

which is hinge loss! Now, the SVM objective becomes:

## Solving¶

### Hard-Margin SVM¶

- convex optimization problem
- specifically a
*quadratic programming problem* - minimizing a function that is quadratic in vars
- constraints are linear

- specifically a
- this is called the
*primal form*, but most people solve the*dual form*

We can encode the primal form algebraically:

### Dual Form¶

- does not change the solution
- introduces new variables \(\alpha_n\) for each training instance

Once the \(\alpha_n\) are computed, **w** and b can be computed as:

As it turns out, most \(\alpha_i\)’s are 0 - only the support vectors are not

**For Soft-Margin SVM**

For soft-margin SVMs, support vectors are:

- points on the margin boundaries (\(\xi = 0\))
- points in the margin region (\(0 < \xi < 1\))
- points on the wrong side of the hyperplane (\(\xi \geq 1\))

**Conclusion**: w and b only depend on the support vectors

#### Derivation¶

Given the algebriaecally encoded primal form:

We can switch the order of the min and max:

To solve inner min, differentiate L wrt w and b:

- \(w = \sum_i \alpha_i y_i x_i\) means
**w**is a weighted sum of examples - \(\sum_i \alpha_i y_i = 0\) means positive and negative examples have the same weight
- \(\alpha_i > 0\) only when \(x_i\) is a support vector, so
**w**is a sum of signed*support vectors*

**Conclusion**

**Soft-Margin**

subject to \(0 \leq \alpha_i \leq c\) \((\forall i)\)

## Non-Linearly-Seperable¶

What if our data is not linearly seperable?

- use a non-linear classifier
- transform our data so that it is, somehow
- e.g. adding a dummy dimension based on a quadratic formula of the real dimension

### Feature Mapping¶

We can map the original feature vector to a higher dimensional space \(\phi(x)\)

e.g. quadratic feature mapping:

Pros: this improves separability, you can apply a linear model more confidently

Cons: There are a lot more features now, and a lot of repeated features - lots of computation and easier to overfit