SVMs ==== .. note:: On the homework: .. math:: & P(GPA=x|type=N) \\ & = \frac{1}{\sqrt{2\pi \sigma_N^2}} \exp(\frac{-(x-\mu_N)^2}{2\sigma_N^2}) How do we estimate :math:`\mu` and :math:`\sigma` from the data? 1. :math:`\arg \max_{\mu, \sigma} P(GPA1, GPA2..GPA6|\mu, \sigma)` 2. :math:`\hat{\mu}_N = avg(GPA)` Support Vector Machines Max-Margin Classification ------------------------- This is a linearly separable dataset, and all these hyperplanes are valid, but which one is *best*? .. image:: _static/svm/ex1.png :width: 300 - The blue one has the largest *margin* - **Margin**: Distance between the hyperplane and the nearest point - defined for a given dataset :math:`\mathbf{D}` and hyperplane :math:`(\mathbf{w}, b)` .. math:: margin(\mathbf{D}, \mathbf{w}, b) = & \min_{(x, y)\in \mathbf{D}} y(\mathbf{w \cdot x} + b) & \text{ if w separates D} \\ & -\inf & \text{ otherwise} SVM is a classification algorithm that tries to find the *maximum margin* separating hyperplane. Hard SVM -------- Setup ^^^^^ - Input: training set of pairs :math:`` - :math:`x_n` is the D-dimensional feature vector - :math:`y_n` is the label - assume binary :math:`\{+1, -1\}` - Hypothesis class: set of all hyperplanes H - Output: :math:`w` and :math:`b` of the maximum hypotheses :math:`h\in H` - :math:`w` is a D-dimensional vector (1 for each feature) - :math:`b` is a scalar Prediction ^^^^^^^^^^ - learned boundary is the maximum-margin hyperplane specified by :math:`w, b` - given a test instance :math:`x'`, prediction :math:`\hat{y} = sign(w \cdot x' + b)` - if the prediction is correct, :math:`\hat{y}(w \cdot x' + b) > 0` Intuition ^^^^^^^^^ .. image:: _static/svm/ex2.png :width: 500 But :math:`y * activation` is a weak condition - let's increase it to be "sufficiently" positive **Final Goal**: Find w, b that minimize 1/margin s.t. y * activation >= **1** for all points Optimization ^^^^^^^^^^^^ .. math:: \min_{w, b} & \frac{1}{\gamma(w, b)} \\ \text{subj. to } & y_n(w \cdot x_n + b) \geq 1 (\forall n) Where :math:`\gamma` is the distance from the hyperplane to the nearest point - maximizing :math:`\gamma` = minimizing :math:`1/\gamma` - constraints: *all* training instances are correctly classified - we have a 1 instead of 0 in the condition to ensure a non-trivial margin - this is a hard constraint, and so called a hard-margin SVM - what about for non linearly-separable data? - infeasible solution (feasible set is empty): no hyperplane tielded - let's loosen the constraint slightly Soft-Margin SVMs ---------------- .. image:: _static/svm/ex3.png - introduce one slack variable :math:`\xi_n` for each training instance - if a training instance is classified correctly, :math:`\xi_n` is 0 since it needs no slack - but :math:`\xi_n` can even be >1 for incorrectly classified instances - if :math:`\xi_n` is 0, classification is correct - if :math:`0 < \xi_n < 1`, classification is correct but margin is not large enough - if :math:`\xi_n > 1`, classification is incorrect - where :math:`C` is a hyperparameter (how much to care about slack) - if the slack component of the objective function is 0, it's the same goal as a hard-margin SVM TLDR: maximize margin while minimizing total cost the model has to pay for misclassification that happens while obtaining this margin Discussion ---------- Note that the max-margin hyperplane lies in the middle between the positive and negative points - So the margin is determined by only 2 data points, that lie on the lines :math:`w \cdot x + b = 1` and :math:`w \cdot x + b = -1` - these points, :math:`x_+` and :math:`x_-`, are called support vectors .. image:: _static/svm/ex4.png .. note:: :math:`w \cdot x_1 + b` is 0 since :math:`x_1` is on the decision boundary :math:`w \cdot x_\gamma = 1` -> :math:`||w||*||x_\gamma|| = 1` since :math:`w, x_\gamma` are parallel Therefore, we can modify the objective: .. math:: \min_{w, b, \xi} & \frac{1}{2}||w||^2 + C\sum_n \xi_n & \\ \text{subj. to } & y_n(w \cdot x_n + b) \geq 1 - \xi_n & (\forall n) \\ & \xi_n \geq 0 & (\forall n) Or, intuitively, finding the smallest weights possible. Hinge Loss ---------- We can write the slack variables in terms of :math:`(w, b)`: .. math:: \xi_n = & 0 & \text{ if } y_n(w\cdot x_n + b) \geq 1 \\ & 1 - y_n(w\cdot x_n + b) & \text{ otherwise} which is hinge loss! Now, the SVM objective becomes: .. math:: \min_{w, b} \frac{1}{2}||w||^2 + C\sum_n l^{(hin)}(y_n, w\cdot x_n + b) Solving ------- Hard-Margin SVM ^^^^^^^^^^^^^^^ .. math:: \min_{w, b} & \frac{1}{2}||w||^2 \\ \text{subj. to } & y_n(w \cdot x_n + b) \geq 1 (\forall n) - convex optimization problem - specifically a *quadratic programming problem* - minimizing a function that is quadratic in vars - constraints are linear - this is called the *primal form*, but most people solve the *dual form* We can encode the primal form algebraically: .. math:: \min_{w, b} \max_{\alpha \geq 0} \frac{1}{2}(w \cdot w) + \sum_i \alpha_i (1-y_i(w \cdot x_i + b)) Dual Form ^^^^^^^^^ - does not change the solution - introduces new variables :math:`\alpha_n` for each training instance .. math:: \max & \sum_{n=1}^N \alpha_n - \frac{1}{2} \sum_{m,n=1}^N \alpha_m \alpha_n y_m y_n (x_m^T x_n) \\ \text{subject to } & \sum_{n=1}^N \alpha_n y_n = 0, \alpha_n \geq 0; n = 1..N Once the :math:`\alpha_n` are computed, **w** and b can be computed as: .. math:: w = \sum_{n=1}^N \alpha_n y_n x_n \\ b = \text{something...} As it turns out, most :math:`\alpha_i`'s are 0 - only the support vectors are not **For Soft-Margin SVM** .. math:: \max & \sum_{n=1}^N \alpha_n - \frac{1}{2} \sum_{m,n=1}^N \alpha_m \alpha_n y_m y_n (x_m^T x_n) \\ \text{subject to } & \sum_{n=1}^N \alpha_n y_n = 0, 0 \leq \alpha_n \leq C; n = 1..N For soft-margin SVMs, support vectors are: - points on the margin boundaries (:math:`\xi = 0`) - points in the margin region (:math:`0 < \xi < 1`) - points on the wrong side of the hyperplane (:math:`\xi \geq 1`) **Conclusion**: w and b only depend on the support vectors Derivation """""""""" Given the algebriaecally encoded primal form: .. math:: \min_{w, b} \max_{\alpha \geq 0} \frac{1}{2}(w \cdot w) + \sum_i \alpha_i (1-y_i(w \cdot x_i + b)) We can switch the order of the min and max: .. math:: \max_{\alpha \geq 0} \min_{w, b} \frac{1}{2}(w \cdot w) + \sum_i \alpha_i (1-y_i(w \cdot x_i + b)) \\ = \max_{\alpha \geq 0} \min_{w, b} L(w, b, \alpha) To solve inner min, differentiate L wrt w and b: .. math:: \frac{\partial L(w, b, \alpha)}{\partial w_k} & = w_k - \sum_i \alpha_i y_i x_{i,k} & \\ \frac{\partial L(w, b, \alpha)}{\partial w} & = w - \sum_i \alpha_i y_i x_i & \to w = \sum_i \alpha_i y_i x_i \\ \frac{\partial L(w, b, \alpha)}{\partial b} & = - \sum_i \alpha_i y_i & \to \sum_i \alpha_i y_i = 0 - :math:`w = \sum_i \alpha_i y_i x_i` means **w** is a weighted sum of examples - :math:`\sum_i \alpha_i y_i = 0` means positive and negative examples have the same weight - :math:`\alpha_i > 0` only when :math:`x_i` is a support vector, so **w** is a sum of signed *support vectors* .. image:: _static/svm/ex5.png .. image:: _static/svm/ex6.png **Conclusion** .. image:: _static/svm/ex7.png **Soft-Margin** .. image:: _static/svm/ex8.png subject to :math:`0 \leq \alpha_i \leq c` :math:`(\forall i)` Non-Linearly-Seperable ---------------------- What if our data is not linearly seperable? - use a non-linear classifier - transform our data so that it is, somehow - e.g. adding a dummy dimension based on a quadratic formula of the real dimension .. image:: _static/svm/ex9.png :width: 500 Feature Mapping ^^^^^^^^^^^^^^^ We can map the original feature vector to a higher dimensional space :math:`\phi(x)` e.g. quadratic feature mapping: .. math:: \phi(x) = < & 1, 2x_1, 2x_2, ..., 2x_D, \\ & x_1^2, x_1x_2, ..., x_1x_D, \\ & x_2x_1, x_2^2, ..., x_2x_D, \\ & ... > Pros: this improves separability, you can apply a linear model more confidently Cons: There are a lot more features now, and a lot of repeated features - lots of computation and easier to overfit