Linear Models¶
If your data is linearly separable, perceptron will find you a separating hyperplane.
But what if my data isn’t linearly separable?
- perceptron will find a hyperplane that makes some errors
- what about a hyperplane that makes a minimal amount of errors?
Minimum Error Hyperplane¶
The error of a linear model \((\mathbf{w}, b)\) for an instance \((\mathbf{x_n}, y_n)\) is:
where \(\mathbf{1}\) is an indicator function that returns 1 on incorrect prediction and 0 on correct (0-1 loss)
Based on this, we can make an objective to minimize the minimum error hyperplane:
This is ERM: empirical risk minimization.
But there are problems:
- the loss fcn is not convex
- not differentiable
Alternatives to 0-1 Loss¶
We need to find an upper-bound to 0-1 loss that is convex, so that minimization is easy. Also, minimizing the upper bound of the objective pushes down the real objective.
Given \(y, a\) (label, activation):
- 0/1: \(l^{(0/1)}(y, a) = 1[ya \leq 0]\)
- hinge: \(l^{(hin)}(y, a) = \max\{0, 1-ya\}\)
- logistic: \(l^{(log)}(y, a) = \frac{1}{\log 2} \log(1 + \exp[-ya])\)
- exponential: \(l^{(exp)}(y, a) = \exp[-ya]\)
These are all convex functions and can be minimized using SGD - except for hinge loss at point 1.
Sub-gradient Descent¶
How do we minimize a non-differentiable function?
- apply GD anyway, where it exists
- at non-diff points, use a sub-gradient
- sub-gradient of \(f(z)\) at a point \(z'\) is the set of all lines that touch \(f(z)\) at \(z'\) but lie below \(f(z)\)
- at diff points, the sub gradient is the gradient