Linear Models¶

If your data is linearly separable, perceptron will find you a separating hyperplane.

But what if my data isn’t linearly separable?

perceptron will find a hyperplane that makes some errors
what about a hyperplane that makes a minimal amount of errors?

Minimum Error Hyperplane¶

The error of a linear model \((\mathbf{w}, b)\) for an instance \((\mathbf{x_n}, y_n)\) is:

\[\mathbf{1} [y_n (\mathbf{w} \cdot \mathbf{x_n} + b) \leq 0]\]

where \(\mathbf{1}\) is an indicator function that returns 1 on incorrect prediction and 0 on correct (0-1 loss)

Based on this, we can make an objective to minimize the minimum error hyperplane:

\[\min_{\mathbf{w}, b} \sum_n \mathbf{1} [y_n (\mathbf{w} \cdot \mathbf{x_n} + b) \leq 0]\]

This is ERM: empirical risk minimization.

But there are problems:

the loss fcn is not convex
not differentiable

Alternatives to 0-1 Loss¶

We need to find an upper-bound to 0-1 loss that is convex, so that minimization is easy. Also, minimizing the upper bound of the objective pushes down the real objective.

Given \(y, a\) (label, activation):

0/1: \(l^{(0/1)}(y, a) = 1[ya \leq 0]\)
hinge: \(l^{(hin)}(y, a) = \max\{0, 1-ya\}\)
logistic: \(l^{(log)}(y, a) = \frac{1}{\log 2} \log(1 + \exp[-ya])\)
exponential: \(l^{(exp)}(y, a) = \exp[-ya]\)

These are all convex functions and can be minimized using SGD - except for hinge loss at point 1.

Sub-gradient Descent¶

How do we minimize a non-differentiable function?

apply GD anyway, where it exists
at non-diff points, use a sub-gradient
sub-gradient of \(f(z)\) at a point \(z'\) is the set of all lines that touch \(f(z)\) at \(z'\) but lie below \(f(z)\)
at diff points, the sub gradient is the gradient

Linear Models¶

Minimum Error Hyperplane¶

Alternatives to 0-1 Loss¶

Sub-gradient Descent¶

cse142-notes

Navigation

Related Topics