Linear Models
=============

If your data is linearly separable, perceptron will find you a separating hyperplane.

But what if my data isn't linearly separable?

- perceptron will find a hyperplane that makes some errors
- what about a hyperplane that makes a *minimal* amount of errors?

Minimum Error Hyperplane
------------------------

The error of a linear model :math:`(\mathbf{w}, b)` for an instance :math:`(\mathbf{x_n}, y_n)` is:

.. math::
    \mathbf{1} [y_n (\mathbf{w} \cdot \mathbf{x_n} + b) \leq 0]

where :math:`\mathbf{1}` is an indicator function that returns 1 on incorrect prediction and 0 on correct
(0-1 loss)

Based on this, we can make an objective to minimize the minimum error hyperplane:

.. math::
    \min_{\mathbf{w}, b} \sum_n \mathbf{1} [y_n (\mathbf{w} \cdot \mathbf{x_n} + b) \leq 0]

This is ERM: **empirical risk minimization**.

But there are problems:

- the loss fcn is not convex
- not differentiable

Alternatives to 0-1 Loss
^^^^^^^^^^^^^^^^^^^^^^^^
We need to find an upper-bound to 0-1 loss that is convex, so that minimization is easy. Also, minimizing the upper
bound of the objective pushes down the real objective.

Given :math:`y, a` (label, activation):

- 0/1: :math:`l^{(0/1)}(y, a) = 1[ya \leq 0]`
- hinge: :math:`l^{(hin)}(y, a) = \max\{0, 1-ya\}`
- logistic: :math:`l^{(log)}(y, a) = \frac{1}{\log 2} \log(1 + \exp[-ya])`
- exponential: :math:`l^{(exp)}(y, a) = \exp[-ya]`

.. image:: _static/linearmodels/ex1.png
    :width: 450

These are all convex functions and can be minimized using SGD - except for hinge loss at point 1.

Sub-gradient Descent
^^^^^^^^^^^^^^^^^^^^
How do we minimize a non-differentiable function?

- apply GD anyway, where it exists
- at non-diff points, use a sub-gradient
- sub-gradient of :math:`f(z)` at a point :math:`z'` is the set of all lines that touch :math:`f(z)` at :math:`z'` but lie below :math:`f(z)`
- at diff points, the sub gradient is the gradient

.. image:: _static/linearmodels/ex2.png
    :width: 450