Perceptron
==========

Perceptron is a linear, online classification model

Given a training set of pairs, it learns a linear decision boundary hyperplane - we assume labels are binary for now

It's inspired by neurons: activation is a function of its inputs and weights. For example, the weighted sum activation:

.. math::
    activation = \sum_{i=1}^D w_ix_i

Then, prediction can be something like ``a > 0 ? 1 : -1``.

Additionally, we can add a bias term to account for a non-zero intercept:

.. math::
    a = [\sum_{i=1}^D w_ix_i] + b

Linear Boundary
---------------

- a ``D-1`` dimensional hyperplane separates a ``D`` dimensional space into two half-spaces: positive and negative
- this linear boundary has the form :math:`\mathbf{w} \cdot \mathbf{x} = 0`
    - defined by **w**: the unit vector (often normalized) normal to any vector on the hyperplane
- :math:`\text{proj}_w x` is how far away *x* is from the decision boundary
    - when **w** is normalized to a unit vector, :math:`\mathbf{w} \cdot \mathbf{x} = \text{proj}_w x`.

.. image:: _static/perceptron/ex1.png
    :width: 450

**With Bias**

- When a bias is added, the linear boundary becomes :math:`\mathbf{w} \cdot \mathbf{x} + b = 0`
    - this can be converted to the more general form :math:`\mathbf{w} \cdot \mathbf{x} = 0` by adding *b* to **w** and an always-1 feature to **x**

Prediction
----------
Pretty simple:

.. code:: py

    def prediction(w, x, b):
        return sign( w @ x + b )

Training
--------

This is an error-driven model:

1. initialize model to some weights and biases
2. for each instance in training set:
    1. use current **w** and *b* to predict a label :math:`\hat{y}`
    2. if :math:`\hat{y} = y` do nothing
    3. otherwise update **w** and *b* to do better
3. goto 2

.. image:: _static/perceptron/ex2.png
    :width: 500

Update in a little simpler notation:

.. math::
    \mathbf{w} & = \mathbf{w} + y \mathbf{x} \\
    b & = b + y

So what does it do? Let's look at the new activation after an update where a positive was incorrectly predicted as a negative label:

.. image:: _static/perceptron/ex3.png

So for the given example, the activation is improved by a factor of positive :math:`\sum_{i=1}^D x_i^2 + 1`, bringing
the prediction closer to correctiveness for that one sample.

We can also control the learning rate easily using a term :math:`\eta`:

.. math::
    \mathbf{w} = \mathbf{w} + y \eta \mathbf{x}

Caveats
^^^^^^^

- the order of the training instances is important!
    - e.g. all positives followed by all negatives is bad
    - recommended to permute the training data after each iteration

Example
-------

.. code:: text

    x1 x2 y  wx  w (after update, if any)
    -------------------------------------
                 <0, 0>
     1  3  +   0 <1, 3>
     2  3  -  11 <-1, 0>
    -3  1  +   3 <-1, 0>
     1 -1  -  -1 <-1, 0>

Convergence
-----------
We can define convergence as when going through the training data once, no updates are made

If the training data is linearly separable, perceptron will converge - if not, it will never converge.

How long perceptron takes to converge is based on how *easy* the dataset is - roughly, how separated from each
other the two classes are (i.e. the higher the *margin* is, the easier the dataset is, where *margin* is the distance
from the hyperplane to a datapoint)

Proof
^^^^^

**Overview**

.. image:: _static/perceptron/proof1.png

**Steps**

.. image:: _static/perceptron/proof2.png

**Simplification 1**

.. image:: _static/perceptron/proof3.png

**Simplification 2**

.. image:: _static/perceptron/proof4.png

**Simplification 3**

.. image:: _static/perceptron/proof5.png

**Analysis Setup**

.. image:: _static/perceptron/proof6.png

.. image:: _static/perceptron/proof7.png

**Finishing Up**

.. image:: _static/perceptron/proof8.png