Probability Review
==================

Useful notes: http://cs229.stanford.edu/section/cs229-prob.pdf

Let's define some important things.

- **Outcome Space**: :math:`\Omega` - contains all possible atomic outcomes
- each outcome (atom) has probability *density* or *mass* (discrete v. continuous spaces)
- an *event* is a subset of :math:`\Omega`
- :math:`P(event)` is a sum (or integral) over event's outcomes
- a *random variable* :math:`V` maps :math:`\Omega` to (usually) :math:`R`
- :math:`V = value` is an event, :math:`P(V)` is a distribution

.. note::
    **Example**: rolling a fair 6-sided die and then flipping that many fair coins

    - :math:`\Omega = \{(1, H), (1, T), (2, HH), (2, HT), ...\}`
    - let number of heads be a random variable
    - so what's the expected number of heads?
        - :math:`E(V) = \sum_{\text{atoms } a} P(a)V(a)`

    Let's look at some other properties to figure this out.

- Events A and B are *independent* iff:
    - :math:`P(A \text{ and } B) = P(A) * P(B)`
- Conditional probability
    - :math:`P(A | B) = \frac{P(A, B)}{P(B)}`
- Product Rule
    - :math:`P(A, B) = P(A|B) * P(B)`
    - :math:`P(B, A) = P(B|A) * P(A)`
- **Bayes Rule**
    - :math:`P(A|B) = P(B|A) \frac{P(A)}{P(B)}`

- Expectations add
    - :math:`E(V_1 + V_2) = E(V_1) + E(V_2)`
- rule of conditioning (sum rule)
    - if events :math:`e_1, e_2, ... , e_k` partition :math:`\Theta` then:
    - :math:`P(event) = \sum P(e_i) P(event | e_i)`
    - :math:`E(rv) = \sum P(e_i) E(rv | e_i)`

.. note::
    Back to the expected number of heads.

    .. math::

        E(\text{# heads}) & = \sum_{r=1}^6 P(roll = r) E(\text{# heads} | roll = r) \\
        & = \frac{1}{6}(\frac{1+2+3+4+5+6}{2}) \\
        & = \frac{21}{12} = 1.75

- Joint distributions factor
    - if :math:`\Omega = (S*T*U)`, then :math:`P(S=s,T=t,U=u)` is :math:`P(S=s)P(T=t|S=s)P(U=u|S=s,T=t)`
- Conditional distributions are also distributions
    - :math:`P(A|B) = \frac{P(A, B)}{P(B)}`, so :math:`P(A|B, C)=\frac{P(A,B|C)}{P(B|C)}`

Bayes Rule for Learning
-----------------------

- Assume joint distribution :math:`P(\mathbf{X=x}, Y=y)`
- We want :math:`P(Y=y|\mathbf{X=x})` for each label :math:`y` on a new instance :math:`\mathbf{x}`
- So, using Bayes' Rule, :math:`P(y|\mathbf{x}) = P(\mathbf{x}|y) \frac{P(y)}{P(\mathbf{x})}`
- :math:`P(\mathbf{x})` doesn't matter here, so we care that :math:`P(y|\mathbf{x})` is proportional to :math:`P(\mathbf{x}|y) P(y)`
- From the data, we can learn :math:`P(\mathbf{x}|y)` and :math:`P(y)`
- Predict label :math:`y` with largest product

So how do we learn :math:`P(\mathbf{x}|y)`?

.. note::
    Take for example a coin flip. You observe the sequence HTH; what is the probability that the next flip is H?

    Mathematically, the answer is 2/3: taking the likelihood function :math:`L(\theta) = P(HTH|\theta)`
    we get the probability equal to :math:`\theta^2 (1-\theta)`.

    By finding the :math:`\theta` value at the zero derivative, we get 2/3.

.. note::
    But what if we have a prior belief :math:`P(\theta)` where :math:`\theta = P(H)`?

    Now, the posterior on :math:`\theta` becomes :math:`P(\theta | HTH)`:

    .. math::
        P(\theta | HTH) = P(HTH | \theta) \frac{P(\theta)}{P(HTH)}

    Or in this case:

    .. math::
        \frac{\theta^2 (1-\theta) P(\theta)}{normalization}

    **Discrete Prior**

    Taking :math:`P(\theta=0) = P(\theta=1/2) = P(\theta=1) = 1/3`, :math:`\theta^2 (1-\theta) P(\theta)` is
    0, 1/24, and 0 for the 3 cases respectively. Thus, the posterior :math:`P(\theta = 1/2 | HTH) = 1`.

    **Prior Density**

    - :math:`P(\theta) = 1` for :math:`0 \leq \theta \leq 1`
    - So :math:`\theta^2 (1-\theta) P(\theta)` is just :math:`\theta^2 (1-\theta)`
    - and the posterior is :math:`\frac{\theta^2 (1-\theta)}{12}`
    - If we plot this, the max is at :math:`\theta = 2/3`

- Treat parameter :math:`\theta` as a random var with the prior distribution :math:`P(\theta)`, see training data :math:`Z`
- :math:`posterior = \frac{prior * data likelihood}{constant}`
- :math:`P(\theta | Z) = \frac{P(\theta) P(Z | \theta)}{P(Z)}`

Bayes' Estimation
-----------------

Treat parameter :math:`\theta'` as a RV with the prior distribution :math:`P(\theta)`, use fixed data
:math:`Z = (\mathbf{x}, y)` (RV :math:`S`)

Maximum Likelihood
^^^^^^^^^^^^^^^^^^

.. math::
    \theta_{ML} = \arg \max_{\theta'} P(S=Z|\theta = \theta')

Maximum a Posteriori
^^^^^^^^^^^^^^^^^^^^

.. math::
    \theta_{MAP} & = \arg \max_{\theta'} P(\theta = \theta' | S=Z) \\
    & = \arg \max_{\theta'} P(S=Z | \theta = \theta')\frac{P(\theta = \theta')}{P(S=Z)}

Predictive Distribution
^^^^^^^^^^^^^^^^^^^^^^^
aka Full Bayes

.. math::
    P(Y=y | S=Z) = \int P(Y=y | \theta=\theta') P(\theta=\theta' | S=Z) d\theta'

Mean a'Post
^^^^^^^^^^^

.. math::
    \theta_{mean} = E[\theta | S=Z] = \int \theta' P(\theta=\theta' | S=Z) d\theta'

Use
^^^

- draw enough data so that :math:`P(Y=y | X=\mathbf{x})` estimated for every possible pair
    - this takes a lot of data
- another approach: class of models
- think of each model :math:`m` as a way of generating the training set Z of :math:`(\mathbf{x}, y)` pairs

Compound Experiment
^^^^^^^^^^^^^^^^^^^

- prior :math:`P(M=m)` on model space
- models give :math:`P(X=x | M=m)` (where :math:`x` is a pair :math:`(\mathbf{x}, y)`)
- The joint experiment (if data is iid given m) is:

.. math::
    P(\{(\mathbf{x_i}, y_i)\}, m) = P(m) \prod_i (P(\mathbf{x_i} | m) P(y_i | \mathbf{x_i}, m))

Generative and Discriminative Models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- Generative model: :math:`P((\mathbf{x}, y) | m)`
    - tells how to generate examples (both instance and label)
    - learn :math:`P(\mathbf{x} | y, m)` and use Bayes' rule
    - common assumptions:
        - :math:`P(\mathbf{x} | y, m)` is Gaussian
        - :math:`P(y | m)` is Bernoulli
- Discriminative model: :math:`P(y | h, \mathbf{x})`
    - tells how to create labels from instances
    - often :math:`f(\mathbf{x}) = \arg \max_y f_y(\mathbf{x})`