Probability Review

Useful notes: http://cs229.stanford.edu/section/cs229-prob.pdf

Let’s define some important things.

  • Outcome Space: \(\Omega\) - contains all possible atomic outcomes
  • each outcome (atom) has probability density or mass (discrete v. continuous spaces)
  • an event is a subset of \(\Omega\)
  • \(P(event)\) is a sum (or integral) over event’s outcomes
  • a random variable \(V\) maps \(\Omega\) to (usually) \(R\)
  • \(V = value\) is an event, \(P(V)\) is a distribution

Note

Example: rolling a fair 6-sided die and then flipping that many fair coins

  • \(\Omega = \{(1, H), (1, T), (2, HH), (2, HT), ...\}\)
  • let number of heads be a random variable
  • so what’s the expected number of heads?
    • \(E(V) = \sum_{\text{atoms } a} P(a)V(a)\)

Let’s look at some other properties to figure this out.

  • Events A and B are independent iff:
    • \(P(A \text{ and } B) = P(A) * P(B)\)
  • Conditional probability
    • \(P(A | B) = \frac{P(A, B)}{P(B)}\)
  • Product Rule
    • \(P(A, B) = P(A|B) * P(B)\)
    • \(P(B, A) = P(B|A) * P(A)\)
  • Bayes Rule
    • \(P(A|B) = P(B|A) \frac{P(A)}{P(B)}\)
  • Expectations add
    • \(E(V_1 + V_2) = E(V_1) + E(V_2)\)
  • rule of conditioning (sum rule)
    • if events \(e_1, e_2, ... , e_k\) partition \(\Theta\) then:
    • \(P(event) = \sum P(e_i) P(event | e_i)\)
    • \(E(rv) = \sum P(e_i) E(rv | e_i)\)

Note

Back to the expected number of heads.

\[\begin{split}E(\text{# heads}) & = \sum_{r=1}^6 P(roll = r) E(\text{# heads} | roll = r) \\ & = \frac{1}{6}(\frac{1+2+3+4+5+6}{2}) \\ & = \frac{21}{12} = 1.75\end{split}\]
  • Joint distributions factor
    • if \(\Omega = (S*T*U)\), then \(P(S=s,T=t,U=u)\) is \(P(S=s)P(T=t|S=s)P(U=u|S=s,T=t)\)
  • Conditional distributions are also distributions
    • \(P(A|B) = \frac{P(A, B)}{P(B)}\), so \(P(A|B, C)=\frac{P(A,B|C)}{P(B|C)}\)

Bayes Rule for Learning

  • Assume joint distribution \(P(\mathbf{X=x}, Y=y)\)
  • We want \(P(Y=y|\mathbf{X=x})\) for each label \(y\) on a new instance \(\mathbf{x}\)
  • So, using Bayes’ Rule, \(P(y|\mathbf{x}) = P(\mathbf{x}|y) \frac{P(y)}{P(\mathbf{x})}\)
  • \(P(\mathbf{x})\) doesn’t matter here, so we care that \(P(y|\mathbf{x})\) is proportional to \(P(\mathbf{x}|y) P(y)\)
  • From the data, we can learn \(P(\mathbf{x}|y)\) and \(P(y)\)
  • Predict label \(y\) with largest product

So how do we learn \(P(\mathbf{x}|y)\)?

Note

Take for example a coin flip. You observe the sequence HTH; what is the probability that the next flip is H?

Mathematically, the answer is 2/3: taking the likelihood function \(L(\theta) = P(HTH|\theta)\) we get the probability equal to \(\theta^2 (1-\theta)\).

By finding the \(\theta\) value at the zero derivative, we get 2/3.

Note

But what if we have a prior belief \(P(\theta)\) where \(\theta = P(H)\)?

Now, the posterior on \(\theta\) becomes \(P(\theta | HTH)\):

\[P(\theta | HTH) = P(HTH | \theta) \frac{P(\theta)}{P(HTH)}\]

Or in this case:

\[\frac{\theta^2 (1-\theta) P(\theta)}{normalization}\]

Discrete Prior

Taking \(P(\theta=0) = P(\theta=1/2) = P(\theta=1) = 1/3\), \(\theta^2 (1-\theta) P(\theta)\) is 0, 1/24, and 0 for the 3 cases respectively. Thus, the posterior \(P(\theta = 1/2 | HTH) = 1\).

Prior Density

  • \(P(\theta) = 1\) for \(0 \leq \theta \leq 1\)
  • So \(\theta^2 (1-\theta) P(\theta)\) is just \(\theta^2 (1-\theta)\)
  • and the posterior is \(\frac{\theta^2 (1-\theta)}{12}\)
  • If we plot this, the max is at \(\theta = 2/3\)
  • Treat parameter \(\theta\) as a random var with the prior distribution \(P(\theta)\), see training data \(Z\)
  • \(posterior = \frac{prior * data likelihood}{constant}\)
  • \(P(\theta | Z) = \frac{P(\theta) P(Z | \theta)}{P(Z)}\)

Bayes’ Estimation

Treat parameter \(\theta'\) as a RV with the prior distribution \(P(\theta)\), use fixed data \(Z = (\mathbf{x}, y)\) (RV \(S\))

Maximum Likelihood

\[\theta_{ML} = \arg \max_{\theta'} P(S=Z|\theta = \theta')\]

Maximum a Posteriori

\[\begin{split}\theta_{MAP} & = \arg \max_{\theta'} P(\theta = \theta' | S=Z) \\ & = \arg \max_{\theta'} P(S=Z | \theta = \theta')\frac{P(\theta = \theta')}{P(S=Z)}\end{split}\]

Predictive Distribution

aka Full Bayes

\[P(Y=y | S=Z) = \int P(Y=y | \theta=\theta') P(\theta=\theta' | S=Z) d\theta'\]

Mean a’Post

\[\theta_{mean} = E[\theta | S=Z] = \int \theta' P(\theta=\theta' | S=Z) d\theta'\]

Use

  • draw enough data so that \(P(Y=y | X=\mathbf{x})\) estimated for every possible pair
    • this takes a lot of data
  • another approach: class of models
  • think of each model \(m\) as a way of generating the training set Z of \((\mathbf{x}, y)\) pairs

Compound Experiment

  • prior \(P(M=m)\) on model space
  • models give \(P(X=x | M=m)\) (where \(x\) is a pair \((\mathbf{x}, y)\))
  • The joint experiment (if data is iid given m) is:
\[P(\{(\mathbf{x_i}, y_i)\}, m) = P(m) \prod_i (P(\mathbf{x_i} | m) P(y_i | \mathbf{x_i}, m))\]

Generative and Discriminative Models

  • Generative model: \(P((\mathbf{x}, y) | m)\)
    • tells how to generate examples (both instance and label)
    • learn \(P(\mathbf{x} | y, m)\) and use Bayes’ rule
    • common assumptions:
      • \(P(\mathbf{x} | y, m)\) is Gaussian
      • \(P(y | m)\) is Bernoulli
  • Discriminative model: \(P(y | h, \mathbf{x})\)
    • tells how to create labels from instances
    • often \(f(\mathbf{x}) = \arg \max_y f_y(\mathbf{x})\)