Probability Review¶

Useful notes: http://cs229.stanford.edu/section/cs229-prob.pdf

Let’s define some important things.

Outcome Space: \(\Omega\) - contains all possible atomic outcomes
each outcome (atom) has probability density or mass (discrete v. continuous spaces)
an event is a subset of \(\Omega\)
\(P(event)\) is a sum (or integral) over event’s outcomes
a random variable \(V\) maps \(\Omega\) to (usually) \(R\)
\(V = value\) is an event, \(P(V)\) is a distribution

Note

Example: rolling a fair 6-sided die and then flipping that many fair coins

\(\Omega = \{(1, H), (1, T), (2, HH), (2, HT), ...\}\)
let number of heads be a random variable
so what’s the expected number of heads?
- \(E(V) = \sum_{\text{atoms } a} P(a)V(a)\)

Let’s look at some other properties to figure this out.

Events A and B are independent iff:
- \(P(A \text{ and } B) = P(A) * P(B)\)
Conditional probability
- \(P(A | B) = \frac{P(A, B)}{P(B)}\)
Product Rule
- \(P(A, B) = P(A|B) * P(B)\)
- \(P(B, A) = P(B|A) * P(A)\)
Bayes Rule
- \(P(A|B) = P(B|A) \frac{P(A)}{P(B)}\)
Expectations add
- \(E(V_1 + V_2) = E(V_1) + E(V_2)\)
rule of conditioning (sum rule)
- if events \(e_1, e_2, ... , e_k\) partition \(\Theta\) then:
- \(P(event) = \sum P(e_i) P(event | e_i)\)
- \(E(rv) = \sum P(e_i) E(rv | e_i)\)

Note

Back to the expected number of heads.

\[\begin{split}E(\text{# heads}) & = \sum_{r=1}^6 P(roll = r) E(\text{# heads} | roll = r) \\ & = \frac{1}{6}(\frac{1+2+3+4+5+6}{2}) \\ & = \frac{21}{12} = 1.75\end{split}\]

Joint distributions factor
- if \(\Omega = (S*T*U)\), then \(P(S=s,T=t,U=u)\) is \(P(S=s)P(T=t|S=s)P(U=u|S=s,T=t)\)
Conditional distributions are also distributions
- \(P(A|B) = \frac{P(A, B)}{P(B)}\), so \(P(A|B, C)=\frac{P(A,B|C)}{P(B|C)}\)

Bayes Rule for Learning¶

Assume joint distribution \(P(\mathbf{X=x}, Y=y)\)
We want \(P(Y=y|\mathbf{X=x})\) for each label \(y\) on a new instance \(\mathbf{x}\)
So, using Bayes’ Rule, \(P(y|\mathbf{x}) = P(\mathbf{x}|y) \frac{P(y)}{P(\mathbf{x})}\)
\(P(\mathbf{x})\) doesn’t matter here, so we care that \(P(y|\mathbf{x})\) is proportional to \(P(\mathbf{x}|y) P(y)\)
From the data, we can learn \(P(\mathbf{x}|y)\) and \(P(y)\)
Predict label \(y\) with largest product

So how do we learn \(P(\mathbf{x}|y)\)?

Note

Take for example a coin flip. You observe the sequence HTH; what is the probability that the next flip is H?

Mathematically, the answer is 2/3: taking the likelihood function \(L(\theta) = P(HTH|\theta)\) we get the probability equal to \(\theta^2 (1-\theta)\).

By finding the \(\theta\) value at the zero derivative, we get 2/3.

Note

But what if we have a prior belief \(P(\theta)\) where \(\theta = P(H)\)?

Now, the posterior on \(\theta\) becomes \(P(\theta | HTH)\):

\[P(\theta | HTH) = P(HTH | \theta) \frac{P(\theta)}{P(HTH)}\]

Or in this case:

\[\frac{\theta^2 (1-\theta) P(\theta)}{normalization}\]

Discrete Prior

Taking \(P(\theta=0) = P(\theta=1/2) = P(\theta=1) = 1/3\), \(\theta^2 (1-\theta) P(\theta)\) is 0, 1/24, and 0 for the 3 cases respectively. Thus, the posterior \(P(\theta = 1/2 | HTH) = 1\).

Prior Density

\(P(\theta) = 1\) for \(0 \leq \theta \leq 1\)
So \(\theta^2 (1-\theta) P(\theta)\) is just \(\theta^2 (1-\theta)\)
and the posterior is \(\frac{\theta^2 (1-\theta)}{12}\)
If we plot this, the max is at \(\theta = 2/3\)

Treat parameter \(\theta\) as a random var with the prior distribution \(P(\theta)\), see training data \(Z\)
\(posterior = \frac{prior * data likelihood}{constant}\)
\(P(\theta | Z) = \frac{P(\theta) P(Z | \theta)}{P(Z)}\)

Bayes’ Estimation¶

Treat parameter \(\theta'\) as a RV with the prior distribution \(P(\theta)\), use fixed data \(Z = (\mathbf{x}, y)\) (RV \(S\))

Maximum Likelihood¶

\[\theta_{ML} = \arg \max_{\theta'} P(S=Z|\theta = \theta')\]

Maximum a Posteriori¶

\[\begin{split}\theta_{MAP} & = \arg \max_{\theta'} P(\theta = \theta' | S=Z) \\ & = \arg \max_{\theta'} P(S=Z | \theta = \theta')\frac{P(\theta = \theta')}{P(S=Z)}\end{split}\]

Predictive Distribution¶

aka Full Bayes

\[P(Y=y | S=Z) = \int P(Y=y | \theta=\theta') P(\theta=\theta' | S=Z) d\theta'\]

Mean a’Post¶

\[\theta_{mean} = E[\theta | S=Z] = \int \theta' P(\theta=\theta' | S=Z) d\theta'\]

Use¶

draw enough data so that \(P(Y=y | X=\mathbf{x})\) estimated for every possible pair
- this takes a lot of data
another approach: class of models
think of each model \(m\) as a way of generating the training set Z of \((\mathbf{x}, y)\) pairs

Compound Experiment¶

prior \(P(M=m)\) on model space
models give \(P(X=x | M=m)\) (where \(x\) is a pair \((\mathbf{x}, y)\))
The joint experiment (if data is iid given m) is:

\[P(\{(\mathbf{x_i}, y_i)\}, m) = P(m) \prod_i (P(\mathbf{x_i} | m) P(y_i | \mathbf{x_i}, m))\]

Generative and Discriminative Models¶

Generative model: \(P((\mathbf{x}, y) | m)\)
- tells how to generate examples (both instance and label)
- learn \(P(\mathbf{x} | y, m)\) and use Bayes’ rule
- common assumptions:
  
  \(P(\mathbf{x} | y, m)\) is Gaussian
  
  \(P(y | m)\) is Bernoulli
Discriminative model: \(P(y | h, \mathbf{x})\)
- tells how to create labels from instances
- often \(f(\mathbf{x}) = \arg \max_y f_y(\mathbf{x})\)

Probability Review¶

Bayes Rule for Learning¶

Bayes’ Estimation¶

Maximum Likelihood¶

Maximum a Posteriori¶

Predictive Distribution¶

Mean a’Post¶

Use¶

Compound Experiment¶

Generative and Discriminative Models¶

cse142-notes

Navigation

Related Topics