Probability Review¶
Useful notes: http://cs229.stanford.edu/section/cs229-prob.pdf
Let’s define some important things.
- Outcome Space: \(\Omega\) - contains all possible atomic outcomes
- each outcome (atom) has probability density or mass (discrete v. continuous spaces)
- an event is a subset of \(\Omega\)
- \(P(event)\) is a sum (or integral) over event’s outcomes
- a random variable \(V\) maps \(\Omega\) to (usually) \(R\)
- \(V = value\) is an event, \(P(V)\) is a distribution
Note
Example: rolling a fair 6-sided die and then flipping that many fair coins
- \(\Omega = \{(1, H), (1, T), (2, HH), (2, HT), ...\}\)
- let number of heads be a random variable
- so what’s the expected number of heads?
- \(E(V) = \sum_{\text{atoms } a} P(a)V(a)\)
Let’s look at some other properties to figure this out.
- Events A and B are independent iff:
- \(P(A \text{ and } B) = P(A) * P(B)\)
- Conditional probability
- \(P(A | B) = \frac{P(A, B)}{P(B)}\)
- Product Rule
- \(P(A, B) = P(A|B) * P(B)\)
- \(P(B, A) = P(B|A) * P(A)\)
- Bayes Rule
- \(P(A|B) = P(B|A) \frac{P(A)}{P(B)}\)
- Expectations add
- \(E(V_1 + V_2) = E(V_1) + E(V_2)\)
- rule of conditioning (sum rule)
- if events \(e_1, e_2, ... , e_k\) partition \(\Theta\) then:
- \(P(event) = \sum P(e_i) P(event | e_i)\)
- \(E(rv) = \sum P(e_i) E(rv | e_i)\)
Note
Back to the expected number of heads.
- Joint distributions factor
- if \(\Omega = (S*T*U)\), then \(P(S=s,T=t,U=u)\) is \(P(S=s)P(T=t|S=s)P(U=u|S=s,T=t)\)
- Conditional distributions are also distributions
- \(P(A|B) = \frac{P(A, B)}{P(B)}\), so \(P(A|B, C)=\frac{P(A,B|C)}{P(B|C)}\)
Bayes Rule for Learning¶
- Assume joint distribution \(P(\mathbf{X=x}, Y=y)\)
- We want \(P(Y=y|\mathbf{X=x})\) for each label \(y\) on a new instance \(\mathbf{x}\)
- So, using Bayes’ Rule, \(P(y|\mathbf{x}) = P(\mathbf{x}|y) \frac{P(y)}{P(\mathbf{x})}\)
- \(P(\mathbf{x})\) doesn’t matter here, so we care that \(P(y|\mathbf{x})\) is proportional to \(P(\mathbf{x}|y) P(y)\)
- From the data, we can learn \(P(\mathbf{x}|y)\) and \(P(y)\)
- Predict label \(y\) with largest product
So how do we learn \(P(\mathbf{x}|y)\)?
Note
Take for example a coin flip. You observe the sequence HTH; what is the probability that the next flip is H?
Mathematically, the answer is 2/3: taking the likelihood function \(L(\theta) = P(HTH|\theta)\) we get the probability equal to \(\theta^2 (1-\theta)\).
By finding the \(\theta\) value at the zero derivative, we get 2/3.
Note
But what if we have a prior belief \(P(\theta)\) where \(\theta = P(H)\)?
Now, the posterior on \(\theta\) becomes \(P(\theta | HTH)\):
Or in this case:
Discrete Prior
Taking \(P(\theta=0) = P(\theta=1/2) = P(\theta=1) = 1/3\), \(\theta^2 (1-\theta) P(\theta)\) is 0, 1/24, and 0 for the 3 cases respectively. Thus, the posterior \(P(\theta = 1/2 | HTH) = 1\).
Prior Density
- \(P(\theta) = 1\) for \(0 \leq \theta \leq 1\)
- So \(\theta^2 (1-\theta) P(\theta)\) is just \(\theta^2 (1-\theta)\)
- and the posterior is \(\frac{\theta^2 (1-\theta)}{12}\)
- If we plot this, the max is at \(\theta = 2/3\)
- Treat parameter \(\theta\) as a random var with the prior distribution \(P(\theta)\), see training data \(Z\)
- \(posterior = \frac{prior * data likelihood}{constant}\)
- \(P(\theta | Z) = \frac{P(\theta) P(Z | \theta)}{P(Z)}\)
Bayes’ Estimation¶
Treat parameter \(\theta'\) as a RV with the prior distribution \(P(\theta)\), use fixed data \(Z = (\mathbf{x}, y)\) (RV \(S\))
Maximum Likelihood¶
Maximum a Posteriori¶
Predictive Distribution¶
aka Full Bayes
Mean a’Post¶
Use¶
- draw enough data so that \(P(Y=y | X=\mathbf{x})\) estimated for every possible pair
- this takes a lot of data
- another approach: class of models
- think of each model \(m\) as a way of generating the training set Z of \((\mathbf{x}, y)\) pairs
Compound Experiment¶
- prior \(P(M=m)\) on model space
- models give \(P(X=x | M=m)\) (where \(x\) is a pair \((\mathbf{x}, y)\))
- The joint experiment (if data is iid given m) is:
Generative and Discriminative Models¶
- Generative model: \(P((\mathbf{x}, y) | m)\)
- tells how to generate examples (both instance and label)
- learn \(P(\mathbf{x} | y, m)\) and use Bayes’ rule
- common assumptions:
- \(P(\mathbf{x} | y, m)\) is Gaussian
- \(P(y | m)\) is Bernoulli
- Discriminative model: \(P(y | h, \mathbf{x})\)
- tells how to create labels from instances
- often \(f(\mathbf{x}) = \arg \max_y f_y(\mathbf{x})\)