# Probability Review¶

Useful notes: http://cs229.stanford.edu/section/cs229-prob.pdf

Let’s define some important things.

**Outcome Space**: \(\Omega\) - contains all possible atomic outcomes- each outcome (atom) has probability
*density*or*mass*(discrete v. continuous spaces) - an
*event*is a subset of \(\Omega\) - \(P(event)\) is a sum (or integral) over event’s outcomes
- a
*random variable*\(V\) maps \(\Omega\) to (usually) \(R\) - \(V = value\) is an event, \(P(V)\) is a distribution

Note

**Example**: rolling a fair 6-sided die and then flipping that many fair coins

- \(\Omega = \{(1, H), (1, T), (2, HH), (2, HT), ...\}\)
- let number of heads be a random variable
- so what’s the expected number of heads?
- \(E(V) = \sum_{\text{atoms } a} P(a)V(a)\)

Let’s look at some other properties to figure this out.

- Events A and B are
*independent*iff: - \(P(A \text{ and } B) = P(A) * P(B)\)

- Events A and B are
- Conditional probability
- \(P(A | B) = \frac{P(A, B)}{P(B)}\)

- Product Rule
- \(P(A, B) = P(A|B) * P(B)\)
- \(P(B, A) = P(B|A) * P(A)\)

**Bayes Rule**- \(P(A|B) = P(B|A) \frac{P(A)}{P(B)}\)

- Expectations add
- \(E(V_1 + V_2) = E(V_1) + E(V_2)\)

- rule of conditioning (sum rule)
- if events \(e_1, e_2, ... , e_k\) partition \(\Theta\) then:
- \(P(event) = \sum P(e_i) P(event | e_i)\)
- \(E(rv) = \sum P(e_i) E(rv | e_i)\)

Note

Back to the expected number of heads.

- Joint distributions factor
- if \(\Omega = (S*T*U)\), then \(P(S=s,T=t,U=u)\) is \(P(S=s)P(T=t|S=s)P(U=u|S=s,T=t)\)

- Conditional distributions are also distributions
- \(P(A|B) = \frac{P(A, B)}{P(B)}\), so \(P(A|B, C)=\frac{P(A,B|C)}{P(B|C)}\)

## Bayes Rule for Learning¶

- Assume joint distribution \(P(\mathbf{X=x}, Y=y)\)
- We want \(P(Y=y|\mathbf{X=x})\) for each label \(y\) on a new instance \(\mathbf{x}\)
- So, using Bayes’ Rule, \(P(y|\mathbf{x}) = P(\mathbf{x}|y) \frac{P(y)}{P(\mathbf{x})}\)
- \(P(\mathbf{x})\) doesn’t matter here, so we care that \(P(y|\mathbf{x})\) is proportional to \(P(\mathbf{x}|y) P(y)\)
- From the data, we can learn \(P(\mathbf{x}|y)\) and \(P(y)\)
- Predict label \(y\) with largest product

So how do we learn \(P(\mathbf{x}|y)\)?

Note

Take for example a coin flip. You observe the sequence HTH; what is the probability that the next flip is H?

Mathematically, the answer is 2/3: taking the likelihood function \(L(\theta) = P(HTH|\theta)\) we get the probability equal to \(\theta^2 (1-\theta)\).

By finding the \(\theta\) value at the zero derivative, we get 2/3.

Note

But what if we have a prior belief \(P(\theta)\) where \(\theta = P(H)\)?

Now, the posterior on \(\theta\) becomes \(P(\theta | HTH)\):

Or in this case:

**Discrete Prior**

Taking \(P(\theta=0) = P(\theta=1/2) = P(\theta=1) = 1/3\), \(\theta^2 (1-\theta) P(\theta)\) is 0, 1/24, and 0 for the 3 cases respectively. Thus, the posterior \(P(\theta = 1/2 | HTH) = 1\).

**Prior Density**

- \(P(\theta) = 1\) for \(0 \leq \theta \leq 1\)
- So \(\theta^2 (1-\theta) P(\theta)\) is just \(\theta^2 (1-\theta)\)
- and the posterior is \(\frac{\theta^2 (1-\theta)}{12}\)
- If we plot this, the max is at \(\theta = 2/3\)

- Treat parameter \(\theta\) as a random var with the prior distribution \(P(\theta)\), see training data \(Z\)
- \(posterior = \frac{prior * data likelihood}{constant}\)
- \(P(\theta | Z) = \frac{P(\theta) P(Z | \theta)}{P(Z)}\)

## Bayes’ Estimation¶

Treat parameter \(\theta'\) as a RV with the prior distribution \(P(\theta)\), use fixed data \(Z = (\mathbf{x}, y)\) (RV \(S\))

### Maximum Likelihood¶

### Maximum a Posteriori¶

### Predictive Distribution¶

aka Full Bayes

### Mean a’Post¶

### Use¶

- draw enough data so that \(P(Y=y | X=\mathbf{x})\) estimated for every possible pair
- this takes a lot of data

- another approach: class of models
- think of each model \(m\) as a way of generating the training set Z of \((\mathbf{x}, y)\) pairs

### Compound Experiment¶

- prior \(P(M=m)\) on model space
- models give \(P(X=x | M=m)\) (where \(x\) is a pair \((\mathbf{x}, y)\))
- The joint experiment (if data is iid given m) is:

### Generative and Discriminative Models¶

- Generative model: \(P((\mathbf{x}, y) | m)\)
- tells how to generate examples (both instance and label)
- learn \(P(\mathbf{x} | y, m)\) and use Bayes’ rule
- common assumptions:
- \(P(\mathbf{x} | y, m)\) is Gaussian
- \(P(y | m)\) is Bernoulli

- Discriminative model: \(P(y | h, \mathbf{x})\)
- tells how to create labels from instances
- often \(f(\mathbf{x}) = \arg \max_y f_y(\mathbf{x})\)