# Probability Review¶

Useful notes: http://cs229.stanford.edu/section/cs229-prob.pdf

Let’s define some important things.

• Outcome Space: $$\Omega$$ - contains all possible atomic outcomes
• each outcome (atom) has probability density or mass (discrete v. continuous spaces)
• an event is a subset of $$\Omega$$
• $$P(event)$$ is a sum (or integral) over event’s outcomes
• a random variable $$V$$ maps $$\Omega$$ to (usually) $$R$$
• $$V = value$$ is an event, $$P(V)$$ is a distribution

Note

Example: rolling a fair 6-sided die and then flipping that many fair coins

• $$\Omega = \{(1, H), (1, T), (2, HH), (2, HT), ...\}$$
• let number of heads be a random variable
• so what’s the expected number of heads?
• $$E(V) = \sum_{\text{atoms } a} P(a)V(a)$$

Let’s look at some other properties to figure this out.

• Events A and B are independent iff:
• $$P(A \text{ and } B) = P(A) * P(B)$$
• Conditional probability
• $$P(A | B) = \frac{P(A, B)}{P(B)}$$
• Product Rule
• $$P(A, B) = P(A|B) * P(B)$$
• $$P(B, A) = P(B|A) * P(A)$$
• Bayes Rule
• $$P(A|B) = P(B|A) \frac{P(A)}{P(B)}$$
• $$E(V_1 + V_2) = E(V_1) + E(V_2)$$
• rule of conditioning (sum rule)
• if events $$e_1, e_2, ... , e_k$$ partition $$\Theta$$ then:
• $$P(event) = \sum P(e_i) P(event | e_i)$$
• $$E(rv) = \sum P(e_i) E(rv | e_i)$$

Note

Back to the expected number of heads.

$\begin{split}E(\text{# heads}) & = \sum_{r=1}^6 P(roll = r) E(\text{# heads} | roll = r) \\ & = \frac{1}{6}(\frac{1+2+3+4+5+6}{2}) \\ & = \frac{21}{12} = 1.75\end{split}$
• Joint distributions factor
• if $$\Omega = (S*T*U)$$, then $$P(S=s,T=t,U=u)$$ is $$P(S=s)P(T=t|S=s)P(U=u|S=s,T=t)$$
• Conditional distributions are also distributions
• $$P(A|B) = \frac{P(A, B)}{P(B)}$$, so $$P(A|B, C)=\frac{P(A,B|C)}{P(B|C)}$$

## Bayes Rule for Learning¶

• Assume joint distribution $$P(\mathbf{X=x}, Y=y)$$
• We want $$P(Y=y|\mathbf{X=x})$$ for each label $$y$$ on a new instance $$\mathbf{x}$$
• So, using Bayes’ Rule, $$P(y|\mathbf{x}) = P(\mathbf{x}|y) \frac{P(y)}{P(\mathbf{x})}$$
• $$P(\mathbf{x})$$ doesn’t matter here, so we care that $$P(y|\mathbf{x})$$ is proportional to $$P(\mathbf{x}|y) P(y)$$
• From the data, we can learn $$P(\mathbf{x}|y)$$ and $$P(y)$$
• Predict label $$y$$ with largest product

So how do we learn $$P(\mathbf{x}|y)$$?

Note

Take for example a coin flip. You observe the sequence HTH; what is the probability that the next flip is H?

Mathematically, the answer is 2/3: taking the likelihood function $$L(\theta) = P(HTH|\theta)$$ we get the probability equal to $$\theta^2 (1-\theta)$$.

By finding the $$\theta$$ value at the zero derivative, we get 2/3.

Note

But what if we have a prior belief $$P(\theta)$$ where $$\theta = P(H)$$?

Now, the posterior on $$\theta$$ becomes $$P(\theta | HTH)$$:

$P(\theta | HTH) = P(HTH | \theta) \frac{P(\theta)}{P(HTH)}$

Or in this case:

$\frac{\theta^2 (1-\theta) P(\theta)}{normalization}$

Discrete Prior

Taking $$P(\theta=0) = P(\theta=1/2) = P(\theta=1) = 1/3$$, $$\theta^2 (1-\theta) P(\theta)$$ is 0, 1/24, and 0 for the 3 cases respectively. Thus, the posterior $$P(\theta = 1/2 | HTH) = 1$$.

Prior Density

• $$P(\theta) = 1$$ for $$0 \leq \theta \leq 1$$
• So $$\theta^2 (1-\theta) P(\theta)$$ is just $$\theta^2 (1-\theta)$$
• and the posterior is $$\frac{\theta^2 (1-\theta)}{12}$$
• If we plot this, the max is at $$\theta = 2/3$$
• Treat parameter $$\theta$$ as a random var with the prior distribution $$P(\theta)$$, see training data $$Z$$
• $$posterior = \frac{prior * data likelihood}{constant}$$
• $$P(\theta | Z) = \frac{P(\theta) P(Z | \theta)}{P(Z)}$$

## Bayes’ Estimation¶

Treat parameter $$\theta'$$ as a RV with the prior distribution $$P(\theta)$$, use fixed data $$Z = (\mathbf{x}, y)$$ (RV $$S$$)

### Maximum Likelihood¶

$\theta_{ML} = \arg \max_{\theta'} P(S=Z|\theta = \theta')$

### Maximum a Posteriori¶

$\begin{split}\theta_{MAP} & = \arg \max_{\theta'} P(\theta = \theta' | S=Z) \\ & = \arg \max_{\theta'} P(S=Z | \theta = \theta')\frac{P(\theta = \theta')}{P(S=Z)}\end{split}$

### Predictive Distribution¶

aka Full Bayes

$P(Y=y | S=Z) = \int P(Y=y | \theta=\theta') P(\theta=\theta' | S=Z) d\theta'$

### Mean a’Post¶

$\theta_{mean} = E[\theta | S=Z] = \int \theta' P(\theta=\theta' | S=Z) d\theta'$

### Use¶

• draw enough data so that $$P(Y=y | X=\mathbf{x})$$ estimated for every possible pair
• this takes a lot of data
• another approach: class of models
• think of each model $$m$$ as a way of generating the training set Z of $$(\mathbf{x}, y)$$ pairs

### Compound Experiment¶

• prior $$P(M=m)$$ on model space
• models give $$P(X=x | M=m)$$ (where $$x$$ is a pair $$(\mathbf{x}, y)$$)
• The joint experiment (if data is iid given m) is:
$P(\{(\mathbf{x_i}, y_i)\}, m) = P(m) \prod_i (P(\mathbf{x_i} | m) P(y_i | \mathbf{x_i}, m))$

### Generative and Discriminative Models¶

• Generative model: $$P((\mathbf{x}, y) | m)$$
• tells how to generate examples (both instance and label)
• learn $$P(\mathbf{x} | y, m)$$ and use Bayes’ rule
• common assumptions:
• $$P(\mathbf{x} | y, m)$$ is Gaussian
• $$P(y | m)$$ is Bernoulli
• Discriminative model: $$P(y | h, \mathbf{x})$$
• tells how to create labels from instances
• often $$f(\mathbf{x}) = \arg \max_y f_y(\mathbf{x})$$