# Naive Bayes¶

TLDR: predict the likelihood of the label, given features

$\begin{split}& \arg \max_y P(y | \mathbf{x}) \\ & = \arg \max_y P(\mathbf{x} | y) \frac{P(y)}{P(\mathbf{x})} \\ & = \arg \max_y P(\mathbf{x} | y) P(y)\end{split}$

Naive independence assumption: the attributes are conditionally independent given y, i.e.

$P(\mathbf{x} | y) = \prod_j P(x_j | y)$

So, we predict the label y that maximizes

$P(y) \prod_j P(x_j | y)$

This uses a generative model: pick y then generate x based on y

To implement naive bayes, we need to estimate:

• $$P(y)$$ distribution
• for each class y, for each feature $$x_j$$, need $$P(x_j | y)$$ distributions

all of these features are 1-dimensional - the combination of them is the model

## Issues¶

• conditional independence is optimistic
• what if an attribute-value pair is not in the training set?
• laplace smoothing / dummy data
• continuous features: use gaussian or other density?
• attributes for text classification?
• bag of words model

## NB for Text¶

• let $$V$$ be the vocabulary (all words/symbols in training docs)
• for each class $$y$$, let $$Docs_y$$ by the concatenation of all docs labelled y
• for each word $$w$$ in $$V$$, let $$\#w(Docs_y)$$ be the number of times $$w$$ occurs in $$Docs_y$$
• set $$P(w | y) = \frac{\#w(Docs_y) + 1}{|V| + \sum_w \#w(Docs_y)}$$ (Laplacian smoothing)