Naive Bayes =========== TLDR: predict the likelihood of the label, given features .. math:: & \arg \max_y P(y | \mathbf{x}) \\ & = \arg \max_y P(\mathbf{x} | y) \frac{P(y)}{P(\mathbf{x})} \\ & = \arg \max_y P(\mathbf{x} | y) P(y) Naive independence assumption: the attributes are conditionally independent given *y*, i.e. .. math:: P(\mathbf{x} | y) = \prod_j P(x_j | y) So, we predict the label *y* that maximizes .. math:: P(y) \prod_j P(x_j | y) This uses a *generative* model: pick *y* then generate **x** based on *y* To implement naive bayes, we need to **estimate**: - :math:`P(y)` distribution - for each class *y*, for each feature :math:`x_j`, need :math:`P(x_j | y)` distributions all of these features are 1-dimensional - the combination of them is the model .. image:: _static/naivebayes/ex1.png Issues ^^^^^^ - conditional independence is optimistic - what if an attribute-value pair is not in the training set? - laplace smoothing / dummy data - continuous features: use gaussian or other density? - attributes for text classification? - bag of words model NB for Text ^^^^^^^^^^^ - let :math:`V` be the vocabulary (all words/symbols in training docs) - for each class :math:`y`, let :math:`Docs_y` by the concatenation of all docs labelled *y* - for each word :math:`w` in :math:`V`, let :math:`\#w(Docs_y)` be the number of times :math:`w` occurs in :math:`Docs_y` - set :math:`P(w | y) = \frac{\#w(Docs_y) + 1}{|V| + \sum_w \#w(Docs_y)}` (Laplacian smoothing)