Ensemble/Boosting ================= Ensemble Methods ---------------- - use an ensemble/group of hypotheses - diversity important; ensemble of "yes-men" is useless - get diverse hypotheses by using - different data - different algorithms - different hyperparameters Why? - averaging reduces variance - ensemble more stables than individual - consider biased coin p(head) = 1/3 - variance on one flip = 1/3 - 1/9 = 2/9 - variance on average of 2 flips = 2/9 - 1/9 = 1/9 - averaging makes less mistakes - consider 3 classifiers with accuracy 0.8, 0.7, and 0.7 - the probability that the majority vote is correct = (f1 correct, f2 correct, f3 wrong) + (f1 correct, f2 wrong, f3 correct) + (f1 wrong, f2 correct, f3 correct) + (f1 correct, f2 correct, f3 correct) - = :math:`(.8*.7*.3) + (.8*.3*.3) + (.2*.7*.7) + (.8*.8*.7) \approx 0.82` Creation ^^^^^^^^ - use different training sets - bootstrap example: pick *m* examples from labeled data with replacement - cross-validation sampling - reweight data (boosting, later) - use different features - random forests "hide" some features Prediction ^^^^^^^^^^ - unweighted vote - weighted vote - pay more attention to better predictors - cascade - filter examples; each level predicts or passes on Boosting -------- - "boosts" the preformance of another learning algorithm - creates a set of classifiers and predicts with weighted vote - use different distributions on sample to get diversity - up-weight hard, down-weight easy examples - note: ensembles used before to reduce variance .. image:: _static/boosting/ex1.png :width: 450 If boosting is possible, then: - can use fairly wild guesses to produce highly accurate predictions - if you can learn "part way" you can learn "all the way" - should be able to improve any learning algorithm - for any learning problem: - either can learn always with nearly perfect accuracy - or there exist cases where cannot learn even slightly better than random guessing Ada-Boost ^^^^^^^^^ Given a training sample S with labels +/-, and a learning algorithm L: 1. for t from 1 to T do 1. create distribution :math:`D_t` on :math:`S` 2. call :math:`L` with :math:`D_t` on :math:`S` to get hypothesis :math:`h_t` 1. i.e. :math:`\min \sum_n D_t(n) l(f(x_n), y_n)` where :math:`D_t(n)` is the weight of the sample 3. calculate weight :math:`\alpha_t` for :math:`h_t` 2. final hypothesis is :math:`F(x) = \sum_t \alpha_t h_t(x)`, or :math:`H(x)` = value with most weight formally: .. image:: _static/boosting/ex2.png :width: 450 So how do we pick :math:`D_t` and :math:`\alpha_t`? - :math:`D_1(i) = 1/m` - the weight assigned to :math:`(x_i, y_i)` at :math:`t=1` - given :math:`D_t` and :math:`h_t`: - :math:`D_{t+1}(i) = \frac{D_t(i)}{Z_t} \exp(-\alpha_t y_i h_t(x_i))` - if correct, increase weights by a factor > 1 (positive exponential) - otherwise decrease by a factor < 1 (negative exponential) - where :math:`Z_t` is a normalization factor - where :math:`\alpha_t = \frac{1}{2} \ln (\frac{1-\epsilon_t}{\epsilon_t}) > 0` - :math:`H_{final}(x) = sign(\sum_t \alpha_t h_t(x))` **Example**: .. image:: _static/boosting/ex3.png :width: 300 now the weights become :math:`\frac{1}{10} e^{0.42}` for the misclassified and :math:`\frac{1}{10} e^{-0.42}` for the correct .. image:: _static/boosting/ex4.png :width: 400 .. image:: _static/boosting/ex5.png :width: 400 .. image:: _static/boosting/ex6.png :width: 400 Analyzing Error ^^^^^^^^^^^^^^^ Thm: Write :math:`\epsilon_t` as :math:`1/2 - \gamma_t` - :math:`\gamma_t` = "edge" = how much better than random guessing. Then: .. math:: \text{training error}(H_{final}) & \leq \prod_t [2 \sqrt{\epsilon_t (1-\epsilon_t)}] \\ & = \prod_t \sqrt{1-4\gamma_t^2} \\ & \leq \exp(-2\sum_t \gamma_t^2) So if :math:`\forall t: \gamma_t \geq \gamma > 0`, then :math:`\text{training error}(H_{final}) \leq e^{-2\gamma^2 T}` therefore, as :math:`T \to \infty`, training error :math:`\to 0` Proof ^^^^^ Let :math:`F(x) = \sum_t \alpha_t h_t(x) \to H_{final}(x) = sign(F(x))` Step 1: unwrapping recurrence .. math:: D_{final}(i) & = \frac{1}{m} \frac{\exp(-y_i \sum_t \alpha_t h_t(x_i))}{\prod_t Z_t} \\ & = \frac{1}{m} \frac{\exp(-y_i F(x_i))}{\prod_t Z_t} Step 2: training error :math:`(H_{final}) \leq \prod_t Z_t` .. math:: \text{training error}(H_{final}) & = \frac{1}{m} \sum_i & 1 & \text{ if } y_i \neq H_{final}(x_i) \\ & & 0 & \text{ otherwise} \\ & = \frac{1}{m} \sum_i & 1 & \text{ if } y_i F(x_i) \leq 0 \\ & & 0 & \text{ otherwise} \\ & \leq \frac{1}{m} \sum_t \exp(-y_i F(x_i)) \\ & = \sum_i D_{final}(i) \prod_t Z_t \\ & = \prod_t Z_t Step 3: :math:`Z_t = 2 \sqrt{\epsilon_t (1-\epsilon_t)}` .. math:: Z_t & = \sum_i D_t(i) \exp(-\alpha_t y_i h_t(x_i)) \\ & = \sum_{i:y_i \neq h_t(x_i)} D_t(i)e^{\alpha_t} + \sum_{i:y_i = h_t(x_i)} D_t(i) e^{-\alpha_t} \\ & = \epsilon_t e^{\alpha_t} + (1-\epsilon_t) e^{\alpha_t} \\ & = 2 \sqrt{\epsilon_t (1-\epsilon_t)} Discussion ^^^^^^^^^^ We expect even as training error approaches 0 as T increases, the test error won't - overfitting! We can actually predict "generalization error" (basically test error): .. math:: \text{generalization error} \leq \text{training error} + \tilde{O}(\sqrt{\frac{dT}{m}}) Where :math:`m` = # of training samples, :math:`d` = "complexity" of weak classifiers, :math:`T` = # of rounds But in reality, it's not always a tradeoff between training error and test error. Margin Approach """"""""""""""" - training error only measures whether classifications are right or wrong - should also consider confidence of classifications - :math:`H_{final}` is weighted majority vote of weak classifiers - measure confidence by *margin* = strength of the vote - = (weighted fraction voting correctly) - (weighted fraction voting incorrectly) - so as we train more, we increase the margin, which leads to a decrease in test loss - both AdaBoost and SVMs - work by maximizing margins - find linear threshold function in high-dimensional space - but they use different norms AdaBoost is: - fast - simple, easy to program - no hyperparameters (except T) - flexible, can combine with any learning algorithm - no prior knowledge needed about weak learner - provably effective (provided a rough rule of thumb) - versatile But: - performance depends on data and weak learner - consistent with theory, adaboost can fail if: - weak classifiers too complex (overfitting) - weak classifiers too weak (basically random guessing) - underfitting, or low margins -> overfitting - susceptible to uniform noise