# Decision Trees¶

Let’s take the example of whether or not to play tennis given 4 features - a binary classification question based on discrete features.

To construct, pick a feature and split on it - then recursively build the tree top down

## Algorithm¶

ID3 = Iterative Dichotomizer

How do we choose the “best” feature?

- Our goal is to build a tree that is as small as possible - Occam’s Razor
- select a hypothesis that makes minimal assumptions

- finding the minimal tree is NP-hard
- this algorithm is greedy-heuristic, but cannot guarantee optimality
- the main heuristic is the definition of “best”

## “Best”¶

### Example¶

Take the dataset:

```
(A=0,B=0), -: 50 examples
(A=0,B=1), -: 50 examples
(A=1,B=0), -: 0 examples
(A=1,B=1), +: 100 examples
```

In this case, it’s easy to tell we should split on A.

But what about:

We prefer the right tree (split on A) because statistically it requires fewer decisions to get to a prediction. But how do we quantify this?

We should choose features that split examples into “relatively pure” examples. This is based on information gain.

### Entropy¶

Entropy of a set of examples S relative to a binary classification task is: (also denoted \(H(S)\))

Where \(p_+\) is the proportion of examples in S that belong to positive class, and vice versa

Generalized to K classes:

The lower the entropy, the less uncertainty there is in a dataset.

**Ex.** When all examples belong to one class, entropy is 0. When all examples are exactly split across all classes
(in binary case), entropy is 1.

### Information Gain¶

The information gain of an attribute \(a\) is the expected reduction in entropy caused by partitioning on this attribute:

where \(S_v\) is the subset of a for which the value of attribute \(a\) is \(v\).

This takes the averaged entropy of all partitions, and the entropy of each partition is weighted by its “contribution” to the size of S.

Partitions of low entropy (low uncertainty/imbalanced splits) result in high gain

So this algorithm (ID3) makes statistics-based decisions that uses *all data*

## Overfitting¶

- Learning a tree that perfectly classifies the data may not create the tree that best generalizes the training data
- may be noise
- might be making decisions based on very little data

- causes:
- too much variance in training data
- we split on features that are irrelevant

### Pruning¶

Pruning is a way to remove overfitting

- remove leaves and assign majority label of parent to all items
- prune the children of S if:
- all children are leaves

- how much to prune?
- prune if the accuracy on the
*validation set*does not decrease upon assigning the most frequent class label to all items at S

- prune if the accuracy on the

2 basic approaches:

- post-pruning: grow the full tree then remove nodes that don’t have sufficient evidence
- pre-pruning: stop growing the tree at some point during construction when it is determined that there is not enough data to make reliable choices

Note that not all leaves in a pruned tree have to be pure. How do we decide the label at a non-pure leaf?

### Random Forests¶

Another way to avoid overfitting

- learn a collection of decision trees instead of a single one
- predict the label by aggregating all the trees in the forest
- voting for classification, averaging for regression

## Continuous Attributes¶

TLDR:

- split on feature that gives largest drop in MSE
- split continuous variables into discrete variables (small, medium, large, etc)
- what thresholds to use for these splits?
- e.g. sort examples on feature, for each ordered pair with different labels consider if there is a threshold
- midpoint or on the endpoint?

## Missing Values¶

Options:

- throw away the instance
- fill in the most common value of the feature
- fill in with all possible values of the feature
- guess based on correlations with other features/labels

## Other Issues¶

- noisy labels
- two identical instances with conflicting labels

- attributes with different costs
- change information so that low cost attributes are preferred