# Regressions¶

Given training data $$\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1..m}$$, find a linear approximation to the ys.

I.e., find w (or $$\theta$$) of weights/parameters such that $$\theta \cdot \mathbf{x} \approx y$$.

Topics:

• Least Squares as Maximum Likelihood
• Finding ML weights
• bias-variance error decomposition
• basis functions for normalizing features
• 1-norm, 2-norm regularization
• bayesian linear regression

## Least Square¶

Example:

Data
x0 x1 | y
------+--
1  2  | 5
1  3  | 7
1  4  | 9


Note the add-a-dimension trick of adding $$\mathbf{x_0} = \vec{1}$$.

By eye, $$\theta = (1, 2)$$ has zero error ($$\theta = \mathbf{w}$$)

We can represent the data matrix X as an array of feature vectors:

$\begin{split}X = \begin{bmatrix} 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{bmatrix} = \begin{bmatrix} \mathbf{x}^{(1)^T} \\ \mathbf{x}^{(2)^T} \\ \mathbf{x}^{(3)^T} \end{bmatrix}\end{split}$

and the label vector $$\mathbf{y}$$ / $$\vec{y}$$:

$\begin{split}\vec{y} = \begin{bmatrix} 5 \\ 7 \\ 9 \end{bmatrix}\end{split}$

Note

To simplify our notation, $$\mathbf{x}^{(i)} \cdot \theta = \mathbf{x}^{(i)^T} \cdot \theta$$.

### Criterion¶

$J(\theta) = \frac{1}{2} \sum_{i=1}^3 (\mathbf{x}^{(i)} \cdot \theta - y^{(i)})^2$

Note

$\begin{split}X \theta - \vec{y} = \begin{bmatrix} \mathbf{x}^{(1)} \cdot \theta \\ \mathbf{x}^{(2)} \cdot \theta \\ \mathbf{x}^{(3)} \cdot \theta \end{bmatrix} - \vec{y} = \begin{bmatrix} \mathbf{x}^{(1)} \cdot \theta - y^{(1)} \\ \mathbf{x}^{(2)} \cdot \theta - y^{(2)} \\ \mathbf{x}^{(3)} \cdot \theta - y^{(3)} \end{bmatrix}\end{split}$

so we can simplify this with matrix operations.

$J(\theta) = \frac{1}{2} (X \theta - \vec{y})^T (X \theta - \vec{y})$

### SGD¶

Consider SGD: take the example where $$\mathbf{x}^{(1)} = (1, 2)$$, $$y^{(1)} = 5$$, and $$\theta$$ is initialized to $$(1, 1)$$.

The squared-error on this example is $$((1 * \theta_0) + (2 * \theta_1) - 5)^2 = 4$$, and its contribution to $$J(\theta)$$ is 2 (squared-error div 2).

At $$\theta = (1, 1)$$:

$\begin{split}& \frac{\partial}{\partial \theta_0} \frac{1}{2} ((1 * \theta_0) + (2 * \theta_1) - 5)^2 \\ & = \frac{2}{2} ((1 * \theta_0) + (2 * \theta_1) - 5) * 1 \\ & = -2\end{split}$

and similarly,

$\begin{split}& \frac{\partial J}{\partial \theta_1} = \frac{2}{2} ((1 * \theta_1) + (2 * \theta_1) - 5) * 2 \\ & = -4\end{split}$

With step size $$\frac{1}{20}$$, update $$\theta' = \theta = \frac{1}{20} \nabla_\theta(J)$$

so $$\theta' = (1, 1) - (-0.1, -0.2) = (1.1, 1.2)$$. If we keep doing this, eventually $$\theta$$ becomes $$(1, 2)$$.

### Closed Form¶

However, there’s a way to minimize $$\theta$$ without having to do SGD: the formula is

$(X^T X)^{-1} X^T \vec{y}$

Example

\begin{align}\begin{aligned}\begin{split}X^T X & = \begin{bmatrix} 1 & 1 & 1 \\ 2 & 3 & 4 \end{bmatrix} \begin{bmatrix} 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{bmatrix} = \begin{bmatrix} 3 & 9 \\ 9 & 29 \end{bmatrix} \\\end{split}\\\begin{split}(X^T X)^{-1} & = \begin{bmatrix} 29/6 & -9/6 \\ -9/6 & 3/6 \end{bmatrix} \\\end{split}\\\begin{split}(X^T X)^{-1} X^T \vec{y} & = \begin{bmatrix} 29/6 & -9/6 \\ -9/6 & 3/6 \end{bmatrix} \begin{bmatrix} 1 & 1 & 1 \\ 2 & 3 & 4 \end{bmatrix} \begin{bmatrix} 5 \\ 7 \\ 9 \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \end{bmatrix}\end{split}\end{aligned}\end{align}

## Maximum Likelihood¶

This is a more generic solution concept. Making some assumptions, we can derive the least-squared criterion:

• learn a function f in a given class of functions (not necessarily linear)
• have m examples $$\{(\mathbf{x}^{(i)}, y^{(i)})\}$$ where $$y^{(i)} = f(\mathbf{x}^{(i)} + \epsilon^{(i)})$$
• where $$\epsilon^{(i)}$$ is some noise
• assume x’s are fixed, concentrate on y’s like a discriminative model
• assume $$\epsilon^{(i)}$$ s are iid draws from some mean 0 Gaussian distribution
• then, the probability of getting $$y^{(i)}$$ for $$\mathbf{x}^{(i)}$$ with f is:
$\begin{split}p(y^{(i)} | \mathbf{x}^{(i)}, f) & = p(\epsilon^{(i)} = y^{(i)} - f(\mathbf{x}^{(i)})) \\ & = \frac{1}{\sqrt{2 \pi} \sigma} \exp( -\frac{(y^{(i)} - f(\mathbf{x}^{(i)})^2)}{2 \sigma^2} )\end{split}$

with this, we can check the overall likelihood of datapoints over the entire set:

$L(f) = P(\text{all labels } | f \text{, all } \mathbf{x}) = \prod_{i=1}^m p(y^{(i)} | \mathbf{x}^{(i)}, f)$

using some log properties, we can…

$\begin{split}\ln L(f) & = \sum_{i=1}^m \ln p(y^{(i)} | \mathbf{x}^{(i)}, f) \\ \ln L(f) & = m \ln (\frac{1}{\sqrt{2 \pi} \sigma}) - \frac{1}{2 \sigma^2} \sum_{i=1}^m (y^{(i)} - f(\mathbf{x}^{(i)})^2\end{split}$

Assuming a Gaussian distribution on $$p$$.

So to maximize the likelihood of f, we’re just minimizing the squared error!

We know that the squared error is convex - we want to find the bottom of the bowl, where the gradient is 0.

Note

For review: minimizing $$f(x) = (x - 5)^2$$

• $$f'(x) = 2(x - 5)$$
• $$f'(x) = 0$$
• $$x = 5$$ is the minimizer

So, we can derive the closed form from the ML formula:

$\begin{split}\nabla_w \ln L(f) & = \nabla_w (m \ln (\frac{1}{\sqrt{2 \pi} \sigma}) - \frac{1}{2 \sigma^2} \sum_{i=1}^m (y^{(i)} - \mathbf{w} \cdot \mathbf{x^{(i)}})^2 \\ & = (\frac{1}{\sigma^2} \sum_{i=1}^m (y^{(i)} - \mathbf{w}^T \cdot \mathbf{x^{(i)}})\mathbf{x_i^T}) \\ \mathbf{0}^T & = \sum_{i=1}^m y^{(i)} \mathbf{x_i^T} - \mathbf{w}^T \sum_{i=1}^m \mathbf{x}^{(i)} \mathbf{x_i^T} \\ ... & \text{some matrix magic...} \\ \mathbf{\omega}_{ML} & = (X^T X)^{-1}X^T \vec{y}\end{split}$

Note

Some of the matrix magic:

$\begin{split}\sum_{i=1}^m y^{(i)} \mathbf{x}^{(i)} = X^T \vec{y} \\\end{split}$

We can also use SGD to learn instead of finding minimum directly:

Cycle through samples, taking a step in the negative gradient direction for each example.

$\begin{split}\omega_{new} & = \omega_{old} - \eta \nabla Error(\mathbf{x}^{(i)}, y^{(i)}) \\ & = \omega_{old} - \eta \nabla \frac{1}{2} (y^{(i)} - \omega_{old} \cdot \mathbf{x}^{(i)})^2 \\ & = \omega_{old} + \eta (y^{(i)} - \omega_{old} \cdot \mathbf{x}^{(i)}) \mathbf{x}^{(i)}\end{split}$

In this formula, $$\eta$$ is the learning rate. This is the LMS algorithm.

## Bias-Variance Decomposition¶

Suppose we have training instances $$\mathbf{x}^{(1)}..\mathbf{x}^{(m)}$$ and true target function $$f$$

Let labels $$y^{(i)}$$ in sample be $$f(\mathbf{x}^{(i)}) + \epsilon ^{(i)}$$ where $$\epsilon ^{(i)}$$ are any iid noise.

Assume $$E[\epsilon ^{(i)}] = 0$$, so $$E[y ^{(i)}] = f(\mathbf{x} ^{(i)})$$

Let’s examine the expected squared error between regression function $$g$$ learned from sample and “true” function $$f$$ at a particular test point $$\mathbf{x}$$

$E_{noise} [(g(\mathbf{x}) - f(\mathbf{x}))^2]$

where $$E_{noise}$$ is the expectation over training label noise.

Let $$\bar{g}(\mathbf{x})$$ be $$E_{noise} [(g(\mathbf{x})]$$. We can rewrite $$E_{noise} [(g(\mathbf{x}) - f(\mathbf{x}))^2]$$ as

$\begin{split}& = E_{noise} [(g(\mathbf{x}) - \bar{g} + \bar{g} - f(\mathbf{x}))^2] \\ & = E_{noise} [(g(\mathbf{x}) - \bar{g})^2 + (\bar{g} - f(\mathbf{x}))^2 + 2(g(\mathbf{x}) - \bar{g})(\bar{g} - f(\mathbf{x}))] \\ & = E_{noise} [(g(\mathbf{x}) - \bar{g})^2] + E_{noise} [(\bar{g} - f(\mathbf{x}))^2] + 2E_{noise} [(g(\mathbf{x}) - \bar{g})(\bar{g} - f(\mathbf{x}))] \\ & = E_{noise} [(g(\mathbf{x}) - \bar{g})^2] + (\bar{g} - f(\mathbf{x}))^2 + 2E_{noise} [(g(\mathbf{x}) - \bar{g})(\bar{g} - f(\mathbf{x}))] \\ & = \text{variance of } g(\mathbf{x}) + (\text{bias of } g(\mathbf{x}))^2 + 2E_{noise} [(g(\mathbf{x}) - \bar{g})(\bar{g} - f(\mathbf{x}))]\end{split}$

Since $$f(\mathbf{x})$$ and $$\bar{g}$$ are constant with respect to noise,

$\begin{split}E_{noise} [(g(\mathbf{x}) - \bar{g})(\bar{g} - f(\mathbf{x}))] & = (\bar{g} - f(\mathbf{x})) E_{noise}[(g(\mathbf{x}) - \bar{g})] \\ & = (\bar{g} - f(\mathbf{x})) (E_{noise}[g(\mathbf{x})] - E_{noise}[\bar{g}]) \\ & = (\bar{g} - f(\mathbf{x})) (\bar{g} - \bar{g}) \\ & = 0.\end{split}$

ergo,

$E_{noise} [(g(\mathbf{x}) - f(\mathbf{x}))^2] = \text{variance of } g(\mathbf{x}) + (\text{bias of } g(\mathbf{x}))^2$

## Regularized Least Squares¶

• Regularization penalizes complexity and reduces variance (but increases bias)
• adds a term $$\lambda$$ to the squared error (the regularization coefficient)
$\sum_{i=1}^m (y^{(i)} - \mathbf{w}^T \mathbf{x}^{(i)})^2 + \frac{\lambda}{2} \mathbf{w}^T \mathbf{w}$

which is minimized by

$\mathbf{w} = (\lambda \mathbf{I} + X^T X)^{-1} X^T \vec{y}$

More generally, we can use a different regularizer:

$\sum_{i=1}^n (y^{(i)} - \mathbf{w}^T \mathbf{x}^{(i)})^2 + \frac{\lambda}{2} \sum_{j=1}^m |w_j|^q$

## Logistic Regression¶

Logistic Regression learns a parameter vector $$\mathbf{\theta}$$ (or weights $$\mathbf{w}$$)

Idea behind 2-class logistic regression:

• two labels, so $$y \in \{0, 1\}$$ (binary classification)
• discriminative model gives $$p(y = 1 | \mathbf{x}; \mathbf{\theta})$$
• labels softly separated by hyperplane
• maximum confusion at hyperplane: when $$\theta^T \mathbf{x} = 0$$ then $$p(y=1| \mathbf{x}; \theta) = 1/2$$
• use add a dimension trick to shift hyperplane off $$\mathbf{0}$$
• assume $$p(y = 1| \mathbf{x}; \theta)$$ is some $$g(\theta \cdot \mathbf{x})$$

Properties of $$g(\theta \cdot \mathbf{x}) = g(\theta^T \mathbf{x}) = p(y = 1| \mathbf{x}; \theta)$$:

• $$g(- \inf) = 0$$
• $$g(\inf) = 1$$
• $$g(0) = 1/2$$
• confidence of label increases as you move away from the boundary - $$g(\theta^T \mathbf{x})$$ is monotonically increasing
• $$g(-a) = 1 - g(a)$$ (symmetry)

For example, one function that satisfies these properties is the logistic sigmoid.

$h_\theta(\mathbf{x}) = g(\theta \cdot \mathbf{x}) = \frac{1}{1 + \exp(-\theta \cdot \mathbf{x})} = \frac{\exp(\theta \cdot \mathbf{x})}{1 + \exp(\theta \cdot \mathbf{x})}$

And $$g()$$ has a simple derivative: $$g'(a) = g(a)(1-g(a))$$.

### Likelihood¶

$\begin{split}L(\theta) & = p(\vec{y} | X; \theta) \\ & = \prod_{i = 1}^m p(y ^{(i)} | \mathbf{x} ^{(i)}; \theta) \\ & = \prod_{i = 1}^m p(y = 1 | \mathbf{x} ^{(i)}; \theta) ^{y ^{(i)}} p(y = 0 | \mathbf{x} ^{(i)}; \theta) ^{1 - y ^{(i)}} \\ & \text{^ this encodes the if-test on y} \\ & = \prod_{i=1}^m h_\theta (\mathbf{x} ^{(i)}) ^{y ^{(i)}} (1 - h_\theta (\mathbf{x} ^{(i)})) ^{1-y ^{(i)}}\end{split}$

this is convex!

As before, log-likelihood is easier:

$\begin{split}l(\theta) & = \log(L(\theta)) \\ & = \sum_{i=1}^m y ^{(i)} \log (h_\theta (\mathbf{x} ^{(i)})) + (1-y ^{(i)}) \log (1 - h_\theta (\mathbf{x} ^{(i)}))\end{split}$

Taking the derivatives for one sample on one dimension of $$\theta$$,

$\frac{\partial}{\partial \theta_j} l(\theta) = (y ^{(i)} - h_\theta (\mathbf{x} ^{(i)})) x_j ^{(i)}$

so we can then plug that, generalized into all dimensions, into SGD

$\theta := \theta + \alpha (y ^{(i)} - h_\theta (\mathbf{x} ^{(i)})) \mathbf{x} ^{(i)}$

Looks pretty similar to LMS, but $$h_\theta()$$ replaces $$\theta^T \mathbf{x}$$.

### Multiclass¶

We can extend logistic regression to multiple classes:

• learn weights $$\theta_k$$ for each class $$k \in \{1..K\}$$
• class-k-ness of instance $$\mathbf{x}$$ is estimated by $$\theta_k \cdot \mathbf{x}$$
• estimate $$p(Class = k | \mathbf{x}; \theta_1 .. \theta_K)$$ for instance $$\mathbf{x}$$ using softmax fcn:
$h_k(\mathbf{x}; \theta_1 .. \theta_K) = \frac{\exp(\theta_k \cdot \mathbf{x})}{\sum_{r=1}^K \exp(\theta_r \cdot \mathbf{x})}$
• want weights that maximize likelihood of the sample
• use one-of-K encoding for labels
• make each label $$\mathbf{y} ^{(i)}$$ a K-vector with $$y ^{(i)}_k = 1$$ iff class = k
• likelihood of $$m$$ labels in sample is:
$\begin{split}L(\theta_1 .. \theta_K) & = p(\mathbf{y} ^{(1)} .. \mathbf{y} ^{(m)} | X; \theta_1 .. \theta_K) \\ & = \prod_{i=1}^m p(\mathbf{y} ^{(i)} | \mathbf{x} ^{(i)}; \theta_1 .. \theta_K) \\ & = \prod_{i=1}^m \prod_{k=1}^K h_k(\mathbf{x} ^{(i)}; \theta_1 .. \theta_K) ^{y_k ^{(i)}}\end{split}$
• iterative methods maximize log likelihood

Note

Class $$\theta_K$$ is actually redundant, since $$p(class = K | \mathbf{x}) = 1 - \sum_{k=1}^{K-1} p(class = k | \mathbf{x})$$.