Regressions¶
Given training data \(\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1..m}\), find a linear approximation to the ys.
I.e., find w (or \(\theta\)) of weights/parameters such that \(\theta \cdot \mathbf{x} \approx y\).
Topics:
- Least Squares as Maximum Likelihood
- Finding ML weights
- bias-variance error decomposition
- basis functions for normalizing features
- 1-norm, 2-norm regularization
- bayesian linear regression
Least Square¶
Example:
Data
x0 x1 | y
------+--
1 2 | 5
1 3 | 7
1 4 | 9
Note the add-a-dimension trick of adding \(\mathbf{x_0} = \vec{1}\).
By eye, \(\theta = (1, 2)\) has zero error (\(\theta = \mathbf{w}\))
We can represent the data matrix X as an array of feature vectors:
and the label vector \(\mathbf{y}\) / \(\vec{y}\):
Note
To simplify our notation, \(\mathbf{x}^{(i)} \cdot \theta = \mathbf{x}^{(i)^T} \cdot \theta\).
Criterion¶
Note
so we can simplify this with matrix operations.
SGD¶
Consider SGD: take the example where \(\mathbf{x}^{(1)} = (1, 2)\), \(y^{(1)} = 5\), and \(\theta\) is initialized to \((1, 1)\).
The squared-error on this example is \(((1 * \theta_0) + (2 * \theta_1) - 5)^2 = 4\), and its contribution to \(J(\theta)\) is 2 (squared-error div 2).
At \(\theta = (1, 1)\):
and similarly,
With step size \(\frac{1}{20}\), update \(\theta' = \theta = \frac{1}{20} \nabla_\theta(J)\)
so \(\theta' = (1, 1) - (-0.1, -0.2) = (1.1, 1.2)\). If we keep doing this, eventually \(\theta\) becomes \((1, 2)\).
Closed Form¶
However, there’s a way to minimize \(\theta\) without having to do SGD: the formula is
Example
Maximum Likelihood¶
This is a more generic solution concept. Making some assumptions, we can derive the least-squared criterion:
- learn a function f in a given class of functions (not necessarily linear)
- have m examples \(\{(\mathbf{x}^{(i)}, y^{(i)})\}\) where \(y^{(i)} = f(\mathbf{x}^{(i)} + \epsilon^{(i)})\)
- where \(\epsilon^{(i)}\) is some noise
- assume x’s are fixed, concentrate on y’s like a discriminative model
- assume \(\epsilon^{(i)}\) s are iid draws from some mean 0 Gaussian distribution
- then, the probability of getting \(y^{(i)}\) for \(\mathbf{x}^{(i)}\) with f is:
with this, we can check the overall likelihood of datapoints over the entire set:
using some log properties, we can…
Assuming a Gaussian distribution on \(p\).
So to maximize the likelihood of f, we’re just minimizing the squared error!
Convexity and Gradients¶
We know that the squared error is convex - we want to find the bottom of the bowl, where the gradient is 0.
Note
For review: minimizing \(f(x) = (x - 5)^2\)
- \(f'(x) = 2(x - 5)\)
- \(f'(x) = 0\)
- \(x = 5\) is the minimizer
So, we can derive the closed form from the ML formula:
Note
Some of the matrix magic:
We can also use SGD to learn instead of finding minimum directly:
Cycle through samples, taking a step in the negative gradient direction for each example.
In this formula, \(\eta\) is the learning rate. This is the LMS algorithm.
Bias-Variance Decomposition¶
Suppose we have training instances \(\mathbf{x}^{(1)}..\mathbf{x}^{(m)}\) and true target function \(f\)
Let labels \(y^{(i)}\) in sample be \(f(\mathbf{x}^{(i)}) + \epsilon ^{(i)}\) where \(\epsilon ^{(i)}\) are any iid noise.
Assume \(E[\epsilon ^{(i)}] = 0\), so \(E[y ^{(i)}] = f(\mathbf{x} ^{(i)})\)
Let’s examine the expected squared error between regression function \(g\) learned from sample and “true” function \(f\) at a particular test point \(\mathbf{x}\)
where \(E_{noise}\) is the expectation over training label noise.
Let \(\bar{g}(\mathbf{x})\) be \(E_{noise} [(g(\mathbf{x})]\). We can rewrite \(E_{noise} [(g(\mathbf{x}) - f(\mathbf{x}))^2]\) as
Since \(f(\mathbf{x})\) and \(\bar{g}\) are constant with respect to noise,
ergo,
Regularized Least Squares¶
- Regularization penalizes complexity and reduces variance (but increases bias)
- adds a term \(\lambda\) to the squared error (the regularization coefficient)
which is minimized by
More generally, we can use a different regularizer:
Logistic Regression¶
Logistic Regression learns a parameter vector \(\mathbf{\theta}\) (or weights \(\mathbf{w}\))
Idea behind 2-class logistic regression:
- two labels, so \(y \in \{0, 1\}\) (binary classification)
- discriminative model gives \(p(y = 1 | \mathbf{x}; \mathbf{\theta})\)
- labels softly separated by hyperplane
- maximum confusion at hyperplane: when \(\theta^T \mathbf{x} = 0\) then \(p(y=1| \mathbf{x}; \theta) = 1/2\)
- use add a dimension trick to shift hyperplane off \(\mathbf{0}\)
- assume \(p(y = 1| \mathbf{x}; \theta)\) is some \(g(\theta \cdot \mathbf{x})\)
Properties of \(g(\theta \cdot \mathbf{x}) = g(\theta^T \mathbf{x}) = p(y = 1| \mathbf{x}; \theta)\):
- \(g(- \inf) = 0\)
- \(g(\inf) = 1\)
- \(g(0) = 1/2\)
- confidence of label increases as you move away from the boundary - \(g(\theta^T \mathbf{x})\) is monotonically increasing
- \(g(-a) = 1 - g(a)\) (symmetry)
For example, one function that satisfies these properties is the logistic sigmoid.
And \(g()\) has a simple derivative: \(g'(a) = g(a)(1-g(a))\).
Likelihood¶
this is convex!
As before, log-likelihood is easier:
Taking the derivatives for one sample on one dimension of \(\theta\),
so we can then plug that, generalized into all dimensions, into SGD
Looks pretty similar to LMS, but \(h_\theta()\) replaces \(\theta^T \mathbf{x}\).
Multiclass¶
We can extend logistic regression to multiple classes:
- learn weights \(\theta_k\) for each class \(k \in \{1..K\}\)
- class-k-ness of instance \(\mathbf{x}\) is estimated by \(\theta_k \cdot \mathbf{x}\)
- estimate \(p(Class = k | \mathbf{x}; \theta_1 .. \theta_K)\) for instance \(\mathbf{x}\) using softmax fcn:
- want weights that maximize likelihood of the sample
- use one-of-K encoding for labels
- make each label \(\mathbf{y} ^{(i)}\) a K-vector with \(y ^{(i)}_k = 1\) iff class = k
- likelihood of \(m\) labels in sample is:
- iterative methods maximize log likelihood
Note
Class \(\theta_K\) is actually redundant, since \(p(class = K | \mathbf{x}) = 1 - \sum_{k=1}^{K-1} p(class = k | \mathbf{x})\).