PCA¶
PCA = Principal Component Analysis
The goal of PCA is to reduce redudndancy (collapse redundant features), or to encode examples using a new set of features
A technique to project a high-dimensional data matrix into a lower dimension.
- First normalize old features to mean \(\bar{x} = 0\), variance 1 unit
- Uses a linear transformation:
- new features are projections, \(u \cdot x\) for unit length u
- projecting onto vectors \(u_1.. u_k\) gives k new features - break each x into a sum of k vectors \(\mu_1 .. \mu_k\)
- low spread in certain directions implies redundancy
- find projection directions that preserve as much variance as possible
- e.g. multiple different 3s can be viewed as a prototype rotated and shifted by latent variables
- digits lie in noisy low-dimensional subspace of all images
First View: Preserve Variance¶
- project onto u, but preserve variation
- length of \(x\) in direction \(u\) is \(u \cdot x\) when \(u\) has length 1
- maximize variance of projected instance lengths
- variance of projections onto \(u\) over N training examples is:
Where \(\mathbf{S}\) is the data covariance matrix \(\mathbf{X}^T \mathbf{X}\). Note that the term \(u^T \bar{x}\) is 0 if the data is centered.
- Constrain \(u^Tu = 1\), add lagrange multiplier (penalizes violation of constraint), and maximize:
- \(\max u^T Su+\lambda (1-u^Tu)\)
- take matrix derivatives wrt \(u\) and set equal to 0:
- \(Su=\lambda u\)
- so \(u\) is an eigenvector of \(S\).
Note
Remember if \(M\mathbf{v} = \lambda \mathbf{v}\) then:
- \(\lambda\) is an eigenvalue of matrix M
- \(\mathbf{v}\) is an eigenvector
- if \(\mathbf{v}\) is an eigenvector, then so is \(c\mathbf{v}\) (with same eigenvalue)
- usually take length-1 eigenvectors
- \(d*d\) matrices have at most \(d\) orthogonal eigenvectors
Since we want to maximize variance \(\mathbf{u}^T \mathbf{Su}\), and \(\mathbf{u}\) is an eigenvector,
so the first principal component is the eigenvector with largest eigenvalue! The second is the next, and so on
- eigenvector with largest eigenvalue is first principle component
- find other principle components iteratively by considering directions perpendicular to previous ones
- first k principle components will be first k eigenvectors of \(S\)
- k eigenvectors can be found in \(O(kD^2)\) time for a \(D*D\) matrix
Second View: Data Compression¶
Let’s start over and look at this new view.
- goal: find a compressed set of \(k < D\) features that approximates data
- use set of \(k\) projections onto orthogonal unit vectors \(\mathbf{u}_1 .. \mathbf{u}_k\), which form an ortho-normal basis for the subspace projected onto
- projection of an \(\mathbf{x}_n\) is \((\alpha_{n, 1}, \alpha_{n, 2}, ..., \alpha_{n, k})\) in the u-coordinates in the original space
- \(\tilde{\mathbf{x}} = \sum_{i=1}^k \alpha_{n,i}\mathbf{u}_i\)
We want to minimize average projection distance
- for 1-dimensional case, projection \(\tilde{\mathbf{x}} = (\mathbf{x}^T_n \mathbf{u})\mathbf{u}\)
- goal: \(\min_{\mathbf{u}} J = \sum_{n=1}^N ||\mathbf{x}_n -\tilde{\mathbf{x}}||^2\)
- \(J\) is roughly a measure of how much information is lost during this compression
Assuming centered data and \(\mathbf{u}\) unit length:
So minimizing J wrt u means maximizing \(\sum_{n=1}^N \mathbf{u}^T \mathbf{x}_n \mathbf{x}^T_n \mathbf{u} = N \mathbf{u}^T \mathbf{Su}\).
And this is the same objective function as earlier!
Uses¶
- compression: approximate data with fewer features
- visualization: compress to 2 or 3 dimensions
- preprocess features to remove redundancy - improve speed, simplify hypothesis
- possible noise reduction
- plot eigenvalues - gives idea of dimensionality data lies in