PCA¶

PCA = Principal Component Analysis

The goal of PCA is to reduce redudndancy (collapse redundant features), or to encode examples using a new set of features

A technique to project a high-dimensional data matrix into a lower dimension.

First normalize old features to mean \(\bar{x} = 0\), variance 1 unit
Uses a linear transformation:
- new features are projections, \(u \cdot x\) for unit length u
- projecting onto vectors \(u_1.. u_k\) gives k new features - break each x into a sum of k vectors \(\mu_1 .. \mu_k\)
low spread in certain directions implies redundancy
find projection directions that preserve as much variance as possible
e.g. multiple different 3s can be viewed as a prototype rotated and shifted by latent variables
- digits lie in noisy low-dimensional subspace of all images

First View: Preserve Variance¶

project onto u, but preserve variation
length of \(x\) in direction \(u\) is \(u \cdot x\) when \(u\) has length 1
maximize variance of projected instance lengths
variance of projections onto \(u\) over N training examples is:

\[\begin{split}& \frac{1}{N} \sum_{n=1}^N (\mathbf{u}^T \mathbf{x}_n - \mathbf{u}^T \mathbf{\bar{x}})^2 \\ = & \frac{1}{N} \sum_{n=1}^N (\mathbf{u}^T \mathbf{x}_n)^2 \\ = & \frac{1}{N} \sum_{n=1}^N \mathbf{u}^T \mathbf{x}_n \cdot \mathbf{x}_n^T \mathbf{u} \\ = & \mathbf{u}^T (\frac{1}{N} \sum_{n=1}^N \mathbf{x}_n \mathbf{x}_n^T) \mathbf{u} \\ = & \mathbf{u}^T \mathbf{S} \mathbf{u}\end{split}\]

Where \(\mathbf{S}\) is the data covariance matrix \(\mathbf{X}^T \mathbf{X}\). Note that the term \(u^T \bar{x}\) is 0 if the data is centered.

Constrain \(u^Tu = 1\), add lagrange multiplier (penalizes violation of constraint), and maximize:
- \(\max u^T Su+\lambda (1-u^Tu)\)
take matrix derivatives wrt \(u\) and set equal to 0:
- \(Su=\lambda u\)
- so \(u\) is an eigenvector of \(S\).

Note

Remember if \(M\mathbf{v} = \lambda \mathbf{v}\) then:

\(\lambda\) is an eigenvalue of matrix M
\(\mathbf{v}\) is an eigenvector
if \(\mathbf{v}\) is an eigenvector, then so is \(c\mathbf{v}\) (with same eigenvalue)
usually take length-1 eigenvectors
\(d*d\) matrices have at most \(d\) orthogonal eigenvectors

Since we want to maximize variance \(\mathbf{u}^T \mathbf{Su}\), and \(\mathbf{u}\) is an eigenvector,

\[\mathbf{u}^T \mathbf{Su} = \mathbf{u}^T \lambda \mathbf{u} = \lambda \mathbf{u}^T \mathbf{u} = \lambda\]

so the first principal component is the eigenvector with largest eigenvalue! The second is the next, and so on

eigenvector with largest eigenvalue is first principle component
find other principle components iteratively by considering directions perpendicular to previous ones
first k principle components will be first k eigenvectors of \(S\)
k eigenvectors can be found in \(O(kD^2)\) time for a \(D*D\) matrix

Second View: Data Compression¶

Let’s start over and look at this new view.

goal: find a compressed set of \(k < D\) features that approximates data
use set of \(k\) projections onto orthogonal unit vectors \(\mathbf{u}_1 .. \mathbf{u}_k\), which form an ortho-normal basis for the subspace projected onto
projection of an \(\mathbf{x}_n\) is \((\alpha_{n, 1}, \alpha_{n, 2}, ..., \alpha_{n, k})\) in the u-coordinates in the original space
- \(\tilde{\mathbf{x}} = \sum_{i=1}^k \alpha_{n,i}\mathbf{u}_i\)

We want to minimize average projection distance

for 1-dimensional case, projection \(\tilde{\mathbf{x}} = (\mathbf{x}^T_n \mathbf{u})\mathbf{u}\)
goal: \(\min_{\mathbf{u}} J = \sum_{n=1}^N ||\mathbf{x}_n -\tilde{\mathbf{x}}||^2\)
\(J\) is roughly a measure of how much information is lost during this compression

Assuming centered data and \(\mathbf{u}\) unit length:

\[\begin{split}J & = \sum_{n=1}^N ||\mathbf{x}_n -\tilde{\mathbf{x}}||^2 \\ & = \sum_{n=1}^N \mathbf{x}_n^T \mathbf{x}_n - 2 \mathbf{x}_n^T (\mathbf{x}^T_n \mathbf{u})\mathbf{u} + (\mathbf{x}^T_n \mathbf{u})^2 \mathbf{u}^T \mathbf{u} \\ & = \sum_{n=1}^N \mathbf{x}_n^T \mathbf{x}_n - 2 (\mathbf{u}^T \mathbf{x}_n ) \mathbf{x}_n^T \mathbf{u} + (\mathbf{u}^T \mathbf{x}_n) (\mathbf{x}^T_n \mathbf{u}) \\ & = \sum_{n=1}^N \mathbf{x}_n^T \mathbf{x}_n - \sum_{n=1}^N \mathbf{u}^T \mathbf{x}_n \mathbf{x}^T_n \mathbf{u}\end{split}\]

So minimizing J wrt u means maximizing \(\sum_{n=1}^N \mathbf{u}^T \mathbf{x}_n \mathbf{x}^T_n \mathbf{u} = N \mathbf{u}^T \mathbf{Su}\).

And this is the same objective function as earlier!

Uses¶

compression: approximate data with fewer features
visualization: compress to 2 or 3 dimensions
preprocess features to remove redundancy - improve speed, simplify hypothesis
possible noise reduction
plot eigenvalues - gives idea of dimensionality data lies in

PCA¶

First View: Preserve Variance¶

Second View: Data Compression¶

Uses¶

cse142-notes

Navigation

Related Topics