PCA

PCA = Principal Component Analysis

The goal of PCA is to reduce redudndancy (collapse redundant features), or to encode examples using a new set of features

A technique to project a high-dimensional data matrix into a lower dimension.

  • First normalize old features to mean \(\bar{x} = 0\), variance 1 unit
  • Uses a linear transformation:
    • new features are projections, \(u \cdot x\) for unit length u
    • projecting onto vectors \(u_1.. u_k\) gives k new features - break each x into a sum of k vectors \(\mu_1 .. \mu_k\)
  • low spread in certain directions implies redundancy
  • find projection directions that preserve as much variance as possible
  • e.g. multiple different 3s can be viewed as a prototype rotated and shifted by latent variables
    • digits lie in noisy low-dimensional subspace of all images

First View: Preserve Variance

  • project onto u, but preserve variation
  • length of \(x\) in direction \(u\) is \(u \cdot x\) when \(u\) has length 1
  • maximize variance of projected instance lengths
  • variance of projections onto \(u\) over N training examples is:
\[\begin{split}& \frac{1}{N} \sum_{n=1}^N (\mathbf{u}^T \mathbf{x}_n - \mathbf{u}^T \mathbf{\bar{x}})^2 \\ = & \frac{1}{N} \sum_{n=1}^N (\mathbf{u}^T \mathbf{x}_n)^2 \\ = & \frac{1}{N} \sum_{n=1}^N \mathbf{u}^T \mathbf{x}_n \cdot \mathbf{x}_n^T \mathbf{u} \\ = & \mathbf{u}^T (\frac{1}{N} \sum_{n=1}^N \mathbf{x}_n \mathbf{x}_n^T) \mathbf{u} \\ = & \mathbf{u}^T \mathbf{S} \mathbf{u}\end{split}\]

Where \(\mathbf{S}\) is the data covariance matrix \(\mathbf{X}^T \mathbf{X}\). Note that the term \(u^T \bar{x}\) is 0 if the data is centered.

  • Constrain \(u^Tu = 1\), add lagrange multiplier (penalizes violation of constraint), and maximize:
    • \(\max u^T Su+\lambda (1-u^Tu)\)
  • take matrix derivatives wrt \(u\) and set equal to 0:
    • \(Su=\lambda u\)
    • so \(u\) is an eigenvector of \(S\).

Note

Remember if \(M\mathbf{v} = \lambda \mathbf{v}\) then:

  • \(\lambda\) is an eigenvalue of matrix M
  • \(\mathbf{v}\) is an eigenvector
  • if \(\mathbf{v}\) is an eigenvector, then so is \(c\mathbf{v}\) (with same eigenvalue)
  • usually take length-1 eigenvectors
  • \(d*d\) matrices have at most \(d\) orthogonal eigenvectors

Since we want to maximize variance \(\mathbf{u}^T \mathbf{Su}\), and \(\mathbf{u}\) is an eigenvector,

\[\mathbf{u}^T \mathbf{Su} = \mathbf{u}^T \lambda \mathbf{u} = \lambda \mathbf{u}^T \mathbf{u} = \lambda\]

so the first principal component is the eigenvector with largest eigenvalue! The second is the next, and so on

  • eigenvector with largest eigenvalue is first principle component
  • find other principle components iteratively by considering directions perpendicular to previous ones
  • first k principle components will be first k eigenvectors of \(S\)
  • k eigenvectors can be found in \(O(kD^2)\) time for a \(D*D\) matrix

Second View: Data Compression

Let’s start over and look at this new view.

  • goal: find a compressed set of \(k < D\) features that approximates data
  • use set of \(k\) projections onto orthogonal unit vectors \(\mathbf{u}_1 .. \mathbf{u}_k\), which form an ortho-normal basis for the subspace projected onto
  • projection of an \(\mathbf{x}_n\) is \((\alpha_{n, 1}, \alpha_{n, 2}, ..., \alpha_{n, k})\) in the u-coordinates in the original space
    • \(\tilde{\mathbf{x}} = \sum_{i=1}^k \alpha_{n,i}\mathbf{u}_i\)

We want to minimize average projection distance

  • for 1-dimensional case, projection \(\tilde{\mathbf{x}} = (\mathbf{x}^T_n \mathbf{u})\mathbf{u}\)
  • goal: \(\min_{\mathbf{u}} J = \sum_{n=1}^N ||\mathbf{x}_n -\tilde{\mathbf{x}}||^2\)
  • \(J\) is roughly a measure of how much information is lost during this compression

Assuming centered data and \(\mathbf{u}\) unit length:

\[\begin{split}J & = \sum_{n=1}^N ||\mathbf{x}_n -\tilde{\mathbf{x}}||^2 \\ & = \sum_{n=1}^N \mathbf{x}_n^T \mathbf{x}_n - 2 \mathbf{x}_n^T (\mathbf{x}^T_n \mathbf{u})\mathbf{u} + (\mathbf{x}^T_n \mathbf{u})^2 \mathbf{u}^T \mathbf{u} \\ & = \sum_{n=1}^N \mathbf{x}_n^T \mathbf{x}_n - 2 (\mathbf{u}^T \mathbf{x}_n ) \mathbf{x}_n^T \mathbf{u} + (\mathbf{u}^T \mathbf{x}_n) (\mathbf{x}^T_n \mathbf{u}) \\ & = \sum_{n=1}^N \mathbf{x}_n^T \mathbf{x}_n - \sum_{n=1}^N \mathbf{u}^T \mathbf{x}_n \mathbf{x}^T_n \mathbf{u}\end{split}\]

So minimizing J wrt u means maximizing \(\sum_{n=1}^N \mathbf{u}^T \mathbf{x}_n \mathbf{x}^T_n \mathbf{u} = N \mathbf{u}^T \mathbf{Su}\).

And this is the same objective function as earlier!

Uses

  • compression: approximate data with fewer features
  • visualization: compress to 2 or 3 dimensions
  • preprocess features to remove redundancy - improve speed, simplify hypothesis
  • possible noise reduction
  • plot eigenvalues - gives idea of dimensionality data lies in