# PCA¶

PCA = Principal Component Analysis

The goal of PCA is to reduce redudndancy (collapse redundant features), or to encode examples using a new set of features

A technique to project a high-dimensional data matrix into a lower dimension.

• First normalize old features to mean $$\bar{x} = 0$$, variance 1 unit
• Uses a linear transformation:
• new features are projections, $$u \cdot x$$ for unit length u
• projecting onto vectors $$u_1.. u_k$$ gives k new features - break each x into a sum of k vectors $$\mu_1 .. \mu_k$$
• low spread in certain directions implies redundancy
• find projection directions that preserve as much variance as possible
• e.g. multiple different 3s can be viewed as a prototype rotated and shifted by latent variables
• digits lie in noisy low-dimensional subspace of all images

## First View: Preserve Variance¶

• project onto u, but preserve variation
• length of $$x$$ in direction $$u$$ is $$u \cdot x$$ when $$u$$ has length 1
• maximize variance of projected instance lengths
• variance of projections onto $$u$$ over N training examples is:
$\begin{split}& \frac{1}{N} \sum_{n=1}^N (\mathbf{u}^T \mathbf{x}_n - \mathbf{u}^T \mathbf{\bar{x}})^2 \\ = & \frac{1}{N} \sum_{n=1}^N (\mathbf{u}^T \mathbf{x}_n)^2 \\ = & \frac{1}{N} \sum_{n=1}^N \mathbf{u}^T \mathbf{x}_n \cdot \mathbf{x}_n^T \mathbf{u} \\ = & \mathbf{u}^T (\frac{1}{N} \sum_{n=1}^N \mathbf{x}_n \mathbf{x}_n^T) \mathbf{u} \\ = & \mathbf{u}^T \mathbf{S} \mathbf{u}\end{split}$

Where $$\mathbf{S}$$ is the data covariance matrix $$\mathbf{X}^T \mathbf{X}$$. Note that the term $$u^T \bar{x}$$ is 0 if the data is centered.

• Constrain $$u^Tu = 1$$, add lagrange multiplier (penalizes violation of constraint), and maximize:
• $$\max u^T Su+\lambda (1-u^Tu)$$
• take matrix derivatives wrt $$u$$ and set equal to 0:
• $$Su=\lambda u$$
• so $$u$$ is an eigenvector of $$S$$.

Note

Remember if $$M\mathbf{v} = \lambda \mathbf{v}$$ then:

• $$\lambda$$ is an eigenvalue of matrix M
• $$\mathbf{v}$$ is an eigenvector
• if $$\mathbf{v}$$ is an eigenvector, then so is $$c\mathbf{v}$$ (with same eigenvalue)
• usually take length-1 eigenvectors
• $$d*d$$ matrices have at most $$d$$ orthogonal eigenvectors

Since we want to maximize variance $$\mathbf{u}^T \mathbf{Su}$$, and $$\mathbf{u}$$ is an eigenvector,

$\mathbf{u}^T \mathbf{Su} = \mathbf{u}^T \lambda \mathbf{u} = \lambda \mathbf{u}^T \mathbf{u} = \lambda$

so the first principal component is the eigenvector with largest eigenvalue! The second is the next, and so on

• eigenvector with largest eigenvalue is first principle component
• find other principle components iteratively by considering directions perpendicular to previous ones
• first k principle components will be first k eigenvectors of $$S$$
• k eigenvectors can be found in $$O(kD^2)$$ time for a $$D*D$$ matrix

## Second View: Data Compression¶

Let’s start over and look at this new view.

• goal: find a compressed set of $$k < D$$ features that approximates data
• use set of $$k$$ projections onto orthogonal unit vectors $$\mathbf{u}_1 .. \mathbf{u}_k$$, which form an ortho-normal basis for the subspace projected onto
• projection of an $$\mathbf{x}_n$$ is $$(\alpha_{n, 1}, \alpha_{n, 2}, ..., \alpha_{n, k})$$ in the u-coordinates in the original space
• $$\tilde{\mathbf{x}} = \sum_{i=1}^k \alpha_{n,i}\mathbf{u}_i$$

We want to minimize average projection distance

• for 1-dimensional case, projection $$\tilde{\mathbf{x}} = (\mathbf{x}^T_n \mathbf{u})\mathbf{u}$$
• goal: $$\min_{\mathbf{u}} J = \sum_{n=1}^N ||\mathbf{x}_n -\tilde{\mathbf{x}}||^2$$
• $$J$$ is roughly a measure of how much information is lost during this compression

Assuming centered data and $$\mathbf{u}$$ unit length:

$\begin{split}J & = \sum_{n=1}^N ||\mathbf{x}_n -\tilde{\mathbf{x}}||^2 \\ & = \sum_{n=1}^N \mathbf{x}_n^T \mathbf{x}_n - 2 \mathbf{x}_n^T (\mathbf{x}^T_n \mathbf{u})\mathbf{u} + (\mathbf{x}^T_n \mathbf{u})^2 \mathbf{u}^T \mathbf{u} \\ & = \sum_{n=1}^N \mathbf{x}_n^T \mathbf{x}_n - 2 (\mathbf{u}^T \mathbf{x}_n ) \mathbf{x}_n^T \mathbf{u} + (\mathbf{u}^T \mathbf{x}_n) (\mathbf{x}^T_n \mathbf{u}) \\ & = \sum_{n=1}^N \mathbf{x}_n^T \mathbf{x}_n - \sum_{n=1}^N \mathbf{u}^T \mathbf{x}_n \mathbf{x}^T_n \mathbf{u}\end{split}$

So minimizing J wrt u means maximizing $$\sum_{n=1}^N \mathbf{u}^T \mathbf{x}_n \mathbf{x}^T_n \mathbf{u} = N \mathbf{u}^T \mathbf{Su}$$.

And this is the same objective function as earlier!

## Uses¶

• compression: approximate data with fewer features
• visualization: compress to 2 or 3 dimensions
• preprocess features to remove redundancy - improve speed, simplify hypothesis
• possible noise reduction
• plot eigenvalues - gives idea of dimensionality data lies in