PCA
===
PCA = Principal Component Analysis
The goal of PCA is to reduce redudndancy (collapse redundant features), or to encode examples using a new set of
features
A technique to project a high-dimensional data matrix into a lower dimension.
- First normalize old features to mean :math:`\bar{x} = 0`, variance 1 unit
- Uses a linear transformation:
- new features are projections, :math:`u \cdot x` for unit length u
- projecting onto vectors :math:`u_1.. u_k` gives k new features - break each x into a sum of k vectors :math:`\mu_1 .. \mu_k`
- low spread in certain directions implies redundancy
- find projection directions that preserve as much variance as possible
- e.g. multiple different 3s can be viewed as a prototype rotated and shifted by latent variables
- digits lie in noisy low-dimensional subspace of all images
First View: Preserve Variance
-----------------------------
- project onto u, but preserve variation
- length of :math:`x` in direction :math:`u` is :math:`u \cdot x` when :math:`u` has length 1
- maximize variance of projected instance lengths
- variance of projections onto :math:`u` over N training examples is:
.. math::
& \frac{1}{N} \sum_{n=1}^N (\mathbf{u}^T \mathbf{x}_n - \mathbf{u}^T \mathbf{\bar{x}})^2 \\
= & \frac{1}{N} \sum_{n=1}^N (\mathbf{u}^T \mathbf{x}_n)^2 \\
= & \frac{1}{N} \sum_{n=1}^N \mathbf{u}^T \mathbf{x}_n \cdot \mathbf{x}_n^T \mathbf{u} \\
= & \mathbf{u}^T (\frac{1}{N} \sum_{n=1}^N \mathbf{x}_n \mathbf{x}_n^T) \mathbf{u} \\
= & \mathbf{u}^T \mathbf{S} \mathbf{u}
Where :math:`\mathbf{S}` is the data covariance matrix :math:`\mathbf{X}^T \mathbf{X}`. Note that the term :math:`u^T \bar{x}` is 0 if the data is centered.
- Constrain :math:`u^Tu = 1`, add lagrange multiplier (penalizes violation of constraint), and maximize:
- :math:`\max u^T Su+\lambda (1-u^Tu)`
- take matrix derivatives wrt :math:`u` and set equal to 0:
- :math:`Su=\lambda u`
- so :math:`u` is an eigenvector of :math:`S`.
.. note::
Remember if :math:`M\mathbf{v} = \lambda \mathbf{v}` then:
- :math:`\lambda` is an eigenvalue of matrix M
- :math:`\mathbf{v}` is an eigenvector
- if :math:`\mathbf{v}` is an eigenvector, then so is :math:`c\mathbf{v}` (with same eigenvalue)
- usually take length-1 eigenvectors
- :math:`d*d` matrices have at most :math:`d` orthogonal eigenvectors
Since we want to maximize variance :math:`\mathbf{u}^T \mathbf{Su}`, and :math:`\mathbf{u}` is an eigenvector,
.. math::
\mathbf{u}^T \mathbf{Su} = \mathbf{u}^T \lambda \mathbf{u} = \lambda \mathbf{u}^T \mathbf{u} = \lambda
so the first principal component is the eigenvector with largest eigenvalue! The second is the next, and so on
- eigenvector with largest eigenvalue is first principle component
- find other principle components iteratively by considering directions perpendicular to previous ones
- first k principle components will be first k eigenvectors of :math:`S`
- k eigenvectors can be found in :math:`O(kD^2)` time for a :math:`D*D` matrix
Second View: Data Compression
-----------------------------
Let's start over and look at this new view.
- goal: find a compressed set of :math:`k < D` features that approximates data
- use set of :math:`k` projections onto orthogonal unit vectors :math:`\mathbf{u}_1 .. \mathbf{u}_k`, which form an ortho-normal basis for the subspace projected onto
- projection of an :math:`\mathbf{x}_n` is :math:`(\alpha_{n, 1}, \alpha_{n, 2}, ..., \alpha_{n, k})` in the u-coordinates in the original space
- :math:`\tilde{\mathbf{x}} = \sum_{i=1}^k \alpha_{n,i}\mathbf{u}_i`
We want to minimize average projection distance
- for 1-dimensional case, projection :math:`\tilde{\mathbf{x}} = (\mathbf{x}^T_n \mathbf{u})\mathbf{u}`
- goal: :math:`\min_{\mathbf{u}} J = \sum_{n=1}^N ||\mathbf{x}_n -\tilde{\mathbf{x}}||^2`
- :math:`J` is roughly a measure of how much information is lost during this compression
Assuming centered data and :math:`\mathbf{u}` unit length:
.. math::
J & = \sum_{n=1}^N ||\mathbf{x}_n -\tilde{\mathbf{x}}||^2 \\
& = \sum_{n=1}^N \mathbf{x}_n^T \mathbf{x}_n - 2 \mathbf{x}_n^T (\mathbf{x}^T_n \mathbf{u})\mathbf{u} + (\mathbf{x}^T_n \mathbf{u})^2 \mathbf{u}^T \mathbf{u} \\
& = \sum_{n=1}^N \mathbf{x}_n^T \mathbf{x}_n - 2 (\mathbf{u}^T \mathbf{x}_n ) \mathbf{x}_n^T \mathbf{u} + (\mathbf{u}^T \mathbf{x}_n) (\mathbf{x}^T_n \mathbf{u}) \\
& = \sum_{n=1}^N \mathbf{x}_n^T \mathbf{x}_n - \sum_{n=1}^N \mathbf{u}^T \mathbf{x}_n \mathbf{x}^T_n \mathbf{u}
So minimizing J wrt **u** means maximizing :math:`\sum_{n=1}^N \mathbf{u}^T \mathbf{x}_n \mathbf{x}^T_n \mathbf{u} = N \mathbf{u}^T \mathbf{Su}`.
And this is the same objective function as earlier!
Uses
----
- compression: approximate data with fewer features
- visualization: compress to 2 or 3 dimensions
- preprocess features to remove redundancy - improve speed, simplify hypothesis
- possible noise reduction
- plot eigenvalues - gives idea of dimensionality data lies in