Unsupervised Learning¶

Getting supervised data is expensive and time-consuming; let’s make sense of the data w/o labels

Today: K-means clustering

Algorithm
Convergence
Initialization
Hyperparameter selection
- how many classes?

In some cases, different clustering algorithms can output different clusters (e.g. multicolored squares/circles - split on shape or color?)

Algorithm¶

In a clustering problem, each instance is just a data vector \(\mathbf{x}\) without a label.

Let’s use k-means clustering.

Pseudocode:

assign cluster centers randomly
while centers are changing:
    assign each example to the closest center (least square distance)
    compute the new centers according to the assignments

But note that there can be many stop conditions - a certain number of iterations, when the centers stop changing, or some other rule

def k_means(D, k):
    for k = 1 to K:  # random init
        mu_k = some random location

    while mus changing:
        for n = 1 to N:  # assign to closest center
            z_n = argmin_k ||u_k - x_n||

        for k = 1 to K:  # compute new centers
            X_k = {x_n | z_n = k}
            mu_k = Mean(X_k)

    return z

Convergence¶

K-means is guaranteed to converge - i.e. there is some loss function, and it stops changing after some time.

So what does it optimize? The sum of the squared distance of points from their cluster means!

\[L(z,\mu; D) = \sum_N ||x_n - \mu_{z_n}||^2 = \sum_k \sum_{n:z_n=k} ||x_n - \mu_k||^2\]

where \(z\) is the assignments of each point and \(\mu\) is the location of each cluster center.

Lemma 1: Given a set of K points, the point m that minimizes the squared distances of all points to m is the mean of the K points.

Theorem: For any dataset and any number of clusters, the K-means algorithm converges in a finite number of iterations.

How to prove:

focus on where in the algorithm the loss is decreasing
- lines 7, 11
prove that loss can only decrease at those points
- at line 7, \(||x_n - \mu_b|| \leq ||x_n - \mu_a||\) where a is old and b is new
- at line 11, by lemma 1, the total sum of the squared distances decreases as the mean is the point that minimizes the squared distances of all points in each cluster
prove that loss cannot decrease forever - there exists a lower bound reachable in a finite number of steps
- the lower bound is 0, when each training point is the center of a cluster
- \(\mu\) is always the means of some subset of data, and there are a finite number of values for \(\mu\)

This proves convergence.

But… what does it converge to?

it’s not guaranteed to converge to the “right answer”
not even guaranteed to converge to the same answer each time (depends on initialization)

Initialization¶

How do we find the best answer given that it depends on initialization?

run a lot of times and pick the solution with mimimum loss
pick certain initialization points

Furthest-First¶

pick initial means as far from each other as possible
pick a random example and set it as the mean of the first cluster
for 2..K, pick the example furthest away from each picked one

Note: this initialization is sensitive to outliers, but this effect is usually reduced after the first few iterations

Probabilistic Tweak

pick a random example and set it as the mean of the first cluster
for 2..K, \(\mu_k\) is sampled from points in the dataset s.t. the probability of choosing a point is proportional to its distance from the closest other mean
has better theoretical guarantees

How to choose K¶

How do we choose K when there are an unknown number of clusters?

if you pick K with the smallest loss, K=n has loss = 0
“regularize” K via the BIC/AIC (Bayes Info Criteria/Akaike IC)
- increasing K decreases the first term, but increases the second term

BIC: (where D is the # of training instances)

\[\arg \min_K \hat{L}_K + K \log D\]

AIC:

\[\arg \min_K \hat{L}_K + 2KD\]

Probabilistic Clustering¶

Unobserved Variables

also called hidden/latent variables
not missing values; just not observed in the data
e.g.:
- imaginary quantity meant to provide simplification
- a real world object/phenomena that is difficult or impossible to measure
- a object/phenomena that was not measured (due to faulty sensors)
can be discrete or continuous

On to the clustering

Let’s assume that each cluster is generated by a Gaussian distribution:

The three clusters are defined by \(<\mu_x, \sigma^2_x>\) for each cluster x.

Additionally, we add a parameter \(\theta = <\theta_R, \theta_G, \theta_B>\) which is the probability of being assigned to a given cluster.

So now, our objective is to find \((\theta, <\mu_R, \sigma^2_R>, <\mu_G, \sigma^2_G>, <\mu_B, \sigma^2_B>)\) - but to find these parameters we need to know the assignments, but the assignments define the parameters!

Setup¶

Assume that samples are drawn from K Gaussian distributions.

If we knew cluster assignments, could we find the parameters?
If we knew the parameters, could we find the cluster assignments?

Just as in non-probabilistic k-means, we initialize parameters randomly, then solve 1 and 2 iteratively.

Params from Assignments¶

Let the cluster assignment for a point \(x_n\) be \(y_n\), and N be the normal distribution.

\[\begin{split}P(D|\theta, \mu s, \sigma s) & = P(X, Y | \theta, \mu s, \sigma s) \\ & = \prod_n P(x_n, y_n | \theta, \mu s, \sigma s) \\ & = \prod_n P(y_n | \theta, \mu s, \sigma s) P(x_n | y_n, \theta, \mu s, \sigma s) \\ & = \prod_n P(y_n | \theta) P(x_n| \mu_{y_n}, \sigma_{y_n}) \\ & = \prod_N \theta_{y_n} N(x_n | \mu_{y_n}, \sigma_{y_n})\end{split}\]

To find the distributions, we just argmax this equation. But since it’s all products, we can take the log and set the derivatives to 0!

\[\log P(D|\theta, \mu s, \sigma s) = \sum_n \log \theta_{y_n} + \sum_n \log N(x_n | \mu_{y_n}, \sigma_{y_n})\]

Assignments from Params¶

Assign a point to the most likely cluster:

\[\begin{split}y_n & = \arg \max_k P(y_n = k | x_n) \\ & = \arg \max_k \frac{P(y_n = k, x_n)}{P(x_n)} \\ & = \arg \max_k P(y_n = k, x_n) \\ & = \arg \max_k P(y_n = k) P(x_n | y_n = k) \\ & = \arg \max_k \theta_k N(x_n|\mu_k, \sigma^2_k)\end{split}\]

Note though - each point has exactly one cluster assignment. But we live in a probabilistic world - assign the point to every cluster with a probability!

Soft Assignment

\(z_{n,k}\) is a real number in the range [0..1], \(Z_n\) is a normalization constant.

\[\begin{split}z_{n,k} & = P(y_n = k | x_n) \\ & = \frac{P(y_n = k, x_n)}{P(x_n)} \\ & = \frac{1}{Z_n} P(y_n = k, x_n) \\ & = \frac{1}{Z_n} P(y_n = k) P(x_n | y_n = k) \\ & = \frac{1}{Z_n} \theta_k N(x_n|\mu_k, \sigma^2_k)\end{split}\]

and \(\sum_k z_{n,k} = 1\).

What about assignments -> params?

find the parameters that maximize the likelihood of the data

This procedure is called Expectation Maximization, where step 1 is the Maximization step and step 2 is the Expectation step. This is the Gaussian Mixture Model, specifically

In conclusion, EM:

computes the expected values for the latent values/params (in gaussian, \(\mu, \sigma^2\))
maximizes the expected complete log likelihood of the data to estimate the parameters/latent vars

Note

K-means is a special form of GMM, and GMM is a special form of EM-clustering.