# Kernel Methods¶

- step 1: use a special type of mapping functions
- still map feature space to a higher dimensional space
- but computation of \(\phi(x) \cdot \phi(z)\) is easy (remember the multiplication of features in SVM optimization)

- step 2: rewrite the model s.t.
- the mapping \(\phi(x)\) never needs to be explicitly computed - i.e. never compute terms like \(w \cdot \phi(x) + b\)
- only work with \(\phi(x) \cdot \phi(z)\) - we call this the
**kernel**function

**Ex.**

We can compute \(\phi(x) \cdot \phi(z)\) easily:

We call this a **kernel** function

What about the quadratic feature mapping?

## Kernel in SVM¶

Rewriting the primal form of the SVM to use the mapped space doesn’t seem easy.

But we can rewrite the dual form easily!

**Kernelized SVM**

Now SVM is computing a linear boundary in the higher-dimensional space without actually transforming vectors

Predictions:

## Formal Definition¶

Let’s use the example kernel \(K(x, z) = (x \cdot z)^2\) for \(\phi(x) = <x_1^2, \sqrt{2}x_1x_2, x_2^2>\)

- a kernel function is implicitly associated with some \(\phi\)
- \(\phi\) maps input \(x\in X\) to a higher dimensional space \(F\):
- \(\phi: X \to F\)

- kernel takes 2 inputs from \(X\) and outputs their similarity in F
- \(K: X \times X \to R\)

- once you have a kernel, you
**don’t need to know**\(\phi\)

**Mercer’s Condition**

- a function can be a kernel function K if a suitable \(\phi\) exists
- \(\exists \phi\) s.t. \(K(x, z) = \phi(x) \cdot \phi(z)\)

- mathematically: K should be positive semi-definite; i.e. for all square integrable f s.t. \(\int f(x)^2 dx < \infty\)
- \(\int \int f(x)K(x, z)f(z)dxdz > 0\)
- sufficient and necessary to identify a kernel function

**Constructing Kernels**

We already know some proven basic kernel functions - given Mercer’s condition, we can construct new kernels!

- \(k(x, z) = k_1(x, z) + k_2(x, z)\) (direct sum)
- \(k(x, z) = \alpha k_1(x, z)\) (\(\forall \alpha > 0\) - scalar product)

Note

Example: given that k1 and k2 are kernels, prove \(k(x, z) = k_1(x, z) + k_2(x, z)\) is a kernel

(\(\phi\) for the linear kernel = \(\phi(x) = x\))

## Perceptron¶

we can also apply kernels to perceptron!

Naively, we can just replace \(\mathbf{x}\) with \(\phi(x)\) in the algorithm - but that requires knowledge of \(\phi\)

**Prediction**

since \(w = \sum_m \alpha_m \phi(x_m)\), prediction on \(x_n\) is easy: