When I first started learning about support vector machines, the intuition behind kernel functions lifting the data to an effectively infinite feature space was extremely hard to grasp.


Each data point becomes a basis (in the linear algebra sense) in the new space. Bases are linearly independent by definition. Once you have a set of N linearly independent vectors and N data points, you can just solve a system of equations to do linear regression - you don't have to do a least squares approximation since the line can fit the points exactly. Back in the original data space the line between the points will be nonlinear.

I say effectively infinite rather than infinite (as it usually appears in the literature) because it turns out having a feature for each data point is enough to fit anything, so actually going all the way to infinite features would be only aesthetically different.

Abstract text explanations never seemed to make the idea crystal clear so I wrote some code. It was much easier than I thought. It's in this file: svr.m.

Here are pictures of two underlying functions/patterns it was able to learn using only linear regression.






































It doesn't look very smooth because it's not based on Vapnik's e-insensitive loss function. So it's not really support vector regression because every point is basically a support vector. The goal is to explore and illustrate the kernel feature space mapping though.

Try looking at the code if you are somewhat familiar with kernel methods (it's really only a few lines besides extensive comments). To generate the pictures above, I used the Matlab commands -

x=[1:10]'; y=x.^2+randn(10,1); xt=[10:100]'/10; yt=xt.^2;
svr(y,x,yt,xt);
title('Noisy y = x^2'); xlabel('x'); ylabel('y'); legend('true','learned')

x=[1:10]'; y=sin(x)+randn(10,1)/5; xt=[10:100]'/10; yt=sin(xt);
svr(y,x,yt,xt);
title('Noisy y = sin(x)'); xlabel('x'); ylabel('y'); legend('true','learned')

You will also need to change the kernel width (sigma on line 2 of the code) to get the above results. First try setting sigma to something small like .05 to see how the Gaussian radial basis function kernel looks. It places a Normal/Gaussian curve at each point. It's a pretty cool idea. There are other kernel functions but the Gaussian RBF is the most common in the literature.

Support vector machines and the Bayesian version, the Relevance vector machine, are both based on these kernel feature space mappings. So are many other modern machine learning algorithms. They're very useful and practical to understand.

Feel free to leave comments and thoughts. I like finding out about new things.

Here's the beginning of the note, click on it for the full pdf write-up (it has some math which doesn't display well here):

There are a couple different approaches to determining the best leverage to use for an overall portfolio. In finance 101 they teach Markowitz's mean-variance optimization where the efficient portfolios are along an optimal frontier and the best one among those is at the point where a line drawn from the risk free porfolio/asset is tangent to the frontier. In the 1950s Kelly derived a new optimal leverage criteria inspired by information theory (which had been established by Shannon a few years earlier). The criteria being optimized in these two cases is typically called an `objective function'- a function of possible asset weights/allocations that outputs a number giving the relative estimated quality of the portfolio weights. However, the two objective functions look quite different (peek ahead if you're unfamiliar). In this note I show the two are approximately equivalent, with the approximation being very close in realistic risk-return scenarios.

This is an equivalence that I haven't seen mentioned very often so I thought a formal version would be appropriate.