I've previously explained the academic Kelly criterion results and explored its relation to Markowitz mean-variance portfolio optimization. Make sure to look back at those links to previous notes if you've never heard of the Kelly criterion because I'm not re-explaining it here. This academic work assumes you know the distribution of your strategy's returns. However you will usually only have a sample of its historical performance. This note shows how you can implement these theoretical results using the data you have about your strategy's returns.

The most naive way (and sadly the only way I have ever seen it given in tutorials) to implement Kelly's results is to assume a certain return distribution and then estimate the parameters of that distribution based on the observed data. This bootstrap estimate of the distribution can then be plugged into the Kelly formula. If Kelly has been analytically solved for that distribution, for example the Bernoulli (coin flip) or normal, then the optimal betting fraction can be found extremely efficiently and with no trouble.

If the distribution of returns is unusual, such as multi-modal (it has multiple peaks), skewed, or fat-tailed, then that approach will give you either too high or too low of an amount to bet. At best you leave money on the table (betting/investing too little) and at worst you unduly increase drawdowns (betting too much). Even if you have a strategy that generates normally-distributed returns, the general solution will work well so there's no big disadvantage. The general solution to the problem is to estimate a full probability distribution over returns based on past observations.

There are multiple ways to estimate a distribution. The simplest is to put the data in a histogram. A histogram is a lot like a probability distribution but obviously depending on the number of bins it can be a loose approximation and it can be biased if the bin edges are chosen unluckily. Biased means that it will always over or underestimate the correct answer even if it is based on tons of good clean data.

The following example python code calculates the growth-maximizing betting fraction based on the histogram estimate. To run it, you need to use ipython -pylab. First the function places a histogram on the data, then it generates a set of many possible betting fraction (f) values, and finally evaluates the expected log growth function at each of those values and chooses the one that maximizes it:

def kelly(returns,precision=10):
(freq_bin, ret_bin, junk) = hist(returns,bins=precision) # generate an empirical distribution of the data
ret_bin_pessimistic = ret_bin[hstack([r_[:ret_bin.size/2],r_[ret_bin.size/2+1:ret_bin.size]])].copy() # the histogram has #bin+1 bin edges so we have to throw one away. I chose the one that will result in overbetting
prob_est_bin = freq_bin/double(size(returns)) # frequentist MLE estimate of the probability of each bin
f_candidates = arange(0,10,.01) # grid optimization candidates of the betting fraction f
growths = sum(prob_est_bin * log(1 + f_candidates[:,newaxis]*ret_bin_pessimistic), axis=1) # evaluate the growth function i.e expected log return. Python does some nice replication of the vectors to make them the right sizes
return f_candidates[growths.argmax()] # return the optimal f

Here are some results from this function compared to analytic solutions to Kelly's formula based on taking derivatives and finding critical points:
In a simple case where there are only two outcomes (i.e. Bernoulli) and payoffs are +1 and -1, the optimal fraction is f=2p-1 where p = the probability of +1. This is derived in a previous note. When we set p=.75, which would ideally result in 3 samples +1 and one -1, the function is correct and says to bet .5 i.e. 50% on each event:
kelly(array([1,1,1,-1.])) # i.e. p = .75 = (#1/total outcomes)
output: 0.5 # Correct!

In our second case we generate 10 million returns from a normal distribution with 1% expected return and a standard deviation of 10%. The analytic optimal fraction is f=E(x)/(SD(X)^2)=0.01/(0.10^2)=0.01/0.01=1. The function gets this example correct too, although it's not perfect. Using more bins for the histogram improves the estimate. This is expected when we have plenty of data like 10 million observations:
ret = randn(10000000)/10+.01 # generate samples with 1% edge (E(r)) and 10% risk (SD(r)) (=> Var(r)=.001 *units are not %; maybe %^2 if it made sense)
kelly(ret,precision=100) # precision = empirical pdf histogram bins
output: 1.1100000000000001 # Should be 1. Sample ret gives results between (.85,1.15)
kelly(ret,precision=1000)
output: 0.97999999999999998
kelly(ret,precision=10000)
output: 0.98999999999999999 # Higher precision works better with more samples

This histogram approach works well, but a better way is to do what's termed kernel density estimation. It creates a distribution by essentially smearing the data out so choppy bins aren't needed. A slight shift in the bin edges in a histogram will result in significantly different Kelly estimates, especially if the number of bins is small, as you saw above. Kernel density estimates are less fragile.

Notice that all of these are bootstrap estimates. This means that we estimate one thing which is actually random and has some error, but we ignore that error and assume it's correct so that we can use it to calculate something else that depends on it. The estimated distribution is random since it's based on a finite sample of random data, but then it's used as if we're 100% sure about it to estimate the Kelly betting fraction. Be aware of the potential risk-hiding dangers in these kinds of procedures.

The problem with Kelly betting based on an empirical distribution is that the estimate of the distribution is worst at the part that affects the Kelly fraction the most. Specifically, in an empirical distribution based on a histogram the tails are chopped off and even with a kernel density estimate the distribution's tails are essentially arbitrary. Remember that a good heuristic is to bet/leverage less rather than more.

Hopefully this note makes the Kelly criterion results more concrete and practical. It should be simple to rewrite the code above in whichever language you may prefer - it's essentially just optimization of th function: prob_est_bin * log(1 + f_candidates[:,newaxis]*ret_bin_pessimistic), with respect to f; which in English is just E[ln(1+f*r)].

I'd like to know your thoughts on the topic. If you notice an error please mention it too.

I watched a video lecture, as I often do, on data analysis. here's the video: the Hilbert Spectrum. Here are the notes I took while watching it:


The idea is appealing- to decompose a time series into underlying trends of different periodicities. In the trading world this would correspond to maybe a long term macroeconomic trend, a monthly pattern occurring around announcement of the federal funds rate, and a short term pattern caused by supply and demand and liquidity constraints. The researcher in the video was trying to study ocean waves with satellite data. Obviously there may be a difference in the two processes.

I implemented the Hilbert spectrum algorithm because I was excited about it. Here's the R script. For example, here's what the spectrum looks like for GOOG & TYP share prices:

At the top is the actual price series and below that are the series with the high frequency patterns removed one by one. They look nice.

Here's the code, hspect.r, in the language R. R is basically an advanced calculator that's also programmable.

The problem is that this is a type of smoother, useful for summarizing and exploring data, but useless for extrapolation or prediction. Among this family is cubic spline interpolation and LOESS. At the edges, if you extend these curves to make predictions the estimates will have extremely high variance. Making predictions with one of these smoothers is equivalent to throwing away almost all your data except the bit at the very end, and then either fitting a 3rd degree polynomial to it (in cubic spline interpolation) or a straight line (in LOESS).

Cubic spline interpolation is especially insidious because most people don't understand it and a confusing name doesn't help. Everyone knows how to interpret two derivatives: velocity and acceleration. The third derivative is interpretable, in two different contexts, as curvature or as burst. Burst is like if you're standing in an elevator and it goes up, how much you feel it. If the elevator is designed will, burst
will be a constant and you will barely feel it. It's also important in roller coaster design to ensure you have a smooth ride. In terms of curvature, if the third derivative is constant, it will be pleasing to the eye as if it were drawn by sweeping hand motions. That's the qualitative explanation. This latter interpretation of curvature is what cubic spline interpolation is based on. The cubic spline
interpolation fits a nice-looking piecewise (between each two points) polynomial which matches 1st and 2nd derivatives at each knot.

Unfortunately you have to understand these methods to know not to use them and not to trust systems based on them. I've had people contact me about using cubic spline interpolation for prediction but it's just not applicable.

Feel free to add your own thoughts.

I'm going to explain how I go about reading an academic paper because a lot of people I talk to don't read them, although that's where the best machine learning ideas are.

Michael Jordan is a Berkeley professor I follow and try to read as much as possible of. On his website is the paper, "A Flexible and Efficient Algorithm for Regularized Fisher Discriminant Analysis".

Fisher/linear discriminant analysis (LDA) is an algorithm for doing multi-class classification. A three "class" problem, for example, would be predicting whether to buy, short, or do nothing.

At least skim the paper before reading the rest of this note. I like this one because they don't leave out any steps and the matrix algebra makes it easy to write in Matlab.

So first I read over the paper about 1.5 times taking notes. Then I walked over to one of my friends at work to ask what he thought of the paper. He didn't have any opinion on it at first.

Then I opened up Matlab to implement Algorithm 1 on page 8. Here's the code. It's as straightforward as possible; First I initialize all variables, then calculate the algorithm step-by-step; Where a line is based on an equation, the equation number is given to the right in a comment, ex. %(15); And all variables have the exact same names as in the paper, so you can easily match them back and forth.

Next I wrote Algorithm 2 on page 9. Here's the code. As in the previous, the comments in the code should make it extremely clear how it relates to the paper.

The code seemed to work, so I emailed the main author, Zhihua Zhang, for his version to check mine against. Here's his code. It turns out both of his algorithms were the same, with one variable renamed. His is harder to connect to the paper itself because he uses different variable names.

My friend at work wrote the best version. It's the fastest and clearer than mine. Here's his code. He's a Matlab hacker and math Ph.D so he recognized that centering X and calculating psuedo-inverses could be sped up a lot, among other things.

In the end, all of our versions gave the same results. The third version is the most efficient. Multiclass kernel LDA is pretty good algorithm and implementing the paper directly was a fun exercise. Hopefully you can look at my version to see how to go directly from the paper to code, and then my friend's to see how to make it much more efficient.

I'll be in NYC from Sunday to early Wednesday. If anyone is interested in meeting for coffee/lunch/dinner, leave a comment or email me- maxdama at berkeley.edu

Just found this, http://opentrader.codeplex.com/Wiki/View.aspx?title=features%20list

Looks interesting but I probably won't have a chance to test it soon- tell me what you think if you do.

Marketcetera, another open source platform that I mentioned previously, is still making progress: http://www.marketcetera.com/site/
I'll probably hop on and try Marketcetera once their strategy backtesting module is finished.