I've previously explained the academic Kelly criterion results and explored its relation to Markowitz mean-variance portfolio optimization. Make sure to look back at those links to previous notes if you've never heard of the Kelly criterion because I'm not re-explaining it here. This academic work assumes you know the distribution of your strategy's returns. However you will usually only have a sample of its historical performance. This note shows how you can implement these theoretical results using the data you have about your strategy's returns.
The most naive way (and sadly the only way I have ever seen it given in tutorials) to implement Kelly's results is to assume a certain return distribution and then estimate the parameters of that distribution based on the observed data. This bootstrap estimate of the distribution can then be plugged into the Kelly formula. If Kelly has been analytically solved for that distribution, for example the Bernoulli (coin flip) or normal, then the optimal betting fraction can be found extremely efficiently and with no trouble.
If the distribution of returns is unusual, such as multi-modal (it has multiple peaks), skewed, or fat-tailed, then that approach will give you either too high or too low of an amount to bet. At best you leave money on the table (betting/investing too little) and at worst you unduly increase drawdowns (betting too much). Even if you have a strategy that generates normally-distributed returns, the general solution will work well so there's no big disadvantage. The general solution to the problem is to estimate a full probability distribution over returns based on past observations.
There are multiple ways to estimate a distribution. The simplest is to put the data in a histogram. A histogram is a lot like a probability distribution but obviously depending on the number of bins it can be a loose approximation and it can be biased if the bin edges are chosen unluckily. Biased means that it will always over or underestimate the correct answer even if it is based on tons of good clean data.
The following example python code calculates the growth-maximizing betting fraction based on the histogram estimate. To run it, you need to use ipython -pylab. First the function places a histogram on the data, then it generates a set of many possible betting fraction (f) values, and finally evaluates the expected log growth function at each of those values and chooses the one that maximizes it:
def kelly(returns,precision=10):
(freq_bin, ret_bin, junk) = hist(returns,bins=precision) # generate an empirical distribution of the data
ret_bin_pessimistic = ret_bin[hstack([r_[:ret_bin.size/2],r_[ret_bin.size/2+1:ret_bin.size]])].copy() # the histogram has #bin+1 bin edges so we have to throw one away. I chose the one that will result in overbetting
prob_est_bin = freq_bin/double(size(returns)) # frequentist MLE estimate of the probability of each bin
f_candidates = arange(0,10,.01) # grid optimization candidates of the betting fraction f
growths = sum(prob_est_bin * log(1 + f_candidates[:,newaxis]*ret_bin_pessimistic), axis=1) # evaluate the growth function i.e expected log return. Python does some nice replication of the vectors to make them the right sizes
return f_candidates[growths.argmax()] # return the optimal f
Here are some results from this function compared to analytic solutions to Kelly's formula based on taking derivatives and finding critical points:
In a simple case where there are only two outcomes (i.e. Bernoulli) and payoffs are +1 and -1, the optimal fraction is f=2p-1 where p = the probability of +1. This is derived in a previous note. When we set p=.75, which would ideally result in 3 samples +1 and one -1, the function is correct and says to bet .5 i.e. 50% on each event:
kelly(array([1,1,1,-1.])) # i.e. p = .75 = (#1/total outcomes)
output: 0.5 # Correct!
In our second case we generate 10 million returns from a normal distribution with 1% expected return and a standard deviation of 10%. The analytic optimal fraction is f=E(x)/(SD(X)^2)=0.01/(0.10^2)=0.01/0.01=1. The function gets this example correct too, although it's not perfect. Using more bins for the histogram improves the estimate. This is expected when we have plenty of data like 10 million observations:
ret = randn(10000000)/10+.01 # generate samples with 1% edge (E(r)) and 10% risk (SD(r)) (=> Var(r)=.001 *units are not %; maybe %^2 if it made sense)
kelly(ret,precision=100) # precision = empirical pdf histogram bins
output: 1.1100000000000001 # Should be 1. Sample ret gives results between (.85,1.15)
kelly(ret,precision=1000)
output: 0.97999999999999998
kelly(ret,precision=10000)
output: 0.98999999999999999 # Higher precision works better with more samples
This histogram approach works well, but a better way is to do what's termed kernel density estimation. It creates a distribution by essentially smearing the data out so choppy bins aren't needed. A slight shift in the bin edges in a histogram will result in significantly different Kelly estimates, especially if the number of bins is small, as you saw above. Kernel density estimates are less fragile.
Notice that all of these are bootstrap estimates. This means that we estimate one thing which is actually random and has some error, but we ignore that error and assume it's correct so that we can use it to calculate something else that depends on it. The estimated distribution is random since it's based on a finite sample of random data, but then it's used as if we're 100% sure about it to estimate the Kelly betting fraction. Be aware of the potential risk-hiding dangers in these kinds of procedures.
The problem with Kelly betting based on an empirical distribution is that the estimate of the distribution is worst at the part that affects the Kelly fraction the most. Specifically, in an empirical distribution based on a histogram the tails are chopped off and even with a kernel density estimate the distribution's tails are essentially arbitrary. Remember that a good heuristic is to bet/leverage less rather than more.
Hopefully this note makes the Kelly criterion results more concrete and practical. It should be simple to rewrite the code above in whichever language you may prefer - it's essentially just optimization of th function: prob_est_bin * log(1 + f_candidates[:,newaxis]*ret_bin_pessimistic), with respect to f; which in English is just E[ln(1+f*r)].
I'd like to know your thoughts on the topic. If you notice an error please mention it too.


