Here's a narrated video walkthrough of the code in the previous note. It's high resolution so I couldn't successfully embed it. You should read the previous note before watching this. 

UPDATE:
It looks like the service I'm using as my public server broke all their links today. The new link are of this form:
http://dl.dropbox.com/u/39904/Blog%20Datasets/Sys3/sys3.m
instead of:
http://dl-client.getdropbox.com/u/39904/Blog%20Datasets/Sys3/sys3.m
The pattern to get the new links from the old ones is to just delete "-client" and "get". Hopefully they will restore the old links since I had posted files on many pages.

I previously posted a note on decision trees, then explained how they could be improved by model averaging using ensembles of trees trained on bootstrap samples. Then I implemented it in Matlab, and now finally I'm sharing it here coded in R, with an example to walk through. This should be the simplest way to learn how a trading system like this works and it's open source.

The code is concise at about 100 lines. Here's the main system, sample data used in the example below, and a small harness to load the data and configure the workspace. You will need to download the randomForest library.

As I've mentioned multiple times, machine learning systems can take in basically any data and then automatically harvest as much alpha as possible from it. The differences between an advanced (tree bagging, SVM, etc) and a primitive algorithm (linear regression, nearest neighbors, LDA, etc) usually translate [in trading] to finding more complex nonlinear patterns, controlling overfitting, and, of course, slower runtimes.

In this example we're going to try to squeeze some alpha out of GLD, an actively traded gold ETF. After a little bit of thought, we decide on some inputs that might have some predictive value because of what we know about macroeconomics and the market. We decide to feed our system data on the movements of two big gold miners, Freeport-McMoRan (FCX) and Rio Tinto's ADR (RTP), bonds (DHY) the performance of the financial sector (XLF) and the S&P500 (SPY). I recommend using factors such as bond prices, the overall market, and the price of relevant commodities in any machine learning system, because I've found they often
improve performance. If you look at the sample data which you should have downloaded above, I've compiled all this data for you. We will use weekly periods and backtest as far back as 2001.

To run the system from the code I've provided, open R (on Windows it's RGui, download the installer here) and first enter the command > setwd('C:\\[whatever folder you downloaded the files into]'). This sets the default search path directory. Next copy and paste or type > source('rungoldtreesys.r'). This will load the data and the tree bagging system code. Now you can backtest the system using whatever parameters you'd like. In the continuation of the example below, I ran it with this command and parameters, > factormodel.tree(data, targets, returns, btsamples=130, horizon=1, trainperiods=8, leverage = 'kelly', keepNFeatures = 10,
treesInBag = 40, endPd = 150).

Now I'll explain how to interpret the results. While the system is backtesting it outputs the predictions at each period so you can see how fast it's running. The final text results give you a summary of all the predictions and confidence values, and the overall accuracy as the fraction correct. There are also three plots, showing the estimated importance of each variable and decrease in out-of-bootstrap-sample error rate as trees are added to the ensemble (only on the first backtest period, just to give an idea - you don't want hundreds of charts). Here's one I got estimating variable importance:
We find that the previous period's return is the most useful followed by the previous 4 weeks' return of FCX and bond yield levels. FCXHighLow is the difference between FCX's weekly high and low and FCXVolNorm is the volume of shares traded in the week for FCX. Both were found to be useless, as we might expect. Read more about tree bagging to learn how exactly importance is measured. Next we look at the error rate of the ensemble as more decision trees were "grown":

During feature selection the error rate falls and then rises since the ensemble gets "confused" by the useless variables, which we found above. Then in actual model building the accuracy finishes at about 57.5%. This is just the model build to predict one period into the future by the backtester. The real power of ensemble/bagging learners is that as more components are added the error gets lower, to a point.

Finally, let's look at the equity curve using the parameters above. There is another parameter, the random seed, which controls how bootstrap samples are chosen so results vary.
Over 120 weeks (about 2 years), the system made about 90%. Two things to keep in mind are that this ignores trading costs (which should be negligible in this case because it's weekly trading of just a single security), and more importantly that this is based on full Kelly betting, which is probably too volatile for a human to tolerate - above we see a 40% drawdown. However, when searching for alpha it's good to have sensitive tools.

If you give this system data and buy/short targets, it will pull as much alpha from the data as is possible for the underlying algorithm.

Finally, I'll explain the system's parameters so you can experiment with and modify the code yourself.

data : All the data you're giving to the system by columns
targets : Either 1 or -1 for a long or short position. Aligned in time with corresponding data
returns : Similar to targets but used for making equity curve
backtest : Don't mess with this one, possible future functionality
verbose : "
btsamples : The number of periods to evaluate the system on
skip : Ignore every (skip-1) data point. Used for testing over long period faster
horizon : Number of periods out to predict (equivalently, number of lags for data)
dataperiods : Don't mess with this one, possible future functionality
speedUpFactor : Whether or not to train a model every backtest period, not tested, likely broken
trainperiods : Number of periods of data to train on. More = focus on the past, Less = shorter memory
leverage : Either 'kelly' or a positive decimal number. E.g 2 means 2X returns
keepNFeatures : Number of features to retain after feature selection
treesInBag : Number of trees to grow. Smooths confidence values but takes longer
startPd & endPd : Used to test over a specific interval. No date functionality - kept simple for now.

Please leave a message if you have any suggestions, questions, or ideas.


Here's the code for a basic decision tree system in Matlab. I'm sharing it in response to a reader comment on the previous note. I re-wrote this system in R, which is what I was planning on sharing, and which I will be later. This code is not as clean as I would typically like to share and Matlab is not open source. 


When you unzip it, you'll see it's split up into two folders, production and release. The production version has a backtesting script wrapped around it so it's probably more interesting, but also slightly more complex. To run the production version use the function "sim2" and for the release version use the function "run". It will download the data from Yahoo, train and run the algorithm, and then give some output. 

The code is actually relatively short and clear so don't get intimidated at first. 

A few months ago I started looking for a new trading system idea following the same machine learning philosophy as the last one. 

The previous system was based on support vector regression, and used sliding window crossvalidation to set the kernel width and SVM cost parameters. It had a few problems:

1. MLE (maximum likelihood estimation) models are always uncertain because they may not be robust. Look up “model averaging” or “Bayesian methods”
2. Crossvalidation required retraining the SVM thousands of times which was extremely slow.
3. Support Vector Regression can only take numeric/ordered data as inputs, not categorical.
4. Regression estimates can be hard to interpret. e.g. if during training the system never saw a day where the price rose 20%, then it predicts 20% should you interpret it as a strong or a broken signal?
5. Crossvalidation over two parameters required the creation of a 4 dimensional matrix to store performance. This was very hard to visualize, especially after not working on the system for a while. 
6. Coming up with a loss function for regression is hard for the application of trading. MSE is not perfect, nor is correlation. The loss function should penalize false positives because of transaction costs.
7. No natural confidence values.
8. Hard to interpret SVM. Infinite features w/ Gaussian RBF? The effect of changing C & kernel width are not easy to anticipate or interpret.
9. Non-linear but biased toward linearity (for ex., bad at learning XOR)

A few months ago I settled on a new learning algorithm to build a system on: the random forest. Random forest is a clever name for decision tree bagging (ensemble). And bagging is a clever conjunction of “bootstrap aggregating”. I especially liked that it could accept any data, numerical or categorical, gives confidence values, and is easier to interpret. Then I started printing and reading papers on decision trees, random forests, and the bootstrap. I read about four papers a week. The random forest also improves on the standard decision tree, which I wrote about previously, on problems 1 and 7 above. 

I'll post the code in a few days once it has been tested.

I've previously written notes about an information theory betting "paradox" and Kelly's optimal betting fraction.


Information theory formalized the duality between information transmission and probability. Many interesting results come from thinking of probability in terms of information, such as using Kelly betting to determine the maximum profit from private information about a stock or how to correctly price an option.

The following are the best three papers I've found on information theory for someone who has just started learning. (One other, Elements of Information Theory, is also good but not free except at a library; I have the solution manual in electronic format if anyone would like it)

1) Shannon's original 1948 paper. Page 3-7 or so is the good bit if you don't like math, it explains the information rate of English (also good if you do like math of course).

2) Information Theory, Inference, and Learning Algorithms by David MacKay. I really like how this one is written, with exercises imbedded in the text to work as you go. This is exactly how I teach myself things when I'm reading independently. For anyone interested in machine learning, this will be a fascinating book. Start from page 1. (this book is also available online from the author's site)

3) Lecture notes: "An Introduction to Information Theory and Entropy" by Tom Carter. This is the best one of the three for anyone reading about info theory for the first time. It's intuitively explained and goes relatively slowly- it's the first one I would recommend to a friend.

Please feel free to leave a comment, ask a question, or email me about anything, even unrelated to the above.

Machine learning is about automatically finding patterns in data. In trading systems we are interested in finding all the patterns and then choosing the ones that usually precede price increases (or anything else you can find a way to profit from).

Innovation in the field comes from developing new pattern identification algorithms. Typically these algorithms function similarly so you can compare the pros and cons of each and create a modular system in which you can swap them out. Each algorithm makes certain assumptions which influence the applications they are appropriate for. Financial market data is highly noisy, nonstationary, and even the noise is heterogeneous. This is different than say, medical diagnostic data, where the goal is to automatically diagnose a patient based on their symptoms and health history.

In the past, I looked at neural networks and support vector machines. These are the algorithms that seemed applicable to trading. A few others which seem less powerful: linear regression, naïve bayes, and nearest neighbors. (Sadly, most big institutional factor models are based on linear regression)



Now take a step back. Consider how a stock screener works. First you say, “keep a stock if its P/E is under 15”. Then you add another condition, “only keep those with debt/equity less than 1.0”. Then you might add one more filter, “of those, only keep the stocks with market cap under $500 million”. Every ticker is passed through this list of conditions and some are rejected at each step. You hope the remaining stocks are good values and will outperform the market. The problem is that you don’t really know when you make a screen by hand. Factor model software, used by many institutional investors, will automatically determine how useful a certain criteria is by using simple linear regression and backtesting.

What’s the problem with linear regression? This is a linear relationship: y=mx+b. It’s great for modeling something like, how much will a company’s taxes increase if the profit goes up by $100? m will equal the tax rate, .35. But a linear relationship can’t precisely model all sorts of things we regularly see. Let’s say you want to model Facebook and other social networks’ profitability based on the number of hits their sites get each day. There is a well-documented network effect where going from second most popular to first is much more significant than going from fifth to fourth. A linear model could only capture part of this effect. Another situation would be estimating someone’s weight by looking at how tall they are. Weight is roughly proportional to volume which grows as the cube of height, so linear regression would fail here too.



Now I’ll explain the decision tree algorithm. I’ll build on these two thought experiments and compare it to support vector machines and neural nets.

A decision tree is just like the screening example, but more sophisticated. The screener is not a tree in the sense that it has only two outcomes: keep or reject. The decision tree keeps analyzing the “rejected” ones. So instead of saying, “throw away the ones with P/E over 15”, the decision tree would then analyze those further, possibly adding in some typical growth criteria, like “keep those with analysts’ earnings estimates of at least 20%”. But intuitively you should think of it as a sophisticated automatic screener. Wikipedia has a good little example; the graphic below is copied from there to make the note more interesting.



You might wonder how it can work automatically. It’s a product of how incredibly fast computers are. Let’s say you have three variables, P/E, debt/equity, and market cap.

1) The algorithm randomly selects 1, say P/E.
2) The algorithm loops through many values for the P/E, say {1,2,3,4,5,6,7,8,9,10,…,30,31,…,max}, where max = the maximum P/E of all stocks being considered, and determines the P/E number that best splits the data into outperformers and underperformers. (Other “loss functions” are also common) This is very computationally demanding.
3) Then the algorithm repeats (recursively) on each of the subsets determined by the P/E split. It may divide them again by P/E or maybe by debt/equity or market cap.
4) Once the tree has many branches and the data is reasonably well classified (this is an important parameter which I won’t go into depth about here), you’re done training and you can start using it for actual prediction on new data.

Here’s the simplest example of how a decision tree can introduce nonlinearity: First it says companies with a P/E <> 3 are good. So now on a number line of P/Es, it goes bad-good-bad. It turns out that linear models cannot learn this relationship, and support vector machines can, with some care by the programmer. (Our analysis can be justified by imagining that P/E < 15 means good value, but P/E < 3 means screwy accounting and bad data, so you eliminate those.)

Trees have other nice properties including they handle missing data and outliers well, they can mix categorical and numeric inputs, and the trained learner is easier to interpret.

In another note soon, I’ll explain why I decided to build a system based on a further enhancement of this decision tree concept: the “random forest” or tree bagging. Obviously decision trees already have advantages, but I’ll go into the weaknesses and further improvements and explain how I built and backtested the system.

A long time ago I wrote a note on position sizing, before I knew that the problem had been analyzed already. This note explains in very simple terms the analytic solution to the problem I solved earlier with Monte Carlo estimation. 


Once again, I had to copy and paste screenshots to keep the formulas so it's a little hard to read. Here's this note is in pdf format, if that's easier.



There has been a lot of research on generalizations of the Kelly Criterion, which should be easy to find.