UPDATE:
It looks like the service I'm using as my public server broke all their links today. The new link are of this form:
http://dl.dropbox.com/u/39904/Blog%20Datasets/Sys3/sys3.m
instead of:
http://dl-client.getdropbox.com/u/39904/Blog%20Datasets/Sys3/sys3.m
The pattern to get the new links from the old ones is to just delete "-client" and "get". Hopefully they will restore the old links since I had posted files on many pages.

I previously posted a note on decision trees, then explained how they could be improved by model averaging using ensembles of trees trained on bootstrap samples. Then I implemented it in Matlab, and now finally I'm sharing it here coded in R, with an example to walk through. This should be the simplest way to learn how a trading system like this works and it's open source.

The code is concise at about 100 lines. Here's the main system, sample data used in the example below, and a small harness to load the data and configure the workspace. You will need to download the randomForest library.

As I've mentioned multiple times, machine learning systems can take in basically any data and then automatically harvest as much alpha as possible from it. The differences between an advanced (tree bagging, SVM, etc) and a primitive algorithm (linear regression, nearest neighbors, LDA, etc) usually translate [in trading] to finding more complex nonlinear patterns, controlling overfitting, and, of course, slower runtimes.

In this example we're going to try to squeeze some alpha out of GLD, an actively traded gold ETF. After a little bit of thought, we decide on some inputs that might have some predictive value because of what we know about macroeconomics and the market. We decide to feed our system data on the movements of two big gold miners, Freeport-McMoRan (FCX) and Rio Tinto's ADR (RTP), bonds (DHY) the performance of the financial sector (XLF) and the S&P500 (SPY). I recommend using factors such as bond prices, the overall market, and the price of relevant commodities in any machine learning system, because I've found they often
improve performance. If you look at the sample data which you should have downloaded above, I've compiled all this data for you. We will use weekly periods and backtest as far back as 2001.

To run the system from the code I've provided, open R (on Windows it's RGui, download the installer here) and first enter the command > setwd('C:\\[whatever folder you downloaded the files into]'). This sets the default search path directory. Next copy and paste or type > source('rungoldtreesys.r'). This will load the data and the tree bagging system code. Now you can backtest the system using whatever parameters you'd like. In the continuation of the example below, I ran it with this command and parameters, > factormodel.tree(data, targets, returns, btsamples=130, horizon=1, trainperiods=8, leverage = 'kelly', keepNFeatures = 10,
treesInBag = 40, endPd = 150).

Now I'll explain how to interpret the results. While the system is backtesting it outputs the predictions at each period so you can see how fast it's running. The final text results give you a summary of all the predictions and confidence values, and the overall accuracy as the fraction correct. There are also three plots, showing the estimated importance of each variable and decrease in out-of-bootstrap-sample error rate as trees are added to the ensemble (only on the first backtest period, just to give an idea - you don't want hundreds of charts). Here's one I got estimating variable importance:
We find that the previous period's return is the most useful followed by the previous 4 weeks' return of FCX and bond yield levels. FCXHighLow is the difference between FCX's weekly high and low and FCXVolNorm is the volume of shares traded in the week for FCX. Both were found to be useless, as we might expect. Read more about tree bagging to learn how exactly importance is measured. Next we look at the error rate of the ensemble as more decision trees were "grown":

During feature selection the error rate falls and then rises since the ensemble gets "confused" by the useless variables, which we found above. Then in actual model building the accuracy finishes at about 57.5%. This is just the model build to predict one period into the future by the backtester. The real power of ensemble/bagging learners is that as more components are added the error gets lower, to a point.

Finally, let's look at the equity curve using the parameters above. There is another parameter, the random seed, which controls how bootstrap samples are chosen so results vary.
Over 120 weeks (about 2 years), the system made about 90%. Two things to keep in mind are that this ignores trading costs (which should be negligible in this case because it's weekly trading of just a single security), and more importantly that this is based on full Kelly betting, which is probably too volatile for a human to tolerate - above we see a 40% drawdown. However, when searching for alpha it's good to have sensitive tools.

If you give this system data and buy/short targets, it will pull as much alpha from the data as is possible for the underlying algorithm.

Finally, I'll explain the system's parameters so you can experiment with and modify the code yourself.

data : All the data you're giving to the system by columns
targets : Either 1 or -1 for a long or short position. Aligned in time with corresponding data
returns : Similar to targets but used for making equity curve
backtest : Don't mess with this one, possible future functionality
verbose : "
btsamples : The number of periods to evaluate the system on
skip : Ignore every (skip-1) data point. Used for testing over long period faster
horizon : Number of periods out to predict (equivalently, number of lags for data)
dataperiods : Don't mess with this one, possible future functionality
speedUpFactor : Whether or not to train a model every backtest period, not tested, likely broken
trainperiods : Number of periods of data to train on. More = focus on the past, Less = shorter memory
leverage : Either 'kelly' or a positive decimal number. E.g 2 means 2X returns
keepNFeatures : Number of features to retain after feature selection
treesInBag : Number of trees to grow. Smooths confidence values but takes longer
startPd & endPd : Used to test over a specific interval. No date functionality - kept simple for now.

Please leave a message if you have any suggestions, questions, or ideas.

47 comments:

chintan shah said...

Good Job!

Anonymous said...

I dont think it makes sense to use full kelly criterion. Even with a high quality program like this the return could be overestimated. This increases the chances of ruin.

A half kelly criterion is better.

This is the view that e.p. chan has espoused in his blog entry. It is also recommended in Fortune's Formula.

Max Dama said...

Anon,

Half is somewhat arbitrary. But it's a good idea to modify this code, because it's only a very basic prototype.

Regards,
Max

Anonymous said...

hi max, is it suspicious that it would return 90% profit?

Joshua Ulrich said...

Max,

Thank you for putting this together and sharing!

I'm having several issues replicating the results in your post.

You say you generated the results in the post with this command:
factormodel.tree(data, targets, returns, btsamples=130, horizon=1, trainperiods=8, leverage = 'kelly', keepNFeatures = 10, treesInBag = 40, endPd = 150),
but the error rate graphs suggest you created 200 trees, not 40.

When endPd is less than 300 or so, I get an error about NA values in the predictors in the call to randomForest and the program stops.

If there's not enough data for the requested number of backtest periods, you call the "error()" function, which doesn't exist. You want "stop()".

Best,
Josh

Joshua Ulrich said...

Max,

I'm also a bit confused regarding your GLDr1 values. I'm able to replicate all the other weekly returns, but I'm nowhere close with GLD.

Max Dama said...

Josh,

Cool I hadn't noticed that error wasn't the error raising function because it always caused an error!

GLD was daily data, it should be replaced with weekly.

The error rate graphs were from forests trained with 200 trees, not the same from the example. Because it's only training it on 8 periods, more than about 20 trees give almost the same results. 200 tree forest error plots look better.

I use Emacs on a Sun terminal for all my class code and now I have XEmacs with some R plugin on this Windows machine. With R I don't notice a difference because it's so straight forward in the first place. At work I use Vim. I didn't use Vim much at all before this but I'm liking it more than Emacs.

Thanks for catching each of these things.

Regards,
Max

Bas said...

Max,

Interesting to notice that I'm not the only one who's using the RF algorythm within the financial domain (perform a Google search and you understand what I mean ;-). You could consider the e1071 package (cran.r-project.org/package=e1071) which is capable to finetune the nodesize, mtry & ntree parameters based on sampling techniques like bootstrappng and cross-validation. Keep us posted!

Bas

Anonymous said...

Hi Max,

Nice example! Thank you for putting yourself out there dude ;-)

Do you plan to continue to develop this example?
Isn't there something like the R package Caret that will help "tune"?

What program did you use to create your test data? If it was R why not share it too?

As I understand it, Kelly was not envisioned for use in the market. It does serve your example well though ;-) If you are looking for a fun way to do "bet" sizing for the market see what Ralph Vince (several books) has to say ;-)

Keep up the great work dude!!!

Cordially,

-Digital Dude-

"One of the tricks here is to get away from thinking that programs have to be composed with only a simple text editor." -Alan Kay-

Max Dama said...

Thanks Bas and Digital.

Digital - I had Vince's book from the library for a while but I need to take a second look. Thanks for the reminder.

I didn't use R for scraping the data. It was manual from Google (which is probably why there was the error). I would prefer quantmod's data functions return it in a data frame rather than invisibly putting it in the workspace individually with specific names.

I don't think I'll pursue this example because in the past few weeks I've made great progress writing random forest code from scratch. Plus it's not very clean code in the first place. My current opinion is that rapid matrix computations are necessary for large scale self optimizing systems
that backtest themselves.

Wow I just looked up and read about Caret before pressing publish comment. That looks great. Well I'm not sure what I'll do after all- thanks!

Regards,
Max

chintan shah said...

Max,

since 2001 Gold has been on long term extreme bullish trend.
Here parameters have correlation with input data not co-integration or leading/lagging effect which is necessary for machine learning.

As an example
series 1: 2 4 6 8 10 8......
series 2: 1 2 3 4 5 4 .........
series 3: 4 8 12 16 20 16....

Here you cant take series 1 &2 for forecasting series 3 as they are markov processes.

but if it has some sort of some sort of leading effect like this

series 1: 3 4 5 6 5 4 3 2...
series 2 : 1 1 2 4 5 6 8 6 4 3 1..

here series 1 can be useful to forecast value of series 2.

So, I guess its not about choosing right algorithm but understanding market from intuitive perspective and then replicating your market feelings in some CODE.


If this system can generate alpha by keeping short bias then its working otherwise its getting lucky due to long term bullish trend of gold.

Regards,
Chintan

Max Dama said...

Chintan,

No. The data is correlated with gold prices but definitely not 100%. Obviously I'm not confused that correlation is a leading indicator.

Also the system is long short so no underlying trend should influence it.

Regards,
Max

Bas said...

IMO, especially advanced algorithms like random forest and SVM find relationships between the given inputs and the output(s) you can't measure directly using a correlation coefficient. In my case the highest individual correlation is about 0.15 while the actual model performs a 0.95-0.99 score.

Bas

Daniel said...

Max,

Thanks for posting your code. I am a little confused about a few things ... you mentioned that the gld returns in the dataset are daily. I assume they are the forward 1-day returns calculated by closing price, is that right (and the other returns are 1 week historical returns)? Not sure what I'm doing wrong but it doesn't match up with the quotes I'm looking at. Secondly from the code it looks like the gld returns are being kept in the training data ... I'm curious why you are keeping that column when it is used to formulate the targets.

Thanks,
Daniel

Max Dama said...

Daniel,

The returns are weekly, not daily. Also, I kept in the GLD returns, but lagged by one period. Perhaps the last week's price is useful in predicting this week's, when combined with other features.

Regards,
Max

Daniel said...

Hm ... did you update the data set? In an above comment you mentioned the gld returns are based on daily data, but either way I'm still a little confused about where the numbers came from.

Unrelated to that, do you know if input normalization (eg transformation to 0 mean/unit std) has any usefulness with decision trees? I would guess not since it seems like learning is not distance based but I've never really used them very much.

Daniel

Max Dama said...

Daniel,

No I haven't fixed the data set. Sorry about the data, I had no idea it would be such an issue for people, trying to verify where it came from or whatever.

Input normalization/linear transformation should not affect the random forest. However it could be fruitful to change the distribution by featurizing it in some way.

Regards,
Max

manaau said...

Awesome, thanks. I question the use of kelly - wouldn't it overemphasize price interactions later in the data?

Max Dama said...

Manaau,

What do you mean by later and what do you mean by interactions?

Regards,
Max

manaau said...

(sheepishly) um..

interactions being the several tickers thrown in the hopper for testing; later being the time series - which was the idea, that a strong set of relationships (miners go up, gold goes up) later in the series would be overemphasized if not betting a flat amount from beginning to end, leading to less than optimum rules. my (novice) testing has used the same bet size for rules, then let the best methods (smoothest strong performers) fight it out in monte carlo (for compounded returns vs drawdown).

Enjoying the site a lot, I think I need to enroll in school somewhere and get me a student price on some matlab :)

Max Dama said...

Manaau,

Kelly overemphasizes everything in the sense you're referring to. But it's the optimal fraction for maximizing compounded growth so if that's your only goal, then it's good. Heedlessly maximizing compounded growth isn't what most humans are after it seems though.

I like Monte Carlo too. For some reason people seem to find it difficult- I think it's perfectly natural.

Regards,
Max

Anonymous said...

Hi Max,

Try:

apple <- getSymbols('AAPL',auto.assign=FALSE)

for your workspace environment issue ;-)

Add: return.class='arg' if you want to coerce to something other than an xts returned ;-)

Cordially,

-Digital Dude-

“Ideas make a difference, the difference.” -Bill Bradley-

Max Dama said...

That'll do it.

Anonymous said...

Max,

Interesting article.

I'm trying to reproduce your results on my machine. I've loaded R, rtreesystem.r, rungoldtreesys.r, and gold01-09.csv.

When I run it, R comes back with:


Loading required package: randomForest

Error in trainEnsemble(data[(t + horizon + 1):(t + trainperiods + horizon + :
could not find function "randomForest"

In addition: Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called 'randomForest'


Apparently, it can't find "randomforest". Can you post the code for "randomforest"?



Thanks,
Bill S

Anonymous said...

Max,

Interesting article.

I'm trying to reproduce your results on my machine. I've loaded R, rtreesystem.r, rungoldtreesys.r, and gold01-09.csv.

When I run it, R comes back with:


Loading required package: randomForest

Error in trainEnsemble(data[(t + horizon + 1):(t + trainperiods + horizon + :
could not find function "randomForest"

In addition: Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called 'randomForest'


Apparently, it can't find "randomforest". Can you post the code for "randomforest"?



Thanks,
Bill S

Max Dama said...

Bill,

You can find the code here: http://cran.r-project.org/web/packages/randomForest/index.html

Regards,
Max

Anonymous said...

Max,

Please forgive me for the dumb questions, but I'm new to R and randomForest.

From files gold01-09.csv and rungoldtreesys.R:



#setwd("C:\\Documents and Settings\\Max\\My Documents\\Quant\\R")
data = read.csv('gold00-09.csv', header=TRUE, stringsAsFactors=FALSE)
returns = data[[9]]
dates = data[[1]]
data = data[-c(1)]
targets = returns
targets[targets>=0]=1
targets[targets<0]=-1

source('rtreesystem.r')
# factormodel.tree(data, targets, returns, btsamples=250, horizon=1, trainperiods=150, leverage = 'kelly', keepNFeatures = 10, treesInBag = 200)




In the above code, the +1 -1 "targets" values are the ones you're trying to classify/predict. They are "digitized" from the 9th variable in "data" (which is GLDr1 in the data file, and "returns" in the above code). When you pass "data" and "returns" into "factormodel.tree" how do you tell it to ignore "returns" and/or the 9th variable for modeling/prediction?



Thanks,
Bill S

Max Dama said...

Bill,

The dumber the questions you have to ask, the dumber the teacher.

The 9th variable/returns are not ignored, they're included in the inputs.

Are you thinking that this is essentially giving the model exactly the same data it's trying to predict and therefore cheating? Actually it's not because the input data is lagged by one period compared to the targets. So the returns from yesterday are used to predict the returns today.

If you're asking something else, please clarify.

Regards,
Max

Anonymous said...

Max,

You answered a couple of questions.

First, it appears that I am interpreting rungoldtreesys.R and the data file correctly.

Second, yes, I was looking for a lag between "targets" and "returns".

This brings up some new questions:

1. In file rtreesystem.R, is the following line where the "targets" "data" offset occurs?

savedEnsemble = trainEnsemble(data[(t + horizon+1):(t + trainperiods+horizon+1), ], targets[(t + 1):(t + trainperiods+1)], verbose = TRUE, treesInBag, keepNFeatures)

If not, where does it occur?

2. Running the same data 10 times, I get 10 different answers (typically with a different order of variable importance). Does this mean that no single variable dominates the data, or something else?

3. Can you recommend a good book on R, hopefully with a lot of examples? I'm an engineer (a mediocre programmer in various languages).



Thanks,
Bill S

Max Dama said...

Bill,

Yes, that's where the offsetting occurs- anywhere you see the variable "horizon".

RandomForest selects random bootstrap samples to train each tree. That's why each run gives different results.

I haven't read a book on R. I took this class at UC Berkeley: http://www.stat.berkeley.edu/classes/s133/ The professor has all the notes online. The "schedule" link at the top left has the notes and the "resources" link has some interesting sites. It was a good class and the notes are clear.

R has a great community, just subscribe to the email lists: http://www.r-project.org/mail.html I'm on the SIG-Finance mailer.

Regards,
Max

Anonymous said...

Max,

Thanks for the answers and the links. That course information will keep me busy for a while. I also subscribed to the Finance mailing list.


Thanks again,
Bill S

Max Dama said...

You're welcome Bill

Anonymous said...

Max,

I was testing the randomforest and svm methods to see what the trade-off/advantages are. While I was pushing "gamma", waaaaay overfitting the svm test, I went looking for results from the tests of others. I bumped into the following and thought you might be interested.

http://home.comcast.net/~tom.fawcett/public_html/ML-gallery/pages/index.html

Notice the similarities of svm and randomforest.


Bill S

Max Dama said...

Thanks Bill. Surprising that Bayes Nets did so badly.

Max

Anonymous said...

Max,

I agree. It looks like it needs a lot of points.

The Logistic Regression surprised me. I was fooling around with a couple of logistic models before I saw this. That's now on the back burner.

The Nearest-Neighbor scheme seems to pick up a lot, even with just a few points. I've only read about these models, I haven't had time to test them out. I'll probably look a little closer now.

The sad part about most financial problems is they don't have nice little "islands" to discover. It's just loud or louder noise.


Bill S

Anonymous said...

Hi Max
Very interesting!! Thank you for sharing!!!
How do I configure it to run on 1 minute data?

Max Dama said...

Anon,

You need a csv file with 1 minute data to do that. Then just replace the filename in the code with your new filename.

Regards,
Max

Anonymous said...

Hi Max,
Thanks for sharing!!!
Are there certain relations (or ranges) between: data (number of posts), btsamples, trainperiods and treesInBag (and other inputs)that is preferable?
For example: 2 * trainperiods = treesInBag.

Thank you
Noob

Anonymous said...

Hi Max,

Is it possible somehow to add some trading rules into the evaluation? For example a stop-loss where the long position is kept until it turns down.

Noob

Anonymous said...

Hi again Max,

If I want to papertrade gold with the DTB as a predictor how can I see what the prediction is for the next week/day?

Regards,
Noob

Max Dama said...

Noob,

There are heuristics like that but you figure them out by playing around with it, there haven't been results proved about it, and I think it's impossible due to the random nature of the forest.

Feel free to modify the code, it's only a few lines actually.

Regards,
Max

henner's Blog said...

Hi Max,
love your blog. Really great content...

Just a tiny question.
Isn't this decision tree bagging system data snooping par excellence? It fit's parameters in the best possible way to explain the sample but what about out-of-sample performance? May be I'll just have to paper trade it :)

Cheers,
Henning

Max Dama said...

Henning,

Definitely. Hopefully we've done enough regularization that we haven't allowed ourselves to overfit too much, but only out of sample testing can really show how it works.

Thanks for the compliment, I'm glad it's useful.

Regards,
Max

Anonymous said...

Hi Max,

thanks for sharing your insights on your blog. I've been working with randomForest and other classification algos since many years (having a background in biostatistics, where these methods have been used for quite some time before they became famous in finance).

I compared my code with yours (my financial time-series are ordered past-to-present), using your gold data, and I think your R code doesn't do what you want it to do (I use R 2.9.1):

in the following lines in your code:

idx = rev(data.frame(targets = seq(1, btsamples*skip, skip)))
idx = data.frame(idx, data = idx$targets+horizon)

because the rev() command doesn't work as you want it (it does not reverse the rows in the data.frame), later on in your code the time-axis of 'preds' is wrong (still ordered from present to past, instead from past to present), yielding problems during the creation of the equities time-series (using cumprod()), where you do reverse the order of 'returns'.

so I suggest to replace those two lines above with something like (the idx$data you never use in your code, so I drop it
):

idx = data.frame(targets = rev(seq(1, btsamples*skip, skip)))

Tell me, if I'm wrong, but I think, this is the way, it's supposed to be.

Cheers, Chris

Max Dama said...

Thanks Chris.

Anonymous said...

Max,

Thanks for posting this interesting example. I would like to play with it using R (which I'm still learning) but I noticed the links to the R code, data and harness file are broken. Would you have the sample files available via other links?

warm regards,

Andrew

Max Dama said...

Andrew,

Please see the update I just posted at the top of the page.

Regards,
Max