Machine learning is about automatically finding patterns in data. In trading systems we are interested in finding all the patterns and then choosing the ones that usually precede price increases (or anything else you can find a way to profit from).

Innovation in the field comes from developing new pattern identification algorithms. Typically these algorithms function similarly so you can compare the pros and cons of each and create a modular system in which you can swap them out. Each algorithm makes certain assumptions which influence the applications they are appropriate for. Financial market data is highly noisy, nonstationary, and even the noise is heterogeneous. This is different than say, medical diagnostic data, where the goal is to automatically diagnose a patient based on their symptoms and health history.

In the past, I looked at neural networks and support vector machines. These are the algorithms that seemed applicable to trading. A few others which seem less powerful: linear regression, naïve bayes, and nearest neighbors. (Sadly, most big institutional factor models are based on linear regression)



Now take a step back. Consider how a stock screener works. First you say, “keep a stock if its P/E is under 15”. Then you add another condition, “only keep those with debt/equity less than 1.0”. Then you might add one more filter, “of those, only keep the stocks with market cap under $500 million”. Every ticker is passed through this list of conditions and some are rejected at each step. You hope the remaining stocks are good values and will outperform the market. The problem is that you don’t really know when you make a screen by hand. Factor model software, used by many institutional investors, will automatically determine how useful a certain criteria is by using simple linear regression and backtesting.

What’s the problem with linear regression? This is a linear relationship: y=mx+b. It’s great for modeling something like, how much will a company’s taxes increase if the profit goes up by $100? m will equal the tax rate, .35. But a linear relationship can’t precisely model all sorts of things we regularly see. Let’s say you want to model Facebook and other social networks’ profitability based on the number of hits their sites get each day. There is a well-documented network effect where going from second most popular to first is much more significant than going from fifth to fourth. A linear model could only capture part of this effect. Another situation would be estimating someone’s weight by looking at how tall they are. Weight is roughly proportional to volume which grows as the cube of height, so linear regression would fail here too.



Now I’ll explain the decision tree algorithm. I’ll build on these two thought experiments and compare it to support vector machines and neural nets.

A decision tree is just like the screening example, but more sophisticated. The screener is not a tree in the sense that it has only two outcomes: keep or reject. The decision tree keeps analyzing the “rejected” ones. So instead of saying, “throw away the ones with P/E over 15”, the decision tree would then analyze those further, possibly adding in some typical growth criteria, like “keep those with analysts’ earnings estimates of at least 20%”. But intuitively you should think of it as a sophisticated automatic screener. Wikipedia has a good little example; the graphic below is copied from there to make the note more interesting.



You might wonder how it can work automatically. It’s a product of how incredibly fast computers are. Let’s say you have three variables, P/E, debt/equity, and market cap.

1) The algorithm randomly selects 1, say P/E.
2) The algorithm loops through many values for the P/E, say {1,2,3,4,5,6,7,8,9,10,…,30,31,…,max}, where max = the maximum P/E of all stocks being considered, and determines the P/E number that best splits the data into outperformers and underperformers. (Other “loss functions” are also common) This is very computationally demanding.
3) Then the algorithm repeats (recursively) on each of the subsets determined by the P/E split. It may divide them again by P/E or maybe by debt/equity or market cap.
4) Once the tree has many branches and the data is reasonably well classified (this is an important parameter which I won’t go into depth about here), you’re done training and you can start using it for actual prediction on new data.

Here’s the simplest example of how a decision tree can introduce nonlinearity: First it says companies with a P/E <> 3 are good. So now on a number line of P/Es, it goes bad-good-bad. It turns out that linear models cannot learn this relationship, and support vector machines can, with some care by the programmer. (Our analysis can be justified by imagining that P/E < 15 means good value, but P/E < 3 means screwy accounting and bad data, so you eliminate those.)

Trees have other nice properties including they handle missing data and outliers well, they can mix categorical and numeric inputs, and the trained learner is easier to interpret.

In another note soon, I’ll explain why I decided to build a system based on a further enhancement of this decision tree concept: the “random forest” or tree bagging. Obviously decision trees already have advantages, but I’ll go into the weaknesses and further improvements and explain how I built and backtested the system.

18 comments:

Josh Ulrich said...

Max,

This is a nice and quick introduction to decision trees. I've used Breiman and Cutler's random forests to test if my ideas are predictive of the next day's returns. It's quite easy to use in R.

I look forward to your next post on this topic.

Best,
Josh

Max Dama said...

Josh,

Thanks, I've come across the random forest library for R, and I was going to use it for the system I mentioned above. In the end I decided to use Matlab because there's a good function to download the SP500. My next system will likely be in R because of the IBrokers package. I was wondering how easy it would be to just use IB's Java API in either R or Matlab.

Regards,
Max

Joshua Ulrich said...

When you say "download the S&P500" are you referring to the index or all its components?

If it's just the index, take a look at getSymbols() in the quantmod package. If it downloads all the components, I could quickly write a function to do that in R. I would just need to know how to determine all 500 symbols. :)

I don't know how easy it is to use IB's TWS with Matlab. I've had success downloading real-time ES book data with IBrokers in R. If you don't have an IB account, you can test the IBrokers package with the IB demo TWS.

Max Dama said...

Josh,

It downloads all the data over a specified period for all the symbols, taken from S&P's website I believe, and filters out the ones that have missing data. Here's the file: http://www.mathworks.com/matlabcentral/fileexchange/23569

I have a funded IB account.

Regards,
Max

Anonymous said...

Max,

It would be nice to see your examples in a free open source solution (FOSS) so your readers can all follow along...

Cordially,

-Digital Dude-

"We have very few inferior people in the world. We have lots of inferior environments. Try to enrich your environment." -Frank Loyd Wright-

Max Dama said...

Digital,

I was wondering if you were still around.

I agree, also I'm teaching a class at Berkeley on quantitative trading which I'll probably need to use R for since I can't require everyone buys Matlab. However, I've been hesitant to switch because this summer I have to use Matlab at a job so I want to stay in that mindset until it starts. I would like to recode the decision tree system in R anyways. In fact, I've started on it.

Regards,
Max

Anonymous said...

Max,

I'm still around and looking forward to more questions and answers and examples from you and your readers ;-)

Josh,

Any chance you would share some or all of your realtime ES book data with IBrokers and R as an example???

Cordially,

-Digital Dude-

“Weed out the bureaucracies that stifle cooperation and collaboration.” -Jack Welch, ex-Chairman, General Electric-

Joshua Ulrich said...

Max,

Do those Matlab functions download all the S&P 500 constituents as-of a particular day, or does it download the data for each constituent as the index was defined for each day (i.e. accounting for index changes)?


Digital Dude,

I will post an example of using this code in an actual trading rule in the next few weeks.

Make sure you have the latest version of R and IBrokers, and the TWS needs to be running.

require(IBrokers)
tws <- twsConnect()
fh <- file('ES_depth_0509.dat',open='a')
reqMktDepth(tws, twsFUT("ES","GLOBEX","200905"), file=fh )
close(fh)

Max Dama said...

Josh,

It downloads data only for the current composition.

Max

Joshua Ulrich said...

Max,

In that case, it's pretty easy. This should do the trick:

require(quantmod)
url <- "http://www2.standardandpoors.com/servlet/Satellite?pagename=spcom/page/download&sectorid=%20%3E%20%2700%27&itemname=%3E=%20%271%27&dt=07-MAY-2009&indexcode=500"

# Read in CSV from S&P website
x <- read.csv(url)

# Pull the first 10 symbols from Yahoo
symbols <- paste(x[1:10,'Symbol'],collapse=";")
getSymbols(symbols, env=.GlobalEnv)

Max Dama said...

Thanks Josh. Everything's easy in R.

The main reason was what I mentioned to Digital but slipped my mind when I first replied,
"I've been hesitant to switch because this summer I have to use Matlab at a job so I want to stay in that mindset until it starts."

One question: in the quantmod package, are the model building and testing functions completed? I've tried to use them but got an error somewhere and the examples on the website haven't been finished.

Regards,
Max

Joshua Ulrich said...

You're welcome Max. The only reason it's easy is because of Jeff Ryan's work on the quantmod package.

Please keep me in mind when you do start using R more. I'm more-than-happy to help. :-)

Sadly, the model building/testing functionality of quantmod hasn't progressed much. One big reason for this is because Jeff isn't certain how they should be implemented. Think of them more as "concepts" than firm ideas that simply need to be coded. If you have ideas for how they should be implemented, please contact Jeff.

Max Dama said...

Josh,

As I've been building more models I've noticed that there's not one perfect template for all systems, so I see how that could be an impediment. I think about it often too so I'll get in touch with him soon.

Regards,
Max

Anonymous said...

Max & Josh,

Thank you both very much! Its nice to see the collaboration ;-)

I'll be looking for your upcoming R work Max...

I've already seen this example with IBrokers Josh... It hangs for me collecting data forever... I was looking for a more useful example... As for the model building in quantmod... Just hack something together to get a working start... I'm sure you and Jeff will get plenty of feedback ;-) Rattle seems to have hung some stuff together...

Cordially,

-Digital Dude-

“We keep moving forward, opening new doors and doing new things, because we’re curious.” -Walt Disney-

Joshua Ulrich said...

DD,

I just tried it myself without issue. R "hangs" because it isn't multi-threaded (i.e. it can only do one thing at a time).

So, it's a bit tricky to gather an analyze data at the same time. That said, Jeff and I are working on an example of how to run a real-time trading strategy using IBrokers.

Anonymous said...

Josh,

My bad... It works as expected / documented then ;-) Just not what I had expected ;-(

I'll be looking forward to seeing the acquire and analyze example that you and Jeff are working on.

Again, thanks to all 8^)

Cordially,

-Digital Dude-

"Simple things should be simple. Complex things should be possible." -Alan Kay-

Ken said...

Max, check out what the R project has proposed for the Google Summer of Code:

http://www.r-project.org/soc09/ideas.html#p3

I'm anxious to read your post on the weaknesses of decision trees and your improvements. I use regression trees quite successfully in trading, and I'm curious whether you'll say anything that's new to me.

Max Dama said...

Ken,

Just an update: I'll post the next note by Thursday.

Regards,
Max