Overfitting means that you are fitting both the predictable component of the system and the noisy part. It is useful to think of the the data as being generated from a predictable system with some element of noise added in.In a chemistry application this might mean that you can explain a chemical reaction with 90% accuracy, and the 10% of unexplainable variation is due to human measurement error, imprecise tools, and microscopic imperfections or dirt on the test tube. No matter how well we understand chemical processes, we cannot predict with 100% accuracy every time.

An equity trend prediction system might break a price series down this way:Here we are assuming that our system is only fundamentally capable of predicting big trends and not small price movements. For example it might be trained on EOD data so intraday movement is unexplainable.

Overfitting occurs when our learning algorithm fits the noisy portion of the data rather than the trend. It's especially hard to deal with in a trading system because price movements seem to be primarily noise. The algorithm is basically distracted from the trend by all the random movement.

Here are two models an algorithm might learn:
We do not trust the model on the right because in the future it may wiggle anywhere.

There are a few ways to avoid using an overfitting model. The first way is to realize that overfitting occurs when the algorithm "memorizes" the training sample in its parameters. So by limiting the number of parameters it will help prevent the problem. In an SVM this means keeping C lower or, in the case of a polynomial kernel, keeping the degree to something reasonable (which depends on the system being modelled). In a neural network this means limiting the number of hidden layers and nodes. Each neuron's weight helps learn the training data, with too many neurons, the data is essentially memorize. Ernie Chan recently wrote about a parameterless system.

Another way to avoid overfitting is to cross validate the model. Train it with part of your historical data and then test in on the unused part. Repeat so that all the data is used as test data in one of the iterations of cross validation.Using cross validation you can get a good idea of what to expect when you apply your model to unseen data. If the model is overfitting, performance during cross validation will be poor and possibly inconsistent. The disadvantage is that cross validation takes computation time since it basically re-learns the model the number of times you cross-validate (e.g. 4X in the diagram above). With neural nets it can take a very long time.

There are multiple ways to think about overfitting and they each help intuition. I feel like I should say more about splitting the data into noisy and explainable components but I hope it will be clear enough already. Just realize that what is noise to one model may be explainable by another and vice-versa. Combine two models and you may be able to expand the explainable portion and shrink the noisy part. But that's another topic.

Anyway please comment if something is unclear or you have something to add. School resumes in a week so my schedule will be impacted soon.

6 comments:

ehsan said...

Hi,
Thanks for this explanation. I just wonder to know what should be done if overfitting occurs during cross-validation? As you suggested, we can divide the data-set into, say 10 sets, and do the cross validation. However, to me it is not clear what if overfitting occurs each time the model is being trained within the cross-validation procedure.

Max Dama said...

Ehsan,

If you are getting overfitting during cross validation then you need to decrease the model's ability to memorize training data. For example with a neural net you would decrease the number of training generations and/or the number of hidden nodes. And with an SVM you would lower the value of the parameter C (here's more info if you aren't familiar). Tell me if this answers your question and maybe I can give more info if you tell me which learner you are testing.

Regards,
Max

ehsan said...

Hi Max,
I am using feed-forward back propagation neural network with 10 folds cross validation. My question is:
I see overfitting when cross-validation specially when the number of hidden layer neurons increases.
Should I use a separate set for validation? If yes, how should I implement this as I believe I should use
9 sets for training and 1 set for testing.

Best regards,
Ehsan

Max Dama said...

Ehsan,

I get the impression you are not familiar with the lingo so my answer may not exactly answer your question.

Cross validation cannot result in overfitting, it will reveal it by showing poor performance on the test data. So when you ask if you should use a separate set for validation, in fact this is what cross validation already does.

If anything you should decrease the number of hidden layer neurons until cross validation results in good performance. If cross validation gives good results, then the model is not overfitting.

Regards,
Max

ehsan said...

Hey Max,
Do you have any idea about Bootstrap technique for neural network. If yes, I would like to ask some questions.

Cheers,
Ehsan

Max Dama said...

Ehsan,

Sorry I know very little about bootstrapping.

Regards,
Max