The following is a very interesting problem copied from p.132 of "Elements of Information Theory" by Cover and Thomas. I've been reading information theory recently, and I'm finding it very fascinating.

Example 6.3.1 (Red and Black):
In this example, cards replace horses and the outcomes become more predictable as time goes on.

Consider the case of betting on the color of the next card in a deck of
26 red and 26 black cards. Bets are placed on whether the next card will
be red or black, as we go through the deck. We also assume the game
pays a-for-l, that is, the gambler gets back twice what he bets on the
right color. These are fair odds if red and black are equally probable.

We consider two alternative betting schemes:
1. If we bet sequentially, we can calculate the conditional probability of the next card and bet proportionally. Thus we should bet on (red, black) for the first card, and for the second card, if the first card is black, etc.
2. Alternatively, we can bet on the entire sequence of 52 cards at once. There are possible sequences of 26 red and 26 black cards, all of them equally likely. Thus proportional betting implies that we put of our money on each of these sequences and let each bet “ride.”

We will argue that these procedures are equivalent. For example, half the sequences of 52 cards start with red, and so the proportion of money bet on sequences that start with red in scheme 2 is also one half, agreeing with the proportion used in the first scheme. In general, we can verify that betting of the money on each of the possible outcomes will at each stage give bets that are proportional to the probability of red and black at that stage. Since we bet of the wealth on each possible outtut sequence, and a bet on a sequence increases wealth by a factor of 2^52 on the observed sequence and 0 on all the others, the resulting wealth is:


Rather interestingly, the return does not depend on the actual sequence. This is like the AEP in that the return is the same for all sequences. All sequences are typical in this sense.

It's amazing that the strategy which bets everything at the start can beat one which bets after seeing which cards are no longer in the deck. The calculation of resulting wealth of the 2nd betting scheme, which results in 9.08, is transparent but I didn't see how it could be exactly equal to the resulting wealth of the other betting scheme.

Here is an excel spreadsheet I made to simulate the first betting scheme: excel 2007 | older 2003, (functions in 2003 version may not work). On the first sheet, it randomly generates a sequence of draws from the deck, according to actual probabilities, and calculates the wealth you would have after the deck runs out. You can see the resulting wealth at the bottom (cell H54) of the "wealth" column- it's always 9.08. It's very surprising to me and I don't fully understand how the two results can match. I'm new to the field though.

Tell me if you have any insight or a similar example that makes the paradox less paradoxical.

I was included in a recent article on the future of quantitative finance:

Future of Wall Street: Rise of the Machines


Sam Gustin, the author, was easy to work with.

Before publication I only knew about my interview but the article digs up some interesting things that I had never heard of, one example:

Legend Advisory, a Florida-based division of mutual fund giant Wadell and Reed, manages over $1 billion using a computer model called the Asset Allocation Neural Network, or AANN. (The company pronounces the model "Anne" and refers to it as "she").
Of course there are the Frankensteinesque mad scientist quotes too-
"AANN is a human being without a heart," said Shashi Mehrotra, Legend Advisory's chief investment officer.
Anyway it's a short read. I was in part II, there's also a part I. It also appears here on Wired.com since it's some sort of cross-promoting tech/business section.

Survivorship bias is a simple idea but it's a good time to bring it up after the previous notes on overfitting.

Survivorship bias exaggerates long-only strategies' performance during backtesting, especially those with a long holding period (years).

Most data sets will not include companies that went bankrupt or were delisted because their share prices went too low. Survivorship bias describes how the companies that we see today are those that did the best in the past. The failures have disappeared.

For ex. if you take ten years of data for all stocks currently included in the S&P 500 the performance will be better than the actual ten year return of the S&P 500.

Another ex. is if your strategy only applies to one stock, then the model you build of it based on historical data will not include the possibility of bankruptcy. Bankruptcy from the perspective of a time series is a floor on the semi-random walk. Maybe the model predicts the stock will drop by $1 when it is at $0.25, not understanding that it can't go below $0.00. This is not the usual usage of survivorship bias as I explained above but it can be generalized.

I know some professional system development & backtesting products such as MarketQA use complete data sets to eliminate potential survivorship bias. For a retail trader all you can do is keep it in mind. Fortunately survivorship bias's influence is not too high on most algorithmic systems which are usually high-frequency or at least hold for <1 week. Please share if you have a simple solution to this problem or if you have a quantitative example of this bias from previous work, preferably illustrated by a chart. Unfortunately this note was mostly generalizations and words.

Previously I wrote a note on overfitting during training. Now after reading that, let's imagine a normal scenario-

You're trying to find a strategy with an edge and you're considering a 3 types: a moving average crossover momentum strategy, an RSI-threshold strategy, and a buy-after-gap-down strategy. Being a modern quant trader, you know that regular, automatic parameter optimization is the only way to make an adaptive, fully automated system. The goal of system development, of course, is to determine which strategy is best.

After reading the previous note on overfitting you're smart enough to have split your data into two sets, one for training and one for testing.

The training set is used with crossvalidation to find the best parameters for the strategy. You are [separately] having it automatically optimize the two moving average lengths, the RSI period, and the minimum downward gap threshold. Those are the obvious parameters. Then the out-of-sample test set is used to measure the performance of each strategy, generating PnL, max drawdown, sharpe etc.

Following this, you compare the results and based on the PnL curve and careful scrutiny, you pick the best system.


What was the problem in the above? Considering three strategies introduced a hidden parameter that slipped past crossvalidation. Go back and imagine a bigger system that has a portfolio of strategies, MA, RSI, and gap-based. These are numbered 1,2,3. So this system has an extra parameter s={1,2,3}. It also has the parameters for each strategy as mentioned above. When this system reaches the crossvalidation loop, 1 final result pops out. Previously we had 3 results and then we chose the best.

This is equivalent to overfitting on the training data. Convince yourself of this fact. They appear different because of the different purposes/names we have assigned the 'training' and 'test' sets. In fact, picking a model at the end was equivalent to training. Now generalize how we showed the equivalence of overfitting on the training and test sets to cases where the system follows a more complex adaptive strategy, with layers on layers of auto-optimization validation loops.

Test-set overfitting is typically worse than the above because in most cases you will be considering more than 3 strategies. First example: you are haphazardly searching for some edge by trying any kind of strategy you can imagine. Second example (more insidious): you are testing different kernels on an SVM. You will think that you have found that one kernel is more applicable to the domain of financial forcasting, but actually it's an illusion. Ignore 'intrinsic' meaning and just conceptualize any options as a parameter list (unfortunately combinatorally large).

---
This part is just me thinking of ideas and writing. It's a bit off the deep end: you should stop here unless the top part sounded like old news and was 100% intuitive on the first read-through. ---

Hypothetically speaking, if the system had been trained and tested on an infinite amount of data overfitting would not be a problem (as long as the number of parameters is finite (??)). And I don't mean including all time periods (ex. take every other period- still infinite but not including all time and overfitting would not be a problem). Unless you test on all the data that happens in the future, and not just your out-of-sample set (obviously impossible), you risk fitting the expression of noise that is specific to that set. You will think you have found a pattern in the stock market, when really you have found a pattern in the noise. All finite sets of numbers have patterns, for example the list of all the numbers repeated once. If this is the only pattern, and no sequence repeats more than once, then you will not suffer from too much overfitting even if you follow a flawed procedure as described above. The noise will only truly become noisy once it is infinitely long and there are no more persistent patterns. 'Until that point' it will not be perfect noise and you must beware around it.

When you test on anything less than infinite data, you risk selecting the fateful subset of the data that your system happens to predict perfectly. Fortunately your odds of selecting a highly patterned set from the noise decrease exponentially as you use a larger test set ( 1 / k^n ). Just remember that the possibility exists in the universe that this was all by chance. [Maybe the laws of physics are false and actually every human observation till now has simply happened be perfectly correlated with some perfectly meaningless, unrelated formulas Newton happened upon.]
------

If you can't recognize all incarnations of overfitting, you will not be able to accurately test a self-adapting system. You can't even get to the point of looking for an edge of this type because you don't know how to see.

I would like to see research going more in depth on overfitting, beyond what I've mentioned so please leave a comment if you know of a source.

I've been in Mumbai since January 3rd and I'm leaving on the 16th. If anyone is interested in meeting for lunch or dinner, leave a comment or email me- maxdama at berkeley.edu

When you are trying to find the correct equation to model and predict financial data, you will always have some error. If you are using regression to predict the next period's return, you will probably measure the accuracy by mean squared error (MSE).

Error can be broken down into two components, and these two components can be interpreted as the sources of the error. Error = Bias + Variance

1) Bias is the error incurred by the expected prediction relative to the optimal/true prediction (Bias = E[y]-f(x), where f(x) is the true prediction and y is the approximation).
For example, using a 1st degree polynomial (a line) to approximate a 2nd degree polynomial (a parabola) will intrinsically have some bias error because a line cannot match a polynomial at all points.

2) Variance is the average error compared to expected prediction (Var = E[(y-E[y])^2]).
For example, if you only have 2 two sample data points, the function class of all 1st degree polynomials (ax+b) containing those two points will have no variance because only one line can go through the two points. However, the function class of all 2nd degree polynomials (ax2+bx+c) will have higher variance because there are infinite parabolas that can be strung through two points. Therefore you will have higher generalization error when you test on out-of-sample data. Here's a picture of both examples, focus on the 1st order and 50th order, clearly both will have high prediction error:

Now that I've covered the intuition, here's the derivation of Bias and Variance from MSE, working backwards, with justifications for each step (click to enlarge):

The bias-variance tradeoff is a fundamental, intrinsic challenge for machine learning. If you are using a neural network, you will have to deal with very high variance; more nodes = more variance + less bias. If you are using linear regression, you will have to accept very high bias.

I have glossed over noisy data, which makes the decomposition MSE = Bias + Var + Noise. However, I think it's more interesting to imagine that noise doesn't exist, and actually we just don't have a good enough model yet. For example, you could call a coin flip random, but I think it's deterministic based on launch velocity, air resistance, wind, etc and we just don't have the capability to measure and predict these complicating factors. That's a philosophical question. Of course treating un-model-able factors as noise is a very useful simplifying assumption. Google "bias variance" for further info and more on adding in a noise factor if you're interested.

I think next I will do a short series on the three sources of overfitting/data snooping procedural flaws: training on the test data, survivorship bias, and overfitting the out-of sample test set. The last is the most challenging to watch out for and the least well-known.

Please leave comments or corrections on bias/variance or anything else.