September 6, 2013

Momentum Back Testing

In 1937, Cowles and Jones published the first study showing that relative strength price momentum leads to abnormally high future returns. These findings are just as valid today as they were 75 years ago. Academics have been very diligent in studying momentum further, since it flies in the face of the efficient market hypothesis (EMH). EMH says you cannot beat the market using publicly available information. Hundreds of subsequent tests over the past 20 years have confirmed the veracity of momentum investing. Momentum is slowly gaining the attention it deserves as the investment world's "premier market anomaly" that is "beyond suspicion" (words of Fama & French).

Last week there was an interview of me on the MyPlanIQ blog. They asked about my work with dual momentum. I did not know at the time that MyPlanIQ intended to use my interview to promote their Tactical Asset Allocation (TAA) model. Since the details of TAA are unknown and proprietary, I cannot comment on the worthiness of their model. What I can say is I have nothing to do with any of MyPlanIQ's models and do not endorse them. 

I have also noticed other advisory services, as well as some managed investment programs, that look like they have been inspired by my momentum research. I would like to make it clear that I am not involved with, nor do I endorse, any outside services. 

There is a natural tendency to take others research, make a few changes to it, and hope you have created a better mousetrap. This often does not work out as expected. Here is why. High quality research is rigorous. Serious researchers subject their work to peer review and statistical significance testing. They disclose data sources and testing logic so other researchers can replicate their results. For high quality research, data is king. (I have come up with a saying: "One can never have too much money, good looks, or data.") Conscientious researchers are always trying to get as much data as they can for testing purposes. This reduces the chance of over fitting the data.

Fortunately, there has been a large amount of data available for back testing momentum. Academic researchers have consistently shown that momentum works across most markets and on out-of-sample data. Absolute momentum (trend following) has worked back to the turn of the century[i]. Relative strength momentum has worked all the way back to the beginning of the previous century![ii] This is important for two reasons. First, it leads to greater confidence in the results. Worst-case scenarios, in particular, are highly dependent on the amount of past data that is available. Second, with plenty of data, one can look at segments of the data to see how consistent and stable the results have been over time. We want to see that our overall results cover a wide range of market conditions are not dependent on just a few good periods of short-term performance. We also want to make sure our results have held up well over time and are still strong. This kind of robustness testing can reduce the chance of data snooping bias.

Another test of robustness is to look at other markets and see if your results hold up there as well. To do this in a meaningful way, you also need plenty of past data. This is why I go to the trouble of using indices instead of ETFs for my back testing. Whenever possible, I test my strategies using index data back to 1972, which is the beginning of fixed income index data. Data on a reasonable number of ETFs only goes back to 2003. There is a big difference in using forty years rather than ten years of data when you are testing strategies based on monthly price changes. In fact, one should be suspicious of any conclusions derived from using only ten or fewer years of data when evaluating intermediate term strategies like momentum. Yet that is precisely how most practitioners try to tweak and "improve" on my results, or on what they find in other momentum research papers. When working with monthly returns, ten or fifteen years is not much time. Results can easily be influenced by chance or happenstance, especially if there is not a convincing logical basis for your conclusions. What we can count on is that simple momentum works well across many different markets using a 3 to 12 month formation period. Anything else should be subject to rigorous and thorough evaluation that includes as many years as possible of past performance data, confirmation of your results in additional markets, parameter sensitivity and other robustness tests, drawdown analysis, etc.    

There is another problem related to paucity of data, and that is data snooping (data dredging, data fitting) bias. Data snooping is pervasive among practitioners, and not just with respect to momentum. It can happen when you add a new parameter to a model or re-optimize existing parameters. Extensive data dredging and model over fitting can lead to spurious results and regression to the mean. A statistician friend calls this the Grim Reaper, because it can take away all or most of your expected future returns.

Data snooping often uses the same data more than once. Every data set contains patterns due entirely to chance. When you perform a large number of tests, some of them may produce false results that appear to be good. When the data itself suggests your hypotheses, it is impossible to tell whether the results are just chance patterns. If you do extensive data snooping,  your evaluation criteria need to be much more stringent. 

Some people think that splitting a modest amount of data into a testing set and a hold out set for cross validation will take care of this problem. However, that is not necessarily true. For example, you could split your data in half then rank your strategies based on looking at only the first half of the data. Going down your list strategies, you might find one that looks decent in both halves of the data. But it is still likely this is just due to chance, and you only have half as much data to use for back testing. Your odds go up if you use carefully constructed randomized out-of-sample tests. Otherwise, as the saying goes, "If you torture your data long enough, it will confess to anything."

About 20 years ago there was an infamous study that showed 99% of the return of the S&P500 index could be explained by a multiple regression on butter production in Bangladesh, US cheese production, and the number of sheep in the US and Bangladesh. The author of that paper still gets inquiries asking where to get data on Bangladesh butter production! More recently, a serious research paper (believe it or not) called "Exact Prediction of S&P 500 Returns" links future stock returns to the number of nine year old children in the US.

I recently came across someone offering momentum signals based on the same methodology and a very similar portfolio to the one in my first momentum paper. He water boarded the formation period parameters until the model showed an annual return of 41% over the past (guess how long) ten years. Further torturing the model's portfolio composition, he was able to come up with, and now promotes, annual returns of 73% over the past three years! If anyone thinks momentum (or anything else) can realistically provide annual returns of 73%, then I have a lovely bridge I would like to sell you.

If you cannot avoid significant data snooping bias, there is a False Discovery Rate test you can perform that will tell you if you have efficient criteria for model selection. Without something like this, you may be data snooping your way to nowhere.  

                                                                                      Data Snoopy

[i] Moskowitz, Tobias J., Yao Hua Ooi, and Lasse Heje Pedersen, 2012, "Time Series Momentum," Journal of Financial Economics 104, 228-250
[ii] Geczy, Christopher and Mikhail Samonov, 2013, "212 Years of Price Momentum (The World's Longest Backtest: 1801-2012)," working paper