Hello, I am Darwin and today I want to talk about the limitations that walk forward analysis suffers from. This is my 3rd article, so if you do not know how WFA works, please read the other 2 (you can find them here on the forum).
The target audience of this article is everybody that deals with ExpertAdvisors and Backtesting / Walk Forward Analysis
Some of you might already have seen a few posts of me where I talk about some research I do in the fields of Trading System Analysis (in the course of writing a meta-algorithm that can build, analyse and trade strategies on its own). The goal is to write an algortithm that is so powerful that it can take every EA and, due to in-depth analysis, tell you how and when to trade it in order to make profit, no matter how good or bad the underlying EA is
So here is a new article in which I would like to lay down some insights that I could get in the process of writing this algorithm (DATFRA - Darwins Algorithmic Trading Framework)
Well, lets begin. My first concern is that the design of Walk Forward Analysis is, in its nature, unrewarding and not the kind of analysis a trader wants.
Also, I claim that the results of a WFA are more or less random, and if a system works well after a successful WFA, then not because the test was successful, but because the trader designing the system did a good job.
In this article I do not yet want to show how this problems can be solved, I just want to demonstrate that they exist. In my next article I will explain how I think this all can be solved in an elegant way.
The fundamental design problem
Walk Forward Analysis is designed to evaluate a trading construct you give to it.
This construct consists of:
- Trading System (eg an Expert Advisor)
- Market/Timeframe (eg “EURUSD / H4”)
- System’s Parameter ranges (eg “Moving Average Period from 50-150”)
- Optimisation (In Sample) Timespan (eg “Optimise on 2 years of data”)
- Forward Trading (Out of Sample) Timespan (eg “Forward Trade for one month”)
- Preferred characteristic (eg “forward trade the candidate with highest profit”).
So all this has to be pre-determined by the trader, out of intuition, and not based on true facts and data. But god, these are the most important decisions, how should one “guess” them?!
And then, WFA will only be able to tell you if this construct would have worked in the past or not, thats it.
So in order to find the best trading construct, you have to use trial&error and repeat WFA step multiple times. This would then, step by step, even lead to the worst case, your “unseen” out-of-sample tests would slowly become “known” in-sample data and the whole advantage of WFA over backtesting would fade away completely.
This design related problems are already showing that WFA can not be the end of the road in terms of system analysis.
In a perfect world, you should give the analysis algorithms only the trading system and the market/timeframe, no other parameters. And then, the algorithm should tell you the best choices for all the other parts of the trading construct, based on data and facts, not the other way round.
Side Note: it should NOT just tell you how to trade your systems, it should give you the possibilities to look into the system’s characteristics on your own. You should never be forced to trust any algorithms without the possibilities to check it’s findings!
This is very, very important. It is not very much of value to evaluate a single trading construct, but it is a gamechanger if you can look into your strategies in a way that would allow you to just “see” how they work and what trading construct will work best (More on this in my next article)
Even worse: Unreliable results because of lacking data
Ok, so even if a trader could come up with a good trading construct out of intuition/knowledge, WFA would still be a more or less random thing. But first, let’s make a rough calculation:
An example trading system and a small estimation of its parameterspace-size
So, a system that enters trades based on a Moving Average Crossover and RSI Indicator, and exits them using a different Moving Average Crossover has at least 5 Parameters (2x2 for MA-Periods + RSI Threshold). It’s 6 if you take into account the StopLoss.
Let’s say the “fast” Moving Average Periods can be 10-50 and the “slow” ones 50-250, the RSI threshold can be 1-100 and the StopLoss 50-150 pips (this is no real system, just an example!)
So this system can already be traded in 4020040200100*100 different ways. That is 640 billion (640.000.000.000), which is quite a huge number.
One might question my exact example strategy, but can not question the millions or billions of possible parameter combinations, even for small systems.
But thankfully, if we take into account that a lot of these parameter-combinations would behave very similar, we do not need to evaluate them all, but we need at least a meaningfull sample of it, like a few hundred thousand or a few million.
So, keep this huge amount in mind, even for small systems, because with every new dimension for our optimisation problem’s solution space (every new parameter) the amount of possible parameter-combinations grows exponentially.
Walk Forward Analysis - missed data during optimisation
Ok, now lets look at the first step of WFA, and the first problem: Missed data because of inefficient algorithm design and computing time concerns.
During optimisation step of WFA, the algorithm should, in a perfect world, evaluate all 640 billion combinations in order to determine which of them work best. Of course this is not possible, but a “meaningfull” sample (let’s say 500.000) would be feasible and needed if we want to look at the “real” picture.
The problem is, due to limitations of WFA algorithms, optimisation has to be done in every single Walk Forward Window.
Let’s say we do a WFA on 10 years of data and our Forward Trading Timespan is 2 weeks: That makes 240 Walk Forward Windows. That means 500.000 tested parameter combinations per window would need 120.000.000 single simulations.
And then, remember that WFA relies on a trial&error principle, so you will most likely have to do this a few times.
You see? Evaluating the “real” picture would take very, very long, and therefore most WFA implementations are forced to only evaluate an very much cropped fraction of the actual parameterspace because it is not possible to evaluate the whole parameterspace (or a meaningfull sample of it) in a reasonably small timespan, because optimisation has to be done in every single WF-Window.
This means, WFA most likely does not evaluate 500.000 parameter combinations per window but only 10.000 or 50.000 or something like that. So eventually we already lose like 90% of all data in this step.
This is a problem that could be solved if the trader has lots of time for his/her analysis (which is not likely, especially based on the trial&error method), or with a more efficient design of these algorithms. Nevertheless, in praxis, this problem is ever-present.
For comparison: DATFRA, which is my private research project, only has to do one single simulation per parameter-combination, no matter how many WF-Windows it analyses. In the above example, that would already decrease the computing time by the factor 240.
Parenthesis: What kind of data do we look at when analysing trading systems, what is a “datapoint”
I will talk about “datapoints” and “data” quite frequently in this article and in my posts, so here is an explanation.
When analysing systems, it is always about a trinity of informations. Remember how WFA works:
So a datapoint, of which 1 is generated per Walk Forward window, consists of:
- The performance in the RED optimisation window
- The performance in the GREEN forward trading window
- The used parameter-combination for this specific test
So, in our example, a WFA would generate 240 of them, whereas 120million (500k * 240) would be possible for our example system. That should already give you headache.
Walk Forward Analysis - tons of missed data during forward trading
Ok, now lets look at the second step of WFA, and the second problem: Missed data because of wrong algorithm design and computing time concerns.
Now remember, a meaningfull sample of our trading system’s parameterspace would be 500.000, and we have 240 WF-Windows. That would make a total of 120.000.000 optimisation-candidates. And out of this huge amount, a WFA algorithm takes the very best per window, 240 in this example.
That is 0,0002% of the total amount of all datapoints that we could use to describe/analyse this system and it’s ability to produce good forward trading results, based on good optimisation results.
And then WFA takes these few datapoints and claims it gives a somehow realistic view on a trading system’s performance / robustness.
Thats nonsense! You also would not judge a picture’s colour by looking at 1 pixel, would you?
A word about fluctuations and why the “very best” parameter combination is not meaningfull
You could argue that it is not important if we forward trade all 500.000 candidates per window, because we are only interested in the top performers, as they are the ones we trade in realtiy.
Well this argument would only works if:
- We would ignore the ~90% of data lost in the optimisation step
- The very best candidates would be meaningfull, which means that all candidates that are following (like the next 10 or 20 or 50, which is not much compared to 500.000) would behave in quite the same way.
But reality is different, the performance of the top candidates per window fluctuates quite much and taking the “very best” therefore leads to more or less random outcomes.
Experiment 1
Here are some examples, I plotted the forward trading performance of the best (left) and the next 4 candidates of some random strategies I created and evaluated with DATFRA. Most of the analysed WF Windows looked like these:
These were just a few examples to illustrate my point of view, I could show hundreds or thousands of them.
So, for the real picture, you would AT LEAST need to evaluate a few hundred of the top candidates, not just one, as it does not show the “real” picture. It’s performance is more or less random!
A perfect analysis algorithm would evaluate every single candidate that made at least 1$ profit during optimisation. That would give the real picture and most likely 1000 or 10.000 as many datapoints than what a WFA gives.
Experiment 2
Here are some more examples, this time I plotted the overall WFE (red) and the WFE of single windows (green) of some random strategies I created and evaluated with DATFRA.
WFE (Walk Forward Efficiency) is a measurement that compares in-sample and out-of-sample performance and is used as THE statistic about system robustness in WFA (google for it if you want to know more about it)
This clearly shows the flucutating nature of the results a WFA generates, and that the end result is not really telling much about your expected live trading performance.
Btw: To keep the plot scale in limits I did map all points > 2.5 to 2.5 and all points < -2.5 to -2.5, so reallity is even a lot worse. That is also the reason why the second image in the second row does not look “right”
A word about feasibility
Please do not think I only talk about grey theory here “as it is not possible to do this kind of simulations in a short enough amount of time anyway”.
If the algorithm is designed well, one would not need a single further simulation in order to determine forward trading profit and not a new optimisation procedure for each WF-window.
So for the used example, DATFRA can generate 34.000.000 “Optimisation=>Forward Trade datapoints” in ~24 hours and on a mid-end PC (8GB Ram, quadcore 3GHZ).
Still not 120millions, sure, but compared to 240, I think its a very good result.
So it IS feasible to analyse a system with such a level of insight, even on today’s hardware.