The Weird and Wonderful World of Backtesting

I have spent a large amount of my working life back testing strategies. I have found from this experience that it is extremely difficult to write a backtest that will give you any confidence you are not just going to get burnt when you start putting real money on. I believe from reading around this could be of use to a lot of people here.

I am talking about topics like,

Common back testing pitfalls.
How many trades do you need to trust I am not curve fitting
Walk forward analysis. What is it/ why does it matter.
Measures for testing strategy quality. Beyond sharp and looking at draw down, risk of ruin etc.
Execution and understanding the pit falls of meta traders modeling quality
Understanding complexities of look ahead bias and how to avoid it.
Understanding the complexities of data snooping and over fitting.

Anyway I am just throwing this out to see if anyone would be interested in a thread like this and if so we can take it from there.

1 Like

Already done several times recently, perhaps you should have joined in. I for one will not repeat my comments.

My point for dipping my toe in the water first. All i can see in the EA section are threads like “I MAKE 500% A WEEK WITH NO EFFORT” and little that looks substatial. It was mainly going to be around simple ish stats and practicalities of exection.

I have seen you comment on topics like data quality and walk forward analysis and I agree with them. The real question is why do they suck, what is the real reason? Can they be fixed? Are there ways to compute the sensitivity of your results to these assumptions. I actually think you will like this thread once/if it gets going.

1 Like

Here you go.
http://forums.babypips.com/expert-advisors-automated-trading/59078-why-backtests-useless-eas-flawed-their-parameters-bad-discuss.html
http://forums.babypips.com/expert-advisors-automated-trading/52287-tutorial-complete-backtesting-analysis-setup-100-free.html

Thanks for the links. I think that the angle i would like to approach this from is slightly different to what is already on here. A lot of what i have read here so far is quite dependent upon meta trader and or other tooling. I wanted to discuss the conceptual statistical issues associated with backtesting.

Unfortunately, with the kind of viewers this forum grabs, you won’t be getting those kinds of discussions anytime soon. There’s only a small handful of these guys on here.

At the same time, many of the more technical traders do not like to reveal “specifics”. What ends up happening is several pages of ‘I MAKE 500% A WEEK WITH NO EFFORT’, which makes me go :20:

Perhaps if you wanted to get the ball rolling and create a thread like that, you might be able to get a group going. :slight_smile:

Thanks, I am planning on doing that here. I was inspired by the effort you put into your meta trader thread so im planning on doing something similar from the point of view of why stats/backtesting will bite you in the ass. This is likely to go over a lot of peoples heads but as you said to me, it it helps a few people then its worth it. The topic im planning on starting with is data since i figure its controversial and should get some eyes over here.

[B]DATA[/B]

There has been a lot written about the pit falls of back testing and one of the main offenders is data. I would like to go into more detail as to some of the common problems that can occur and what if anything can be done about it.

These will be the main topics I will cover here

  1. Spikes and Gaps
  2. Bid-offer spreads
  3. Data Granularity

It should be noted that these are potential issues and how much it affects you is dependent upon the type of strategy you are testing. Weekly strategies are less susceptible to data issue than say a short term scalping strategy. It is important you understand the assumptions of the strategy you are testing and you will need to explicitly test these assumptions accordingly. With this in mind lets continue.

[B]Spikes and Gaps[/B]

There is an old adage which is you get what you pay for. Market data is a great example of this. There are free sources of this data but if you expect that a free source is as reliable as one that you have to pay for then you are being quite naive. To pay for a good set of tick data will cost around $300 per symbol per year. This is not to say that you cannot do a lot of good with free data but you need to start with the assumption that what you are looking is not that great. There are a couple of good free sources including, metatrader directly, dukascopy and gain archives. Some of the common problems that can occur in market data is that either there are gaps where data is missing or there is an inclusion of ticks that are off market and as such may or may not have really happened.

The common issue with spikes in the data is either that you are in a position and get stopped out or that you get a trade triggered and you get a lot of pnl that should not be there. So what can you do about this. Now there are a couple of solutions and the common one suggested is not in my opinion the correct one. The common advice is to use a voting mechanism where you use 3 data sources and choose the 2 prices that agree with each other. Thus if there is a spike in one data source then you can spot it and remove it. This sounds smart but it has one specific issue. This averaged data feed is not tradeable, there is no way when you start trading you can use a data feed like this (unless you’re being very fancy). This is where I would like to introduce a rule. “Test what you trade, and trade what you test.” This means you should only test with data that comes from the broker you will be running the strategy with. If there is a spike in their data then they may have stopped you out even if that spike is localised to their books only.

So what should you do? Well you have two problems here. You can collect your brokers market data via meta trade or some other source which is ideal but if you want to test over many many years then this might prove to be difficult. My suggestion then is when testing data to try to workout how sensitive your strategy is to data spikes. There are some measures that can be used for this which if you would like some homework you should read Robert E. Pardos book Evaluation and Optimization of Trading Strategies. It is sensible to look at the difference between the raw pnl and the pnl with the top 5% of winners removed. This will give you an indication initially as to how sensitive your strategy is to a few trades. This doesn’t not mean that you have a problem but is entirely dependent upon the type of strategy you are testing. If it’s a scalping strategy with an expected RR of 1:1 then you might have a problem, if you are testing a trend rider its less likely to be an issue. What you should do however is then look at this 5% of trades and see, did you make money because of some sustained sensible move, or was it because of a spike. If it was a spike then was it around numbers, and was it reflected on more than one data source. If not then you will need to go back to the drawing board.

As I said, it’s not that spikes, will/ won’t give you problems but looking in detail at the top 5% of winners or losers will give you a tool to at least evaluate whether you do/don’t have a problem.

[B]Bid-Offer Spreads[/B]

I would like to start this a common misnoma. There are a few common models for working out whether an order was filled in a simulation and including transaction costs.

  1. Fill order when a limit crosses the mid price and then subtract from the pnl the volume multiplied by the avg spread
  2. Use a mid-price with a fixed bid offer spread and then trade when the market crosses the limit price
  3. Use actual bid offer prices and fill orders when the market crosses the limit price

The only one that is useable is number 3. There are a couple of reasons for this.

  1. You cannot trade at mid and the offer needs to cross your buy price to be filled.
  2. When there is anything interesting going on, the first thing that happens is the spread blows out.

Spreads are variable throughout the day. Strategies that are very susceptible to this are ones that use solely limit orders to enter/exit positions. The problem with using mid prices manifests itself it a couple of ways. It allows you to place stops closer as the price will have to go further before you are stopped out. It means you can get into a trade earlier as in a breakout the price is probably 2 pips higher lower than that data says but a stupid back test would just fill you at mid and subtract a pip of transaction costs. So I go back to my point, test what you trade and trade what you test. This again doesn’t mean that your test has problems but its something to be aware of and test for. Short term strategies that use limit orders are susceptible to this as are strategies that trade with breakouts. The best thing to do in this case is look at the pnl per trade ratio. This should be higher than the spread if you have any hope of making the strategy work in reality. If you are trading short term and are only making 1-2 pips per RT then you are in danger of making money from not using spreads properly.

[B]Data Granularity[/B]

Meta trader has a concept of modelling quality. This is just a measure of what time frame the execution model uses vs. the time frame of the signal model. Again if you are trading daily bars then this is less likely to be an issue but if you are testing a scalping strategy that is aiming for 10-20 pips then it is something to be aware of. The problem is this. Imagine you have a which you bought the bottom or a range. It would be a good idea to have a take profit at the top of the range and a stop just below the range. If the next bar you see is a large outside bar, meaning that it went through the bottom and the top of the range two things could have happened. You got stopped and then the market went in your favour, or you made your money and after that who cares. Again it is impossible without knowing the strategy you are testing to know if there is a problem. Strategies with a wide TP vs SL gap are less susceptible to 1 minute bars, strategies which trade pairs are less susceptible due to lower volatility. The question is how do you test if you have a problem and the answer is similar to that in point 1. If you test a strategy in which in a single bar you hit the TP and the SL then always assume the SL is hit first, compare this pnl with the pnl in which you always assume the TP is hit first. Ideally these will be the same and thus its modelling quality is less likely to be an issue. Only after the strategy has been specifically tested in this way can you say whether the modelling quality is/isn’t a problem with the test results.

To summarise there are a few things to take away.

  1. All back tests make assumptions.
  2. There are specific techniques/measures to test the impact of each assumption.
  3. Different strategy types are sensitive to different assumptions.
  4. Test what you trade, trade what you test.
  5. Do not only rely on a metatrader backtest results
1 Like

From the reading I have been doing I think there is a misunderstanding as to how to develop a trading strategy. The common process seems to be,

  1. Create Strategy Idea
  2. Build strategy in simulation environment
  3. Tune strategy to be profitable in simulation
  4. Trade and make a boat load of money

Now if you think this is how it works you really, really need to read on. The process is similar but the difference is very important. A strategy is tuned to make money in a simulation, thus the biggest issue comes from the assuption that the simulation is the same as reality. This is often not the case for various subtle reasons you can read through on this thread.

Ok so what is the reality

  1. Create Strategy Idea
  2. Build strategy in simulation environment
  3. Tune strategy to be profitable in simulation
  4. Trade
  5. Run sim on the days that were traded
  6. Did you get hosed ? If so go to 7 else go to 8.
  7. Work out difference between sim and reality, fix simulation environment, then go to 3.
  8. Trade and make a boat load of money

The interesting thing about this is due to the assumptions made by each strategy type this fixing of the simualtion environment has to be done for every strategy. The other thing is if you are going through this cycle many, many times then you’re unlikely to have started with a robust strategy idea.

People post strategies that have really good backtest results which have never been traded and assume they are only one step from making a lot of money. In reality they are just at the begining of a lengthy backtest fixing cycle. Once you have enough experiece it is possible to future proof this by relising from the start that a strategy is flawed because of some subtle implied assumuption made in the simulation. Unfortunately the only way of getting to this stage is a lengthy painful process of finding a lot of killer strategies to only find that they acutally suck. I hope this last statement makes those of you who have been through the wash cycle of disapointment smile. For those of you just begining on the journee stick at it an you will get there, but you need to trade as well as backtest or you’ll never be able to see the differences.

I would like to summerise this post in a mathematical way so please stick with me,

The market can be thought of as a binary function that takes a strategy s and a time range r1 and outputs a pnl w

So a simulation market f we say f(s, r1) = w where w is the sim pnl
So for the real market g we expect that g(s, r1) = w
What this means is that simulation over the same strategy and the same data should output the same pnl. If not you are aleady in trouble.

The second stage I will talk about is about back test stability and the following assumption.

f(s, r1) > 0 => g(s, r2) > 0

This is where walk forward analysis is useful. To put this into english, if you make money in sim yesterday implies you will make money in the market tomorrow.

[B]“There are lies, damned lies, and statistics”[/B]
-Mark twain

Before getting into walk forward analysis and why it’s not a panacea, I need to talk about statistical testing a little bit. A statistical tests has two error rates that are very important. One is the chance of a false positive (thinking a trade will make money it doesn’t) and the other is the change of a false negative( the chance the test says there is no trade and the market rockets up). The latter is not talked often but it’s something I will cover in a different post.

The false positive rate is important in regards to back test as it says if you test 100 strategies then you expect that x% will make money irrespective of whether they are any good or not.
If you are testing a strategy with 10 RSI settings, 10 MACD settings then and find that 1 or 2 make money you might be falling foul of this. It is not that there is definitely a problem but it is worth keeping these issues in mind.

So what can you do about it?

  1. Don’t use strategies with a lot of indicators/parameter settings
  2. Make the back test a lot harder

Using longer testing periods, testing out of sample and walk forward analysis are examples of fixing point 2. In each case all you are doing is making the test harder and thus reducing the false positive rate. This in turn will give you a higher probability that if you can back test something that makes money then you have actually found something that works. This is however subject to constraint one.
No matter how good your simulation there is always a non-zero probability that it will give positive test results for something that will not make money. I have seen examples where because they are using walk forward analysis, will test a strategy with 10 indicators and 10 configurations each. The issue with doing this is fixing one problem but introducing another. No matter how good the back test is, if you abuse it and test millions of configurations then the results with always be useless. This is also the reason why simple strategies that are coarsely optimised tend to do better in the long run.

There is a second advantage to walk forward analysis which is it is more realistic in regards to how a strategies are managed in reality which adheres to the “test what you trade and trade what you test” rule. This I will save for a longer discussion of walk forward analysis.

Walk Forward Analysis is quite a hot topic on here but I would like to highlight some of the benefits that are less readily covered.

I personally like walk forward analysis, or at least I prefer it to other alternatives. As I general rule it is a good idea to make the way a strategy is tested the same as way it is traded. If you do not respect this rule then you can run into a lot of problems in regards to back test results not matching reality and it is best to just remove this as a degree of freedom if you can. The points made in this post use the assumption that the back testing framework you use is reflective of production. For more information on this please see the post in regards to data.

I personally see walk forward analysis as being separate to back testing. I see it as a layer on top of back testing which is used for strategy optimisation and selection. The selection process is dependent on individual backtest results but the power of walk forward analysis is its ability to give you a probability distribution for a strategy’s forward performance. Without this it is very difficult to know if a strategy is broken or if you are just unlucky. It is very important to manage your losers and WFA gives you an effective way to do this.

I would say there are 3 normal ways people test strategies.

  1. Take all of the data available, optimise their strategy in sample and then start trading it.
  2. Split the data into an in sample and and out of sample set, fit in sample and then test out of sample and make sure you make money in both.
  3. Split the data into slices, use testing strategy 2 but then repeat the process as you walk forward through the data. This is WFA.

Most people should know strategy 1 is like juggling hand grenades. What is less well known is testing strategy 2 is a specific example of walk forward analysis in which the in sample and out of sample periods are long enough, 1 cycle of the walk forward has taken up your whole dataset. Succinctly, A is a subset of B and as such A cannot be greater than B.

Say that you use testing strategy 2 to fit and test an EA you have written. Say for instance that you use a 1 year out of sample period. What will you do after you have traded it for a year?

  1. leave it alone
  2. re- calibrate it
  3. bin it

If the EA made money then I can see arguments to don’t fix what isn’t broken. But if it doesn’t then how do you know whether to re-fit it or to put it in the bin? How bad does it have to suck before you give up? This is a difficult question to answer if you have not traded that strategy for multiple years or have not used walk forward analysis.

You can lose money for many reasons but for simplification I would like to say there are two main ones.

  1. God hates you and it’s just not your day.
  2. Your strategy is fitted to a specific market regime which is no longer valid due to structural changes in the market

Before you stop trading a strategy you need to know which one of these two things could be the case for you not making money. Walk forward analysis provides you with a handy way to do this.

The most important results from a selection procedure is that it should give you predictability in regards to what will happen in the future. If the back test says that you will have made money in sample then when you trade it, there should be a good chance you will make money out of sample i.e. when you are live trading. One of the most important things when selecting a calibration of a strategy you are testing is that if it is in the top x performers in sample, then it should be in the top x performers out of sample consistently. If this is the case then you are likely to have found something robust. This is where walk forward analysis is useful. For a specific calibration of a strategy if you for instance fit a strategy if you use a walk forward with an in sample of 1 year and out of sample of 3 months. Then over a 10 year period you have 40 in and 40 out of sample results. The question is if the fitted strategy makes $10,000 in sample what is the expectation of the out of sample results. This can be worked out using a simple linear regression and there results that can happen when doing this.

  1. There is no relation => your fundament idea/ strategy is crap
  2. There is a relation but the variance is really high => you should trade it but only when you’re expected pnl is very high
  3. There is a good relationship with low variance => you just found your own personal atm
  4. There is a good relationship with low variance => you thew a million calibrations at it so you were always going one calibration like this.

This is also useful as it answers the question to if we fit this strategy over 2013 and trade it for Q1 in 2014. If it made $10,000 in the fit over 2013 what can we expect the results for Q1 to be? If the walk forward says that the std. deviation is 2.5k then you can expect pro rata the pnl in Q1 to be 2.5k ± 1.25k with a 2.28% chance (1 tail test, assumes Gaussian pnl distribution and sufficient number of trades). If you are way outside of that its worth taking a look at some of the assumptions made in the selection process.

This also is an important point in regards to trading automated strategies, I see a lot that people expect to trade a strategies continuously and they should continue to work forever. This assumption often comes from a lack of understanding of the markets and where trading opportunities come from. Very few last forever, most are fleeting. You should only trade an EA when you have confidence in regards to their forward performance. There are a couple of ways of gaining confidence in their forward performance, one is to trade it for years the other is WFA but you need to know when the opportunity you are making money from is gone.

This leads me neatly into the next thing in wanted to discuss which is. In a walk forward analysis, what is the correctly length of the in and out of sample periods? Good question pip, I’m glad you asked!

It depends upon the market conditions that you would like to exploit and how long you think that regime will last for. We are currently in a state of low vol, strategies that require high vol ( you know who you are) are currently having a hard time. This is where the trade-off is, using a short out of sample period runs the risk of not locking into a regime, and this giving bad out of sample results, a really long out of sample period will mean that you cannot adapt to changes of regime very quickly and can have long draw downs. So what to do? My answer is not to care about the specific length of in or out of sample results but look at which gives you the best predictability with the in and out of sample results. There is a balance to be struck and its specific to the individual strategy but this is how it can be struck and tested for. A long in sample period can give a good mean but bad variance of out of sample performance, a short mean can give a bad mean but low variance of out of sample performance and really you are looking for the best expected pnl out of sample with the smallest expected variance so that you can be confident you will be successful when you are trading live.

I am hoping from the last paragraph that you have noticed something weird. I am suggesting that there should be an optimisation of the selection procedure used to optimise the overall strategy (insert inception reference). This is exactly what I am suggesting but the alternative is that you use the same in and out of sample periods for all strategies. It is not possible to say it makes sense to always fit a strategy on 2 years’ worth of data. It’s a magic number and they should always be questioned. The right answer is I should select the calibration fitted over x months/weeks/ days of data because it gives be the highest expected return over the next y months/weeks/days.

The other thing that should be highlighted is that WFA is like a seat belt. When seatbelts were introduced in the US the death rate for a short time increased because everyone started driving like d1cks thinking I have a seatbelt on so I can’t die. If you throw enough indicators and calibrations at a good WFA strategy selector you will always find a few specific instances that pass all of the tests you throw at it even though they will not make you money. This is where I think a lot of the grievances in regards to walk forward testing come from. My argument to this is the same as the argument in regards to the seat belt. I realise that they will not save my life in all eventualities but I think if I drive carefully it will help me to survive most of the unexpected eventualities.

I hope this helps.

I have extremely good hind sight. It’s an important skill when wailing on other people’s trades and just generally being a know it all. I also find it is also useful for other things.

I have seen a lot of conversations around signal accuracy and people asking is it better to have a signal that is 60% accurate or 80% accurate. I would like to say accuracy is only half of the battle. The other half is signal efficiency.

Say that you have a signal that tells you the market will move 100 pips. You know it to be 60% accurate. Is this signal better or worse than one that finds 2 times the opportunities but is only 56% accurate. The latter has the higher expected nominal returns and thus should be the more preferable.

People in some of the PA forums ask questions like I have only found 4 trades this month is that good? The answer to this is really to use (if you have them) your incredible power of hind sight to work out, in an ideal world if I was awesome how many opportunities were there to make 100 pips? If the answer was really 100 and you spotted 4 then no, realistically you suck. As even if you had an accuracy of 80% your efficiency is 4% and there is a lot more money on the table.

This is the difference between type 1 and type 2 errors. Taking a trade and hitting your SL is an example of a false positive or type 1 error. Less well known is the type 2 error of the false negative. This is not seeing an opportunity and you missing out. This can cost you just as much if not more than the former.

The reason that people don’t tend to talk about type 2 errors is that they are harder to measure. It is however possible with as I said the great powers of hind sight. The other thing of note is that there is a direct relationship between the ratio of type 1 and type 2 errors. The more trades you find, the more the accuracy is likely to drop. There are a few statistical techniques for measuring this relationship but a good one is the area under a receiver operating characteristic (ROC) curve. In an ideal world you would like a signal that is a perfect classifier in which the accuracy is independent of the number of opportunities found.

Total pnl is a proxy to finding a balance between accuracy and efficiency but it neglects the whole picture as its limited to the maximum possible conditional upon the strategy being tested and not the non-conditional opportunity. It is worth before you even start to know what this number is and what proportion you are capturing while evaluation a strategy.

Before anyone flames my for saying you should always go for an opportunity of 100 pips I know that is not correct. But if not 100 pips then what is the optimum? Is it even a fixed number? All I am trying to say is that number of opportunities, efficiency and accuracy should all be taken into account when trying to work this out.