Why backtests are useless, EAs are flawed and their parameters are bad [DISCUSS]

pipwhip · December 11, 2013, 8:08am

Are you taking the average profit of the sum of profits in each bucket?

DarwinFX · December 11, 2013, 8:37am

The trend that the diagramm shows is visible in both cases.

But thats the whole point, algotrading should not be a fixed process. It should show all information, depending on what the trader that uses it wants to see.
And I guess that is the same for all working algo-trading implementations.

So you can look at the average, the sum, the membercount, moving averages, can make statistical evaluations like Gini-coefficient / Skewness or do other stuff. You name it.

You can also decide not to use bucketing at all, and take the average of trades - on a per-day basis - to get an equity curve

Note:
1.) as I said, this is based on the default “MovingAverages”-EA that comes with every mt4-installation, so it is of course not making profit
2.) The flat regions are days without trades.

What I want to say is that backtests are useless because they only show you 1 datapoint, one experiment, one simulation, one parameterset whereas there are half a million of them “within” a proper evaluation method.
So a good “backtest” should harvest and save them, so you can then evaluate and view them in every way that you can imagine
For example, processing them with C++ (as I do), Java, or just import them in excel, or I dont know.

-Darwin

pipwhip · December 11, 2013, 9:41am

one backtest is not useful I agree, but this still smells like data mining. Doing things like only looking at strategies that are profitable in or out of sample is just going to give you spurious correlation. Do you acually do anything statisically rigorous like pardo’s book suggests?

DarwinFX · December 11, 2013, 9:56am

What you suspect is: Generate many systems, evaluate them all, take the ones that look good, right?
What I say algotrading should be: Generate a few systems and then make many evaluations on them.

So not increase tested systems, but tests per system… more or less.

That way, there is no “data mining” in the way you think.
Also, you can still develop the system manually, to avoid fitting towards unsound,overfitted ones.
Then only evaluate them with algos, making all decisions manually.

No fitting -> no overfitting, just a broader view on the system! Thats the theory, at least

I did not read this book and I am a programmer, not a statistician, can you please define “doing statistically rigorous” tests?
Then I can answer that question

-Darwin

pipwhip · December 11, 2013, 10:37am

I like you’re enthusiasm but you’re starting a post about writing a walk forward testing framework and i find it odd you havent read about where it comes from. This book introduces the concept of WFA.

This is the bible/koran/torah of backtesting.

One of the main sections deals with statistical techniques to know whether a strategy calibration that works in sample will work out of sample. He talks about stochasic dominance which basically says if a single calibration is in the top x% in sample then it should be in the top x% out of sample.

Which statistical method you use is not that important but the main thing your framework should produce is something that says, this is the likliehood that your out of sample strategies performance is nothing more than chance. I know you have kind of done that but it needs to take into account the whole sample space i.e. every param configuration you threw at it. This is what i mean by statistically rigorous.

I think you are on the right track however and i appreciate you building something for other people. I think you could really benifit from reading pardos book and enhancing what you have done a little bit.

ClarkFX · December 11, 2013, 9:18pm

It’s a great book, definitely recommend!

pipwhip · December 12, 2013, 5:04am

I agree but its no “multifractal volatility” which is a book so boring amazon wouldn’t even let me sell it back to them.

DarwinFX · December 17, 2013, 5:22am

@pipwhip:

Well I did think quite a bit about your suggestions in the last days, sorry for my late answer.

First of all, I am sure reading that book would be a good thing for me, tough I totally lack the time for this at the moment.

One of the main sections deals with statistical techniques to know whether a strategy calibration that works in sample will work out of sample. He talks about stochasic dominance which basically says if a single calibration is in the top x% in sample then it should be in the top x% out of sample.

I have a few problems with that.

1.) How would you determine x? Sure, this can be done using expert knowledge or something like that, but relying on knowledge of people is a dangerous task, especially in trading.

So, to determine x (as an exclusion criterion for systems) we would need to find a value for x based on computational heavy and time consuming experiments including many strategies.

Everything else would just be a good guess, which is nothing I am comfortable with.

2.) At the moment I have a 100% parameterless evaluation approach, where the user does not input ANY knowledge/parameter.
All you need is an EA and the market/timeframe. Thats it.

Every single other variable is not pre-set, but extracted from the data (not automatically, as this only leads to bad decisions. it just helps the user to extract these variables).

So, setting a “x” (based on knowledge from a book) would kind of invalidate the idea of a 100% data-driven trading.

3.) What this method actually does is comparing the in-sample performance to the out-of-sample performance, right?
But who says that every strategy performs in this way?
Sure, it seems to be logical in the first thought, but I have a problem taking any “expert knowledge” for granted.
So, without heavy experiments and reliable data, it’s sort of a bet.

Did this guy release any experiments/data/numbers that prove the validity of his statement?

Because my (limited) experience tells me that it’s more or less random how good a single candidate performs in out-of-sample. You can only see a clear “trend” when you look at all the (~100.000-500.000) in-sample/out-of-sample pairs.

4.)

the main thing your framework should produce is something that says, this is the likliehood that your out of sample strategies performance is nothing more than chance. I know you have kind of done that but it needs to take into account the whole sample space i.e. every param configuration you threw at it.

My pervious post shows a diagramm that basically does exactly this.

It directly shows if the profit in in-sample and the profit in out-of-sample is correlated, so it should basically be the same thing, or did I understand anything wrong?

Also, it does show “the whole sample space i.e. every param configuration you threw at it”, as that is the whole point of that framework.

Would like to hear your opinion on that

-Darwin

pipwhip · December 17, 2013, 7:49am

Hey,

I appreciate your post, thanks for taking the time to think about what I wrote. I actually like discussing these things as it help to clarify them in my own head. From reading your post I think you are on the right track to doing what you want to do. I think doing it as a unsupervised learning method where there is little user input is a smart way setting this up. If anything I would like to add to your method to make it more statistically robust.

In regards to the x% thing I think this was badly explained by me. There is a validation technique for probabilistic classifiers, there are a few of them, I was trying to explain
one called stochastic dominance which is in the book. The details don’t matter but what you are trying to say is that if you have a signal that says there is 60% chance that a trade will work when you fit it, it should be right 60% of the time when you use it. This is actually similar to what you did with correlating in and out of sample results but in stochastic dominance the axis are the percentiles of each configuration in and out of sample. This is how you would do it in practice,

For each strategy configuration you have an in sample and out of sample pnl.
Sort by in sample pnl and assign an index number based on this index order
Then sort by out of sample pnl and assign an index number based on this index order
Scatter plot the in sample index by the out of sample index and you should get a straight line if the strategy is any good.

The straight line is simply saying that in general the crap ones in sample are the crap ones out of sample which is what you were saying but there were two things missing.

Using percentiles is a way of normalizing the data to adjust for the fact the in and out of sample data sets are different (vol, number of bars, etc).
You need to look at all of the configurations you tested not just the ones that made money. This is important to work out if the correlation is significant.

So in regards to X% it’s the all you are doing is making sure that for all values of x in [0,100] if a configuration is in the lower x% in sample it’s in the lower x% out of sample.

I agree that correlation is important, but I’m saying showing statistical significance of that correlation is also important. This is what you doing in saying you should testing a lot of configuration and showing a clear trend. I am saying there are good statistical ways of defining the term “clear trend”. If someone said there is a relationship between oil and copper and there is a correlation of 60% then I would immediately be inclined to ask, what is the strength of the relationship, how significant is it? i.e. what is the R^2 and the p value for the correlation.

Thinking about it, as I have typed this I have decided that you are probably already about 95% there. I think just adding for instance an a simple least squares regression to your in sample out of sample scatter plot and outputting the R^2 of the fit and the p value of the beta coefficient. You will find is more robust if you do the percentile normalisation I suggested since you will remove some of the noise caused by the non-linear differences between the in sample and out of sample periods.

Does that make sense?

I would also expect from your tool to say something like this. Say I have a strategy and your tool told me its good and that over the last 6 months I would have made 25 pips per day. My question would be, if I trade this for the next 3 months what per day performance can I expect. Would -50 pips be a bad day? Would 30 pips be a bad day. From the walk forward you can build a histogram of comparative percentages between in sample and out of sample performance. This would say that normally the out of sample performance was ± 80% of in sample performance. This at least will give the user confidence when they start trading the strategy forward that “normal” is somewhere between ±20 pips.

I agree you will never know specifically how things will perform live, you never know when the next earth quake etc is going to happen, but it would be awesome for a simple histogram to say, here is your normal operating range and don’t worry too much if you are within this range. It also useful for a whole array of stop loss and portfolio optimization techniques but that is off topic.

pip

pipwhip · December 17, 2013, 8:46am

This is worth a read if you get a second.

True Fact: The Lack of Pirates Is Causing Global Warming - Forbes

and then read this

Correlation does not imply causation - Wikipedia, the free encyclopedia

DarwinFX · December 20, 2013, 1:35pm

Sorry for the late answer, this time it was due to lacking time

I think doing it as a unsupervised learning method where there is little user input is a smart way setting this up

Actually its the opposite, an heavily supervised learning method, tough without pre-defined user input

For each strategy configuration you have an in sample and out of sample pnl.

Sort by in sample pnl and assign an index number based on this index order

Then sort by out of sample pnl and assign an index number based on this index order

Scatter plot the in sample index by the out of sample index and you should get a straight line if the strategy is any good.

Ah, sounds great Will implement this for sure (especiallly because its just a few lines of code thanks to oop :))

Thanks for the input!

You need to look at all of the configurations you tested not just the ones that made money. This is important to work out if the correlation is significant.

Well, I just hope that a few hundred thousand samples are enough to ensure significant correlations.
The thing is, if I have strategy that has 10.000 different parameter-combinations, it would mean a single evaluation takes 10 days or something like that…
Just taking 1k samples out of the ~1-2k >0$ candidates reduces the time significantly.

Do you think this is a problem?

i.e. what is the R^2 and the p value for the correlation.

Thinking about it, as I have typed this I have decided that you are probably already about 95% there. I think just adding for instance an a simple least squares regression to your in sample out of sample scatter plot and outputting the R^2 of the fit and the p value of the beta coefficient. You will find is more robust if you do the percentile normalisation I suggested since you will remove some of the noise caused by the non-linear differences between the in sample and out of sample periods

I am not so much into statistics, so I can not judge if this makes sense. Tough I would like to discuss it with you, skype? Add “darwin-fx”

I would also expect from your tool to say something like this. Say I have a strategy and your tool told me its good and that over the last 6 months I would have made 25 pips per day. My question would be, if I trade this for the next 3 months what per day performance can I expect. Would -50 pips be a bad day? Would 30 pips be a bad day. From the walk forward you can build a histogram of comparative percentages between in sample and out of sample performance. This would say that normally the out of sample performance was ± 80% of in sample performance. This at least will give the user confidence when they start trading the strategy forward that “normal” is somewhere between ±20 pips.

Also a very nice feature suggestion which I will add for sure

-Darwin

PS: Add me on skype

jcl365 · December 21, 2013, 3:21am

There is another book in that context: David Aronson, Evidence-based Technical Analysis. It goes a little beyond Pardo. Aronson describes an algorithm that calculates a result expectancy for in-sample tests. The in sample test result must be better than this expectancy for producing a significant out of sample result. So he can anticipate the out of sample result without actually doing an out of sample test.

DarwinFX · December 22, 2013, 8:47am

Hey jcl365

Would you mind to explain the concept a bit further?

If I understand you right, he makes a lot in-sample/out-of-sample tests, so he can then determine a threshold for in-sample fitness that would indicate for a good out-of-sample result?

So, for live trading, he can then make an in-sample optimisation and then, based on the threshold, determine if it is worth trading the system?

If so, its basically the same thing as in this diagramm:

In this case, the first 3 bars cover an in-sample profit of >4000$ (or something like that).
So, the threshold I would use for this system would be something ~4k$/in-sample, otherwise its not worth trading it.

This is basically a way to measure current market conditions and determine if they are suited for the system (eg, a trend following system would not reach its threshold during ranging markets etc.)

Did I understand this correctly?

-Darwin

jcl365 · December 22, 2013, 9:04am

He does not make out of sample tests at all. All tests are in sample, this way using all available data. The algorithm calculates a threshold from the result distribution of the in sample tests. In your diagram above, if I understand it right the average in sample profit is about 15%. Thus a threshold above 15% would mean that the system is not profitable at all.

DarwinFX · December 22, 2013, 10:19am

I also use all data (- 1x in-sample timespan) for analysis.

But I fear I do not understand his approach.
Ok, he gets a distribution of in-sample results, but if you do not make any out-of-sample tests, how should you be able to say anything about the possible out-of-sample/live-trading performance of a system, based on in-sample data, if you do not even have any out-of-sample/live-trading data to learn from?
I would really like to understand this and hopefully learn from his approach

-Darwin

PS: The diagramm above just shows how much out-of-sample (y-achsis) profit one can expect if the in-sample (x-achsis) profit was in a given range.

jcl365 · December 23, 2013, 2:02am

The reason why in sample tests are bad is selection bias. The best parameter combination is selected and its result of course will be worse than in an out of sample test.

You can determine selection bias without doing any out of sample tests. For instance, it’s clear that the selection bias is the worse, the more parameter combinations you have, and the more different their results are. People have developed algorithms for determining selection bias from the distribution of the in sample results. It is no closed formula, but it’s a computer algorithm, which is patented by the way. Two of such algorithms are described in Aronson’s book.

pipwhip · December 24, 2013, 10:35am

Sorry i have been out of play ive been busy not trading and just getting drunk in general.

Looking at the samples that make money or taking any subset is an issue becuase it can lead your results being biased. The questions is not can you find a few strategies that make money in 100,000 configurations, the question is how many out of the 100,000 need to be profitable where you can be confident that you have not just over fit? This is where statistical significance is important.

You might expect that 5% will be profitable just by random chance. Thus if you find 5000 configrations that make money this is no big deal. Just looking at ~1-2k > $0 is dangerous as you will find you are only looking at a small part of the picture.

DarwinFX · February 20, 2014, 5:57am

Net algorithms, new luck Now I can freely decide wether to analyse candidates with negative insample or not, without an increase in runtime.

Tough it generates really really much data if I do, like 160.000 insample->outofsample datapoints without negative insample candidates and millions for other way.

Not very practicable for todays computational power, perhaps in a few years

-Darwin

GoalAchievement · February 21, 2014, 5:10am

When reading this post, I’m impressed with the time and knowledge that went in here, and found it a very informative read. It seems to have started with Darwin making a WFT tool as opensource. Is this available somewhere? I think the more people start experimenting with this tool, the more feedback will start coming in…

DarwinFX · February 21, 2014, 7:39am

http://forums.babypips.com/expert-advisors-automated-trading/60069-darwins-walk-forward-analyzer-v0-1-free-open-source.html#post577330

Tough i still did not have time to clean and release the source, so if you want it, add me on skype Darwin-FX

-Darwin