Statistical Arb/Pairs trading strategy!

I suggest learning the urca and vars packages as they have very good implementations of the functions you will need to perform Johansen cointegration and also run the augmented d ickey fuller (ADF) tests for validation.

I also highly suggest Bernard Pfaff’s book Analysis of Integrated and Cointegrated Time Series in R. It is more of a workbook than a textbook but it shows how to use the functions within R and essentially how to interpret the results, but not how to apply them to pairs trading. So you will need to do some supplemental learning of R (there are some time series introductions for R) and also regarding the concepts of cointegration (Carol Alexander). For the pairs trading concept you could look at something like chapter 2 from: Pairs Trading, Convergence Trading, Cointegration

Then you need a basic methodology for putting the concepts together. Something like this: (you might find other resources through searching)
http://numericalmethod.com/papers/course1/lecture3.pdf

I see, the r=0 line is rejecting a zero rank at the 1% level so there is significant statistical evidence that there must be one (or more) relations. The r<=1 line then tells us there is only one. I also found these notes useful (for anyone else reading):
http://www.ualberta.ca/~sfossati/e509/files/class/Lec11.pdf

Out of curiosity, what was the time period you used? Hourly? And are you trading any of these relations? I’m skeptical of simple two-factor cointegration in FX but still undecided about multi-factor.

Time to learn RPy…

I was using 1 minute data. No I’m not trading any of them. They tend to be longer term in nature and my bent is toward shorter term strategies that can produce a smoother equity curve. When applied to smaller amounts of data cointegration tends to fall apart out of sample from my experience. My strategies are more similar to the ideas discussed in this thread based on empirical pairs trading. But I must admit I’m spending much more time in development than in trading. But the idea that you can apply coefficients to time series and create a more normalized spread line is my takeaway. If any are interested in trading using cointegration based methods I suggest taking a look at Ernie Chan’s blog:
Quantitative Trading

It sounds like you’ve got the interpretation figured out, but in case anyone else is needing a bit more step by step examples, see these two on interpretation of the Johansen Cointegration results:
FAQs on Johansen’s Cointegration test
How do I interpret Johansens’ test results?

0	118.2650173	31.2379	33.8777	39.3693
1	114.8008395	25.1236	27.5858	32.7172
2	102.5224266	18.8928	21.1314	25.865
3	101.2230094	12.2971	14.2639	18.52
4	79.41884747	2.7055	3.8415	6.6349

So, what do we conclude by looking at the output? We start going down one row at a time, and compare the test statistic with the critical values. 118.26 is more than the critical values 31.2379, 33.8777,39.3693 which means that we are not able to say whether there is 0 cointegration vector. It could still be that there is more than 1 cointegrating vector. So, lets move on to the 2nd row. 114.8 is above the critical values 25.1236,27.5858,32.7172 so we cannot conclude that there is 1 cointegrating vector. But it could be possible that there are 2 or more cointegrating vectors. We move on to the 3rd row now, and get rejected again.

When the critical value is exceeded, it means that the null hypothesis cannot be rejected. In this case I understand the null hypothesis when r = 0 is rejected to mean that there are zero cointegrating relations. When the test statistic exceeds the critical value we cannot say that there are zero cointegrating relations. So we conclude that there may be 1 cointegrating relation (or more), and we continue with the next row until we get a rejection of the null hypothesis.

I wanted to reiterated a couple of points in this post that might have been overlooked by some.

I’ve been thinking about this post a lot. This is an important concept and I thank richardtannermassingill for making it. It’s sort of the inconvenient stat arb truth! But knowing it can be a means for building more stable strategies that work over the long term through time series engineering.

When A/N was not trending you could count on pretty consistent profits. Now I pay attention to A/N.

I’ve always felt that the most important thing is first creating a deterministic system - one that is predictable a priori. For instance, a sine wave is predictable. In system development having a static mean and being bounded (static standard deviation) is more than enough to make the outcome predictable. But prices don’t come that way so they must be manipulated through a bit of engineering to create a better spread upon which a simple mean reversion strategy can be applied.

Of course, some simply follow Kelton’s strategy and if they size and manage their positions properly then can do very well.

If you analyse the issue of the trending/ranging cross using the cointegration pair trading technique it’s easy to see what’s going on. Standard pair trading (Engle-Granger, linear regression, no y-intercept) fits the equation, y = beta*x,

e.g. with y = EUR-USD, x = GBP-USD, we use linear regression to fit, EU = beta*GU.

Or we can rearrange this to get, beta = EU/GU = EUR-GBP. So from the outset we expect the beta from the linear regression to be similar to EUR-GBP.

Let’s do the calculation… Here I use minute data, and do a linear regression over the past 1440 bars (~1 week). I’ve plotted EUR-GBP (blue), the value of beta from the regression (green), and for comparison the SMA of EUR-GBP over the past 1440 bars as well.


Standard cointegration pair trading says to trade when the spread, S = y - beta*x, widens. We can rewrite this equation as S/x = y/x - beta, or using currency notation, S/GU = EUR-GBP - beta.

Therefore we trade when the current EUR-GBP is a long way from the current beta. But the plot shows that beta is a proxy for the SMA of EUR-GBP. So the signals from cointegration pair trading are equivalent to just looking at a long run SMA of the cross-rate and trading when there is a deviation. When the cross trends, the strategy will break down because you’re always waiting for the cross to return to it’s long-term SMA.

This is why I don’t have great hope for pair trading in FX. Just trade a long-term SMA of the cross and be done with it. But if we jazz things up with multivariate cointegration then things could get interesting.

Excellent post!

Very interesting analysis and presentation! I can confirm the results you show in the picture. This is essentially a visual representation of what I posted previously:

In system development having a static mean and being bounded (static standard deviation) is more than enough to make the outcome predictable. But prices don’t come that way so they must be manipulated through a bit of engineering to create a better spread upon which a simple mean reversion strategy can be applied.
In this case a better spread is one that is more stationary, bounded, and predictable. It’s funny really, that so many spend so much time developing and testing systems, when they should be spending their time with the data, and then just applying simple systems to that engineered data.

I am interested in the best way to make an indicator for an EA that can be backtested over 10 years to test this stuff. The best I can come up with is based upon an MA, where we just measure the distance of price from it and use a long period. Here are the results of the Euro with a 1000 and 6000 EMA:




The code I used in thinkorswim is given. The 6000 EMA is better since it doesn’t produce as much divergence. But since there will probably be some, it may be better to buy one then average in a certain amount of pips rather than to buy one at -6 then average in at -7 (since again the indicator may show divergence and not get to -7). When price goes back to 0 on the indicator it means it has touched the 6000 EMA. That is the normal exit of the strategy, a mean reversion.

So you would get a line for GBPUSD as well and then measure the difference between the Euro’s line and it. I’ll show this in my next post. The timeframes change the readings, so you would have to develop OB/OS levels for each timeframe you trade. Still, this way the method can be backtested.

Here the green line is the 6000 EMA and the blue line is the 1000 EMA on the upper chart. The first indicator is based on the 1000EMA and the second the 6000 EMA. EURUSD is the purple line, GBPUSD is the red line, and the yellow line is the difference between them. This yellow line is the important one, as it tells the difference between EURUSD and GBPUSD. The entry point would be at some OB/OS level, and the general exit point is when the yellow line reverts to 0.


In this second chart I just dropped the data of the two lines and only kept the difference between them. In the first indicator graph EURUSD is the 0 line, the red line is the GBPUSD difference from EURUSD, and the green line is the AUDUSD difference from EURUSD.

The second indicator graph is the difference between GBPUSD as the 0 line and AUDUSD.

The third indicator graph is the difference between USDCHF as the 0 line and USDCAD.


The larger the EMA input the less frequent there will be mean reversion to 0, but there will be less divergence on the charts.

Lastly, has anyone thought about inversing USDCHF and trading the EURUSD USDCHF difference? This might work well. I made an indicator of the difference below. You would have made 44 pips on the trade I detailed, where both EURUSD and USDCHF would have been sold. A 31 pip loss on the latter and a 75 pip gain on the former.

It might be naive, but it seems to me since these pairs are so correlated there is a better chance to achieve the desired mean reversion. In the trade we sold EUR bought USD, sold USD bought CHF. This seems like it is simply selling EURCHF, but I tallied the trade and only came to 25 pips profit. Perhaps my calculations are off, but it doesn’t seem they would be off by that much.


I installed R, and loaded the library urca and others, I’m able to do the lm <- … but I’m still very newbie on R heheheh

May I ask you how do you create the “p” object/variable/array/whatever you are using in your call for ca.jo?

Thanks.

Hey Guys

i been using this system for a while , but i have been losing in almost all the trades i made since trying this out. i must be doing something wrong. I think its because i’m not overlaying EU GU properly and making bad decisions on when to enter.(*using Kelton’s method) After entering the gap btw EU GU , it seems like it immediately widens, pushing my loses and most of the time going over my margin call. I know this thread has advanced by bounds, its a really trouble and bother for most you regulars here for me to ask this, but i any kind soul to help? You don’t have to answer if you dont want to :slight_smile: but a point to the right post or entry would be really great :slight_smile: Thanks for taking the time to read this and understanding :slight_smile:

Medisoft, I’m using StatConnector to interface through COM to R from both VB6 and C#. You can also use 7Bit’s mt4r.dll / mt4R.mqh if you’re connecting through C# or VB6 (I translated the mt4r.dll interface for C# and VB6 and VBA ).

From MT4 your only option is mt4r.dll due to MT4’s lack of COM support. The sample code for sending data to R with mt4R.mqh is located in arbomat in the onOpen method.

int i, ii, j;
   int ishift;
   
   // if any pair has less than back bars in the history
   // then adjust back accordingly.
   back = back_bars;
   now = now_bars;
   
   if (ObjectGet("back", OBJPROP_TIME1) != 0){
      back = iBarShift(NULL, 0, ObjectGet("back", OBJPROP_TIME1));
   }
   
   if (ObjectGet("now", OBJPROP_TIME1) != 0){
      now = iBarShift(NULL, 0, ObjectGet("now", OBJPROP_TIME1));
   }
   
   if (now >= back){
      now = 0;
   }
   
   for (i=0; i<pairs; i++){
      if (iBars(symb[i], 0) < back){
         back = iBars(symb[i], 0) - 2; // use the third last bar.
         Print(symb[i], " has only ", back);
      }
   }
   
   ArrayResize(coef, pairs);
   ArrayResize(prices, pairs);
   ArrayResize(regressors, back * pairs);
   ArrayResize(pred, back);
   Ri("back", back);
   Ri("now", now);
   Ri("pairs", pairs);

The code above sets up the basic variables: back, now and pairs in both MT4 and R:
back is the # of bars of the shortest series (-2).
now is when the in sample period ends, and the out of sample period starts.
pairs is simply the number of pairs being analyzed for a proper index count.

Medisoft, continued…

Rm(“regressors”, regressors, back, pairs);

Right after the main looping routine (this forum won’t let me post the code block) is the line above which populates the prices from MT4 to the regressors matrix. In MT4 the multi dimensional array is simply done as a single dimensional array but with index counting (i * pairs) to keep track of the columns. In R the data format is A[rows, columns] where rows = data points and columns = symbols. Think of an Excel sheet with rows and columns for a visual with a new symbol in each column.

The Rm command sends the entire matrix to R in the “regressors” variable where it is split into pairs number of columns like this: regressors[data, pairs].

When you do ca.jo it requires that there are column names for your matrix. You can add column names as below (assume 3 pairs in regressors varaible):

colnames(regressors)=c('EU','AU','EG')

If you don’t column names on your matrix, ca.jo will throw an error. p is the same as regressors in this case.

Also see this line of code within the arbomat plot method that will allow you to save the R workspace so you can manually load the data and work within the R environment using the GUI manually. This makes exploring the functions much easier, and the ability to use MT4’s data.

//Rx("save.image(\"" + SNAPSHOTS + "arbomat.R\")");

Give an example of what positions your took (long /short) with EU GU and when you took it and the size vs the account size.

Size your positions down significantly if you’re getting margin call. Plan on several hundred pips of drawdown per pair. If you add to initial positions it will use up your margin faster so you must size down even further.

Thanks!

I found this method

eu <- get.hist.quote(instrument="EUR/USD",start="2011-06-01",end="2012-08-08",quote="C",provider="oanda");

that gets the quotes in daily view from oanda, it also supports Yahoo finance, that is pretty good for looking at stocks with cointegration.

:slight_smile:

I did this test with eu and gu, and this formula

m = ca.jo(ecb.data, type=c('trace'), ecdet=c('trend'), K=2, spec=c('transitory'))

and obtained this results


###################### 
# Johansen-Procedure # 
###################### 


Test type: trace statistic , with linear trend in cointegration 


Eigenvalues (lambda):
[1] 5.945249e-02 2.104065e-02 2.081668e-17


Values of teststatistic and critical values of test:


          test 10pct  5pct  1pct
r <= 1 |  9.21 10.49 12.25 16.26
r = 0  | 35.75 22.76 25.32 30.45


Eigenvectors, normalised to first column:
(These are the cointegration relations)


                 eu.l1        gu.l1     trend.l1
eu.l1     1.0000000000 1.0000000000  1.000000000
gu.l1    -0.9736060242 1.7544577315 -2.044253855
trend.l1  0.0003800921 0.0006580745 -0.001881093


Weights W:
(This is the loading matrix)


           eu.l1        gu.l1      trend.l1
eu.d -0.03582426 -0.011908037  3.078237e-17
gu.d  0.03627223 -0.009927627 -1.123751e-17



I yet don’t know how to interpret that results hehehehe, but at least I obtained something like you. I suppose that on the documents you posted before is the information about how to understand this results.

Can you also tell me what are you looking on this?

Thanks.

I’m seeing a lot of references to I(1), but none of them tell me what is it.

Can you help me telling me what is that?

Thanks

It means integrated of order 1. I(0) is integrated of order 0. Generally price series are I(1), meaning the prices need to be manipulated (via cointegration) in order to make them stationary or normal. You validate this with the ADF test.

I did this test with eu and gu, and this formula

m = ca.jo(ecb.data, type=c('trace'), ecdet=c('trend'), K=2, spec=c('transitory'))

and obtained this results


###################### 
# Johansen-Procedure # 
###################### 


Test type: trace statistic , with linear trend in cointegration 


Eigenvalues (lambda):
[1] 5.945249e-02 2.104065e-02 2.081668e-17


Values of teststatistic and critical values of test:


          test 10pct  5pct  1pct
r <= 1 |  9.21 10.49 12.25 16.26
r = 0  | [b]35.75[/b] 22.76 25.32 [b]30.45[/b]

You have one cointegrating relation (r=0) with 99% certainty because the test statistic exceeds the 1% critical value. Granger causation can flow both from EU to GU and from GU to EU. In this case it only flows one way. I’m not sure if you can tell which way from this printout.


Eigenvectors, normalised to first column:
(These are the cointegration relations)

                 eu.l1        gu.l1     trend.l1
eu.l1     1.0000000000 1.0000000000  1.000000000
gu.l1    -0.9736060242 1.7544577315 -2.044253855
trend.l1  0.0003800921 0.0006580745 -0.001881093

To plot what this looks like take 1.0 * EU - 0.9736…*GU. To zero center it add the trend.l1 value (intercept).

You can access the variables referencing your m@V. It is a matrix just as it appears above. 3 rows and 3 columns so m@V[3,3]. To get the relevant values it will be

ecb.data[,1] * m@V[1,1] + ecb.data[,2]* m@V[2,1] with optional + m@V[3,1]

If I understand well, if I have 2 or more series that are integrated in I(1), and also found an equation (vector of coefficients) that form a new series that is integrated in I(0), then the original series are cointegrated.

Am I right?

If yes, then all what I need to do is to look for instruments that are integrated on I(1), then solve an equation that gives me the coefficients that form the new series integrated in I(0).

That seems “easy” hhahheee

Yeah easy :stuck_out_tongue:

Make sure you try walk forward testing too. I haven’t had much luck finding stable relationships that persist over time.