Statistical Arb/Pairs trading strategy!

medisoft · October 8, 2012, 9:05pm

I installed R, and loaded the library urca and others, I’m able to do the lm <- … but I’m still very newbie on R heheheh

May I ask you how do you create the “p” object/variable/array/whatever you are using in your call for ca.jo?

Thanks.

FXEZ:

By way of contrast and for the sake of completeness, I ran the same pairs with Johansen cointegration using R. Below are the first three columns of eigenvectors applied to the price charted.

R code for package urca:
 m = ca.jo(p, type=c('trace'), ecdet=c('trend'), K=2, spec=c('transitory'))
Eigenvectors, normalised to first column:
(These are the cointegration relations)
AJ.l1     1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00
CJ.l1     2.845320e+01 -2.321219e+00 -6.674735e-01  1.481832e+01 -1.074078e-01
EG.l1     4.471224e+01 -1.105396e+00  6.342888e-01 -4.278603e+01 -8.107445e-01
EJ.l1    -5.219393e+01  3.519224e-01  6.367206e-01 -2.301109e+01  8.236176e-01
EU.l1     2.414314e+01  6.549944e-01 -4.443174e-01  2.967418e+01 -2.073002e+00
trend.l1 -6.072959e-06 -2.922243e-09  3.198172e-07 -7.391761e-06  7.699305e-07
So the first column 1.0AJ + 2.84+01CJ + 4.47+01EG -5.21+01EJ + 2.41+01*EU: these are in the first sub graph.
Values of teststatistic and critical values of test:

          test 10pct  5pct  1pct
r <= 4 |  3.18 10.49 12.25 16.26
r <= 3 | 14.15 22.76 25.32 30.45
r <= 2 | 27.05 39.06 42.44 48.45
r <= 1 | 54.03 59.14 62.99 70.05
r = 0  | [B]97.30[/B] 83.20 87.31 [B]96.58[/B]
Based on the test statistics I would conclude that only the first picture (first column of eigenvectors top of post, first row of teststatistics just above) is significant at the 99% level. This is interesting because it appears that only the 2nd picture has a stationary mean.

The 3rd picture looks sort of similar to the first two pictures in the previous post! Note, I didn’t center the charts for the final number in each column (trend.l1) which is also the intercept value in a regression.

Ghostwalker000 · October 9, 2012, 6:31am

Hey Guys

i been using this system for a while , but i have been losing in almost all the trades i made since trying this out. i must be doing something wrong. I think its because i’m not overlaying EU GU properly and making bad decisions on when to enter.(*using Kelton’s method) After entering the gap btw EU GU , it seems like it immediately widens, pushing my loses and most of the time going over my margin call. I know this thread has advanced by bounds, its a really trouble and bother for most you regulars here for me to ask this, but i any kind soul to help? You don’t have to answer if you dont want to but a point to the right post or entry would be really great Thanks for taking the time to read this and understanding

FXEZ · October 9, 2012, 7:23am

Medisoft, I’m using StatConnector to interface through COM to R from both VB6 and C#. You can also use 7Bit’s mt4r.dll / mt4R.mqh if you’re connecting through C# or VB6 (I translated the mt4r.dll interface for C# and VB6 and VBA ).

From MT4 your only option is mt4r.dll due to MT4’s lack of COM support. The sample code for sending data to R with mt4R.mqh is located in arbomat in the onOpen method.

int i, ii, j;
   int ishift;
   
   // if any pair has less than back bars in the history
   // then adjust back accordingly.
   back = back_bars;
   now = now_bars;
   
   if (ObjectGet("back", OBJPROP_TIME1) != 0){
      back = iBarShift(NULL, 0, ObjectGet("back", OBJPROP_TIME1));
   }
   
   if (ObjectGet("now", OBJPROP_TIME1) != 0){
      now = iBarShift(NULL, 0, ObjectGet("now", OBJPROP_TIME1));
   }
   
   if (now >= back){
      now = 0;
   }
   
   for (i=0; i<pairs; i++){
      if (iBars(symb[i], 0) < back){
         back = iBars(symb[i], 0) - 2; // use the third last bar.
         Print(symb[i], " has only ", back);
      }
   }
   
   ArrayResize(coef, pairs);
   ArrayResize(prices, pairs);
   ArrayResize(regressors, back * pairs);
   ArrayResize(pred, back);
   Ri("back", back);
   Ri("now", now);
   Ri("pairs", pairs);

The code above sets up the basic variables: back, now and pairs in both MT4 and R:
back is the # of bars of the shortest series (-2).
now is when the in sample period ends, and the out of sample period starts.
pairs is simply the number of pairs being analyzed for a proper index count.

FXEZ · October 9, 2012, 7:33am

Medisoft, continued…

Rm(“regressors”, regressors, back, pairs);

Right after the main looping routine (this forum won’t let me post the code block) is the line above which populates the prices from MT4 to the regressors matrix. In MT4 the multi dimensional array is simply done as a single dimensional array but with index counting (i * pairs) to keep track of the columns. In R the data format is A[rows, columns] where rows = data points and columns = symbols. Think of an Excel sheet with rows and columns for a visual with a new symbol in each column.

The Rm command sends the entire matrix to R in the “regressors” variable where it is split into pairs number of columns like this: regressors[data, pairs].

When you do ca.jo it requires that there are column names for your matrix. You can add column names as below (assume 3 pairs in regressors varaible):

colnames(regressors)=c('EU','AU','EG')

If you don’t column names on your matrix, ca.jo will throw an error. p is the same as regressors in this case.

Also see this line of code within the arbomat plot method that will allow you to save the R workspace so you can manually load the data and work within the R environment using the GUI manually. This makes exploring the functions much easier, and the ability to use MT4’s data.

//Rx("save.image(\"" + SNAPSHOTS + "arbomat.R\")");

FXEZ · October 9, 2012, 7:36am

Give an example of what positions your took (long /short) with EU GU and when you took it and the size vs the account size.

Size your positions down significantly if you’re getting margin call. Plan on several hundred pips of drawdown per pair. If you add to initial positions it will use up your margin faster so you must size down even further.

medisoft · October 9, 2012, 1:01pm

Thanks!

I found this method

eu <- get.hist.quote(instrument="EUR/USD",start="2011-06-01",end="2012-08-08",quote="C",provider="oanda");

that gets the quotes in daily view from oanda, it also supports Yahoo finance, that is pretty good for looking at stocks with cointegration.

I did this test with eu and gu, and this formula

m = ca.jo(ecb.data, type=c('trace'), ecdet=c('trend'), K=2, spec=c('transitory'))

and obtained this results


###################### 
# Johansen-Procedure # 
###################### 


Test type: trace statistic , with linear trend in cointegration 


Eigenvalues (lambda):
[1] 5.945249e-02 2.104065e-02 2.081668e-17


Values of teststatistic and critical values of test:


          test 10pct  5pct  1pct
r <= 1 |  9.21 10.49 12.25 16.26
r = 0  | 35.75 22.76 25.32 30.45


Eigenvectors, normalised to first column:
(These are the cointegration relations)


                 eu.l1        gu.l1     trend.l1
eu.l1     1.0000000000 1.0000000000  1.000000000
gu.l1    -0.9736060242 1.7544577315 -2.044253855
trend.l1  0.0003800921 0.0006580745 -0.001881093


Weights W:
(This is the loading matrix)


           eu.l1        gu.l1      trend.l1
eu.d -0.03582426 -0.011908037  3.078237e-17
gu.d  0.03627223 -0.009927627 -1.123751e-17

I yet don’t know how to interpret that results hehehehe, but at least I obtained something like you. I suppose that on the documents you posted before is the information about how to understand this results.

Can you also tell me what are you looking on this?

Thanks.

FXEZ:

Medisoft, I’m using StatConnector to interface through COM to R from both VB6 and C#. You can also use 7Bit’s mt4r.dll / mt4R.mqh if you’re connecting through C# or VB6 (I translated the mt4r.dll interface for C# and VB6 and VBA ).

From MT4 your only option is mt4r.dll due to MT4’s lack of COM support. The sample code for sending data to R with mt4R.mqh is located in arbomat in the onOpen method.
int i, ii, j;
   int ishift;
   
   // if any pair has less than back bars in the history
   // then adjust back accordingly.
   back = back_bars;
   now = now_bars;
   
   if (ObjectGet("back", OBJPROP_TIME1) != 0){
      back = iBarShift(NULL, 0, ObjectGet("back", OBJPROP_TIME1));
   }
   
   if (ObjectGet("now", OBJPROP_TIME1) != 0){
      now = iBarShift(NULL, 0, ObjectGet("now", OBJPROP_TIME1));
   }
   
   if (now >= back){
      now = 0;
   }
   
   for (i=0; i<pairs; i++){
      if (iBars(symb[i], 0) < back){
         back = iBars(symb[i], 0) - 2; // use the third last bar.
         Print(symb[i], " has only ", back);
      }
   }
   
   ArrayResize(coef, pairs);
   ArrayResize(prices, pairs);
   ArrayResize(regressors, back * pairs);
   ArrayResize(pred, back);
   Ri("back", back);
   Ri("now", now);
   Ri("pairs", pairs);
The code above sets up the basic variables: back, now and pairs in both MT4 and R:
back is the # of bars of the shortest series (-2).
now is when the in sample period ends, and the out of sample period starts.
pairs is simply the number of pairs being analyzed for a proper index count.

medisoft · October 9, 2012, 1:54pm

I’m seeing a lot of references to I(1), but none of them tell me what is it.

Can you help me telling me what is that?

Thanks

FXEZ · October 9, 2012, 4:38pm

It means integrated of order 1. I(0) is integrated of order 0. Generally price series are I(1), meaning the prices need to be manipulated (via cointegration) in order to make them stationary or normal. You validate this with the ADF test.

FXEZ · October 9, 2012, 4:54pm

I did this test with eu and gu, and this formula

m = ca.jo(ecb.data, type=c('trace'), ecdet=c('trend'), K=2, spec=c('transitory'))

and obtained this results


###################### 
# Johansen-Procedure # 
###################### 


Test type: trace statistic , with linear trend in cointegration 


Eigenvalues (lambda):
[1] 5.945249e-02 2.104065e-02 2.081668e-17


Values of teststatistic and critical values of test:


          test 10pct  5pct  1pct
r <= 1 |  9.21 10.49 12.25 16.26
r = 0  | [b]35.75[/b] 22.76 25.32 [b]30.45[/b]

You have one cointegrating relation (r=0) with 99% certainty because the test statistic exceeds the 1% critical value. Granger causation can flow both from EU to GU and from GU to EU. In this case it only flows one way. I’m not sure if you can tell which way from this printout.


Eigenvectors, normalised to first column:
(These are the cointegration relations)

                 eu.l1        gu.l1     trend.l1
eu.l1     1.0000000000 1.0000000000  1.000000000
gu.l1    -0.9736060242 1.7544577315 -2.044253855
trend.l1  0.0003800921 0.0006580745 -0.001881093

To plot what this looks like take 1.0 * EU - 0.9736…*GU. To zero center it add the trend.l1 value (intercept).

You can access the variables referencing your m@V. It is a matrix just as it appears above. 3 rows and 3 columns so m@V[3,3]. To get the relevant values it will be

ecb.data[,1] * m@V[1,1] + ecb.data[,2]* m@V[2,1] with optional + m@V[3,1]

medisoft · October 9, 2012, 6:29pm

If I understand well, if I have 2 or more series that are integrated in I(1), and also found an equation (vector of coefficients) that form a new series that is integrated in I(0), then the original series are cointegrated.

Am I right?

If yes, then all what I need to do is to look for instruments that are integrated on I(1), then solve an equation that gives me the coefficients that form the new series integrated in I(0).

That seems “easy” hhahheee

shamanix · October 9, 2012, 6:40pm

Yeah easy

Make sure you try walk forward testing too. I haven’t had much luck finding stable relationships that persist over time.

medisoft · October 9, 2012, 7:03pm

You are right, I can’t say which way is the Granger causation hehehehe.

I make tests for NU and UCAD.


###################### 
# Johansen-Procedure # 
###################### 


Test type: trace statistic , with linear trend in cointegration 


Eigenvalues (lambda):
[1] 2.720459e-02 1.317239e-02 1.040834e-17


Values of teststatistic and critical values of test:


          test 10pct  5pct  1pct
r <= 1 |  5.74 10.49 12.25 16.26
r = 0  | 17.68 22.76 25.32 30.45


Eigenvectors, normalised to first column:
(These are the cointegration relations)


                 nu.l1        uc.l1     trend.l1
nu.l1     1.000000e+00 1.000000e+00  1.000000000
uc.l1     1.586876e+00 1.453832e-01  2.968044915
trend.l1 -3.856874e-05 4.016451e-05 -0.001635521


Weights W:
(This is the loading matrix)


           nu.l1       uc.l1      trend.l1
nu.d  0.01035114 -0.02225317  8.888060e-17
uc.d -0.03190065  0.01140776 -3.187419e-16

If In understand well, this are not cointegated, because r=0 on test must exceed the r=0 on 1pct. Am I right?

EDIT ----------

I think I no can tell in which direction the Granger causality flows: from EU to GU


grangertest(eu,gu)
Granger causality test


Model 1: gu ~ Lags(gu, 1:1) + Lags(eu, 1:1)
Model 2: gu ~ Lags(gu, 1:1)
  Res.Df Df      F Pr(>F)
1    432                 
2    433 -1 1.7288 0.1893



grangertest(gu,eu)
Granger causality test


Model 1: eu ~ Lags(eu, 1:1) + Lags(gu, 1:1)
Model 2: eu ~ Lags(eu, 1:1)
  Res.Df Df      F  Pr(>F)  
1    432                    
2    433 -1 3.2275 0.07311 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Don’t know what it means yet hahahahaah.

FXEZ:

m = ca.jo(ecb.data, type=c('trace'), ecdet=c('trend'), K=2, spec=c('transitory'))
###################### 
# Johansen-Procedure # 
###################### 


Test type: trace statistic , with linear trend in cointegration 


Eigenvalues (lambda):
[1] 5.945249e-02 2.104065e-02 2.081668e-17


Values of teststatistic and critical values of test:


          test 10pct  5pct  1pct
r <= 1 |  9.21 10.49 12.25 16.26
r = 0  | [B]35.75[/B] 22.76 25.32 [B]30.45[/B]
You have one cointegrating relation (r=0) with 99% certainty because the test statistic exceeds the 1% critical value. Granger causation can flow both from EU to GU and from GU to EU. In this case it only flows one way. I’m not sure if you can tell which way from this printout.
Eigenvectors, normalised to first column:
(These are the cointegration relations)

                 eu.l1        gu.l1     trend.l1
eu.l1     1.0000000000 1.0000000000  1.000000000
gu.l1    -0.9736060242 1.7544577315 -2.044253855
trend.l1  0.0003800921 0.0006580745 -0.001881093
To plot what this looks like take 1.0 * EU - 0.9736…*GU. To zero center it add the trend.l1 value (intercept).

You can access the variables referencing your m@V. It is a matrix just as it appears above. 3 rows and 3 columns so m@V[3,3]. To get the relevant values it will be
ecb.data[,1] * m@V[1,1] + ecb.data[,2]* m@V[2,1] with optional + m@V[3,1]

FXEZ · October 9, 2012, 7:14pm

If In understand well, this are not cointegated, because r=0 on test must exceed the r=0 on 1pct. Am I right?

Yes that’s right but you could also get cointegration at the 95% or 90% confidence level. Also note shamanix’s comment regarding forward testing. It’s important to split up the test period between in sample and out of sample to get a feel for how stable the relationship is.

Also note the bit I posted earlier regarding correlation. Highly correlated pairs can trigger spurious regressions. They may pass the test and seem like they are cointegrated but they will generally fall apart. In this case you want to take the correlation of returns, not of the prices.

medisoft · October 9, 2012, 8:43pm

and what about the Granger causality, do I got it right? hehehe

If I check first correlation, and found something between -0.69 to 0.69 that also is cointegrated with 99 % confidence level, that should be good, isn’t it?

FXEZ · October 10, 2012, 2:27am

It looks like it but you may want to test different values for lag on the grangertest or my preference granger.test (package MSBVAR) and check ?granger.test and note the following caveats:

Note also that these tests are highly sensitive to lag length § and the presence of unit roots.

I just ran a test where with lag p=1 neither were significant but GU -> EU had a lower p-value. With higher lags, both are significant with EU -> GU being more significant. I think you simply take this test as a feel for which way the Granger causation might be flowing, rather than thinking in terms that x literally causes y, when in fact they may simply be moving together.

I don’t think .69 is a good cutoff for low correlation. That’s still fairly high. 0.3 is a more conservative level.

And just because you find a high confidence doesn’t mean that it will hold up out of sample. Don’t forget the in/out of sample tests for reality check purposes and using lots of data points in your tests so the quality of your cointegration will be better.

medisoft · October 10, 2012, 3:07am

2000 data points can be considered a lot? or maybe 10,000?

I’m testing now cointegration on 15M, because I don’t have enough data for 1 day.

FXEZ:

It looks like it but you may want to test different values for lag on the grangertest or my preference granger.test (package MSBVAR) and check ?granger.test and note the following caveats:

I just ran a test where with lag p=1 neither were significant but GU -> EU had a lower p-value. With higher lags, both are significant with EU -> GU being more significant. I think you simply take this test as a feel for which way the Granger causation might be flowing, rather than thinking in terms that x literally causes y, when in fact they may simply be moving together.

I don’t think .69 is a good cutoff for low correlation. That’s still fairly high. 0.3 is a more conservative level.

And just because you find a high confidence doesn’t mean that it will hold up out of sample. Don’t forget the in/out of sample tests for reality check purposes and using lots of data points in your tests so the quality of your cointegration will be better.

FXEZ · October 11, 2012, 7:41am

I think the amount of time is one aspect and probably the major one. How much time does your data cover?

The other aspect is the number of data points - more is generally better. But how much time and how many bars really depends on the extent of the relationship.

If a cointegrating relationship occurs over a period of several months of time, then at least that much data (whatever the # of bars) must be included in the test in order to properly develop the cointegration eigenvectors to reflect that long-term relationship.

10000 bars may sound like a lot, but on 1 minute it really isn’t (just over a week).

Just keep in mind that cointegration is a long term relationship and let that principle guide. And don’t forget to do in/out of sample testing to validate that you found a stable relationship. This is really easy to do in R.

Stephen777 · October 11, 2012, 11:15am

and from this point on, I don’t understand a single thing!!! And to think that I thought that I was quite accomplished in being able code in mql4!!

[B]medisoft[/B] and [B]FXEZ[/B], any chance of a simple explanation of what you are attempting to discover? And will it have any bearing on Kelton’s statistical arb/pairs or is this cointegration something completely different?

FXEZ · October 11, 2012, 1:09pm

Stephen, I’m not sure if there is a simple explanation of cointegration, any more than there is a simple PhD program. But I think everyone ought to at least learn the basic concepts of the two main schools of thought on the Stat Arb area: empirical and model based.

This thread’s concept of pairs trading (Kelton’s method) represents one way to interpret the empirical school of thought. Basically, instead of starting with a model before the fact and applying it to the market, the idea behind empirical stat arb is simply to adapt to the observed conditions, much as how Medisoft earlier in the thread came up with the idea to use a 40 pip deviation and then a 200 pip SL as part of his strategy.

The model based approach (cointegration) applies a mathematical model or models to the market and based on the interpretation of the model’s results, strategies are developed that are based on that model. In contrast to Kelton’s method, the arbomat program is based on a model-centric approach using cointegration.

The basic idea behind regression (which is used in the cointegration model) is to apply weights to each term (pair), solving for the least sum of squared differences. In other words, regression attempts to find the most efficient model that minimizes variation. The net effect of performing cointegration from the trader’s point of view is to create a combined spread or new series that is more normalized, and should therefore be more predictable.

medisoft · October 11, 2012, 2:05pm

Can you explain me what is that “in/out” of sample testing?

Thanks.

By the way, I doing some research of stocks with R, looking for almost equal Betas, with low correlations and good cointegrations on stocks (from s&p500, russell2000 and other indexes).

I can’t find any relation yet hehehehe .