Running correlations are a useful technique to explore how correlations between two variables vary in time or space. The correlation is calculated in a window of the first *n* observations, then the window is moved by one position, and the correlation recalculated. This is repeated until correlations are calculated for the whole data series. This procedure is analogous to the calculation of a running mean (also known as a moving average).

A running correlation can be calculated in R using the function running() in the gtools library. Here I look at the running correlation between 100 years of sunspot data and a random variable (the relevance of this will become apparent later) with an eleven year window.

library(gtools) x<-ts(sunspot.year[time(sunspot.year)>1888], start=1889)#100 years of sunspot data y=rnorm(100)# 100 random values from a normal distribution x11(height=5,width=7) par(mar=c(3,3,1,3),mgp=c(1.5,.5,0), mfrow=c(1,2)) plot(x, type="l", ylab="Sunspot number", xlab="Year CE") par(new=T)# treat the graph window as a new window, so a new graph can be plotted without erasing the old plot(1889:1988,y, col=2, ann=F, type="l", axes=F) axis(4) mtext(side=4, text="Random variable", line=1.5) rc<-running(x,y,fun=cor, width=11)#calculate the running correlation plot(1894:1983,rc, ylab="Correlation", type="l", xlab="Year CE")

The left-hand figure shows the sunspot number in black, and the random variable in red. The right-hand plot shows the running correlation between sunspot number and the random variable. As expected, there is no consistent correlation between the sunspot number and the random variable, but some windows have strong correlations of greater than 0.7.

It gets more complicated if significance levels are required. For a single correlation, the statistical significance of a Pearson correlation can be calculate using the t-distribution with n-2 degrees of freedom. This is a standard procedure, done in R with the function cor.test().

x<-rnorm(100)#100 random values from a normal distribution y<-rnorm(100)#more random data cor.test(x,y)

By default, cor.test() calculates the Pearson correlation, and the alternative hypothesis is two sided (i.e. large correlations, either positive or negative, will be significant). The output is like this

Pearson’s product-moment correlation

data: x and y

t = -0.8354, df = 98, p-value = 0.4055

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

-0.2759484 0.1142166

sample estimates:

cor

-0.084088

From the bottom, the output of cor.test() tells us the estimate of the correlation between x and y, the 95% confidence interval on that correlation, the alternative hypothesis, in this case a two sided test, and the p-value of this correlation. As the data here are random, the null hypothesis is true and there is a 5% chance that it will be incorrectly rejected – a Type I error.

Unfortunately, this procedure will not give the correct significance levels for a running correlation because of the multiple comparison problem. By testing for significance at the p=0.05 level, we have to be prepared to accept a 5% Type I error rate, a 5% chance of rejecting the null hypothesis even when it is true. If we run this test many times, the probability that at least one test returns a significant result gets very high very quickly and the Type I error rate is much higher than the expected 5%.

Before getting worked up about this, we need to determine how big a problem this is – how often is the null hypothesis being rejected. This can be tested by calculating running correlations for different datasets where the null hypothesis is true, and counting what proportion of times the naive significance levels are exceeded.

c2<-function(x,y){cor.test(x,y)$p.value} res<-replicate(1000, {#run the code in the {} brackets 1000 times y<-rnorm(100) z<-running(x,y, fun=c2, width=11)#function c2 calculates the p value of the correlation in each window sum(z<0.05)#tests if the p value is less that 0.05 }) mean(res>0)#find the proportion of trials where at least some windows appeared to have a significant correlation

This finds that over 80% of trials appears to have a statistically significant correlation in at least one window, even though the null hypothesis was true. This is clearly much larger than the expected 5% Type I error rate. The problem would be even larger if a longer data series was used, as there would be more correlations made.

There are procedures for controlling for the effects of multiple comparisons, such as the Bonferroni correction, where the critical p value is divided by the number of trials. The Bonferroni correction is usually conservative, and would be especially so for a running correlation as two overlapping windows are not independent.

The simplest solution is probably to run a Monte Carlo simulation (or perhaps by permuting the data) to determine the null distribution. We can calculate a running correlation for many different random variables (or permutations of one of the observed variables), and find the highest correlation for each trial.

res<-replicate(1000, { y<-rnorm(100) z<-max(running(x,y, fun=cor, width=11))# find maximum correlation }) quantile(res, 0.95) #critical of maximum correlation at p=0.05 one-sided test.

With 100 data points and a window width of eleven, this simulation gives a critical value of 0.83. Only if the maximum correlation in a running correlation is greater than this value is it statistically significant. Obviously this critical value will depend on the length of the data series (longer = higher) and the width of the window (wider = lower).

If a two sided test is needed, we could find the maximum absolute correlation, replacing max(…) with max(abs(…)) in the code above. This will increase the critical value slightly.

This exploration of the properties of running correlations was stimulated by a paper by Kokfelt and Muscheler that recently came online in The Holocene and linked to excitedly by WUWT. Kokfelt and Muscheler calculate

An 11 yr running correlation between summer precipitation

and sunspot numbers [that] reveals an overall positive correlation

after 1960 that, however, is statistically significant at the 95%

level only from 1975 to 1977 and from 1997 to 1998.

However they appear to use the standard significance levels applicable to a single correlation rather than a running correlation.

In view of the higher critical values shown by simulation above for conditions that approximate those of Kokfelt and Muscheler, their conclusion of a solar influence on instrumental record of summer precipitation at Abisko cannot be sustained. This does not affect the conclusions they draw from their palaeodata.

Pingback: On the NIPCC, the sun, moths and flames | Musings on Quantitative Palaeoecology

Very interesting — thanks!

Pingback: The Sun on the Nile: how many degrees of freedom? | Musings on Quantitative Palaeoecology

Pingback: Tibetan tree-rings and the sun | Musings on Quantitative Palaeoecology

Pingback: Diatoms, running correlations and solar variability | Musings on Quantitative Palaeoecology

I am very interested to know what you think about the methods used in Polyakova et al., 2006. Can you comment?

Changing relationship between the North Atlantic Oscillation and key North Atlantic climate parameters. doi:10.1029/2005GL024573

The methods look fine to me. They are using a running correlation to show that the correlation between the climatic variables is not constant. This is a good use of running correlation.

The papers I have been criticising have been using running correlations to find periods where the correlation between two variables appears to be significant but have not corrected for multiple testing.

This is a great introduction. I am using this method for my publication. However, one thing I got lost is that it appears the Figure 4 from Kokfelt and Muscheler has more than 100 data points, but their significant level (95) appears to be ~ 0.5, rather than 0.82 calculated by you and me. Do you have some ideas about the reason. Thanks.

I think the significance levels on figure 4 are based on n = 11, the number of observations in each segment of the running correlation.

Got it. I should read your blog more carefully. Thanks for your explanation.

Sorry for bringing this old thread up again but it has been very useful for me. Regarding the issues with calculating significance for a running correlation, do you have any references I could have a look at that explain the problem? I’m sure you understand that referencing a blog post is often not good enough for an academic journal…

I don’t know of any papers discussing problems with running correlations. I would have hoped the general problem I discuss here could be filed under “obvious” and not need citing, but given the number of papers that do it so badly, perhaps not. At least one paper has used the method I show here without citing it (I asked the author at a conference how they had done their running correlation, they replied that they had lifted the code from my blog).

I suppose at some stage I could write a note on problems with interpreting running correlation. Not entirely sure which journal would be appropriate. Meanwhile, try teaching journal editors about this new fangled thing call blogs…

Obvious to a statistician maybe but I have read many papers in hydrology/climate science using running correlation incorrectly. Indeed, I am a statistician by training but did not recognise the problem outlined here until I came across this blog… Anyway thank you for your help!

Hello Richard,

thanks for mentioning the multiple comparison problem, which is critically under-appreciated in paleoscience, leading to all kinds of spurious detections. Another issue that I am surprised you did not mention here is autocorrelation, which is often sizable for paleo timeseries (in part, due to persistence in climate/ecological systems and in part to archive-specific processes, like bioturbation in sedimentary archives). That often dramatically reduces the number of degrees of freedom available to a t-test, making results that assume $\nu = n-2$ woefully optimistic. Running correlations add to this issue in that neighboring windows share a large fraction of data, so adjacent “tests” could hardly be less independent. I thus fear that the problem is even more severe than you make it out to be.

You may be interested in this paper, which touches on some of those issues:

Thank you for the link to the paper – very useful.

I been asked a couple of time to write up this post (one of the most popular on my blog) as a paper. If I find time do this, I will certainly include autocorrelation as an additional problem. Autocorrelation is discussed in many other posts.

Hello Richard,

Recently I have been wondering about sliding correlations to apply them in the field of lifespan cognitive neuroscience and I came up with this great (old) post. It has been very useful given the lack of information about how to properly correct (for mult. comp and dependency) sliding window correlations.

I’d like to ask you a couple of questions: 1) would you be able to recommend some papers associated with this topic? and 2) all the corrections I’ve seen in sliding windows refer to a single comparison. Would it be sound to apply a “cluster-based” approach in which first, we detect a consecutive number of comparisons reaching a significant value (arbitrary; let’s say p = .05) and then testing whether the number of consecutive comparisons reaching significance is p < 0.05 by permuting the data and obtaining a distribution of the max. number of consecutive significant comparisons? I’d love to hear your thoughts about it

I’ve seen a paper discussing how to use running correlations to test if a correlation is stable, but I am not aware of any that discuss how to interpret running correlations that are occasionally significant (there will be equivalent tests in the spectral domain with wavelets that might have more coverage). Several people have asked if I plan on writing up this blog post as a paper – I do, once I have worked out what it all means – but it is not my highest priority.

Your proposed test would be very sensitive to autocorrelation in the data. That can of course be corrected for by changing the null test to use autocorrelated time series (with the same properties as the original data) rather than white noise. If this can be solved, then I think your test is reasonable, but it is quite remote from asking if two series are related.