It is time, as the walrus said, to talk about another paper reporting an effect of solar variability on the Earth’s climate. Laurenz et al. (2019) correlate European country-scale precipitation data from CRU with the number of sunspots, a proxy for solar activity. After allowing for a lagged response, they find some reasonably high and significant correlations in at least some months and some countries.

I do expect solar variability to affect climate, but I expect the effect to be quite small and difficult to detect above a background of internal variability and other forcings such as greenhouse gas and aerosols. However, statistical traps abound. This makes me wary of papers that report strong and highly significant relationships. Especially when the authors write

The probability that the correlation r = 0.54 is by chance is less than 0.1% (p < 0.001).

This reflects a common misunderstanding of p-values. The co-authors, Horst-Joachim Lüdecke and Sebastian Lüning, the reviewers, and the editors don’t look good for letting this slip past them.

As the paper reports, monthly precipitation time series have very little autocorrelation and are almost white noise, so it seems unlikely that the strongly autocorrelated sunspot timeseries would be an good predictor.

This is easily tested by looking at some data. Not surprisingly, the correlation between the raw Danish February precipitation and sunspots is weak (r = 0.07) and not remotely close to being significant (p = 0.43).

So how do Laurenz et al. get to report this correlation as 0.42 (p < 0.01)? Predictably, they smooth the data, and allow for lagged responses. Using their methods, I get approximately the same correlation. But what about the significance? Many papers that smooth the data or allow for lagged response (a form of multiple testing) fail to correct the p-value and then find spuriously significant results. Laurenz et al. don’t make that mistake – they use a Monte Carlo procedure, applying their method to 10,000 surrogate time series to find the distribution of p-values under the null hypothesis. They report that for values of |r| > 0.45 p < 0.001; for |r| > 0.35 p < 0.01; and for |r| > 0.25 p < 0.05 (I’m not exactly sure why they are reporting p-values for absolute correlations when they need a one-sided test.)

It only takes a few lines of code to re-implement their Monte Carlo procedure and check these results.

library("tidyverse") library("furrr") library("magrittr") #load data ss0 <- read.table("data/SN_m_tot_V2.0.txt", nrow = 3237) ss <- ts(ss0$V4, start = ss0$V1[1], frequency = 12) %>% window(start = 1901, end = c(2015, 12)) denmark <- read.table("data/crucy.v3.26.1901.2017.Denmark.pre.per", skip = 3, header = TRUE) %>% filter(YEAR <= 2015) set.seed(101) #make function lags.cor <- function(l, m, ss, clim){ ss.lag <- stats::lag(ss, l) ss.lag <- window(ss.lag, start = c(1901, m), end = c(2015, m), extend = TRUE, deltat = 1) cor(ss.lag, clim, use = "pair") } #make null distribution null_dist <- future_map(1:10000, ~{ y = rnorm(nrow(denmark)) ys = signal::sgolayfilt(x = y, p = 5, n = 11) tibble(lag = -110:24) %>% mutate(cor = map_dbl(lag, lags.cor, m = 1, clim = ys, ss = ss)) }) %>% bind_rows(.id = "n") %>% mutate(n = as.numeric(n))

I am using white noise as I cannot find a function to simulate autocorrelated time series with a Hurst exponent of 0.6 which the authors use. This will make my significance thresholds slightly lower, as will my use of a one-sided rather than two-sided test. Despite this, I find that the significance thresholds are much higher than those the authors report.

null_dist %>% group_by(n) %>% slice(which.max(cor)) %$% quantile(cor, probs = c(0.95, 0.99))

## 95% 99% ## 0.40 0.47

Only if I ignore the lagged responses do I get results similar to the authors.

null_dist %>% filter(lag == 0) %$% quantile(cor, c(0.95, .99))

## 95% 99% ## 0.25 0.34

It appears that the authors neglected to test for lagged responses in their Monte Carlo procedure. As such their p-values are over optimistic and many of their apparently significant results are actually not significant.

Still, there are ~45 monthly series with a correlation that exceeds the 95th percentile of the null distribution. Is this impressive? Given that there are 39 countries x 12 months = 468 tests, we should expect ~23 correlations to be “significant” just by chance because of multiple testing. A correction for multiple testing is required. Getting 45 significant correlations is more than expected. However, because the timeseries from neighbouring countries (e.g., Belgium and Luxembourg) can be very highly correlated because of the spatial patterns in precipitation (perhaps enhanced by the gridding process), if one has a significant correlation, several more will as well. This will increase the probability of getting 45 “significant” correlations by chance.

Given the apparent error in calculating the p-values in Laurenz et al. (2019), the authors need to archive their code, and if the problem is confirmed, should write a correction to their paper.

Really great review. I will share. Thanks.

Can this be described as correlation hacking something that Lüdeke appears to specialize in.

There’s a response to this criticism on PubPeer. Sounds a bit like “others claim the same, so we’re right”:

https://pubpeer.com/publications/34269B14670E738F57DC96C4D846E6

It is a thing of beauty isn’t it. I love the way it gracefully dances round the issues in the first comment. And now the authors have published a new comment. I am so surprised they elect not to share their code which would instantly resolve this problem.

I just read it – another dance around the elephant in the room, and a boohoo about not using your actual name.

Somewhat presciently, Steven Mosher recently commented on another blog, stating “after climategate the white hats started sharing data and code

and the only people refusing to share code and data have been skeptics.”

https://andthentheresphysics.wordpress.com/2019/02/20/survivor-bias/#comment-145374

Yup.