Sunspots and raindrops

It is time, as the walrus said, to talk about another paper reporting an effect of solar variability on the Earth’s climate. Laurenz et al. (2019) correlate European country-scale precipitation data from CRU with the number of sunspots, a proxy for solar activity. After allowing for a lagged response, they find some reasonably high and significant correlations in at least some months and some countries.

I do expect solar variability to affect climate, but I expect the effect to be quite small and difficult to detect above a background of internal variability and other forcings such as greenhouse gas and aerosols. However, statistical traps abound. This makes me wary of papers that report strong and highly significant relationships. Especially when the authors write

The probability that the correlation r = 0.54 is by chance is less than 0.1% (p < 0.001).

This reflects a common misunderstanding of p-values. The co-authors, Horst-Joachim Lüdecke and Sebastian Lüning, the reviewers, and the editors don’t look good for letting this slip past them.

As the paper reports, monthly precipitation time series have very little autocorrelation and are almost white noise, so it seems unlikely that the strongly autocorrelated sunspot timeseries would be an good predictor.

This is easily tested by looking at some data. Not surprisingly, the correlation between the raw Danish February precipitation and sunspots is weak (r = 0.07) and not remotely close to being significant (p = 0.43).

So how do Laurenz et al. get to report this correlation as 0.42 (p < 0.01)? Predictably, they smooth the data, and allow for lagged responses. Using their methods, I get approximately the same correlation. But what about the significance? Many papers that smooth the data or allow for lagged response (a form of multiple testing) fail to correct the p-value and then find spuriously significant results. Laurenz et al. don’t make that mistake – they use a Monte Carlo procedure, applying their method to 10,000 surrogate time series to find the distribution of p-values under the null hypothesis. They report that for values of |r| > 0.45 p < 0.001; for |r| > 0.35 p < 0.01; and for |r| > 0.25 p < 0.05 (I’m not exactly sure why they are reporting p-values for absolute correlations when they need a one-sided test.)

It only takes a few lines of code to re-implement their Monte Carlo procedure and check these results.


#load data
ss0 <- read.table("data/SN_m_tot_V2.0.txt", nrow = 3237)
ss <- ts(ss0$V4, start = ss0$V1[1], frequency = 12) %>%
window(start = 1901, end = c(2015, 12))

denmark <- read.table("data/crucy.v3.26.1901.2017.Denmark.pre.per", skip = 3, header = TRUE) %>%
filter(YEAR <= 2015)

#make function
lags.cor <- function(l, m, ss, clim){
ss.lag <- stats::lag(ss, l)
ss.lag <- window(ss.lag, start = c(1901, m), end = c(2015, m), extend = TRUE, deltat = 1)
cor(ss.lag, clim, use = "pair")

#make null distribution
null_dist <- future_map(1:10000, ~{ y = rnorm(nrow(denmark)) ys = signal::sgolayfilt(x = y, p = 5, n = 11) tibble(lag = -110:24) %>%
mutate(cor = map_dbl(lag, lags.cor, m = 1, clim = ys, ss = ss))
}) %>%
bind_rows(.id = "n") %>%
mutate(n = as.numeric(n))

I am using white noise as I cannot find a function to simulate autocorrelated time series with a Hurst exponent of 0.6 which the authors use. This will make my significance thresholds slightly lower, as will my use of a one-sided rather than two-sided test. Despite this, I find that the significance thresholds are much higher than those the authors report.

null_dist %>%
group_by(n) %>%
slice(which.max(cor)) %$%
quantile(cor, probs = c(0.95, 0.99))
## 95% 99%
## 0.40 0.47

Only if I ignore the lagged responses do I get results similar to the authors.

null_dist %>%
filter(lag == 0) %$%
quantile(cor, c(0.95, .99))
## 95% 99%
## 0.25 0.34

It appears that the authors neglected to test for lagged responses in their Monte Carlo procedure. As such their p-values are over optimistic and many of their apparently significant results are actually not significant.

Still, there are ~45 monthly series with a correlation that exceeds the 95th percentile of the null distribution. Is this impressive? Given that there are 39 countries x 12 months = 468 tests, we should expect ~23 correlations to be “significant” just by chance because of multiple testing. A correction for multiple testing is required. Getting 45 significant correlations is more than expected. However, because the timeseries from neighbouring countries (e.g., Belgium and Luxembourg) can be very highly correlated because of the spatial patterns in precipitation (perhaps enhanced by the gridding process), if one has a significant correlation, several more will as well. This will increase the probability of getting 45 “significant” correlations by chance.

Given the apparent error in calculating the p-values in Laurenz et al. (2019), the authors need to archive their code, and if the problem is confirmed, should write a correction to their paper.

Posted in Peer reviewed literature, Uncategorized | Tagged , | 5 Comments

Autocorrelation in the testate amoeba calibration set

Amesbury et al examine the autocorrelation in their huge calibration set. I thought I would do the same, increasing the resolution of the analysis to get a better handle on what is going on.


An RNE plot for the weighted averaging with tolerance downweighting plotted with ggpalaeo. I have used the pruned calibration set for comparison with the original paper.

This is an RNE plot. It shows the cross-validation performance of the transfer function when samples are removed by deleting samples from the calibration set

  1. at random (open circles)
  2. which are geographic neighbours of the focal sample in the cross-validation (filled circles)
  3. which have similar values of the environmental variable being reconstructed to the focal sample in the cross-validation (red dashed line).

If there is no autocorrelation in the calibration set, deleting samples by geography or at random should yield similar changes in performance. If there is strong autocorrelation, deleting samples by geographic proximity is worse than deleting samples by environmental similarity.

With the testate amoeba WAtol calibration set, a substantial fraction of the loss in performance occurs when samples within 1km of the focal site in cross-validation are deleted. At this distance, deleting samples by geographic proximity is worse than deleting the same number of samples by environmental similarity. At greater distances, the decline in performance is less step (the equivalent plot in Amesbury et al is a little different because they have accidentally plotted the WAinv instead of the WAtol result – I need to improve the rne function to make the selection of model variants clearer).

I interpret the initial steep decline as a bog-specific effect, the type of issue that leave-one-site-out cross-validation is designed to fix. The subsequent decline is probably partly due to some climate effects on testate amoeba (either directly or via bog vegetation composition) and partly due to analyst effects (for example in how water table depths were calculated).

It might be possible to get some (imperfect) handle on the relative importance of these terms by cross-validating the WAtol model in different ways. Leave-one-out cross-validation of the pruned calibration set has an r2 of 0.722; with leave-one-site-out cross-validation, the r2 falls to 0.697; leaving out data from each reference in turn gives an r2 of 0.683. This is equivalent to a geographic exclusion of 500 km. This suggests there might be some analyst effect, but there will still be an element of climate effect in this change in performance. A better analysis would use information on who the taxonomist was for each calibration set.

For this testate amoeba calibration set with WAtol, spatial autocorrelation is not a major problem (see Telford and Birks (2009) for some transfer functions where autocorrelation is a much bigger problem).


Posted in Palaeohydrology, Peer reviewed literature, transfer function | Tagged , , | Leave a comment

Testing testate amoeba: some comments on Amesbury et al (2018)

Today, I have been reading and thinking about a new paper presenting a huge testate-amoeba calibration set for reconstructing water table depth in bogs (Amesbury et al 2018). This calibration set, with almost 2000 samples, is the synthesis of many smaller calibrations sets after the inevitable taxonomic harmonisation. In general, I like the paper, but I have some comment.


All the raw testate amoeba counts are available from Neotoma, and the harmonised data are available from Mendeley. This is excellent: I will be using these data and the large amount of fossil testate amoeba data in Neotoma in a forthcoming manuscript. Plus one citation to the authors.  Now if only the authors had archived the code as well.


I have not attempted to reproduce all aspects of the paper. I can replicate the weighted-average partial least squares and the weighted average with tolerance downweighting model  performances with the full dataset, but get a slightly higher root mean squared error of prediction (RMSEP) with the pruned dataset.

A large calibration set

The calibration set is about an order of magnitude larger than what might usually be thought of as a large calibration set. This has some interesting consequences. First, the apparent performance and the cross-validated performance are very similar (WA-tol RMSE = 10.04 cm, RMSEP = 10.20 cm). Large differences between the apparent and cross-validated performance hint at over-fitting. Second, WA-tol is the best model. Theoretically WA-tol should be a better model than ordinary WA as taxa with wide niches are given less weight than taxa with narrow niches, but in practice the niche width cannot be estimated well enough and performance is worse. Third, with such a large dataset, and with no evidence of over-fitting, leave-one-out and leave-one-site-out cross-validation have similar performance (except for the modern analogue technique).

Model selection

Trying several transfer functions model, and multiple versions of each model is fairly standard practice in palaeoecology. It is a type of cherry picking, that will can biased performance – see Telford et al (2004) for an test of this problem and a solution.

An unreported statistic

One useful diagnostic is the ratio of the constrained eigenvalue to the first unconstrained eigenvalue in an ordination constrained just by the environmental variable of interest. Ideally this ratio should exceed one, suggesting that the variable of interest is, or is correlated with, the most important gradient in the data. Using a CCA constrained by water table depth on square-root transformed data, the ratio \lambda_1/\lambda_2 is 0.6. This is not very impressive, and will make reconstructions vulnerable to changes in variables other than the variable of interest. I’m not sure which environmental variables are correlated with the first unconstrained axis.


Many data sets have outliers that can degrade model performance. There can be good reasons for deleting them, for example, in the Norwegian chironomid calibration set, a few lakes were cooled by long lasting snow beds, so the lake water was colder than expected for the air temperature. As is common practice with testate amoeba transfer functions, the Amesbury et al adopt a much more aggressive approach to outliers. They prune any observations that have a transfer function residual of more than 20% of the environmental gradient. This will inevitably give a boost to the model performance statistics. What is less clear is whether this gives better reconstructions.

One possibility is that testate amoeba have a rather weak relationship with water depth, and that the pruning artificially strengthens it. Many of the samples pruned are at the ends of the water depth gradient, so pruning, in part, addresses edge effects. Another possibility is that the water depth measurements are not very reliable (often they are spot measurements and could change after  storm), and pruning addresses this.

Significance tests

Amesbury et al use the reconstruction significance test I developed. They apply it to testate amoeba stratigraphies from two bogs. The reconstruction from Lac Le Caron is not significant with any version of their transfer function. The authors conclude

Given that the efficacy of the ‘randomTF’ method has been recently reviewed and questioned (Amesbury et al., 2016; Payne et al., 2016), these results further call into question the usefulness of this test.

I downloaded the Lac Le Caron data from Neotoma. The samples in the stratigraphy mostly have a low diversity (median N2 = 2.3), a problem known since my original paper to cause a high type II error rate. It does not seem to be fair to criticise the test for failing in such circumstances. The main signal in the stratigraphy from Lac Le Caron are switches from Difflugia pulex to Archerella flavum dominated assemblages. These taxa have almost identical optima (11.1 vs 10.2 cm), so I would argue that changes in water table depth are unlikely to have caused this assemblage change. As such, I believe that my significance test is correct in reporting a non-significant result.

Posted in Palaeohydrology, Peer reviewed literature, transfer function | Tagged , | 1 Comment

Introducing ggpalaeo

I’ve put some code I used for plotting figures for my soon-to-be-resubmitted manuscript into a package because I thought it might be useful to others.

The main use of ggpalaeo is to make ggplot2 plots of transfer function diagnostics from the analogue and rioja packages.


It will plot distances to nearest analogues and goodness-of-fit statistics by time or depth, and also plot transfer function performance plots. The diagnostic plots are perhaps better suited to supplementary material than the main text of a manuscript, but they are very pretty.

The other thing ggpalaeo does is make a couple of additions to strat.plot in rioja, allowing secondary axes (for example age in addition to depth) and summary plots (for example percent planktonic and benthic diatoms). Have a look at the examples.

You can install  ggpalaeo with


You might need to install devtools first. Please let me know if you have any problems or suggestions for additions or improvements.

Posted in R, transfer function | Tagged | 3 Comments

The elevation of Lake Tilo

For my PhD, I studied the palaeolimnology of two lakes in the Ethiopian rift valley, using diatoms to reconstruct changes in the water chemistry of Lake Awassa, an oligosaline caldera lake which retains its low salinity despite having no effluent rivers, and Lake Tilo, one of three alkaline maar lakes close to the Bilate River toward the western edge of the rift valley.

At the time, crater lakes were widely thought of as giant rain gauges, but, of course, groundwater plays a large role in controlling lake depth and salinity, complicating the relationship these variables and climate. Given the importance of groundwater, I wanted to know the elevation of Lake Tilo, the neighbouring lakes Budemeda and Mechefera, and the Bilate River to better understand how groundwater would flow between them. Unfortunately, the 10m contour resolution on the available map was insufficiently precise and I was hopeless at using stereo-photographs to estimate elevations, so this never got done. But now I don’t need to use stereo-photographs, I can download a high-resolution digital elevation model — SRTM data — from and plot it.

Here, I am using the 1′ resolution data. Each tile is about 25MB and covers a 1°x1° area.


tilo = raster::raster(x = "n07_e038_1arc_v3.tif")

lakes = data_frame(
lake = factor(c("B", "T", "M"), levels = c("B", "T", "M")),
x = c(38.09417, 38.09417, 38.08500),
y = c(7.096667, 7.065000, 7.042500)

tilo2 =, xy = TRUE) %>%
rename(elevation = 3)

tilo2 %>%
filter(between(x, 38.04, 38.15), between(y, 7.01, 7.12)) %>%
ggplot(aes(x = x, y = y, fill = elevation)) +
geom_raster() +
scale_fill_continuous(type = "viridis") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
coord_equal() +
geom_text(aes(x, y, label = lake), lakes, colour = "white", inherit.aes = FALSE) +
labs(fill = "Elevation, m", x = "°E", y = "°N")

Budemeda is the northernmost lake in a deep, steep-sided crater. Tilo is the middle lake. Mechefera is the southernmost lake in a shallow crater. When I visited, Mechefera supported a large population of flamingoes. Wading past the dead and dying flamingoes on the shores of the small cider-cone island was not pleasant. They kept the resident hyaena fed though.


DEM of the three lakes.

To my surprise, there is a smaller fourth crater south-west of Lake Mechefera – I thought this was another cinder cone in the east-west line of cones. Google maps shows the crater floor to be vegetated – I suspect it is swampy and may be seasonally flooded.

Now I can plot some E-W profiles through the lakes.

`%nearest%` <- function(a, b){#find nearest value#slow
a1 <- unique(a)
res <- sapply(b, function(i){ a == a1[which.min(abs(a1 - i))] }) apply(res, 1, any) } tilo2 %>%
filter(between(x, 38.04, 38.15)) %>%
filter(y %nearest% lakes$y) %>%
mutate(lake = factor(y, labels = c("M", "T", "B"))) %>%
ggplot(aes(x = x, y = elevation, colour = lake, label = lake)) +
geom_path() +
scale_colour_discrete(limits = c("B", "T", "M")) +
directlabels::geom_dl(method = "first.points") +
directlabels::geom_dl(method = "last.points") +
labs(y = "Elevation, m", colour = "Lake", x = "°E") +
theme(legend.position = "none")

East-west profiles through the three lakes.

Lake Budemeda appears to be 13 m below the elevation of the river immediately to the west. Lake Tilo is a couple of metres lower than Lake Budemeda, but less that five metres below the river to the west. Lake Mechefera is about 15 m below Lake Tilo and 8 m below the river immediately to the south. This means that river water will tend to percolate from the river into all three lakes, but especially Budemeda and Mechefera, and that water will flow from Budemeda to the lakes to the south. This probably partially explains the high salinity of the thermal springs along the northern shore of Lake Tilo (I should have realised that when I was doing my thesis).

Lacustrine sediment high on the crater walls show that during the early Holocene the water level in Tilo was about 30m higher that at present, far above the river level. At this time, the lake was fresh and there was rapid accumulation of diatomite dominated by Aulacoseira granulata.

SRTM data are a great resource, especially for closed basin lakes.

Posted in Data manipulation, Palaeohydrology, R, Uncategorized | Tagged , , | Leave a comment

Conference abstract deadlines

“I love deadlines. I love the whooshing noise they make as they go by.”
Douglas Adams

When I go to a conference, I want to present my most recent relevant work. But often the deadline for abstract submission is months before the conference, so I need to decide which work is likely to be completed in time, and write a convincing abstract before all (sometimes any) results are available.

Take for example the next INQUA conference, the deadline for abstract submission, 9th January, is 6.5 months in advance of the conference.

I wondered if this lead time was typical, so I trawled through some conferences to find their lead times. I’ve looked for conferences in ecology and geology, avoiding the plague of predatory conferences by only listing those I know or found on the PAGES and Nature Ecology calendars. These gave a mix of 2018 and 2019 conferences. I stopped once I had 30 conferences. If the submission deadline was extended, I took the original date as I cannot tell if the deadlines will be extended for 2019 conferences.


Histogram of conference lead times in ecology and geology

The median lead time is 110 days, much shorter than for INQUA. Of course, I want to know why there is a 5.8 fold difference between the longest (Polar2018) and shortest (AQUA) lead time.

I have a few ideas, some of which are probably testable.

  • Some of the conferences probably plan to extend the abstract submission deadline, so the published deadline is artificially early, whereas for others the deadline is fixed. As extensions are usually only by a week or two, this should explain only a little of the variance.

  • I only looked for conferences in ecology or geology, so I suspect research field explains little of the variance. I predict that some faster moving fields, e.g. biomedicine, lead times should typically be short.

  • I wonder whether conference size is associated with lead times. Large conferences are obviously take more effort to organise, but on the other hand, have more resources. It is difficult to work out how large conferences are, but two of the largest conferences, the EGU and ESA have very different lead times (87 and 171 days, respectively). Small, relatively informal conferences should be the easiest to organise, and therefore short lead times.

I prefer short lead times. I am more likely to submit an abstract to a conference with a three month lead time than one with a six month delay.

Posted in meta science | Tagged | 1 Comment

No new data no comment at Nature Communications

Of all the modes of post-publication peer review, comments published in the same journal as the original article are the most visible, and because they have survived editorial and reviewer scrutiny, carry at least modicum of credibility. Unfortunately, comments are much rarer than expected given the number of papers our journal club savages. For example, QSR has published just a couple of dozen in the last decade. Part of the reason for the sparsity of comments must be the hurdles placed in the path to publication.

These are the requirement from Science

Technical Comments (up to 1000 words, 2 figures or tables, 15 references, and no Supplementary Materials), are published online and critique the core conclusions and/or methodology of research published in Science within the previous 3 months.  Technical Comments should not present new data or other previously unpublished work nor be based on new findings/concepts that would not have been accessible to the authors when the paper was written.

The word limit is tight and the deadline restrictive (confession: I don’t always read Science papers the day they are published), but otherwise this is reasonable.

How about the requirement from Nature Communications?

Submissions should challenge the main conclusions of the Nature Communications paper and contain new, unpublished data to support the arguments.

•    They should not exceed 1200 words (main text).
•    Contributions should start with a brief paragraph that summarizes the message of the article without specialized terminology, for a non-specialist readership. This paragraph should be used as the abstract for submission purposes.
•    Contributions should have a simple message that requires only one or two small figures or tables. Contributions with more than two figures and/or tables will not be considered.
•    As a guideline, contributions allow up to 15 references.
•    Supplementary Information is permitted at the editor’s discretion.

The requirement for “new, unpublished data” is the most important difference: Science forbids it, Nature Communications demands it.

I’ve helped write a few comments showing that:

  • The method used by Pither and Aarssen (2005) to show that most diatoms taxa are pH generalists had a very high type II error rate which voided their conclusions.
  • That changes of the sedimentation rate in Lake Erhai caused artifacts that Wang et al (2012) misinterpreted as flickering prior to a critical transition.
  • The methods used by Lyons et al (2015) caused a spurious breakpoint in the mid-Holocene.

The common feature of these comments is that none of them contain new unpublished data.

Nature Communications doesn’t require articles to contain new unpublished data. This is good, many of my papers don’t contain new data; new methods and ideas are as important as new data. However, it potentially leads to a bizarre situation that an article presenting a novel but flawed analysis of previously published data extracted from Neotoma or some other database could not be critiqued in a comment unless the authors were able to find some new unpublished data (and describe it in 1200 words).





Posted in Peer reviewed literature | Tagged , | 3 Comments