Double diatoms

I am in awe of Dr Elisabeth Bik and her amazing ability and dedication to spotting duplications in images.

Follow the thread to see the duplications she finds in this photograph. I usually struggle to find duplications until they are pointed out.

But not always.

Take, for example, this diatom calibration set.

Screenshot_2019-05-16 (11) (PDF) Diatom-based reconstruction of trophic status changes recorded in varved sediments of Lake[...]

The lower two assemblages are, within the resolution of the PDF, identical. Having spotted that, I wanted to know whether any of the other assemblages were identical, so I scraped the data out of the PDF. All the other assemblages are distinct.

But what about the next figure which shows the down-core stratigraphy?

Screenshot_2019-05-16 (11) (PDF) Diatom-based reconstruction of trophic status changes recorded in varved sediments of Lake[...](1)

While the upper part of the diagram is spiky as expected, the lower part of the diagram is step like with nine pairs of apparently identical assemblages.

These identical assemblages in the calibration set and the down-core stratigraphy are unexpected. I asked the authors if they could explain what happened, but they haven’t replied.

The paper also includes a transfer function that I cannot reproduce with the assemblage data I scraped from this paper and the environmental data I scraped from another paper on the same calibration set lakes. The reported cross-validation performance is better than the apparent performance I get, but it is possible that the environmental data had more problems than just using the wrong unit prefix. If my model is correct, then the actual cross-validation performance is very weak (R2 = 0.03), then the reconstruction shown in the paper is bogus and the paper should be corrected (or retracted).

I’m not holding my breath.

Posted in Peer reviewed literature, Uncategorized | Tagged , | Leave a comment

Funky ordination plots with ggvegan

Yesterday, I tweeted a photo of a ordination I plotted with ggvegan, and thought I should show how I made it.

Ordinations can be plotted with base R (see ?plot.cca). This is fine for simple plots, but it is a lot of effort to make a complex plot. The ggvegan package, written by Gavin Simpson, lets you use the power of ggplot2 and can make complex plots easier to make.

Start by loading some packages. ggvegan will need to be installed from GitHub if you don’t already have it.

#load packages

I’m going to use the pyrifos dataset, which is included in vegan. The data are the abundances of aquatic invertebrates from an experiment in which twelve ditches were studies several times before and after insecticide treatment. We need to make a tibble (or a data.frame) of the predictors (dose, time and ditch ID).

#load data

env <- tibble(
week = rep(c(-4, -1, 0.1, 1, 2, 4, 8, 12, 15, 19, 24), each = 12),
dose = factor(rep(c(0.1, 0, 0, 0.9, 0, 44, 6, 0.1, 44, 0.9, 0, 6), 11)),
ditch = gl(12, 1, length=132))

We can run a principal components analysis with rda.

pca <- rda(pyrifos)

Simple ordination plots can be made with the autoplot function. This plot shows all the samples. This is fine, but we want to see dose and time information.

autoplot(pca, layers = "sites", arrows = FALSE)


At the moment, it is difficult to do much to this plot. But we can use the fortify function from ggvegan to extract the scores from the ordination object and then combine these with the predictor data before plotting with ggplot. I’m going to colour the samples by dose, and draw a line through the samples in each ditch, with a point on the first value.

pca_fort <- fortify(pca, display = "sites") %>%

ggplot(pca_fort, aes(x = PC1, y = PC2, colour = dose, group = ditch)) +
geom_path() + #use geom_path not geom_line
geom_point(aes(size = if_else(week == min(week), 1, NA_real_)), show.legend = FALSE) +
scale_color_viridis_d() +
scale_size(range = 2) +


Don’t forget to add coord_equal() to ensure that the plot is correctly scaled.

Happy plotting.

Posted in R, Uncategorized | Tagged , | Leave a comment

Sunspots and raindrops

It is time, as the walrus said, to talk about another paper reporting an effect of solar variability on the Earth’s climate. Laurenz et al. (2019) correlate European country-scale precipitation data from CRU with the number of sunspots, a proxy for solar activity. After allowing for a lagged response, they find some reasonably high and significant correlations in at least some months and some countries.

I do expect solar variability to affect climate, but I expect the effect to be quite small and difficult to detect above a background of internal variability and other forcings such as greenhouse gas and aerosols. However, statistical traps abound. This makes me wary of papers that report strong and highly significant relationships. Especially when the authors write

The probability that the correlation r = 0.54 is by chance is less than 0.1% (p < 0.001).

This reflects a common misunderstanding of p-values. The co-authors, Horst-Joachim Lüdecke and Sebastian Lüning, the reviewers, and the editors don’t look good for letting this slip past them.

As the paper reports, monthly precipitation time series have very little autocorrelation and are almost white noise, so it seems unlikely that the strongly autocorrelated sunspot timeseries would be an good predictor.

This is easily tested by looking at some data. Not surprisingly, the correlation between the raw Danish February precipitation and sunspots is weak (r = 0.07) and not remotely close to being significant (p = 0.43).

So how do Laurenz et al. get to report this correlation as 0.42 (p < 0.01)? Predictably, they smooth the data, and allow for lagged responses. Using their methods, I get approximately the same correlation. But what about the significance? Many papers that smooth the data or allow for lagged response (a form of multiple testing) fail to correct the p-value and then find spuriously significant results. Laurenz et al. don’t make that mistake – they use a Monte Carlo procedure, applying their method to 10,000 surrogate time series to find the distribution of p-values under the null hypothesis. They report that for values of |r| > 0.45 p < 0.001; for |r| > 0.35 p < 0.01; and for |r| > 0.25 p < 0.05 (I’m not exactly sure why they are reporting p-values for absolute correlations when they need a one-sided test.)

It only takes a few lines of code to re-implement their Monte Carlo procedure and check these results.


#load data
ss0 <- read.table("data/SN_m_tot_V2.0.txt", nrow = 3237)
ss <- ts(ss0$V4, start = ss0$V1[1], frequency = 12) %>%
window(start = 1901, end = c(2015, 12))

denmark <- read.table("data/crucy.v3.26.1901.2017.Denmark.pre.per", skip = 3, header = TRUE) %>%
filter(YEAR <= 2015)

#make function
lags.cor <- function(l, m, ss, clim){
ss.lag <- stats::lag(ss, l)
ss.lag <- window(ss.lag, start = c(1901, m), end = c(2015, m), extend = TRUE, deltat = 1)
cor(ss.lag, clim, use = "pair")

#make null distribution
null_dist <- future_map(1:10000, ~{ y = rnorm(nrow(denmark)) ys = signal::sgolayfilt(x = y, p = 5, n = 11) tibble(lag = -110:24) %>%
mutate(cor = map_dbl(lag, lags.cor, m = 1, clim = ys, ss = ss))
}) %>%
bind_rows(.id = "n") %>%
mutate(n = as.numeric(n))

I am using white noise as I cannot find a function to simulate autocorrelated time series with a Hurst exponent of 0.6 which the authors use. This will make my significance thresholds slightly lower, as will my use of a one-sided rather than two-sided test. Despite this, I find that the significance thresholds are much higher than those the authors report.

null_dist %>%
group_by(n) %>%
slice(which.max(cor)) %$%
quantile(cor, probs = c(0.95, 0.99))
## 95% 99%
## 0.40 0.47

Only if I ignore the lagged responses do I get results similar to the authors.

null_dist %>%
filter(lag == 0) %$%
quantile(cor, c(0.95, .99))
## 95% 99%
## 0.25 0.34

It appears that the authors neglected to test for lagged responses in their Monte Carlo procedure. As such their p-values are over optimistic and many of their apparently significant results are actually not significant.

Still, there are ~45 monthly series with a correlation that exceeds the 95th percentile of the null distribution. Is this impressive? Given that there are 39 countries x 12 months = 468 tests, we should expect ~23 correlations to be “significant” just by chance because of multiple testing. A correction for multiple testing is required. Getting 45 significant correlations is more than expected. However, because the timeseries from neighbouring countries (e.g., Belgium and Luxembourg) can be very highly correlated because of the spatial patterns in precipitation (perhaps enhanced by the gridding process), if one has a significant correlation, several more will as well. This will increase the probability of getting 45 “significant” correlations by chance.

Given the apparent error in calculating the p-values in Laurenz et al. (2019), the authors need to archive their code, and if the problem is confirmed, should write a correction to their paper.

Posted in Peer reviewed literature, Uncategorized | Tagged , | 5 Comments

Autocorrelation in the testate amoeba calibration set

Amesbury et al examine the autocorrelation in their huge calibration set. I thought I would do the same, increasing the resolution of the analysis to get a better handle on what is going on.


An RNE plot for the weighted averaging with tolerance downweighting plotted with ggpalaeo. I have used the pruned calibration set for comparison with the original paper.

This is an RNE plot. It shows the cross-validation performance of the transfer function when samples are removed by deleting samples from the calibration set

  1. at random (open circles)
  2. which are geographic neighbours of the focal sample in the cross-validation (filled circles)
  3. which have similar values of the environmental variable being reconstructed to the focal sample in the cross-validation (red dashed line).

If there is no autocorrelation in the calibration set, deleting samples by geography or at random should yield similar changes in performance. If there is strong autocorrelation, deleting samples by geographic proximity is worse than deleting samples by environmental similarity.

With the testate amoeba WAtol calibration set, a substantial fraction of the loss in performance occurs when samples within 1km of the focal site in cross-validation are deleted. At this distance, deleting samples by geographic proximity is worse than deleting the same number of samples by environmental similarity. At greater distances, the decline in performance is less step (the equivalent plot in Amesbury et al is a little different because they have accidentally plotted the WAinv instead of the WAtol result – I need to improve the rne function to make the selection of model variants clearer).

I interpret the initial steep decline as a bog-specific effect, the type of issue that leave-one-site-out cross-validation is designed to fix. The subsequent decline is probably partly due to some climate effects on testate amoeba (either directly or via bog vegetation composition) and partly due to analyst effects (for example in how water table depths were calculated).

It might be possible to get some (imperfect) handle on the relative importance of these terms by cross-validating the WAtol model in different ways. Leave-one-out cross-validation of the pruned calibration set has an r2 of 0.722; with leave-one-site-out cross-validation, the r2 falls to 0.697; leaving out data from each reference in turn gives an r2 of 0.683. This is equivalent to a geographic exclusion of 500 km. This suggests there might be some analyst effect, but there will still be an element of climate effect in this change in performance. A better analysis would use information on who the taxonomist was for each calibration set.

For this testate amoeba calibration set with WAtol, spatial autocorrelation is not a major problem (see Telford and Birks (2009) for some transfer functions where autocorrelation is a much bigger problem).


Posted in Palaeohydrology, Peer reviewed literature, transfer function | Tagged , , | Leave a comment

Testing testate amoeba: some comments on Amesbury et al (2018)

Today, I have been reading and thinking about a new paper presenting a huge testate-amoeba calibration set for reconstructing water table depth in bogs (Amesbury et al 2018). This calibration set, with almost 2000 samples, is the synthesis of many smaller calibrations sets after the inevitable taxonomic harmonisation. In general, I like the paper, but I have some comment.


All the raw testate amoeba counts are available from Neotoma, and the harmonised data are available from Mendeley. This is excellent: I will be using these data and the large amount of fossil testate amoeba data in Neotoma in a forthcoming manuscript. Plus one citation to the authors.  Now if only the authors had archived the code as well.


I have not attempted to reproduce all aspects of the paper. I can replicate the weighted-average partial least squares and the weighted average with tolerance downweighting model  performances with the full dataset, but get a slightly higher root mean squared error of prediction (RMSEP) with the pruned dataset.

A large calibration set

The calibration set is about an order of magnitude larger than what might usually be thought of as a large calibration set. This has some interesting consequences. First, the apparent performance and the cross-validated performance are very similar (WA-tol RMSE = 10.04 cm, RMSEP = 10.20 cm). Large differences between the apparent and cross-validated performance hint at over-fitting. Second, WA-tol is the best model. Theoretically WA-tol should be a better model than ordinary WA as taxa with wide niches are given less weight than taxa with narrow niches, but in practice the niche width cannot be estimated well enough and performance is worse. Third, with such a large dataset, and with no evidence of over-fitting, leave-one-out and leave-one-site-out cross-validation have similar performance (except for the modern analogue technique).

Model selection

Trying several transfer functions model, and multiple versions of each model is fairly standard practice in palaeoecology. It is a type of cherry picking, that will can biased performance – see Telford et al (2004) for an test of this problem and a solution.

An unreported statistic

One useful diagnostic is the ratio of the constrained eigenvalue to the first unconstrained eigenvalue in an ordination constrained just by the environmental variable of interest. Ideally this ratio should exceed one, suggesting that the variable of interest is, or is correlated with, the most important gradient in the data. Using a CCA constrained by water table depth on square-root transformed data, the ratio \lambda_1/\lambda_2 is 0.6. This is not very impressive, and will make reconstructions vulnerable to changes in variables other than the variable of interest. I’m not sure which environmental variables are correlated with the first unconstrained axis.


Many data sets have outliers that can degrade model performance. There can be good reasons for deleting them, for example, in the Norwegian chironomid calibration set, a few lakes were cooled by long lasting snow beds, so the lake water was colder than expected for the air temperature. As is common practice with testate amoeba transfer functions, the Amesbury et al adopt a much more aggressive approach to outliers. They prune any observations that have a transfer function residual of more than 20% of the environmental gradient. This will inevitably give a boost to the model performance statistics. What is less clear is whether this gives better reconstructions.

One possibility is that testate amoeba have a rather weak relationship with water depth, and that the pruning artificially strengthens it. Many of the samples pruned are at the ends of the water depth gradient, so pruning, in part, addresses edge effects. Another possibility is that the water depth measurements are not very reliable (often they are spot measurements and could change after  storm), and pruning addresses this.

Significance tests

Amesbury et al use the reconstruction significance test I developed. They apply it to testate amoeba stratigraphies from two bogs. The reconstruction from Lac Le Caron is not significant with any version of their transfer function. The authors conclude

Given that the efficacy of the ‘randomTF’ method has been recently reviewed and questioned (Amesbury et al., 2016; Payne et al., 2016), these results further call into question the usefulness of this test.

I downloaded the Lac Le Caron data from Neotoma. The samples in the stratigraphy mostly have a low diversity (median N2 = 2.3), a problem known since my original paper to cause a high type II error rate. It does not seem to be fair to criticise the test for failing in such circumstances. The main signal in the stratigraphy from Lac Le Caron are switches from Difflugia pulex to Archerella flavum dominated assemblages. These taxa have almost identical optima (11.1 vs 10.2 cm), so I would argue that changes in water table depth are unlikely to have caused this assemblage change. As such, I believe that my significance test is correct in reporting a non-significant result.

Posted in Palaeohydrology, Peer reviewed literature, transfer function | Tagged , | 1 Comment

Introducing ggpalaeo

I’ve put some code I used for plotting figures for my soon-to-be-resubmitted manuscript into a package because I thought it might be useful to others.

The main use of ggpalaeo is to make ggplot2 plots of transfer function diagnostics from the analogue and rioja packages.


It will plot distances to nearest analogues and goodness-of-fit statistics by time or depth, and also plot transfer function performance plots. The diagnostic plots are perhaps better suited to supplementary material than the main text of a manuscript, but they are very pretty.

The other thing ggpalaeo does is make a couple of additions to strat.plot in rioja, allowing secondary axes (for example age in addition to depth) and summary plots (for example percent planktonic and benthic diatoms). Have a look at the examples.

You can install  ggpalaeo with


You might need to install devtools first. Please let me know if you have any problems or suggestions for additions or improvements.

Posted in R, transfer function | Tagged | 3 Comments

The elevation of Lake Tilo

For my PhD, I studied the palaeolimnology of two lakes in the Ethiopian rift valley, using diatoms to reconstruct changes in the water chemistry of Lake Awassa, an oligosaline caldera lake which retains its low salinity despite having no effluent rivers, and Lake Tilo, one of three alkaline maar lakes close to the Bilate River toward the western edge of the rift valley.

At the time, crater lakes were widely thought of as giant rain gauges, but, of course, groundwater plays a large role in controlling lake depth and salinity, complicating the relationship these variables and climate. Given the importance of groundwater, I wanted to know the elevation of Lake Tilo, the neighbouring lakes Budemeda and Mechefera, and the Bilate River to better understand how groundwater would flow between them. Unfortunately, the 10m contour resolution on the available map was insufficiently precise and I was hopeless at using stereo-photographs to estimate elevations, so this never got done. But now I don’t need to use stereo-photographs, I can download a high-resolution digital elevation model — SRTM data — from and plot it.

Here, I am using the 1′ resolution data. Each tile is about 25MB and covers a 1°x1° area.


tilo = raster::raster(x = "n07_e038_1arc_v3.tif")

lakes = data_frame(
lake = factor(c("B", "T", "M"), levels = c("B", "T", "M")),
x = c(38.09417, 38.09417, 38.08500),
y = c(7.096667, 7.065000, 7.042500)

tilo2 =, xy = TRUE) %>%
rename(elevation = 3)

tilo2 %>%
filter(between(x, 38.04, 38.15), between(y, 7.01, 7.12)) %>%
ggplot(aes(x = x, y = y, fill = elevation)) +
geom_raster() +
scale_fill_continuous(type = "viridis") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
coord_equal() +
geom_text(aes(x, y, label = lake), lakes, colour = "white", inherit.aes = FALSE) +
labs(fill = "Elevation, m", x = "°E", y = "°N")

Budemeda is the northernmost lake in a deep, steep-sided crater. Tilo is the middle lake. Mechefera is the southernmost lake in a shallow crater. When I visited, Mechefera supported a large population of flamingoes. Wading past the dead and dying flamingoes on the shores of the small cider-cone island was not pleasant. They kept the resident hyaena fed though.


DEM of the three lakes.

To my surprise, there is a smaller fourth crater south-west of Lake Mechefera – I thought this was another cinder cone in the east-west line of cones. Google maps shows the crater floor to be vegetated – I suspect it is swampy and may be seasonally flooded.

Now I can plot some E-W profiles through the lakes.

`%nearest%` <- function(a, b){#find nearest value#slow
a1 <- unique(a)
res <- sapply(b, function(i){ a == a1[which.min(abs(a1 - i))] }) apply(res, 1, any) } tilo2 %>%
filter(between(x, 38.04, 38.15)) %>%
filter(y %nearest% lakes$y) %>%
mutate(lake = factor(y, labels = c("M", "T", "B"))) %>%
ggplot(aes(x = x, y = elevation, colour = lake, label = lake)) +
geom_path() +
scale_colour_discrete(limits = c("B", "T", "M")) +
directlabels::geom_dl(method = "first.points") +
directlabels::geom_dl(method = "last.points") +
labs(y = "Elevation, m", colour = "Lake", x = "°E") +
theme(legend.position = "none")

East-west profiles through the three lakes.

Lake Budemeda appears to be 13 m below the elevation of the river immediately to the west. Lake Tilo is a couple of metres lower than Lake Budemeda, but less that five metres below the river to the west. Lake Mechefera is about 15 m below Lake Tilo and 8 m below the river immediately to the south. This means that river water will tend to percolate from the river into all three lakes, but especially Budemeda and Mechefera, and that water will flow from Budemeda to the lakes to the south. This probably partially explains the high salinity of the thermal springs along the northern shore of Lake Tilo (I should have realised that when I was doing my thesis).

Lacustrine sediment high on the crater walls show that during the early Holocene the water level in Tilo was about 30m higher that at present, far above the river level. At this time, the lake was fresh and there was rapid accumulation of diatomite dominated by Aulacoseira granulata.

SRTM data are a great resource, especially for closed basin lakes.

Posted in Data manipulation, Palaeohydrology, R, Uncategorized | Tagged , , | Leave a comment