Count sums – the preprint

I wandered lonely as a cloud
That floats on high o’er vales and hills,
When all at once I saw a crowd,
A host, of golden chironomids;
Beside the lake, beneath the trees,
Fluttering and dancing in the breeze.

Continuous as the stars that shine
And twinkle on the milky way,
They stretched in never-ending swarm
Along the margin of a bay:
Ten thousand saw I at a glance,
Tossing their heads in sprightly dance.

The waves beside them danced; but they
Out-did the sparkling waves in glee:
A poet could not but be gay,
In such a jocund company:
I gazed—and gazed—but little thought
What wealth the show to me had brought:

For oft, when on my couch I lie
In vacant or in pensive mood,
They flash upon that inward eye
Which is the bliss of solitude;
And then my heart with pleasure fills,
And dances with the chironomids.

Sincere apologies to W. Wordsworth: nothing rhymes with chironomids

But were there really ten thousand? Next time you walk by a lake, try to count the chironomids in a swarm. It is impossible, they just move about too much, and it doesn’t seem to get much easier when they are dead.

Just look at the midge stratigraphy from the eutrophic Hiidenvesi in Finland. The methods section of the paper report that “A minimum count size of 50 individuals was set” in 5 cm3 sediment samples.


Midge stratigraphy from Hiidenvesi including the absolute (black bars) and relative abundances (grey bars) of chironomids and chaoborids. The chaoborid/chironomid ratio indicates the severity of anoxia and N2 diversity index represents the effective number of occurrences

Do these assemblages really look like the minimum count sum is 50?

The lowest sample has about 45% Chaoborus flavicans and just over 10% of five chironomid taxa. This looks suspiciously like a count of nine midges: four Chaoborus  (44.4%) and one each for five chironomid taxa (11.1%). This interpretation is supported by the concentration data on the left of the figure. This sample has 0.8 Chaoborus and 1 chironomid per cubic centimetre of sediment. Since 5 cm3 of sediment were sampled and 5 * (0.8 + 1) = 9, this also implies that the count sum is nine midges. Last time I checked, nine was somewhat less than fifty. The second sample from the top appears to have the largest count sum with 1.2 Chaoborus and 3 chironomids per cubic centimetre of sediment, implying a count sum of 21, less than half that reported. All the analyses based on the chironomid stratigraphy are much more uncertain than the reader would expect and may be biased.

Given the number of examples of papers that appear to have counted fewer, often far fewer, microfossils than reported, I decided to write up and submit a manuscript describing the problem and showing how the count sum can be estimated from percent data. If you don’t have at least infinite patience for the peer review process to finish, you can read a preprint on Paleorxiv. Some of the examples I report in the preprint (none of which have yet featured on this blog) appear to be due to lazy descriptions of the data, some due to assemblages being scheduled for deletion but not actually deleted, and some due to what might be described as falsification of the count sums. I also accidentally found some fabricated data – maybe more on that in another post.

Posted in Peer reviewed literature | Tagged , , | Leave a comment

Count-less chironomids?

Most papers that present microfossil assemblages report (not always accurately) the minimum number of microfossils that were counted in each sample, an important indicator of the precision of the data. Some do not. For these papers, the reader should be able to assume that the counts conform to typical practice for that microfossil. For example, I would expect a pollen or diatom stratigraphy to have counts sums of a few hundred, and a chironomid stratigraphy to have a minimum count sum of fifty (perhaps thirty). If the count sums are materially lower than this I would start to wonder if the omission was deliberate and recall Fanelli’s (2013) definition of misconduct

‘any omission or misrepresentation of the information necessary and sufficient to evaluate the validity and significance of research, at the level appropriate to the context in which the research is communicated’.

The figure below is from a paper that reconstructs July temperature for four intervals in the period ca. 570 BCE to 120 CE (Iron Age–Roman Period) using chironomids from the varved sediments of Lake Silvaplana. The paper does not report the minimum count sum; can it be estimated from the stratigraphy?


(a) Chironomid assemblages, (b) inferred mean July air temperatures and (c) % organic matter. For part (a), percentages for each taxon are represented by black bars. Thin black lines represent the zones created using the ZONE program (Juggins, 1991).

I think the minimum count sum can be estimated. The lowest assemblage is 100% Orthocladius. This could represent one Orthocladius, a hundred, or a thousand, but I would suggest that it is unlikely to be a large count given the chironomid diversity in the lake (in contrast to polar foram assemblages which are commonly 100% Neogloboquadrina pachyderma), would make it unlikely to find just one taxon, even if that taxon is fairly dominant. The next assemblage is 100% Cricotopus, the third and fourth 100% Corynoneura, and the fifth 100 % Procladius: there are two possibilities, either the chironomid community in the lake repeatedly changed abruptly and taphonomic processes did not mix the assemblages, or the count sum for these assemblages is very low, perhaps only one. Fifty five of the 208 assemblages in the stratigraphy are monospecific.

The eight and ninth assemblages are 50% Tanytarsus and 50% another taxon. These are conceivably counts of two chironomids. The eleventh assemblage is more diverse with five taxa each with 20%, perhaps a count of five. Only a few assemblages, mainly near the top, appear to have a count sum of more than ten chironomids.

These are the lowest estimated count sums I have found, even lower than in Lake Muzzano.
Despite these extremely low count sums, a reconstruction of July temperature is made with a transfer function. Perhaps not surprisingly, the reconstruction is very noisy, with abrupt changes of over 10°C. The reconstruction is meaningless junk.

A correction is needed to report the omitted count sums, and if they are mainly in low single figures, I think the chironomid part of the paper should be retracted as no reviewer would ever have recommended publication had they known that.

Posted in Peer reviewed literature, transfer function | Tagged , , | Leave a comment

Tools for a reproducible manuscript

It is all too easy to make a mistake putting a manuscript together: mistranscribe a number, forget to rerun some analyses after the data are revised, or simply bodge the round some numbers. Fortunately it is possible to avoid these problems with some technology that inserts numbers and graphs into the manuscript, keeps them all up to date, and lets you see the calculations behind everything. The technology does not prevent code errors, but by making the analysis transparent, they are hopefully easier to find.

There are numerous guides to writing dynamic documents with Rmarkdown. What I want to do here is show some of the things I needed to write a manuscript with Rmarkdown. There may be better solutions to some of the problems I faced – tell me about them. If you want to see any of this in action, take a look at the GitHub repo for my recently accepted manuscript.

To get started with Rmarkdown, in RStudio, go to File >> New File >> R markdown and you get a sample file to play with. Remember, no spaces in the filename.

Too much code

Rmarkdown lets you have chunks of R code embedded in the text, so that when the document is rendered, the code is evaluated following the chunk’s options  and the results printed. This is great for teaching and simple documents, but when you have several hundred lines of code to clean data it becomes unwieldy to have as one document, and prevents reuse if you want to make a manuscript and a presentation with the same code.

The first solution I found for this is the knitr::read_chunk function which lets you put the code in a separate file. I later found the drake package which lets you put most of the code in the drake_plan and then readd or loadd the generated objects (targets) into rmarkdown. When you make the drake_plan only the targets that are out of date (because of code or data changes) get updated. This nicely solves the problem of how to cache the results of slow analyses (there is a cache built into Rmarkdown but it is a bit fragile).

Article format

If you want to submit your manuscript to a journal as a PDF, things will go a lot smoother if you use the journal’s LaTeX template. I have no idea how to do that either, but the rticles package takes away most of the pain. You make a draft manuscript with

rmarkdown::draft("MyArticle.Rmd", template = "acm_article", package = "rticles")

This generates the YAML, the metadata at the top of the Rmarkdown file between two sets of ---, required. This works well provided you don’t do anything daft such as submit to Wiley, one of the few major publishers without a template in rticles.


Obviously, a manuscript is going to need citations, and formatting citations is a pain. rticles will put a couple of lines in the YAML to say which bibtex file to use and which citation style.

bibliography: chironomid2.bib
csl: elsevier-harvard_rjt.csl

Any decent citation manager will be able to export a bibtex file. I use JabRef.

CSL files for many different journals are available. I had to edit the Elsevier style slightly so that it showed R packages correctly.

In the text, write [@Smith1990] to cite the reference with the key Smith1990 with name and year, or [-@Smith1990] to get just the year. The formatted reference list gets added to the end of the file. The citr addin for Rstudio lets you search through the bibtex file and select references to add to the manuscript.

Line numbers

I hate reviewing manuscripts without line numbers. So I wanted line numbers on my manuscript. This turned out to be easy – Rmarkdown documents are rendered into PDF via LaTeX, so LaTeX commands can be used.

In the YAML, add

  - \usepackage{lineno}
  - \linenumbers

And that’s it.

Cross-referencing figures

One of my favourite parts of writing a manuscript is when I decide to add another figure and need to renumber all the other figures and make sure that the text refers to the correct figure.

Rmarkdown can cross-reference figures (also tables etc). This feature was added relatively recently, and needs bookdown rather than rmarkdown, which can be set in the YAML (NB spacing is important in YAML )

    number_sections: true

In the text, to refer to the figure from the chunk with the label weather-climate-plot, use (Fig. \@ref(fig:weather-climate-plot)). Important, this does not work if you have underscores in the chunk name. This took me an entire evening to work out.

You can also cross-reference figures in the supplementary material. In the header-includes: section of the YAML, add

  - \usepackage{xr}
  - \externaldocument[sm-]{supplementary_data}

Where supplementary_data is the name of the rmarkdown file you want to reference. In the text you can then use Fig. \@ref(sm-fig:zab-dist-analogue). The sm prevents any clash between the labels in the manuscript and the supplementary.

To force the figures in the supplementary to be prepended with an “S”, to the YAML in the supplementary, add

  - \renewcommand{\thefigure}{S\arabic{figure}}

This magic works on the *.tex files made as an intermediate stage of rendering the PDF. You need to keep these by running rmarkdown::render with clean = FALSE, then you can run tinytex::pdflatex

tinytex::pdflatex("manuscript.tex", clean=TRUE)
tinytex::pdflatex("supplementary_data.tex", clean=TRUE)

Package management

One of the great things about R is that the packages are often improved. This is also a problem as the code that worked fine may fail after you update a package. One solution to this is to use packrat to keep track of which packages your project uses. packrat will automatically download all the packages you need to reproduce my manuscript.

I found packrat to be a bit of a pain, and am looking forward to trying renv.

Rmarkdown and git

As rmarkdown files are text files they work well with version control using git and, for example, GitHub. Since git tracks changes in each line of text, it would show a long paragraph as having changed if you add one comma. A trick I learnt somewhere is to break the paragraph up, one sentence per line, so the version control now works at the sentence level. Rmarkdown ignores single line returns in a document, you need two to get a new paragraph.


It is not difficult

If you can write code, and you can write text, you can write a manuscript with Rmarkdown. Your coauthors can probably cope too.

Posted in reproducible research | Tagged | 2 Comments

Double diatoms

I am in awe of Dr Elisabeth Bik and her amazing ability and dedication to spotting duplications in images.

Follow the thread to see the duplications she finds in this photograph. I usually struggle to find duplications until they are pointed out.

But not always.

Take, for example, this diatom calibration set.

Screenshot_2019-05-16 (11) (PDF) Diatom-based reconstruction of trophic status changes recorded in varved sediments of Lake[...]

The lower two assemblages are, within the resolution of the PDF, identical. Having spotted that, I wanted to know whether any of the other assemblages were identical, so I scraped the data out of the PDF. All the other assemblages are distinct.

But what about the next figure which shows the down-core stratigraphy?

Screenshot_2019-05-16 (11) (PDF) Diatom-based reconstruction of trophic status changes recorded in varved sediments of Lake[...](1)

While the upper part of the diagram is spiky as expected, the lower part of the diagram is step like with nine pairs of apparently identical assemblages.

These identical assemblages in the calibration set and the down-core stratigraphy are unexpected. I asked the authors if they could explain what happened, but they haven’t replied.

The paper also includes a transfer function that I cannot reproduce with the assemblage data I scraped from this paper and the environmental data I scraped from another paper on the same calibration set lakes. The reported cross-validation performance is better than the apparent performance I get, but it is possible that the environmental data had more problems than just using the wrong unit prefix. If my model is correct, then the actual cross-validation performance is very weak (R2 = 0.03), then the reconstruction shown in the paper is bogus and the paper should be corrected (or retracted).

I’m not holding my breath.

Posted in Peer reviewed literature, Uncategorized | Tagged , | Leave a comment

Funky ordination plots with ggvegan

Yesterday, I tweeted a photo of a ordination I plotted with ggvegan, and thought I should show how I made it.

Ordinations can be plotted with base R (see ?plot.cca). This is fine for simple plots, but it is a lot of effort to make a complex plot. The ggvegan package, written by Gavin Simpson, lets you use the power of ggplot2 and can make complex plots easier to make.

Start by loading some packages. ggvegan will need to be installed from GitHub if you don’t already have it.

#load packages

I’m going to use the pyrifos dataset, which is included in vegan. The data are the abundances of aquatic invertebrates from an experiment in which twelve ditches were studies several times before and after insecticide treatment. We need to make a tibble (or a data.frame) of the predictors (dose, time and ditch ID).

#load data

env <- tibble(
week = rep(c(-4, -1, 0.1, 1, 2, 4, 8, 12, 15, 19, 24), each = 12),
dose = factor(rep(c(0.1, 0, 0, 0.9, 0, 44, 6, 0.1, 44, 0.9, 0, 6), 11)),
ditch = gl(12, 1, length=132))

We can run a principal components analysis with rda.

pca <- rda(pyrifos)

Simple ordination plots can be made with the autoplot function. This plot shows all the samples. This is fine, but we want to see dose and time information.

autoplot(pca, layers = "sites", arrows = FALSE)


At the moment, it is difficult to do much to this plot. But we can use the fortify function from ggvegan to extract the scores from the ordination object and then combine these with the predictor data before plotting with ggplot. I’m going to colour the samples by dose, and draw a line through the samples in each ditch, with a point on the first value.

pca_fort <- fortify(pca, display = "sites") %>%

ggplot(pca_fort, aes(x = PC1, y = PC2, colour = dose, group = ditch)) +
geom_path() + #use geom_path not geom_line
geom_point(aes(size = if_else(week == min(week), 1, NA_real_)), show.legend = FALSE) +
scale_color_viridis_d() +
scale_size(range = 2) +


Don’t forget to add coord_equal() to ensure that the plot is correctly scaled.

Happy plotting.

Posted in R, Uncategorized | Tagged , | Leave a comment

Sunspots and raindrops

It is time, as the walrus said, to talk about another paper reporting an effect of solar variability on the Earth’s climate. Laurenz et al. (2019) correlate European country-scale precipitation data from CRU with the number of sunspots, a proxy for solar activity. After allowing for a lagged response, they find some reasonably high and significant correlations in at least some months and some countries.

I do expect solar variability to affect climate, but I expect the effect to be quite small and difficult to detect above a background of internal variability and other forcings such as greenhouse gas and aerosols. However, statistical traps abound. This makes me wary of papers that report strong and highly significant relationships. Especially when the authors write

The probability that the correlation r = 0.54 is by chance is less than 0.1% (p < 0.001).

This reflects a common misunderstanding of p-values. The co-authors, Horst-Joachim Lüdecke and Sebastian Lüning, the reviewers, and the editors don’t look good for letting this slip past them.

As the paper reports, monthly precipitation time series have very little autocorrelation and are almost white noise, so it seems unlikely that the strongly autocorrelated sunspot timeseries would be an good predictor.

This is easily tested by looking at some data. Not surprisingly, the correlation between the raw Danish February precipitation and sunspots is weak (r = 0.07) and not remotely close to being significant (p = 0.43).

So how do Laurenz et al. get to report this correlation as 0.42 (p < 0.01)? Predictably, they smooth the data, and allow for lagged responses. Using their methods, I get approximately the same correlation. But what about the significance? Many papers that smooth the data or allow for lagged response (a form of multiple testing) fail to correct the p-value and then find spuriously significant results. Laurenz et al. don’t make that mistake – they use a Monte Carlo procedure, applying their method to 10,000 surrogate time series to find the distribution of p-values under the null hypothesis. They report that for values of |r| > 0.45 p < 0.001; for |r| > 0.35 p < 0.01; and for |r| > 0.25 p < 0.05 (I’m not exactly sure why they are reporting p-values for absolute correlations when they need a one-sided test.)

It only takes a few lines of code to re-implement their Monte Carlo procedure and check these results.


#load data
ss0 <- read.table("data/SN_m_tot_V2.0.txt", nrow = 3237)
ss <- ts(ss0$V4, start = ss0$V1[1], frequency = 12) %>%
window(start = 1901, end = c(2015, 12))

denmark <- read.table("data/crucy.v3.26.1901.2017.Denmark.pre.per", skip = 3, header = TRUE) %>%
filter(YEAR <= 2015)

#make function
lags.cor <- function(l, m, ss, clim){
ss.lag <- stats::lag(ss, l)
ss.lag <- window(ss.lag, start = c(1901, m), end = c(2015, m), extend = TRUE, deltat = 1)
cor(ss.lag, clim, use = "pair")

#make null distribution
null_dist <- future_map(1:10000, ~{ y = rnorm(nrow(denmark)) ys = signal::sgolayfilt(x = y, p = 5, n = 11) tibble(lag = -110:24) %>%
mutate(cor = map_dbl(lag, lags.cor, m = 1, clim = ys, ss = ss))
}) %>%
bind_rows(.id = "n") %>%
mutate(n = as.numeric(n))

I am using white noise as I cannot find a function to simulate autocorrelated time series with a Hurst exponent of 0.6 which the authors use. This will make my significance thresholds slightly lower, as will my use of a one-sided rather than two-sided test. Despite this, I find that the significance thresholds are much higher than those the authors report.

null_dist %>%
group_by(n) %>%
slice(which.max(cor)) %$%
quantile(cor, probs = c(0.95, 0.99))
## 95% 99%
## 0.40 0.47

Only if I ignore the lagged responses do I get results similar to the authors.

null_dist %>%
filter(lag == 0) %$%
quantile(cor, c(0.95, .99))
## 95% 99%
## 0.25 0.34

It appears that the authors neglected to test for lagged responses in their Monte Carlo procedure. As such their p-values are over optimistic and many of their apparently significant results are actually not significant.

Still, there are ~45 monthly series with a correlation that exceeds the 95th percentile of the null distribution. Is this impressive? Given that there are 39 countries x 12 months = 468 tests, we should expect ~23 correlations to be “significant” just by chance because of multiple testing. A correction for multiple testing is required. Getting 45 significant correlations is more than expected. However, because the timeseries from neighbouring countries (e.g., Belgium and Luxembourg) can be very highly correlated because of the spatial patterns in precipitation (perhaps enhanced by the gridding process), if one has a significant correlation, several more will as well. This will increase the probability of getting 45 “significant” correlations by chance.

Given the apparent error in calculating the p-values in Laurenz et al. (2019), the authors need to archive their code, and if the problem is confirmed, should write a correction to their paper.

Posted in Peer reviewed literature, Uncategorized | Tagged , | 5 Comments

Autocorrelation in the testate amoeba calibration set

Amesbury et al examine the autocorrelation in their huge calibration set. I thought I would do the same, increasing the resolution of the analysis to get a better handle on what is going on.


An RNE plot for the weighted averaging with tolerance downweighting plotted with ggpalaeo. I have used the pruned calibration set for comparison with the original paper.

This is an RNE plot. It shows the cross-validation performance of the transfer function when samples are removed by deleting samples from the calibration set

  1. at random (open circles)
  2. which are geographic neighbours of the focal sample in the cross-validation (filled circles)
  3. which have similar values of the environmental variable being reconstructed to the focal sample in the cross-validation (red dashed line).

If there is no autocorrelation in the calibration set, deleting samples by geography or at random should yield similar changes in performance. If there is strong autocorrelation, deleting samples by geographic proximity is worse than deleting samples by environmental similarity.

With the testate amoeba WAtol calibration set, a substantial fraction of the loss in performance occurs when samples within 1km of the focal site in cross-validation are deleted. At this distance, deleting samples by geographic proximity is worse than deleting the same number of samples by environmental similarity. At greater distances, the decline in performance is less step (the equivalent plot in Amesbury et al is a little different because they have accidentally plotted the WAinv instead of the WAtol result – I need to improve the rne function to make the selection of model variants clearer).

I interpret the initial steep decline as a bog-specific effect, the type of issue that leave-one-site-out cross-validation is designed to fix. The subsequent decline is probably partly due to some climate effects on testate amoeba (either directly or via bog vegetation composition) and partly due to analyst effects (for example in how water table depths were calculated).

It might be possible to get some (imperfect) handle on the relative importance of these terms by cross-validating the WAtol model in different ways. Leave-one-out cross-validation of the pruned calibration set has an r2 of 0.722; with leave-one-site-out cross-validation, the r2 falls to 0.697; leaving out data from each reference in turn gives an r2 of 0.683. This is equivalent to a geographic exclusion of 500 km. This suggests there might be some analyst effect, but there will still be an element of climate effect in this change in performance. A better analysis would use information on who the taxonomist was for each calibration set.

For this testate amoeba calibration set with WAtol, spatial autocorrelation is not a major problem (see Telford and Birks (2009) for some transfer functions where autocorrelation is a much bigger problem).


Posted in Palaeohydrology, Peer reviewed literature, transfer function | Tagged , , | Leave a comment