Simplistic and Dangerous Models

A few weeks ago there were none. Three weeks ago, with an entirely inadequate search strategy, ten cases were found. Last Saturday there were 43! With three inaccurate data points, there is enough information to fit an exponential curve: the prevalence is doubling every seven days. Armchair epidemiologists should start worrying that by Christmas there will be 1012 preprints relating COVID-19 to weather and climate unless an antidote is found.

Fortunately, a first version of an antidote to one form of the preprint plague that is sweeping the planet known as the SDM (apparently this not an acronym for Simplistic and Dangerous Model). A second version is due to be published soon.

So why are SDMs such a bad tool for trying to model the spread of COVID-19?

1) The system is not at equilibrium

The first case of what has become known as COVID-19 was reported on 17th November 2019 in Wuhan. The weather was a pleasant 17°C. An SDM fitted to the occurrences of COVID-19 at this time would show it was an extreme specialist, with a preferred temperature of 17°C. Of course it would be absurd to fit an SDM.


By mid-January, the virus had spread across China and an SDM would find that Chinese winter temperatures were ideal. By mid-March, the outbreaks were developing in Europe and America – places with good transport links to China and each other, and an SDM would tell a different story. A month later, and there are COVID-19 outbreaks in almost all countries. An SDM fitted now to presence absence data would show no evidence of a niche.

We can visualise this by fitting a GAM to the presence of >5 cases of COVID-19 in each country/region for each date in the John Hopkins dataset. This is similar to the data Araujo & Naimi (2020) used in the first version of their preprint.


Only for a short time period is there an apparent climatic niche in the distribution of COVID-19 cases. It would be foolish to assert that that brief window gives us insight into the climatic niche of COVID-19 rather than the volume of passenger transport from China and subsequently other countries with outbreaks as the disease spreads.

Almost 40,000 cases of COVID-19 in Brazil testify to the utter failure of the prediction in Araujo & Naimi (2020) that tropical climates are less favourable for the spread of COVID-19.

For simplicity, I fitted the GAM using February mean temperatures from Chelsa. If you think it would make any difference to use observed temperatures rather than a climatology you are missing the point.

Problems with fitting SDM to invasive species, where the range is not in approximate equilibrium with climate, are well known. Anyone working in the field should be aware of these issues.

2) Microclimate

Temperature and humidity can affect the life-time of droplets and the time that virus particles on surfaces are viable. But the 2m outside air temperature is a poor predictor of the actual conditions where many people are exposed to infection. It’s all about microclimate: many people are inside most of the time, especially when it is cold outside.

Qian et al (2020) examined outbreaks of COVID-19 in China (excluding Hubei) that could be traced to their source. They found only one of 318 outbreaks in was transmitted outdoors. Most outbreaks occurred at home. One proviso about this study, it might be more difficult to trace transmissions that occurred outside, causing an undercount.

3) Ignores other factors

Climate and seasonality probably will play some role in COVID-19 dynamics, as it does, for reasons that are not entirely clear, in influenza, for which several hypotheses have been suggested including the influence of seasonality on human behaviour (inside more in cold weather) and our immune systems. While they may play some role, climate and seasonality are unlikely to be the dominant factors controlling COVID-19 transmission. The initial spread around the world reflects patterns of human movement. The number of infections in a location depend on human behaviour including policy to minimise spread taken successfully (South Korea) or less so (United Kingdom). A standard SDM takes none of these factors into consideration, nor any other component of an epidemiological model.


So certainly simplistic, what about dangerous?

If the effect of climate and seasonality on COVID-19 is small and policy makers are influenced by these preprints to premature relax (or apply belatedly) restrictions on movement and businesses, people will die.


Araujo & Naimi (2020) is not the only preprint using an SDM to try to find a relationship between COVID-19 and climate. Harbert et al (2020) use a somewhat more cautious approach, concluding that while they cannot exclude an influence of climate, it is less important than population size in explaining county-level COVID-19 statistics in the United States. Bariotakis et al (2020) use the full suite of 19 WorldClim bioclimatic variables as predictors in MaxEnt. They find …. Actually, I don’t really care what they find as the premise is nonsense so any result can only be an artefact.

Just because you have a bag of hammers, does not make every problem a nail.

Posted in Uncategorized | Tagged , | 3 Comments

COVID-19, climate and the plague of preprints

Many diseases have geographically variability in prevalence or seasonal variability in epidemics, which may, directly or indirectly, be causally related to climate. Unfortunately, the nature of the relationship with climate is not always clear.

With the recent outbreak of COVID-19, scientists (including a depressingly large number of ecologists who ought to know better) have wasted no time in applying their favourite methods to the available case data and climate data to produced a pack of preprints.

Please let me know if you find any more.

I know ecologists are trying to be useful, but please consider


Most of these preprints argue for reduced transmission at warmer temperatures. If true, this would be good news as spring progresses to summer in the northern hemisphere, allowing time for health services to prepare for increased transmission later in the year.

If false, then these preprints could influence how societies react to the ongoing crisis, perhaps convincing people or policy makers that the risk of transmission of COVID-19 is low. Certainly, there are policy makers who are open to this idea:

“Looks like by April, you know, in theory, when it gets a little warmer, it miraculously goes away,”  Donald Trump 10 February 2020

I haven’t carefully read all the preprints, but for several of those that I have read are not at all persuasive.

I’ve helped write a response to one of these. In some forthcoming posts I will try to reproduce some of the others as I suspect some are already disproved by the new data available since they were published.

But for now, if a preprint assumes that the climate at 60°N 90°E represents the climate of Russia, it is junk.

Posted in Uncategorized | Tagged | 4 Comments

Erroneous information … was given

By coincidence, days after I wrote about the apparently very low midge counts in Hiidenvesi, the authors published a correction.

Erroneous information considering Chironomidae and Chaoboridae accumulation was given in Figure 4 published in Luoto et al. (2017). Therefore, it cannot be stated as in the results (page 5) that the chironomid flux increased towards the present. This data is not interpreted or discussed in the manuscript any further, but as a result from the incorrect accumulation unit, the midge accumulation data cannot be used for reference in other research. The volumetric midge sample size given in the methods (page 3) is the maximum sample size used. Also the midge percentage information in Figure 4 is incorrect. However, the correct percentages were otherwise used throughout the manuscript, including the midge-based reconstructions.


This is figure 4 from the original paper


Apparently, the accumulation rates in figure are wrong (the figure shows concentrations not accumulation rates), the amount of sediment used differed from that reported, and, added almost as an after thought, the midge assemblage data are also wrong.

Fortunately however, the correction assures the reader that all the rest of the paper, with the exception of a sentence about the increasing flux, was based on the correct data. Obviously, it unreasonable to expect the correction to include a figure showing the correct data or an explanation of why the wrong data were published: there is just not enough space on the internet.

Curiously, the description of the stratigraphy on page five of the original paper, which the correction assures the reader is for the the correct data, perfectly matches the incorrect data.

The incorrect data in the stratigraphy are easy to digitise and are entirely consistent with count sums of between 7 and 21 midges. Counts so low that no honest scientist would knowingly try to portray them as counts of at least 50 midges.

The Chaoborus:Chironomidae ratio calculated from the digitised data matches the published curve; the N2 curve can also be reproduced (if the data are first square root transformed). Even the ordination can be perfectly reproduced.


Published ordination


Reproduction of the ordination

What are the odds of an ordination on the incorrect data being identical to an ordination on the correct data?

I cannot reproduce the reconstructions of dissolved oxygen or total phosphorous as the calibration set data have not been archived. (The total phosphorous reconstruction uses a four component WAPLS model which apparently has an R2 of 0.92. Without the raw data, I am disinclined to believe this; it could be another transfer function where the apparent performance is reported as the cross-validated performance.)

While it is good to see the authors taking responsibility for problems in the Hiidenvesi paper, I am not persuaded that the correction is a full account of its problems. The authors need to archive the raw data, explain how the incorrect data came to be published, and how the results that they report to have used the correct data can be reproduced with the incorrect data.

Posted in Peer reviewed literature, transfer function | Tagged , | Leave a comment

Making a pollen diagram from Neotoma

Last week I gave a course on R for palaeoecologists covering data handling using tidyverse, reproducibility and some some ordinations and transfer functions. One of the exercises was to download some pollen data from Neotoma and make a pollen diagram. This is my solution. It uses a lot of tidyverse. Everything could be done with base R, but I find tidyverse more expressive.

#load packages
#devtools::install_github("richardjtelford/ggpalaeo") #NEW VERSION
library("ggpalaeo") #Need new version

I’ve chosen to plot the pollen data from Lake Bambili. First we need to download the data with the neotoma package

#download data
get_dataset(40940)#check correct site
## A dataset_list containing 1 objects:
## Accessed from 2019-10-10 16:41h to 2019-10-10 16:41h.
## Datasets:
## long lat type
## 40940 Bambili 1 10.24383 5.93487 pollen
bambili_raw <- get_download(40940)
## API call was successful. Returned record for Bambili 1
eco_types <- get_table("EcolGroupTypes")

Then we need to extract the components we need from the downloaded data. get_download returns a list with one element per dataset which we can extract with the $ notation.

#extract components
cnts = counts(bambili_raw$`40940`) #table of counts
meta = bambili_raw$`40940`$sample.meta #sample depths/ages
taxa = bambili_raw$`40940`$taxon.list %>% #taxon info
  mutate_all(as.character) #convert everything from a factor to prevent warnings later

I want to add the age/depth information to the counts. Here I am using bind_cols to do this, which assumes the rows are in the same order. The paranoid might want to convert the row names of the counts into a column (rownames_to_column) and then join on Next I use pivot_longer (the replacement of tidyr::gather) to convert the data from a species x samples data.frame to a long format, and join to the taxon information. Note that some datasets have an alias column in taxon.list and you need to join by that instead.

bambili <- meta %>% select(age, depth) %>%
bind_cols(cnts) %>%
pivot_longer(cols = -c("age", "depth"), names_to = "species", values_to = "count") %>%
left_join(taxa, by = c("species" = ""))

Now is a good time to check what we have. It’s useful to have the Neotoma database manual open to check what terms mean.

bambili %>% count(variable.element)
## # A tibble: 3 x 2
## variable.element n
## 1 pollen 36750
## 2 pollen/spore 175
## 3 spore 875

variable.element contains just pollen and spores. It can include a variety of other things (such as stomata) that we might want to filter out.

eco_types %>%
semi_join(bambili, by = c("EcolGroupID" = "")) %>%
select(EcolGroupID, EcolGroup)
## EcolGroupID EcolGroup
## 1 AQVP Aquatic Vascular Plants
## 2 MANG Mangroves
## 3 PALM Palms
## 4 SEED Spermatophyte rank or clade above order
## 5 TRSH Trees and Shrubs
## 6 UNID Unknown and Indeterminable Palynomorphs
## 7 UPHE Upland Herbs
## 8 VACR Terrestrial Vascular Cryptogams

I don’t want to include the AQVP or UNID in the diagram. I’m not sure about SEED.

bambili %>% filter( == "SEED", count > 0) %>%
select(species, count) %>%
group_by(species) %>%
summarise(n = n(), max = max(count))
## # A tibble: 1 x 3
## species n max
## 1 Monocotyledoneae undiff. 33 6

So SEED is an unidentified monocot present at low abundances. I’m going to delete it.

#remove unwanted variable.element/

bambili = bambili %>% 
  filter(! %in% c("AQVP", "UNID", "SEED"))
#use `%in%` not `==`

This is a good time to check the count sums. It might be prudent to delete any samples with very low counts.

#check count sums
bambili %>%
group_by(depth) %>%
summarise(s = sum(count)) %>%
arrange(s) %>%
## # A tibble: 5 x 2
## depth s
## 1 510. 72
## 2 220. 157
## 3 200. 176
## 4 240. 177
## 5 280. 233

Seventy four isn't great, but I'm going to keep it.

Now we can calculate percent and remove the rare taxa

#calculate percent
bambili = bambili %>% 
  group_by(depth) %>% 
  mutate(percent = count/sum(count) * 100) 

#remove rare taxa
bambili1 = bambili %>% 
  group_by(species) %>% 
    sum(percent > 0) >= 3, #must be in at least three samples
    max(percent) > 3) #must have a max percent > 3

Now we can use pivot_wider to reshape the data back into a species x samples data.frame that we can plot with rioja::strat.plot. For convenience, I’m separating the species data from the age/depth data.

bambili2 <- bambili1 %>%
select(age, depth, species, percent) %>%
pivot_wider(names_from = "species", values_from = "percent")

bambili_spp <- bambili2 %>% select(-age, -depth) %>% as strat.plot does not like tibble - pull request to fix this submitted.

Now we can plot the stratigraphy.

strat.plot(bambili_spp, yvar = bambili2$depth)


There are a variety of aspects of this plot that need improving. We need to:

  • plot on constant scale for all taxa
  • reverse y axis so deeper levels are lower
  • arrange the taxa in some logical order
  • rotate the species names and set the figure margins
  • add a cluster diagram and zones
  • add a secondary scale

Some of these can be fixed by setting an argument in strat.plot (there are a lot of arguments – see ?strat.plot), but to reorder the species, we need to reprocess the data.

bambili2 <- bambili1 %>%
#make a factor with TRSH first = factor(, levels = c("TRSH", "UPHE", "VACR")),
mean_percent = mean(percent)) %>%
#arrange by and mean_percent (largest first)
arrange(, desc(mean_percent)) %>%
ungroup() %>%
#make species into a factor so we can perserve the order
mutate(species = factor(species, levels = unique(species)))

#reshape using tidyr::spread as pivot_wider (currently?) ignores factor order
bambili3 <- bambili2 %>%
select(age, depth, species, percent) %>%
spread(key = "species", value = "percent")

bambili_spp <- bambili3 %>%
select(-age, -depth) %>%

#set up for ecological group colours
ecological_groups <- bambili2 %>%
  distinct(species, %>% 
ecological_colours <- c("red", "green", "orange")

And we run a constrained cluster analysis

bambili_dist <- dist(sqrt(bambili_spp/100))#chord distance
clust <- chclust(bambili_dist, method = "coniss")
#bstick(clust)#five groups

Now we can make a better plot.

#set up mgp (see ?par)
mgp <- c(2, 0.25, 0)
par(tcl = -0.15, mgp = mgp)#shorter axis ticks - see ?par
pt <- strat.plot(
d = bambili_spp,
yvar = bambili3$depth,
y.rev = TRUE, #reverse direction of y-axis
scale.percent = TRUE, #use constant scale for all taxa
srt.xlabel = 45, #rotate x-label by 45 degrees
cex.xlabel = 0.8, #smaller font
mgp = mgp,
xRight = 0.98, #right margin
xLeft = 0.21, #left margin with space for 2nd axis
yTop = 0.60, #top margin
yBottom = 0.1, #bottom margin
col.line = ecological_colours[ecological_groups],#colours = ecological_colours[ecological_groups], #colours
ylabel = "Depth cm",
clust = clust

#add zone boundaries
addClustZone(pt, clust = clust, nZone = 5)

#add a secondary scale
secondary_scale(pt, yvar = bambili3$depth, 
                yvar2 = bambili3$age, 
                ylabel2 = "Date yr BP",
                n = 10)


It’s beginning to get there. There are probably too many taxa plotted. Merging the various Polypodiophyta as the names are very long and the ecological interpretation of the different types is unclear. I also want to reduce the space between the y-axis and the ylabel. Unfortunately this is hard coded in strat.plot but it would only take a minute to make it into an argument. I’d also like to add a % sign to to the axis, but I cannot see an argument that will do that (again it shouldn’t be hard to code – adding yet another argument).

I would like to have some text on top describing the ecological groups, but I have the feeling that that would be very difficult to do.

Posted in Data manipulation, R, reproducible research | Tagged | 1 Comment

Count sums – the preprint

I wandered lonely as a cloud
That floats on high o’er vales and hills,
When all at once I saw a crowd,
A host, of golden chironomids;
Beside the lake, beneath the trees,
Fluttering and dancing in the breeze.

Continuous as the stars that shine
And twinkle on the milky way,
They stretched in never-ending swarm
Along the margin of a bay:
Ten thousand saw I at a glance,
Tossing their heads in sprightly dance.

The waves beside them danced; but they
Out-did the sparkling waves in glee:
A poet could not but be gay,
In such a jocund company:
I gazed—and gazed—but little thought
What wealth the show to me had brought:

For oft, when on my couch I lie
In vacant or in pensive mood,
They flash upon that inward eye
Which is the bliss of solitude;
And then my heart with pleasure fills,
And dances with the chironomids.

Sincere apologies to W. Wordsworth: nothing rhymes with chironomids

But were there really ten thousand? Next time you walk by a lake, try to count the chironomids in a swarm. It is impossible, they just move about too much, and it doesn’t seem to get much easier when they are dead.

Just look at the midge stratigraphy from the eutrophic Hiidenvesi in Finland. The methods section of the paper report that “A minimum count size of 50 individuals was set” in 5 cm3 sediment samples.


Midge stratigraphy from Hiidenvesi including the absolute (black bars) and relative abundances (grey bars) of chironomids and chaoborids. The chaoborid/chironomid ratio indicates the severity of anoxia and N2 diversity index represents the effective number of occurrences

Do these assemblages really look like the minimum count sum is 50?

The lowest sample has about 45% Chaoborus flavicans and just over 10% of five chironomid taxa. This looks suspiciously like a count of nine midges: four Chaoborus  (44.4%) and one each for five chironomid taxa (11.1%). This interpretation is supported by the concentration data on the left of the figure. This sample has 0.8 Chaoborus and 1 chironomid per cubic centimetre of sediment. Since 5 cm3 of sediment were sampled and 5 * (0.8 + 1) = 9, this also implies that the count sum is nine midges. Last time I checked, nine was somewhat less than fifty. The second sample from the top appears to have the largest count sum with 1.2 Chaoborus and 3 chironomids per cubic centimetre of sediment, implying a count sum of 21, less than half that reported. All the analyses based on the chironomid stratigraphy are much more uncertain than the reader would expect and may be biased.

Given the number of examples of papers that appear to have counted fewer, often far fewer, microfossils than reported, I decided to write up and submit a manuscript describing the problem and showing how the count sum can be estimated from percent data. If you don’t have at least infinite patience for the peer review process to finish, you can read a preprint on Paleorxiv. Some of the examples I report in the preprint (none of which have yet featured on this blog) appear to be due to lazy descriptions of the data, some due to assemblages being scheduled for deletion but not actually deleted, and some due to what might be described as falsification of the count sums. I also accidentally found some fabricated data – maybe more on that in another post.

Posted in Peer reviewed literature | Tagged , , | 1 Comment

Count-less chironomids?

Most papers that present microfossil assemblages report (not always accurately) the minimum number of microfossils that were counted in each sample, an important indicator of the precision of the data. Some do not. For these papers, the reader should be able to assume that the counts conform to typical practice for that microfossil. For example, I would expect a pollen or diatom stratigraphy to have counts sums of a few hundred, and a chironomid stratigraphy to have a minimum count sum of fifty (perhaps thirty). If the count sums are materially lower than this I would start to wonder if the omission was deliberate and recall Fanelli’s (2013) definition of misconduct

‘any omission or misrepresentation of the information necessary and sufficient to evaluate the validity and significance of research, at the level appropriate to the context in which the research is communicated’.

The figure below is from a paper that reconstructs July temperature for four intervals in the period ca. 570 BCE to 120 CE (Iron Age–Roman Period) using chironomids from the varved sediments of Lake Silvaplana. The paper does not report the minimum count sum; can it be estimated from the stratigraphy?


(a) Chironomid assemblages, (b) inferred mean July air temperatures and (c) % organic matter. For part (a), percentages for each taxon are represented by black bars. Thin black lines represent the zones created using the ZONE program (Juggins, 1991).

I think the minimum count sum can be estimated. The lowest assemblage is 100% Orthocladius. This could represent one Orthocladius, a hundred, or a thousand, but I would suggest that it is unlikely to be a large count given the chironomid diversity in the lake (in contrast to polar foram assemblages which are commonly 100% Neogloboquadrina pachyderma), would make it unlikely to find just one taxon, even if that taxon is fairly dominant. The next assemblage is 100% Cricotopus, the third and fourth 100% Corynoneura, and the fifth 100 % Procladius: there are two possibilities, either the chironomid community in the lake repeatedly changed abruptly and taphonomic processes did not mix the assemblages, or the count sum for these assemblages is very low, perhaps only one. Fifty five of the 208 assemblages in the stratigraphy are monospecific.

The eight and ninth assemblages are 50% Tanytarsus and 50% another taxon. These are conceivably counts of two chironomids. The eleventh assemblage is more diverse with five taxa each with 20%, perhaps a count of five. Only a few assemblages, mainly near the top, appear to have a count sum of more than ten chironomids.

These are the lowest estimated count sums I have found, even lower than in Lake Muzzano.
Despite these extremely low count sums, a reconstruction of July temperature is made with a transfer function. Perhaps not surprisingly, the reconstruction is very noisy, with abrupt changes of over 10°C. The reconstruction is meaningless junk.

A correction is needed to report the omitted count sums, and if they are mainly in low single figures, I think the chironomid part of the paper should be retracted as no reviewer would ever have recommended publication had they known that.

Posted in Peer reviewed literature, transfer function | Tagged , , | Leave a comment

Tools for a reproducible manuscript

It is all too easy to make a mistake putting a manuscript together: mistranscribe a number, forget to rerun some analyses after the data are revised, or simply bodge the round some numbers. Fortunately it is possible to avoid these problems with some technology that inserts numbers and graphs into the manuscript, keeps them all up to date, and lets you see the calculations behind everything. The technology does not prevent code errors, but by making the analysis transparent, they are hopefully easier to find.

There are numerous guides to writing dynamic documents with Rmarkdown. What I want to do here is show some of the things I needed to write a manuscript with Rmarkdown. There may be better solutions to some of the problems I faced – tell me about them. If you want to see any of this in action, take a look at the GitHub repo for my recently accepted manuscript.

To get started with Rmarkdown, in RStudio, go to File >> New File >> R markdown and you get a sample file to play with. Remember, no spaces in the filename.

Too much code

Rmarkdown lets you have chunks of R code embedded in the text, so that when the document is rendered, the code is evaluated following the chunk’s options  and the results printed. This is great for teaching and simple documents, but when you have several hundred lines of code to clean data it becomes unwieldy to have as one document, and prevents reuse if you want to make a manuscript and a presentation with the same code.

The first solution I found for this is the knitr::read_chunk function which lets you put the code in a separate file. I later found the drake package which lets you put most of the code in the drake_plan and then readd or loadd the generated objects (targets) into rmarkdown. When you make the drake_plan only the targets that are out of date (because of code or data changes) get updated. This nicely solves the problem of how to cache the results of slow analyses (there is a cache built into Rmarkdown but it is a bit fragile).

Article format

If you want to submit your manuscript to a journal as a PDF, things will go a lot smoother if you use the journal’s LaTeX template. I have no idea how to do that either, but the rticles package takes away most of the pain. You make a draft manuscript with

rmarkdown::draft("MyArticle.Rmd", template = "acm_article", package = "rticles")

This generates the YAML, the metadata at the top of the Rmarkdown file between two sets of ---, required. This works well provided you don’t do anything daft such as submit to Wiley, one of the few major publishers without a template in rticles.


Obviously, a manuscript is going to need citations, and formatting citations is a pain. rticles will put a couple of lines in the YAML to say which bibtex file to use and which citation style.

bibliography: chironomid2.bib
csl: elsevier-harvard_rjt.csl

Any decent citation manager will be able to export a bibtex file. I use JabRef.

CSL files for many different journals are available. I had to edit the Elsevier style slightly so that it showed R packages correctly.

In the text, write [@Smith1990] to cite the reference with the key Smith1990 with name and year, or [-@Smith1990] to get just the year. The formatted reference list gets added to the end of the file. The citr addin for Rstudio lets you search through the bibtex file and select references to add to the manuscript.

Line numbers

I hate reviewing manuscripts without line numbers. So I wanted line numbers on my manuscript. This turned out to be easy – Rmarkdown documents are rendered into PDF via LaTeX, so LaTeX commands can be used.

In the YAML, add

  - \usepackage{lineno}
  - \linenumbers

And that’s it.

Cross-referencing figures

One of my favourite parts of writing a manuscript is when I decide to add another figure and need to renumber all the other figures and make sure that the text refers to the correct figure.

Rmarkdown can cross-reference figures (also tables etc). This feature was added relatively recently, and needs bookdown rather than rmarkdown, which can be set in the YAML (NB spacing is important in YAML )

    number_sections: true

In the text, to refer to the figure from the chunk with the label weather-climate-plot, use (Fig. \@ref(fig:weather-climate-plot)). Important, this does not work if you have underscores in the chunk name. This took me an entire evening to work out.

You can also cross-reference figures in the supplementary material. In the header-includes: section of the YAML, add

  - \usepackage{xr}
  - \externaldocument[sm-]{supplementary_data}

Where supplementary_data is the name of the rmarkdown file you want to reference. In the text you can then use Fig. \@ref(sm-fig:zab-dist-analogue). The sm prevents any clash between the labels in the manuscript and the supplementary.

To force the figures in the supplementary to be prepended with an “S”, to the YAML in the supplementary, add

  - \renewcommand{\thefigure}{S\arabic{figure}}

This magic works on the *.tex files made as an intermediate stage of rendering the PDF. You need to keep these by running rmarkdown::render with clean = FALSE, then you can run tinytex::pdflatex

tinytex::pdflatex("manuscript.tex", clean=TRUE)
tinytex::pdflatex("supplementary_data.tex", clean=TRUE)

Package management

One of the great things about R is that the packages are often improved. This is also a problem as the code that worked fine may fail after you update a package. One solution to this is to use packrat to keep track of which packages your project uses. packrat will automatically download all the packages you need to reproduce my manuscript.

I found packrat to be a bit of a pain, and am looking forward to trying renv.

Rmarkdown and git

As rmarkdown files are text files they work well with version control using git and, for example, GitHub. Since git tracks changes in each line of text, it would show a long paragraph as having changed if you add one comma. A trick I learnt somewhere is to break the paragraph up, one sentence per line, so the version control now works at the sentence level. Rmarkdown ignores single line returns in a document, you need two to get a new paragraph.


It is not difficult

If you can write code, and you can write text, you can write a manuscript with Rmarkdown. Your coauthors can probably cope too.

Posted in reproducible research | Tagged | 2 Comments

Double diatoms

I am in awe of Dr Elisabeth Bik and her amazing ability and dedication to spotting duplications in images.

Follow the thread to see the duplications she finds in this photograph. I usually struggle to find duplications until they are pointed out.

But not always.

Take, for example, this diatom calibration set.

Screenshot_2019-05-16 (11) (PDF) Diatom-based reconstruction of trophic status changes recorded in varved sediments of Lake[...]

The lower two assemblages are, within the resolution of the PDF, identical. Having spotted that, I wanted to know whether any of the other assemblages were identical, so I scraped the data out of the PDF. All the other assemblages are distinct.

But what about the next figure which shows the down-core stratigraphy?

Screenshot_2019-05-16 (11) (PDF) Diatom-based reconstruction of trophic status changes recorded in varved sediments of Lake[...](1)

While the upper part of the diagram is spiky as expected, the lower part of the diagram is step like with nine pairs of apparently identical assemblages.

These identical assemblages in the calibration set and the down-core stratigraphy are unexpected. I asked the authors if they could explain what happened, but they haven’t replied.

The paper also includes a transfer function that I cannot reproduce with the assemblage data I scraped from this paper and the environmental data I scraped from another paper on the same calibration set lakes. The reported cross-validation performance is better than the apparent performance I get, but it is possible that the environmental data had more problems than just using the wrong unit prefix. If my model is correct, then the actual cross-validation performance is very weak (R2 = 0.03), then the reconstruction shown in the paper is bogus and the paper should be corrected (or retracted).

I’m not holding my breath.

Posted in Peer reviewed literature, Uncategorized | Tagged , | Leave a comment

Funky ordination plots with ggvegan

Yesterday, I tweeted a photo of a ordination I plotted with ggvegan, and thought I should show how I made it.

Ordinations can be plotted with base R (see ?plot.cca). This is fine for simple plots, but it is a lot of effort to make a complex plot. The ggvegan package, written by Gavin Simpson, lets you use the power of ggplot2 and can make complex plots easier to make.

Start by loading some packages. ggvegan will need to be installed from GitHub if you don’t already have it.

#load packages

I’m going to use the pyrifos dataset, which is included in vegan. The data are the abundances of aquatic invertebrates from an experiment in which twelve ditches were studies several times before and after insecticide treatment. We need to make a tibble (or a data.frame) of the predictors (dose, time and ditch ID).

#load data

env <- tibble(
week = rep(c(-4, -1, 0.1, 1, 2, 4, 8, 12, 15, 19, 24), each = 12),
dose = factor(rep(c(0.1, 0, 0, 0.9, 0, 44, 6, 0.1, 44, 0.9, 0, 6), 11)),
ditch = gl(12, 1, length=132))

We can run a principal components analysis with rda.

pca <- rda(pyrifos)

Simple ordination plots can be made with the autoplot function. This plot shows all the samples. This is fine, but we want to see dose and time information.

autoplot(pca, layers = "sites", arrows = FALSE)


At the moment, it is difficult to do much to this plot. But we can use the fortify function from ggvegan to extract the scores from the ordination object and then combine these with the predictor data before plotting with ggplot. I’m going to colour the samples by dose, and draw a line through the samples in each ditch, with a point on the first value.

pca_fort <- fortify(pca, display = "sites") %>%

ggplot(pca_fort, aes(x = PC1, y = PC2, colour = dose, group = ditch)) +
geom_path() + #use geom_path not geom_line
geom_point(aes(size = if_else(week == min(week), 1, NA_real_)), show.legend = FALSE) +
scale_color_viridis_d() +
scale_size(range = 2) +


Don’t forget to add coord_equal() to ensure that the plot is correctly scaled.

Happy plotting.

Posted in R, Uncategorized | Tagged , | Leave a comment

Sunspots and raindrops

It is time, as the walrus said, to talk about another paper reporting an effect of solar variability on the Earth’s climate. Laurenz et al. (2019) correlate European country-scale precipitation data from CRU with the number of sunspots, a proxy for solar activity. After allowing for a lagged response, they find some reasonably high and significant correlations in at least some months and some countries.

I do expect solar variability to affect climate, but I expect the effect to be quite small and difficult to detect above a background of internal variability and other forcings such as greenhouse gas and aerosols. However, statistical traps abound. This makes me wary of papers that report strong and highly significant relationships. Especially when the authors write

The probability that the correlation r = 0.54 is by chance is less than 0.1% (p < 0.001).

This reflects a common misunderstanding of p-values. The co-authors, Horst-Joachim Lüdecke and Sebastian Lüning, the reviewers, and the editors don’t look good for letting this slip past them.

As the paper reports, monthly precipitation time series have very little autocorrelation and are almost white noise, so it seems unlikely that the strongly autocorrelated sunspot timeseries would be an good predictor.

This is easily tested by looking at some data. Not surprisingly, the correlation between the raw Danish February precipitation and sunspots is weak (r = 0.07) and not remotely close to being significant (p = 0.43).

So how do Laurenz et al. get to report this correlation as 0.42 (p < 0.01)? Predictably, they smooth the data, and allow for lagged responses. Using their methods, I get approximately the same correlation. But what about the significance? Many papers that smooth the data or allow for lagged response (a form of multiple testing) fail to correct the p-value and then find spuriously significant results. Laurenz et al. don’t make that mistake – they use a Monte Carlo procedure, applying their method to 10,000 surrogate time series to find the distribution of p-values under the null hypothesis. They report that for values of |r| > 0.45 p < 0.001; for |r| > 0.35 p < 0.01; and for |r| > 0.25 p < 0.05 (I’m not exactly sure why they are reporting p-values for absolute correlations when they need a one-sided test.)

It only takes a few lines of code to re-implement their Monte Carlo procedure and check these results.


#load data
ss0 <- read.table("data/SN_m_tot_V2.0.txt", nrow = 3237)
ss <- ts(ss0$V4, start = ss0$V1[1], frequency = 12) %>%
window(start = 1901, end = c(2015, 12))

denmark <- read.table("data/crucy.v3.26.1901.2017.Denmark.pre.per", skip = 3, header = TRUE) %>%
filter(YEAR <= 2015)

#make function
lags.cor <- function(l, m, ss, clim){
ss.lag <- stats::lag(ss, l)
ss.lag <- window(ss.lag, start = c(1901, m), end = c(2015, m), extend = TRUE, deltat = 1)
cor(ss.lag, clim, use = "pair")

#make null distribution
null_dist <- future_map(1:10000, ~{ y = rnorm(nrow(denmark)) ys = signal::sgolayfilt(x = y, p = 5, n = 11) tibble(lag = -110:24) %>%
mutate(cor = map_dbl(lag, lags.cor, m = 1, clim = ys, ss = ss))
}) %>%
bind_rows(.id = "n") %>%
mutate(n = as.numeric(n))

I am using white noise as I cannot find a function to simulate autocorrelated time series with a Hurst exponent of 0.6 which the authors use. This will make my significance thresholds slightly lower, as will my use of a one-sided rather than two-sided test. Despite this, I find that the significance thresholds are much higher than those the authors report.

null_dist %>%
group_by(n) %>%
slice(which.max(cor)) %$%
quantile(cor, probs = c(0.95, 0.99))
## 95% 99%
## 0.40 0.47

Only if I ignore the lagged responses do I get results similar to the authors.

null_dist %>%
filter(lag == 0) %$%
quantile(cor, c(0.95, .99))
## 95% 99%
## 0.25 0.34

It appears that the authors neglected to test for lagged responses in their Monte Carlo procedure. As such their p-values are over optimistic and many of their apparently significant results are actually not significant.

Still, there are ~45 monthly series with a correlation that exceeds the 95th percentile of the null distribution. Is this impressive? Given that there are 39 countries x 12 months = 468 tests, we should expect ~23 correlations to be “significant” just by chance because of multiple testing. A correction for multiple testing is required. Getting 45 significant correlations is more than expected. However, because the timeseries from neighbouring countries (e.g., Belgium and Luxembourg) can be very highly correlated because of the spatial patterns in precipitation (perhaps enhanced by the gridding process), if one has a significant correlation, several more will as well. This will increase the probability of getting 45 “significant” correlations by chance.

Given the apparent error in calculating the p-values in Laurenz et al. (2019), the authors need to archive their code, and if the problem is confirmed, should write a correction to their paper.

Posted in Peer reviewed literature, Uncategorized | Tagged , | 5 Comments