Resampling Assemblage Counts

Imagine that one diatom assemblage has 20 species and another has 25, but the more species rich assemblage also has more diatoms counted.
A fair comparison of the richness of each assemblage can only be made when the number of individuals is constant.

In this case, we can use the function rarefy from the vegan package. Other analyses might not be possible to run directly, so you need re-sampled counts with a specified count sum. I’ve been doing this recently to find the how the prevalence of singletons (species only occurring once) varies with count sum.

There are several ways to resample counts. A key question is whether you want to sample with or without replacement.

Sampling without replacement asks what the assemblage you currently have might have been when the count sum was smaller. If you sample as many individuals as you actually have, your resampled assemblage is identical to your current assemblage and you cannot sample more individuals than you have.

Sampling with replacement asks what the assemblage you currently have might look like if you counted another assemblage from the same material.
In principle, you can simulate larger assemblages than you currently have, though this is probably always a bad idea. Species richness will be biased low, as rare taxa present in the community behind the veil line cannot occur in the resampled assemblages but could occur if another diatom slide was counted.

I’m going to resample the vegan::BCI dataset of tree counts in plots on Barro Colorado Island.

data(BCI, package = "vegan")

First a tidyverse solution.

BCI %>% 
  # put the rownames into a column called "plot"
  rownames_to_column(var = "plot") %>% 
  # convert the data from a wide to a long format
  pivot_longer(-plot, names_to = "species", values_to = "n") %>% 
  # uncount does the opposite of count - giving 2 rows for a species that had a count of 2.
  uncount(n) %>% 
  group_by(plot) %>%
  # randomly sample, with replacement, 100 individuals (rows) per plot
  slice_sample(n = 100, replace = TRUE) %>% 
  # count the number of individuals per species per plot
## # A tibble: 2,128 × 3
## # Groups:   plot [50]
##    plot  species                     n
##    <chr> <chr>                   <int>
##  1 1     Alseis.blackiana            6
##  2 1     Annona.spraguei             1
##  3 1     Apeiba.glabra               2
##  4 1     Astronium.graveolens        3
##  5 1     Brosimum.alicastrum         1
##  6 1     Casearia.sylvestris         2
##  7 1     Cassipourea.guianensis      1
##  8 1     Cecropia.insignis           1
##  9 1     Chrysophyllum.argenteum     1
## 10 1     Cordia.bicolor              3
## # … with 2,118 more rows

The tidyverse solution might not be the fastest (see below to find out) as it is doing a lot of extra work. So let’s write a function that has the same essential steps of uncounting the counts, sampling them, and the recounting them.

resample_1 <- function(counts, size, replace = FALSE){
  #counts is species count data to resample,
  #size is target count sum
  #replace or not

  #grab species names (if present)
  id <- names(counts)
    id <- 1:length(count)
  count2 <- rep(id, times = counts)
  # sample
  samp <- sample(count2, size = size, replace = replace)
  # count again
  tab <- table(samp)
  # fill missing taxa with zero
  result <- rep(0, length(counts))
  names(result) <- id
  result[names(tab)] <- tab
  #return result - a named vector

#first BCI plot
bci1 <- unlist(BCI[1, ])

resample_1(count = bci1, size = 100, replace = TRUE)[1:20]
##       Abarema.macradenia    Vachellia.melanoceras    Acalypha.diversifolia 
##                        0                        0                        0 
##    Acalypha.macrostachya           Adelia.triloba     Aegiphila.panamensis 
##                        0                        0                        0 
##  Alchornea.costaricensis      Alchornea.latifolia         Alibertia.edulis 
##                        0                        0                        0 
##  Allophylus.psilospermus         Alseis.blackiana        Amaioua.corymbosa 
##                        0                        9                        0 
##      Anacardium.excelsum           Andira.inermis          Annona.spraguei 
##                        0                        0                        0 
##            Apeiba.glabra         Apeiba.tibourbou  Aspidosperma.desmanthum 
##                        1                        0                        0 
## Astrocaryum.standleyanum     Astronium.graveolens 
##                        0                        1

There are also R functions that do the same jobs.

Resampling with replacement gives a multinomial distribution; for which we can also use stats::rmultinom. This takes the argument prob for the species data. The data are standardised to sum to one, so counts, percent or proportions could be used. n is the number of resamples to make. The return object is a matrix with n columns and one species per row.

rmultinom(n = 1, size = 100, prob =  bci1)[1:20, ]
##       Abarema.macradenia    Vachellia.melanoceras    Acalypha.diversifolia 
##                        0                        0                        0 
##    Acalypha.macrostachya           Adelia.triloba     Aegiphila.panamensis 
##                        0                        0                        0 
##  Alchornea.costaricensis      Alchornea.latifolia         Alibertia.edulis 
##                        0                        0                        0 
##  Allophylus.psilospermus         Alseis.blackiana        Amaioua.corymbosa 
##                        0                        5                        0 
##      Anacardium.excelsum           Andira.inermis          Annona.spraguei 
##                        0                        0                        0 
##            Apeiba.glabra         Apeiba.tibourbou  Aspidosperma.desmanthum 
##                        3                        0                        0 
## Astrocaryum.standleyanum     Astronium.graveolens 
##                        0                        1

Resampling without replacement follows the multivariate hypergeometric distribution, which can be fitted with rmvhyper from package extraDistr. The arguments have different names, nn is the number of resamples, k is size and n is the species data to resample. The output is a matrix with nn rows and one species per column – the transpose of rmultinom.

rmvhyper(nn = 1, k = 50, n =  bci1)[, 1:20]
##  [1] 0 0 0 0 0 0 1 0 0 0 3 0 0 0 0 3 0 0 0 0

As I often resample many assemblages many times, it is useful to know which solution is fastest.


size <- 100
# preprocess for tidyverse
BCI1 <- BCI %>% 
  rownames_to_column(var = "plot") %>% 
  slice(1) %>% 
  pivot_longer(-plot, names_to = "species", values_to = "n")

mb1 <- microbenchmark(
  tidyverse = BCI1 %>% 
    uncount(n) %>% 
    group_by(plot) %>%
    slice_sample(n = 100, replace = TRUE) %>% 
    resample_wo = resample_1(count = bci1, size = size, replace = FALSE),
  resample_w = resample_1(count = bci1, size = size, replace = TRUE),
  rmultinom = rmultinom(n = 1, size = size, prob =  bci1),
  rmvhyper = rmvhyper(nn = 1, k = size, n =  bci1)
microbench mark plot of time taken by the different method

As expected, the tidyverse solution is much slower than the other solutions. mvhyper and rmultinom are much faster than the home-made solution. I don’t know why mvhyper is slower than rmultinom.

I often need to resample an assemblage many times. mvhyper and rmultinom have this functionality built-in. The home-grown function can be extended to make multiple resamples.

resample_n <- function(count, size, n = 1, replace = FALSE){
  id <- names(count)
    id <- 1:length(count)
  count2 <- rep(id, times = count)
  replicate(n = n, {
    samp <- sample(count2, size = size, replace = replace)
    tab <- table(samp)
    result <- rep(0, length(count))
    names(result) <- id
    result[names(tab)] <- tab

And again, we can test how fast the functions are.

mb_nrep <- microbenchmark(
  resample_wo_1 = resample_n(count = bci1, n = 1, size = size, replace = FALSE),
  resample_wo_10 = resample_n(count = bci1, n = 10, size = size, replace = FALSE),
  resample_wo_100 = resample_n(count = bci1, n = 100, size = size, replace = FALSE),

  rmultinom_1 = rmultinom(n = 1, size = size, prob =  bci1),
  rmultinom_10 = rmultinom(n = 10, size = size, prob =  bci1),
  rmultinom_100 = rmultinom(n = 100, size = size, prob =  bci1),

  resample_w_1 = resample_n(count = bci1, n = 1, size = size, replace = TRUE),
  resample_w_10 = resample_n(count = bci1, n = 10, size = size, replace = TRUE),
  resample_w_100 = resample_n(count = bci1, n = 100, size = size, replace = TRUE),

  rmvhyper_1 = rmvhyper(nn = 1, k = size, n =  bci1),
  rmvhyper_10 = rmvhyper(nn = 10, k = size, n =  bci1),
  rmvhyper_100 = rmvhyper(nn = 100, k = size, n =  bci1)

mb_nrep %>% 
  print() %>% 
  mutate(n = rep(c(1, 10, 100), times = 4),
         expr = str_remove(expr, "_\\d{1,3}")) %>%   
  ggplot(aes(x = n, y = mean/n, colour = expr)) +
  geom_line() +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Number of draws", y = "Mean time per resample,  microseconds", colour = "Method")

The home-made function has less good economy-of-scale than the other methods. I think I have some projects that need their code updating.

One last problem is how to resample all the diatom assemblages in a core. We need to iterate over each row in the data set (assuming it is in wide format – also possible to do this in long format). If we want a single resample, we can use apply.

m1 <- apply(BCI, 1, resample_1, size = 100)
m2 <- apply(BCI, 1, rmultinom, size = 100, n = 1)
m3 <- apply(BCI, 1, rmvhyper, k = 100, nn = 1)

Of course, it is also possible to run multiple resamples at once, but as this returns a matrix, apply won’t work well. Instead, we can use plyr::aaply (don’t load plyr as it will conflict with dplyr). This will return an array (the rmvhyper result is transposed).

#convert the data to a matrix or problems 
mm1 <- plyr::aaply(.data = as.matrix(BCI), .margins = 1, .fun = resample_n, n = 2, size = 100 )
## [1]  50 225   2
mm2 <- plyr::aaply(.data = as.matrix(BCI), .margins = 1, .fun = rmultinom, n = 2, size = 100 )
## [1]  50 225   2
mm3 <- plyr::aaply(.data = as.matrix(BCI), .margins = 1, .fun = rmvhyper, nn = 2, k = 100 )
## [1]  50   2 225

Coming soon, some analyses using lots of resampled assemblages.

Posted in R | Tagged , | Leave a comment

A demo targets plan for reproducible pipelines for Neotoma data

In any analysis, there are a series of steps: data importing, cleaning, analysis, creating figures, and then putting it all together in an manuscript. Very often, some of these steps are time consuming, and you don’t want to rerun everything every time you modify the code, but keeping track of which parts of the code need to be re-run is challenging.

This is where a pipeline toolkit can help. It keeps track of which of your objects are up-to-date, and which need updating.

For some years, I have been using the drake package. But drake has been superseded, targets is the new pipeline toolkit for R. I’ve been planning on migrating from drake to targets for a while.

Pipeline toolkits work very well in conjunction with the Neotoma database as downloading data is slow.

I have written a demonstration project that uses targets to

  • Download data from Neotoma
  • Process and (minimally) clean the data
  • Run a simple analysis
  • Make some figures
  • Render a manuscript containing the results and the figures using R markdown

If the site inclusion rules are changed, only new sites will be downloaded, making the download process much faster after the first run through.

The project is available on GitHub as a template. Make a copy and give it a try and give me some feedback.

Posted in reproducible research | Tagged , | Leave a comment

Reproducibility of high resolution reconstruction – one year on

It is about a year since my paper discussing the reproducibility of high resolution reconstructions was finally published, so I thought I should give a full account of what has happened since.


None of the authors of the papers I critiqued have tried to prove that I am mistaken. Many of the assertions I made should be extremely easy to disprove. For example, to prove that the calibration-in-time models are cross-validated, the authors only need to publicly archive their C2 file. That they haven’t, suggests to me that they cannot disprove my assertions. Nor have the authors corrected or retracted their papers.

What about the editors? Surely they are concerned about potentially problematic papers in their journals. Alas, I gave up expecting robust action from editors a long time ago. Anyone who follows @MicrobiomDigest will understand.

Against this pessimism, the editor of The American Naturalist did an exceptional job in dealing with dubious spider data (the original blog posts have been deleted; lawyers have been sending letters)

What about institutions?

The Integrity Officer at the University of Bern assured me that they were still investigating the Silvaplanasee and Seebergsee papers. One of the concerns with the Seebergsee paper is that the chironomid counts might be as low as three rather than the reported thirty. To prove this one only needs to be able to count to three. I have no idea why it has taken the University of Bern more than two years to do this. Perhaps they employed a two-toed sloth and it ran out of fingers.

Previously, the University of Bern had determined that, other than some regrettable confusion about whether nineteen is larger or smaller than fifty, there were no critical issues with the Zabinskie paper.
Apparently is it quite normal for the data to change half a dozen times after publication; for redundancy analyses to contain more sites than the underlying data; and for sites in a principal component analysis to bounce around and disappear like the contents of a box of frogs watched over by a heron.

The University of Bern may have discovered mountains of evidence that exonerates the authors, but their policy of maintaining complete secrecy about integrity proceedings conflicts with the need for the transparency required for trust.

One of the rationales for keeping integrity proceedings secret is to protect the reputation of anyone who is mistakenly accused.
This policy might have been viable some decades ago, but now that any fool can raise concerns on twitter or post anonymous comments on it is at best a quaint relic of a more patrician era.

Posted in Peer reviewed literature, reproducible research, transfer function | Tagged | Leave a comment

Simplistic and Dangerous Models

A few weeks ago there were none. Three weeks ago, with an entirely inadequate search strategy, ten cases were found. Last Saturday there were 43! With three inaccurate data points, there is enough information to fit an exponential curve: the prevalence is doubling every seven days. Armchair epidemiologists should start worrying that by Christmas there will be 1012 preprints relating COVID-19 to weather and climate unless an antidote is found.

Fortunately, a first version of an antidote to one form of the preprint plague that is sweeping the planet known as the SDM (apparently this not an acronym for Simplistic and Dangerous Model). A second version is due to be published soon.

So why are SDMs such a bad tool for trying to model the spread of COVID-19?

1) The system is not at equilibrium

The first case of what has become known as COVID-19 was reported on 17th November 2019 in Wuhan. The weather was a pleasant 17°C. An SDM fitted to the occurrences of COVID-19 at this time would show it was an extreme specialist, with a preferred temperature of 17°C. Of course it would be absurd to fit an SDM.


By mid-January, the virus had spread across China and an SDM would find that Chinese winter temperatures were ideal. By mid-March, the outbreaks were developing in Europe and America – places with good transport links to China and each other, and an SDM would tell a different story. A month later, and there are COVID-19 outbreaks in almost all countries. An SDM fitted now to presence absence data would show no evidence of a niche.

We can visualise this by fitting a GAM to the presence of >5 cases of COVID-19 in each country/region for each date in the John Hopkins dataset. This is similar to the data Araujo & Naimi (2020) used in the first version of their preprint.


Only for a short time period is there an apparent climatic niche in the distribution of COVID-19 cases. It would be foolish to assert that that brief window gives us insight into the climatic niche of COVID-19 rather than the volume of passenger transport from China and subsequently other countries with outbreaks as the disease spreads.

Almost 40,000 cases of COVID-19 in Brazil testify to the utter failure of the prediction in Araujo & Naimi (2020) that tropical climates are less favourable for the spread of COVID-19.

For simplicity, I fitted the GAM using February mean temperatures from Chelsa. If you think it would make any difference to use observed temperatures rather than a climatology you are missing the point.

Problems with fitting SDM to invasive species, where the range is not in approximate equilibrium with climate, are well known. Anyone working in the field should be aware of these issues.

2) Microclimate

Temperature and humidity can affect the life-time of droplets and the time that virus particles on surfaces are viable. But the 2m outside air temperature is a poor predictor of the actual conditions where many people are exposed to infection. It’s all about microclimate: many people are inside most of the time, especially when it is cold outside.

Qian et al (2020) examined outbreaks of COVID-19 in China (excluding Hubei) that could be traced to their source. They found only one of 318 outbreaks in was transmitted outdoors. Most outbreaks occurred at home. One proviso about this study, it might be more difficult to trace transmissions that occurred outside, causing an undercount.

3) Ignores other factors

Climate and seasonality probably will play some role in COVID-19 dynamics, as it does, for reasons that are not entirely clear, in influenza, for which several hypotheses have been suggested including the influence of seasonality on human behaviour (inside more in cold weather) and our immune systems. While they may play some role, climate and seasonality are unlikely to be the dominant factors controlling COVID-19 transmission. The initial spread around the world reflects patterns of human movement. The number of infections in a location depend on human behaviour including policy to minimise spread taken successfully (South Korea) or less so (United Kingdom). A standard SDM takes none of these factors into consideration, nor any other component of an epidemiological model.


So certainly simplistic, what about dangerous?

If the effect of climate and seasonality on COVID-19 is small and policy makers are influenced by these preprints to premature relax (or apply belatedly) restrictions on movement and businesses, people will die.


Araujo & Naimi (2020) is not the only preprint using an SDM to try to find a relationship between COVID-19 and climate. Harbert et al (2020) use a somewhat more cautious approach, concluding that while they cannot exclude an influence of climate, it is less important than population size in explaining county-level COVID-19 statistics in the United States. Bariotakis et al (2020) use the full suite of 19 WorldClim bioclimatic variables as predictors in MaxEnt. They find …. Actually, I don’t really care what they find as the premise is nonsense so any result can only be an artefact.

Just because you have a bag of hammers, does not make every problem a nail.

Posted in Uncategorized | Tagged , | 3 Comments

COVID-19, climate and the plague of preprints

Many diseases have geographically variability in prevalence or seasonal variability in epidemics, which may, directly or indirectly, be causally related to climate. Unfortunately, the nature of the relationship with climate is not always clear.

With the recent outbreak of COVID-19, scientists (including a depressingly large number of ecologists who ought to know better) have wasted no time in applying their favourite methods to the available case data and climate data to produced a pack of preprints.

Please let me know if you find any more.

I know ecologists are trying to be useful, but please consider


Most of these preprints argue for reduced transmission at warmer temperatures. If true, this would be good news as spring progresses to summer in the northern hemisphere, allowing time for health services to prepare for increased transmission later in the year.

If false, then these preprints could influence how societies react to the ongoing crisis, perhaps convincing people or policy makers that the risk of transmission of COVID-19 is low. Certainly, there are policy makers who are open to this idea:

“Looks like by April, you know, in theory, when it gets a little warmer, it miraculously goes away,”  Donald Trump 10 February 2020

I haven’t carefully read all the preprints, but for several of those that I have read are not at all persuasive.

I’ve helped write a response to one of these. In some forthcoming posts I will try to reproduce some of the others as I suspect some are already disproved by the new data available since they were published.

But for now, if a preprint assumes that the climate at 60°N 90°E represents the climate of Russia, it is junk.

Posted in Uncategorized | Tagged | 4 Comments

Erroneous information … was given

By coincidence, days after I wrote about the apparently very low midge counts in Hiidenvesi, the authors published a correction.

Erroneous information considering Chironomidae and Chaoboridae accumulation was given in Figure 4 published in Luoto et al. (2017). Therefore, it cannot be stated as in the results (page 5) that the chironomid flux increased towards the present. This data is not interpreted or discussed in the manuscript any further, but as a result from the incorrect accumulation unit, the midge accumulation data cannot be used for reference in other research. The volumetric midge sample size given in the methods (page 3) is the maximum sample size used. Also the midge percentage information in Figure 4 is incorrect. However, the correct percentages were otherwise used throughout the manuscript, including the midge-based reconstructions.


This is figure 4 from the original paper


Apparently, the accumulation rates in figure are wrong (the figure shows concentrations not accumulation rates), the amount of sediment used differed from that reported, and, added almost as an after thought, the midge assemblage data are also wrong.

Fortunately however, the correction assures the reader that all the rest of the paper, with the exception of a sentence about the increasing flux, was based on the correct data. Obviously, it unreasonable to expect the correction to include a figure showing the correct data or an explanation of why the wrong data were published: there is just not enough space on the internet.

Curiously, the description of the stratigraphy on page five of the original paper, which the correction assures the reader is for the the correct data, perfectly matches the incorrect data.

The incorrect data in the stratigraphy are easy to digitise and are entirely consistent with count sums of between 7 and 21 midges. Counts so low that no honest scientist would knowingly try to portray them as counts of at least 50 midges.

The Chaoborus:Chironomidae ratio calculated from the digitised data matches the published curve; the N2 curve can also be reproduced (if the data are first square root transformed). Even the ordination can be perfectly reproduced.


Published ordination


Reproduction of the ordination

What are the odds of an ordination on the incorrect data being identical to an ordination on the correct data?

I cannot reproduce the reconstructions of dissolved oxygen or total phosphorous as the calibration set data have not been archived. (The total phosphorous reconstruction uses a four component WAPLS model which apparently has an R2 of 0.92. Without the raw data, I am disinclined to believe this; it could be another transfer function where the apparent performance is reported as the cross-validated performance.)

While it is good to see the authors taking responsibility for problems in the Hiidenvesi paper, I am not persuaded that the correction is a full account of its problems. The authors need to archive the raw data, explain how the incorrect data came to be published, and how the results that they report to have used the correct data can be reproduced with the incorrect data.

Posted in Peer reviewed literature, transfer function | Tagged , | Leave a comment

Making a pollen diagram from Neotoma

Last week I gave a course on R for palaeoecologists covering data handling using tidyverse, reproducibility and some some ordinations and transfer functions. One of the exercises was to download some pollen data from Neotoma and make a pollen diagram. This is my solution. It uses a lot of tidyverse. Everything could be done with base R, but I find tidyverse more expressive.

#load packages
#devtools::install_github("richardjtelford/ggpalaeo") #NEW VERSION
library("ggpalaeo") #Need new version

I’ve chosen to plot the pollen data from Lake Bambili. First we need to download the data with the neotoma package

#download data
get_dataset(40940)#check correct site
## A dataset_list containing 1 objects:
## Accessed from 2019-10-10 16:41h to 2019-10-10 16:41h.
## Datasets:
## long lat type
## 40940 Bambili 1 10.24383 5.93487 pollen
bambili_raw <- get_download(40940)
## API call was successful. Returned record for Bambili 1
eco_types <- get_table("EcolGroupTypes")

Then we need to extract the components we need from the downloaded data. get_download returns a list with one element per dataset which we can extract with the $ notation.

#extract components
cnts = counts(bambili_raw$`40940`) #table of counts
meta = bambili_raw$`40940`$sample.meta #sample depths/ages
taxa = bambili_raw$`40940`$taxon.list %>% #taxon info
  mutate_all(as.character) #convert everything from a factor to prevent warnings later

I want to add the age/depth information to the counts. Here I am using bind_cols to do this, which assumes the rows are in the same order. The paranoid might want to convert the row names of the counts into a column (rownames_to_column) and then join on Next I use pivot_longer (the replacement of tidyr::gather) to convert the data from a species x samples data.frame to a long format, and join to the taxon information. Note that some datasets have an alias column in taxon.list and you need to join by that instead.

bambili <- meta %>% select(age, depth) %>%
bind_cols(cnts) %>%
pivot_longer(cols = -c("age", "depth"), names_to = "species", values_to = "count") %>%
left_join(taxa, by = c("species" = ""))

Now is a good time to check what we have. It’s useful to have the Neotoma database manual open to check what terms mean.

bambili %>% count(variable.element)
## # A tibble: 3 x 2
## variable.element n
## 1 pollen 36750
## 2 pollen/spore 175
## 3 spore 875

variable.element contains just pollen and spores. It can include a variety of other things (such as stomata) that we might want to filter out.

eco_types %>%
semi_join(bambili, by = c("EcolGroupID" = "")) %>%
select(EcolGroupID, EcolGroup)
## EcolGroupID EcolGroup
## 1 AQVP Aquatic Vascular Plants
## 2 MANG Mangroves
## 3 PALM Palms
## 4 SEED Spermatophyte rank or clade above order
## 5 TRSH Trees and Shrubs
## 6 UNID Unknown and Indeterminable Palynomorphs
## 7 UPHE Upland Herbs
## 8 VACR Terrestrial Vascular Cryptogams

I don’t want to include the AQVP or UNID in the diagram. I’m not sure about SEED.

bambili %>% filter( == "SEED", count > 0) %>%
select(species, count) %>%
group_by(species) %>%
summarise(n = n(), max = max(count))
## # A tibble: 1 x 3
## species n max
## 1 Monocotyledoneae undiff. 33 6

So SEED is an unidentified monocot present at low abundances. I’m going to delete it.

#remove unwanted variable.element/

bambili = bambili %>% 
  filter(! %in% c("AQVP", "UNID", "SEED"))
#use `%in%` not `==`

This is a good time to check the count sums. It might be prudent to delete any samples with very low counts.

#check count sums
bambili %>%
group_by(depth) %>%
summarise(s = sum(count)) %>%
arrange(s) %>%
## # A tibble: 5 x 2
## depth s
## 1 510. 72
## 2 220. 157
## 3 200. 176
## 4 240. 177
## 5 280. 233

Seventy four isn't great, but I'm going to keep it.

Now we can calculate percent and remove the rare taxa

#calculate percent
bambili = bambili %>% 
  group_by(depth) %>% 
  mutate(percent = count/sum(count) * 100) 

#remove rare taxa
bambili1 = bambili %>% 
  group_by(species) %>% 
    sum(percent > 0) >= 3, #must be in at least three samples
    max(percent) > 3) #must have a max percent > 3

Now we can use pivot_wider to reshape the data back into a species x samples data.frame that we can plot with rioja::strat.plot. For convenience, I’m separating the species data from the age/depth data.

bambili2 <- bambili1 %>%
select(age, depth, species, percent) %>%
pivot_wider(names_from = "species", values_from = "percent")

bambili_spp <- bambili2 %>% select(-age, -depth) %>% as strat.plot does not like tibble - pull request to fix this submitted.

Now we can plot the stratigraphy.

strat.plot(bambili_spp, yvar = bambili2$depth)


There are a variety of aspects of this plot that need improving. We need to:

  • plot on constant scale for all taxa
  • reverse y axis so deeper levels are lower
  • arrange the taxa in some logical order
  • rotate the species names and set the figure margins
  • add a cluster diagram and zones
  • add a secondary scale

Some of these can be fixed by setting an argument in strat.plot (there are a lot of arguments – see ?strat.plot), but to reorder the species, we need to reprocess the data.

bambili2 <- bambili1 %>%
#make a factor with TRSH first = factor(, levels = c("TRSH", "UPHE", "VACR")),
mean_percent = mean(percent)) %>%
#arrange by and mean_percent (largest first)
arrange(, desc(mean_percent)) %>%
ungroup() %>%
#make species into a factor so we can perserve the order
mutate(species = factor(species, levels = unique(species)))

#reshape using tidyr::spread as pivot_wider (currently?) ignores factor order
bambili3 <- bambili2 %>%
select(age, depth, species, percent) %>%
spread(key = "species", value = "percent")

bambili_spp <- bambili3 %>%
select(-age, -depth) %>%

#set up for ecological group colours
ecological_groups <- bambili2 %>%
  distinct(species, %>% 
ecological_colours <- c("red", "green", "orange")

And we run a constrained cluster analysis

bambili_dist <- dist(sqrt(bambili_spp/100))#chord distance
clust <- chclust(bambili_dist, method = "coniss")
#bstick(clust)#five groups

Now we can make a better plot.

#set up mgp (see ?par)
mgp <- c(2, 0.25, 0)
par(tcl = -0.15, mgp = mgp)#shorter axis ticks - see ?par
pt <- strat.plot(
d = bambili_spp,
yvar = bambili3$depth,
y.rev = TRUE, #reverse direction of y-axis
scale.percent = TRUE, #use constant scale for all taxa
srt.xlabel = 45, #rotate x-label by 45 degrees
cex.xlabel = 0.8, #smaller font
mgp = mgp,
xRight = 0.98, #right margin
xLeft = 0.21, #left margin with space for 2nd axis
yTop = 0.60, #top margin
yBottom = 0.1, #bottom margin
col.line = ecological_colours[ecological_groups],#colours = ecological_colours[ecological_groups], #colours
ylabel = "Depth cm",
clust = clust

#add zone boundaries
addClustZone(pt, clust = clust, nZone = 5)

#add a secondary scale
secondary_scale(pt, yvar = bambili3$depth, 
                yvar2 = bambili3$age, 
                ylabel2 = "Date yr BP",
                n = 10)


It’s beginning to get there. There are probably too many taxa plotted. Merging the various Polypodiophyta as the names are very long and the ecological interpretation of the different types is unclear. I also want to reduce the space between the y-axis and the ylabel. Unfortunately this is hard coded in strat.plot but it would only take a minute to make it into an argument. I’d also like to add a % sign to to the axis, but I cannot see an argument that will do that (again it shouldn’t be hard to code – adding yet another argument).

I would like to have some text on top describing the ecological groups, but I have the feeling that that would be very difficult to do.

Posted in Data manipulation, R, reproducible research | Tagged | 1 Comment

Count sums – the preprint

I wandered lonely as a cloud
That floats on high o’er vales and hills,
When all at once I saw a crowd,
A host, of golden chironomids;
Beside the lake, beneath the trees,
Fluttering and dancing in the breeze.

Continuous as the stars that shine
And twinkle on the milky way,
They stretched in never-ending swarm
Along the margin of a bay:
Ten thousand saw I at a glance,
Tossing their heads in sprightly dance.

The waves beside them danced; but they
Out-did the sparkling waves in glee:
A poet could not but be gay,
In such a jocund company:
I gazed—and gazed—but little thought
What wealth the show to me had brought:

For oft, when on my couch I lie
In vacant or in pensive mood,
They flash upon that inward eye
Which is the bliss of solitude;
And then my heart with pleasure fills,
And dances with the chironomids.

Sincere apologies to W. Wordsworth: nothing rhymes with chironomids

But were there really ten thousand? Next time you walk by a lake, try to count the chironomids in a swarm. It is impossible, they just move about too much, and it doesn’t seem to get much easier when they are dead.

Just look at the midge stratigraphy from the eutrophic Hiidenvesi in Finland. The methods section of the paper report that “A minimum count size of 50 individuals was set” in 5 cm3 sediment samples.


Midge stratigraphy from Hiidenvesi including the absolute (black bars) and relative abundances (grey bars) of chironomids and chaoborids. The chaoborid/chironomid ratio indicates the severity of anoxia and N2 diversity index represents the effective number of occurrences

Do these assemblages really look like the minimum count sum is 50?

The lowest sample has about 45% Chaoborus flavicans and just over 10% of five chironomid taxa. This looks suspiciously like a count of nine midges: four Chaoborus  (44.4%) and one each for five chironomid taxa (11.1%). This interpretation is supported by the concentration data on the left of the figure. This sample has 0.8 Chaoborus and 1 chironomid per cubic centimetre of sediment. Since 5 cm3 of sediment were sampled and 5 * (0.8 + 1) = 9, this also implies that the count sum is nine midges. Last time I checked, nine was somewhat less than fifty. The second sample from the top appears to have the largest count sum with 1.2 Chaoborus and 3 chironomids per cubic centimetre of sediment, implying a count sum of 21, less than half that reported. All the analyses based on the chironomid stratigraphy are much more uncertain than the reader would expect and may be biased.

Given the number of examples of papers that appear to have counted fewer, often far fewer, microfossils than reported, I decided to write up and submit a manuscript describing the problem and showing how the count sum can be estimated from percent data. If you don’t have at least infinite patience for the peer review process to finish, you can read a preprint on Paleorxiv. Some of the examples I report in the preprint (none of which have yet featured on this blog) appear to be due to lazy descriptions of the data, some due to assemblages being scheduled for deletion but not actually deleted, and some due to what might be described as falsification of the count sums. I also accidentally found some fabricated data – maybe more on that in another post.

Posted in Peer reviewed literature | Tagged , , | 1 Comment

Count-less chironomids?

Most papers that present microfossil assemblages report (not always accurately) the minimum number of microfossils that were counted in each sample, an important indicator of the precision of the data. Some do not. For these papers, the reader should be able to assume that the counts conform to typical practice for that microfossil. For example, I would expect a pollen or diatom stratigraphy to have counts sums of a few hundred, and a chironomid stratigraphy to have a minimum count sum of fifty (perhaps thirty). If the count sums are materially lower than this I would start to wonder if the omission was deliberate and recall Fanelli’s (2013) definition of misconduct

‘any omission or misrepresentation of the information necessary and sufficient to evaluate the validity and significance of research, at the level appropriate to the context in which the research is communicated’.

The figure below is from a paper that reconstructs July temperature for four intervals in the period ca. 570 BCE to 120 CE (Iron Age–Roman Period) using chironomids from the varved sediments of Lake Silvaplana. The paper does not report the minimum count sum; can it be estimated from the stratigraphy?


(a) Chironomid assemblages, (b) inferred mean July air temperatures and (c) % organic matter. For part (a), percentages for each taxon are represented by black bars. Thin black lines represent the zones created using the ZONE program (Juggins, 1991).

I think the minimum count sum can be estimated. The lowest assemblage is 100% Orthocladius. This could represent one Orthocladius, a hundred, or a thousand, but I would suggest that it is unlikely to be a large count given the chironomid diversity in the lake (in contrast to polar foram assemblages which are commonly 100% Neogloboquadrina pachyderma), would make it unlikely to find just one taxon, even if that taxon is fairly dominant. The next assemblage is 100% Cricotopus, the third and fourth 100% Corynoneura, and the fifth 100 % Procladius: there are two possibilities, either the chironomid community in the lake repeatedly changed abruptly and taphonomic processes did not mix the assemblages, or the count sum for these assemblages is very low, perhaps only one. Fifty five of the 208 assemblages in the stratigraphy are monospecific.

The eight and ninth assemblages are 50% Tanytarsus and 50% another taxon. These are conceivably counts of two chironomids. The eleventh assemblage is more diverse with five taxa each with 20%, perhaps a count of five. Only a few assemblages, mainly near the top, appear to have a count sum of more than ten chironomids.

These are the lowest estimated count sums I have found, even lower than in Lake Muzzano.
Despite these extremely low count sums, a reconstruction of July temperature is made with a transfer function. Perhaps not surprisingly, the reconstruction is very noisy, with abrupt changes of over 10°C. The reconstruction is meaningless junk.

A correction is needed to report the omitted count sums, and if they are mainly in low single figures, I think the chironomid part of the paper should be retracted as no reviewer would ever have recommended publication had they known that.

Posted in Peer reviewed literature, transfer function | Tagged , , | Leave a comment

Tools for a reproducible manuscript

It is all too easy to make a mistake putting a manuscript together: mistranscribe a number, forget to rerun some analyses after the data are revised, or simply bodge the round some numbers. Fortunately it is possible to avoid these problems with some technology that inserts numbers and graphs into the manuscript, keeps them all up to date, and lets you see the calculations behind everything. The technology does not prevent code errors, but by making the analysis transparent, they are hopefully easier to find.

There are numerous guides to writing dynamic documents with Rmarkdown. What I want to do here is show some of the things I needed to write a manuscript with Rmarkdown. There may be better solutions to some of the problems I faced – tell me about them. If you want to see any of this in action, take a look at the GitHub repo for my recently accepted manuscript.

To get started with Rmarkdown, in RStudio, go to File >> New File >> R markdown and you get a sample file to play with. Remember, no spaces in the filename.

Too much code

Rmarkdown lets you have chunks of R code embedded in the text, so that when the document is rendered, the code is evaluated following the chunk’s options  and the results printed. This is great for teaching and simple documents, but when you have several hundred lines of code to clean data it becomes unwieldy to have as one document, and prevents reuse if you want to make a manuscript and a presentation with the same code.

The first solution I found for this is the knitr::read_chunk function which lets you put the code in a separate file. I later found the drake package which lets you put most of the code in the drake_plan and then readd or loadd the generated objects (targets) into rmarkdown. When you make the drake_plan only the targets that are out of date (because of code or data changes) get updated. This nicely solves the problem of how to cache the results of slow analyses (there is a cache built into Rmarkdown but it is a bit fragile).

Article format

If you want to submit your manuscript to a journal as a PDF, things will go a lot smoother if you use the journal’s LaTeX template. I have no idea how to do that either, but the rticles package takes away most of the pain. You make a draft manuscript with

rmarkdown::draft("MyArticle.Rmd", template = "acm_article", package = "rticles")

This generates the YAML, the metadata at the top of the Rmarkdown file between two sets of ---, required. This works well provided you don’t do anything daft such as submit to Wiley, one of the few major publishers without a template in rticles.


Obviously, a manuscript is going to need citations, and formatting citations is a pain. rticles will put a couple of lines in the YAML to say which bibtex file to use and which citation style.

bibliography: chironomid2.bib
csl: elsevier-harvard_rjt.csl

Any decent citation manager will be able to export a bibtex file. I use JabRef.

CSL files for many different journals are available. I had to edit the Elsevier style slightly so that it showed R packages correctly.

In the text, write [@Smith1990] to cite the reference with the key Smith1990 with name and year, or [-@Smith1990] to get just the year. The formatted reference list gets added to the end of the file. The citr addin for Rstudio lets you search through the bibtex file and select references to add to the manuscript.

Line numbers

I hate reviewing manuscripts without line numbers. So I wanted line numbers on my manuscript. This turned out to be easy – Rmarkdown documents are rendered into PDF via LaTeX, so LaTeX commands can be used.

In the YAML, add

  - \usepackage{lineno}
  - \linenumbers

And that’s it.

Cross-referencing figures

One of my favourite parts of writing a manuscript is when I decide to add another figure and need to renumber all the other figures and make sure that the text refers to the correct figure.

Rmarkdown can cross-reference figures (also tables etc). This feature was added relatively recently, and needs bookdown rather than rmarkdown, which can be set in the YAML (NB spacing is important in YAML )

    number_sections: true

In the text, to refer to the figure from the chunk with the label weather-climate-plot, use (Fig. \@ref(fig:weather-climate-plot)). Important, this does not work if you have underscores in the chunk name. This took me an entire evening to work out.

You can also cross-reference figures in the supplementary material. In the header-includes: section of the YAML, add

  - \usepackage{xr}
  - \externaldocument[sm-]{supplementary_data}

Where supplementary_data is the name of the rmarkdown file you want to reference. In the text you can then use Fig. \@ref(sm-fig:zab-dist-analogue). The sm prevents any clash between the labels in the manuscript and the supplementary.

To force the figures in the supplementary to be prepended with an “S”, to the YAML in the supplementary, add

  - \renewcommand{\thefigure}{S\arabic{figure}}

This magic works on the *.tex files made as an intermediate stage of rendering the PDF. You need to keep these by running rmarkdown::render with clean = FALSE, then you can run tinytex::pdflatex

tinytex::pdflatex("manuscript.tex", clean=TRUE)
tinytex::pdflatex("supplementary_data.tex", clean=TRUE)

Package management

One of the great things about R is that the packages are often improved. This is also a problem as the code that worked fine may fail after you update a package. One solution to this is to use packrat to keep track of which packages your project uses. packrat will automatically download all the packages you need to reproduce my manuscript.

I found packrat to be a bit of a pain, and am looking forward to trying renv.

Rmarkdown and git

As rmarkdown files are text files they work well with version control using git and, for example, GitHub. Since git tracks changes in each line of text, it would show a long paragraph as having changed if you add one comma. A trick I learnt somewhere is to break the paragraph up, one sentence per line, so the version control now works at the sentence level. Rmarkdown ignores single line returns in a document, you need two to get a new paragraph.


It is not difficult

If you can write code, and you can write text, you can write a manuscript with Rmarkdown. Your coauthors can probably cope too.

Posted in reproducible research | Tagged | 3 Comments