Citing R and packages automatically

Almost every manuscript I write has a paragraph that looks something
like this:

All analyses were done in R version 3.4.4 (R Core Team 2017). Ordinations were fitted with vegan version 2.4-6 (Oksanen et al. 2017) with square-root transformed assemblage data. Transfer functions were fitted with rioja version 0.9-15.1 (Juggins 2017) using square-root transformed species data. Some diagnostics were performed using analogue version 0.17-0 (Simpson and Oksanen 2016).

Keeping this paragraph up to date as new versions of R are installed and
packages get updated is a pain and apt to go wrong. For example, there
are nine citations for ‘R core development team
on Google Scholar even though the citation changed to ‘R core team’ in

So this is my solution. I am using a couple of functions,
getRversion() which returns the version of R used and packageVersion which returns
the package version used. My Rmarkdown file looks like this

All analyses were done in R version `r getRversion()` [@R]. Ordinations were fitted with vegan version `r packageVersion(“vegan”)` [@vegan] with square-root transformed assemblage data.

Rmarkdown will replace [@R] with the entry in my bibtex file with the
key R, but first I need to add the package citations to my
bibliography. I’ve written a wrapper for bibtex::write.bib to do this.

package_citations <- function(packages, old_bib, new_bib){

  #copy original bib file
  fs::file_copy(old_bib, new_bib)

  #R citation
  Rcite = citation()
  Rcite$key = "R"
  bibtex::write.bib(Rcite, new_bib, append = TRUE)

  #package citation
  bibtex::write.bib(packages, new_bib, append = TRUE)

This function makes a copy of an existing bibtex file and adds citations
for R (with the key set to R) and packages. To use it, I run

packages = c("vegan", "rioja", "analogue"),
old_bib = "Rmd/extra/chironomid.bib",
new_bib = "Rmd/extra/chironomid2.bib")

The new bibtex file is now ready for use – just specify it in the YAML at the top of the Rmarkdown file.

One small complication is that some packages have multiple citations.
These are identified by an integer after the package name in the bibtex
key (e.g. analogue1, analogue2) and you need to inspect the bibtex file
to work out which you want.

Posted in R, reproducible research | Tagged | 2 Comments

Chironomids are cunning wee beasties

Since I had examined almost every aspect of the data from the remarkable Lake Żabińskie chironomid-inferred August air temperature reconstruction, some time last summer I thought that I would, for completeness, have a quick look at the instrumental temperature data. I was not expecting to find any problems.

In the original paper, the authors use a homogenised composite of data from 10 stations near Lake Żabińskie with “longer temporal series”.

Map of weather stations used by the authors

Authors’ map of the ten weather stations used in their composite temperature series. Not really sure why Giżycko (1936-1938) was included.

I haven’t done anywhere near as thorough a job: I’ve simply downloaded some long GHCN series from the KNMI Climate Explorer, and made a composite of their August temperatures after subtracting the 1951-1980 mean. I selected records from Warsaw, Vilnius, Kanaus, and Kalingrad. These are all farther away (up to ~230km) than the stations used by the authors, but given the large spatial scale of temperature anomalies, I don’t think this is a major problem.

For the period after 1939, where the chironomid-inferred August air temperature reconstruction has an annual resolution, my composite corresponds well with the authors’ (Pearson correlation r = 0.98). I’m happy with that.


Prior to 1939, the relationship between my composite and the authors’ looks much worse. In this period, chironomids were insufficiently abundant for an annual resolution reconstruction, so chironomids from two to five years are aggregated for the reconstruction. From looking at the temperature series, it would appear that the authors have not aggregated their temperature data to match the resolution of the chironomid data, but have instead thinned the data by using the temperature of a selected nominal year. So, for example, the chironomid sample from 1896-1898 is compared with the temperature from 1896 rather than the mean for 1896-1898.

In this next figure (showing just the pre-1939 data), my composite has been processed to match the resolution of the authors’ by either thinning to take just the first year, or taking the mean over the relevant years.


The pre-1939 correlation between the authors’ composite and mine is much higher (r = 0.91) if I thin the data than if I take the mean over the relevant interval (r = 0.57).

I wanted to confirm this finding with the authors’ original instrumental data. These should have been archived as the paper was published on the condition that all data necessary to reproduce the result would be made available. The authors refused to send me the instrumental data (though they did send some other data I requested). Fortunately, I realised that Sowieck on the map above is archived in GHCN as Sovetsk, which allowed me to check my analysis. The authors later confirmed that their temperature series was miscalculated and I acquired their composite data.

Mistakes happen; I don’t want to make a big deal about miscalculated instrumental data. The relationship between the instrumental data and the chironomid reconstruction is more interesting. One might expect the correlation between the reconstruction and the instrumental record to be weak before 1939 because of the miscalculation.

Not so. Using the authors’ data, the pre-1939 correlation between the reconstruction and the incorrectly calculated temperature is high (r = 0.81, p = 0.00009), comparable with the later data (r = 0.78), whereas the correlation with the correctly calculated temperature is much lower (r = 0.53, p = 0.03).

Those chironomids are cunning wee beasties, able to give a reconstruction that correlates even with incorrectly calculated temperatures.

Curiously, the correlation of the entire reconstruction with the original temperature series, 0.79, is higher than that reported in the paper (r = 0.74) or the corrigendum (r = 0.73).

Posted in Peer reviewed literature, transfer function | Tagged , | 1 Comment

Low chironomid counts from Lake Muzzano

I am aware of at least 10 papers authored by Dr Larocque-Tobler where there is good evidence that the chironomid count sum has been misreported (a corrigendum to one of these papers admits that a count sum of 19 was misreported as being at least 50). So naturally, when I read Larocque-Tobler & Pla-Rabès (2015), which reports a chironomid and diatom data from Lake Muzzano, I was interested in the chironomid count sums.

Unfortunately, the paper does not report the chironomid counts sums, but instead states that

Sixty-two samples were analyzed for chironomids, but many had few head capsules and were merged together.

When the count sums are not reported, I think it should be reasonable to assume that they conform to the typical count sizes for the microfossil group. For chironomids, the minimum count sum is typically 50 head capsules, although some authors use a minimum of 30. I think it would be poor practice to have count sums below 30 without reporting it.

Since no data are archived for the Lake Muzzano study, we need to try to infer the count sums from the stratigraphic diagram.


Chironomid stratigraphy (%) from Lake Muzzano.

Let’s take a look at a few of the samples.

The top sample has six equally abundant taxa with just over 10% and a seventh with over 30%. I think this probably represent a count of nine head capsules.

Let’s go down the core into the yellow zone:

  • The first yellow sample has four taxa at 25% giving a probable count of four head capsules.
  • The second sample has two species at 50%, suggesting a count of two.
  • The third has three taxa at 20% and a fourth at 40% suggesting a count of five. This is the sample that appears to have the largest count sum in this zone.

It is of course possible that I have underestimated the count sums and that the two taxa at 50% represent a count of four rather than two, but it is very unlikely that this sample has a count sum of greater than, say, 30. I think at least a third of the samples probably have count sums below 10 and perhaps only one or two of the count sums are greater than 20.

While it is easy to show that the count sums are probably very small, does it matter? I think it does. For example, the discussion of Smittia is probably based on only one head capsule and cannot be expected to be robust (the discussion of Limnophyes is simply wrong).

I think that the count sums in the Lake Muzzano study are so small that their size should have been reported (this is good practice in any case) so that the reviewers and readers can make an informed opinion about the signal and noise in the data. That readers would probably have an unfavourable opinion about such small count sums is no reason to omit this information.

Posted in Peer reviewed literature | Tagged , | Leave a comment

Merging taxa in assemblage data

One possible reason for the impossible percent values I’ve found in assemblages data is that taxa have been merged in Excel after percent were calculated. Doing anything in Excel is to invite disaster, if nothing else, it is very difficult to check what has been done.

Merging and renaming taxa is an almost inevitable step in the workflow for processing community and assemblages data. We need a reproducible method: here I show how can it be done with R.

I’m going to assume that the assemblage data are in wide format (one column per taxa) and that there are meta data (depths, ages etc) in one or more columns. If the meta data are in the rownames (which is very convenient for the ‘rioja’ and ‘vegan’ packages, less so for ‘dplyr’ as tibbles don’t have rownames), they can be moved into a column with rownames_to_column.

Here is a small artificial assemblage dataset.

spp <- data_frame(
depth_cm = 1:3,
sp_A = rpois(3, 5),
sp_b = rpois(3, 5),
sp.C = rpois(3, 5),
sp_D = rpois(3, 5))
spp_save <- spp # keep copy for later
## # A tibble: 3 x 5
## depth_cm sp_A sp_b sp.C sp_D
## 1 1 4 8 9 2
## 2 2 4 3 6 3
## 3 3 5 8 6 3

If we just want to rename a couple of taxa, the simplest solution is to use rename, where we set new_name = old_name. rename can take pairs of new and old names, separated by commas.

spp %>% rename(sp_B = sp_b, sp_C = sp.C)
## # A tibble: 3 x 5
## depth_cm sp_A sp_B sp_C sp_D
## 1 1 4 8 9 2
## 2 2 4 3 6 3
## 3 3 5 8 6 3

If there are many names that need altering, or we need to make the same changes to multiple data.frames, we need a different solution as rename gets tedious.

I like to make a data.frame of the old and new names and then use plyr::mapvalues to change the old into the new names. (plyr is a useful package but has several conflicts with dplyr so it is safer to use the :: notation than loading it).

changes <- read.csv(stringsAsFactors = FALSE, text =
"old, new
sp_b, sp_B
sp.C, sp_C", strip.white = TRUE)#this can go in an separate file

names(spp) <- plyr::mapvalues(names(spp), from = changes$old, to = changes$new)
## # A tibble: 3 x 5
## depth_cm sp_A sp_B sp_C sp_D
## 1 1 4 8 9 2
## 2 2 4 3 6 3
## 3 3 5 8 6 3

Merging taxa is possible in the wide format, but much easier in a thin format. We can convert from a wide format to a thin format with gather, and back with spread.

spp <- spp_save#original version

spp_thin <- spp %>% gather(key = taxa, value = count, -depth_cm)#don't gather depth_cm
## # A tibble: 12 x 3
## depth_cm taxa count
## 1 1 sp_A 4
## 2 2 sp_A 4
## 3 3 sp_A 5
## 4 1 sp_b 8
## 5 2 sp_b 3
## 6 3 sp_b 8
## 7 1 sp.C 9
## 8 2 sp.C 6
## 9 3 sp.C 6
## 10 1 sp_D 2
## 11 2 sp_D 3
## 12 3 sp_D 3

If there are just a few taxa that need merging, we can use recode within mutate followed by summarise. Note that in contrast with rename, recode expects “old_name” = “new_name”

spp_thin %>%
mutate(taxa = recode(taxa, "sp.C" = "sp_D")) %>%
group_by(depth_cm, taxa) %>%
summarise(count = sum(count)) %>%
spread(key = taxa, value = count)
## # A tibble: 3 x 4
## # Groups: depth_cm [3]
## depth_cm sp_A sp_b sp_D
## *
## 1 1 4 8 11
## 2 2 4 3 9
## 3 3 5 8 9

If there are many taxa that need merging (or some that need merging and some renaming) we can use mapvalues again.

changes <- read.csv(stringsAsFactors = FALSE, text =
"old, new
sp_b, sp_B
sp.C, sp_D", strip.white = TRUE)#this can go in an separate file

spp_thin %>%
mutate(taxa = plyr::mapvalues(taxa, from = changes$old, to = changes$new)) %>%
group_by(depth_cm, taxa) %>%
summarise(count = sum(count)) %>%
spread(key = taxa, value = count)
## # A tibble: 3 x 4
## # Groups: depth_cm [3]
## depth_cm sp_A sp_B sp_D
## *
## 1 1 4 8 11
## 2 2 4 3 9
## 3 3 5 8 9

This can also be done with a left_join.

spp2 <- spp_thin %>%
left_join(changes, by = c("taxa" = "old")) %>%
mutate(taxa = coalesce(new, taxa)) %>% #takes original name if no new one.
select(-new) %>%
group_by(depth_cm, taxa) %>%
summarise(count = sum(count)) %>%
spread(key = taxa, value = count)
## # A tibble: 3 x 4
## # Groups: depth_cm [3]
## depth_cm sp_A sp_B sp_D
## *
## 1 1 4 8 11
## 2 2 4 3 9
## 3 3 5 8 9

Now the data are ready for further analysis – remember some functions will want you to remove the meta_data first. For example

cca(select(spp2, -depth_cm))
Posted in Data manipulation, R | Leave a comment

A mean wind blows over Lake Żabińskie

I have largely neglected the chrysophyte-inferred reconstructions of winter severity and summer calcium concentrations/zonal wind speed from Lake Żabińskie even though they fall within the scope of my review of sub-decadal resolution reconstructions. This is not because I think this pair of reconstructions are jointly credible, but because no data have been archived despite several requests to the authors (and no useful calibration set data can be scraped from the publications).

Having shown that the correlation between instrumental temperature records and the pollen reconstruction from Mauntschas is no better than that expected because of the autocorrelation and multiple testing, I wondered if the same could be true for the calcium/wind reconstruction.

Hernández-Almeida et al (2015) report that a weighted averaging transfer function using their 50-lake calibration set has a bootstrap R2 of 0.68 and a RMSEP of 0.143 (log10 units). This is reasonably good, but the distribution of sites along the gradient is very uneven which will bias the RMSEP.


Figure 4. Plots of the a observed versus predicted log10 Ca2+ and b observed versus residual log10 Ca2+ based on WA regression with classical deshrinking. The black dot represents Lake Żabińskie.

I’m not totally convinced that the relationship between the estimated and observed calcium concentration in figure 4 of the paper is cross-validated, which then makes me wonder if the performance statistics are cross-validated or not. With the data this would be trivial to confirm. Without — doubts multiply.

The authors smooth the reconstruction from Lake Żabińskie with a three-year triangular filter and compare it to different combinations of months and allow for a lagged response using a procedure developed earlier which generated 144 climate time series.

I wanted to test if this multiple testing on autocorrelated data could generate the reported correlation of 0.5 between the reconstruction and May-October wind.

Even though the assemblage data are not yet available, I can test several ideas with just the wind data. The wind data come from the Twentieth Century Reanalysis Project. Other than reporting that the 1000 hPa level is used, no information is given about how the data were processed. I’ve assumed that the closest grid point to Lake Żabińskie has been used (22°E, 54°N) and downloaded the data and calculated the monthly means.

I had assumed that getting replicating the wind speed figure would be the easy part. Alas no. My wind speed data look nothing like the published curve. Then I realise that although the text discusses wind speed, it appears that the authors are actually using wind velocity, specifically the zonal (west-east) velocity. This would account for the otherwise impossible negative “speeds”. However, my mean velocity curve also looks nothing like the published curve, but at least it has a similar mean and variance.


May-October mean velocity with 3-yr triangular filter.



Figure 5. Bootstrapped estimates of log10 Ca2+ reconstructed from the sedimentary chrysophyte cyst assemblages at Lake Żabińskie based on the WA prediction model for the period AD 1898–2010. Log10 Ca2+ was highest correlated with the mean May–October zonal winds speed (m/s; 1,000 hPa) (grey line).

Using mean velocity rather than mean speed then the paper presents some huge problems. A May-October mean velocity of 0 m/s could indicate flat calm conditions all summer or that easterly gales blew for half the summer and westerly gales blew for the remainder. The impact of these two scenarios on lake mixing are quite different.

Even if, as the paper suggests, westerly winds have much more impact on the lake than winds from other directions because of the catchment’s morphology,  it is unphysical to expect mean wind velocity to be a useful predictor. As it stands, the result suggests that westerly winds mix the lake and easterly winds unmix it. Mean wind-speed, weighted by the lake’s exposure in the direction of the wind, would be a much more physically relevant predictor.

The paper appears to be a fishing trip, searching for the best of many predictors regardless of their physical plausibility. This is not a recipe for reproducible research. Had mean velocity from some combination of months not given a good performance, would the authors realised that mean velocity was a daft idea and repeated the analysis with wind speed? And if that too failed, would they have used mean squared wind speed, to better correspond with the wind’s mixing potential?

Posted in Peer reviewed literature, transfer function | Tagged , | Leave a comment

Pollen from the garden of forking paths

Most transfer functions for reconstructing past environmental changes are based on a calibration-in-space approach, with a modern calibration set of paired microfossil assemblages and environmental data. The alternative approach is calibration-in-time, with well-dated fossil assemblages and contemporaneous environmental data.

I’ve previously shown that the three chironomid calibration-in-time transfer functions all misreport the performance statistics. All report the apparent performance as if it was the cross-validated performance. But what of the single calibration-in-time model developed for pollen assemblages that I am aware of (please let me know of other calibration-in-time models for microfossil assemblages)?

Kamenik et al (2008) report a calibration-in-time model for the pollen stratigraphy from the mire Mauntschas in the SW Swiss Alps, dated with 29 14C dates and other chronological information. The paper reports that the r2 between predicted and modelled April-November temperature is 0.44. In ideal circumstances this would represent a very respectable performance. However, I have a couple of concerns about the analysis in this paper.

Firstly, the pollen and climate data are smoothed with a three-year triangular filter prior to any analysis. This will induce temporal autocorrelation into the data and thereby violate the assumptions of the statistical methods used.

Secondly, a large number of choices are made on the route to the selected model. Choices include:

  • The climate variable reconstructed (mean temperature or precipitation over 1-12 months with a lag of 1-11 months – a total of 288 responses). May to August air temperature is the best predictor in an RDA, but April-November air temperature is reconstructed because it reduced the transfer function error.
  • The predictor variables (6 out of 11 pollen taxa are used in the final model)
  • The transformation of the pollen data (accumulation rate or percent, detrended or not)
  • The statistical model (ordinary least squares regressions (OLSR), time series regressions (TSR), ridge regression (RidgeR), principal components regressions (PCR) and partial least squares regressions (PLSR))

The impact of these choices – forks in Andrew Gelman’s garden of forking paths – will be to inflate the estimate of model performance.

It is possible to explore the impact of some of these choices by simulation which can help gauge how impressed we should be with an r2 of 0.44. To simplify the simulation, I’m only going to investigate the importance of the induced autocorrelation and the selection of pollen taxa.

I simulated 49 years of climate data and pollen data with a Gaussian distribution 10,000 times for each case below (code on github).

The upper panel in the figure below shows the distribution of leave-one-out cross-validation r2 of OLS models fitted to six simulated pollen spectra (data from two years are removed to make the data set comparable with the smoothed data in the next step). The reported r2 is far above the 95th percentile of the distribution.

The middle panel shows the distribution of r2 when the six pollen spectra and the climate variable are smoothed with a three-year triangular filter, as used by the authors. The 95th percentile of the distribution has moved towards the reported r2.

In the lower panel, I show the distribution of the r2 of the OSL when the best subset (1–11 variables) of smoothed pollen spectra is chosen by BIC. The 95th percentile of the distribution has moved beyond the reported r2. No claim to statistical significance can be supported.


Distribution of r2 values from 10000 simulations with different treatments. The red line shows the reported r2.

Allowing for model selection, choice of data transformations, and selection of the climate variable, reconstruction would move the simulated r2 even further to the right.

None of the choices made by the authors are bad. All can be defended, except perhaps the inclusion of autumn temperatures in their climate mean when most plants have flowered months earlier. The problem is that the authors have sought the path which yields the best performance. Pollen data from an another core from the same mire would be unlikely to give the same performance with the selected model.

As with the three chironomid calibration-in-time models, the Mauntschas Mire model cannot be relied upon. A critical difference though is that the performance of the chironomid models is misrepresented.

Posted in Peer reviewed literature, transfer function | Tagged , , | 1 Comment

The third chironomid calibration-in-time reconstruction

Most reconstructions of past environmental conditions derived from chironomid assemblages with transfer functions use a calibration-in-space approach, in which the calibration is a set of modern chironomid assemblages paired with modern environmental data. I am aware of three chironomid calibration-in-time reconstruction. Here the calibration set is well dates chironomid assemblages paired with historic environmental data.

I have already shown that the calibration-in-time reconstructions from Seebergsee and Silvaplana report the apparent performance rather than the cross-validated performance they claim to report, and that the actual skill is near zero. But what about the third reconstruction from Nurmijärvi.

Nurmijärvi is a varved lake in southern Finland from which the authors counted the chironomids at 1-11 year resolution over the instrumental period which began in the 1830s and reconstruct July air temperatures.

The paper reports that the calibration-in-time transfer function has an r2jack of 0.64 (Table 1). This is a fairly good performance given the limited temperature range in the calibration set. The paper also reports the correlation between the reconstructed and instrumental temperature as 0.51 (Table 2, Figure 4), which is equivalent to an r2 of 0.26. With a calibration-in-time model, the model r2 should be identical to the r2 between the reconstruction and the environmental data. It is not clear why they are not identical in this case.

Since the authors have not replied to emails enquiring about this discrepancy, nor to requests for the raw data, I digitised the stratigraphic diagram and the temperature data. There are of course inaccuracies in the digitised data, and I have only the 38 common taxa (the stratigraphic diagram omits 21 rare taxa).

I find that the model the authors use, a tolerance-weighted weighted-averaging model with inverse deshrinking, has an r2 of 0.57. Not too far off the 0.64 reported. But this is the apparent statistic – the leave-one-out cross-validated r2 is only 0.13.


Predicted vs measured July temperatures for the Nurmijärvi calibration-in-time transfer function

A model with an r2 of 0.13 has very limited utility, and this is before we consider the impact of autocorrelation etc.

It appears to me that the authors are, like the authors of the Silvaplana and Seebergsee papers, reporting the apparent rather than the cross-validation performance. If the authors believe that imperfections in the digitised data are responsible for the lack of reproducibility, they only need to send me the data and I will update this post.


Posted in Peer reviewed literature, transfer function | Tagged , , | Leave a comment