Mountain detour on the way to Białowieża

I am on my way to Białowieża Forest in eastern Poland to meet with collaborators on the KlimaVeg project and start analysing some of the project data. I have taken a detour to the Tatra Mountains, a compact calcareous range on Poland’s southern border.

I am used to the English Lake District, but the Tatras are much busier. Zakopane (which has night trains from the far corners of Poland), makes Bowness look like a ghost town.

Posted in Uncategorized | Tagged , | Leave a comment

Replicating the Lake Żabińskie reconstruction

Having shown that the archived chironomid data from Lake Żabińskie are strange in several ways and have bad reconstruction diagnostics, I want to see how well I can replicate the August air-temperature reconstuction.

First, I need to build the transfer function. I’m using WAPLS on square-root transformed species data. All taxa with more than one occurrence are included, and the lakes with declared low count sums are omitted.

keep <- colSums(spp > 0) > 1
mod1 <- crossval(WAPLS(sqrt(spp[, keep]), env), cv.method = "bootstrap", nboot = 5000, verbose = FALSE)
knitr::kable(performance(mod1)$crossval[1:3, 1:4], digits = 2)
RMSE R2 Avg.Bias Max.Bias
Comp01 2.68 0.65 -0.06 9.97
Comp02 2.33 0.76 -0.04 6.75
Comp03 2.38 0.76 -0.08 6.27

The bootstrap cross-validation performance is very similar to that reported by the corrigendum (WAPLS component 2: RMSEP = 2.3°C, r^2^ = 0.76).

The reconstruction is similar to that archived but not identical.

wapls.sqrt <- predict(mod1, sqrt(fos))$fit[, "Comp02"]
reconstruction <- data.frame(
chron = chron,
Instrumental = instrumental$temperature,
Archived = recon$temperature,
Replication = wapls.sqrt
reconstruction2 <- gather(reconstruction, key = "Series", value = "Temperature", -chron)
ggplot(reconstruction2, aes(chron, Temperature, colour = Series)) +
geom_line(alpha = 0.5) +
scale_colour_manual(limits = c("Instrumental", "Archived", "Replication"), values = c("black", "red", "blue")) +
labs(x = "Year CE", y = "Temperature °C", colour = "Series")


The mean of the archived and replication reconstructions are very similar (16.92 vs. 16.88), but the variance of my replication is about 20% higher (2.19 vs. 2.67).
One of the most noticable differences is that the archived reconstruction has a value for 1925, whereas the chironomids have data for 1927. This 1925/1927 switch also occurred during the evolution of the fossil data.

The other differences between the archived reconstruction and my replication might be because of different sites included in the calibration set (LT15 omit nine lakes on the basis of a PCA, but its is not clear which lakes these are and whether they are also omitted for the corrigendum) or different species inclusion rules. The bootstrap that LT15 use will also cause variability.

Ideally anybody who has the raw data should be able to replicate the results of any paper exactly. Given the vague description of the methods in LT15 and the corrigendum, the replication is as good as can be expected.

This is probably my last post detailing oddities in the data archived by Larocque-Tobler et al (2015). I think I have done enough to demonstrate that the data have unexpected properties that need explaining. In my next couple of posts in this series, I’ll describe my quest to get that explanation, or how I mistook a yo-yo for a die.

Posted in Peer reviewed literature, transfer function, Uncategorized | Tagged , | Leave a comment

Low calibration set counts in LT15

Larocque-Tobler et al (2015) promised that, for their chironomid calibration set

  At least 50 head capsules (Heiri et al., 2001 and Larocque et al., 2001) were mounted.

A cursory examination of the data shows that this is unlikely to be true: many of the calibration set samples have a minimum relative abundance above the 2% expected for a singleton in a count of 50 heads.

The corrigendum admitted as much

Lakes with low numbers of head capsules were included in the transfer function used to create the published temperature reconstruction… The following lakes have now been deleted for the corrected temperature reconstruction: GOR, KOS, LEK, SAL, SZE, SZOS, TRZ, WAS and ZAB.

If a few samples were a few (i.e. can count on fingers of one hand) chironomids short of the promised fifty, I would not object. In the Polish calibration set, nine of the forty-eight lakes (two of the fifty lakes sampled have no chironomid data), are now reported to have fewer chironomids than promised, but how many fewer?

Since most samples should have at least one singleton, we can estimate the probable count sum from the reciprocal of the relative abundance of the least abundant taxa.

This plot1 shows the minimum percentage and the probable count sum for the Polish calibration set, highlighting the lakes that the corrigendum declared to have low counts.


Two aspects of this figure are striking.

First that some of the probable counts are very low. Lake GOR appears to have a count of just five heads, based on a minimum relative abundance of 20% (the missing 20% might be an unidentified chironomid).

Cricotopus Endochironomus albipennis Microtendipes pedellus
GOR 40 20 20

Second, in addition to the reported nine Polish lakes with counts below fifty head capsules, a further ten have a minimum relative abundance greater than 2%, indicative of a count lower than fifty heads.

What about the Canadian portion of the calibration set? There are 73 Canadian lakes (not 72 as reported in LT15 – the data source, Larocque (2008), also reports 73 lakes). Four of these lakes appear to have a minimum count below fifty.


This possible undercount in a few samples in Larocque (2008) is interesting but will not materially affect the results of that paper.

There are some additional curious aspects to the sums of the relative abundances for each lake in the calibration set. This sum can be below 100% if rare or unidentified taxa have been omitted. Values slightly above 100% are possible with rounding errors. But I struggle to understand how a sum of 105.1% is possible for Polish lake GIL. The sum of 175% for Canadian Lac H is beyond comprehension.

  1. All code used in this post is available on github 
Posted in Peer reviewed literature, Uncategorized | Tagged , | Leave a comment

The missing rare taxa at Żabińskie

“rarity is the attribute of a vast number of species of all classes, in all countries.” Charles Darwin

In any census of any species-rich community, the rarest taxa are likely to be represented by a single individual. This can be visualised with rank abundance curves, shown here for Barro Colorado Island tree counts (jittered for clarity).


In the BCI data, every sample contains many taxa represented by a single individual. Across all samples, 38% of taxon occurrences are of a single individual.

This is not the case for the Lake Żabińskie chironomid counts where 34 of 89 samples lack any taxa represented by a single (or a half) head.


Not only do many samples lack singletons, one lacks any taxa occurring with fewer than five head capsules. There is a curious trend in the proportion of samples without singletons, with a higher prevelance of such samples in the second half of the record.


These are the counts for the two most extreme cases.

pos(fos_counts[minc == 5, ])
##      Chironomini Microtendipes pedellus Cladotanytarsus mancus1
## 1967           7                      6                       6
##      Tanytarsus sp Tanytarsus lactesens Tanytarsus mendax
## 1967            11                    5                 6
pos(fos_counts[minc == 4.5, ])
##      Polypedilum nubeculosum Cladotanytarsus mancus1 Paratanytarsus
## 1970                     4.5                       8             12
##      Corynoneura Cricotopus Nanocladius branchio Orthocladius
## 1970           5          5                  4.5          4.5


How unlikely is it to have so many samples without singletons?

One way to test this is to fit a rank abundance model to the count data and then simulate assemblages from the model. There are a variety of rank abundance models that can be fitted with the vegan package. Here they are fitted to the first BCI sample.

rf <- radfit(BCI[1, ])


Few singletons are expected with the pre-emption model, so I am going to apply this model to the chironomid counts and then simulate assemblages and test if they have singletons.

fos_counts1 <- fos_counts
# promote half heads to full
fos_counts1[fos_counts1 %% 1 == 0.5] <- fos_counts1[fos_counts1 %% 1 == 0.5] + 0.5
#fit model and extract coefficients
alpha1 <- apply(fos_counts1, 1, function(r)  coef(rad.preempt(r)))

simulateMinAbun <- function(alpha, maxRank = 50, n = 10000, J = 30){
#maxRank is number of taxa considered, J is number of individuals, n is number of trials
rank <- 1:maxRank
abun <- J * alpha * (1 - alpha)^(rank - 1)
sims <- rmultinom(n = n, size = J, prob = abun)
table(apply(sims, 2, min0))/n
high20 <- simulateMinAbun(max(alpha1), J = 20)
high30 <- simulateMinAbun(max(alpha1), J = 30)
high70 <- simulateMinAbun(max(alpha1), J = 70)

med30 <- simulateMinAbun(median(alpha1), J = 30)
phigh <- pbinom(q = sum(minc > 1), size = nrow(fos_counts), 1 - high30[1], lower.tail = FALSE)
pmed <- pbinom(q = sum(minc >  1), size = nrow(fos_counts), 1 - med30[1], lower.tail = FALSE)

For the sample with the largest coefficient for the pre-emption model (i.e. the steepest slope of the rank-abundance relationship), the probability of the count having at least one singleton is 0.91. This result is not sensitive to the number of heads counted (at least over the range reported for the chironomid data).

With the median pre-emption model coefficient, the probability of the count having at least one singleton is 0.9984.

Even in the most generous case, the probability of having the observed number of samples without singletons is 5 x 10-14. With the median case, the probability is 9 x 10-74.

This calculation does not account for the taxon inclusion rules (which were not adhered to) which would have removed taxa occurring in fewer than three samples. This would adjust the probability by several orders of magnitude. However, the calculation does not consider that in some samples neither singletons nor doubletons occur, which would adjust the probability by orders of magnitude in the other direction.

Whatever adjustments you want to make, the lack of rare taxa in the chironomid data is remarkable.

In the first version of the dataset, samples apparently missing singletons (identified by the minimum percent being above 2%) tended to have all their values as integer multiples of the minimum percent. This strongly suggested (as is now admitted) that the counts were not fifty as claimed, but much lower for some samples. Since then, the data have “evolved” and this integer multiple only holds for a few samples.

It is not obvious why there are so many samples without singletons. Perhaps the taxon inclusion rule (“3 percent in at least 3 samples”) was misapplied, smiting singletons in some levels only until the observed pattern emerged. At the calculated probabilities, almost anything is more likely than the data being correct.

It is trivial to infer that the currently archived data are definitely not the original data. The original data (all 76 taxa) needs to be archived.

Posted in Peer reviewed literature, Uncategorized | Tagged , | 1 Comment

ggplot2 maps with inset

Lots of students want maps showing their field sites for their thesis. Several have come to me with code they found on the internet but couldn’t get to work. The problem is that the plotting package ggplot2 has evolved since that code was written and some things no longer work (for example opts instead of theme), ggplot2 now supports map projections and Norway is more complicated than Canada in that it has remote territories.

This post updates the code from code to make a map with an inset showing the location against a regional map.

I’m going to plot the seedClim locations.

Load the required packages and some data.


dat siteID           latitude longitude Temperature Precipitation
Alrust           60.8203    8.70466           2             1
Arhelleren       60.6652    6.33738           3             3
Fauske           61.0355    9.07876           3             1
Gudmedalen       60.8328    7.17561           1             3
Hogsete          60.876     7.17666           2             2
Lavisdalen       60.8231    7.27596           1             2
Ovstedal         60.6901    5.96487           3             4
Rambera          61.0866    6.63028           2             3
Skjellingahaugen 60.9335    6.41504           1             4
Ulvhaugen        61.0243    8.12343           1             1
Veskre           60.5445    6.51468           2             4
Vikesland        60.8803    7.16982           3             2
dat$Temperature dat$Precipitation precipLab tempLab ```

Find the latitude and longitude ranges required for the high resolution map. Here I've added a buffer round the locations.

xlim ylim ```

### Get the map data

# Low resolution map of Norway for the inset map

# High resolution map of Norway for the main
norwaymapHires ```
Unfortunately, even the high resolution map is not high-resolution enough - lots of important islands off the Hordaland coast are missing, for example Askøy near Bergen. I need to find a higher resolution map.

### Make the maps

The inset map showing all of Norway

a geom_map(data = norwaymap, aes(x = long, y = lat, map_id = region),
map = norwaymap, colour = NA, fill = "grey60") +
geom_rect(data = data.frame(),
aes(xmin = xlim[1], xmax = xlim[2], ymin = ylim[1], ymax = ylim[2]),
colour = "red", fill = NA) +
coord_map(xlim = c(3, 33), ylim = c(57, 72)) +
labs(x = NULL, y = NULL)

#print(a)# print this map until you are happy with it.

geom_map draws the map.

geom_rect shows the limits of the high-resolution map.

coord_map projects the map. Any of the projections in the mapproj package can be used – there are probably better ones than the default Mercator projection. I had to set the x and y limits to exclude Svalbard, Jan Mayen and Bouvetøya (a subantarctic island).

Zoomed in map, with the site locations

b shape = Temperature, fill = Precipitation)) +
geom_map(aes(x = long, y = lat, map_id = region), data = norwaymapHires, map = norwaymapHires, colour = NA, fill = "grey60", inherit.aes = FALSE) +
geom_point(size = 3) +
coord_map(xlim = xlim, ylim = ylim) +
labs(x = NULL, y = NULL) +
scale_shape_manual(breaks = 1:3, labels = tempLab, values = c(24, 22, 25)) +
guides(shape = guide_legend(override.aes = list(fill = "black")))+
scale_colour_brewer(breaks = 1:4, labels = precipLab, palette = "RdBu") +
scale_fill_brewer(breaks = 1:4, labels = precipLab, palette = "RdBu")

I’m using various scale_?_??? commands to set the shape and colour of the points. I need to use the guides function as otherwise the temperature legend has empty rather than filled shapes.

A theme

maptheme axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.border = element_rect(fill = NA, colour = "black"),
panel.background = element_blank()

This theme can be added to the plots. It sets a dark rectangle round the map and removes the grey background and axis annotation. An alternative would be ggthemes::theme_map() but it moves the legend.

Printing the maps

Print the two maps together, one an inset of the other using the grid package for flexible alignment of ggplot graphs.

vp_b vp_a print(b + maptheme, vp = vp_b)
print(a + maptheme, vp = vp_a)


Posted in R | Tagged | 7 Comments


It is conventional to write small integers as words rather than figures in text, especially if they are at the start of the sentence. This caused me some grief with rmarkdown, which I have started using for presentations, papers and blog-posts: my code gave me figures, I wanted words. Then I found the english package.

## [1] forty two
## [1] minus one

We need to be careful if we use this inline with

The Ultimate Answer to Life, The Universe and
Everything is...`r as.english(42)`

as this will be converted back to a figure.

The Ultimate Answer to Life, The Universe and Everything is…42.

Wrapping the command in as.character will fix this

`r as.character(as.english(2))` to the power of
`r as.character(as.english(276709))` to
`r as.character(as.english(1))` against

two to the power of two hundred and seventy six thousand seven hundred and nine to one against

It would be a cunning plan to make a function to run as.character(as.english(x)) if it is needed repeatedly.

Of course, sometimes we will want to start a sentence with a number and need the first letter capitalising. This can be done with regular expressions (otherwise known as magic).

`r gsub("(^)(.)", "\\1\\U\\2", as.english(42), perl = TRUE)`

Forty two

As was hinted above by the conversion back to figures, objects made by as.english retain their numeric values, and so can be used in calculations

char.english <- function(x)as.character(as.english(x))
num <- lapply(0:9, as.english)
names(num) <- sapply(0:9, char.english)# set names
attach(num) #make available

six * nine


## [1] fifty four
Posted in R | Tagged | 1 Comment

The importance of version control

Perhaps the most contentious scientific argument of the early twenty-first century concerns whether “data” is plural or singular. To accidentally say “data is” at a prestigious conference is to lob a metaphorical lettuce into the traditionalist audience. Submit a manuscript with “data are” and expect scathing reviews impugning your pet goat’s morels1 from an avant-garde grammatician. Fortunately, there is a consensus on one topic (no mere 97% here), the chironomid data from Lake Żabińskie are plural.

Version One – January 2015

The published version of the chironomid stratigraphy which has not been archived and is now acknowledged by the corrigendum to be incorrect. None of the archived files match it.

Version Two – January 2015

The version of the data archived at in January differed substantially from the published stratigraphy and contained patterns best described as implausible.

Version Three – June 2015

After enquiring about the discrepancies between the archived data and the published stratigraphy, I was emailed a C2 file with a new version of the data. It has five fewer taxa than V2. These are mostly rare, except for Parochlus which occurred eight times in V2 (four times if the unexpected duplicate assemblages are ignored). Fifty-three of eighty-four assemblages are identical. Some other assemblages have large differences, for example 1931 where most of the Criotopus is replaced by Nanocladius branchio.

Version Four – July 2015

This version was uploaded onto in mid-July. It has the same taxa as V3 but gained samples for 1970, 1969, 1966, 1955 and 1941 and has a sample for 1927 instead of 1925. Only four samples have identical assemblages in both versions: 1986 has complete taxonomic turnover!

Version Five – August 2015

V5 was uploaded onto in August. Seventeen samples have different compositions (more than just rounding errors), for example in the sample from 2002, Paratanytarsus had an relative abundance of 27% in V4 but 0% in V5. It is partially replaced by Tanytarsus sp (14%), while all other taxa also increase in relative abundance. V5 is the first version to show the count sums. As expected, they are generally much lower (median 32.5) than the 50 promised by the paper.

Version Six – September 2015

Another month, another version of the Lake Żabińskie chironomid stratigraphy. This version is mostly the same as V5 rounded to one decimal place, but in 1931 Cricotopus loses 2% and Limnophyes gains 2%.

Version Seven – October 2015

V7 is not rounded to one decimal place as V6 was, nor is it identical to V5. The largest change from V6 is in the sample from 1940, in which 12% Criotopus appears, with all other taxa losing one or two percent.
V7 also includes what are purported to be the raw counts. They have non-integer (or half integer) values which are impossible. I suspect the counts are back-calculated from percent data and then the percentages recalculated.

Version Eight – October 2015

V8 only includes count data which are identical to V7 counts to four decimal places!

Version Nine – November 2015

V9 was uploaded onto NOAA in November. The count data in the excel file are identical to the counts in V7 and V8, those in the text file have been rounded. The percent data are V7 rounded to one decimal place.


Different versions of the chironomid data for 1940

Version ten?

It is not clear that the version of the data described in the corrigendum is the same as V9. The corrigendum details the taxon inclusion rules

Taxa not included in the transfer function or not found in 3 percent in at least 3 samples were removed. In consequence, the number of taxa was reduced from 76 to 50.

Figure 1 caption

Four taxa (Brillia, Eukiefferiella, Krenopelopia and Stictochironomus) present in only 3 samples are not displayed.

However, in V9 (which does have 50 taxa), these four taxa have only one or two occurrences.

Of course, I understand that most people have multiple versions of their data: the initial low resolution counts; taxonomic revisions; corrections of transcribing errors. But this should all be sorted out before publication. Post-publication it should be easy to identify the correct file. To archive an incorrect version once is unfortunate; to archive six different versions (none of which match the published stratigraphy) is beginning to look like carelessness.

Note, all the archived data versions2 have file creation dates that post-date publication.

  1. Admit it, if the morels weren’t fly-blown, your amoral goat would have eaten them (and then would be amorel). 
  2. Data and code are archived on github 
Posted in Peer reviewed literature | Tagged , | 3 Comments