## Finding singletons in real data

I have argued that the rarest species in an assemblage should typically be represented by a single individual, and that this can be used to estimate the count sum when this is not reported.

Now I want to test this with some real data. Starting with tree-count data from Barro Colorado Island which is available from within the `vegan` package. How many species in each of the 50 samples are represented by a single individual?

```library("tidyverse")
data("BCI", package = "vegan")

data_frame(n = rowSums(BCI == 1)) %>%
ggplot(aes(x = n)) +
geom_bar(width = 1) +
labs(x = "Number of singletons") +
xlim(21, NA)
```

Number of singletons per sample in the BCI data-set

All of the BCI samples have many singletons: the probability of finding a sample without any singletons is obviously very small. But perhaps some attributes of the relatively large (median count is 428 trees) species rich (median richness is 91 species) samples are not relevant to chironomid or diatom samples.

What happens when the count sum is much smaller? Here, I resample with replacement each BCI sample to a count of fifty trees. This is done one thousand times.

```BCI_50 <- BCI %>%
rowwise() %>%
do(
as_data_frame(
t(rmultinom(n = 1000, size = 50, prob = .))
)
)

n_50 <- data_frame(n = rowSums(BCI_50 == 1)) n_50 %>%
ggplot(aes(x = n)) +
geom_bar(width = 1) +
labs(x = "Number of singletons")
```

Number of singletons per sample in the BCI data-set when the count sum is reduced to 50 trees.

The minimumum number of singletons in the resampled data is now 3, and 99.97% of samples have more that 5 singletons. Even with a count of only fifty, samples without singletons are going to be rare.

How about if the taxonomic resolution is reduced by reanalysing the data by genus rather than species? That turns out not to change things much as many genera have only one species in each sample. The richness can be greatly reduced by grouping species according to the first letter of the genus name.

```BCI_50_genera <- BCI_50 %>%
rowid_to_column("sample") %>%
gather(key = species, value = count, -sample) %>%
# mutate(genera = gsub("^(\\w+).*", "\\1", species)) %>%
mutate(genera = gsub("^(\\w{1}).*", "\\1", species)) %>%
group_by(sample, genera) %>%
summarise(count = sum(count))

n_50_genera <- BCI_50_genera %>%
summarise(n = sum(count == 1))

n_50_genera %>%
ggplot(aes(x = n)) +
geom_bar(width = 1) +
labs(x = "Number of singletons")
```

Number of singletons per sample in the BCI data-set when the count sum is reduced to 50 trees and the taxonomic resolution is greatly reduced.

The mean richness is now 15.2 taxa.
99.53% of samples have at least one singleton amongst the 50 trees. Samples without singletons are still rare.

Time to move over to some chironomid data. Professor Ian Walker archived chironomid stratigraphies from nine Canadian lakes. The data are in tilia files (not recommended), fortunately the `rioja` package has a function to read tilia files. Unfortunately, some characters are not being read correctly and are crashing `read.Tilia` so I am calling the underlying C code directly and removing taxa after ‘unidentifiable’, which removes all the other diptera and sums that I don’t need, as well as all the problem characters.

```library("rioja")
obj <- .Call("ReadTiliaFile", fname, PACKAGE = "rioja") obj\$Data %>%
set_names(make.names(obj\$FullNames, unique = TRUE)) %>%
as_data_frame() %>%
rowid_to_column(var = "sample") %>%
select(1:grep("^UNIDENTIFIABLE\$", names(.), ignore.case = TRUE))
}

tilia_files <- dir(pattern = "\\.til\$", recursive = TRUE, full.names = TRUE, ignore.case = TRUE)

tilia_chironomids <- tilia_files %>%
set_names(gsub(pattern = ".*/(\\w*)\\.til\$", replacement = "\\1", x = ., ignore.case = TRUE)) %>%

min_counts <- tilia_chironomids %>%
filter(lake != "brier") %>% #percent data
gather(key = species, value = count, -lake, -sample) %>%
mutate(count = round(count * 2) / 2 ) %>% # Wood Lake counts back-transformed from % data
filter(count > 0) %>%
group_by(lake, sample) %>%
summarise(min_count = min(count), sum_count = sum(count), n_sing = sum(count <= 1), nsp = n())
```

The chironomids are mainly identified to genus level, some higher, some lower. Of the 158 (median count sum = 86.25, median richness = 18), 98.1% have singletons (whole or half). The median number of singletons is 6.

Of the three samples without singletons, one has a count sum of only 4.5. The other two have sensible looking counts: not all the counts are multiples of the lowest count, which would help recognise this case.

I am confident that assemblages will generally have one or more singletons, so they can be used to estimate the count sum when this is not given. This analysis shows that samples with low diversity (or equivalently low taxonomic resolution), especially with large counts, or very small counts sums might not. Analyst-related issues that increase the risk of singletons being absent include counting only a limited range of taxa (as is sometimes done in pollen analysis) and archiving the data after rare taxa have been deleted (please archive the full data).

## A paper with insufficient and misleading information

Sometimes papers reporting transfer functions don’t give enough information for the reader to properly evaluate the model. Take for example Massaferro and Larocque-Tobler (2013) who report a chironomid-based mean annual air temperature transfer function for Patagonia and its application to a Holocene-length chironomid stratigraphy from Lake Potrok Aike.

The paper reports that the three-component WAPLS model has an RMSE (root mean squared error) of 0.83°C and an r2 of 0.64. I assumed that this was a typo and that the authors were actually reporting the RMSEP (root mean squared error of prediction), as it has been standard practice for decades to present the model’s cross-validated performance.

On re-reading the paper after seeing it cited, I began to doubt my previous assumptions: perhaps the authors really were reporting the non-crossvalidated model performance statistics, which can be unrealistic estimates of model performance. My doubts focused on the large number of components used (a three-component WAPLS model is usually (perhaps always) an overfit) and the weak relationship between chironomid assemblage composition and temperature in the calibration set shown in the following figure.

Calibration set chironomid assemblages arranged by temperature. Few if any taxa show a strong relationship with temperature

A search of my sent-mail folder reminds me that I advised the authors of my concerns about this exactly four years ago today. They did not respond. I now happen to have the data from this paper and decided to have a quick look.

Using the species exclusion rules described in the paper and square-root transformation of the species data, I can get non-crossvalidated performance statistics that are similar to those reported (RMSE = 0.83°C, r2 = 0.65).

But what of the crossvalidated performance, which is a far better (though still imperfect) guide to the actual model performance? With leave-one-out crossvalidation, the performance is much worse (RMSEP = 1.39°C, r2 = 0.15). The RMSEP is almost equal to the standard deviation of the environmental variable (sd = 1.4°C) and the r2 is probably the lowest I have seen for any published transfer function. This transfer function is not useful. The authors’ claim that the performance is “comparable to other transfer functions of this size” is a false statement.

Despite the transfer function’s lack of skill, the authors use it to reconstruct temperatures from the Lake Potrok Aike chironomid stratigraphy. Most chironomid workers report the minimum number of chironomid head capsules that they count in each sample. Typically this is 50 (or perhaps 30) head capsules. The minimum chironomid count sum is not reported in this paper: I suspect most readers would assume it was about 50. It isn’t. The count sum is reported in the fossil data: the median count sum is 17. Only the fossil samples with a count of two or fewer chironomids seem to have been omitted from the published fossil stratigraphy. To not report that count sums are so low, far below the typical minimum accepted, strikes me as disingenuous.

The fossil data consist of only four taxa [note to my copy editor – this is not a typo]. Of these Phaenopsectra is most abundant, with a median abundance of 50%, much higher than the maximum abundance of this taxon in the calibration set (16.4%). This naturally raises concerns that the reconstruction diagnostics will be poor.

Perhaps the most widely used transfer function diagnostic is analogue quality. With the usual rule-of-thumb that distances greater than the 10th percentile of taxonomic distances in the calibration set represent non-analogue assemblages, 81% of samples lack analogues. This is not good.

The paper does not report analogue quality, but it does report that “all samples of the Potrok Aike core added passively to CCA of the training set samples were within the 95% confidence interval.” This is presumably the residual length diagnostic. When I calculate this statistic, I find that 100% of fossil samples are outside the 95th percentile of residual lengths. This is not good.

This paper has been fairly well cited, but appears to be fatally flawed: the transfer function as ~no skill, the fossil counts are small and lack analogues in the calibration set. I would recommend that anyone who plans to cite it assure themselves that the conclusions of the paper adequately reflect the results.

## Please archive assemblage data as counts not percent

A large fraction of the microfossil assemblage that has been archived on-line by palaeoecologists is percent data, often without the count sums, rather than the raw count data. This is unfortunate as some analyses need count data. Calculating percent from count data is trivial, but the reverse operation is in general not possible if the count sums are not available.

Fortunately, one aspect of assemblage data makes it possible to estimate the count sums: the long tail of most rank-abundance distributions means that the rarest taxon in any sample is probably present with a single individual. The lowest percent abundance, $P_{min}$, is therefore $1/count * 100$ and the count sum can be estimated as $100/P_{min}$.

Archived data are often rounded off which limits the precision of this calculation, but it is possible to estimate the range of possible count sums using the maximum and minimum value of $P_{min}$ that would yield the reported $P_{min}$ after rounding. Obviously, the uncertainty increases when fewer decimal places are reported, and with larger count sums.

I’ve written an R function to estimate count sums in the R package `countChecker` which is available on Github.

```#devtools::install_github("richardjtelford/count_checker")
library("countChecker")
library("tidyverse")
```

Lets see it in action with the percent data from Last Chance Lake (Axford et al 2017), chosen because I like the name. More importantly, it was the first percent assemblage data I found on NOAA were the count sums were reported, so it is possible to check the code is working.

```#Import data
data("last_chance")

#examine data to determine precision
```
``````## Cric.pulTy Cric.Orthund Cric.sylTy Cric.treTy
## 1 0.00 3.53 0.00 0.00
## 2 0.85 6.78 0.00 1.69
## 3 0.00 1.92 0.00 0.00
## 4 0.25 0.98 0.49 1.47
## 5 0.00 2.26 0.75 2.26
## 6 0.00 0.00 0.00 4.57``````
```#Isolate species data
last_chance = select(last_chance0, -age_calBP, -totcaps)

#Estimate n and add reported n
last_chance_est = estimate_n(spp = last_chance, digits = 2) %>%
mutate(n = last_chance0\$totcaps)

#plot
last_chance_est %>%
ggplot(aes(x = n, y = est_n, ymax = est_max, ymin = est_min)) +
geom_pointrange() +
geom_abline(
aes(slope = slope, intercept = 0, colour = as.factor(slope)),
data = data_frame(slope = 1:2),
show.legend = FALSE) +
labs(x = "Count sum", y = "Estimated count sum")
```

Estimated vs reported count sums for Last Chance Lake. The lower line is the 1:1 relationship. The upper line is the 2:1 relationship expected when the rarest taxon is represented by a half chironomid.

This reveals both the power of this method, and one of a few problems. A few samples have estimated counts that match the reported counts, but most have estimated counts twice the reported counts. The problem is that chironomid workers often count half head capsules: in many samples the lowest abundance is a half chironomid. Pollen and occasionally diatoms are also sometimes counted with halves – integer counts are easier to analyse.

In other datasets, there might be some samples with the rarest taxon represented by two (or more) microfossils. This is probably only a serious risk in very low diversity systems (hello Neogloboquadrina pachyderma sinistral), or where only a few common taxa are identified rather than all taxa.

Diatoms present an interesting test case as the two valves which comprise the frustule are counted separately even though they are often found paired (or in longer chains), raising the risk that the rarest taxon has two valves. If the rarest taxon has two valves, it is unlikely that the other taxa all have multiples of two valves, which should help detect this problem which would otherwise make the count appear to be half its true size.

The first diatom assemblage dataset I found on pangaea.de appeared to have a few samples with count sums much lower than reported, even allowing for the possibility that the rarest taxon is not a singleton. This is not the first time that I’ve noticed that some authors appear to report how many microfossils they planned to count rather than the number they actually counted. This is difficult to detect with percent data but `countChecker` might reveal it. I plan, at some stage, to trawl through various archived assemblage data to test how common this problem is.

The `countChecker` package also tests whether the percent data are congruent using a method very similar to the GRIM test which has been used to great effect on psycology data. The principle is this: if the count is of $n$ microfossils, then, discounting the tendency of some analysts to count fractional microfossils, only integer multiples of $100/n$ are possible. The `percent_checker` function tests for such impossible values after allowing for limited precision in both the percent data and the count sum. I’ll explore this method more later.

Pollen data archived in Neotoma and the European Pollen Database are nearly all count data. Please can other microfossil assemblage data also be archived as counts not percent.

Posted in R | Tagged | 1 Comment

## Chironomid assemblage change in Seebergsee

Palaeoecology could, to a large extent, be described as the study of changes in fossil assemblages: what we can learn from changes in foraminifera assemblages over the last interglacial cycle, from palynological assemblages as Neolithic agriculture spread across Europe, or from diatom assemblages in a core from a polluted lake.

Changes in assemblage composition can be caused by many factors, for example environmental changes, succession, neutral species turnover, changes in taphonomic processes such as decay, and counting error.

But sometimes assemblage composition just changes in ways that cannot be so simply explained. Take for example the chironomid stratigraphy from Seebergsee as published in Larocque-Tobler et al (2011) and Larocque-Tobler et al (2012).

Larocque-Tobler et al (2011) Fig. 4. (a) Number of head capsules by taxon and TOTAL head capsule numbers in the 24-cm core of Seebergsee. The zones were established using the ZONE program. (b) Percentages of taxa in the merged samples (i.e. samples with less than 30 head capsules were merged) through time and CA scores of axes 1 and 2.

Larocque-Tobler et al (2012) Fig. 3. “Changes in the chironomid taxa (in percentages). Only the taxa from more than two samples are presented.”

There are several differences between these stratigraphic plots. For example, in the second plot Smittia and Corynoneura arctica are shown as having no occurrences in the last 100 years whereas they previously had about eight occurrences each; Microtendipes, Cricotopus and Tanytarsus are omitted completely (the latter had 21 occurrences); and Telopelopia has ~doubled in relative abundance to 40% in some fossil samples. Other taxa appear to have the same relative abundance in each paper (the resolution of the figure is is not great, making it difficult to be certain).

One possible explanation for these differences is that the data are from two different cores, two different counts, which would represent a large amount of work. If the data are from two counts then the differences are large enough to change the reconstructions, potentially breaking the strong correlation with instrumental temperatures.

The second paper cites the first for providing the age-depth model for the uppermost, unlaminated, section of core (the reconstructions appear to use the same chronology, even though the stratigraphies have different ones), providing some evidence that both stratigraphies are from the same core. Both papers report that a “A freeze corer (Kulbe and Niederreiter, 2003) was used in 2005 to extract a 3 m sediment”. This is impressively long for a freeze core. The core was divided into 5 x 53cm sections – perhaps the corer was 3m long rather and the sediment core somewhat shorter. The upper 2.5m are used in Larocque-Tobler et al (2012).

However, the reported sub-sampling plans are different, suggesting two cores. Larocque-Tobler et al (2011) report that

“For the top 8 cm, a rotating saw was used to cut the frozen core at 0.2-cm intervals and 40 sub-samples were obtained. Since few chironomid head capsules (n = 8 to 67) were found in these samples, the core was subsequently sub-sampled using the same rotating saw at 0.5-cm intervals between 8 and 24 cm core depth for a total of 73 samples.”

Whereas the second paper reports that

“For the upper 36 cm, the core was subsampled at 0.2-cm increments using a rotating bench saw. As samples below 30 cm had less than 30 chironomids per sample, the rest of the core was subsampled at 1-cm increments.”

This makes no sense as the authors would have known about the low abundance of chironomids by the time they started processing the second core, if there were two (unless they processed both cores simultaneously). I think this description must be in error.

The papers report that different microscopes were used. From the first paper:

“The head capsules were then identified under a Motic B3 Professional microscope at 400–1000×.”

And the second

“The head capsules were identified using a Leica light microscope at 400–1000×.”

The chironomid preparation text is similar in the two papers, so this looks like a deliberate change. Perhaps both microscopes were used, one for the first paper the other for the additional samples in the second paper and the text elides over this complexity.

I am inclined to reject the explanation that there were two cores counted, which leaves me with no explanation for the two versions of the assemblage data. At least there are only two.

## The European Pollen Database meets SQLite

The European Pollen Database is a fantastic resource for palaeoecologists, storing pollen stratigraphies from across the continent. Getting the data into R for analysis is facilitated by the EPDr package. However, first you need to set up the database and this can be a little tricky. The EPD is available to download in three database formats, Paradox, Microsoft Access and Postgres. The data are also available from Neotoma (partially as of now) and Pangaea.

• I don’t know much about Paradox, and I’m not greatly motivated to change that. It might be possible to use it with EPDr (which used DBI internally), but I am not finding much online about how to do this
• Last time I used MS Access there were problems with it only having a 32-bit driver available for importing data into R. This was possible to work around this but was a considerable pain.
• Postgres is a top of the range open-source data base. However, set-up is not trivial, and on my university-managed computer, I lack the permissions needed to complete the set-up (I also lack the permission needed to change time-zones).

So what I want to do is to convert the EPD into a SQLite database. This is a very simple database format, lacking the bells and whistles that Postgres has, but it is very easy to make the connection to R – you just tell R where the file is and that it should use the SQLite driver. Having used SQLite on a couple of projects before, I also had some code to convert the MS Access files into SQLite.

We need to start by downloading the latest MS Access version of EPD,  and installing mdbtools (on ubuntu `apt-get install mdbtools`). My code might not be the most elegant way to complete the job (everything can probably be done with a few lines of a bash script), but it works.

```####load packages####
library("DBI")
library("RSQLite")

#### Extract from mdb ####
#mdb location
mdb  data/epd_schema.sql"))

##export data
system('mkdir -p data/sql')
system(paste("for i in \$( mdb-tables", mdb,
" ); do echo \$i ; mdb-export -H -I sqlite", mdb,
" \$i > data/sql/\$i.sql; done"))

#### make sqlite3 database ####
con <- dbConnect(SQLite(), dbname = "data/epd.sqlite")

#### set up schema ####
setup <- readChar("data/epd_schema.sql", nchar = 100000)

#Change to valid sqlite datatype

# EPDr does not like # in field names
setup <- gsub("#", "_", setup)

# avoid case sensitive problems with EPDr
setup <- tolower(setup)

sapply(
paste("create", strsplit(setup, "create")[[1]][-1]),
dbExecute,
conn = con)

#### import data ####
import_table <- function(TAB) {
message(TAB)
tab <- readChar(paste0("data/sql/", TAB, ".sql"), nchar = 1e9)
if (length(tab) == 0) {
paste("File", TAB, "is empty")
return(NULL)
}
#make into a single INSERT INTO statement for speed
tab <- gsub("\nINSERT INTO [^;]+ VALUES", "\n", tab)
tab <- gsub(";(?!\$)", ",", tab, perl = TRUE)

# Change # to _ to keep EPDr happy
tab <- gsub("#", "_", tab)

dbExecute(conn = con, statement = tab)
}

sapply(toupper(dbListTables(con)), import_table)

```

So did it work?

library(EPDr)
library(RSQLite)

#### make connection ####
epd.connection <- DBI::dbConnect(dbname = "data/epd.sqlite", drv = SQLite())

#### test an EPDr function ####
# E_ Site_ Sigle Name IsCore IsSect IsSSamp Descriptor HasAnLam
#1 1 1 ADANGE NA Y N Y RFLU N
# EntLoc LocalVeg Coll_ SampDate DepthAtLoc
#1 left coast of river Cyperaceae fen 65 1984-08-00 210
# IceThickCM SampDevice CoreDiamCM C14DepthAdj Notes Site_ SiteName
# SiteCode SiteExists PolDiv1 PolDiv2 PolDiv3 LatDeg LatMin
#1 GEO-01000-ADAN NA GEO 01 000 43 18
# LatSec LatNS LatDD LatDMS LonDeg LonMin LonSec LonEW LonDD
#1 20 N 43.30556 43.18.20N 41 20 0 E 41.33333
# LonDMS Elevation AreaOfSite
#1 41.20.00E 1750 0.25

#### Tidy up ####
DBI::dbDisconnect(epd.connection)

Looks good so far (sorry about the formatting – there seems to be a bug in the syntax highlighter). Now I can start to explore some questions I have.

If there is interest, I can put the SQLite database on Dropbox, but I won't guarantee that the copy is up-to-date.

## Been scooped? A discussion on data stewardship

At Climate of the Past, there is a pre-print by Darrell Kaufman and others on the data stewardship policies adopted by the PAGES 2k special issue.

Abstract. Data stewardship is an essential element of the publication process. Knowing how to enact generally described data policies can be difficult, however. Examples are needed to model the implementation of open-data polices in actual studies. Here we explain the procedure used to attain a high and consistent level of data stewardship across a special issue of the journal, Climate of the Past. We discuss the challenges related to (1) determining which data are essential for public archival, (2) using previously published data, and (3) understanding how to cite data. We anticipate that open-data sharing in paleo sciences will accelerate as the advantages become more evident and the practice becomes a standard part of publication.

The policy was closely aligned to the regular Climate of the Past policies (which are among the better policies in palaeo journals), but with more “must”. The discussion/review is ongoing, with a co-editor-in-chief encouraging further contributions to the discussion.

Two of the comments posted so far, while generally supportive of data archiving, raise concerns. Both express concern about the impact on early career scientists.

From Karoly et al

The impact of rigid data policies formulated in a top-down manner by experienced researchers (often those involved in modelling or multi-proxy synthesis) with large teams will generally be negative on early-career researchers who are often working to schedules around their PhD study and cannot as rapidly produce the final products of their work as can a larger group. With a desire to succeed and contribute to the science, this leaves them vulnerable to ’scientific exploitation’ and, in more serious cases, may compromise the successful completion of their postgraduate studies and future careers.

From Cook

It is quite clear that the ramifications of this “pilot” have not been thought out well by its authors, especially given the way that it forces graduate students and early-career scientists to give up their sensitive new data prematurely before their degrees or projects are completed. I know because this is a very real concern of my two graduate students and they deserve to be concerned given this so-called “best practices” data stewardship policy that prompted the earlier comment.

Karoly et al and Cook are presumably concerned that early career scientists who archive their data on publication will be scooped, one of six common fears about archiving data (Bishop 2015).

Is there any justification for this fear? Does anyone have any examples of scientists being scooped because they archived data on publication? Or better still, know of a study of the prevalence of scooping?

I am aware of people being scooped after making unpublished data publicly available. Ongoing time series are a particular problem – please follow their data usage policy.

I think the risk of being scooped because of archived data is low.

• A well designed project plans the analyses in advance – the authors should know what papers they plan to write before the data are gathered – so they have a large head-start on anyone else.
• Publishing a paper usually takes months: during that time the authors are the only people with access to the data giving a further headstart.
• The second paper will rarely just use the same data as the first (if it does, care should be taken to avoid salami-slicing).
• Most people have a backlog of papers that need writing, and also have the courtesy not to write a paper on a single, recently published dataset without including the authors.

If scooping is a significant risk, one option is to allow data to be archived but protected from download for some time. The downside of this is that the data are not immediately available for replicating the paper. Another option would be to make the data available under embargo, so the study can be replicated but the data cannot be included in any paper until after a certain date.

We need to know the prevalence of scooping using data archived on publication. Without knowing the prevalence, we don’t know whether we need to adjust policies and practices to reduce the risk, or to put more effort into assuaging the fears of early career scientists.

Posted in Uncategorized | Tagged | 18 Comments

## A failed attempt to reproduce two ordinations

To examine the millennium-long chironomid-inferred air temperature reconstruction from Seebergsee (Larocque-Tobler et al 2012) is, after having shown that the calibration-in-time reconstruction for the upper section of the core (Larocque-Tobler et al 2011) has no skill, to flog the proverbial dead parrot.

But there is one aspect I wish to examine: Figure 5, which shows ordinations of the millennium-long chironomid stratigraphies from Seebergsee and Lake Silvaplana.

Larocque-Tobler et al (2012) Fig. 5. Two dimensional non-metric multi-dimensional scaling in the fossil chironomid assemblages of Seebergsee and Lake Silvaplana. In both cores the samples after ca 1950 CE (black circles) have the highest chord distances with samples pre-1950 CE (white circles).

Both ordinations show a pronounced shift in community composition at 1950 CE, with all fossil samples from after this date distinct from those before. The ordinations are non-metric multidimensional scaling fitted and plotted using Primer (so the axes are scaled correctly).

One might expect such the strong pattern shown in the ordinations to be clearly visible in the underlying community data: they are not.

Larocque-Tobler et al (2012) Fig. 3. Changes in the chironomid taxa (in percentages). Only the taxa from more than two samples are presented.

If there is a switch in community composition in Seebergsee, it would appear to be at about 1970 CE, where there is a zone boundary, rather than 1950 CE. It is unclear from the paper whether the zones were defined using the “Zone program” (Methods) or “based on the PCA scores of axis 1 and 2” (Results).

Larocque-Tobler et al (2010) Fig. 3. a) Number of head capsules per taxon found in each sample of the Lake Silvaplana cores. b) Percentages of taxa in merged samples of the Lake Silvaplana cores. Samples were merged to obtain a number of at least 30 head capsules (see Larocque et al., 2009).

The millennium-long chironomid stratigraphy from Lake Silvaplana (Larocque-Tobler et al 2010) does not have a zone boundary at 1950 CE; the assemblage change at ~1770 CE appears more important.

Both stratigraphies appear to have many more samples than shown in the ordination.

Figure 5 from Larocque-Tobler et al (2012) is the only published ordination of the Seebergsee chironomid stratigraphy, but there are several ordinations of the Lake Silvaplana stratigraphy. None show any indications of a large community change at 1950. For example, here is a correspondence analysis of the last 150 years from Larocque et al (2009).

Figure 4 from Larocque et al (2009) Correspondence analysis. The numbers in the graph are the years AD of the samples. The variance explained by each axis is in brackets. The axes are incorrectly scaled.

Since I have the data for at least the last century from both lakes, I can try to reproduce the Figure 5.

Non-metric multidimensional scaling of the chironomid stratigraphy from Seebergsee.

The NMDS of the last 150 years at Seebergsee (Larocque-Tobler et al 2011) shows no distinct split at 1950 or any other time.

Non-metric multidimensional scaling of the chironomid stratigraphy from Lake Silvaplana.

The NMDS of the > 400 year long chironomid stratigraphy from Lake Silvaplana (Larocque-Tobler et al 2009) shows a fairly distinct split into two groups. However, the split is at 1760 not 1950 (I had to use the Bray-Curtis distance as the chord distance used Larocque-Tobler et al (2012) found the last sample in the data to be an extreme outlier).

Neither ordination in Larocque-Tobler et al (2012) can be reproduced.

I had hoped this would be my last post exploring Seebergsee, but, alas, I found some further oddities while preparing this post. After discussing those, I have a another post about the chironomid-inferred reconstructions from Lake Zabinskie and, in the unlikely event that I feel inspired, the lakes at Abisko. Soon, I hope to resume my usual diet of papers reporting solar correlations with palaeoecological data.