Chironomid assemblage change in Seebergsee

Palaeoecology could, to a large extent, be described as the study of changes in fossil assemblages: what we can learn from changes in foraminifera assemblages over the last interglacial cycle, from palynological assemblages as Neolithic agriculture spread across Europe, or from diatom assemblages in a core from a polluted lake.

Changes in assemblage composition can be caused by many factors, for example environmental changes, succession, neutral species turnover, changes in taphonomic processes such as decay, and counting error.

But sometimes assemblage composition just changes in ways that cannot be so simply explained. Take for example the chironomid stratigraphy from Seebergsee as published in Larocque-Tobler et al (2011) and Larocque-Tobler et al (2012).

image description

Larocque-Tobler et al (2011) Fig. 4. (a) Number of head capsules by taxon and TOTAL head capsule numbers in the 24-cm core of Seebergsee. The zones were established using the ZONE program. (b) Percentages of taxa in the merged samples (i.e. samples with less than 30 head capsules were merged) through time and CA scores of axes 1 and 2.


Larocque-Tobler et al (2012) Fig. 3. “Changes in the chironomid taxa (in percentages). Only the taxa from more than two samples are presented.”


There are several differences between these stratigraphic plots. For example, in the second plot Smittia and Corynoneura arctica are shown as having no occurrences in the last 100 years whereas they previously had about eight occurrences each; Microtendipes, Cricotopus and Tanytarsus are omitted completely (the latter had 21 occurrences); and Telopelopia has ~doubled in relative abundance to 40% in some fossil samples. Other taxa appear to have the same relative abundance in each paper (the resolution of the figure is is not great, making it difficult to be certain).

One possible explanation for these differences is that the data are from two different cores, two different counts, which would represent a large amount of work. If the data are from two counts then the differences are large enough to change the reconstructions, potentially breaking the strong correlation with instrumental temperatures.

The second paper cites the first for providing the age-depth model for the uppermost, unlaminated, section of core (the reconstructions appear to use the same chronology, even though the stratigraphies have different ones), providing some evidence that both stratigraphies are from the same core. Both papers report that a “A freeze corer (Kulbe and Niederreiter, 2003) was used in 2005 to extract a 3 m sediment”. This is impressively long for a freeze core. The core was divided into 5 x 53cm sections – perhaps the corer was 3m long rather and the sediment core somewhat shorter. The upper 2.5m are used in Larocque-Tobler et al (2012).

However, the reported sub-sampling plans are different, suggesting two cores. Larocque-Tobler et al (2011) report that

“For the top 8 cm, a rotating saw was used to cut the frozen core at 0.2-cm intervals and 40 sub-samples were obtained. Since few chironomid head capsules (n = 8 to 67) were found in these samples, the core was subsequently sub-sampled using the same rotating saw at 0.5-cm intervals between 8 and 24 cm core depth for a total of 73 samples.”

Whereas the second paper reports that

“For the upper 36 cm, the core was subsampled at 0.2-cm increments using a rotating bench saw. As samples below 30 cm had less than 30 chironomids per sample, the rest of the core was subsampled at 1-cm increments.”

This makes no sense as the authors would have known about the low abundance of chironomids by the time they started processing the second core, if there were two (unless they processed both cores simultaneously). I think this description must be in error.

The papers report that different microscopes were used. From the first paper:

“The head capsules were then identified under a Motic B3 Professional microscope at 400–1000×.”

And the second

“The head capsules were identified using a Leica light microscope at 400–1000×.”

The chironomid preparation text is similar in the two papers, so this looks like a deliberate change. Perhaps both microscopes were used, one for the first paper the other for the additional samples in the second paper and the text elides over this complexity.

I am inclined to reject the explanation that there were two cores counted, which leaves me with no explanation for the two versions of the assemblage data. At least there are only two.

Posted in Peer reviewed literature | Tagged , , | Leave a comment

The European Pollen Database meets SQLite

The European Pollen Database is a fantastic resource for palaeoecologists, storing pollen stratigraphies from across the continent. Getting the data into R for analysis is facilitated by the EPDr package. However, first you need to set up the database and this can be a little tricky. The EPD is available to download in three database formats, Paradox, Microsoft Access and Postgres. The data are also available from Neotoma (partially as of now) and Pangaea.

  • I don’t know much about Paradox, and I’m not greatly motivated to change that. It might be possible to use it with EPDr (which used DBI internally), but I am not finding much online about how to do this
  • Last time I used MS Access there were problems with it only having a 32-bit driver available for importing data into R. This was possible to work around this but was a considerable pain.
  • Postgres is a top of the range open-source data base. However, set-up is not trivial, and on my university-managed computer, I lack the permissions needed to complete the set-up (I also lack the permission needed to change time-zones).

So what I want to do is to convert the EPD into a SQLite database. This is a very simple database format, lacking the bells and whistles that Postgres has, but it is very easy to make the connection to R – you just tell R where the file is and that it should use the SQLite driver. Having used SQLite on a couple of projects before, I also had some code to convert the MS Access files into SQLite.

We need to start by downloading the latest MS Access version of EPD,  and installing mdbtools (on ubuntu apt-get install mdbtools). My code might not be the most elegant way to complete the job (everything can probably be done with a few lines of a bash script), but it works.

####load packages####

#### Extract from mdb ####
#mdb location
mdb  data/epd_schema.sql"))

##export data
system('mkdir -p data/sql')
system(paste("for i in $( mdb-tables", mdb,
" ); do echo $i ; mdb-export -H -I sqlite", mdb,
" $i > data/sql/$i.sql; done"))  

#### make sqlite3 database ####
con <- dbConnect(SQLite(), dbname = "data/epd.sqlite")

#### set up schema ####
setup <- readChar("data/epd_schema.sql", nchar = 100000)

#Change to valid sqlite datatype
setup <- gsub("Memo/Hyperlink", "Text", setup)

# EPDr does not like # in field names
setup <- gsub("#", "_", setup)

# avoid case sensitive problems with EPDr
setup <- tolower(setup)

#add each table separately
  paste("create", strsplit(setup, "create")[[1]][-1]),
  conn = con)

#### import data ####
import_table <- function(TAB) {
  tab <- readChar(paste0("data/sql/", TAB, ".sql"), nchar = 1e9)
  if (length(tab) == 0) {
    paste("File", TAB, "is empty")
  #make into a single INSERT INTO statement for speed
  tab <- gsub("\nINSERT INTO [^;]+ VALUES", "\n", tab)
  tab <- gsub(";(?!$)", ",", tab, perl = TRUE)

  # Change # to _ to keep EPDr happy
  tab <- gsub("#", "_", tab)

  dbExecute(conn = con, statement = tab)

sapply(toupper(dbListTables(con)), import_table)

So did it work?


#### make connection ####
epd.connection <- DBI::dbConnect(dbname = "data/epd.sqlite", drv = SQLite())

#### test an EPDr function ####
list_e(epd.connection, site = "Adange")
# E_ Site_ Sigle Name IsCore IsSect IsSSamp Descriptor HasAnLam
# EntLoc LocalVeg Coll_ SampDate DepthAtLoc
#1 left coast of river Cyperaceae fen 65 1984-08-00 210
# IceThickCM SampDevice CoreDiamCM C14DepthAdj Notes Site_ SiteName
#1 NA spade 10 NA NA 1 Adange
# SiteCode SiteExists PolDiv1 PolDiv2 PolDiv3 LatDeg LatMin
#1 GEO-01000-ADAN NA GEO 01 000 43 18
# LatSec LatNS LatDD LatDMS LonDeg LonMin LonSec LonEW LonDD
#1 20 N 43.30556 43.18.20N 41 20 0 E 41.33333
# LonDMS Elevation AreaOfSite
#1 41.20.00E 1750 0.25

#### Tidy up ####

Looks good so far (sorry about the formatting – there seems to be a bug in the syntax highlighter). Now I can start to explore some questions I have.

If there is interest, I can put the SQLite database on Dropbox, but I won't guarantee that the copy is up-to-date.

Posted in R | Tagged | Leave a comment

Been scooped? A discussion on data stewardship


At Climate of the Past, there is a pre-print by Darrell Kaufman and others on the data stewardship policies adopted by the PAGES 2k special issue.

Abstract. Data stewardship is an essential element of the publication process. Knowing how to enact generally described data policies can be difficult, however. Examples are needed to model the implementation of open-data polices in actual studies. Here we explain the procedure used to attain a high and consistent level of data stewardship across a special issue of the journal, Climate of the Past. We discuss the challenges related to (1) determining which data are essential for public archival, (2) using previously published data, and (3) understanding how to cite data. We anticipate that open-data sharing in paleo sciences will accelerate as the advantages become more evident and the practice becomes a standard part of publication.

The policy was closely aligned to the regular Climate of the Past policies (which are among the better policies in palaeo journals), but with more “must”. The discussion/review is ongoing, with a co-editor-in-chief encouraging further contributions to the discussion.

Two of the comments posted so far, while generally supportive of data archiving, raise concerns. Both express concern about the impact on early career scientists.

From Karoly et al

The impact of rigid data policies formulated in a top-down manner by experienced researchers (often those involved in modelling or multi-proxy synthesis) with large teams will generally be negative on early-career researchers who are often working to schedules around their PhD study and cannot as rapidly produce the final products of their work as can a larger group. With a desire to succeed and contribute to the science, this leaves them vulnerable to ’scientific exploitation’ and, in more serious cases, may compromise the successful completion of their postgraduate studies and future careers.

From Cook

It is quite clear that the ramifications of this “pilot” have not been thought out well by its authors, especially given the way that it forces graduate students and early-career scientists to give up their sensitive new data prematurely before their degrees or projects are completed. I know because this is a very real concern of my two graduate students and they deserve to be concerned given this so-called “best practices” data stewardship policy that prompted the earlier comment.

Karoly et al and Cook are presumably concerned that early career scientists who archive their data on publication will be scooped, one of six common fears about archiving data (Bishop 2015).

Is there any justification for this fear? Does anyone have any examples of scientists being scooped because they archived data on publication? Or better still, know of a study of the prevalence of scooping?

I am aware of people being scooped after making unpublished data publicly available. Ongoing time series are a particular problem – please follow their data usage policy.

See also the discovery of the dwarf planet Haumea.

I think the risk of being scooped because of archived data is low.

  • A well designed project plans the analyses in advance – the authors should know what papers they plan to write before the data are gathered – so they have a large head-start on anyone else.
  • Publishing a paper usually takes months: during that time the authors are the only people with access to the data giving a further headstart.
  • The second paper will rarely just use the same data as the first (if it does, care should be taken to avoid salami-slicing).
  • Most people have a backlog of papers that need writing, and also have the courtesy not to write a paper on a single, recently published dataset without including the authors.

If scooping is a significant risk, one option is to allow data to be archived but protected from download for some time. The downside of this is that the data are not immediately available for replicating the paper. Another option would be to make the data available under embargo, so the study can be replicated but the data cannot be included in any paper until after a certain date.

We need to know the prevalence of scooping using data archived on publication. Without knowing the prevalence, we don’t know whether we need to adjust policies and practices to reduce the risk, or to put more effort into assuaging the fears of early career scientists.

Posted in Uncategorized | Tagged | 18 Comments

A failed attempt to reproduce two ordinations

To examine the millennium-long chironomid-inferred air temperature reconstruction from Seebergsee (Larocque-Tobler et al 2012) is, after having shown that the calibration-in-time reconstruction for the upper section of the core (Larocque-Tobler et al 2011) has no skill, to flog the proverbial dead parrot.

But there is one aspect I wish to examine: Figure 5, which shows ordinations of the millennium-long chironomid stratigraphies from Seebergsee and Lake Silvaplana.



Larocque-Tobler et al (2012) Fig. 5. Two dimensional non-metric multi-dimensional scaling in the fossil chironomid assemblages of Seebergsee and Lake Silvaplana. In both cores the samples after ca 1950 CE (black circles) have the highest chord distances with samples pre-1950 CE (white circles).

Both ordinations show a pronounced shift in community composition at 1950 CE, with all fossil samples from after this date distinct from those before. The ordinations are non-metric multidimensional scaling fitted and plotted using Primer (so the axes are scaled correctly).

One might expect such the strong pattern shown in the ordinations to be clearly visible in the underlying community data: they are not.


Larocque-Tobler et al (2012) Fig. 3. Changes in the chironomid taxa (in percentages). Only the taxa from more than two samples are presented.

If there is a switch in community composition in Seebergsee, it would appear to be at about 1970 CE, where there is a zone boundary, rather than 1950 CE. It is unclear from the paper whether the zones were defined using the “Zone program” (Methods) or “based on the PCA scores of axis 1 and 2” (Results).


Larocque-Tobler et al (2010) Fig. 3. a) Number of head capsules per taxon found in each sample of the Lake Silvaplana cores. b) Percentages of taxa in merged samples of the Lake Silvaplana cores. Samples were merged to obtain a number of at least 30 head capsules (see Larocque et al., 2009).

The millennium-long chironomid stratigraphy from Lake Silvaplana (Larocque-Tobler et al 2010) does not have a zone boundary at 1950 CE; the assemblage change at ~1770 CE appears more important.

Both stratigraphies appear to have many more samples than shown in the ordination.

Figure 5 from Larocque-Tobler et al (2012) is the only published ordination of the Seebergsee chironomid stratigraphy, but there are several ordinations of the Lake Silvaplana stratigraphy. None show any indications of a large community change at 1950. For example, here is a correspondence analysis of the last 150 years from Larocque et al (2009).


Figure 4 from Larocque et al (2009) Correspondence analysis. The numbers in the graph are the years AD of the samples. The variance explained by each axis is in brackets. The axes are incorrectly scaled.

Since I have the data for at least the last century from both lakes, I can try to reproduce the Figure 5.


Non-metric multidimensional scaling of the chironomid stratigraphy from Seebergsee.

The NMDS of the last 150 years at Seebergsee (Larocque-Tobler et al 2011) shows no distinct split at 1950 or any other time.


Non-metric multidimensional scaling of the chironomid stratigraphy from Lake Silvaplana.

The NMDS of the > 400 year long chironomid stratigraphy from Lake Silvaplana (Larocque-Tobler et al 2009) shows a fairly distinct split into two groups. However, the split is at 1760 not 1950 (I had to use the Bray-Curtis distance as the chord distance used Larocque-Tobler et al (2012) found the last sample in the data to be an extreme outlier).

Neither ordination in Larocque-Tobler et al (2012) can be reproduced.

I had hoped this would be my last post exploring Seebergsee, but, alas, I found some further oddities while preparing this post. After discussing those, I have a another post about the chironomid-inferred reconstructions from Lake Zabinskie and, in the unlikely event that I feel inspired, the lakes at Abisko. Soon, I hope to resume my usual diet of papers reporting solar correlations with palaeoecological data.

Posted in Peer reviewed literature | Tagged , | Leave a comment

Reviews of another manuscript

After last night’s premature excitement about reviews being ready for my review of high-resolution reconstructions, today I received reviews of a different manuscript. It includes this:

4. The authors applied high-end modern statistics and gave full reference. However, their analyses may only be re-produced by using the same scripts in R. A re-calculation with a different statistics software package is not possible, either the respective analyses were not included or derivate variables were used. A focusing on simpler and wider distributed multivariate statistics would be appropriate.

I need to confess: most the analyses were done with a terribly obscure R package known as vegan. The manuscript uses the avant-garde methods redundancy analysis and canonical correspondence analysis which, I admit, are almost impossible to fit outside vegan unless one uses CANOCO, XLSTAT, or half a dozen other statistical programs or R packages.

It’s not much of a stretch to realise that few of the potential readers of the manuscript will have heard of Procrustes analysis, which we use next to compare our ordinations, but reflect on this, our code can be translated to a DOS program. PROTEST if you want, but don’t get into a twist.

Finally we use a co-correspondence analysis from the cocorresp  package to compare community patterns in two species groups. This is a less commonly used method, but downloading R and the package won’t bankrupt anybody (those not afraid of being bankrupted can run the analysis in MATLAB).

Had the reviewer complained that the co-correspondence analysis was redundant, I would not have objected. Had the reviewer explained that our analyses were inappropriate or sub-optimal, I would have paid attention. But this claim that our analyses can only be reproduced with R is both wrong (as shown above) and irrelevant as these packages can be freely downloaded, and the vague suggestion we use simpler methods is unhelpful.

UPDATE: CANOCO version 5 will run all the analyses used in the manuscript.



Posted in Peer reviewed literature | Tagged | 2 Comments

Reviews are in

An email this evening informs me that the reviews of my manuscript discussing the challenges of validating reconstructions derived from microfossil assemblages with instrumental data, and the particular problems with several chironomid-based reconstructions, are now ready. But since I wish to sleep tonight, I shall refrain from reading them until tomorrow.

Posted in transfer function | Tagged | 2 Comments

Heavy weather at Seebergsee

Larocque-Tobler et al (2011) compare their chironomid-inferred reconstructions of July air temperature with instrumental data from Château-d’Oex, a climate station with continuous temperature data from 1901. Château-d’Oex is located less than 30km from Seebergsee (Larocque-Tobler et al (2011) report 50km), and about 800m lower.

Why am I examining the climate data where there is surely little to go wrong? Because to find all the problems in a paper, you need to examine everything, and understanding the problems in one paper can help suggest aspects of another paper to examine.

The Château-d’Oex temperature series needs to be processed to make it comparable with the reconstruction.

“Since samples having less than 30 head capsules were merged for temperature and VWHO reconstructions, the temporal resolution of each sample decreased to ca. 3–8 years. Thus, three- to eight-year running means in the instrumental data were used for comparison with the chironomid-inferred temperatures.”

So lets try looking at the three and eight year running mean.


Larocque-Tobler et al (2011) Fig. 6. Temperature reconstruction using chironomid head capsules and the calibration-in-time approach (black line) compared with mean July instrumental data from the closest meteorological station (dotted line). Averages in the instrumental data followed the time represented in the merged sediment samples. Over-plotted with the instrumental data (black circles) and the 3 (red) and 8 (blue) year running means.

I did not expect the three or eight year running means to be identical to the temperature series used by Larocque-Tobler et al (2011) but I would expect them to be somewhat similar. There is a period between about 1975 and 1985 where the three year running mean and the published series are nearly identical, but around 1915 the published series is far higher than either smooth, and between about 1925 and 1940 it is much lower.

What could the cause of these discrepancies be?

Normalisation period I calculated anomalies by subtracting the 1901-2005 mean temperature, and I believe the published series is treated in the same way as the mean anomaly is near zero. Use of a different normal period was used for calculating anomalies could not explain both the high temperatures near 1915 and the low temperatures 1925-1940.

Choice of month Plotting months other than July, or the mean summer or annual temperature, does not give a better fit between the published series and the Château-d’Oex data.

Data homogonisation Larocque-Tobler et al (2011) was submitted in September 2010. The Swiss Federal Office of Meteorology published the current version of the homogenised climate data in December 2010. Larocque-Tobler et al (2011) are therefore probably using an earlier version of the temperature data (although they could have updated their analysis during revisions). I don’t know what temperature data were available in 2010, but they were probably homogenised to some extent. Compared with the homogenised data, the raw data show a warmer mean annual temperature 1920-1940, but I cannot locate a copy of the version of the data that were available in early 2010.

Data processsing As I show in my manuscript, and show in a future blog-post, Larocque-Tobler et al (2015) incorrectly calculate the August air temperature at Lake Zabinskie when chironomid samples span several years. Rather than using the mean temperature of the period spanned by the chironomid samples, they use the temperature of the first year. I cannot see any evidence of the same data processing problem here.

None of these potential explanations seem to explain the discrepancies between the published series and the downloadable data from Château-d’Oex. The published series are incorrect for an unknown reason, therefore the correlation between the calibration-in-space July air temperature reconstruction and the published series is meaningless. I have not calculated what the correlation would be with the correct series, but a visual comparison of the curves suggests it would be worse.


Larocque-Tobler et al (2011) Figure 5a. Calibration-in-space reconstruction (solid line) and reported instrumental data (dotted line) over-plotted with the instrumental data (black circles) and the 3 (red) and 8 (blue) year running means.


I’ve digitised the calibration-in-space reconstruction. The reported correlation between the calibration-in-space reconstruction and the reported instrumental data for the period up to 1980 is correct (r = 0.64); for the period up to 1960, the correlation is 0.89 rather than the reported r = 0.71. This strange error of reporting a much weaker correlation than the data show was also done at Abisko (Larocque 2003) and in Lake Zabinskie (Larocque-Tobler et al 2015). I don’t know why the authors, who have argued that “Chironomids = Temperature”, are weakening the relationship between their chironomid-based reconstruction and temperature.

Larocque-Tobler et al (2011) is a deeply problematic paper: the reported count sums are so low that precise reconstructions are unlikely, and the true counts sums are likely to be lower still; the various versions of the chronology is a mess; the lack of cross-validation means that the performance of the calibration-in-time is spuriously good; and for an unknown reason, the published instrumental temperature series cannot be reconciled with the Château-d’Oex series. If the authors stand by their paper, they should archive the data and code needed to reproduce their results. If they do not, they should correct/retract it.

Posted in Peer reviewed literature | Tagged | Leave a comment