All the pollen

“And some things that should not have been forgotten were lost. History became legend. Legend became myth. And for two and a half thousand years, the metadata passed out of all knowledge.”


Michener et al 1997 Figure 1. Example of the normal degradation in information content associated with data and metadata over time (“information entropy”). Accidents or changes in storage technology (dashed line) may eliminate access to remaining raw data and metadata at any time

A couple of months ago, Eric Grimm gave an introduction to the Neotoma database in Bergen for participants in the HOPE project (PhD and postdoc positions to be advertised soon).

I started to tinker with the neotoma R package and downloaded some fossil pollen data. Actually, all the pollen data: over 2700 sites; 110 thousand levels; and 120 million terrestrial spores and pollen grains.


Location of Neotoma fossil pollen data sets

But what to do with 170 MB of data?

The first thing I do with any data-set is to test it for oddities: improbably values or patterns that might indicate errors or misunderstandings. I developed a set of methods last year to test the ever-so curious chironomid data from Lake Żabińskie: I’m looking forward to applying them to this huge amount of pollen data.

Expections 1: counts should be integers

The vast majority of the pollen data in Neotoma is labelled as being count data – I’ve omitted a small amount labelled as percent data. Counts should be integers, so any non-integers values would be cause for concern except that pollen analysts often count half grains of Pinus and some other conifers (some Tsuga, Podocarpus, Pinaceae) with pollen which often splits into two identifiable parts (half counts are also common in diatom and chironomid counts). So I am expecting integer and half values for some conifer species but that is not what I found.

The vast majority of counts are integer (or half) values; only 7537 (0.3%) are not.

Of these, 3923 are near integer (or half) values (absolute difference less than 0.001). These are probably because some of the count data have been back-calculated from percent data (or read off pollen diagrams) and there are rounding errors. These errors are inconsequential and are trivial to fix.

The counts for Pinaceae were more variable that I had imagined. While the vast majority of counts are integers (309096) or half values (11104), 374 counts appear to be tenths, quarters, or thirds of a grain with a few odd values that might be percent – see below.

I also discovered that some analysts count Acer grains in thirds.

Excluding the conifers and Acer which have non-integer counts, there are still several thousand non-integer counts in the database. These may represent typos, which should be sporadic, or indicate that the data are not counts, but are instead percent or pollen concentration/influx data, which might have pervasive non-integer values. It is also possible that some analysts count half grains of a broader range of taxa, in which case the non-integers should be restricted to a few taxa.

Eighty four data sets have at least one unexpected non-integer value; 37 have more than five. These are the 5 data sets with more than half the counts non-integer values.

Table 1: Proportion and number of non-integer values.
.id Proportion Number
15059 0.99 1806
16209 0.91 524
16210 0.83 171
15696 0.73 517
16090 0.59 237

We can examine these data sets with the browse function.


The very high numbers in 16090 suggest that these are influx or concentration data – one would need to check the original publication. The others look more like percent data, but need to check as the values sum to more that 100 in each sample for all but 15059. I’m going to drop these data sets from the remainder of my analysis.

The other samples with non-integer values mostly have half integers. These could be from calculating percent from a count sum of 200, or even more enthusiastic counting of half pollen grains. The remaining values look like errors of one kind or another.

It should be relatively easy to flag data with unexpected non-integer, but I’m going to ignore these for now and for simplicity round all fractional values up.

Expectation 2: Count sums should be reasonable

Many palynologists count three hundred or five hundred pollen grains per sample. I don’t think anyone ever counted twenty thousand grains per sample. It would just take too long.


The white lines at ~250 and ~850 are plotting artefacts

Very high counts might indicate that influx/accumulation rates have been entered instead of counts. Alternatively, some palanyologists might be really enthusiastic, or, in low diversity samples, the abundance of the dominant taxon might be estimated which could lead to high counts without high effort.

The data reports counts as high as 25241. I’m going to arbitrarily set 5000 as my threshold for concern. This flags 0.1% of the samples. These are some of the 36 data sets with count sums over 5000.

.id Proportion Number Minimum Median Maximum
4404 0.01 2 29 528 25241
488 0.23 9 572 2500 19062
4355 1.00 23 11626 11876 12072
16091 0.03 2 418 590 10838
3568 0.41 7 490 3766 10641
294 0.13 5 1095 2687 10481

Some of these are probably easy to explain. 4355 is either in the middle of the densest stand of Lycopodium since the Carboniferous, or the Lycopodium spike has been mis-labelled. Likewise, the Eucalyptus count suggests that 4095 (Hockham Mere) is a portal to the Antipodes.

Others appear to be typos. For example data set 20643 reports 4080 Abies grains in the first sample: none of the other samples have more than 6 Abies grains. And I’m fairly sure that the two counts of 9999 for Corylus/Myrica in data set 16091 are not real. It might be possible to use taxonomic dissimilaries within the data set to identify odd samples but as data sets can span the deglaciation large taxonomic changes are expected anyway.


Expectation 3: Few samples without singletons

“rarity is the attribute of a vast number of species of all classes, in all countries.” Charles Darwin

One of the curious aspects of the chironomid counts from Lake Żabińskie is the lack of rare taxa in many of the samples. I suggested that it is likely that in any census of any species-rich assemblage, the rarest taxa will to be represented by a single individual.

How well does the pollen data conform to this expectation. At 3%, the proportion of samples lacked singletons is higher than I had expected. The samples without singletons are not evenly distributed: 73% of datasets have no samples without singletons; 2% lack singletons in more than 50% of samples.

There is a strong relationship between the proportion of samples in a data set and the number of taxa in the data set.


About a fifth of samples in datasets where the number of taxa is 25 or fewer lack singletons. Conversely, only 2.1% of samples from datasets with more taxa lack singletons, and 1.2% of those from datasets with over 40 taxa.

I don’t think this is a caused by low diversity, but is due to a large extent to the limited taxonomic resolution and scope of some of the pollen datasets. In at least the older data, it was common to focus on a limited number of common taxa and to ignore rare species. The lack of singletons is not a useful flag for such data sets.

The almost complete lack of singletons in some species rich data sets is curious and warrants a flag.

Expectation 4: Samples that lack singletons should not have lowest common denominator > 1

It was the many assemblages without singletons were the counts were integer multiples of the rarest taxon, that first alerted me to the problems with the chironomid data from Lake Żabińskie. Such counts should be very rare, but will occur if the counts have been multiplied.

I want to flag samples without singletons where all/most of the counts are integer multiples of the rarest taxon.


Of the 3340 samples without singletons (and taxonomic richness > 5), 300 have all count integer multiples of the rarest taxon. In one sample, all counts are multiples of 43.

In one data set, 99% of values are multiples of 3, the minimum count of all samples.

I have no hesitation in suggesting that in both these examples the data are not the raw counts. Possible scenarios include the data being 1) accumulation rates or concentrations, 2) per mille, 3) back-transformed from percent after rounding, 4) the result of someone pulling a fast one.

Expectation 5. Zeros. Lots of them.

Community data usually have a many zero values and few samples will contain all the taxa found in the whole data set (unless there are very few samples), especially if the richness is high.


Data set 4082 has 56 taxa and 71 samples but only 3% of the counts are zero. Flagged as curious.

Other testable expectations?

Suggestions for other tests that could reveal errors or other problems in putative count data would be very welcome. I’m hoping that, in collaboration with Simon Goring, some of these tests can be implemented in Neotoma and that the clearly erroneous data sets can be cleaned.

Posted in Data manipulation | Tagged , , | Leave a comment

Forthcoming quantitative palaeoecology PhD and Postdoc positions in Bergen

There are vacancies for a 3-year PhD position and a 3-year post-doctoral fellow position at the University of Bergen’s Department of Biology within the Ecological and Environmental Change Research Group as part of the European Research Council funded project Humans on Planet Earth – Long-term impacts on biosphere dynamics (HOPE).


These positions are now advertised: PhD; Postdoc.

About the HOPE project:

A critical question in Earth system science is what was the impact of prehistoric people on the biosphere? There is a wealth of information about human impact through clearance, agriculture, erosion, and modifying water and nutrient budgets. Humans have greatly changed the biosphere patterns on Earth in the last 8000–11,000 years, but have humans modified the major ecological processes (e.g. assembly rules, species interactions) that shape community assembly and dynamics? To answer this question, patterns in pollen-stratigraphical data for the last 11,500 years from over 2000 sites across the globe will be explored consistently using numerical techniques to detect quantitative changes in 25 ecosystem properties. Patterns in these properties will be compared statistically at sites within biomes, between biomes, within continents, and between continents to test the hypothesis that prehistoric human activities changed the basic processes of community assembly and that interrelationships between processes changed though time.

The PHD position

Qualifications and personal qualities:

  • The applicant must hold a Master’s or an equivalent degree within quantitative palaeoecology, biogeography, or ecology, or related fields relevant to the PhD project.
  • The successful candidate should be highly motivated, enjoy the challenge of working with very large data-sets, and understand the relevance of the data and the results.
  • The successful candidate can work independently and in a structured manner, and have the ability to cooperate with others within HOPE’s consortium as well as within the EECRG, and to follow through challenging ideas.
  • Proficiency in both written and oral English is essential.

Special requirements for the position:

The successful candidate should have experience in quantitative analyses of palaeoecological or ecological data using the statistical software R or related programs, as well as documented skills in one or more research fields relevant to the position (e.g. Quaternary palaeoecology, palaeoclimatology, applied statistics, numerical ecology, quantitative palaeoecology, biogeography, macroecology, community ecology, biodiversity), and some experience of using large databases.

Special responsibilities for the position:

The successful candidate will be primarily responsible for developing quantitative procedures for evaluating taxon co-occurrences and co-correlations from pollen-stratigraphical data expressed as ‘closed’ percentages, for applying these procedures to pollen data across the globe as part of the HOPE project, and for evaluating taxon co-occurrence analysis in palaeoecology.

About the PhD position:

The duration of this position is 3 years. As a PhD candidate the successful applicant must participate in an approved educational programme for a PhD degree within the three-year period.

The Postdoc Position

Qualifications and personal qualities:

  • The applicant must hold a PhD or an equivalent degree within quantitative palaeoecology, ecology, biogeography, or a related field.
  • The successful candidate should be highly motivated, enjoy the challenge of working with very large data-sets, and understand the relevance of the data and the results.
  • The successful candidate can work independently and in a structured manner, and have the ability to cooperate with others within HOPE’s consortium as well as within the EECRG, and to follow through challenging ideas.
  • Proficiency in both written and oral English is essential.

Special requirements for the position include

  • The successful candidate must have experience in quantitative analyses of palaeoecological or ecological data using the statistical software R or related programming language and in using large databases.
  • The candidate will be able to document skills in one or more research fields relevant to the position such as Quaternary palaeoecology, palaeoclimatology, applied statistics, numerical ecology, quantitative palaeoecology, biogeography, macroecology, community ecology, and biodiversity.
  • The candidate will play a major role in the publication of HOPE results.

Special responsibilities for the position:

The successful candidate will be responsible for developing the HOPE database of pollen-stratigraphical data and associated chronological palaeoenvironmental and site data within the framework of state-of-the-art palaeoinformatics, for the numerical and statistical analyses of many large pollen-stratigraphical data-sets, and for developing appropriate software for particular palaeoecological and diversity analyses.

General Information

Closing date: 15 September 2017.

Detailed information about the position and about how to apply can be obtained by contacting: Professor John Birks, Department of Biology, University of Bergen (+47 5558 3350 or +47 5593 7717 / email:

Posted in Uncategorized | 1 Comment

There is a memory in the dirt at the bottom of the sea

There is a memory in the dirt at the bottom of the sea. It is in the number of different sorts of small dead animals which we can use to find out how warm or cold the sea was in the past. There are different approaches for doing this. All the approaches have problems if things are more like things near to them in space than expected. If this happens, our guess of how warm the sea was can appear to be much better than it really is. Some often-used approaches are much worse than others when this happens. This problem is hard to wipe out but ignoring it is not a good idea.

Written after #TenHundredWordsOfScience with a Simple Writer.


Posted in transfer function, Uncategorized | Tagged , , | Leave a comment

Abstract abstracts

So I am searching through the Web of Science for papers reporting 11 yr (Schwabe) cycles in palaeoproxy data (especially tree-rings) when I find this title, which looks promising:


The abstract in WoS is, well, abstract

The pilgrim fathers of the scientific imagination as it exists today are the great tragedians of ancient Athens, Aeschylus, Sophocles, Euripides. Their vision of fate, remorseless and indifferent, urging a tragic incident to its inevitable issue, is the vision possessed by science. Fate in Greek Tragedy becomes the order of nature in modern thought. …the essence of dramatic tragedy is not unhappiness. It resides in the solemnity of the remorseless working of things. This inevitableness of destiny can only be illustrated in terms of human life by incidents which in fact involve unhappiness. For it is only by them that the futility of escape can be made evident in the drama. This remorseless inevitableness is what pervades scientific thought. The laws of physics are the decrees of fate.

But when I go to check the paper, I find a much more conventional abstract

Power spectrum analysis of 81 long and 202 short Chinese dryness-wetness indices yields evidence for two peaks with periods near 18.6 and 10.5 years, both of which are statistically significant at confidence levels of 99.9 per cent. They are identified as induced by the 18.6-year luni-solar, Mn, constituent tide and a 10–11-year solar cycle, Sc, variation in the Sun’s luminosity of the order of 0.1 per cent. Amplitude and phase of Mn wavetrains are highly non-stationary with respect to both time and geography; in particular, abrupt 180° phase changes in wave polarity are often observed. Amplitude and phase of Sc waves are also highly non-stationary, with those in northern China out of phase with waves in the south since 1895 (they were in phase from 1815 to 1845). For the 202 short records variance contribution of the two signals to total variance in raw data varied from 6 per cent to 53 per cent, with a mean of 22 per cent, again demonstrating their extreme non-stationarity. Construction of a dry and very dry drought index (DVDI) shows that since 1470 by far the most prolonged, continuous, and serious drought (due to constructive interference and concomitant high amplitudes of the two waves) occurred from 1633 to 1643; the Ming Dynasty collapsed in 1644 and, in agreement with Hameed and Gong (1990), it is concluded that this climatic disaster was a causal factor in the fall of the Ming Empire.

The abstract in WoS is actually a long quote from Alfred North Whitehead.

So far I have about 100 candidate papers (sometimes it is not clear from the abstract if the paper is reporting Schwabe cycles or some longer term solar variability).

Posted in Peer reviewed literature, solar variability, Uncategorized | Leave a comment

Dinocysts, transfer functions and spatial autocorrelation: part 1207

I don’t always comment on papers that use transfer functions but neglect to consider how spatial autocorrelation in the modern calibration set might make the reconstructions spuriously precise. It gets tedious, especially when the same authors make the same mistakes time and again. But sometimes I am asked to review such papers, and I oblige.

One such paper was Wary et al (2017), published yesterday. The paper suggests that when Greenland and the North Atlantic cool during Dansgaard–Oeschger oscillations, the surface of the Norwegian Sea warms, and vice versa. The warmth in the Norwegian Sea is reconstructed from the cysts of dinoflagellates, which live near the surface. Cold subsurface conditions in the Norwegian Sea are reconstructed from planktic foraminifera.

Since Wary et al  was published in Climate of the Past, the complete peer review process – reviews, editors comments and author replies –  is publicly available. This now includes the second round of reviews/editor comments which were previously hidden.

Lacking the expertise to critique the physical plausibility of this regional see-saw, my review focused on the dinocyst reconstructions. The paper reconstructs summer and winter sea surface temperatures and salinities, together with sea-ice duration. I really doubt that all five variables can be reconstructed independently, especially since most dinoflagellates overwinter in cysts on the sea floor.

I criticised the paper reporting model performance statistics from a cross-validation scheme (either leave-one-out or k-fold cross-validation – the paper is not clear) that ignores the considerable spatial autocorrelation in the calibration set, and suggested that the true uncertainty was severely underestimated. I also criticised the lack of reconstruction diagnostics to help the reader evaluate the reconstructions.

The editor agreed these were important concerns. So how did the authors respond?

The authors added – as they had promised – a plot showing the taxonomic distance of each fossil sample to the nearest analogue in the modern calibration set. They claim this plot will “ensure that one can assess by his own the reliability and robustness of our reconstructions”.


Wray et al (2017) Figure S5. Distance to the nearest analogue
in the four studied cores 

Well good luck with that. Usually, plots of the distance to the nearest analogue show some reference levels (often the 5th and 10th percentile of all distances in the calibration set) that the distances can be compared with. Wray et al do not, so there is no way to know if the distances are high or low (a problem exacerbated by the absence of information on which distance metric was used, hampering replication). This figure is almost useless.


Wary et al rely on Guiot and de Vernal (2011a, b) (a paper and their response to a comment) in support of their assertion that

parallel studies equally based on cross-validation schemes showed that this spatial autocorrelation has in fact relatively low impact on the calculation of the error of prediction of the MAT transfer function applied to dinocyst assemblages.

Unfortunately, Guiot and de Vernal (2011a, b) is a strong contender for one of the worst papers ever published in Quaternary Science Reviews, managing to simultaneously demonstrate and deny that autocorrelation is a problem, and use an irrelevant test to prove nothing. It is absolutely not evidence that autocorrelation is not a serious problem for transfer functions.

The authors also cite de Vernal et al (2013a) and de Vernal et al (2013b) as further evidence that autocorrelation is not a problem for dinocyst transfer functions. However neither paper even attempts to test if autocorrelation leads to an overestimation of model performance. Both papers use k-fold cross-validation. This is only minutely less sensitive to autocorrelation than leave-one-out cross-validation: it is a solution to autocorrelation to the same extent that a sieve makes a good boat.

The authors graciously cite several of my papers which demonstrate that utocorrelation is a problem and suggest means to identify it and deal with it. However, I would much rather that instead of contributing towards increasing my h-index, the authors had engaged with the h-block cross-validation scheme I proposed. In conclusion, Wary et al is yet another wasted opportunity to determine the true utility of dinocyst-based transfer functions.


Posted in Uncategorized | Leave a comment

Critical perspectives in climate reconstructions from pollen

This week I am at Caux, high above Lac Léman, Switzerland, attending a Pollen Climate Model Intercomparison Project workshop.

The invited talk I gave this morning talk was on   Critical_Issues_in_pollen_climate_reconstructions. (I’ve made the ioslides presentation into a pdf – it looked much prettier before).

Posted in Uncategorized | Leave a comment

“Fossil Insect Study Suggests That Los Angeles Climate Has Been Relatively Stable for at Least 50,000 Years”

So sayeth the press release. But what about the paper, and the 182 beetles sampled from La Brea tar pits?

Fossil preservation in the tar pits is exceptional, but the constant stream of gas through the tar deposits mixes the fossils – there is no stratigraphy in the tar. Holden et al have to radiocarbon date each and every beetle they analyse. Naturally, this limits the number of beetles they can analyse, both because of the financial cost and the destruction of the beetles. It also limits the number of species than can be analysed (to seven), as only species with abundant fossils can be vaporised  (we are not told what other species are found – I would have liked this information).

Contrast the situation with the tar pits with a more typical site for beetle analysis, say a section through a peat bog, where large volumes of sediment in stratigraphic order can be collected, with dozens or more beetle fossils from many species in each sample, and only a few dates needed to constrain the chronology. La Brea is a challenging site.

These are dates of the seven beetle species at La Brea.


Holden et al figure 2. Median calibrated age of each beetle with 2-sigma ranges. Cases where only the error bar is shown are “greater than” ages where the results were very close to background and only a lower limit could be specified. The low quality of the image is because Elsevier are hopeless.

The first thing to note is that no beetles were dated to the last glacial maximum. Holden et al ascribe this to either the lack of insect collections from the pits containing LGM mammal fossils, or a cooler LGM climate making the tar less sticky so fewer insect were caught. If the former explanation is correct, the press release claim that “Los Angeles Climate has been relatively stable for at least 50,000 years” cannot be substantiated as there is no evidence from a critical interval. If the latter explanation is correct, the press release is refuted. A third explanation not considered by Holden et al is that the climate changed such that the seven beetles they use were not present at La Brea. Whatever the reason for the lack of LGM beetles, the headline of the press release is wrong.

There is also a beetle gap in the early Holocene thermal maximum. Again, the lack of evidence precludes a conclusion that the climate was stable.

It is always going to be difficult to reconstruct climate from a just seven species of beetles, selected in part because they were common. So it is not greatly surprising the reconstructed climate, for the intervals where there are data, is similar to modern. Holden et al report “mean summer temperatures within ±5 °C of today’s conditions”. A range wide enough that only the LGM (for which there are no data) could reasonably be expected to exceed.

The method for reconstructing climate is unclear. The paper appears to assume that the assemblage composition is constant through time and hence that the climatic conditions must have remained similar to modern. I would have liked to see a figure showing the assemblage in each time window. Something like this.


Number of beetles by time interval. ?Modern beetles have an age of < 200 cal BP, and so might represent accidental modern contamination of the tar pit.

There are distinct shifts in the species composition through time that Holden et al do not explore. I don’t know if these shifts have possible climatic interpretations. Although Holden et al collate modern records for their species, they don’t present the results in an easy-to-interpret way (they present violin plots, each species in a separate file, and scatter plots).

While the species distribution modelling could certainly have been done better, and the method for climate reconstruction made explicit, it is always going to be difficult to work on the stratigraphically-mixed deposits at La Brea, and without a huge amount of money for dating, and a willingness to atomise any beetles for dates, it will be impossible to get the data and reconstruction that could be expected from a more typical site. But that is no excuse for a press release that is unsupported by the paper.





Posted in climate, Peer reviewed literature | Tagged , | Leave a comment