Merging taxa in assemblage data

One possible reason for the impossible percent values I’ve found in assemblages data is that taxa have been merged in Excel after percent were calculated. Doing anything in Excel is to invite disaster, if nothing else, it is very difficult to check what has been done.

Merging and renaming taxa is an almost inevitable step in the workflow for processing community and assemblages data. We need a reproducible method: here I show how can it be done with R.

I’m going to assume that the assemblage data are in wide format (one column per taxa) and that there are meta data (depths, ages etc) in one or more columns. If the meta data are in the rownames (which is very convenient for the ‘rioja’ and ‘vegan’ packages, less so for ‘dplyr’ as tibbles don’t have rownames), they can be moved into a column with rownames_to_column.

Here is a small artificial assemblage dataset.

spp <- data_frame(
depth_cm = 1:3,
sp_A = rpois(3, 5),
sp_b = rpois(3, 5),
sp.C = rpois(3, 5),
sp_D = rpois(3, 5))
spp_save <- spp # keep copy for later
## # A tibble: 3 x 5
## depth_cm sp_A sp_b sp.C sp_D
## 1 1 4 8 9 2
## 2 2 4 3 6 3
## 3 3 5 8 6 3

If we just want to rename a couple of taxa, the simplest solution is to use rename, where we set new_name = old_name. rename can take pairs of new and old names, separated by commas.

spp %>% rename(sp_B = sp_b, sp_C = sp.C)
## # A tibble: 3 x 5
## depth_cm sp_A sp_B sp_C sp_D
## 1 1 4 8 9 2
## 2 2 4 3 6 3
## 3 3 5 8 6 3

If there are many names that need altering, or we need to make the same changes to multiple data.frames, we need a different solution as rename gets tedious.

I like to make a data.frame of the old and new names and then use plyr::mapvalues to change the old into the new names. (plyr is a useful package but has several conflicts with dplyr so it is safer to use the :: notation than loading it).

changes <- read.csv(stringsAsFactors = FALSE, text =
"old, new
sp_b, sp_B
sp.C, sp_C", strip.white = TRUE)#this can go in an separate file

names(spp) <- plyr::mapvalues(names(spp), from = changes$old, to = changes$new)
## # A tibble: 3 x 5
## depth_cm sp_A sp_B sp_C sp_D
## 1 1 4 8 9 2
## 2 2 4 3 6 3
## 3 3 5 8 6 3

Merging taxa is possible in the wide format, but much easier in a thin format. We can convert from a wide format to a thin format with gather, and back with spread.

spp <- spp_save#original version

spp_thin <- spp %>% gather(key = taxa, value = count, -depth_cm)#don't gather depth_cm
## # A tibble: 12 x 3
## depth_cm taxa count
## 1 1 sp_A 4
## 2 2 sp_A 4
## 3 3 sp_A 5
## 4 1 sp_b 8
## 5 2 sp_b 3
## 6 3 sp_b 8
## 7 1 sp.C 9
## 8 2 sp.C 6
## 9 3 sp.C 6
## 10 1 sp_D 2
## 11 2 sp_D 3
## 12 3 sp_D 3

If there are just a few taxa that need merging, we can use recode within mutate followed by summarise. Note that in contrast with rename, recode expects “old_name” = “new_name”

spp_thin %>%
mutate(taxa = recode(taxa, "sp.C" = "sp_D")) %>%
group_by(depth_cm, taxa) %>%
summarise(count = sum(count)) %>%
spread(key = taxa, value = count)
## # A tibble: 3 x 4
## # Groups: depth_cm [3]
## depth_cm sp_A sp_b sp_D
## *
## 1 1 4 8 11
## 2 2 4 3 9
## 3 3 5 8 9

If there are many taxa that need merging (or some that need merging and some renaming) we can use mapvalues again.

changes <- read.csv(stringsAsFactors = FALSE, text =
"old, new
sp_b, sp_B
sp.C, sp_D", strip.white = TRUE)#this can go in an separate file

spp_thin %>%
mutate(taxa = plyr::mapvalues(taxa, from = changes$old, to = changes$new)) %>%
group_by(depth_cm, taxa) %>%
summarise(count = sum(count)) %>%
spread(key = taxa, value = count)
## # A tibble: 3 x 4
## # Groups: depth_cm [3]
## depth_cm sp_A sp_B sp_D
## *
## 1 1 4 8 11
## 2 2 4 3 9
## 3 3 5 8 9

This can also be done with a left_join.

spp2 <- spp_thin %>%
left_join(changes, by = c("taxa" = "old")) %>%
mutate(taxa = coalesce(new, taxa)) %>% #takes original name if no new one.
select(-new) %>%
group_by(depth_cm, taxa) %>%
summarise(count = sum(count)) %>%
spread(key = taxa, value = count)
## # A tibble: 3 x 4
## # Groups: depth_cm [3]
## depth_cm sp_A sp_B sp_D
## *
## 1 1 4 8 11
## 2 2 4 3 9
## 3 3 5 8 9

Now the data are ready for further analysis – remember some functions will want you to remove the meta_data first. For example

cca(select(spp2, -depth_cm))
Posted in Data manipulation, R | Leave a comment

A mean wind blows over Lake Żabińskie

I have largely neglected the chrysophyte-inferred reconstructions of winter severity and summer calcium concentrations/zonal wind speed from Lake Żabińskie even though they fall within the scope of my review of sub-decadal resolution reconstructions. This is not because I think this pair of reconstructions are jointly credible, but because no data have been archived despite several requests to the authors (and no useful calibration set data can be scraped from the publications).

Having shown that the correlation between instrumental temperature records and the pollen reconstruction from Mauntschas is no better than that expected because of the autocorrelation and multiple testing, I wondered if the same could be true for the calcium/wind reconstruction.

Hernández-Almeida et al (2015) report that a weighted averaging transfer function using their 50-lake calibration set has a bootstrap R2 of 0.68 and a RMSEP of 0.143 (log10 units). This is reasonably good, but the distribution of sites along the gradient is very uneven which will bias the RMSEP.


Figure 4. Plots of the a observed versus predicted log10 Ca2+ and b observed versus residual log10 Ca2+ based on WA regression with classical deshrinking. The black dot represents Lake Żabińskie.

I’m not totally convinced that the relationship between the estimated and observed calcium concentration in figure 4 of the paper is cross-validated, which then makes me wonder if the performance statistics are cross-validated or not. With the data this would be trivial to confirm. Without — doubts multiply.

The authors smooth the reconstruction from Lake Żabińskie with a three-year triangular filter and compare it to different combinations of months and allow for a lagged response using a procedure developed earlier which generated 144 climate time series.

I wanted to test if this multiple testing on autocorrelated data could generate the reported correlation of 0.5 between the reconstruction and May-October wind.

Even though the assemblage data are not yet available, I can test several ideas with just the wind data. The wind data come from the Twentieth Century Reanalysis Project. Other than reporting that the 1000 hPa level is used, no information is given about how the data were processed. I’ve assumed that the closest grid point to Lake Żabińskie has been used (22°E, 54°N) and downloaded the data and calculated the monthly means.

I had assumed that getting replicating the wind speed figure would be the easy part. Alas no. My wind speed data look nothing like the published curve. Then I realise that although the text discusses wind speed, it appears that the authors are actually using wind velocity, specifically the zonal (west-east) velocity. This would account for the otherwise impossible negative “speeds”. However, my mean velocity curve also looks nothing like the published curve, but at least it has a similar mean and variance.


May-October mean velocity with 3-yr triangular filter.



Figure 5. Bootstrapped estimates of log10 Ca2+ reconstructed from the sedimentary chrysophyte cyst assemblages at Lake Żabińskie based on the WA prediction model for the period AD 1898–2010. Log10 Ca2+ was highest correlated with the mean May–October zonal winds speed (m/s; 1,000 hPa) (grey line).

Using mean velocity rather than mean speed then the paper presents some huge problems. A May-October mean velocity of 0 m/s could indicate flat calm conditions all summer or that easterly gales blew for half the summer and westerly gales blew for the remainder. The impact of these two scenarios on lake mixing are quite different.

Even if, as the paper suggests, westerly winds have much more impact on the lake than winds from other directions because of the catchment’s morphology,  it is unphysical to expect mean wind velocity to be a useful predictor. As it stands, the result suggests that westerly winds mix the lake and easterly winds unmix it. Mean wind-speed, weighted by the lake’s exposure in the direction of the wind, would be a much more physically relevant predictor.

The paper appears to be a fishing trip, searching for the best of many predictors regardless of their physical plausibility. This is not a recipe for reproducible research. Had mean velocity from some combination of months not given a good performance, would the authors realised that mean velocity was a daft idea and repeated the analysis with wind speed? And if that too failed, would they have used mean squared wind speed, to better correspond with the wind’s mixing potential?

Posted in Peer reviewed literature, transfer function | Tagged , | Leave a comment

Pollen from the garden of forking paths

Most transfer functions for reconstructing past environmental changes are based on a calibration-in-space approach, with a modern calibration set of paired microfossil assemblages and environmental data. The alternative approach is calibration-in-time, with well-dated fossil assemblages and contemporaneous environmental data.

I’ve previously shown that the three chironomid calibration-in-time transfer functions all misreport the performance statistics. All report the apparent performance as if it was the cross-validated performance. But what of the single calibration-in-time model developed for pollen assemblages that I am aware of (please let me know of other calibration-in-time models for microfossil assemblages)?

Kamenik et al (2008) report a calibration-in-time model for the pollen stratigraphy from the mire Mauntschas in the SW Swiss Alps, dated with 29 14C dates and other chronological information. The paper reports that the r2 between predicted and modelled April-November temperature is 0.44. In ideal circumstances this would represent a very respectable performance. However, I have a couple of concerns about the analysis in this paper.

Firstly, the pollen and climate data are smoothed with a three-year triangular filter prior to any analysis. This will induce temporal autocorrelation into the data and thereby violate the assumptions of the statistical methods used.

Secondly, a large number of choices are made on the route to the selected model. Choices include:

  • The climate variable reconstructed (mean temperature or precipitation over 1-12 months with a lag of 1-11 months – a total of 288 responses). May to August air temperature is the best predictor in an RDA, but April-November air temperature is reconstructed because it reduced the transfer function error.
  • The predictor variables (6 out of 11 pollen taxa are used in the final model)
  • The transformation of the pollen data (accumulation rate or percent, detrended or not)
  • The statistical model (ordinary least squares regressions (OLSR), time series regressions (TSR), ridge regression (RidgeR), principal components regressions (PCR) and partial least squares regressions (PLSR))

The impact of these choices – forks in Andrew Gelman’s garden of forking paths – will be to inflate the estimate of model performance.

It is possible to explore the impact of some of these choices by simulation which can help gauge how impressed we should be with an r2 of 0.44. To simplify the simulation, I’m only going to investigate the importance of the induced autocorrelation and the selection of pollen taxa.

I simulated 49 years of climate data and pollen data with a Gaussian distribution 10,000 times for each case below (code on github).

The upper panel in the figure below shows the distribution of leave-one-out cross-validation r2 of OLS models fitted to six simulated pollen spectra (data from two years are removed to make the data set comparable with the smoothed data in the next step). The reported r2 is far above the 95th percentile of the distribution.

The middle panel shows the distribution of r2 when the six pollen spectra and the climate variable are smoothed with a three-year triangular filter, as used by the authors. The 95th percentile of the distribution has moved towards the reported r2.

In the lower panel, I show the distribution of the r2 of the OSL when the best subset (1–11 variables) of smoothed pollen spectra is chosen by BIC. The 95th percentile of the distribution has moved beyond the reported r2. No claim to statistical significance can be supported.


Distribution of r2 values from 10000 simulations with different treatments. The red line shows the reported r2.

Allowing for model selection, choice of data transformations, and selection of the climate variable, reconstruction would move the simulated r2 even further to the right.

None of the choices made by the authors are bad. All can be defended, except perhaps the inclusion of autumn temperatures in their climate mean when most plants have flowered months earlier. The problem is that the authors have sought the path which yields the best performance. Pollen data from an another core from the same mire would be unlikely to give the same performance with the selected model.

As with the three chironomid calibration-in-time models, the Mauntschas Mire model cannot be relied upon. A critical difference though is that the performance of the chironomid models is misrepresented.

Posted in Peer reviewed literature, transfer function | Tagged , , | 1 Comment

The third chironomid calibration-in-time reconstruction

Most reconstructions of past environmental conditions derived from chironomid assemblages with transfer functions use a calibration-in-space approach, in which the calibration is a set of modern chironomid assemblages paired with modern environmental data. I am aware of three chironomid calibration-in-time reconstruction. Here the calibration set is well dates chironomid assemblages paired with historic environmental data.

I have already shown that the calibration-in-time reconstructions from Seebergsee and Silvaplana report the apparent performance rather than the cross-validated performance they claim to report, and that the actual skill is near zero. But what about the third reconstruction from Nurmijärvi.

Nurmijärvi is a varved lake in southern Finland from which the authors counted the chironomids at 1-11 year resolution over the instrumental period which began in the 1830s and reconstruct July air temperatures.

The paper reports that the calibration-in-time transfer function has an r2jack of 0.64 (Table 1). This is a fairly good performance given the limited temperature range in the calibration set. The paper also reports the correlation between the reconstructed and instrumental temperature as 0.51 (Table 2, Figure 4), which is equivalent to an r2 of 0.26. With a calibration-in-time model, the model r2 should be identical to the r2 between the reconstruction and the environmental data. It is not clear why they are not identical in this case.

Since the authors have not replied to emails enquiring about this discrepancy, nor to requests for the raw data, I digitised the stratigraphic diagram and the temperature data. There are of course inaccuracies in the digitised data, and I have only the 38 common taxa (the stratigraphic diagram omits 21 rare taxa).

I find that the model the authors use, a tolerance-weighted weighted-averaging model with inverse deshrinking, has an r2 of 0.57. Not too far off the 0.64 reported. But this is the apparent statistic – the leave-one-out cross-validated r2 is only 0.13.


Predicted vs measured July temperatures for the Nurmijärvi calibration-in-time transfer function

A model with an r2 of 0.13 has very limited utility, and this is before we consider the impact of autocorrelation etc.

It appears to me that the authors are, like the authors of the Silvaplana and Seebergsee papers, reporting the apparent rather than the cross-validation performance. If the authors believe that imperfections in the digitised data are responsible for the lack of reproducibility, they only need to send me the data and I will update this post.


Posted in Peer reviewed literature, transfer function | Tagged , , | Leave a comment

The tangled web of Silvaplana papers

I’ve not written much about the Lake Silvaplana papers although they are as questionable as the Seebergsee and Zabinskie papers.

I have now digitised the 150-year stratigraphy, my request to the authors for the data having been ignored. Together with the archived 420-year stratigraphy I can now answer questions that I couldn’t earlier.

First question, how large were the count sums? The 150-year paper reports that

The total number of head capsules is highly variable; between 15 and 125 head capsules were counted in samples of ca. 30 mg dry weight, the average being 30 head capsules per sample.

JoPL-06 (copy)

The 150-year stratigraphy.

This is easily demonstrated to be false. The stratigraphic diagram shows the count sums in the third curve from the right. Over 20 samples (of 94) have a count sum to the left of the tick mark at 12.5 head capsules. Some of the samples appear to have counts as low as three (samples 33 & 34 are both ~66% Cricotopus and ~33% one other species).

The 420-year paper makes even stronger claims:

Before sample no. 170 [~1775 CE], the minimum head capsule numbers was 50.

Perhaps the authors counted some more chironomids after publishing the first paper, but the stratigraphic diagram shows that most of the samples above sample 170 have counts smaller than fifty: some appear to be below fifteen.


The 420-year stratigraphy counts


All the papers report that samples with fewer than thirty head capsules were merged before the transfer function was applied. This presumably explains why the reconstruction in the 150-year paper has 69 values rather than one for each of the 94 levels in the stratigraphy. I assume the merged samples form (more or less) the upper 69 samples in the archived assemblage data from the 420-year paper (the oldest assemblage in the 150-year stratigraphy is identical to the 69th in the 420-year paper, which represents 1850).

Even after merging, the samples appear to have some counts below 30. The youngest sample in the 420-year dataset has eight species all with exactly 10% and one with 20% of the assemblage. It is more likely that this assemblage has a count sum of ten than a count sum of at least thirty. The youngest sample in the 150-year stratigraphy has the same composition within the error of digitisation. The reported count sum for this sample is about ten.

Reliable reconstructions cannot be expected from such small samples.

I cannot reproduce the calibration-in-space reconstructions derived from the 150-year or the 420-year stratigraphy as the calibration set has not been archived. Given that the chronological uncertainty is estimated at 15%, it is surprising that the reported correlation between the reconstruction and the instrumental record is so good.

I can, however, attempt to reproduce the calibration-in-time reconstruction shown in a subsequent paper. This reconstruction has 82 levels. I don’t know how the 94 levels were merged to make 82, so I’m going to use the 64 samples from the 420-year stratigraphy which fall into the instrumental period which starts in 1864 (five fall between 1864 and 1850). I also don’t know exactly which years are supposed to be represented by each sample, so I am just going to use the reported year with data from meteoSwiss.

I’m not expecting the result to be exactly the same as the authors, but it should be similar. My 2-component WAPLS model on square-root transformed assemblage data has an apparent R2 between the measured and estimated temperature of 0.45. The leave-one-out cross-validated reconstruction has an R2 of 0.01. This is somewhat short of the reported R2 of 0.51 for the whole record (0.71 for the first 90 years). It would appear, as at Seebergsee, that the authors have not cross-validated their model. In both Silvaplana and Seebergsee, the cross-validated model has no skill, and the published millennium-long reconstruction cannot be relied upon. If my analysis is supported by a re-analysis of the full data used in the calibration-in-time paper, then this paper should be retracted.


Posted in Peer reviewed literature | Tagged , , , , | 1 Comment

Extracting data from a PDF image

Some scientists archive their data. Some scientists email their data on request. Some editors cajole authors into releasing data to interested parties. And sometimes none of these approaches yields data.

What then? One option is to request data via the scientist’s head-of-department, another is to scrape the data out of published diagrams. Much has been written about how to extract data from raster formats, including stratigraphic diagrams, I only found a little information on how to extract data from vector formats. This is potentially a much more powerful option.

This is how I did it. First I downloaded Luoto and Ojala (2017) and checked that the stratigraphic plot was a vector format by greatly increasing the magnification. This would make the image appear pixelated if it was a raster.


The stratigraphic diagram from Luoto and Ojala (2017)

The pdf needs to be converted into a format that can be read into a text editor. This can be done with qpdf in the terminal. First I extract the page with the stratigraphic diagram to make further processing easier.

qpdf 0959683616675940.pdf  --pages 0959683616675940.pdf 4 -- outfile.pdf

qpdf outfile.pdf --stream-data=uncompress  outfile2.pdf

Now outfile2.pdf can be read into R

page4 = readLines("luoto/data/outfile2.pdf")

#Find start and end of image
start = grep("/PlacedGraphic", page4)[2]
end = grep("[(Figur)20(e 2.)]TJ", page4, fixed = TRUE)
page4 = page4[start:end]

fig2 = page4[grepl("re$", page4)] %>%
read.table(text = .) %>%
set_names(c("x", "y", "width", "height", "re"))

The bars are, within rounding error, the same height. Various other rectangles in the figure are different heights, so I need to filter the data I want.

fig2 = fig2 %>%
  filter(between(height, -2.4, -2.3))

Now I can plot the data.

fig2 %>%
  ggplot(aes(x = x + width/2, y = y, width = width, height = height, fill = factor(x, levels = sample(unique(x))))) +
  geom_tile(show.legend = FALSE)



The scraped data from the stratigraphic plot

The next step would be to assign taxon names to each unique value of x and scale the widths so they are in percent. When that is done, and the weather data digitised, I can test how well I can reproduce the calibration-in-time transfer function model. The calibration-in-space model will need to wait for data from several other papers to be scraped.

Posted in Peer reviewed literature, R | Tagged , , , | 2 Comments

Bergen: a year with some sunshine

May was glorious.  December less so.


The data are from the Geofysisk Institutt in Bergen. Here is the code I used

florida = read.csv("../climate_data//Florida_2017-06-21_2018-06-22_1529649176.csv", stringsAsFactors = FALSE) %>% as.tibble()
florida = florida %>%
mutate(date_time = ymd_hm(paste(Dato, Tid)),
date = ymd(Dato),
Tid = ymd_hm(paste("2000-01-01", Tid)),
Globalstraling = if_else(Globalstraling == 9999.99, NA_real_, Globalstraling)) %>%
complete(date, Tid) %>%#
arrange(date, Tid) %>%
mutate(Globalstraling = coalesce(Globalstraling, lag(Globalstraling, 24 * 6)))#fill gaps with previous day

ggplot(florida, aes(x = Tid, y = date, fill = Globalstraling)) +
geom_raster() +
scale_fill_gradient(low = "black", high = "yellow") +
scale_x_datetime(date_labels = "%H", expand = c(0, 0)) +
scale_y_date(date_labels = "%b", expand = c(0, 0)) +
labs(x = "Time", y = "Month", fill = expression(atop(Sunshine, Wm^-1)))

Posted in climate, R, Uncategorized | Tagged | Leave a comment