english::as.english

It is conventional to write small integers as words rather than figures in text, especially if they are at the start of the sentence. This caused me some grief with rmarkdown, which I have started using for presentations, papers and blog-posts: my code gave me figures, I wanted words. Then I found the english package.

library("english")
as.english(42)
## [1] forty two
as.english(-1)
## [1] minus one

We need to be careful if we use this inline with

The Ultimate Answer to Life, The Universe and
Everything is...`r as.english(42)`

as this will be converted back to a figure.

The Ultimate Answer to Life, The Universe and Everything is…42.

Wrapping the command in as.character will fix this

`r as.character(as.english(2))` to the power of
`r as.character(as.english(276709))` to
`r as.character(as.english(1))` against

two to the power of two hundred and seventy six thousand seven hundred and nine to one against

It would be a cunning plan to make a function to run as.character(as.english(x)) if it is needed repeatedly.

Of course, sometimes we will want to start a sentence with a number and need the first letter capitalising. This can be done with regular expressions (otherwise known as magic).

`r gsub("(^)(.)", "\\1\\U\\2", as.english(42), perl = TRUE)`

Forty two

As was hinted above by the conversion back to figures, objects made by as.english retain their numeric values, and so can be used in calculations

char.english <- function(x)as.character(as.english(x))
num <- lapply(0:9, as.english)
names(num) <- sapply(0:9, char.english)# set names
attach(num) #make available

six * nine

 

## [1] fifty four
Posted in R | Tagged | Leave a comment

The importance of version control

Perhaps the most contentious scientific argument of the early twenty-first century concerns whether “data” is plural or singular. To accidentally say “data is” at a prestigious conference is to lob a metaphorical lettuce into the traditionalist audience. Submit a manuscript with “data are” and expect scathing reviews impugning your pet goat’s morels1 from an avant-garde grammatician. Fortunately, there is a consensus on one topic (no mere 97% here), the chironomid data from Lake Żabińskie are plural.

Version One – January 2015

The published version of the chironomid stratigraphy which has not been archived and is now acknowledged by the corrigendum to be incorrect. None of the archived files match it.

Version Two – January 2015

The version of the data archived at http://www.chirosindex.com/ in January differed substantially from the published stratigraphy and contained patterns best described as implausible.

Version Three – June 2015

After enquiring about the discrepancies between the archived data and the published stratigraphy, I was emailed a C2 file with a new version of the data. It has five fewer taxa than V2. These are mostly rare, except for Parochlus which occurred eight times in V2 (four times if the unexpected duplicate assemblages are ignored). Fifty-three of eighty-four assemblages are identical. Some other assemblages have large differences, for example 1931 where most of the Criotopus is replaced by Nanocladius branchio.

Version Four – July 2015

This version was uploaded onto http://www.chirosindex.com/ in mid-July. It has the same taxa as V3 but gained samples for 1970, 1969, 1966, 1955 and 1941 and has a sample for 1927 instead of 1925. Only four samples have identical assemblages in both versions: 1986 has complete taxonomic turnover!

Version Five – August 2015

V5 was uploaded onto http://www.chirosindex.com/ in August. Seventeen samples have different compositions (more than just rounding errors), for example in the sample from 2002, Paratanytarsus had an relative abundance of 27% in V4 but 0% in V5. It is partially replaced by Tanytarsus sp (14%), while all other taxa also increase in relative abundance. V5 is the first version to show the count sums. As expected, they are generally much lower (median 32.5) than the 50 promised by the paper.

Version Six – September 2015

Another month, another version of the Lake Żabińskie chironomid stratigraphy. This version is mostly the same as V5 rounded to one decimal place, but in 1931 Cricotopus loses 2% and Limnophyes gains 2%.

Version Seven – October 2015

V7 is not rounded to one decimal place as V6 was, nor is it identical to V5. The largest change from V6 is in the sample from 1940, in which 12% Criotopus appears, with all other taxa losing one or two percent.
V7 also includes what are purported to be the raw counts. They have non-integer (or half integer) values which are impossible. I suspect the counts are back-calculated from percent data and then the percentages recalculated.

Version Eight – October 2015

V8 only includes count data which are identical to V7 counts to four decimal places!

Version Nine – November 2015

V9 was uploaded onto NOAA in November. The count data in the excel file are identical to the counts in V7 and V8, those in the text file have been rounded. The percent data are V7 rounded to one decimal place.

zabinskie1940

Different versions of the chironomid data for 1940

Version ten?

It is not clear that the version of the data described in the corrigendum is the same as V9. The corrigendum details the taxon inclusion rules

Taxa not included in the transfer function or not found in 3 percent in at least 3 samples were removed. In consequence, the number of taxa was reduced from 76 to 50.

Figure 1 caption

Four taxa (Brillia, Eukiefferiella, Krenopelopia and Stictochironomus) present in only 3 samples are not displayed.

However, in V9 (which does have 50 taxa), these four taxa have only one or two occurrences.

Of course, I understand that most people have multiple versions of their data: the initial low resolution counts; taxonomic revisions; corrections of transcribing errors. But this should all be sorted out before publication. Post-publication it should be easy to identify the correct file. To archive an incorrect version once is unfortunate; to archive six different versions (none of which match the published stratigraphy) is beginning to look like carelessness.

Note, all the archived data versions2 have file creation dates that post-date publication.


  1. Admit it, if the morels weren’t fly-blown, your amoral goat would have eaten them (and then would be amorel). 
  2. Data and code are archived on github 
Posted in Peer reviewed literature | Tagged , | 1 Comment

Don’t get attached to attach()

When data are imported into R it generally arrives as a data.frame, a type of object that contains named columns. We often want to access the contents of each column which can be done with the dollar or square-bracket notation. This post was first published at codeRclub.

attach() is used by some R-users to make the columns of a data.frame directly accessible rather than using dollar notation or data arguments.
Unfortunately although attach() saves a little typing and maybe makes the very first R experience a tiny bit easier, there is a large cost which I explore below.

The data.frame beaver1 holds body temperatures of a beaver (it is a found in the datasets package which is installed by default). These are the first few rows.

head(beaver1)
day time temp activ
346 840 36.33 0
346 850 36.34 0
346 900 36.35 0
346 910 36.42 0
346 920 36.55 0
346 930 36.69 0

If we attach() beaver1, we can access any of the columns of the data.frame directly.

attach(beaver1)
quantile(temp)
##      0%     25%     50%     75%    100% 
  ## 36.3300 36.7600 36.8700 36.9575 37.5300
boxplot(temp~activ)

Boxplot of body temperature against activity

Simple, what could possibly go wrong.

Let’s attach second beaver dataset which has the same column names as the first.

attach(beaver2)
  ## The following objects are masked from beaver1:
  ## 
  ##     activ, day, temp, time

That generated a load of warnings that various objects from beaver1 were masked. Now if we run quantile or some other function, we get the data from beaver2

quantile(temp)
##      0%     25%     50%     75%    100% 
  ## 36.5800 37.1475 37.7350 37.9850 38.3500

If we want to use beaver1 again, we have to detach() beaver2 first.

detach(beaver2)
quantile(temp)
##      0%     25%     50%     75%    100% 
  ## 36.3300 36.7600 36.8700 36.9575 37.5300
detach(beaver1)

Attaching and detaching data.frames is obviously going to go horribly wrong very quickly unless you are very careful. Always.

One alternative would be to change the column names on one or both of the data.frames so that all column names are unique.

colnames(beaver1) <- paste("beaver1", colnames(beaver1), sep = "_")
colnames(beaver2) <- paste("beaver2", colnames(beaver2), sep = "_")
attach(beaver1)
attach(beaver2)
quantile(beaver1_temp)
##      0%     25%     50%     75%    100% 
  ## 36.3300 36.7600 36.8700 36.9575 37.5300
quantile(beaver2_temp)
##      0%     25%     50%     75%    100% 
  ## 36.5800 37.1475 37.7350 37.9850 38.3500
rm(beaver1, beaver2)#clean up
detach(beaver1)
detach(beaver2)

Possible, but hardly elegant and this is now as much typing as the dollar notation shown below.

Another solution would be to combine the two datasets. We need to add a column identifying which beaver is which.

beavers <- rbind(cbind(id = "beaver1", beaver1), cbind(id = "beaver2", beaver2))
attach(beavers)
quantile(temp[id == "beaver1"])
##      0%     25%     50%     75%    100% 
  ## 36.3300 36.7600 36.8700 36.9575 37.5300
quantile(temp[id == "beaver2"])
##      0%     25%     50%     75%    100% 
  ## 36.5800 37.1475 37.7350 37.9850 38.3500

Having combined the data.frames, we now need to subset the objects to get the data for each beaver. This isn’t a simple solution (in some cases it will be horribly complicated), but combining the data.frames might be very useful for some analyses (for example if you wanted to run an ANOVA).

attach will also cause problems if there are objects with the same name as one of the columns of the data.frame.

temp <- 7
quantile(temp)
  ##   0%  25%  50%  75% 100% 
  ##    7    7    7    7    7

If we attach the data.frames when object temp already exists, we get a warning, if we make temp afterwards, no warning is given and this object masks the column in the data.frame. Obviously, this could cause some nasty bugs. Good luck with them.

It is of course possible to avoid these problems with masking if we are very careful with naming objects. In practice, it is very easy to make mistakes which then cause difficult-to-interpret bugs.

Even if we manage to avoid errors when using attach() it makes the code difficult to read as it is not obvious which data.frame each object is coming from.

Avoiding attach()

Fortunately there are alternatives to attach().

Referencing the columns in the data.frame

We can reference the columns in the data.frame by using either dollar notation or square bracket notation. Dollar notation is generally neater looking and needs less typing.

quantile(beaver1[, "temp"])
##      0%     25%     50%     75%    100% 
  ## 36.3300 36.7600 36.8700 36.9575 37.5300
quantile(beaver2$temp)
##      0%     25%     50%     75%    100% 
  ## 36.5800 37.1475 37.7350 37.9850 38.3500
plot(beaver2$time %/% 100 + 24*(beaver2$day - beaver2$day[1]) + (beaver2$time %% 100)/60,
                            y = beaver2$temp, col = ifelse(beaver2$activ, 2, 1), xlab = "Time")
# %/% is integer division: 5%/%2 = 2
# %% gives the modulus: 5%%2 = 1

Beaver body temperature against time

When using the dollar notation, if there are spaces in the column names (a bad idea), the column name needs to be quoted.

with() and within()

If the command you are using makes many references to a data.frame, it, the code can become rather messy, as in the previous example where the day and time columns are combined to give hours.

In such cases the with() function can be useful. It is equivalent to attaching the data.frame for this block of code only (without problems with masking).

with(beaver1, {
    hours <- time %/% 100 + 24*(day - day[1]) + (time %% 100)/60
    plot(hours, temp, col = ifelse(activ, 2, 1))
  }
  )

with() can make code much easier to read.

within() is useful if we want to modify the contents of a data.frame. Here, I’m using the lubridate package to make the date and time of each measurement.

library(lubridate)
beavers <- within(beavers, {
  day <- dmy("01-01-1990") + days(x = day-1) + hours(time %/% 100) + minutes(time %% 100)
  rm(time)# remove unwanted column
})
head(beavers)
  ##        id                 day  temp activ
  ## 1 beaver1 1990-12-12 08:40:00 36.33     0
  ## 2 beaver1 1990-12-12 08:50:00 36.34     0
  ## 3 beaver1 1990-12-12 09:00:00 36.35     0
  ## 4 beaver1 1990-12-12 09:10:00 36.42     0
  ## 5 beaver1 1990-12-12 09:20:00 36.55     0
  ## 6 beaver1 1990-12-12 09:30:00 36.69     0

The data argument

Any function that takes a formula argument (think y ~ x) has a data argument that can be given a data.frame.

library(mgcv)
mod <- gam(temp ~ s(unclass(day)), data  = beavers, subset = id == "beaver1")
#converting the date to seconds as gam doesn't like dates
plot(temp ~ day, data = beavers, subset = id == "beaver1", col = ifelse(activ, 2, 1), xlab = "Hours")
with(beavers, lines(day[id == "beaver1"], fitted(mod), col = 2))

Beaver body temperature against time

This is much better than attaching data as it makes it explicit which data are being used.

Do not be tempted to use the dollar notation in formula

## don't do this
  #lm(beavers$temp~beavers$day)
  

This makes the code difficult to read, especially if there are multiple predictors, and will cause problems when making predictions.

ggplot

ggplot(), an alternative system for plotting data from the ggplot2 package, wants to have the data in a data.frame. It does not need or want the data to be attached.

library(ggplot2)
ggplot(data =  beavers, mapping = aes(x = day, y = temp)) +
  geom_point() +
  geom_smooth(method = "gam", formula = y ~ s(x)) +
  facet_wrap(~id, dir = "v", scales = "free_x") +
  xlab("Hours")

Beaver body temperature against time

Does attach() have any uses

attach() is not completely useless: attach() can also attach environments. This allows some nifty tricks for having utility function etc. available without cluttering up your global environment (but it would probably be better to make a package – this is not hard).

Posted in R | Tagged | Leave a comment

Cromwell’s rule is over

Of one thing I am certain: count data should be integers (or half integers if partial individuals are counted as one half). The chironomid data from Lake Żabińskie are remarkable in many respects; surely they meet this most basic expectation.

#load the data
library(readxl)

fname <- "zabinskie2015cit.xls"
excel_sheets(fname)
fos_counts <- read_excel(fname, sheet = "Chironomids Zabinskie counts")
chron <- fos_counts[, 1]
fos_counts <- fos_counts[, -c(1, ncol(fos_counts))]

Let’s look at the top left corner of the data matrix

knitr::kable(fos_counts[1:3, 1:3])
Chironomini Chironomus anthracinus Chironomus plumosus
4.9980 0.000 4.998
2.9760 2.976 0.000
5.0055 0.000 0.000

Non-integer values!

In the fossil data set, 2.5% have non-integer (or half integer) values. These non-integer counts have a probabilty of occuring of exactly zero. Cromwell’s rule is over.

743px-execution_of_cromwell2c_bradshaw_and_ireton2c_1661

After Oliver Cromwell’s death, the monarchy was restored and Cromwell was posthumously executed. Other regicides were even less fortunate.

We can be absolutely certain that the archived data are not the original counts. Most likely, they are back-transformed from percent data that had been rounded. Why would anybody do that? Perhaps the authors lost the original counts. That would be unfortunate.

Posted in Peer reviewed literature | Tagged , | 1 Comment

Ordinating Lake Żabińskie’s chironomids

Fossil stratigraphies are highly multidimensional – many species, many samples – which makes it difficult to visualise the main patterns in the data. One way to cope is to use ordinations to find the main axes of variability in the data. I run ordinations on every core I analyse. In this post, I will run a variety of ordinations on the Lake Żabińskie chironomid stratigraphy from Larocque-Tobler et al (2015). I will compare the results with analyses of the chironomid training set and of the Round Loch of Glenhead diatom stratigraphy which shows the response to acid rain over a 140 year period.

This post will inevitably include an alphabet soup of methods that I don’t have space enough or time to fully explain. Apologies.

Detrended correspondence analysis

I always start my ordination analysis with a detrended correspondence analysis (DCA). I’m interested in the length of the first axis of the DCA in standard deviation (SD) units, a measure of the amount of species turnover along the ecological gradient. An axis length of 4 SD would give almost complete turnover of species (think of a normal distribution, four standard deviations goes from one side to the other).

unimodal plot-1

Hypothetical abundance of a taxon along an environmental gradient. If the data are sampled from a short environmental gradient, e.g., 0 to 2 SD the relationship is linear and methods such as PCA are appropriate. If the sampled gradient is long, e.g. -2 to + 2, the relationship is unimodal and methods that cope with this should be used.

This axis length helps me choose an appropriate ordination method for further analyses.

library(vegan)
library(rioja)
data(RLGH)#load Round Loch of Glenhead data
zdca <- decorana(sqrt(fos))
rdca <- decorana(sqrt(RLGH$spec))

keep <- colSums(spp > 0) >= 3
cdca <- decorana(sqrt(spp[, keep]))#training set

The length of the first axis of the Round Loch of Glenhead diatom stratigtaphy is 0.68 SD. This short axis length is typical for a core. The length of the first axis of the Żabińskie chironomid stratigraphy is 3.13 SD. This is remarkably long. By comparison, the first axis length of the chironomid training set is 3.23 SD. So although the temperature range spanned by the training set (3 – 27.5°C), is more than three times larger than the 20th century temperature range at Żabińskie (12.9 – 20.7°C estimated from the reconstructions), and the training set covers a diverse range of lake depths and other ecologically important environmental variables, the first axis lengths are practically the same. This, like so much with the Żabińskie record, is remarkable.

Unconstrained ordination

The first axis of an ordination is always the most important one. We can visualise how much more important than the other axes it is with a screeplot.

library(ggplot2)

g1 <- ggscreeplot(rda(sqrt(RLGH$spec)), bstick = FALSE, title = "Round Loch of Glenhead PCA")
g2 <- ggscreeplot(cca(sqrt(spp[, keep])), bstick = FALSE, title = "Chironomid training set CA")
g3 <- ggscreeplot(cca(sqrt(fos)), bstick = FALSE, title = "Żabińskie chironomids CA")

plotScreeplots-1

For the Round Loch of Glenhead, the first axis is very much more important than the remaining axes. The samples form a baguette-shaped cloud. This implies that a single environmental gradient (probably pH) is responsible for the changes in the diatom stratigraphy.

For the chironomid training set, the first axis is much more important than the second, which in turn is much more important than the third. The importance of the second axis is probably inflated by an arch effect (visible in a biplot of the ordination), a common artefact in correspondence analysis which could be corrected with a DCA. The samples form a croissant-shaped cloud. This shape implies that there is one main environmental gradient through the training set – almost certainly temperature or something highly correlated with temperature.

For the Żabińskie chrionomid record, the first axis is only slightly longer than the second, which is only slightly longer than the third. The samples form the shape of a slightly deflated football. This shape, together with the long DCA axis 1, would only be expected if 1) the chironomid community was being driven by multiple strong and uncorrelated environmental variables, or 2) the chironomid count sums are very small so there is lots of counting noise. In either case, a very poor temperature reconstruction would be expected. But yet the reconstruction is near-perfect. Remarkable.

We can ask if the environmental variable of interest is correlated with the first few ordination axes. This cannot really be done for the Round Loch of Glenhead as we only have a reconstruction of pH. For Żabińskie, we should use the measured temperatures, but these have not been archived, so I am going to use the reconstruction instead which will bias the result in favour of the temperature being important.

camodm <-cca(sqrt(spp[,keep]))
#plot(camodm)#Arched
envfit(camodm, env, choices = 1:2)
##
## ***VECTORS
##
##           CA1      CA2     r2 Pr(>r)
## [1,] -0.98340  0.18147 0.4872  0.001 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999

With the chironomid training set, temperature is associated with the first axis (CA1) of the ordination, and the r2 is relatively good. This is expected from a training set with a large temperature range.

camod <-cca(sqrt(fos))
#plot(camod)
envfit(camod, recon$temp, choices = 1:2)
##
## ***VECTORS
##
##          CA1     CA2     r2 Pr(>r)
## [1,] -0.8398  0.5429 0.0461  0.104
## Permutation: free
## Number of permutations: 999
envfit(camod, recon$temp, choices = 1:3)
##
## ***VECTORS
##
##           CA1      CA2      CA3     r2 Pr(>r)
## [1,] -0.31427  0.20316 -0.92734 0.3302  0.001 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999

With the Żabińskie chrionomid record, reconstructed temperature is not significantly correlated with the first two ordination axes. It is, however, correlated with the third axis and has a moderate r2. This suggests that temperature is not particularly important in driving community composition.

Constrained ordinations

Constrained ordinations let us know in a more direct way how important an environmental variable is.

mod1 <- cca(sqrt(spp[, keep]) ~ env)
mod1
## Call: cca(formula = sqrt(spp[, keep]) ~ env)
##
##               Inertia Proportion Rank
## Total         3.29854    1.00000
## Constrained   0.24295    0.07365    1
## Unconstrained 3.05560    0.92635   86
## Inertia is mean squared contingency coefficient
##
## Eigenvalues for constrained axes:
##    CCA1
## 0.24295
##
## Eigenvalues for unconstrained axes:
##     CA1     CA2     CA3     CA4     CA5     CA6     CA7     CA8
## 0.30349 0.25509 0.16501 0.12440 0.10133 0.09869 0.09055 0.08762
## (Showed only 8 of all 86 unconstrained eigenvalues)

With the chironomid training set, temperature explains about 7% of the inertia (which is low but not unreasonable), and the ratio of the inertia of the constrained axis to the first unconstrained axis is 0.8. If temperature was (correlated with) the most important ecological gradient in the data set, we would expect this ratio to be greater than one.

mod2 <-cca(sqrt(fos) ~ temperature, recon)
mod2
## Call: cca(formula = sqrt(fos) ~ temperature, data = recon)
##
##               Inertia Proportion Rank
## Total         3.38291    1.00000
## Constrained   0.12734    0.03764    1
## Unconstrained 3.25558    0.96236   49
## Inertia is mean squared contingency coefficient
##
## Eigenvalues for constrained axes:
##    CCA1
## 0.12734
##
## Eigenvalues for unconstrained axes:
##     CA1     CA2     CA3     CA4     CA5     CA6     CA7     CA8
## 0.22205 0.20687 0.18918 0.18282 0.16599 0.15822 0.14259 0.13470
## (Showed only 8 of all 49 unconstrained eigenvalues)

With the Żabińskie chironomid record, reconstructed temperature explains about 4% of the variance, and the ratio of the inertia of the constrained axis to the first unconstrained axis is only 0.57. Even the eighth unconstained axis has more inertia than the axis constrained by temperature. All this suggests, yet again, that temperature is not the most important control on chironomid assemblages in Lake Żabińskie.

Conclusions

It is very difficult to reconcile the strange ordinations and the poor performance of temperature as a predictor of the chironomid assemblages with the remarkably good performance of the August-air temperature reconstruction.

Posted in Peer reviewed literature | Tagged , , | 2 Comments

Falling for ggplot2

I spent a long time resisting the lure of ggplot2. I was proficient with the plotting functions in base graphics; why did I need to learn an entirely new graphics system? Yes, getting up colour ramps could be a real pain, but I was in complete control.

Recently I’ve had to learn how to use ggplot2. Students were asking me to help them fix their code (often copied off Stack Overflow), so I needed at least some understanding. I read the book – eller hente ebok frå Universitetsbiblioteket. Its very readable. Now I see why ggplot2 is so popular and have started to use it whenever possible.

Today I wanted to make a screeplot for a principal component analysis in ggplot. So I wrote my own function based on screeplot in the vegan package.

ggscreeplot <-
function (x, bstick = FALSE, npcs = min(10, if (is.null(x$CCA) || x$CCA$rank == 0) x$CA$rank else x$CCA$rank),
          xlab = "Component", ylab = "Inertia", title = "",  ...) {
  if (is.null(x$CCA) || x$CCA$rank == 0)
    eig.vals <- x$CA$eig
  else eig.vals <- x$CCA$eig
  ncomps <- length(eig.vals)   if (npcs > ncomps)
    npcs <- ncomps
  comps <- seq(len = npcs)

  df <- data.frame(Inertia = eig.vals[comps], names = reorder(factor(names(eig.vals[comps])), comps))
  ymax <- df$Inertia[1]   if (bstick && !is.null(x$CCA) && x$CCA$rank > 0) {
    warning("'bstick' unavailable for constrained ordination")
    bstick <- FALSE
  } else {
    df <- cbind(df, bstick = bstick(x)[comps])
    ymax <-max(ymax, df$bstick[1])
  }
  g <- ggplot(df, aes(x = names)) +
                geom_bar(aes(y = Inertia), stat = "identity", fill = "grey50") +
                labs(x = xlab, y = ylab, title = title) +
                scale_y_continuous(limits = c(0, ymax * 1.04), expand = c(0, 0))

  if (bstick) {
                g <- g + geom_point(aes(y = bstick), colour = "red")
                g <- g + geom_line(aes(y = bstick, group =1), colour = "red")
              }
  g
}

Here it is in action

library(rioja)
data(RLGH)#Round Loch of Glenhead diatom stratigraphy
ggscreeplot(rda(sqrt(RLGH$spec)), bstick = TRUE)
screeplot

Screeplot of a PCA of the Round Loch of Glenhead diatom stratigraphy. Note that axis one is much longer than the other axes – it has almost half the total variance.

I’ll be using this code when I show ordinations of the Lake Żabińskie chironomid reconstruction. Spoiler – they don’t look much like this.

Posted in EDA, R | Leave a comment

Reconstruction diagnostics for the Lake Żabińskie chironomid reconstruction

The evaluation of reconstruction diagonstics is an essential part of the process of generating palaeoenvironmental reconstructions from microfossil assemblages using transfer functions. If the reconstruction diagnostics are bad, we should be especially cautious about interpreting the reconstruction. The problems is that “good” and “bad” are not well defined and we rely on various rules-of-thumb to guide us.

Since the chironomid-based reconstruction of August air temperature presented by Larocque-Tobler et al (2015; hereafter LT15) from Lake Żabińskie is so remarkably good, it should be an interesting case to test how well reconstruction diagnostics work.

LT15 use analogue quality as their main diagnostic method

For the combined transfer function, to determine whether the modern calibration models had adequate analogues for the fossil assemblages, the modern analogue technique (MAT) was performed using C2 (Juggins, 2005), with squared chord distance as the dissimilarity coefficient (DC) (Overpeck et al., 1985). Confidence intervals were based on minimum DC distance within the calibration sets (Laing et al., 1999). Fossil assemblages above the 95% confidence interval were considered to have no analogues in the calibration set; samples between 75% and 95% were considered to have fair analogues (Francis et al., 2006).

This text from LT15 is not entirely clear – what confidence intervals? Time to read Francis et al (2006).

In order to determine whether the modern calibration model had adequate analogs for the fossil assemblages, modern analog testing (MAT) was performed using the computer program C2, with squared chord distance as the dissimilarity coefficient (Overpeck et al., 1985). Confidence intervals were based on minimum DC distance within the calibration set following Laing et al. (1999). Fossil assemblages above the 95% confidence interval were considered to have no analogs in the calibration set, samples between 75% and 95% were considered to have fair analogs.

I get a slight sense that I’ve read this before somewhere. I’m sure it is just a coincidence. But having read it twice, I understand what is being done. The squared-chord distances between each fossil sample and its closest analogue in the calibration set is being compared with the 75th and 95th percentiles of the distribution of distances between each calibration-set sample and its nearest neighbour.

LT15 report that

No sample had chironomid assemblages outside the 95% confidence interval suggesting that the transfer function can be applied to the downcore samples.

but don’t show this with a figure (no complaint here, I would do the same if the analogues were good). I want to see a figure showing the analogue distances.

First we need to load the data, which can be downloaded from NOAA.

library(readxl)

fname <- "zabinskie2015cit.xls"
excel_sheets(fname)
spp <- read_excel(fname, sheet = "Training species")
env <- read_excel(fname, sheet = "Training temperature")
fos <- read_excel(fname, sheet = "Chironomids Zabinsk percentages")
recon <- read_excel(fname, sheet = "Reconstruction ")
names(recon) <- c("date", "temperature")

rownames(spp) <- spp[, 1]
spp[, 1] <- NULL
rownames(env) <- env[, 1]
env <- env[, 2, drop = FALSE]

lowCount <- c("SAL", "LEK", "TRZ", "WAS", "SZOS", "GOR", "KOS", "ZAB")
spp <- spp[!rownames(spp) %in% lowCount, ]
env <- env[!rownames(env) %in% lowCount, , drop  = FALSE]
identical(rownames(spp), rownames(env))
env <- env$Temp

chron <- fos[, 1]
fos <- fos[, -c(1, ncol(fos))]

####check names####
setdiff(names(fos), names(spp))
setdiff(names(spp), names(fos))

Distances to the nearest analogue are easily calculated with the rioja package which should give the same result as C2.

library(rioja)
library(ggplot2)

matmod <- MAT(spp, env)
matpred <- predict(matmod, fos)
goodpoorbad <- quantile(matmod$dist.n[, 1], prob=c(0.75, 0.95))
qualitybands <- data.frame(xmin = rep(-Inf, 3),
xmax = rep(Inf, 3),
ymax = c(goodpoorbad, Inf),
ymin = c(-Inf, goodpoorbad),
fill = factor(c("Good", "Fair", "None"), levels = c("None", "Fair", "Good")))

fillscale <-  scale_fill_manual(values = c("salmon", "lightyellow", "skyblue"), name = "Analogue Quality")

g <- ggplot(data.frame(chron, analogue =  matpred$dist.n[,1])) +
geom_point(aes(x = chron, y = analogue)) +
labs(x = "Date CE", y = "Squared chord distance to nearest analogue") +
geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = fill), qualitybands, alpha = .5) +
fillscale
print(g)
analogueChunk-1

Analogue quality against date

Thirty eight of the 89 samples have no modern analogues under the definition used by LT15. Only 11 samples have good analogues. This is difficult to reconcile with the claim by LT15 that

No sample had chironomid assemblages outside the 95% confidence interval

The reconstruction of August air-temperature in LT15 is remarkably good, almost as good as what would be expected if chironomids were perfect thermometers, yet the analogue quality is fairly awful (squared residual length is also fairly awful). Does this mean that these diagnostics are utterly useless guides to the utility of reconstructions? Or is this another remarkable feature of the Lake Żabińskie chironomid reconstruction?

Perhaps some ordinations would be useful to investigate what is going on. I’ll show some in a future post.

Posted in Peer reviewed literature, transfer function | Tagged , , , | Leave a comment