Beyond nearest analogue distance

In my previous post, I argued that the taxonomic distance from a fossil assemblage to the most similar assemblage in the modern calibration set is a useful diagnostic for palaeoenvironmental reconstructions. Analogue distance is a useful metric, but it is not necessarily sufficient. Consider the following cases:

  1. Two fossil assemblages, one with taxa with well defined optima and the second comprising taxa with poorly defined optima, might have the same taxonomic distance to the nearest calibration set assemblage. Which reconstruction would you trust more?
  2. Two fossil assemblages, one with taxa present in the calibration set but with different relative abundances and one with some taxa missing from the calibration set might have the same nearest analogue distance. Which reconstruction would you trust more?
  3. The nearest analogue in the calibration set might be a very strange assemblage. Would you trust the reconstruction?

In this post, I want to consider another diagnostic for palaeoenvironmental reconstructions that should highlight the last of these issues when it occurs. The first two issues are addressed by the proportion of the fossil assemblage that are not in the calibration set or have poorly defined optima.

In a constrained ordination, for example constrained correspondence analysis or redundancy analysis, of the modern calibration set constrained by the single environmental variable that we want to reconstruct, observations that have an assemblage composition that is well explained by this variable will lie close to the first, constrained, axis. Observations with an assemblage composition that is poorly explained by the constraining variable will be distant from the first axis in at least one dimension. We can calculate the sum of squared distance, the squared residual length from the first axis, and use this as a metric of goodness of fit. This will give us a diagnostic for the calibration set, identifying observations that are poorly fitted by the constraining variable compared to the other observations. Useful, but we want a diagnostic for the fossil assemblages.

We can add the fossil assemblages passively to the constrained ordination to show their approximate position in ordination space. We can now calculate the squared residual lengths for the fossil assemblages. Fossil assemblages with a squared residual length that is large compared with the distribution of squared residual lengths of the calibration set (rule of thumb – greater than 90% of the lengths) are poorly fitted by the variable we want to reconstruct and we should be cautious about reconstructions based on these assemblages. Whereas analogue taxonomic distances are independent of the variable we are trying to reconstruct, residual lengths are not. Modern and fossil assemblages with a poor fit to one environmental variable might have a good fit to another. Squared residual length can be calculated with the function residLen( ) in the analogue package written by Gavin Simpson.

Note, the rioja and analogue packages don’t play nicely together as they have some functions that share the same name – which version you get depends on the order in which you load the libraries. If you load rioja and subsequently load analogue, you get the analogue version of crossval( ) and performance( ) which won’t work with rioja objects. You can force R to use the function you want by specifying which library to use with rioja:::crossval( ).

#Attaching package: ‘analogue’
#The following objects are masked from ‘package:rioja’:
#    crossval, performance

mod<-WA(SWAP$spec, SWAP$pH)
#Error in UseMethod("crossval") : 
#  no applicable method for 'crossval' applied to an object of class "WA"
#this works

Compared with the explanation above, the code for residLen( ) is very short and simple with four arguments: we need to specify the calibration set assemblages and environmental data, the fossil data and the ordination method required.

Since the first three arguments are obvious, I’ll concentrate on the fourth. residLen( ) accepts two different methods, reduncancy analysis (rda) and constrained correspondence analysis (cca). The former is appropriate with short taxonomic gradients in the calibration set, but will show gross artefacts with longer gradients. The latter is more appropriate with longer taxonomic gradients, but might still show some artefacts. We can use detrended correspondence analysis, implemented in the vegan package function decorana( ) to find the length of the gradient. If the first gradient length is shorter than 2-2.5 SD units, rda is probably more appropriate, if it is longer cca might be better. Whichever choice is made, it is important to plot the constrained ordination and examine it for artefacts. At some stage in this series I might write more about using and interpreting ordinations as a key diagnostic for exploring calibration sets, in the meantime Borcard et al is excellent.

I’ll start as always by loading the data. For simplicity, I’m just going to use the analogue package, which includes the SWAP calibration set and Round Loch of Glenhead fossil data but with different names.

data(swapdiat, swappH, rlgh)

Now, I’ll use decorana to help choose an appropriate ordination method, and plot it to check for artefacts.

#decorana(veg = spp) 
#Detrended correspondence analysis with 26 segments.
#Rescaling of axes with 4 iterations.
#                  DCA1   DCA2   DCA3   DCA4
#Eigenvalues     0.6191 0.3890 0.3031 0.2881
#Decorana values 0.6397 0.3924 0.2889 0.2436
#Axis lengths    6.0054 3.7798 3.5541 2.8154


The length of the first axis (6 SD) of the decorana was larger than 2.5 SD units, so I select cca as the more appropriate ordination.

CCA of the SWAP calibration set constrained by pH. Sites are shown as circles, species by crosses.

CCA of the SWAP calibration set constrained by pH. Sites are shown as circles, species by crosses.

The cca of the calibration set assemblages constrained by pH looks OK: there is no obvious arch (which occasionally happens in a cca). Some observations are extreme on axis two, which is not very pretty, but rather than trying to work out what is going on here (perhaps rare species), I’ll skip directly to calculate the squared residual length.

rlen<-residLen(spp, env, fos, method="cca")
plot(rlen) # distribution of modern and fossil residual lengths
plot(depths, rlen$passive, ylab="Squared residual length", xlab="Depth")
abline(h=quantile(rlen$train, probs = c(0.9,0.95)), col=c("orange", "red"))

Squared residual length for the RLGH fossil assemblages vs depth. Orange line marks the 90% limit of the calibration set residual lengths (poorly fitted) and the red line marks the 95% limit (very poorly fitted).

Squared residual length for the RLGH fossil assemblages vs depth. Orange line marks the 90% limit of the calibration set residual lengths (poorly fitted) and the red line marks the 95% limit (very poorly fitted).

This plot shows that most RLGH fossil assemblages are well fitted by pH, but that some in the lower part of the core have higher residual lengths and are poorly or very poorly fitted. We should be more cautious about reconstructions based on these assemblages.

I’ve now presented three diagnostics that focus on individual assemblages: the proportion of the fossil assemblage that is present (or has well constrained optima) in the calibration set; taxonomic distance to the nearest analogue; and the squared residual length. The obvious questions are

  1. Do these methods provide unique information (in which case all might need to be presented), or do they repackage the same information (and we need present only one).
  2. Are these diagnostics useful predictors of reconstruction quality?

I’m going to try to explore these ideas in a forthcoming post in this series on reconstruction diagnostics.


About richard telford

Ecologist with interests in quantitative methods and palaeoenvironments
This entry was posted in R, transfer function and tagged , . Bookmark the permalink.

3 Responses to Beyond nearest analogue distance

  1. ucfagls says:

    Just a minor point; if the functions are exported from a package’s namespace, as crossval() and performance() are, you just need the double-colon operator ::, so rioja::crossval(). You only need ::: if you want access to functions in packages that the package author did not want to export as a user-visible function.

    It never occurred to me to plot the residual length “stratigraphically”; I’ve always just used the distribution-oriented displays. Your series is offering up plenty food for thought on things to add/improve in analogue.

  2. Pingback: Musings on Quantitative Palaeoecology

  3. Pingback: 73 lakes become 78 | Musings on Quantitative Palaeoecology

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s