Reconstructions of past environmental conditions can be made using transfer functions based on the modern relationship between paired observations of species assemblages and the environmental variables of interest in a calibration set. One of the requirements of calibration sets is that they span the likely range of past environmental conditions being reconstructed.

It is noted (Birks 2010), though, that some transfer-function methods can extrapolate. ter Braak (1995) explores this with some simulated data. His calibration set has two environmental variables; observations are taken from an “L”-shaped area. The test set fits into part of the space within the “L”. Thus, although the test set requires extrapolation into environmental space not covered by the calibration set, it does not require extrapolation to extreme values of either environmental variables. In ter Braak’s (1995) example, a one-dimensional multinomial logit model (related to the more common maximum likelihood method) and the modern analogue technique performed poorly. Weighted averaging partial least squares, which is designed to incorporate information from secondary environmental variables performed well, almost as well as a two-dimensional multinomial logit model, which explicitly incorporates secondary environmental gradients.

I want to test something slightly different, how well transfer function models behave when extrapolated to extreme values of the environmental variable of interest.

The test I’m going to use is simple. I’m going to develop transfer function models on the SWAP diatom-pH calibration set truncated at either the 25th or 75th percentile and make predictions for the remaining 25% of the data. The predictions are compared with the observed values, and for context, I also examine the leave-one-out cross-validation predictions for this part of the environmental gradient from transfer function models trained on the entire calibration set.

Here is the code for testing weighted-averaging.

library(rioja) data(SWAP) summary(SWAP$pH) keep<-SWAP$pH<6.225 #WA mod0<-crossval(WA(SWAP$spec,SWAP$pH)) mod1<-crossval(WA(SWAP$spec[keep,colSums(SWAP$spec[keep,])>0],SWAP$pH[keep])) p1<-predict(mod1, SWAP$spec[!keep, ])$fit[,1] plot(SWAP$pH, mod0$predicted[,1], xlab="Measured pH", ylab="Predicted pH") title(main="WAinv") points(SWAP$pH[keep], mod1$predicted[,1], col=2) points(SWAP$pH[!keep], p1, pch=16, col=4) abline(h=6.225, v=6.225, a=0, b=1)

I ran this test for weighted averaging with inverse deshrinking (WAinv), weighed-averaging with monotonic spline deshrinking (WAmono), a two-component weighted averaging partial least squares model (WAPLS), maximum likelihood (ML) and the modern analogue technique (MAT) with five analogues.

Method | Leave-one-out cross-validation | Extrapolation | ||
---|---|---|---|---|

Mean bias | r² | Mean bias | r² | |

WAinv | 0.17 | 0.24 | 0.12 | 0.28 |

WAmono | 0.17 | 0.23 | 0.50 | 0.28 |

WAPLS2 | 0.11 | 0.26 | 0.37 | 0.28 |

ML | 0.02 | 0.16 | 0.32 | 0.11 |

MAT | 0.10 | 0.20 | 0.53 | 0.16 |

Method | Leave-one-out cross-validation | Extrapolation | ||
---|---|---|---|---|

Mean bias | r² | Mean bias | r² | |

WAinv | -0.15 | 0.32 | -0.56 | 0.22 |

WAmono | -0.15 | 0.30 | -0.61 | 0.21 |

WAPLS2 | -0.13 | 0.28 | -0.40 | 0.17 |

ML | -0.07 | 0.19 | -0.50 | 0.09 |

MAT | -0.18 | 0.29 | -0.83 | 0.14 |

As expected, with leave-one-out cross-validation of the entire calibration set, mean bias is positive at the low end of the gradient (pH is over-estimated) and negative at the high end (pH is under-estimated). The r² for these short portions of the gradient is lower than that for the full gradient.

With extrapolation from the truncated calibration set, absolute mean bias increases in all but one case, and the r² decreases in most cases, but surprisingly increases at the acid end of the pH gradient with some methods.

Weighted averaging works by calculating the pH optima of each species as the abundance-weighed average of its occurrences. The weighed average of the optima of the species in the test observation is then calculated. Because averages are taken twice, the estimates will span a smaller range than the original observations. A deshrinking step is used to stretch the estimates to best match the observations. It is this deshrinking step that can allow weighted averaging methods to extrapolate. There is a choice of methods in the rioja package: inverse deshrinking, classical deshrinking, and monotonic spline deshrinking.

Strangely, at the acid end of the gradient, WAinv performs better by extrapolation than by leave-one-out cross-validation. At the alkaline end, performance is worse. WAmono has a larger mean bias than WAinv, especially at the acid end, but a similar r².

WAPLS is designed to cope with secondary gradients in the calibration set, but can also work by correcting edge effects in WA. With the SWAP calibration set, WAPLS has only marginally better cross-validation performance than WAinv; its extrapolation performance has a lower mean bias at the alkaline end, but it otherwise does not outperform WAinv.

I thought ML would perform well by extrapolation as it fits a curve to each species which can be extrapolated. However, with these examples ML does not perform well, perhaps because there are many species with poorly defined optima and tolerances.

MAT is predictably hopeless at extrapolating. The predictions are the mean pH of the five taxonomically most similar observations in the calibration set. The maximum possible prediction is the mean of the five most alkaline observations in the calibration set, the method cannot extrapolate.

These tests show that in some circumstances extrapolation with WA can be good, in others it is poor. WAmono and ML performed worse than I thought they would.

The difference in the extrapolation performance at the acid and alkaline ends of the gradient is curious. If it is possible to work out why this occurs, it may be possible to predict when it is safe to extrapolate (slightly), and when extrapolations are less trustable.

Thanks to Sakari for prompting this post.

Birks, H.J.B., Heiri, O., Seppä, H., Bjune, A.E., 2010. Strengths and weaknesses of quantitative climate reconstructions based on late-Quaternary biological proxies. *The Open Ecology Journa*l 3, 68–110.

ter Braak, CJF 1995. Non-linear methods for multivariate statistical calibration and their use in palaeoecology: a comparison of inverse (k-nearest neighbours, partial least squares and weighted averaging partial least squares) and classical approaches. *Chemometrics and **Intelligent Laboratory Systems*, 28, 165-80.

The simplest explanation for the difference in extrapolation performance at the acid and alkaline ends of the gradient is that the first quartile of the data spans 0.61 pH units, whereas the fourth quartile spans 1.0 pH units. It might have been fairer to run the test for a fixed length of gradient rather than a fixed proportion of the observations.

When testing extrapolation to the extreme 0.5 pH units of the calibration set, performance at the acid end is qualitatively similar to the test above. At the high end of the gradient, the cross-validation performance is very poor (r² all below 0.1). The extrapolations have larger mean bias, but also a higher r².

Perhaps the simplest test of whether a transfer function can extrapolate is how well it performs at the end of the gradient. If it does not perform well at the end of the gradient under cross-validation it is very unlikely to extrapolate well.

Pingback: Effect of incomplete sampling of environmental space on transfer functions | Musings on Quantitative Palaeoecology