Transfer functions are widely used to reconstruct past environmental conditions from fossil assemblages using the relationship between species and the environment in a modern calibration set. Naturally, palaeoecologists want to generate reconstructions that are as precise as possible, and take steps to achieve this:
- taxonomic resolution can be improved in the hope that the new taxa will have narrower ecological niches than the aggregate taxa they replaced
- larger calibration sets can be generated, which can improve precision but can also worsen it if the new observations are not comparable with the old
- maximising the environmental gradient of interest while minimising nuisance environmental variables will usually improve calibration set performance (but not necessarily the reconstructions)
- developing and using new transfer function methods
- increasing the spatial density of observations in an autocorrelated environment (and using transfer function methods, such as the modern analogue technique, that are not robust to autocorrelation)
I want to suggest that there are limits to the precision that can be achieved in practice due to the inherent noise in species-environment relationships and that papers that report transfer functions with exceptionally good performance should be treated with caution. Temperature is one of the most commonly reconstructed environmental variables as it is a key climatic variable and is ecologically important, so I am going to focus on this.
With all the certainty of a hunch, I am going to place my threshold for dubious precision (the root mean squared error of prediction; RMSEP) at 1°C for transfer functions with long temperature gradients (i.e. equator to pole), and somewhat lower if the temperature gradient is small.
Several transfer functions have been declared to have performance better than this threshold. I’m going to focus on the planktonic-foraminifera sea-surface temperature (SST) transfer functions as I know these fairly well; the system is relatively simple (compared with diatoms in lakes at least); and there are some interesting issues to explore.
Pflaumann et al (2003) reported a planktonic foraminifera-SST transfer function with a standard deviation of residuals (similar to RMSEP if bias is low) of 0.75°C for winter and 0.82°C for summer using the SIMMAX method. SIMMAX was (hopefully I am correct in using the past tense) a version of the modern analogue technique (MAT) that weighed analogues by their geographic proximity to the test site during cross-validation. Since SST is spatially autocorrelated, giving high weights to close analogues will tend to make the predictions appear more precise. But this is a spurious precision, bought at the expense of the independence of the test observation, otherwise known as cheating. Since Telford et al (2004) described the problem with SIMMAX, it has been little used.
Waelbroeck et al (1998) introduced the revised analogue method, another version of MAT that attempted to merge the properties of MAT and response surfaces. Unfortunately the response surface was only calculated once rather than repeatedly during cross-validation. This means that the impressive performance for their planktonic foraminifera-SST transfer function, with a standard deviation of residuals of 0.7°C for winter and 0.91°C for summer, is biased by the failure to ensure that the test observation is independent of the calibration set during cross validation. I’ve not seen RAM used much since Telford et al (2004) described the problem with it.
Artificial neural networks (ANN) were used by Malmgren et al (2001), with a reported RMSEP of 0.99°C for winter and 1.07°C for summer. ANNs learn by iteratively adjusting a large set of parameters, which are initially set at random values, to minimize the error between the predicted and actual output. If trained for too long, ANNs can over-fit the data, learning particular features of the modeling set rather than the general rules. This is normally controlled by using splitting the data, training the models on on portion of the data and testing the models with a second portion and stopping the training when the model stops reducing the RMSEP of this second portion. Typically many ANN models are generated from different random initial conditions and configurations and the best model used judged using the second portion. By selecting models that give the lowest RMSEP for the second data partition, the RMSEP is biased low. A third data partition is needed to give an unbiased estimate of model performance (again, see Telford et al (2004) ). Malmgren et al did not use the this independent test set, so their results are biased low.
MAT is perhaps the most widely used transfer function method for reconstructing SST from planktonic foraminifera. Telford & Birks (2005) report an RMSEP of 0.89°C for winter SST in the North Atlantic (Kucera et al (2005) report a larger RMSEP of 1.32°C for winter and 1.42°C for summer – I don’t know what causes the difference). As Telford & Birks (2005) show, this low RMSEP is biased by spatial autocorrelation in the calibration set which means that the test observation is not independent of the calibration set during cross-validation.
All of these low RMSEP are demonstrably biased. To have an RMSEP of 1°C, species need to have very clean responses to the temperature. Nuisance variables and noise make this unlikely. With short gradients, the magnitude of error that is possible decreases, so lower RMSEPs are expected (but also lower r²). So for example, the Norwegian pollen-July temperature RMSEP of just over 1°C is plausible. This model has none of the problems outlined above and uses methods that are reasonable robust to autocorrelation.
In reality, different threshold are needed for different proxies. When the relationship between the organisms and the environmental variable being reconstructed is less direct (for example between chironomids and air temperature) or there are large nuisance gradients (again e.g. chironomids), the threshold at which I start to wonder is raised.
The same logic outlined here holds for transfer functions for reconstructing other variables – if the results look too good to be true, there might be problems. For example, There is at least one transfer function where I suspect that the authors have forgotten to cross-validate their model, so good is the performance. Unfortunately, short of acquiring the data and re-running the analyses, there is little that can be done to check such cases.
A question for readers, do you know of any transfer functions with suspiciously good performance that ought to be examined?