Many transfer functions for inferring environmental conditions from species assemblages include observations with a large difference between the observed and estimated value of the environmental variable after cross-validation. These are the outliers: they make the performance of the transfer function worse than it would otherwise be.
What to do about them? My preference is to first check for transcription errors and then try to understand what is special about these observations.
- Is there something unusual about the site that the outlier comes from? Perhaps the lake is unusually deep, affecting the proportion of planktonic diatoms in a pH calibration set. Or perhaps a lake receives snow-melt throughout the summer, decoupling the relationship between air temperature and chironomids.
- Is it possible that the microfossil assemblage does not represent the modern community? Some ocean cores lack any Holocene deposition, so the core tops are from cold glacial conditions. Alpine lakes with low sedimentation rates might have surface sediments deposited in the little ice age.
- Have taphonomic processes including transport and degradation of microfossils affected the assemblage?
These questions can be difficult to answer, so an alternative strategy is to delete any outliers above a certain threshold. The value of this threshold is critical as it will strongly affect the apparent performance of the transfer function. Because of this Juggins and Birks (2012)
prefer to take a conservative approach to sample deletion and initially remove only outliers that have a standardised residual (under internal cross-validation (CV)) that is greater in absolute value than 2 or 2.5. This corresponds to an expected distribution of about 5% and 1% of observations, respectively.
I want to test the effect of this threshold on transfer function performance.
I’ve analysed the SWAP calibration set in the rioja package in R. I’ve fitted a weighted average (WA) model to the data, found the absolute value of the cross-validation residuals for each observation and then refitted the model omitting the n observations with the largest residual, where n is between 1 and 150. I’ve used the root mean square error of prediction (RMSEP) as my metric of model performance.
The pattern is clear. As the threshold for deleting observations is lowered, the RMSEP declines. This holds for a surprising long time – about 140 of the 167 observations can be deleted before the performance starts to degrade – and the model performance becomes excellent, less than a third of the original RMSEP.
A third of the original RMSEP. Who wouldn’t want a model that performed three times better than the original.
So is it a good idea to have a low threshold for deleting outliers? NO – it is cheating. It is a means to artificial improve the performance of the transfer function, deceiving the reader (and probably the author) as to how accurate and precise reconstruction made with the model will be.
The problem is obvious – it is not possible to clean the fossil observations used in the reconstruction in the same way as the modern calibration set was cleaned.
The guidelines set by Juggins and Birks (2012) are reasonable. Excessive data cleansing may be harmful to the prospects of your manuscripts in peer review.