Sometimes papers reporting transfer functions don’t give enough information for the reader to properly evaluate the model. Take for example Massaferro and Larocque-Tobler (2013) who report a chironomid-based mean annual air temperature transfer function for Patagonia and its application to a Holocene-length chironomid stratigraphy from Lake Potrok Aike.
The paper reports that the three-component WAPLS model has an RMSE (root mean squared error) of 0.83°C and an r2 of 0.64. I assumed that this was a typo and that the authors were actually reporting the RMSEP (root mean squared error of prediction), as it has been standard practice for decades to present the model’s cross-validated performance.
On re-reading the paper after seeing it cited, I began to doubt my previous assumptions: perhaps the authors really were reporting the non-crossvalidated model performance statistics, which can be unrealistic estimates of model performance. My doubts focused on the large number of components used (a three-component WAPLS model is usually (perhaps always) an overfit) and the weak relationship between chironomid assemblage composition and temperature in the calibration set shown in the following figure.
A search of my sent-mail folder reminds me that I advised the authors of my concerns about this exactly four years ago today. They did not respond. I now happen to have the data from this paper and decided to have a quick look.
Using the species exclusion rules described in the paper and square-root transformation of the species data, I can get non-crossvalidated performance statistics that are similar to those reported (RMSE = 0.83°C, r2 = 0.65).
But what of the crossvalidated performance, which is a far better (though still imperfect) guide to the actual model performance? With leave-one-out crossvalidation, the performance is much worse (RMSEP = 1.39°C, r2 = 0.15). The RMSEP is almost equal to the standard deviation of the environmental variable (sd = 1.4°C) and the r2 is probably the lowest I have seen for any published transfer function. This transfer function is not useful. The authors’ claim that the performance is “comparable to other transfer functions of this size” is a false statement.
Despite the transfer function’s lack of skill, the authors use it to reconstruct temperatures from the Lake Potrok Aike chironomid stratigraphy. Most chironomid workers report the minimum number of chironomid head capsules that they count in each sample. Typically this is 50 (or perhaps 30) head capsules. The minimum chironomid count sum is not reported in this paper: I suspect most readers would assume it was about 50. It isn’t. The count sum is reported in the fossil data: the median count sum is 17. Only the fossil samples with a count of two or fewer chironomids seem to have been omitted from the published fossil stratigraphy. To not report that count sums are so low, far below the typical minimum accepted, strikes me as disingenuous.
The fossil data consist of only four taxa [note to my copy editor – this is not a typo]. Of these Phaenopsectra is most abundant, with a median abundance of 50%, much higher than the maximum abundance of this taxon in the calibration set (16.4%). This naturally raises concerns that the reconstruction diagnostics will be poor.
Perhaps the most widely used transfer function diagnostic is analogue quality. With the usual rule-of-thumb that distances greater than the 10th percentile of taxonomic distances in the calibration set represent non-analogue assemblages, 81% of samples lack analogues. This is not good.
The paper does not report analogue quality, but it does report that “all samples of the Potrok Aike core added passively to CCA of the training set samples were within the 95% confidence interval.” This is presumably the residual length diagnostic. When I calculate this statistic, I find that 100% of fossil samples are outside the 95th percentile of residual lengths. This is not good.
This paper has been fairly well cited, but appears to be fatally flawed: the transfer function as ~no skill, the fossil counts are small and lack analogues in the calibration set. I would recommend that anyone who plans to cite it assure themselves that the conclusions of the paper adequately reflect the results.