Most reconstructions of past environmental conditions derived from chironomid assemblages with transfer functions use a calibration-in-space approach, in which the calibration is a set of modern chironomid assemblages paired with modern environmental data. I am aware of three chironomid calibration-in-time reconstruction. Here the calibration set is well dates chironomid assemblages paired with historic environmental data.
I have already shown that the calibration-in-time reconstructions from Seebergsee and Silvaplana report the apparent performance rather than the cross-validated performance they claim to report, and that the actual skill is near zero. But what about the third reconstruction from Nurmijärvi.
Nurmijärvi is a varved lake in southern Finland from which the authors counted the chironomids at 1-11 year resolution over the instrumental period which began in the 1830s and reconstruct July air temperatures.
The paper reports that the calibration-in-time transfer function has an r2jack of 0.64 (Table 1). This is a fairly good performance given the limited temperature range in the calibration set. The paper also reports the correlation between the reconstructed and instrumental temperature as 0.51 (Table 2, Figure 4), which is equivalent to an r2 of 0.26. With a calibration-in-time model, the model r2 should be identical to the r2 between the reconstruction and the environmental data. It is not clear why they are not identical in this case.
Since the authors have not replied to emails enquiring about this discrepancy, nor to requests for the raw data, I digitised the stratigraphic diagram and the temperature data. There are of course inaccuracies in the digitised data, and I have only the 38 common taxa (the stratigraphic diagram omits 21 rare taxa).
I find that the model the authors use, a tolerance-weighted weighted-averaging model with inverse deshrinking, has an r2 of 0.57. Not too far off the 0.64 reported. But this is the apparent statistic – the leave-one-out cross-validated r2 is only 0.13.
A model with an r2 of 0.13 has very limited utility, and this is before we consider the impact of autocorrelation etc.
It appears to me that the authors are, like the authors of the Silvaplana and Seebergsee papers, reporting the apparent rather than the cross-validation performance. If the authors believe that imperfections in the digitised data are responsible for the lack of reproducibility, they only need to send me the data and I will update this post.