I didn’t intend to write about Guiot and de Vernal (2011) as I have already published a comment on this paper, but new papers on Arctic sea-ice reconstructions are dismissing my work on transfer functions and citing Guiot and de Vernal (2011) as evidence.

Below, text from Guiot and de Vernal (2011), hereafter GdV11, is in black, my comments are in red. Wary of the length of this post, and copyright issues, I’ve deleted sections of GdV11 that are either uncontentious, repetitive or subject only to nit picking.

**Guiot, J. & de Vernal, A. (2011) Is spatial autocorrelation introducing biases in the apparent accuracy of paleoclimatic reconstructions? Quaternary Science Reviews 30, 1965–1972.**

## Abstract

We address the issue of spatial autocorrelation, an occurrence that may introduce biases in the evaluation of the performance of transfer functions, by using two fundamentally different approaches, one based on calibration (weighted averaging partial least squares regressions; WA-PLS) and the other based on similarities (modern analogue technique; MAT). The tests were made after spatial standardization of 700 North Atlantic surface data points, which include 29 dinocyst taxa and 4 climate parameters. The evaluation of transfer function performance was made by defining a verification dataset that was gradually isolated from the calibration or comparison datasets. Although strong spatial autocorrelation characterizes the original climate parameter distribution, the results show that the spatial structure of data has relatively low effect on the calculation of the error of prediction. They also show that the performances of MAT are generally better than those of WA-PLS, with lower error of prediction. The better performance of MAT in the present study can be explained by the non-modal distribution of salinity and temperature in the studied marine environments, which is not appropriate for the application of WA-PLS. The two methods yield equivalent results about the spatial structure of the residuals based on empirical semi-variograms. The analyses we performed include the comparison of reconstructions based on original raw data and gridded data. Results suggest that the gridding of the reference database may reduce the noise and thus improve the performance of the techniques.

## 1. Introduction

[…] In order to evaluate the performance of the most commonly used transfer function techniques, comparative validation exercises were made by several authors (for e.g., Kucera et al., 2005 and Guiot and de Vernal, 2007). These exercises led to the conclusion that the modern analogue technique (MAT), the revised analogue method (RAM) and the artificial neural network (ANN) approach yield the best performances in paleoceanography.

RAM is an interesting idea, that appears to have very good performance. Unfortunately this is because the implementation does not cross-validate properly. This flaw is described in Telford et al (2005). Guiot and de Vernal (2007) cite this paper and acknowledge this flaw, so it is not clear why Guiot and de Vernal are now praising this flawed method.

Among these three techniques, MAT and RAM appear particularly useful because they do not rely on quantitative relationships between the assemblages, which may introduce biases in a context of extrapolation.

What is a dissimilarity metric, the heart of MAT, if not a metric of the quantitative relationship between assemblages?

Moreover, MAT and RAM permit the identification of non-analogue situations and can be used for simultaneous reconstructions of several parameters.

This is a non-issue: it is standard practice to use MAT to identify non-analogue assemblages when using WAPLS or other transfer function methods.

MAT and RAM are thus the most commonly used transfer function techniques in paleoceanography and paleoclimatology. These techniques are however criticized by Telford (2006) and Telford and Birks (2009) who claimed, on the basis of comparative results from the weighted averaging-partial least squares regressions (WA-PLS), that spatial autocorrelation leads to the identification of close geographical (and environmental) analogues, even in the case of weak relationships between the proxy and the environmental parameters.

This is correct, except that the relative performance of MAT and WAPLS makes little contribution of the conclusions in Telford (2006) and Telford and Birks (2009). Instead the conclusions are mainly based on the geographic location of neighbours, the apparently good performance of transfer functions for simulated environmental variables, and the effect on model performance of leaving geographic neighbours out of the analysis. In claiming that our results are based on the relative performance of MAT and WAPLS, rather than these other factors, GdV11 misrepresents our work. Most of the rest of the GdV11 addresses this strawman version of our work rather than engaging with the real substance.

[…]

Three different series of tests were made to verify if the assertions of Telford and Birks (2009) are justified, and if the uncertainties estimated by MAT are really underestimated. The first series of tests was based on defining a verification dataset, which is isolated from the calibration dataset with a progressively enlarged area. The mean errors were then analyzed according to the size of the isolation belt. In order to avoid biases related to spatial heterogeneity in the dataset, we did not use the raw data but instead a geographically gridded dataset resulting in a spatially even density of samples (see Fig. 1). Both MAT and WA-PLS methods were then applied for comparison. The second series of tests was designed to verify if results are comparable by replacing, partly or totally, the gridded datasets with raw data. Assuming spatial consistency of climatic data, it is expected that the gridding operation will reduce the noise and thus the prediction errors. The last test we made aimed at examining the spatial structure of the residuals according to the empirical semi-variogram procedure proposed by Telford and Birks (2009).

## 2. Data and methods

### 2.1. Raw data and gridded data

[…]

Description of the n1171 dinocyst data set – available to download from http://www.geotops.ca. The dinocyst community has a good tradition of releasing much of their data.

[…]

Description of the gridding procedure used. This analysis adds a certain novelty to GdV11, and has some desirable properties, such as dealing with very unevenly spaced data. However, it is largely irrelevant to the aim of GdV11 – to investigate the importance of spatial autocorrelation – and not obvious how it could be used to make reconstructions. I will therefore ignore it as far as possible.

### 2.2. The verification dataset and its isolation belt

The selection of a verification dataset is a difficult task. A random selection as proposed by Telford and Birks (2009) may give, at least partially, independent samples without spatial autocorrelation. Another way to define the verification dataset is to use geographically distinct data, as in Kucera et al. (2005) who have used planktonic foraminifer data from the South Atlantic to make reconstruction from North Atlantic assemblages. However, when using different regions for the evaluation of the techniques, the occurrence of endemic taxa may constitute a limitation, not to mention cryptic species (e.g., Darling et al., 2006). Using a similar strategy, Guiot and de Vernal (2007) examined the performances of transfer functions in extreme conditions, with verification data taken at the climatic end members of the dataset. An extreme cold verification set was taken in the coldest part of the database to test the behavior of the various transfer functions approaches during glacial periods, whereas extreme warm verification samples were used to test the behavior of the transfer function during interglacial stages. This kind of test may yield unsuccessful results because of the exclusion of reference samples as analogues and for calibration.

The test described here was Guiot and de Vernal’s first flawed attempt to try to prove that autocorrelation was not an issue. It was instead a test of how well the models extrapolate. Subsequent flawed attempts include Fréchette et al (2008), who ran a version of h-block resampling that considered only latitude, not longitude; Bonnet et al (2010) who use a validation set that is not spatially independent of the calibration set, so the analysis has precisely zero power to detect the influence of spatial autocorrelation; and now GdV11.

Thus, the more suitable solution is probably the use of an h-block cross-validation scheme, where each verification sample is isolated from the calibration dataset by avoiding all the samples located in a given radius of h-km to be included in the calibration dataset. This is what was proposed by Telford and Birks (2009) and by Guiot and Corona (2010) who analyzed temporally auto-correlated data.

It is not clear why GdV11 did not simply use h-block resampling. It would have been easier to interpret, although of course the results would not have been very favourable to their desired conclusions. Code for h-block resampling is in my palaeoSig R package.

The method we propose here is a compromise between the two approaches mentioned above. We isolated a region around a reference point in an intermediate zone of the climatic space and we defined a belt around this point, which is located at 10°E and 58°N. All samples from the region within the belt were excluded from the calibration dataset (Fig. 1b). […]

The area of the belt was defined based on a kilometric distance, where the latitude is multiplied by 4 in order to avoid dominance of longitudinal distribution in the distance index:

Ad hoc or what!

[…]

– The verification dataset, *d _{io}* < 1000 km, contains 38 virtual samples and 35 raw samples.

– The calibration datasets, with incremental distances corresponding to *d _{io}* > 1000, 1250, 1500, 1750, 2000, 2500, 3000 km.

The first calibration dataset, with a distance of 1000 km, has no isolation belt, whereas the last calibration dataset, with a threshold of 3000 km, has a large isolation belt (see Fig. 1b).

This is a hopeless test. It is a compromise only in that it is not as absolutely hopeless as the test used by Guiot and de Vernal (2007). This test cannot examine the influence of autocorrelation at short distances – Telford and Birks (2009) only consider distances up to 200 km in their main analysis on the dinocyst data set.

In order to verify that the choice of a single validation zone does not lead to biased conclusions due to the site selection, we have performed an additional test with 35 samples randomly selected. An isolation belt of 300 km was defined around each of these 35 samples used for verification. All other samples located outside of these 35 isolation belts were used for calibration. The resulting calibration dataset included 519 samples and 280 samples were discarded to avoid any autocorrelation effect. This method is similar to the h-block cross-validation method proposed by Telford and Birks (2009).

This had the potential to be a more useful analysis. Unfortunately GdV11 only compare the WAPLS and MAT h-block performance, not the standard cross-validation performance. A comparison that might have actually addressed the question in the title.

### 2.3. The two methods of climate reconstruction

[…] Uncontentious description of the Modern Analogue Technique (MAT) and Weighted Averaging Partial Least Squares (WAPLS).

MAT and WA-PLS are fundamentally different approaches, which has important implications. With MAT, estimates are sensitive only to the local taxonomic space (Telford, 2006), which means that they depend only on the selected analogues. To the opposite, the regression-based estimates, such as WA-PLS, are sensitive to the whole taxonomic space as they depend on the relationship calculated on the whole calibration dataset. In the case of WA-PLS, one critical point is the assumption of unimodal distribution in the environmental space.

This paragraph is true except for the last sentence. The assumption that WAPLS requires a unimodal distribution of sites was imagined by GdV11. In reality, WA based methods assume that the data are evenly distributed along the gradient. This has been known since work by ter Braak in the 1980s. Just before GdV11 was published, I published a paper exploring the sensitivity of different transfer function methods to an uneven distribution of observations along the gradient (Telford and Birks 2011): maximum likelihood is least sensitive, WAPLS is moderately sensitive, and MAT is very sensitive.

### 2.4. The evaluation of the methods

Both MAT and WA-PLS methods were evaluated according to four statistical parameters.

(1) The coefficient of determination (*R*^{2}), which measures the ratio between the climatic variance explained by the species and the total climatic variance, both quantities being calculated on the calibration dataset.

(2) The root mean squared error (RMSE), which is the square-root of the mean squared residuals (differences between observations and reconstructions), on the calibration dataset.

(3) The root mean squared error of prediction (RMSEP), which is the square-root of the mean squared residuals, on the verification dataset.

(4) Finally, we used another statistical parameter (%*P*), which compares the RMSEP to the total variance of the climatic variable (VC): %*P* = 100 (1 − RMSEP^{2}/VC)

This is a statistic without much utility – it is just a scaled version of RMSEP, adding nothing. If used for the calibration data set, then it is equivalent to R^{2} if bias is low.

[…]

2.5. Noisy raw data vs smoothed data

[…]

### 2.6. The spatial structure of the residuals

Species and climatic variables usually have a spatial structure, which results in correlation of closely located sampling points. In the hypothesis of a relationship between species and climate parameters, the spatial distribution of species should depend on the spatial pattern of climate parameters. Consequently, the residuals (difference between observations and reconstructions) of the transfer function should not have spatial structure. In the case of spatial structure in the residuals, we have to admit that the species distribution is also related to environmental variables that are not taken into account in the dataset. Geostatistics include several tools to describe the spatial structure of a given variable. One of the most commonly used approaches is based on empirical semi-variograms […]

## 3. Results and discussion

[…] Text about the gridding

MAT and WA-PLS were applied by gradually isolating the verification dataset in zone centered around 10°E−58°N from the calibration data. The zone is defined within a distance of 1000 km and contains 38 grid points. The reconstruction of climate parameters from the samples outside the limit distance of 1000 km provides the calibration statistics (*R*^{2} and RMSE; see Table 2), whereas the reconstructions for the 38 grid points inside the 1000 km limit provide verification statistics (RMSEP and %*P*; see Table 2). The results show that the RMSEP are markedly higher than the RMSE even for the first test with a prescribed limit of 1000 km.

Well obviously, this is exactly what is expected. Most of the samples within the validation set off the coast of Norway are far (>100 km) from the edge of the validation set, so they have no geographically close neighbours in the calibration set. Without the ability to select geographically close analogues the predictions for the validation set are expected to be worse than for the calibration set where geographic neighbours can be selected as analogues. Autocorrelation in action.

When the isolation zone size is enlarged by gradually increasing the distance from the central point, *R*^{2} remains approximately constant with both methods for the four climatic variables. We thus focused on the RMSEP, which documents the prediction power of the method (Fig. 3). This power is expected to decrease with the distance when the isolation zone increases. This is what we observed for the temperature variables, except in the case of winter temperature reconstructed with MAT. The RMSEP of the salinity variables remains relatively uniform despite increasing distance. The comparison of both methods shows that MAT performs better than WA-PLS for salinities. As for the temperatures, the curves of performance of both methods interlace. WA-PLS seems to perform as well as MAT for the reconstruction of summer temperature.

This is the performance after autocorrelation has been taken care of by removing most or all of the geographically close analogues. Telford and Birks (2009) make no prediction about this, and this finding has no relevance for the performance of the calibration set under standard cross-validation. The new papers relying on GdV11 use standard cross-validation methods.

[…] Gridded

In order to verify that the results presented above are not dependent upon the geographical particularities of the test samples, we have applied an h-block cross-validation procedure on a set of 35 samples randomly distributed and isolated by a belt of 300 km. Such a random distribution probably covers a wider range of temperature and salinity values than the test samples around 10°E−58° N. Results are summarized inTable 3, which shows that MAT outperforms WA-PLS for SST. For winter salinity, both methods are equivalent, and for summer salinity, WA-PLS yields slightly better results.

Again, it would have been better to run leave-one-out h-block resampling. And again the results are irrelevant to the question the paper is addressing. Had the paper asked which method performs best once autocorrelation has been dealt with, this material would be relevant, but it has no bearing on the importance of autocorrelation. To test that question, GdV11 would need to compare the h-block results with the standard leave-one-out cross-validation performance.

As a final set of tests to evaluate the effect of spatial autocorrelation, we analyzed semi-variograms of the residuals between observations and estimates from MAT and WA-PLS and the half squared difference between pairs of observations (Fig. 4). The differences between pairs of observed climatic variables (OBS, blue curves) are low for short distances (<500 km), which illustrates similar climatic values for geographically close sites and strong spatial consistency. […]

The oceanic environmental variables are spatially structured. What a surprise.

The residuals of reconstructions from transfer functions have a much lesser spatial structure than the climate parameters, but they are still characterized by higher common variances at sites with small distances. As a consequence, the semi-variograms of the residuals show a more horizontal shape than those of climatic parameters, especially those from MAT (note that the log-scale exaggerates this structure). It appears also that the semi-variograms from MAT show lower values than those from WA-PLS as a result of a better fit of reconstructed values, whatever the distance or the climatic parameter.

In summary, the results demonstrate that the residuals of the climate reconstructions have much less spatial structure than the initial climate variable themselves. The results also show an equivalent behavior of MAT and WA-PLS with respect to the spatial structure of data.

This paragraph contradicts the previous. The two models obviously do not have equivalent behaviour if MAT shows lower values than WAPLS. MAT is giving more smaller residuals because it is biased by the autocorrelation.

In contrary to what has been claimed by Telford and Birks (2005), WA-PLS does not perform better than MAT, and spatial autocorrelation does not bias the apparent reliability of MAT results more than those of WA-PLS.

Fantasy.

[…] the distribution of summer SST in the reference dinocyst database is unimodal, which makes it appropriate for the use of WA-PLS. To the contrary, the salinity and winter SST distributions are highly dissymmetric, with high numbers of data points at the oceanic end member (35.5 psu) and close to freezing points (−1.8 °C). Hence, the criterion of modal distribution is not satisfied, which prevents an optimal performance of WA-PLS, whereas this has no effect on MAT.

As discussed above, the requirement that observations are unimodally distributed for WAPLS is imaginary. I presume that GdV11 are confusing the assumption that the species have a unimodal relationship with the environment with some requirement about sampling strategy. The severely uneven distribution of observations for salinity and winter SST will bias estimates of RMSEP, especially for MAT.

## 4. Conclusion

[…]

Because transfer function results are not unequivocal, the reliability of the tracers as proxies of given parameters is often challenged and the performance of the transfer function techniques are a matter of debates. Some of the debates are sterile, contributing to confusion in the scientific community, and discrediting the progress made by many scientists in the field of paleoclimatology.

Nothing discredits quantitative palaeoecology more than ridiculously bad papers such as GdV11 being published in a prestigious journal. GdV11 represents a major failing of the editorial process at Quaternary Science Reviews.

Other debates are fundamental and deserve special care.

Exactly. Hence my criticisms here.

In this study, we have addressed the issue of spatial autocorrelation, which is often the matter of debate in biogeography (e.g., Hawkins et al., 2007) and which was used byTelford (2006) to cast doubts on the reliability of the modern analogue technique (MAT) in paleoclimatology and paleoceanography. The argument of Telford (2006) was initiated from the appraisal of MAT by the means of comparisons with WA-PLS, which is a transfer function based on a fundamentally different approach, using calibrations between assemblages and climate parameters instead of similarities between assemblages.

Again, this is a misrepresentation of my work. GdV11 does not engage with the substance of Telford (2006) or Telford and Birks (2011). GdV11 does not test whether autocorrelation is a problem.

[…]

The autocorrelation debate using dinocyst populations had the advantage of drawing the attention of paleoclimatologists on the difficulty of working with spatial data, and consequently to devise a correct experimental plan for uncertainty assessment.

See Telford and Birks (2011) for some methods that help address this uncertainty assessment.

The problem of autocorrelation is particularly critical for methods based on calibration, when coefficients are calculated by least squares or other techniques. The calibration (or training) process guarantees that the estimates are as close as possible to the observations. The further step is then to verify if the transfer function has a predictive power. This can be achieved only by using an independent verification dataset. With MAT, there is no calibration and the main point is to verify if the assemblage really contains the climatic information we aim to reconstruct. The basic assumption is that two similar assemblages must provide close estimates. The spatial autocorrelation is then a secondary point. The fact that there is no need of calibration is the real strength of the method. This point alleviates the necessity to do an independent verification.

This is nonsense.

[…]

**Summary**

GdV11 is dreadful . Nowhere does it test if transfer function cross-validation estimates are overoptimistic when the environment is spatially autocorrelated. Nowhere does it test if this bias is worse for MAT than WAPLS. These are the important problems, and it is easy to demonstrate that they are real problems, for example by running cross-validation with different sized neighbourhood exclusion zones around the test observation, or by showing that simulated environmental variables can be reconstructed if the calibration set is spatially autocorrelated.

Instead of addressing these problems, GdV11 provides an irrelevant discussion of which method performs best once autocorrelation has been dealt with. Frankly who cares? The new papers by de Vernal et al certainly don’t – they continue to use standard cross-validation without dealing with the autocorrelation to evaluate the sea-ice transfer function, citing GdV11 as evidence that autocorrelation is not a problem.

How serious are the effects of autocorrelation on the sea ice transfer function? I don’t know, and nor do de Vernal et al as it has not been tested. It might just be making their estimates of uncertainty a shade optimistic, or it might be much worse. Nobody knows.

Next up, the awful reply Guiot and de Vernal (2011b) wrote to my comment.

Pingback: Dinocysts and autocorrelation redux: Guiot and de Vernal’s (2011) reply | Musings on Quantitative Palaeoecology

Pingback: Requiem for a transfer function | Musings on Quantitative Palaeoecology