This is a rather technical post, entirely devoid of sarcasm, but including a menagerie of Greek letters, written in response to queries from Gavin and Sakari about h-block cross-validation.

The performance of transfer functions for reconstructing past environmental change from microfossil assemblages based on species-environment relationships in a modern calibration set of paired species and environmental data is usually assessed by cross-validation. The simplest cross-validation scheme, called leave-one-out cross-validation (LOO), is to omit each observation in turn from the calibration set and attempt to predict the environment at the omitted observation from the remainder of the calibration set. Key performance diagnostics include the correlation between the predicted and observed environmental variables, and the root mean square error of prediction (RMSEP). Crucially, LOO assumes that the observations are independent. If the observations in the calibration set are not independent, because of autocorrelation or other types of pseudo-replication, performance statistics based on LOO will be over-optimistic (Telford and Birks, 2005).

This is a problem for many marine transfer functions, and for pollen-climate transfer functions. Most palaeolimnological transfer function have little spatial structure in the calibration set, so are not affected by this problem.

Burman et al (1994) extend LOO by omitting the h observations preceding and following the test observation to minimise the effects of autocorrelation. They call this procedure h-block cross-validation. Telford and Birks (2009) suggest that this scheme can be adopted for transfer functions by omitting observations within h-km of the test observation during cross-validation.

The problem is to select the optimal size of h. If h is too small, the test-observation is not fully independent of the calibration set, if h is too large, information is unused and the performance estimates will be unduly pessimistic.

Burman et al (1994) solve this problem by adding a term to the estimated performance to correct for data underuse. With the addition of this correction term, which varies with the proportion of data excluded, the choice of h becomes much less critical. The method developed by Burnam et al (1994) is only suitable for stationary data (ie the mean and variance do not vary with location), and further they caution that they “consider only processes having a short-range dependence. Although long-range dependence models have been proven useful in a variety of applications …, we do not know if our methodology is applicable.”

Can the methods outlined in Burman et al (1994) be adapted for use with transfer functions? We need to adapt the procedure to cope with two-dimensional data that are irregularly spaced, and develop correction terms for different transfer-function methods. Further, we need to be able to cope with non-stationary data. These challenges range from relatively simple to probably impossible.

In view of these difficulties, it is probably not possible to calculate a correction term, so instead we have to hope that we can select a value for h that gives approximately unbiased estimates of transfer-function performance. Even this is a difficult problem.

The species data Y are a function of the environmental variable of interest, x, with coefficients β, plus two residual terms, ξ which is spatially autocorrelated and ε which is not.

Y_{i}=βx_{i}+ξ_{i}+ε_{i}

Autocorrelation in x is by itself harmless. It is the correlation in the residuals ξ that causes problems. These residuals are due to the influence of environmental variables other than x, and perhaps biotic factors such as dispersal limitation. We do not know a priori the spatial structure of ξ as we do not know the full set of environmental variables that affect the species assemblage, and their relative importance.

I want to propose three methods for selecting the optimal value of h for calculating approximately unbiased estimates of transfer-function performance. I’m going to demonstrate these with the North Atlantic planktonic foraminifera-sea surface temperature calibration set from the MARGO project (Kucera et al 2005). I am going to focus on the modern analogue technique (MAT) as this is the most widely used transfer-function method in oceanic and continental scale pollen-climate transfer functions. It is also the method most susceptible to autocorrelation in the calibration set.

The first method is to select the value of h that gives a cross-validation performance equal to a spatially independent test set. Telford and Birks (2005) use the South Atlantic as a spatially independent test set for planktonic foraminifera-SST transfer-function models trained on a North Atlantic calibration set. They find the spatially independent RMSEP of a MAT model to be 1.87°C. This is equivalent to an h of ~850 km (Fig 1) (Note the calibration set used here is slightly different from that used by Telford and Birks (2005)).

This method is of limited use as few calibration sets have suitable spatially independent test sets. Even those that apparently do, may have cryptic species. However, the results of this method can be compared with the other methods as a check on their validity.

The second method, suggested by Telford and Birks (2009), is to use the range of a variogram of the detrended cross-validated residuals of a weighted averaging (WA) model as the estimate of h. The rationale for this is that weighted averaging is the transfer function method that is most robust to spatial autocorrelation, so its residuals should approximate ξ better than other methods, which tend to incorporate this structure into the model. The WA residuals need detrending to remove features in the residuals arising from model misspecification, such as edge effects. The WA residuals can be detrended with a locally weighted scatterplot smoother (LOESS), but we immediately face the problem of selecting the span of the smoother. Figure 2 shows LOESS smooths of the WA residuals with different spans, figure 3 shows variograms of the residuals of these smooths. Fortunately, despite the wide range of smoothing applied, the ranges of the variograms are relatively similar, from 1105 km for span = 1 and 880 km for span = 0.05, suggesting that the result is not very sensitive to the span chosen.

The third method uses simulated environmental variables with the same autocorrelation structure as the observed environmental variable. For each of the simulated variables, the h-block cross-validation r² is calculated for different values of h. As expected, MAT models appear to be able to reconstruct these simulated variables, with the models having a high r² when h is small (Fig 4), and performance worsens as h increases. Part of this performance is due to autocorrelation in ξ, and part due to the inherent correlation between many of the simulated variables and SST (Fig 5). If the latter term dominates, then the cross-validation r² should be no greater than the r² of the correlation between the simulated variables and observed SST. At small values of h, the cross-validation r² is much higher than the simulated-observed SST r², indicating that ξ is important in inflating the performance of these models (Fig 6). As h increases, the cross-validation r² comes close to the simulated-observed SST r², before falling below as the models are prevented from using more and more good analogues. The minimum sum of squares between the two sets of r² (Fig 7), occurs when h is ~1200 km.

The three suggested methods are in fairly good agreement. Perhaps surprisingly good given their different motivations. Methods one and three can be applied to any transfer function procedure. It is not obvious to which procedure method two can be applied as it is possible that the optimal h is method dependent. This needs to be investigated.

These methods seem to perform well for the planktonic foraminifera-SST transfer function. This calibration set is dominated by the SST gradient – it is essentially a one dimensional data-set. I need to explore how the methods will work in data-sets with two or more strong dimensions, for example a pollen-climate calibration set with strong precipitation and temperature gradients. I predict that method three might have problems. The planktonic foraminifera-SST relationship is well established and replicated in multiple ocean basins. Not all transfer functions are based on such secure footing, some are likely entirely spurious, for example the pollen-sunshine transfer function of Fréchette et al (2008). I need to test how well these methods work with such models.

When I have completed these tests, I plan to rewrite this post and submit it for publication. Comments and suggestions are most welcome.

Sometimes it might be useful to use h-block cross-validation to test the relative sensitivity of different transfer function models to the availability of neighbouring sites during cross-validation, rather than to try to calculate an approximately unbiased estimate of model performance. In this case, any reasonable value of h can be used. RNE plots, showing how performance declines when sites are omitted during cross-validation 1) at random, 2) from the geographic neighbourhood around the test site, or 3) from the environmental space around each site (Telford and Birks, 2009) are probably more useful and interpretable (but grindingly slow to run).

Pingback: Estimating unbiased transfer-function performances in spatially structured environments | Musings on Quantitative Palaeoecology