## Testing testate amoeba: some comments on Amesbury et al (2018)

Today, I have been reading and thinking about a new paper presenting a huge testate-amoeba calibration set for reconstructing water table depth in bogs (Amesbury et al 2018). This calibration set, with almost 2000 samples, is the synthesis of many smaller calibrations sets after the inevitable taxonomic harmonisation. In general, I like the paper, but I have some comment.

### Data

All the raw testate amoeba counts are available from Neotoma, and the harmonised data are available from Mendeley. This is excellent: I will be using these data and the large amount of fossil testate amoeba data in Neotoma in a forthcoming manuscript. Plus one citation to the authors.  Now if only the authors had archived the code as well.

### Reproducibility

I have not attempted to reproduce all aspects of the paper. I can replicate the weighted-average partial least squares and the weighted average with tolerance downweighting model  performances with the full dataset, but get a slightly higher root mean squared error of prediction (RMSEP) with the pruned dataset.

### A large calibration set

The calibration set is about an order of magnitude larger than what might usually be thought of as a large calibration set. This has some interesting consequences. First, the apparent performance and the cross-validated performance are very similar (WA-tol RMSE = 10.04 cm, RMSEP = 10.20 cm). Large differences between the apparent and cross-validated performance hint at over-fitting. Second, WA-tol is the best model. Theoretically WA-tol should be a better model than ordinary WA as taxa with wide niches are given less weight than taxa with narrow niches, but in practice the niche width cannot be estimated well enough and performance is worse. Third, with such a large dataset, and with no evidence of over-fitting, leave-one-out and leave-one-site-out cross-validation have similar performance (except for the modern analogue technique).

### Model selection

Trying several transfer functions model, and multiple versions of each model is fairly standard practice in palaeoecology. It is a type of cherry picking, that will can biased performance – see Telford et al (2004) for an test of this problem and a solution.

### An unreported statistic

One useful diagnostic is the ratio of the constrained eigenvalue to the first unconstrained eigenvalue in an ordination constrained just by the environmental variable of interest. Ideally this ratio should exceed one, suggesting that the variable of interest is, or is correlated with, the most important gradient in the data. Using a CCA constrained by water table depth on square-root transformed data, the ratio $\lambda_1/\lambda_2$ is 0.6. This is not very impressive, and will make reconstructions vulnerable to changes in variables other than the variable of interest. I’m not sure which environmental variables are correlated with the first unconstrained axis.

### Pruning

Many data sets have outliers that can degrade model performance. There can be good reasons for deleting them, for example, in the Norwegian chironomid calibration set, a few lakes were cooled by long lasting snow beds, so the lake water was colder than expected for the air temperature. As is common practice with testate amoeba transfer functions, the Amesbury et al adopt a much more aggressive approach to outliers. They prune any observations that have a transfer function residual of more than 20% of the environmental gradient. This will inevitably give a boost to the model performance statistics. What is less clear is whether this gives better reconstructions.

One possibility is that testate amoeba have a rather weak relationship with water depth, and that the pruning artificially strengthens it. Many of the samples pruned are at the ends of the water depth gradient, so pruning, in part, addresses edge effects. Another possibility is that the water depth measurements are not very reliable (often they are spot measurements and could change after  storm), and pruning addresses this.

### Significance tests

Amesbury et al use the reconstruction significance test I developed. They apply it to testate amoeba stratigraphies from two bogs. The reconstruction from Lac Le Caron is not significant with any version of their transfer function. The authors conclude

Given that the efficacy of the ‘randomTF’ method has been recently reviewed and questioned (Amesbury et al., 2016; Payne et al., 2016), these results further call into question the usefulness of this test.

I downloaded the Lac Le Caron data from Neotoma. The samples in the stratigraphy mostly have a low diversity (median N2 = 2.3), a problem known since my original paper to cause a high type II error rate. It does not seem to be fair to criticise the test for failing in such circumstances. The main signal in the stratigraphy from Lac Le Caron are switches from Difflugia pulex to Archerella flavum dominated assemblages. These taxa have almost identical optima (11.1 vs 10.2 cm), so I would argue that changes in water table depth are unlikely to have caused this assemblage change. As such, I believe that my significance test is correct in reporting a non-significant result.