Performance of chironomid-temperature transfer functions

Chrironomid (non-biting midges) are sensitive to temperature (Eggermont and Heiri 2011), and because their head-capsules preserve well in lake sediments, their fossil assemblages can be use to reconstruct past temperature with a transfer function trained on the chironomid-temperature relationship in a modern calibration set of chironomid assemblages and associated temperatures. They are a often used to provide a independent temperature reconstruction against which pollen or macrofossil-inferred vegetation changes can be compared (e.g., Samartin et al. 2012).

The performance of transfer functions for reconstructing past temperatures from chironomid assemblages in lake sediments varies enormously. The root-mean-squared-error of prediction (RMSEP) for transfer functions in a compilation by Eggermont and Heiri (2011), augmented by a few recent studies1, ranges between 0.55 and 2.37 °C. The lowest RMSEP suggests that chironomids give very precise reconstructions; the highest suggests that chironomids have limited utility for anything other than reconstruction glacial-interglacial changes. What causes this 4.3-fold difference in RMSEP? Understanding this might suggest where efforts to improve chironomid-temperature, and other, transfer functions should be focused.

Part of the variance in performance is explained by the choice of air or water temperatures as a calibration target (Eggermont and Heiri 2011, Figure 1). Models calibrated against water temperature have a worse performance even though chironomids spend most of their life in water. This is probably because water temperatures are spot measurements, and thus reflect recent weather, whereas air temperatures are climatological means derived from local weather stations or gridded climatologies (e.g., Hijmans et al. 2005).


Figure 1. Effect of calibration target on RMSEP

The following analyses concern only the 35 transfer functions calibrated against air temperatures. Most of these transfer functions are trained on calibration sets of many lakes (calibration-in-space). Two tranfer function (Larocque-Tobler et al. 2011; Luoto and Ojala 2016) are trained on fluctuations in chironomid assemblages in one lake over many years (calibration-in-time). Varve chronologies were used to determine the age of each assemblage. Temperature data for each year are available from nearby weather stations. Both calibration-in-time transfer functions have low RMSEPs.

The range of RMSEPs for the air-temperature calibration-in-space models only slightly smaller than for the full set of models. The coefficient of determination (r2) varies between 0.5 and 0.97 and is uncorrelated with RMSEP (Pearson r = 0.15, p = 0.38).

I have several hypotheses that might explain the difference in performance between models:

  • Longer temperature gradients in the calibration set give larger RMSEPs but higher r2.
  • Large calibrations sets give better performance
  • Low-resolution taxonomy is associated with worse performance
  • Chironomid sensitivity to temperature varies with temperature
  • Some authors have followed the (perhaps unwise) advice to maximise the temperature gradient and minimise all other gradients better than others
  • Some authors have been more zealous in removing outliers than others.
  • Some authors count fewer chironomids per assemblage than others.
  • Authors have accidentally mis-reported the performance statistics

Only some of these can be answered with the data currently to hand.

Gradient length

The temperatures in the calibration sets range from 4.3 to 24.5 °C. RMSEP has a strong (R2 = 0.54 ; p < 0.001) positive (slope = 0.061 °C/°C) relationship with temperature range (Figure 2). The relation between temperature range and r2 is weaker (R2 = 0.25; p = 0.003), and also positive (slope = 0.01 °C-1).


Figure 2. RMSEP and r2 against calibration set temperature range. The blue/green points are the calibration in time transfer functions. The are omitted from the regression line.

The r2 increases with temperature range because, even though RMSEP increases, RMSEP as a proportion of temperature range decreases.

But why does the RMSEP incease with gradient length? There are two possible explanations:

  1. With a larger temperature range, cross-validation predictions are less constrained – it is possible to be more wrong. If this is true, it might suggest that transfer functions with a small temperature range have artificially reduced uncertainties.
  2. Larger temperature ranges have a lower density of lakes per °C and therefore species optima are poorly constrained.

Figure 3. Density of lakes as a function of temperature range.

Omitting the anomalously large composite calibration set (Fortin et al. 2015), there is a significant (p < 0.001) negative trend (slope = -0.38 lakes/°C/°C).

Lake density is a less good predictor (R2 = 0.4) of RMSEP than temperature range. So while it might be an important contributor to the effect of temperature range on performance, it does not explain the whole effect. A model including both lake density and temperature range is not significantly better than a model with just temperature range (p = 0.044).

There is a hint in figure 2 that the relationship between RMSEP and temperature range is not linear, but that with increasingly long gradients the RMSEP starts to plateau. Adding a quadratic term does not significantly improve a linear model between RMSEP and temperature (p = 0.1), but the AIC decreases (25.1 vs. 26.1). If the RMSEP increases with increasing temperature range because of the decreasing constraint on the range of predictions, it should be expected that with sufficiently large range all constrains are removed and no further increase in RMSEP is observed.

Simulated species-environment data may help understand the relative importance of lake density and reduced constraint when temperature range is increased.

Number of lakes

The number of observations in a calibration set has a large impact on transfer-function performance (Reavie and Juggins 2011). Reavie and Juggins (2011) took large diatom-phosphorus calibration sets and estimated the performance of stratified-random subsets of observations. They found that there were substantial improvements in performance until calibration sets included 40-70 observations and smaller improvements thereafter. Transfer-function performance will increase because, with more observations, species optima are better estimated and so reconstructions are more accurate.

Translating this calibration set size from diatoms to chironomids is difficult as it will depend on the amount of turnover in the assemblages along the gradient of interest (which could be estimated with detrended constrained correspondence analysis) and the amount of noise inherent in the different proxies. Assuming the numbers can be used as-is, 15 % of calibration sets have fewer than 40 lakes, and 58 % have fewer than 70. Many of the calibration sets are small enough to expect some size-related performance penalty.


Figure 4. RMSEP against the number of lakes in each calibration set, omitting the anomalously large calibration set.

However, the relationship between the number of lakes and RMSEP is not significant (p = 0.69). Probably none of the calibration sets are small enough to suffer the full penalty observed by Reavie and Juggins (2011).

Taxonomic resolution

Chironomid taxonomy has been refined (e.g., Brooks et al. 2007) since the first chironomid transfer functions were made.

Heiri and Lotter (2010) showed that the performance of transfer functions was sensitive to the level of taxonomic precision by analysing a calibration set at low, intermediate or high taxonomic resolution: the RMSEP declined from 1.59 to 1.41°C. At low taxonomic resolution, multiple ecologically disparate species may be merged into a single generic or supra-generic morphotaxon and the combined optima might be misleading. With higher taxonomic resolution, optima are more appropriate but may be poorly defined as individual taxa will inevitably be rarer than the merged taxon, and there is an increased risk of misidentification (Heiri and Lotter 2010; Velle et al. 2010).

Some estimate of the level of taxonomic precision used in each calibration set could be derived from the taxonomic works cited, or by examining the species lists. Lacking the patience to do either, I am going to use the year of publication as a proxy for taxonomic precision, assuming that earlier works used less precise taxonomy. The main problem with this approach is that some calibration sets (e.g., Fortin et al. 2015) include earlier calibration sets and use the lowest-common taxonomy.

The number of taxa could also be used as a proxy for taxonomic resolution, but would need to be adjusted for the size of the calibration set and the environmental range spanned.


Figure 5. RMSEP against year of publication

After accounting for the temperature range in the calibration set, year is not a significant predictor of calibration set length (p = = 0.4). This is perhaps not surprising given that the 11.3 % improvement found by Heiri and Lotter (2010) is small relative to the 4.3-fold range in RMSEP.

Chironomid sensitivity to temperature varies with temperature

It is possible that chironomid sensitivity to temperature is not constant along the temperature gradient, but that turnover is higher, and hence more precise reconstructions are possible, over some parts of the gradient. Following Rapoport’s rule (Stevens 1989), we might expect niches to be smaller, and hence turnover higher, at higher temperatures. Alternatively, turnover might be highest at an ecotone such as the treeline.

The optimal test for this would probably include detrended constrained correspondence analyses over different parts of the temperature gradient in different calibration sets to see if the length of the first axis (a measure of turnover) changes in a consistent manner. Not having this information to hand, I test if the transfer function RMSEP is a function of the mid-point temperature of the calibration set.

There is a strong relationship between the mid-point temperature and RMSEP, but, because of the hard boundary at 0°C, there is also a strong correlation between mid-point temperature and temperature range (Pearson r = 0.42, p = 0.012).


Figure 6. RMSEP against mid-point temperature and mid-point temperature against temperature range.

After accounting for the temperature range, mid-point temperature has a positive (0.036 °C/°C) but not statistically significant (p = 0.12) effect on RMSEP.

Calibration set design

Many authors aim to generate calibration sets that maximise the length of the environmental gradient and minimise other environmental gradients. This design will maximise transfer-function performance, however this performance will probably be over-estimated and down-core reconstructions will be more uncertain than the transfer function reports due to the risk of non-analogue environmental conditions.

If some authors have strived harder than others to generate calibration sets with a single strong gradient, or, equivalently, have selected a geographic extent with a large temperature gradient, this would affect the performance of transfer functions. It is not immediately obvious how to test this.

Outlier removal

Many papers do not discuss outlier removal. Of those that do, Brooks and Birks (Brooks and Birks 2001) removed two glacially fed lakes that were outliers in the transfer function, presumably because the water temperature was much colder than expected given the air temperature. Wu et al 2015 (Wu et al. 2015) remove seven outliers because their absolute residuals were >2 SD away from the observed temperature. The deleted lakes mainly have anomalous pH, conductivity or depth.

Removing outliers will improve the transfer-function performance statistics. Wholesale removal, as sometimes practiced with testate-amoeba water-depth transfer functions (e.g., Woodland et al. 1998 who remove 29 of 163 observations from one of their models), is probably unwise as there is no guarantee that the improvement in cross-validation performance statistics will be reflected in better predictions down-core.

Chironomid count sums

Transfer functions based on larger chironomid counts should expect better performance statistics. This performance boost has two components. The first is due to reconstructions from assemblages with low counts being imprecise whether the assemblages are fossil data or the modern data under cross-validation. This was studied by Heiri and Lotter (2001), Quinlan and Smol (2001), and Larocque (2001). Heiri and Lotter (2001) simulate counts of different sizes by resampling with replacement some large chironomid counts and showed that the standard deviation of reconstructions derived from a counts of fifty head-capsules was about 40 % of the RMSEP. With 100 head-capsules the standard deviation decreased to about 20 % of the RMSEP, with relatively small improvements for counts of 200. With counts of less than fifty the error rose dramatically. Note that the magnitude of this error component is only directly relevant to reconstructions from fossil assemblages if the fossil count sums are equal to those in the calibration set. If the fossil count sum are larger, the cross-validation RMSEP will be pesimistic, and vice versa.

The second component is due to the species optima derived from noisy low-count assemblage data being imprecise, and hence reconstruction being imprecise. This was studied by Bennett et al (2016) who subsampled diatom calibration sets to smaller counts sums (and number of lakes), and then tested performance with an independent test set with the unaltered count sum. They found, for a 350 lake North American diatom-pH calibration set, that reducing the count sum from 300 valves to 25 increased the RMSEP by 10%.

The combined impact of both error components has not been studied as far as I am aware, and the values from Heiri and Lotter (2001) and Bennett et al (2016) are not directly comparable, but it would appear that the loss of performance due to optima being poorly estimated with low count sums is less severe than that due to the counting uncertainties in the observations for which predictions are made under cross-validation. This is expected, at least if weighted averaging or similar methods are used, since the optima average information across multiple lakes, mitigating the counting noise, whereas the predictions are derived from a single observation.

All the papers exploring the impact of count size on performance have noted that with increasing counts, the rate of improvement decreases, but none appear to explain the theoretical basis for this. Assemblage counts arise from a multinomial distribution (a generalisation of the binomial distribution to cope with many possible outcome (species)). With species i ∈ {1, 2, …, k} the expected count of each species is npi where n is the count sum amd pi is the probability of a head-capsule being from species i. The variance for each species is np(1 – p). The standard deviation on the count error for each species is then proportional to the square root of n. As a proportion of the total count, the standard deviation will be \frac{\sqrt{n}}{n} or \frac{1}{\sqrt{n}}. This would suggest that the standard deviation in the reconstructions due to count size in the reconstructions should scale with \frac{1}{\sqrt{n}}, but the results from Heiri and Lotter (2001) above suggest that the standard deviation scales with \frac{1}{n}. I need to think about this some more and try some experiments.

I haven’t extracted the minimum (or typical which might be more meaningful) count sum for the different transfer functions, but it is typically greater than fifty head-capsules. Rashly assuming that the error from the two components is additive and that the results of Heiri and Lotter (2001) and Bennett et al (2016) are representative, it would suggest that a count size of fifty head-capsules would have RMSEP 60% higher than a large (~300) count. This moderate proportion of the overall range in transfer function performance.

Mis-reported statistics

Larocque-Tobler et al (2015) reported the bootstrap RMSEP of their Canadian-Polish transfer function as 1.3°C. In the corrigendum (Larocque-Tobler et al. 2016) corrected this to 2.3°C, blaming the inclusion of nine lakes with low chironomid counts and some other errors. It seems unlikely that such errors would reduce the RMSEP so much (or even at all). Perhaps the authors accidentally reported the apparent RMSE rather than the cross-validated RMSEP. The supplementary material to Larocque-Tobler et al (2015) report the leave-one-out WAPLS-2 RMSEP to be 2.14°C, comparable with the bootstrapped value in the corrigendum. The bootstrapped WAPLS RMSEPs in the supplementary material are impossible as they are lower than the leave-one-out statistics. However, since the WAPLS-3 has a worse performance than WAPLS-2 these numbers cannot be the apparent performance which are guaranteed to improve with more components. The origin of the RMSEP reported by Larocque-Tobler et al (2015), as with so much in that paper, remains a mystery (the supplementary table was added after the second review).

Larocque-Tobler et al (2015) include data from two earlier calibration sets (Larocque et al. 2006; Larocque 2008), which have an RMSEP of 1.17 and 1.67°C, respectively (Table 1). The decrease in performance between Larocque et al (Larocque et al. 2006) and Larocque et al (Larocque 2008) can be explained by the large increase in temperature range in the latter calibration set. Why the performance should worsen again when the Polish lakes are included is unclear and should have been explored in Larocque-Tobler et al (2015) as it undermines the rationale for merging the calibration sets.

It is possible that other transfer functions also misreport statistics. I will check those I have in hand.


Much of the variance in chironomid-air temperature transfer-function performance can be explained by the temperature range in the calibration set. It is not clear how much of this is driven by the reduced constraint on predictions with longer gradients and how much by reductions in lake density. Calibration-set size, count sum and taxonomic precision appear to be of secondary importance. Other possible factors include the choice of the target temperature variable and errors in the temperature which will vary geographically; methodological choices such as square-root transformation of species data and transfer function model type.

Some recommendations will be added once I have some. And will submit a revised version of this for publication somewhere.

If I have missed any published transfer functions, or explanations as to why transfer function performance can vary so much, please let me know.


Barley EM, Walker IR, Kurek J et al (2006) A northwest North American training set: Distribution of freshwater midges in relation to air temperature and lake depth. Journal of Paleolimnology 36:295–314. doi: 10.1007/s10933-006-0014-6

Bennett JR, Rühland KM, Smol JP (2016) No magic number: Determining cost-effective sample size and enumeration effort for diatom-based environmental assessment analyses. Canadian Journal of Fisheries and Aquatic Sciences 1–8. doi: 10.1139/cjfas-2016-0066

Brooks S, Langdon P, Heiri O (2007) The identification and use of palaearctic chironomidae larvae in palaeoecology. Quaternary Research Association Technical Guide 10.

Brooks SJ, Birks H (2000) Chironomid-inferred late-glacial and early-holocene mean july air temperatures for kråkenes lake, western norway. Journal of Paleolimnology 23:77–89. doi: 10.1023/a:1008044211484

Brooks SJ, Birks H (2001) Chironomid-inferred air temperatures from lateglacial and holocene sites in north-west Europe: Progress and problems. Quaternary Science Reviews 20:1723–1741. doi: 10.1016/s0277-3791(01)00038-5

Chang JC, Shulmeister J, Woodward C (2015) A chironomid based transfer function for reconstructing summer temperatures in southeastern Australia. Palaeogeography, Palaeoclimatology, Palaeoecology 423:109–121. doi: 10.1016/j.palaeo.2015.01.030

Eggermont H, Heiri O (2011) The chironomid-temperature relationship: Expression in nature and palaeoenvironmental implications. Biological Reviews 87:430–456. doi: 10.1111/j.1469-185x.2011.00206.x

Fortin M-C, Medeiros AS, Gajewski K et al (2015) Chironomid-environment relations in northern North America. Journal of Paleolimnology 54:223–237. doi: 10.1007/s10933-015-9848-0

Heiri O, Lotter AF (2010) How does taxonomic resolution affect chironomid-based temperature reconstruction? Journal of Paleolimnology 44:589–601. doi: 10.1007/s10933-010-9439-z

Heiri O, Lotter AF (2001) Effect of low count sums on quantitative environmental reconstructions: An example using subfossil chironomids. Journal of Paleolimnology 26:343–350. doi: 10.1023/a:1017568913302

Hijmans RJ, Cameron SE, Parra JL et al (2005) Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25:1965–1978. doi: 10.1002/joc.1276

Larocque I (2008) Nouvelle fonction de transfert pour reconstruire la température à l’aide des chironomides préservés dans les sédiments lacustres. INRS Rapport de Recherche R1032 , 978-2-89146-587-8

Larocque I (2001) How many chironomid head capsules are enough? A statistical approach to determine sample size for palaeoclimatic reconstructions. Palaeogeography, Palaeoclimatology, Palaeoecology 172:133–142. doi: 10.1016/s0031-0182(01)00278-4

Larocque I, Hall R, Grahn E (2001) Chironomids as indicators of climate change: A 100‐lake training set from a subarctic region of northern sweden (lapland). Journal of Paleolimnology 26:307–322. doi: 10.1023/a:1017524101783

Larocque I, Pienitz R, Rolland N (2006) Factors influencing the distribution of chironomids in lakes distributed along a latitudinal gradient in northwestern quebec, Canada. Canadian Journal of Fisheries and Aquatic Sciences 63:1286–1297. doi: 10.1139/f06-020

Larocque-Tobler I, Filipiak J, Tylmann W et al (2015) Comparison between chironomid-inferred mean-August temperature from varved lake żabińskie (Poland) and instrumental data since 1896 AD. Quaternary Science Reviews 111:35–50. doi: 10.1016/j.quascirev.2015.01.001

Larocque-Tobler I, Filipiak J, Tylmann W et al (2016) Corrigendum to “comparison between chironomid-inferred mean-August temperature from varved lake żabińskie (Poland) and instrumental data since 1896 AD” [quat. sci. rev. 111 (2015) 35–50]. Quaternary Science Reviews 140:163–167. doi: 10.1016/j.quascirev.2016.01.020

Larocque-Tobler I, Grosjean M, Kamenik C (2011) Calibration-in-time versus calibration-in-space (transfer function) to quantitatively infer July air temperature using biological indicators (chironomids) preserved in lake sediments. Palaeogeography, Palaeoclimatology, Palaeoecology 299:281–288. doi: 10.1016/j.palaeo.2010.11.008

Luoto TP, Ojala AE (2016) Meteorological validation of chironomids as a paleotemperature proxy using varved lake sediments. The Holocene. doi: 10.1177/0959683616675940

Massaferro J, Larocque-Tobler I (2013) Using a newly developed chironomid transfer function for reconstructing mean annual air temperature at lake potrok aike, patagonia, Argentina. Ecological Indicators 24:201–210. doi: 10.1016/j.ecolind.2012.06.017

Matthews-Bird F, Brooks SJ, Holden PB et al (2016) Inferring late-holocene climate in the Ecuadorian andes using a chironomid-based temperature inference model. Climate of the Past 12:1263–1280. doi: 10.5194/cp-12-1263-2016

Porinchu D, Rolland N, Moser K (2008) Development of a chironomid-based air temperature inference model for the central canadian arctic. Journal of Paleolimnology 41:349–368. doi: 10.1007/s10933-008-9233-3

Quinlan R, Smol JP (2001) Setting minimum head capsule abundance and taxa deletion criteria in chironomid-based inference models. Journal of Paleolimnology 26:327–342. doi: 10.1023/a:1017546821591

Reavie ED, Juggins S (2011) Exploration of sample size and diatom-based indicator performance in three North American phosphorus training sets. Aquatic Ecology 45:529–538. doi: 10.1007/s10452-011-9373-9

Samartin S, Heiri O, Lotter AF, Tinner W (2012) Climate warming and vegetation response after heinrich event 1 (16 700–16 000 cal yr BP) in Europe south of the alps. Climate of the Past 8:1913–1927. doi: 10.5194/cp-8-1913-2012

Stevens GC (1989) The latitudinal gradient in geographical range: How so many species coexist in the tropics. The American Naturalist 133:240–256.

Velle G, Brodersen KP, Birks HJB, Willassen E (2010) Midges as quantitative temperature indicator species: Lessons for palaeoecology. The Holocene 20:989–1002. doi: 10.1177/0959683610365933

Woodland W, Charman D, Sims P (1998) Quantitative estimates of water tables and soil moisture in holocene peatlands from testate amoebae. The Holocene 8:261–273. doi: 10.1191/095968398667004497

Wu J, Porinchu DF, Horn SP, Haberyan KA (2015) The modern distribution of chironomid sub-fossils (insecta: Diptera) in Costa Rica and the development of a regional chironomid-based temperature inference model. Hydrobiologia 742:107–127. doi: 10.1007/s10750-014-1970-x

Zhang E, Chang J, Cao Y et al (2016) A chironomid-based mean July temperature inference model from the south-east margin of the tibetan plateau, China. Climate of the Past Discussions 1–37. doi: 10.5194/cp-2016-96

  1. Data and code for this post are on github. For simplicity, all transfer functions are treated as if they are independent even though some are subsets of others.

About richard telford

Ecologist with interests in quantitative methods and palaeoenvironments
This entry was posted in transfer function and tagged , . Bookmark the permalink.

2 Responses to Performance of chironomid-temperature transfer functions

  1. HI Richard,

    Interesting post. A couple of things:

    There is something amiss in the sentence: Assuming the numbers can be used as-is, NA % of calibration sets have fewer than 40 lakes, and NA % have fewer than 70. This is in the section Number of Lakes

    Isn’t the Bennet et al result a best-case scenario? In that they start with a taxonomically diverse training set because of how those training sets were prepared/produced. I would assume richer subsets from resampling something like the EDDI data set (even if I drew individuals at random from the observed taxa in proportion to their occurrence in the training set or randomly selected sample) than if I sat down at the microscope myself and only counted the first 50 valves.

    Regarding the air/time calibrations — and I should probably look this up myself but you’ll probably know it already — do they report standard CV results or was the CV in those instances modified to account for the temporal ordering of the data?

    Finally, what are the two blue dots in many of the figures/panels?

    • The NAs are now fixed. This is what happens when you add data late in the day with missing values and don’t check everything. Twice.

      For diatoms, Bennett et al is best case because diatom valves are often attached to each other, sometimes in long chains, and therefore observations are not independent so the true variance will be higher than they report. This is less of an issue with chironomids.

      From memory, both air/time calibrations use standard cross-validation. Larocque then uses a modified correlation test on the down-core reconstruction that should correct for autocorrelation. It’s a concern, but my biggest worry about the air/time reconstructions are the potential lack of analogues down core. Neither paper reports useful diagnostics.

      The blue dots were supposed to be green – the calibration in time transfer functions. Hadn’t noticed the colours had changed (and I’m not colour blind)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s