In my previous post on checking how well represented the fossil taxa are in the modern calibration set, I mentioned that taxa with fewer than five effective occurrences, calculated with Hill’s N2, are likely to have poorly constrained optima. I explore this problem in this post.

First I’ll load the calibration set data from the rioja library, and calculate Hill’s N2.

library(rioja) data(SWAP) spp<-SWAP$spec env<-SWAP$pH n2<-Hill.N2(spp)

One way to estimate how well defined the optima of the taxa are is to create bootstrap samples of the calibration set and find the variability of the optima across the different samples. I don’t think this can be done directly with rioja, so I’ve written a function to help.

optima<-function(spp, env, fun=WA, n=1000, column=1){ coeff<-rep(NA, ncol(spp)) names(coeff)<-colnames(spp) replicate(n, { k<-sample(1:nrow(spp), size=nrow(spp), replace=TRUE) spp<-spp[k,] spp<-spp[,colSums(spp)>0] mod<-fun(spp, env[k]) coe<-coef(mod)[,column] coeff[names(coeff)%in%names(coe)]<-coe coeff }) }

The function *sample( )* creates the list of calibration set observations to be included in the bootstrap sample which will be as large as the original dataset, containing some observations multiple times, and others not at all. A transfer function model (either WA or WAPLS will work) is fitted to the data, after removing taxa with no occurrences in the bootstrap sample. The ellipses (…) allow other arguments to be passed to the function. Then the optima are extracted from the model output. *column* lets us select one column from models that produce several columns of output (like WAPLS). Because not all taxa will be included in each bootstap sample, we need to insert the calculated optima into a vector with space for all the optima. *replicate ( )* repeats this procedure *n* times and returns the results as a matrix with NAs for species where the optimum was not calculated for a given bootstrap sample.

Now let’s use this function to calculate the weighted-averaging optima across 1000 (the default) bootstrap samples, and then calculate the standard deviation of each species using apply and plot it against N2. For comparison, I add the leave-one-out cross-validation RMSEP of the WA model. All the optima have a standard deviation less than the standard deviation of the environmental gradient (not a great surprise for WA).

optwa<-optima(spp, env,fun=WA) plot(n2, apply(optwa, 1, sd, na.rm=TRUE), xlab="Hill's N2", ylab="SD of optima across bootstrap samples") modwa<-crossval(WA(spp,env)) abline(h=performance(modwa)$crossval["WA.inv","RMSE"]) sd(env) # [1] 0.771804

Taxa with a high N2 have well defined optima: the optima calculated across different bootstrap samples are similar, as shown by their low standard deviation. Taxa with a lower N2 can have a higher standard deviation, with a break of slope at ~5, with some very uncertain optima for taxa with fewer effective occurrences. Some taxa with fewer than five effective occurrences have low standard deviations. I trust these only a very little more than the taxa with high standard deviations as it may simply be a fluke that they were only found in sites with a narrow environmental range. The extreme case is for taxa with a single occurrence which have a standard deviation of zero. This in an artefact: these taxa have the most uncertain optima.

How about WAPLS optima (beta coefficients)? The WAPLS component 1 optima will be similar to the WA optima, differing only because the sum of relative abundances at each site has been rescaled to 100%, so I am just going to look at the component 2 beta coefficients.

optwapls2<-optima(spp, env,fun=WAPLS, column=2) plot(n2, apply(optwapls2, 1, sd, na.rm=TRUE), xlab="Hill's N2", ylab="SD of optima across bootstrap samples") abline(h=performance(modwa)$crossval["WA.inv","RMSE"]) abline(h=sd(env), lty=2)

With WAPLS component 2 the coefficients are more uncertain that the WA optima; the WAPLS component 3 coefficients would be more uncertain still. With this calibration set, the increase in variance in the coefficients almost completely cancels out the reduction in bias attained by using the second component and the RMSEP of the second component model (0.299) is only marginally lower than the one component model (0.307). This difference is not statistically significantly with a randomisation t-test.

If taxa with poorly defined optima are abundant in a fossil data set but the abundance varies little from level to level in the stratigraphy, the reconstructions may be biased, but the bias would be near constant. If instead these taxa are both abundant and varying, they will contribute to the trends in the reconstruction, but even the sign of their contribution may be uncertain. This is a far greater problem.

I’ve not seen anybody calculate the uncertainty in the optima before, and I’m not recommending the method described here as a standard diagnostic – I think the finding that taxa with fewer than about five effective occurrences have poorly defined optima will be a general result. Let me know if your calibration set has different properties.

Pingback: Transfer function and palaeoenvironmenal reconstruction diagnostics | Musings on Quantitative Palaeoecology

I get an error while running the code for WAPLS models;

“Error in WAPLS.fit(y, x, npls, iswapls, standx, lean) :

Components could not be extracted, please check data “

I really should learn not to tinker with code without giving it a final check.

The problem was that the ‘…’ from optima( ) were not being passed to the ‘…’ in fun( ) as fun( ) is inside replicate( ) so looks for the contents of ‘…’ in that environment. I could fix this properly, but for simplicity have just deleted the ‘…’. This will prevent you from testing WAPLS component 27 (which I don’t suppose you really wanted to) or PLS fitted with WAPLS with the argument iswapls=FALSE. You can work round both of these by making wrapper functions.

The code should work now.

Richard, I’m not finding the Hill.N2 function in the rioja package. What have I missed here?

Version: 0.8-5

Thanks Richard, got it.

That’s got to be the same Hill who worked on reciprocal averaging techniques and TWINSPAN with Hugh Gauch and others.

Hei,

I’m using version 0.8-4. Hill.N2 is listed in the help file for 0.8-5 at http://cran.r-project.org/web/packages/rioja/rioja.pdf on the utils page. If it has gone it is probably a bug.

The same result can be obtained from

library(vegan)

diversity(spp, “invsimpson”, MARGIN=2)

Definitely gone, but thanks.

So, N2 is a diversity index apparently and five species of equal abundance would give

D = 5 * (0.2)^2 = 0.2 and 1/D = 5. Right?

Just checked, Hill.N2 is in the package source for rioja 0.8-5 on CRAN. I have no idea why it not appearing for you.

Yes Hill’s N2 (discussed in Hill 1973; Ecology, 54, 427-432) is a diversity index – it is inverse Simpson. Normally it is used to find the effective number of species per site, here we are using it to find the effective number of occurrences of a species.

The effective number of occurrences of a species is the number of equally abundance occurrences that would have the same Simpson diversity as the species has.

Thanks Richard, got it.

That’s got to be the same Hill who worked on reciprocal averaging techniques and TWINSPAN with Hugh Gauch and others.

Pingback: Beyond nearest analogue distance | Musings on Quantitative Palaeoecology

A very interesting approach.

I’m pretty sure that you have plotted RMSE not RMSEP.

Both `performance(mod)$crossval` and `performance(mod)$object` have a column “RMSE”. The former, which is used above, is actually the RMSEP.

Got it – thanks. Just learned a bit more about rioja.