Variance inflation factors (VIF) give a measure of the extent of multicollinearity in the predictors of a regression. If the VIF of a predictor is high, it indicates that that predictor is highly correlated with other predictors, it contains little or no unique information, and there is redundancy in the set of predictors.

We don’t really want to have redundant predictors in a constrained ordination: they make the analysis more difficult to interpret, and the more predictors we have, the less constrained the ordination is, the more it resembles an unconstrained ordination.

Model selection in ordinations is tricky. Techniques such as forward selection can guide the choice of predictors to include, but also bias the results (see Juggins 2013).

Recently I have seen a paper and reviewed a manuscript using VIF to determine which predictor to drop from a constrained ordination in a backwards selection.

The procedure they were using was

- Generate a constrained ordination with all available predictors.
- Calculate the VIF of each variable.
- If any variable has a VIF over a threshold (typically 20), drop the variable with the highest VIF
- Repeat until all remaining variables have a VIF below the threshold.

I want to show that this is a procedure that can go badly wrong using an example based on the SWAP diatom-pH calibration set.

First I load the data from the rioja package and make two fake pH variables that are highly correlated with the observed pH by adding Gaussian noise to the observed pH.

library(rioja) data(SWAP) env<-data.frame(pH=SWAP$pH, fakepH1=rnorm(length(SWAP$pH), SWAP$pH, .2), fakepH2=rnorm(length(SWAP$pH), SWAP$pH, .2)) pairs(env)#highly correlated.

Now I fit a constrained correspondence analysis using cca, and calculate the VIF.

mod<-cca(sqrt(SWAP$spec)~., env) vif.cca(mod) # pH fakepH1 fakepH2 #26.82960 12.63575 14.49851 plot(mod)

The VIF for pH is greater than the threshold 20, and so under the above procedure is dropped. If we fit a new model without pH we find the VIF of the remaining predictors has fallen and all are now below the threshold. An anova shows that both predictors are statistically significant

mod2<-cca(sqrt(SWAP$spec)~.-pH, env) vif.cca(mod2) # fakepH1 fakepH2 #6.822985 6.822985 anova(mod2, by="terms") #Permutation test for cca under reduced model #Terms added sequentially (first to last) # #Model: cca(formula = sqrt(SWAP$spec) ~ (pH + fakepH1 + fakepH2) - pH, data = env) # Df Chisq F N.Perm Pr(>F) #fakepH1 1 0.3406 11.0050 99 0.01 ** #fakepH2 1 0.0530 1.7122 99 0.01 ** #Residual 164 5.0754 #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

What’s happening is that pH is highly correlated with both fake predictors, whereas the fake predictors are less strongly correlated with each other so have a lower VIF.

Any procedure with a high risk of throwing out the true predictor and instead accepting two fake predictors is not a useful method.

Is my toy example realistic? I would argue it is. One of the studies I saw used the temperature of all the months of a year as predictors for a calibration set. Imagine that June is the true predictor, but June temperature will be highly correlated with May and July temperatures. The procedure risks returning three or four months spaced evenly around the calendar rather than just June.

VIF is a very useful indicator that there is multicollinearity in a data set. It is a poor indicator of which predictor should be dropped from a model.

I was actually first taught this method as the preferred process for selecting variables following the methods described in Pienitz, Smol, and Birks 1995 (JOPL 12:21-49). I think that there are many ‘descendants’ from this research group who still prefer this method for a variety of reasons. Personally, I find that a parsimonious forward selection process with adjusted R2 value is a bit more robust (following Blanchet et al. 2008, Ecology 89:2623-2632). That being said, the original thinking behind backwards selection was that other selection criteria were too strict, and excluded information that could be important for interpretations. So there may be a trade off between variables with ‘mechanistic meaning’ and the selection criteria.

From Pienitz et al

Either I've not fully understood what they mean by this, it has been misdescribed, or it is a rather strange analysis. Why use DCA on the environmental data?

Good question. I struggle to remember this dataset. I think it was due to some kind of secondary gradient with treeline transition, but probably best to ask Reinhard. I was first taught as a student to run this kind of method with RDA/CCA depending on the circumstances. I have not tried to manually regress onto DCA scores before. I remember this paper being the “reference” for the VIF deletion at a cutoff of 20. You will likely find others that are more conservative and delete after a VIF of > 10. I kind of ‘grew out’ of this method, but others still use it specifically for interpretations of datasets that include more variables than stricter forward selection processes.

You just save me for publishing something like your example…!!! Thanks a lot! I am gonna keep my forward selection. I was using also VIF (>10) on a metagenomics fungal dataset, and the results did not make any sense for what I know (from soil variables).

Thanks again!