Should I teach survival analysis, network analysis, time series analysis, or generalised linear models? Which are the key statistical methods that biology Masters students need to be taught in class?
I’m updating the statistics courses I teach to biology students. To help decide what methods to cover, I decided to see what statistical methods the Masters students are using in their theses. If there is a mismatch between what we teach and what the students use, we might want to reassess what the courses cover.
Bergen students are supposed to submit their Masters theses to the Bergen Open Research Archive. I examined the 60 biology MSc theses submitted to BORA for the years 2013 – 2015. I think this is a large, but probably non-random, proportion of the theses submitted over these years.
There is a huge diversity of science in these theses, from the morphology of medieval dogs, to the faunal colonisation of submarine mine tailings, via shade sensitive epiphytic cryptogams, in vitro studies of environmental toxins on polar bear adipose tissue, pedagogic studies and lots and lots of fish.
Most of the theses are written in English (51). Only nine are in Norwegian (seven Bokmål, two Nynorsk). This ratio surprised me.
Most of the theses have a statistics or data analysis section in the methods section which was generally fairly informative. Other were more cryptic…
Six theses had no statistics. Despite the lack of enthusiasm for statistics by some biology students, the vast majority of them used statistics in their thesis. Of these six, two have a strong numerical – but not statistical – focus.
Sixteen theses used bioinformatic methods on DNA and protein sequence data. One of the authors used python, the rest used a wide variety of software, perhaps reflecting their supervisor’s preferences. No-one used bioconductor. Thirteen of these theses used no other statistical methods and are not discussed further.
Most (29) of the 41 students who used statistical methods other than bioinformatics used R, but not always exclusively. The popularity of R is not surprising as most of the statistics teaching and support (other than by the supervisor) is with R. What other software is being used?
At least ten students use Excel (one more used libre office), typically for making figures (hello 3-D barplots), but also for some basic statistics. Graphpad is used by two students in conjunction with R, sigmaPlot by one. Within R, the ggplot2 and lattice packages were used by one student each for plotting figures – these are not currently taught.
Other statistical software included SPSS (3), statistica (1), matlab (2), python (1), JAGS (1), StatPlus:mac (1), prism (1), and one student used Diverse and Primer. Two students don’t report what software they used.
It was difficult to work out how students processed their data. I suspect many used Excel where I would have recommended R. It is difficult to teach data processing skills in a class: everybody needs to know how to run an ANOVA, but data processing needs are diverse.
Preliminary conclusions: most students are using R, corresponding as they have been taught, but many resort to Excel or other programmes for plotting figures, and probably for data processing. This is not great for making theses reproducible.
In the next post, I’ll look more at the statistical methods used.