Making a pollen diagram from Neotoma

Last week I gave a course on R for palaeoecologists covering data handling using tidyverse, reproducibility and some some ordinations and transfer functions. One of the exercises was to download some pollen data from Neotoma and make a pollen diagram. This is my solution. It uses a lot of tidyverse. Everything could be done with base R, but I find tidyverse more expressive.

#load packages
library("tidyverse")
library("neotoma")
library("rioja")
#devtools::install_github("richardjtelford/ggpalaeo") #NEW VERSION
library("ggpalaeo") #Need new version

I’ve chosen to plot the pollen data from Lake Bambili. First we need to download the data with the neotoma package

#download data
get_dataset(40940)#check correct site
## A dataset_list containing 1 objects:
## Accessed from 2019-10-10 16:41h to 2019-10-10 16:41h.
## Datasets:
## dataset.id site.name long lat type
## 40940 Bambili 1 10.24383 5.93487 pollen
bambili_raw <- get_download(40940)
## API call was successful. Returned record for Bambili 1
eco_types <- get_table("EcolGroupTypes")

Then we need to extract the components we need from the downloaded data. get_download returns a list with one element per dataset which we can extract with the $ notation.

#extract components
cnts = counts(bambili_raw$`40940`) #table of counts
meta = bambili_raw$`40940`$sample.meta #sample depths/ages
taxa = bambili_raw$`40940`$taxon.list %>% #taxon info
  mutate_all(as.character) #convert everything from a factor to prevent warnings later

I want to add the age/depth information to the counts. Here I am using bind_cols to do this, which assumes the rows are in the same order. The paranoid might want to convert the row names of the counts into a column (rownames_to_column) and then join on sample.id. Next I use pivot_longer (the replacement of tidyr::gather) to convert the data from a species x samples data.frame to a long format, and join to the taxon information. Note that some datasets have an alias column in taxon.list and you need to join by that instead.

bambili <- meta %>% select(age, depth) %>%
bind_cols(cnts) %>%
pivot_longer(cols = -c("age", "depth"), names_to = "species", values_to = "count") %>%
left_join(taxa, by = c("species" = "taxon.name"))

Now is a good time to check what we have. It’s useful to have the Neotoma database manual open to check what terms mean.

bambili %>% count(variable.element)
## # A tibble: 3 x 2
## variable.element n
##
## 1 pollen 36750
## 2 pollen/spore 175
## 3 spore 875

variable.element contains just pollen and spores. It can include a variety of other things (such as stomata) that we might want to filter out.

eco_types %>%
semi_join(bambili, by = c("EcolGroupID" = "ecological.group")) %>%
select(EcolGroupID, EcolGroup)
## EcolGroupID EcolGroup
## 1 AQVP Aquatic Vascular Plants
## 2 MANG Mangroves
## 3 PALM Palms
## 4 SEED Spermatophyte rank or clade above order
## 5 TRSH Trees and Shrubs
## 6 UNID Unknown and Indeterminable Palynomorphs
## 7 UPHE Upland Herbs
## 8 VACR Terrestrial Vascular Cryptogams

I don’t want to include the AQVP or UNID in the diagram. I’m not sure about SEED.

bambili %>% filter(ecological.group == "SEED", count > 0) %>%
select(species, count) %>%
group_by(species) %>%
summarise(n = n(), max = max(count))
## # A tibble: 1 x 3
## species n max
##
## 1 Monocotyledoneae undiff. 33 6

So SEED is an unidentified monocot present at low abundances. I’m going to delete it.

#remove unwanted variable.element/ecological.group

bambili = bambili %>% 
  filter(!ecological.group %in% c("AQVP", "UNID", "SEED"))
#use `%in%` not `==`

This is a good time to check the count sums. It might be prudent to delete any samples with very low counts.

#check count sums
bambili %>%
group_by(depth) %>%
summarise(s = sum(count)) %>%
arrange(s) %>%
slice(1:5)
## # A tibble: 5 x 2
## depth s
##
## 1 510. 72
## 2 220. 157
## 3 200. 176
## 4 240. 177
## 5 280. 233

Seventy four isn't great, but I'm going to keep it.

Now we can calculate percent and remove the rare taxa

#calculate percent
bambili = bambili %>% 
  group_by(depth) %>% 
  mutate(percent = count/sum(count) * 100) 

#remove rare taxa
bambili1 = bambili %>% 
  group_by(species) %>% 
  filter(
    sum(percent > 0) >= 3, #must be in at least three samples
    max(percent) > 3) #must have a max percent > 3

Now we can use pivot_wider to reshape the data back into a species x samples data.frame that we can plot with rioja::strat.plot. For convenience, I’m separating the species data from the age/depth data.

bambili2 <- bambili1 %>%
select(age, depth, species, percent) %>%
pivot_wider(names_from = "species", values_from = "percent")

bambili_spp <- bambili2 %>% select(-age, -depth) %>%
as.data.frame()#needed as strat.plot does not like tibble - pull request to fix this submitted.

Now we can plot the stratigraphy.

#strat.plot
strat.plot(bambili_spp, yvar = bambili2$depth)

basic-plot-1

There are a variety of aspects of this plot that need improving. We need to:

  • plot on constant scale for all taxa
  • reverse y axis so deeper levels are lower
  • arrange the taxa in some logical order
  • rotate the species names and set the figure margins
  • add a cluster diagram and zones
  • add a secondary scale

Some of these can be fixed by setting an argument in strat.plot (there are a lot of arguments – see ?strat.plot), but to reorder the species, we need to reprocess the data.

bambili2 <- bambili1 %>%
mutate(
#make ecological.group a factor with TRSH first
ecological.group = factor(ecological.group, levels = c("TRSH", "UPHE", "VACR")),
mean_percent = mean(percent)) %>%
#arrange by ecological.group and mean_percent (largest first)
arrange(ecological.group, desc(mean_percent)) %>%
ungroup() %>%
#make species into a factor so we can perserve the order
mutate(species = factor(species, levels = unique(species)))

#reshape using tidyr::spread as pivot_wider (currently?) ignores factor order
bambili3 <- bambili2 %>%
select(age, depth, species, percent) %>%
spread(key = "species", value = "percent")

bambili_spp <- bambili3 %>%
select(-age, -depth) %>%
as.data.frame(bambili_spp)

#set up for ecological group colours
ecological_groups <- bambili2 %>%
  distinct(species, ecological.group) %>% 
  pull(ecological.group)
ecological_colours <- c("red", "green", "orange")

And we run a constrained cluster analysis

bambili_dist <- dist(sqrt(bambili_spp/100))#chord distance
clust <- chclust(bambili_dist, method = "coniss")
#bstick(clust)#five groups

Now we can make a better plot.

#set up mgp (see ?par)
mgp <- c(2, 0.25, 0)
par(tcl = -0.15, mgp = mgp)#shorter axis ticks - see ?par
pt <- strat.plot(
d = bambili_spp,
yvar = bambili3$depth,
y.rev = TRUE, #reverse direction of y-axis
scale.percent = TRUE, #use constant scale for all taxa
srt.xlabel = 45, #rotate x-label by 45 degrees
cex.xlabel = 0.8, #smaller font
mgp = mgp,
xRight = 0.98, #right margin
xLeft = 0.21, #left margin with space for 2nd axis
yTop = 0.60, #top margin
yBottom = 0.1, #bottom margin
col.line = ecological_colours[ecological_groups],#colours
col.bar = ecological_colours[ecological_groups], #colours
ylabel = "Depth cm",
clust = clust
)

#add zone boundaries
addClustZone(pt, clust = clust, nZone = 5)

#add a secondary scale
secondary_scale(pt, yvar = bambili3$depth, 
                yvar2 = bambili3$age, 
                ylabel2 = "Date yr BP",
                n = 10)

better-plot-1

It’s beginning to get there. There are probably too many taxa plotted. Merging the various Polypodiophyta as the names are very long and the ecological interpretation of the different types is unclear. I also want to reduce the space between the y-axis and the ylabel. Unfortunately this is hard coded in strat.plot but it would only take a minute to make it into an argument. I’d also like to add a % sign to to the axis, but I cannot see an argument that will do that (again it shouldn’t be hard to code – adding yet another argument).

I would like to have some text on top describing the ecological groups, but I have the feeling that that would be very difficult to do.

Posted in Data manipulation, R, reproducible research | Tagged | 1 Comment

Count sums – the preprint

I wandered lonely as a cloud
That floats on high o’er vales and hills,
When all at once I saw a crowd,
A host, of golden chironomids;
Beside the lake, beneath the trees,
Fluttering and dancing in the breeze.

Continuous as the stars that shine
And twinkle on the milky way,
They stretched in never-ending swarm
Along the margin of a bay:
Ten thousand saw I at a glance,
Tossing their heads in sprightly dance.

The waves beside them danced; but they
Out-did the sparkling waves in glee:
A poet could not but be gay,
In such a jocund company:
I gazed—and gazed—but little thought
What wealth the show to me had brought:

For oft, when on my couch I lie
In vacant or in pensive mood,
They flash upon that inward eye
Which is the bliss of solitude;
And then my heart with pleasure fills,
And dances with the chironomids.

Sincere apologies to W. Wordsworth: nothing rhymes with chironomids

But were there really ten thousand? Next time you walk by a lake, try to count the chironomids in a swarm. It is impossible, they just move about too much, and it doesn’t seem to get much easier when they are dead.

Just look at the midge stratigraphy from the eutrophic Hiidenvesi in Finland. The methods section of the paper report that “A minimum count size of 50 individuals was set” in 5 cm3 sediment samples.

11270_2017_3622_fig4_html

Midge stratigraphy from Hiidenvesi including the absolute (black bars) and relative abundances (grey bars) of chironomids and chaoborids. The chaoborid/chironomid ratio indicates the severity of anoxia and N2 diversity index represents the effective number of occurrences

Do these assemblages really look like the minimum count sum is 50?

The lowest sample has about 45% Chaoborus flavicans and just over 10% of five chironomid taxa. This looks suspiciously like a count of nine midges: four Chaoborus  (44.4%) and one each for five chironomid taxa (11.1%). This interpretation is supported by the concentration data on the left of the figure. This sample has 0.8 Chaoborus and 1 chironomid per cubic centimetre of sediment. Since 5 cm3 of sediment were sampled and 5 * (0.8 + 1) = 9, this also implies that the count sum is nine midges. Last time I checked, nine was somewhat less than fifty. The second sample from the top appears to have the largest count sum with 1.2 Chaoborus and 3 chironomids per cubic centimetre of sediment, implying a count sum of 21, less than half that reported. All the analyses based on the chironomid stratigraphy are much more uncertain than the reader would expect and may be biased.

Given the number of examples of papers that appear to have counted fewer, often far fewer, microfossils than reported, I decided to write up and submit a manuscript describing the problem and showing how the count sum can be estimated from percent data. If you don’t have at least infinite patience for the peer review process to finish, you can read a preprint on Paleorxiv. Some of the examples I report in the preprint (none of which have yet featured on this blog) appear to be due to lazy descriptions of the data, some due to assemblages being scheduled for deletion but not actually deleted, and some due to what might be described as falsification of the count sums. I also accidentally found some fabricated data – maybe more on that in another post.

Posted in Peer reviewed literature | Tagged , , | Leave a comment

Count-less chironomids?

Most papers that present microfossil assemblages report (not always accurately) the minimum number of microfossils that were counted in each sample, an important indicator of the precision of the data. Some do not. For these papers, the reader should be able to assume that the counts conform to typical practice for that microfossil. For example, I would expect a pollen or diatom stratigraphy to have counts sums of a few hundred, and a chironomid stratigraphy to have a minimum count sum of fifty (perhaps thirty). If the count sums are materially lower than this I would start to wonder if the omission was deliberate and recall Fanelli’s (2013) definition of misconduct

‘any omission or misrepresentation of the information necessary and sufficient to evaluate the validity and significance of research, at the level appropriate to the context in which the research is communicated’.

The figure below is from a paper that reconstructs July temperature for four intervals in the period ca. 570 BCE to 120 CE (Iron Age–Roman Period) using chironomids from the varved sediments of Lake Silvaplana. The paper does not report the minimum count sum; can it be estimated from the stratigraphy?

stewart_etal_4

(a) Chironomid assemblages, (b) inferred mean July air temperatures and (c) % organic matter. For part (a), percentages for each taxon are represented by black bars. Thin black lines represent the zones created using the ZONE program (Juggins, 1991).

I think the minimum count sum can be estimated. The lowest assemblage is 100% Orthocladius. This could represent one Orthocladius, a hundred, or a thousand, but I would suggest that it is unlikely to be a large count given the chironomid diversity in the lake (in contrast to polar foram assemblages which are commonly 100% Neogloboquadrina pachyderma), would make it unlikely to find just one taxon, even if that taxon is fairly dominant. The next assemblage is 100% Cricotopus, the third and fourth 100% Corynoneura, and the fifth 100 % Procladius: there are two possibilities, either the chironomid community in the lake repeatedly changed abruptly and taphonomic processes did not mix the assemblages, or the count sum for these assemblages is very low, perhaps only one. Fifty five of the 208 assemblages in the stratigraphy are monospecific.

The eight and ninth assemblages are 50% Tanytarsus and 50% another taxon. These are conceivably counts of two chironomids. The eleventh assemblage is more diverse with five taxa each with 20%, perhaps a count of five. Only a few assemblages, mainly near the top, appear to have a count sum of more than ten chironomids.

These are the lowest estimated count sums I have found, even lower than in Lake Muzzano.
Despite these extremely low count sums, a reconstruction of July temperature is made with a transfer function. Perhaps not surprisingly, the reconstruction is very noisy, with abrupt changes of over 10°C. The reconstruction is meaningless junk.

A correction is needed to report the omitted count sums, and if they are mainly in low single figures, I think the chironomid part of the paper should be retracted as no reviewer would ever have recommended publication had they known that.

Posted in Peer reviewed literature, transfer function | Tagged , , | Leave a comment

Tools for a reproducible manuscript

It is all too easy to make a mistake putting a manuscript together: mistranscribe a number, forget to rerun some analyses after the data are revised, or simply bodge the round some numbers. Fortunately it is possible to avoid these problems with some technology that inserts numbers and graphs into the manuscript, keeps them all up to date, and lets you see the calculations behind everything. The technology does not prevent code errors, but by making the analysis transparent, they are hopefully easier to find.

There are numerous guides to writing dynamic documents with Rmarkdown. What I want to do here is show some of the things I needed to write a manuscript with Rmarkdown. There may be better solutions to some of the problems I faced – tell me about them. If you want to see any of this in action, take a look at the GitHub repo for my recently accepted manuscript.

To get started with Rmarkdown, in RStudio, go to File >> New File >> R markdown and you get a sample file to play with. Remember, no spaces in the filename.

Too much code

Rmarkdown lets you have chunks of R code embedded in the text, so that when the document is rendered, the code is evaluated following the chunk’s options  and the results printed. This is great for teaching and simple documents, but when you have several hundred lines of code to clean data it becomes unwieldy to have as one document, and prevents reuse if you want to make a manuscript and a presentation with the same code.

The first solution I found for this is the knitr::read_chunk function which lets you put the code in a separate file. I later found the drake package which lets you put most of the code in the drake_plan and then readd or loadd the generated objects (targets) into rmarkdown. When you make the drake_plan only the targets that are out of date (because of code or data changes) get updated. This nicely solves the problem of how to cache the results of slow analyses (there is a cache built into Rmarkdown but it is a bit fragile).

Article format

If you want to submit your manuscript to a journal as a PDF, things will go a lot smoother if you use the journal’s LaTeX template. I have no idea how to do that either, but the rticles package takes away most of the pain. You make a draft manuscript with

rmarkdown::draft("MyArticle.Rmd", template = "acm_article", package = "rticles")

This generates the YAML, the metadata at the top of the Rmarkdown file between two sets of ---, required. This works well provided you don’t do anything daft such as submit to Wiley, one of the few major publishers without a template in rticles.

Citations

Obviously, a manuscript is going to need citations, and formatting citations is a pain. rticles will put a couple of lines in the YAML to say which bibtex file to use and which citation style.

bibliography: chironomid2.bib
csl: elsevier-harvard_rjt.csl

Any decent citation manager will be able to export a bibtex file. I use JabRef.

CSL files for many different journals are available. I had to edit the Elsevier style slightly so that it showed R packages correctly.

In the text, write [@Smith1990] to cite the reference with the key Smith1990 with name and year, or [-@Smith1990] to get just the year. The formatted reference list gets added to the end of the file. The citr addin for Rstudio lets you search through the bibtex file and select references to add to the manuscript.

Line numbers

I hate reviewing manuscripts without line numbers. So I wanted line numbers on my manuscript. This turned out to be easy – Rmarkdown documents are rendered into PDF via LaTeX, so LaTeX commands can be used.

In the YAML, add

header-includes:
  - \usepackage{lineno}
  - \linenumbers

And that’s it.

Cross-referencing figures

One of my favourite parts of writing a manuscript is when I decide to add another figure and need to renumber all the other figures and make sure that the text refers to the correct figure.

Rmarkdown can cross-reference figures (also tables etc). This feature was added relatively recently, and needs bookdown rather than rmarkdown, which can be set in the YAML (NB spacing is important in YAML )

output:
  bookdown::pdf_book:
    base_format: 
      rticles::elsevier_article
    number_sections: true

In the text, to refer to the figure from the chunk with the label weather-climate-plot, use (Fig. \@ref(fig:weather-climate-plot)). Important, this does not work if you have underscores in the chunk name. This took me an entire evening to work out.

You can also cross-reference figures in the supplementary material. In the header-includes: section of the YAML, add

  - \usepackage{xr}
  - \externaldocument[sm-]{supplementary_data}

Where supplementary_data is the name of the rmarkdown file you want to reference. In the text you can then use Fig. \@ref(sm-fig:zab-dist-analogue). The sm prevents any clash between the labels in the manuscript and the supplementary.

To force the figures in the supplementary to be prepended with an “S”, to the YAML in the supplementary, add

header-includes:
  - \renewcommand{\thefigure}{S\arabic{figure}}

This magic works on the *.tex files made as an intermediate stage of rendering the PDF. You need to keep these by running rmarkdown::render with clean = FALSE, then you can run tinytex::pdflatex

tinytex::pdflatex("manuscript.tex", clean=TRUE)
tinytex::pdflatex("supplementary_data.tex", clean=TRUE)

Package management

One of the great things about R is that the packages are often improved. This is also a problem as the code that worked fine may fail after you update a package. One solution to this is to use packrat to keep track of which packages your project uses. packrat will automatically download all the packages you need to reproduce my manuscript.

I found packrat to be a bit of a pain, and am looking forward to trying renv.

Rmarkdown and git

As rmarkdown files are text files they work well with version control using git and, for example, GitHub. Since git tracks changes in each line of text, it would show a long paragraph as having changed if you add one comma. A trick I learnt somewhere is to break the paragraph up, one sentence per line, so the version control now works at the sentence level. Rmarkdown ignores single line returns in a document, you need two to get a new paragraph.

 

It is not difficult

If you can write code, and you can write text, you can write a manuscript with Rmarkdown. Your coauthors can probably cope too.

Posted in reproducible research | Tagged | 2 Comments

Double diatoms

I am in awe of Dr Elisabeth Bik and her amazing ability and dedication to spotting duplications in images.

Follow the thread to see the duplications she finds in this photograph. I usually struggle to find duplications until they are pointed out.

But not always.

Take, for example, this diatom calibration set.

Screenshot_2019-05-16 (11) (PDF) Diatom-based reconstruction of trophic status changes recorded in varved sediments of Lake[...]

The lower two assemblages are, within the resolution of the PDF, identical. Having spotted that, I wanted to know whether any of the other assemblages were identical, so I scraped the data out of the PDF. All the other assemblages are distinct.

But what about the next figure which shows the down-core stratigraphy?

Screenshot_2019-05-16 (11) (PDF) Diatom-based reconstruction of trophic status changes recorded in varved sediments of Lake[...](1)

While the upper part of the diagram is spiky as expected, the lower part of the diagram is step like with nine pairs of apparently identical assemblages.

These identical assemblages in the calibration set and the down-core stratigraphy are unexpected. I asked the authors if they could explain what happened, but they haven’t replied.

The paper also includes a transfer function that I cannot reproduce with the assemblage data I scraped from this paper and the environmental data I scraped from another paper on the same calibration set lakes. The reported cross-validation performance is better than the apparent performance I get, but it is possible that the environmental data had more problems than just using the wrong unit prefix. If my model is correct, then the actual cross-validation performance is very weak (R2 = 0.03), then the reconstruction shown in the paper is bogus and the paper should be corrected (or retracted).

I’m not holding my breath.

Posted in Peer reviewed literature, Uncategorized | Tagged , | Leave a comment

Funky ordination plots with ggvegan

Yesterday, I tweeted a photo of a ordination I plotted with ggvegan, and thought I should show how I made it.

Ordinations can be plotted with base R (see ?plot.cca). This is fine for simple plots, but it is a lot of effort to make a complex plot. The ggvegan package, written by Gavin Simpson, lets you use the power of ggplot2 and can make complex plots easier to make.

Start by loading some packages. ggvegan will need to be installed from GitHub if you don’t already have it.

#load packages
library(tidyverse)
library(vegan)
#devtools::install_github("gavinsimpson/ggvegan")
library(ggvegan)

I’m going to use the pyrifos dataset, which is included in vegan. The data are the abundances of aquatic invertebrates from an experiment in which twelve ditches were studies several times before and after insecticide treatment. We need to make a tibble (or a data.frame) of the predictors (dose, time and ditch ID).

#load data
data("pyrifos")

env <- tibble(
week = rep(c(-4, -1, 0.1, 1, 2, 4, 8, 12, 15, 19, 24), each = 12),
dose = factor(rep(c(0.1, 0, 0, 0.9, 0, 44, 6, 0.1, 44, 0.9, 0, 6), 11)),
ditch = gl(12, 1, length=132))

We can run a principal components analysis with rda.

pca <- rda(pyrifos)

Simple ordination plots can be made with the autoplot function. This plot shows all the samples. This is fine, but we want to see dose and time information.

autoplot(pca, layers = "sites", arrows = FALSE)

autoplot-1

At the moment, it is difficult to do much to this plot. But we can use the fortify function from ggvegan to extract the scores from the ordination object and then combine these with the predictor data before plotting with ggplot. I’m going to colour the samples by dose, and draw a line through the samples in each ditch, with a point on the first value.

pca_fort <- fortify(pca, display = "sites") %>%
bind_cols(env)

ggplot(pca_fort, aes(x = PC1, y = PC2, colour = dose, group = ditch)) +
geom_path() + #use geom_path not geom_line
geom_point(aes(size = if_else(week == min(week), 1, NA_real_)), show.legend = FALSE) +
scale_color_viridis_d() +
scale_size(range = 2) +
coord_equal()

unnamed-chunk-2-1

Don’t forget to add coord_equal() to ensure that the plot is correctly scaled.

Happy plotting.

Posted in R, Uncategorized | Tagged , | Leave a comment

Sunspots and raindrops

It is time, as the walrus said, to talk about another paper reporting an effect of solar variability on the Earth’s climate. Laurenz et al. (2019) correlate European country-scale precipitation data from CRU with the number of sunspots, a proxy for solar activity. After allowing for a lagged response, they find some reasonably high and significant correlations in at least some months and some countries.

I do expect solar variability to affect climate, but I expect the effect to be quite small and difficult to detect above a background of internal variability and other forcings such as greenhouse gas and aerosols. However, statistical traps abound. This makes me wary of papers that report strong and highly significant relationships. Especially when the authors write

The probability that the correlation r = 0.54 is by chance is less than 0.1% (p < 0.001).

This reflects a common misunderstanding of p-values. The co-authors, Horst-Joachim Lüdecke and Sebastian Lüning, the reviewers, and the editors don’t look good for letting this slip past them.

As the paper reports, monthly precipitation time series have very little autocorrelation and are almost white noise, so it seems unlikely that the strongly autocorrelated sunspot timeseries would be an good predictor.

This is easily tested by looking at some data. Not surprisingly, the correlation between the raw Danish February precipitation and sunspots is weak (r = 0.07) and not remotely close to being significant (p = 0.43).

So how do Laurenz et al. get to report this correlation as 0.42 (p < 0.01)? Predictably, they smooth the data, and allow for lagged responses. Using their methods, I get approximately the same correlation. But what about the significance? Many papers that smooth the data or allow for lagged response (a form of multiple testing) fail to correct the p-value and then find spuriously significant results. Laurenz et al. don’t make that mistake – they use a Monte Carlo procedure, applying their method to 10,000 surrogate time series to find the distribution of p-values under the null hypothesis. They report that for values of |r| > 0.45 p < 0.001; for |r| > 0.35 p < 0.01; and for |r| > 0.25 p < 0.05 (I’m not exactly sure why they are reporting p-values for absolute correlations when they need a one-sided test.)

It only takes a few lines of code to re-implement their Monte Carlo procedure and check these results.

library("tidyverse")
library("furrr")
library("magrittr")

#load data
ss0 <- read.table("data/SN_m_tot_V2.0.txt", nrow = 3237)
ss <- ts(ss0$V4, start = ss0$V1[1], frequency = 12) %>%
window(start = 1901, end = c(2015, 12))

denmark <- read.table("data/crucy.v3.26.1901.2017.Denmark.pre.per", skip = 3, header = TRUE) %>%
filter(YEAR <= 2015)
set.seed(101)

#make function
lags.cor <- function(l, m, ss, clim){
ss.lag <- stats::lag(ss, l)
ss.lag <- window(ss.lag, start = c(1901, m), end = c(2015, m), extend = TRUE, deltat = 1)
cor(ss.lag, clim, use = "pair")
}

#make null distribution
null_dist <- future_map(1:10000, ~{ y = rnorm(nrow(denmark)) ys = signal::sgolayfilt(x = y, p = 5, n = 11) tibble(lag = -110:24) %>%
mutate(cor = map_dbl(lag, lags.cor, m = 1, clim = ys, ss = ss))
}) %>%
bind_rows(.id = "n") %>%
mutate(n = as.numeric(n))

I am using white noise as I cannot find a function to simulate autocorrelated time series with a Hurst exponent of 0.6 which the authors use. This will make my significance thresholds slightly lower, as will my use of a one-sided rather than two-sided test. Despite this, I find that the significance thresholds are much higher than those the authors report.

null_dist %>%
group_by(n) %>%
slice(which.max(cor)) %$%
quantile(cor, probs = c(0.95, 0.99))
## 95% 99%
## 0.40 0.47

Only if I ignore the lagged responses do I get results similar to the authors.

null_dist %>%
filter(lag == 0) %$%
quantile(cor, c(0.95, .99))
## 95% 99%
## 0.25 0.34

It appears that the authors neglected to test for lagged responses in their Monte Carlo procedure. As such their p-values are over optimistic and many of their apparently significant results are actually not significant.

Still, there are ~45 monthly series with a correlation that exceeds the 95th percentile of the null distribution. Is this impressive? Given that there are 39 countries x 12 months = 468 tests, we should expect ~23 correlations to be “significant” just by chance because of multiple testing. A correction for multiple testing is required. Getting 45 significant correlations is more than expected. However, because the timeseries from neighbouring countries (e.g., Belgium and Luxembourg) can be very highly correlated because of the spatial patterns in precipitation (perhaps enhanced by the gridding process), if one has a significant correlation, several more will as well. This will increase the probability of getting 45 “significant” correlations by chance.

Given the apparent error in calculating the p-values in Laurenz et al. (2019), the authors need to archive their code, and if the problem is confirmed, should write a correction to their paper.

Posted in Peer reviewed literature, Uncategorized | Tagged , | 5 Comments