Conference abstract deadlines

“I love deadlines. I love the whooshing noise they make as they go by.”
Douglas Adams

When I go to a conference, I want to present my most recent relevant work. But often the deadline for abstract submission is months before the conference, so I need to decide which work is likely to be completed in time, and write a convincing abstract before all (sometimes any) results are available.

Take for example the next INQUA conference, the deadline for abstract submission, 9th January, is 6.5 months in advance of the conference.

I wondered if this lead time was typical, so I trawled through some conferences to find their lead times. I’ve looked for conferences in ecology and geology, avoiding the plague of predatory conferences by only listing those I know or found on the PAGES and Nature Ecology calendars. These gave a mix of 2018 and 2019 conferences. I stopped once I had 30 conferences. If the submission deadline was extended, I took the original date as I cannot tell if the deadlines will be extended for 2019 conferences.


Histogram of conference lead times in ecology and geology

The median lead time is 110 days, much shorter than for INQUA. Of course, I want to know why there is a 5.8 fold difference between the longest (Polar2018) and shortest (AQUA) lead time.

I have a few ideas, some of which are probably testable.

  • Some of the conferences probably plan to extend the abstract submission deadline, so the published deadline is artificially early, whereas for others the deadline is fixed. As extensions are usually only by a week or two, this should explain only a little of the variance.

  • I only looked for conferences in ecology or geology, so I suspect research field explains little of the variance. I predict that some faster moving fields, e.g. biomedicine, lead times should typically be short.

  • I wonder whether conference size is associated with lead times. Large conferences are obviously take more effort to organise, but on the other hand, have more resources. It is difficult to work out how large conferences are, but two of the largest conferences, the EGU and ESA have very different lead times (87 and 171 days, respectively). Small, relatively informal conferences should be the easiest to organise, and therefore short lead times.

I prefer short lead times. I am more likely to submit an abstract to a conference with a three month lead time than one with a six month delay.

Posted in meta science | Tagged | 1 Comment

No new data no comment at Nature Communications

Of all the modes of post-publication peer review, comments published in the same journal as the original article are the most visible, and because they have survived editorial and reviewer scrutiny, carry at least modicum of credibility. Unfortunately, comments are much rarer than expected given the number of papers our journal club savages. For example, QSR has published just a couple of dozen in the last decade. Part of the reason for the sparsity of comments must be the hurdles placed in the path to publication.

These are the requirement from Science

Technical Comments (up to 1000 words, 2 figures or tables, 15 references, and no Supplementary Materials), are published online and critique the core conclusions and/or methodology of research published in Science within the previous 3 months.  Technical Comments should not present new data or other previously unpublished work nor be based on new findings/concepts that would not have been accessible to the authors when the paper was written.

The word limit is tight and the deadline restrictive (confession: I don’t always read Science papers the day they are published), but otherwise this is reasonable.

How about the requirement from Nature Communications?

Submissions should challenge the main conclusions of the Nature Communications paper and contain new, unpublished data to support the arguments.

•    They should not exceed 1200 words (main text).
•    Contributions should start with a brief paragraph that summarizes the message of the article without specialized terminology, for a non-specialist readership. This paragraph should be used as the abstract for submission purposes.
•    Contributions should have a simple message that requires only one or two small figures or tables. Contributions with more than two figures and/or tables will not be considered.
•    As a guideline, contributions allow up to 15 references.
•    Supplementary Information is permitted at the editor’s discretion.

The requirement for “new, unpublished data” is the most important difference: Science forbids it, Nature Communications demands it.

I’ve helped write a few comments showing that:

  • The method used by Pither and Aarssen (2005) to show that most diatoms taxa are pH generalists had a very high type II error rate which voided their conclusions.
  • That changes of the sedimentation rate in Lake Erhai caused artifacts that Wang et al (2012) misinterpreted as flickering prior to a critical transition.
  • The methods used by Lyons et al (2015) caused a spurious breakpoint in the mid-Holocene.

The common feature of these comments is that none of them contain new unpublished data.

Nature Communications doesn’t require articles to contain new unpublished data. This is good, many of my papers don’t contain new data; new methods and ideas are as important as new data. However, it potentially leads to a bizarre situation that an article presenting a novel but flawed analysis of previously published data extracted from Neotoma or some other database could not be critiqued in a comment unless the authors were able to find some new unpublished data (and describe it in 1200 words).





Posted in Peer reviewed literature | Tagged , | 3 Comments

Insist() on the text and numbers agree in Rmarkdown

The manuscript I submitted a long time ago contains multiple sentences
where I describe the result with both text and numbers. For example:

With a three-year moving average, the correlation is weak (r = 0.21) and not statistically significant (p = 0.28).

This is produced from the Rmarkdown

With a three-year moving average, the correlation is weak (r = `r smo_cor$estimate`) and not statistically significant (p = `r smo_cor$p.value`).

If I update the code that generates smo_cor, or the data changes, the
numbers will automatically update potentially creating a conflict with
the text which needs updating manually. I need a solution to identify
when this has happened.

I want a function that I can give an object and a logical test that will
either return the object if the test is true, or throws an error if it
is not. I’ve used the assertr package before, which does what I want,
but only for data.frames. I’ve looked in assertthat but cannot find
what I want, so I’ve written my own short function.

insist = function(., test){ 
  if(!eval(substitute(test), envir = parent.frame())){    
     stop(deparse(substitute(test)), " is not TRUE: \n. = ", .) 

Quick test

pi %>% insist(. > 4)#FALSE - throws error

## Error in insist(., . > 4): . > 4 is not TRUE:
## . = 3.14159265358979

pi %>% insist(. < 3.5)# TRUE - returns pi 
## [1] 3.141593 

My Rmarkdown now looks like

With a three-year moving average, the correlation is weak (r = `r smo_cor$estimate %>% insist(. % insist(. > 0.05)`).

Now if either of the statements is false, I should get an informative

Posted in R, reproducible research | Tagged | Leave a comment

Citing R and packages automatically

Almost every manuscript I write has a paragraph that looks something
like this:

All analyses were done in R version 3.4.4 (R Core Team 2017). Ordinations were fitted with vegan version 2.4-6 (Oksanen et al. 2017) with square-root transformed assemblage data. Transfer functions were fitted with rioja version 0.9-15.1 (Juggins 2017) using square-root transformed species data. Some diagnostics were performed using analogue version 0.17-0 (Simpson and Oksanen 2016).

Keeping this paragraph up to date as new versions of R are installed and
packages get updated is a pain and apt to go wrong. For example, there
are nine citations for ‘R core development team
on Google Scholar even though the citation changed to ‘R core team’ in

So this is my solution. I am using a couple of functions,
getRversion() which returns the version of R used and packageVersion which returns
the package version used. My Rmarkdown file looks like this

All analyses were done in R version `r getRversion()` [@R]. Ordinations were fitted with vegan version `r packageVersion(“vegan”)` [@vegan] with square-root transformed assemblage data.

Rmarkdown will replace [@R] with the entry in my bibtex file with the
key R, but first I need to add the package citations to my
bibliography. I’ve written a wrapper for bibtex::write.bib to do this.

package_citations <- function(packages, old_bib, new_bib){

  #copy original bib file
  fs::file_copy(old_bib, new_bib)

  #R citation
  Rcite = citation()
  Rcite$key = "R"
  bibtex::write.bib(Rcite, new_bib, append = TRUE)

  #package citation
  bibtex::write.bib(packages, new_bib, append = TRUE)

This function makes a copy of an existing bibtex file and adds citations
for R (with the key set to R) and packages. To use it, I run

packages = c("vegan", "rioja", "analogue"),
old_bib = "Rmd/extra/chironomid.bib",
new_bib = "Rmd/extra/chironomid2.bib")

The new bibtex file is now ready for use – just specify it in the YAML at the top of the Rmarkdown file.

One small complication is that some packages have multiple citations.
These are identified by an integer after the package name in the bibtex
key (e.g. analogue1, analogue2) and you need to inspect the bibtex file
to work out which you want.

Posted in R, reproducible research | Tagged | 2 Comments

Chironomids are cunning wee beasties

Since I had examined almost every aspect of the data from the remarkable Lake Żabińskie chironomid-inferred August air temperature reconstruction, some time last summer I thought that I would, for completeness, have a quick look at the instrumental temperature data. I was not expecting to find any problems.

In the original paper, the authors use a homogenised composite of data from 10 stations near Lake Żabińskie with “longer temporal series”.

Map of weather stations used by the authors

Authors’ map of the ten weather stations used in their composite temperature series. Not really sure why Giżycko (1936-1938) was included.

I haven’t done anywhere near as thorough a job: I’ve simply downloaded some long GHCN series from the KNMI Climate Explorer, and made a composite of their August temperatures after subtracting the 1951-1980 mean. I selected records from Warsaw, Vilnius, Kanaus, and Kalingrad. These are all farther away (up to ~230km) than the stations used by the authors, but given the large spatial scale of temperature anomalies, I don’t think this is a major problem.

For the period after 1939, where the chironomid-inferred August air temperature reconstruction has an annual resolution, my composite corresponds well with the authors’ (Pearson correlation r = 0.98). I’m happy with that.


Prior to 1939, the relationship between my composite and the authors’ looks much worse. In this period, chironomids were insufficiently abundant for an annual resolution reconstruction, so chironomids from two to five years are aggregated for the reconstruction. From looking at the temperature series, it would appear that the authors have not aggregated their temperature data to match the resolution of the chironomid data, but have instead thinned the data by using the temperature of a selected nominal year. So, for example, the chironomid sample from 1896-1898 is compared with the temperature from 1896 rather than the mean for 1896-1898.

In this next figure (showing just the pre-1939 data), my composite has been processed to match the resolution of the authors’ by either thinning to take just the first year, or taking the mean over the relevant years.


The pre-1939 correlation between the authors’ composite and mine is much higher (r = 0.91) if I thin the data than if I take the mean over the relevant interval (r = 0.57).

I wanted to confirm this finding with the authors’ original instrumental data. These should have been archived as the paper was published on the condition that all data necessary to reproduce the result would be made available. The authors refused to send me the instrumental data (though they did send some other data I requested). Fortunately, I realised that Sowieck on the map above is archived in GHCN as Sovetsk, which allowed me to check my analysis. The authors later confirmed that their temperature series was miscalculated and I acquired their composite data.

Mistakes happen; I don’t want to make a big deal about miscalculated instrumental data. The relationship between the instrumental data and the chironomid reconstruction is more interesting. One might expect the correlation between the reconstruction and the instrumental record to be weak before 1939 because of the miscalculation.

Not so. Using the authors’ data, the pre-1939 correlation between the reconstruction and the incorrectly calculated temperature is high (r = 0.81, p = 0.00009), comparable with the later data (r = 0.78), whereas the correlation with the correctly calculated temperature is much lower (r = 0.53, p = 0.03).

Those chironomids are cunning wee beasties, able to give a reconstruction that correlates even with incorrectly calculated temperatures.

Curiously, the correlation of the entire reconstruction with the original temperature series, 0.79, is higher than that reported in the paper (r = 0.74) or the corrigendum (r = 0.73).

Posted in Peer reviewed literature, transfer function | Tagged , | 1 Comment

Low chironomid counts from Lake Muzzano

I am aware of at least 10 papers authored by Dr Larocque-Tobler where there is good evidence that the chironomid count sum has been misreported (a corrigendum to one of these papers admits that a count sum of 19 was misreported as being at least 50). So naturally, when I read Larocque-Tobler & Pla-Rabès (2015), which reports a chironomid and diatom data from Lake Muzzano, I was interested in the chironomid count sums.

Unfortunately, the paper does not report the chironomid counts sums, but instead states that

Sixty-two samples were analyzed for chironomids, but many had few head capsules and were merged together.

When the count sums are not reported, I think it should be reasonable to assume that they conform to the typical count sizes for the microfossil group. For chironomids, the minimum count sum is typically 50 head capsules, although some authors use a minimum of 30. I think it would be poor practice to have count sums below 30 without reporting it.

Since no data are archived for the Lake Muzzano study, we need to try to infer the count sums from the stratigraphic diagram.


Chironomid stratigraphy (%) from Lake Muzzano.

Let’s take a look at a few of the samples.

The top sample has six equally abundant taxa with just over 10% and a seventh with over 30%. I think this probably represent a count of nine head capsules.

Let’s go down the core into the yellow zone:

  • The first yellow sample has four taxa at 25% giving a probable count of four head capsules.
  • The second sample has two species at 50%, suggesting a count of two.
  • The third has three taxa at 20% and a fourth at 40% suggesting a count of five. This is the sample that appears to have the largest count sum in this zone.

It is of course possible that I have underestimated the count sums and that the two taxa at 50% represent a count of four rather than two, but it is very unlikely that this sample has a count sum of greater than, say, 30. I think at least a third of the samples probably have count sums below 10 and perhaps only one or two of the count sums are greater than 20.

While it is easy to show that the count sums are probably very small, does it matter? I think it does. For example, the discussion of Smittia is probably based on only one head capsule and cannot be expected to be robust (the discussion of Limnophyes is simply wrong).

I think that the count sums in the Lake Muzzano study are so small that their size should have been reported (this is good practice in any case) so that the reviewers and readers can make an informed opinion about the signal and noise in the data. That readers would probably have an unfavourable opinion about such small count sums is no reason to omit this information.

Posted in Peer reviewed literature | Tagged , | Leave a comment

Merging taxa in assemblage data

One possible reason for the impossible percent values I’ve found in assemblages data is that taxa have been merged in Excel after percent were calculated. Doing anything in Excel is to invite disaster, if nothing else, it is very difficult to check what has been done.

Merging and renaming taxa is an almost inevitable step in the workflow for processing community and assemblages data. We need a reproducible method: here I show how can it be done with R.

I’m going to assume that the assemblage data are in wide format (one column per taxa) and that there are meta data (depths, ages etc) in one or more columns. If the meta data are in the rownames (which is very convenient for the ‘rioja’ and ‘vegan’ packages, less so for ‘dplyr’ as tibbles don’t have rownames), they can be moved into a column with rownames_to_column.

Here is a small artificial assemblage dataset.

spp <- data_frame(
depth_cm = 1:3,
sp_A = rpois(3, 5),
sp_b = rpois(3, 5),
sp.C = rpois(3, 5),
sp_D = rpois(3, 5))
spp_save <- spp # keep copy for later
## # A tibble: 3 x 5
## depth_cm sp_A sp_b sp.C sp_D
## 1 1 4 8 9 2
## 2 2 4 3 6 3
## 3 3 5 8 6 3

If we just want to rename a couple of taxa, the simplest solution is to use rename, where we set new_name = old_name. rename can take pairs of new and old names, separated by commas.

spp %>% rename(sp_B = sp_b, sp_C = sp.C)
## # A tibble: 3 x 5
## depth_cm sp_A sp_B sp_C sp_D
## 1 1 4 8 9 2
## 2 2 4 3 6 3
## 3 3 5 8 6 3

If there are many names that need altering, or we need to make the same changes to multiple data.frames, we need a different solution as rename gets tedious.

I like to make a data.frame of the old and new names and then use plyr::mapvalues to change the old into the new names. (plyr is a useful package but has several conflicts with dplyr so it is safer to use the :: notation than loading it).

changes <- read.csv(stringsAsFactors = FALSE, text =
"old, new
sp_b, sp_B
sp.C, sp_C", strip.white = TRUE)#this can go in an separate file

names(spp) <- plyr::mapvalues(names(spp), from = changes$old, to = changes$new)
## # A tibble: 3 x 5
## depth_cm sp_A sp_B sp_C sp_D
## 1 1 4 8 9 2
## 2 2 4 3 6 3
## 3 3 5 8 6 3

Merging taxa is possible in the wide format, but much easier in a thin format. We can convert from a wide format to a thin format with gather, and back with spread.

spp <- spp_save#original version

spp_thin <- spp %>% gather(key = taxa, value = count, -depth_cm)#don't gather depth_cm
## # A tibble: 12 x 3
## depth_cm taxa count
## 1 1 sp_A 4
## 2 2 sp_A 4
## 3 3 sp_A 5
## 4 1 sp_b 8
## 5 2 sp_b 3
## 6 3 sp_b 8
## 7 1 sp.C 9
## 8 2 sp.C 6
## 9 3 sp.C 6
## 10 1 sp_D 2
## 11 2 sp_D 3
## 12 3 sp_D 3

If there are just a few taxa that need merging, we can use recode within mutate followed by summarise. Note that in contrast with rename, recode expects “old_name” = “new_name”

spp_thin %>%
mutate(taxa = recode(taxa, "sp.C" = "sp_D")) %>%
group_by(depth_cm, taxa) %>%
summarise(count = sum(count)) %>%
spread(key = taxa, value = count)
## # A tibble: 3 x 4
## # Groups: depth_cm [3]
## depth_cm sp_A sp_b sp_D
## *
## 1 1 4 8 11
## 2 2 4 3 9
## 3 3 5 8 9

If there are many taxa that need merging (or some that need merging and some renaming) we can use mapvalues again.

changes <- read.csv(stringsAsFactors = FALSE, text =
"old, new
sp_b, sp_B
sp.C, sp_D", strip.white = TRUE)#this can go in an separate file

spp_thin %>%
mutate(taxa = plyr::mapvalues(taxa, from = changes$old, to = changes$new)) %>%
group_by(depth_cm, taxa) %>%
summarise(count = sum(count)) %>%
spread(key = taxa, value = count)
## # A tibble: 3 x 4
## # Groups: depth_cm [3]
## depth_cm sp_A sp_B sp_D
## *
## 1 1 4 8 11
## 2 2 4 3 9
## 3 3 5 8 9

This can also be done with a left_join.

spp2 <- spp_thin %>%
left_join(changes, by = c("taxa" = "old")) %>%
mutate(taxa = coalesce(new, taxa)) %>% #takes original name if no new one.
select(-new) %>%
group_by(depth_cm, taxa) %>%
summarise(count = sum(count)) %>%
spread(key = taxa, value = count)
## # A tibble: 3 x 4
## # Groups: depth_cm [3]
## depth_cm sp_A sp_B sp_D
## *
## 1 1 4 8 11
## 2 2 4 3 9
## 3 3 5 8 9

Now the data are ready for further analysis – remember some functions will want you to remove the meta_data first. For example

cca(select(spp2, -depth_cm))
Posted in Data manipulation, R | Leave a comment