Bergen: a year with some sunshine

May was glorious.  December less so.


The data are from the Geofysisk Institutt in Bergen. Here is the code I used

florida = read.csv("../climate_data//Florida_2017-06-21_2018-06-22_1529649176.csv", stringsAsFactors = FALSE) %>% as.tibble()
florida = florida %>%
mutate(date_time = ymd_hm(paste(Dato, Tid)),
date = ymd(Dato),
Tid = ymd_hm(paste("2000-01-01", Tid)),
Globalstraling = if_else(Globalstraling == 9999.99, NA_real_, Globalstraling)) %>%
complete(date, Tid) %>%#
arrange(date, Tid) %>%
mutate(Globalstraling = coalesce(Globalstraling, lag(Globalstraling, 24 * 6)))#fill gaps with previous day

ggplot(florida, aes(x = Tid, y = date, fill = Globalstraling)) +
geom_raster() +
scale_fill_gradient(low = "black", high = "yellow") +
scale_x_datetime(date_labels = "%H", expand = c(0, 0)) +
scale_y_date(date_labels = "%b", expand = c(0, 0)) +
labs(x = "Time", y = "Month", fill = expression(atop(Sunshine, Wm^-1)))

Posted in climate, R, Uncategorized | Tagged | Leave a comment

More trouble counting to fifty

Earlier this week at the palaeolimnology symposium, Gavin told me that it had not dawned on him that the count sum could be estimated from percent data using our knowledge of rank abundance curves. I only recently realised this; previously I did not imagine that this would ever be a useful thing to do.

And yet this morning, I found strong evidence that another chironomid analyst has problems keeping to the count sum promised in the paper (the tally is now three chironomid analysts, a diatomist, and a palynologist).

On Sunday, after I gave my presentation to the chironomid workshop, there was some discussion about what should be done with small counts (they do have some information), but there was unanimity that the paper must report it if some counts contain fewer than the target sum.

At the moment, I am not going to name the analyst whose data I examined this morning. This is only a temporary reprieve, when I write up the presentation I gave on Sunday, this case will be used as an example.

The paper reports that the minimum number of head capsules per sample was fifty. Eight of the thirty-four samples appear to have a count sum of less than fifty. In one sample, the rarest taxon has a relative abundance of 6.25%. The relative abundances of the other six taxa are all integer multiples of 6.25%. This is strong evidence that the count sum is sixteen.

There are three possibilities.

First that the counts are a great fluke. This is unlikely. The sample discussed above could be one in which every taxa was present with a multiple of four head capsules (i.e. the true count sum is 64). Even under extremely optimistic assumptions, the probability of getting such a sample is 1/(47) = 6 * 10-5. And then there are another seven samples, one of which would require all nineteen taxa to have multiples of three head capsules (9 * 10-10). Combining the probabilities of all these unlikely counts will give a very small number.

Neither of the remaining two possibilities is very pleasant.

It could be that the author(s) (whom I believe count(s) their own chironomids) are so negligent that they forgot when they wrote the paper that the count sums of almost a quarter of their samples were smaller than promised. If this is the case, the authors’ competence has to be doubted and we need to ask if anything the author(s) report should be trusted.

The other possibility is that the author(s) knowingly misreported the count sums. This could easily be construed as fabrication (“easily” that is for anyone except a university integrity officer), and data fabrication is a form of misconduct. Obviously this would not be the most serious case misconduct, perhaps akin to plagiarising a paragraph rather than a full paper.

The question remains as to what to do with this case of possible data falsification (and any other cases I find when I have time to import some more data). I am seeing three options: to describe the case in my forthcoming manuscript with a citation; to alert the journal that published the paper; and to advise the authors’ university’s integrity officer.

I ask my readers, both of you, to tell me in the comments or otherwise, how you think I should proceed in this case and what you think the outcome should be.


Posted in Misconduct, Peer reviewed literature | Tagged | 3 Comments

My presentation to IPA-IAL 2018

I’ve just given a presentation at the joint IPA-IAL conference in Stockholm

Sub-decadal resolution palaeoenvironmental reconstructions from microfossil assemblages

Download it!

The deadline for applying for a reward for finding typos has expired.

Posted in climate | Tagged , , , | Leave a comment

How many is fifty? Sanity checks for assemblage data.

This week I’m at the Palaeolimnology Symposium in Stockholm this week.

I have a couple of presentations. I gave the first this morning to the chironomid “DeadHead” meeting. I showed some sanity checks for assemblage data, some of which are related to Brown and Heathers’ (2018) GRIM test.  It can download it from figshare.

Posted in Data manipulation, EDA | Tagged , | 1 Comment

Warm summers in the Younger Dryas?

The Younger Dryas was a period (12,900–11,600 BP) towards the end of the last glaciation when glaciers re-advanced in Scotland and the tundra plants, including the eponymous Dryas, replaced the Bølling-Allerød forests in Denmark. These changes indicate the Younger Dryas was a cold interval in Europe, an interpretation challenged by a new paper by Schenk et al  in Nature Communications.

Schenk et al base their reinterpretation on climate model output supported by plant macrofossil and pollen which are used as indicator species to infer July temperatures. Their model simulates autumns, winters and springs that were colder during the Younger Dryas than the Bølling-Allerød, and summers that were short but warmer. The paper suggests, based on the model output, various mechanisms that could give summer warming in an otherwise cold period, especially atmospheric blocking preventing winds from the cold ocean blowing far into Europe.

I’m not in a position to evaluate their model, so I’m going to have a closer look at their proxy data and compare it with other proxy data.

Since the replacement of forest by tundra or steppe during the Younger Dryas is well known, it is a little surprising to see palaeobotanical evidence being used to suggest warm Younger Dryas summers. Schenk et al base their temperature reconstruction on “plant species that indicate local presence focusing primarily on specific plant species, such as aquatics and riparian species “, rather than the full assemblage. I cannot find any discussion in the text as to why the aquatics should indicate warmth while the forests decline, but of course summer temperature is not the only factor driving vegetation – drought also plays an important role.

The glacial readvance is probably one of the best pieces of evidence for cool summers in the Younger Dryas. Within Europe, Younger Dryas glacial re-advances have been reported from, for example, the Iberian Peninsula, South Wales, North Wales, Northern England, Northern Ireland, Scotland, Norway, Iceland, the Alps, Poland, Romania, Macedonia. For glaciers to re-advance, summer temperatures need to be low. Alternatively winter precipitation could to be very high, but it is generally thought that Europe was arid in the Younger Dryas because of the extensive sea ice in the North Atlantic (some of the evidence for aridity is dependent on estimates of summer temperature, but some is independent). Glacial re-advances near the Atlantic coast could be consistent with Schenk et al, as the coast might not be protected by atmospheric blocking, but glacial advances in central Europe are not. Schenk et al do not consider the glacial evidence.

The one possible way to reconcile Schenk et al with the glaciological evidence is to argue that the dating of the glacial re-advances to the Younger Dryas is suspect. Bromley et al (2014) make this argument for Scotland, but at other sites the evidence is unambiguous, for example at Kråkenes where the well dated lake core contains sediment derived from a cirque glacier in the Younger Dryas.

Schenk et al consider the numerous chironomid-based reconstructions of summer temperature which are almost unanimous in reconstructing colder conditions in the Younger Dryas than the preceding Bølling-Allerød but dismiss this evidence. They argue that “chironomid assemblages have been shown to incorporate a number of environmental signals (e.g., catchment vegetation, nutrient supply, lake status and depth, seasonality) apart from the ambient summer air temperature. This is correct, but the aquatic indicators used by Schenk et al will be sensitive to many of the same factors.

Specifically, Schenk et al argue that changes in seasonality – cold springs with late-lying snow and short summers affect the chironomid-based reconstructions. Why these seasonality shifts would not likewise affect the aquatic plant indicators is not explained.

The indicator values are calculated as from the temperature at the range limit of each species in Finland. Range limits are inherently uncertain and as the reconstruction is based on the indicator value of the least cold-tolerant species in the assemblage, the reconstructions will also be very uncertain. I do wonder how the range limits would change if a larger geographic area was considered – a question that could easily be tested with GBIF data.

I am not convinced by this paper.





Posted in climate, Peer reviewed literature, transfer function | Tagged , | 1 Comment

Statigraphic diagrams with ggplot

rioja::strat.plot is a great tool for plotting stratigraphic plots in R, but sometimes it is not obvious how to do something I want, perhaps a summary panel showing the percent trees/shrubs/herbs. Of course, I could extend strat.plot, but I do all nearly all my figures with ggplot now and wouldn’t know where to start.

In this post, I’m going to show how to make a stratigraphic plot with ggplot.

First I’m going to download the pollen data for Tsuolbmajavri (Seppä and Hicks, 2006) from Neotoma.



#get dataset list
pollen_sites <- get_dataset(datasettype = "pollen", x = 15733)

pollen_data <- get_download(pollen_sites)
pollen_data <- pollen_data[[1]]

#species groups to plot
wanted <- c("TRSH", "UPHE", "VACR", "SUCC", "PALM", "MANG")

Now I need to extract and combine the pollen count, the chronology and the taxon information from the downloaded neotoma data. I gather the data to make it into a long thin (tidy) table, filter the taxa that are in the ecological groups I want to plot and calculate percentages.

thin_pollen <- bind_cols(
ages(pollen_data) %>% select(depth, age, age.type)
) %>%
gather(key = taxon, value = count, -depth, -age, -age.type) %>%
left_join(pollen_data$taxon.list, c("taxon" = "")) %>%
filter( %in% wanted) %>%
group_by(depth) %>%
mutate(percent = count / sum(count) * 100)

Now I can plot the data after a little further processing: I’m selecting taxa that occur at least three times and with a maximum abundance of at least 3%, and making taxon into a factor so taxa from the same ecological group will plot together.

I use geom_area twice, once to show the 10x the percent, once to show the percent. I then need to use ylim in coord_flip to truncate the axes at the maximum pollen abundance. The plot uses facets to show the taxa separately.

thin_pollen %>%
mutate(taxon = factor(taxon, levels = unique(taxon[order(]))) %>%
group_by(taxon) %>%
filter(max(percent) >= 3, sum(percent > 0) >= 3) %>%
ggplot(aes(x = age, y = percent, fill = +
geom_area(aes(y = percent * 10), alpha = 0.3, show.legend = FALSE) +
geom_area(show.legend = FALSE) +
coord_flip(ylim = c(0, max(thin_pollen$percent))) +
scale_x_reverse(expand = c(0.02, 0)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 2)) +
facet_wrap(~ taxon, nrow = 1) +
labs(x = expression(Age~''^{14}*C~BP), y = "Percent", fill = "Group") +
theme(strip.text.x = element_text(angle = 90, size = 7, hjust = 0),
panel.spacing = unit(x = 1, units = "pt"),
axis.text.x = element_text(size = 7)



I don’t think that is too bad for a first attempt, but I think I’ll stick to rioja::strat.plot most of the time.

Posted in Data manipulation, R, Uncategorized | Tagged , | Leave a comment

How to calculate percent from counts in R

Micropaleontologists and others often want to calculate percent from count data. From looking at archived data, I realise that what should be an easy process goes wrong far more often that it should (which is of course never). Yesterday, I found that a recently published paper had archived percent data that were impossible in several ways. Rather than discussing that paper, which has certain interesting aspects, today I am going to show how to calculate percent from count data in R.

Case one

The counts, and only the counts, are in a species x sites (columns x rows) matrix or data.frame. Here I use the BCI count data from the vegan package.

data(BCI, package = "vegan")
#head(BCI) see first few rows

To calculate percent, we need to divide the counts by the count sums for each sample, and then multiply by 100.

BCI_percent <- BCI / rowSums(BCI) * 100

This can also be done using the function decostand from the vegan package with method = "total".

Case two

The counts are in a species x sites matrix or data.frame along with other data or meta data. Here, I add the row ID from BCI to represent sample names.

BCI_plus <- BCI %>% rowid_to_column(var = "siteID")

There are a few different solutions to this. One solution is to select the columns with the count data, and process these as above. This is usually fine because we need the pure species x sites data.frame for ordinations etc

BCI_species <- BCI_plus %>%
select(-siteID) # remove siteID column

BCI_percent <- BCI_species / rowSums(BCI_species) * 100

Alternatively, if we want to keep all the data in one object, we can gather the data into a thin (tidy) format, and calculate percents on that.


BCI_percent2 <- BCI_plus %>%
gather(key = taxon, value = count, -siteID) %>% #make into thin format
group_by(siteID) %>% #do calculations by siteID
mutate(percent = count / sum(count) * 100)

head(BCI_percent2 %>% filter(count > 0))# show some data
## # A tibble: 6 x 4
## # Groups: siteID [5]
## siteID taxon count percent
## 1 10 Abarema.macradenia 1 0.2070393
## 2 28 Vachellia.melanoceras 2 0.5167959
## 3 32 Vachellia.melanoceras 1 0.2178649
## 4 34 Acalypha.diversifolia 1 0.2237136
## 5 40 Acalypha.diversifolia 1 0.2044990
## 6 28 Acalypha.macrostachya 1 0.2583979

We can convert this thin object back to a fat format with spread (the opposite of gather). This could be appended directly onto the previous code.

BCI_percent3 <- BCI_percent2 %>%
select(-count) %>%
spread(key = taxon, value = percent)

Case three

So far, this has assumed that all the taxa are part of a single count sum as will typically be the case for diatoms, chironomids, or planktic foraminifera. With pollen, however, there can be complexities, for example, we might want to have trees, shrubs and upland herbs to be in the terrestrial pollen sum (T + S + H), and aquatic macrophytes part of the total pollen sum (T + S + H + A).

The BCI tree data obviously doesn’t have any aquatic macrophytes, but I’m going to treat all species beginning with ‘A’ as aquatic, ‘T’ as tree, ‘S’ as shrub and other taxa as herbs. First we need a dictionary, which you would normally have already.

dictionary <- data_frame(taxon = names(BCI), group = substring(taxon, 1, 1)) %>%
mutate(group = if_else(group %in% c("A", "S", "T"), group, "H"),
sum = case_when(
group == "A" ~ "Aquatic",
group %in% c("T", "S", "H") ~ "Terrestrial"))

One solution is to separate the different groups into separate data.frames, calculate percent, and then bond the data.frames back together.

aquatics <- BCI %>%
select(one_of(dictionary %>%
filter(group == "A") %>%

upland <- BCI %>%
select(one_of(dictionary %>%
filter(group %in% c("S", "T", "H")) %>%

BCI_percent4 <- bind_cols(
upland / rowSums(upland),
aquatics / rowSums(BCI)) * 100

And of course there is a tidyverse solution (there might even be a good tidyverse solution).

BCI_percent5 <- BCI %>%
rowid_to_column(var = "siteID") %>% #need identifier
gather(key = taxon, value = count, -siteID) %>%
left_join(dictionary) %>%
group_by(siteID) %>%
mutate(total_count_sum = sum(count)) %>%
group_by(siteID, sum) %>%
count_sum = sum(count),
relevant_count_sum = if_else(sum == "Aquatic", total_count_sum, count_sum),
percent = count/relevant_count_sum * 100)

From this object, the percent, taxon and siteID can be selected and then spread. The tidyverse solution is complicated because the count sums overlap.


Posted in Data manipulation, R | 1 Comment