## Statigraphic diagrams with ggplot

`rioja::strat.plot` is a great tool for plotting stratigraphic plots in R, but sometimes it is not obvious how to do something I want, perhaps a summary panel showing the percent trees/shrubs/herbs. Of course, I could extend `strat.plot`, but I do all nearly all my figures with `ggplot` now and wouldn’t know where to start.

In this post, I’m going to show how to make a stratigraphic plot with `ggplot`.

First I’m going to download the pollen data for Tsuolbmajavri (Seppä and Hicks, 2006) from Neotoma.

```library("neotoma")
library("tidyverse")

#get dataset list
pollen_sites <- get_dataset(datasettype = "pollen", x = 15733)

pollen_data <- pollen_data[[1]]

#species groups to plot
wanted <- c("TRSH", "UPHE", "VACR", "SUCC", "PALM", "MANG")
```

Now I need to extract and combine the pollen count, the chronology and the taxon information from the downloaded neotoma data. I `gather` the data to make it into a long thin (tidy) table, filter the taxa that are in the ecological groups I want to plot and calculate percentages.

```thin_pollen <- bind_cols(
counts(pollen_data),
ages(pollen_data) %>% select(depth, age, age.type)
) %>%
gather(key = taxon, value = count, -depth, -age, -age.type) %>%
left_join(pollen_data\$taxon.list, c("taxon" = "taxon.name")) %>%
filter(ecological.group %in% wanted) %>%
group_by(depth) %>%
mutate(percent = count / sum(count) * 100)
```

Now I can plot the data after a little further processing: I’m selecting taxa that occur at least three times and with a maximum abundance of at least 3%, and making taxon into a factor so taxa from the same ecological group will plot together.

I use `geom_area` twice, once to show the 10x the percent, once to show the percent. I then need to use `ylim` in `coord_flip` to truncate the axes at the maximum pollen abundance. The plot uses facets to show the taxa separately.

```thin_pollen %>%
mutate(taxon = factor(taxon, levels = unique(taxon[order(ecological.group)]))) %>%
group_by(taxon) %>%
filter(max(percent) >= 3, sum(percent > 0) >= 3) %>%
ggplot(aes(x = age, y = percent, fill = ecological.group)) +
geom_area(aes(y = percent * 10), alpha = 0.3, show.legend = FALSE) +
geom_area(show.legend = FALSE) +
coord_flip(ylim = c(0, max(thin_pollen\$percent))) +
scale_x_reverse(expand = c(0.02, 0)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 2)) +
facet_wrap(~ taxon, nrow = 1) +
labs(x = expression(Age~''^{14}*C~BP), y = "Percent", fill = "Group") +
theme(strip.text.x = element_text(angle = 90, size = 7, hjust = 0),
panel.spacing = unit(x = 1, units = "pt"),
axis.text.x = element_text(size = 7)
)
```

I don’t think that is too bad for a first attempt, but I think I’ll stick to `rioja::strat.plot` most of the time.

## How to calculate percent from counts in R

Micropaleontologists and others often want to calculate percent from count data. From looking at archived data, I realise that what should be an easy process goes wrong far more often that it should (which is of course never). Yesterday, I found that a recently published paper had archived percent data that were impossible in several ways. Rather than discussing that paper, which has certain interesting aspects, today I am going to show how to calculate percent from count data in R.

### Case one

The counts, and only the counts, are in a species x sites (columns x rows) matrix or data.frame. Here I use the BCI count data from the vegan package.

```data(BCI, package = "vegan")
```

To calculate percent, we need to divide the counts by the count sums for each sample, and then multiply by 100.

```BCI_percent <- BCI / rowSums(BCI) * 100
```

This can also be done using the function `decostand` from the vegan package with `method = "total"`.

### Case two

The counts are in a species x sites matrix or data.frame along with other data or meta data. Here, I add the row ID from BCI to represent sample names.

```library("tidyverse")
BCI_plus <- BCI %>% rowid_to_column(var = "siteID")
```

There are a few different solutions to this. One solution is to select the columns with the count data, and process these as above. This is usually fine because we need the pure species x sites data.frame for ordinations etc

```BCI_species <- BCI_plus %>%
select(-siteID) # remove siteID column

BCI_percent <- BCI_species / rowSums(BCI_species) * 100
```

Alternatively, if we want to keep all the data in one object, we can gather the data into a thin (tidy) format, and calculate percents on that.

```BCI_percent2 <- BCI_plus %>%
gather(key = taxon, value = count, -siteID) %>% #make into thin format
group_by(siteID) %>% #do calculations by siteID
mutate(percent = count / sum(count) * 100)

head(BCI_percent2 %>% filter(count > 0))# show some data
```
```## # A tibble: 6 x 4
## # Groups: siteID [5]
## siteID taxon count percent
##
## 1 10 Abarema.macradenia 1 0.2070393
## 2 28 Vachellia.melanoceras 2 0.5167959
## 3 32 Vachellia.melanoceras 1 0.2178649
## 4 34 Acalypha.diversifolia 1 0.2237136
## 5 40 Acalypha.diversifolia 1 0.2044990
## 6 28 Acalypha.macrostachya 1 0.2583979
```

We can convert this thin object back to a fat format with `spread` (the opposite of `gather`). This could be appended directly onto the previous code.

```BCI_percent3 <- BCI_percent2 %>%
select(-count) %>%
spread(key = taxon, value = percent)
```

### Case three

So far, this has assumed that all the taxa are part of a single count sum as will typically be the case for diatoms, chironomids, or planktic foraminifera. With pollen, however, there can be complexities, for example, we might want to have trees, shrubs and upland herbs to be in the terrestrial pollen sum (T + S + H), and aquatic macrophytes part of the total pollen sum (T + S + H + A).

The BCI tree data obviously doesn’t have any aquatic macrophytes, but I’m going to treat all species beginning with ‘A’ as aquatic, ‘T’ as tree, ‘S’ as shrub and other taxa as herbs. First we need a dictionary, which you would normally have already.

```dictionary <- data_frame(taxon = names(BCI), group = substring(taxon, 1, 1)) %>%
mutate(group = if_else(group %in% c("A", "S", "T"), group, "H"),
sum = case_when(
group == "A" ~ "Aquatic",
group %in% c("T", "S", "H") ~ "Terrestrial"))
```

One solution is to separate the different groups into separate data.frames, calculate percent, and then bond the data.frames back together.

```aquatics <- BCI %>%
select(one_of(dictionary %>%
filter(group == "A") %>%
pull(taxon)
))

upland <- BCI %>%
select(one_of(dictionary %>%
filter(group %in% c("S", "T", "H")) %>%
pull(taxon)
))

BCI_percent4 <- bind_cols(
upland / rowSums(upland),
aquatics / rowSums(BCI)) * 100
```

And of course there is a tidyverse solution (there might even be a good tidyverse solution).

```BCI_percent5 <- BCI %>%
rowid_to_column(var = "siteID") %>% #need identifier
gather(key = taxon, value = count, -siteID) %>%
left_join(dictionary) %>%
group_by(siteID) %>%
mutate(total_count_sum = sum(count)) %>%
group_by(siteID, sum) %>%
mutate(
count_sum = sum(count),
relevant_count_sum = if_else(sum == "Aquatic", total_count_sum, count_sum),
percent = count/relevant_count_sum * 100)
```

From this object, the percent, taxon and siteID can be selected and then spread. The tidyverse solution is complicated because the count sums overlap.

Posted in Data manipulation, R | 1 Comment

I’m putting together my review of sub-decadal resolution reconstructions from microfossil assemblages and thought I ought to check if I have missed any.

The criteria are:

• Palaeoenvironmental reconstruction from microfossil assemblages
• Reconstruction validated against instrumental data (ideally with the correlation calculated, or at least possible to calculate)

All three criteria need to be met. All microfossils (diatoms, pollen, forams etc) are within scope, as are reconstructions of any environmental variable.

So far I have:

Diatoms

• Lotter (1998) Total phosphorus in Baldegersee
• Alefs and Müller (1999) Total phosphorus in Ammersee and Starnberger See

Chrysophytes

• De Jong and Kamenik (2011) Date of spring mixing in Lake Silvaplana
• Hernández-Almeida et al. (2015) Calcium concentrations/wind speed in Lake Żabińskie
• Hernández-Almeida et al. (2015) Number consecutive days with water < 4°C in Lake Żabińskie

Chironomids

• Larocque and Hall (2003) July air-temperature in four lakes near Abisko
• Larocque et al. (2009), Larocque-Tobler et al. (2011) July air-temperature from Lake Silvaplana
• Larocque-Tobler et al. (2011) July air-temperature in Seebergsee
• Larocque-Tobler et al. (2015) August air-temperature in Lake Żabińskie
• Luoto and Ojala (2016) July air-temperature in Nurmijärvi
• Zhang et al. (2017) July air-temperature in Tiancai Lake
• Lang et al. (2017) July air-temperature in Speke Hall Lake

Are there any more?

I also want data to test the reproducibility of the reconstructions. I have the modern and fossil chironomid data from Lake Żabińskie, and fossil chronomid data from Tiancai, Seebergsee, and Silvaplana. I would be very interested in any of the other data.

## Package version control

One aspect of writing a manuscript that I find tedious is checking the versions of the R packages I used. So I wrote a small function to do this automatically.

```
get_version = function(package){
require(tidyverse)
require(magrittr)
installed.packages() %>%
as_data_frame() %>%
filter(Package == package) %\$%
paste(Package, "version", Version)
}

```

(I don’t use ​`=` as an assignment operator in my code, but the <- operator together with pipes tends to upset wordpress and it leaves out a chunk of code.)

This function will convert `get_version("vegan")` in my  rmarkdown file  to “vegan version 2.4-6” in the pdf. Problem solved (except that I still need to check the citation is up to date).

I can also grab the R version with `gsub("(.*) \\(.*", "\\1", R.version.string)`, which gives “R version 3.4.4”.

## Finding singletons in real data

I have argued that the rarest species in an assemblage should typically be represented by a single individual, and that this can be used to estimate the count sum when this is not reported.

Now I want to test this with some real data. Starting with tree-count data from Barro Colorado Island which is available from within the `vegan` package. How many species in each of the 50 samples are represented by a single individual?

```library("tidyverse")
data("BCI", package = "vegan")

data_frame(n = rowSums(BCI == 1)) %>%
ggplot(aes(x = n)) +
geom_bar(width = 1) +
labs(x = "Number of singletons") +
xlim(21, NA)
```

Number of singletons per sample in the BCI data-set

All of the BCI samples have many singletons: the probability of finding a sample without any singletons is obviously very small. But perhaps some attributes of the relatively large (median count is 428 trees) species rich (median richness is 91 species) samples are not relevant to chironomid or diatom samples.

What happens when the count sum is much smaller? Here, I resample with replacement each BCI sample to a count of fifty trees. This is done one thousand times.

```BCI_50 <- BCI %>%
rowwise() %>%
do(
as_data_frame(
t(rmultinom(n = 1000, size = 50, prob = .))
)
)

n_50 <- data_frame(n = rowSums(BCI_50 == 1)) n_50 %>%
ggplot(aes(x = n)) +
geom_bar(width = 1) +
labs(x = "Number of singletons")
```

Number of singletons per sample in the BCI data-set when the count sum is reduced to 50 trees.

The minimumum number of singletons in the resampled data is now 3, and 99.97% of samples have more that 5 singletons. Even with a count of only fifty, samples without singletons are going to be rare.

How about if the taxonomic resolution is reduced by reanalysing the data by genus rather than species? That turns out not to change things much as many genera have only one species in each sample. The richness can be greatly reduced by grouping species according to the first letter of the genus name.

```BCI_50_genera <- BCI_50 %>%
rowid_to_column("sample") %>%
gather(key = species, value = count, -sample) %>%
# mutate(genera = gsub("^(\\w+).*", "\\1", species)) %>%
mutate(genera = gsub("^(\\w{1}).*", "\\1", species)) %>%
group_by(sample, genera) %>%
summarise(count = sum(count))

n_50_genera <- BCI_50_genera %>%
summarise(n = sum(count == 1))

n_50_genera %>%
ggplot(aes(x = n)) +
geom_bar(width = 1) +
labs(x = "Number of singletons")
```

Number of singletons per sample in the BCI data-set when the count sum is reduced to 50 trees and the taxonomic resolution is greatly reduced.

The mean richness is now 15.2 taxa.
99.53% of samples have at least one singleton amongst the 50 trees. Samples without singletons are still rare.

Time to move over to some chironomid data. Professor Ian Walker archived chironomid stratigraphies from nine Canadian lakes. The data are in tilia files (not recommended), fortunately the `rioja` package has a function to read tilia files. Unfortunately, some characters are not being read correctly and are crashing `read.Tilia` so I am calling the underlying C code directly and removing taxa after ‘unidentifiable’, which removes all the other diptera and sums that I don’t need, as well as all the problem characters.

```library("rioja")
obj <- .Call("ReadTiliaFile", fname, PACKAGE = "rioja") obj\$Data %>%
set_names(make.names(obj\$FullNames, unique = TRUE)) %>%
as_data_frame() %>%
rowid_to_column(var = "sample") %>%
select(1:grep("^UNIDENTIFIABLE\$", names(.), ignore.case = TRUE))
}

tilia_files <- dir(pattern = "\\.til\$", recursive = TRUE, full.names = TRUE, ignore.case = TRUE)

tilia_chironomids <- tilia_files %>%
set_names(gsub(pattern = ".*/(\\w*)\\.til\$", replacement = "\\1", x = ., ignore.case = TRUE)) %>%

min_counts <- tilia_chironomids %>%
filter(lake != "brier") %>% #percent data
gather(key = species, value = count, -lake, -sample) %>%
mutate(count = round(count * 2) / 2 ) %>% # Wood Lake counts back-transformed from % data
filter(count > 0) %>%
group_by(lake, sample) %>%
summarise(min_count = min(count), sum_count = sum(count), n_sing = sum(count <= 1), nsp = n())
```

The chironomids are mainly identified to genus level, some higher, some lower. Of the 158 (median count sum = 86.25, median richness = 18), 98.1% have singletons (whole or half). The median number of singletons is 6.

Of the three samples without singletons, one has a count sum of only 4.5. The other two have sensible looking counts: not all the counts are multiples of the lowest count, which would help recognise this case.

I am confident that assemblages will generally have one or more singletons, so they can be used to estimate the count sum when this is not given. This analysis shows that samples with low diversity (or equivalently low taxonomic resolution), especially with large counts, or very small counts sums might not. Analyst-related issues that increase the risk of singletons being absent include counting only a limited range of taxa (as is sometimes done in pollen analysis) and archiving the data after rare taxa have been deleted (please archive the full data).

## A paper with insufficient and misleading information

Sometimes papers reporting transfer functions don’t give enough information for the reader to properly evaluate the model. Take for example Massaferro and Larocque-Tobler (2013) who report a chironomid-based mean annual air temperature transfer function for Patagonia and its application to a Holocene-length chironomid stratigraphy from Lake Potrok Aike.

The paper reports that the three-component WAPLS model has an RMSE (root mean squared error) of 0.83°C and an r2 of 0.64. I assumed that this was a typo and that the authors were actually reporting the RMSEP (root mean squared error of prediction), as it has been standard practice for decades to present the model’s cross-validated performance.

On re-reading the paper after seeing it cited, I began to doubt my previous assumptions: perhaps the authors really were reporting the non-crossvalidated model performance statistics, which can be unrealistic estimates of model performance. My doubts focused on the large number of components used (a three-component WAPLS model is usually (perhaps always) an overfit) and the weak relationship between chironomid assemblage composition and temperature in the calibration set shown in the following figure.

Calibration set chironomid assemblages arranged by temperature. Few if any taxa show a strong relationship with temperature

A search of my sent-mail folder reminds me that I advised the authors of my concerns about this exactly four years ago today. They did not respond. I now happen to have the data from this paper and decided to have a quick look.

Using the species exclusion rules described in the paper and square-root transformation of the species data, I can get non-crossvalidated performance statistics that are similar to those reported (RMSE = 0.83°C, r2 = 0.65).

But what of the crossvalidated performance, which is a far better (though still imperfect) guide to the actual model performance? With leave-one-out crossvalidation, the performance is much worse (RMSEP = 1.39°C, r2 = 0.15). The RMSEP is almost equal to the standard deviation of the environmental variable (sd = 1.4°C) and the r2 is probably the lowest I have seen for any published transfer function. This transfer function is not useful. The authors’ claim that the performance is “comparable to other transfer functions of this size” is a false statement.

Despite the transfer function’s lack of skill, the authors use it to reconstruct temperatures from the Lake Potrok Aike chironomid stratigraphy. Most chironomid workers report the minimum number of chironomid head capsules that they count in each sample. Typically this is 50 (or perhaps 30) head capsules. The minimum chironomid count sum is not reported in this paper: I suspect most readers would assume it was about 50. It isn’t. The count sum is reported in the fossil data: the median count sum is 17. Only the fossil samples with a count of two or fewer chironomids seem to have been omitted from the published fossil stratigraphy. To not report that count sums are so low, far below the typical minimum accepted, strikes me as disingenuous.

The fossil data consist of only four taxa [note to my copy editor – this is not a typo]. Of these Phaenopsectra is most abundant, with a median abundance of 50%, much higher than the maximum abundance of this taxon in the calibration set (16.4%). This naturally raises concerns that the reconstruction diagnostics will be poor.

Perhaps the most widely used transfer function diagnostic is analogue quality. With the usual rule-of-thumb that distances greater than the 10th percentile of taxonomic distances in the calibration set represent non-analogue assemblages, 81% of samples lack analogues. This is not good.

The paper does not report analogue quality, but it does report that “all samples of the Potrok Aike core added passively to CCA of the training set samples were within the 95% confidence interval.” This is presumably the residual length diagnostic. When I calculate this statistic, I find that 100% of fossil samples are outside the 95th percentile of residual lengths. This is not good.

This paper has been fairly well cited, but appears to be fatally flawed: the transfer function as ~no skill, the fossil counts are small and lack analogues in the calibration set. I would recommend that anyone who plans to cite it assure themselves that the conclusions of the paper adequately reflect the results.

## Please archive assemblage data as counts not percent

A large fraction of the microfossil assemblage that has been archived on-line by palaeoecologists is percent data, often without the count sums, rather than the raw count data. This is unfortunate as some analyses need count data. Calculating percent from count data is trivial, but the reverse operation is in general not possible if the count sums are not available.

Fortunately, one aspect of assemblage data makes it possible to estimate the count sums: the long tail of most rank-abundance distributions means that the rarest taxon in any sample is probably present with a single individual. The lowest percent abundance, $P_{min}$, is therefore $1/count * 100$ and the count sum can be estimated as $100/P_{min}$.

Archived data are often rounded off which limits the precision of this calculation, but it is possible to estimate the range of possible count sums using the maximum and minimum value of $P_{min}$ that would yield the reported $P_{min}$ after rounding. Obviously, the uncertainty increases when fewer decimal places are reported, and with larger count sums.

I’ve written an R function to estimate count sums in the R package `countChecker` which is available on Github.

```#devtools::install_github("richardjtelford/count_checker")
library("countChecker")
library("tidyverse")
```

Lets see it in action with the percent data from Last Chance Lake (Axford et al 2017), chosen because I like the name. More importantly, it was the first percent assemblage data I found on NOAA were the count sums were reported, so it is possible to check the code is working.

```#Import data
data("last_chance")

#examine data to determine precision
```
``````## Cric.pulTy Cric.Orthund Cric.sylTy Cric.treTy
## 1 0.00 3.53 0.00 0.00
## 2 0.85 6.78 0.00 1.69
## 3 0.00 1.92 0.00 0.00
## 4 0.25 0.98 0.49 1.47
## 5 0.00 2.26 0.75 2.26
## 6 0.00 0.00 0.00 4.57``````
```#Isolate species data
last_chance = select(last_chance0, -age_calBP, -totcaps)

#Estimate n and add reported n
last_chance_est = estimate_n(spp = last_chance, digits = 2) %>%
mutate(n = last_chance0\$totcaps)

#plot
last_chance_est %>%
ggplot(aes(x = n, y = est_n, ymax = est_max, ymin = est_min)) +
geom_pointrange() +
geom_abline(
aes(slope = slope, intercept = 0, colour = as.factor(slope)),
data = data_frame(slope = 1:2),
show.legend = FALSE) +
labs(x = "Count sum", y = "Estimated count sum")
```

Estimated vs reported count sums for Last Chance Lake. The lower line is the 1:1 relationship. The upper line is the 2:1 relationship expected when the rarest taxon is represented by a half chironomid.

This reveals both the power of this method, and one of a few problems. A few samples have estimated counts that match the reported counts, but most have estimated counts twice the reported counts. The problem is that chironomid workers often count half head capsules: in many samples the lowest abundance is a half chironomid. Pollen and occasionally diatoms are also sometimes counted with halves – integer counts are easier to analyse.

In other datasets, there might be some samples with the rarest taxon represented by two (or more) microfossils. This is probably only a serious risk in very low diversity systems (hello Neogloboquadrina pachyderma sinistral), or where only a few common taxa are identified rather than all taxa.

Diatoms present an interesting test case as the two valves which comprise the frustule are counted separately even though they are often found paired (or in longer chains), raising the risk that the rarest taxon has two valves. If the rarest taxon has two valves, it is unlikely that the other taxa all have multiples of two valves, which should help detect this problem which would otherwise make the count appear to be half its true size.

The first diatom assemblage dataset I found on pangaea.de appeared to have a few samples with count sums much lower than reported, even allowing for the possibility that the rarest taxon is not a singleton. This is not the first time that I’ve noticed that some authors appear to report how many microfossils they planned to count rather than the number they actually counted. This is difficult to detect with percent data but `countChecker` might reveal it. I plan, at some stage, to trawl through various archived assemblage data to test how common this problem is.

The `countChecker` package also tests whether the percent data are congruent using a method very similar to the GRIM test which has been used to great effect on psycology data. The principle is this: if the count is of $n$ microfossils, then, discounting the tendency of some analysts to count fractional microfossils, only integer multiples of $100/n$ are possible. The `percent_checker` function tests for such impossible values after allowing for limited precision in both the percent data and the count sum. I’ll explore this method more later.

Pollen data archived in Neotoma and the European Pollen Database are nearly all count data. Please can other microfossil assemblage data also be archived as counts not percent.

Posted in R | Tagged | 1 Comment