A large fraction of the microfossil assemblage that has been archived on-line by palaeoecologists is percent data, often without the count sums, rather than the raw count data. This is unfortunate as some analyses need count data. Calculating percent from count data is trivial, but the reverse operation is in general not possible if the count sums are not available.

Fortunately, one aspect of assemblage data makes it possible to estimate the count sums: the long tail of most rank-abundance distributions means that the rarest taxon in any sample is probably present with a single individual. The lowest percent abundance, , is therefore and the count sum can be estimated as .

Archived data are often rounded off which limits the precision of this calculation, but it is possible to estimate the range of possible count sums using the maximum and minimum value of that would yield the reported after rounding. Obviously, the uncertainty increases when fewer decimal places are reported, and with larger count sums.

I’ve written an R function to estimate count sums in the R package `countChecker`

which is available on Github.

#devtools::install_github("richardjtelford/count_checker") library("countChecker") library("tidyverse")

Lets see it in action with the percent data from Last Chance Lake (Axford et al 2017), chosen because I like the name. More importantly, it was the first percent assemblage data I found on NOAA were the count sums were reported, so it is possible to check the code is working.

#Import data data("last_chance") #examine data to determine precision head(last_chance0[, 4:7])#two digits after decimal

```
## Cric.pulTy Cric.Orthund Cric.sylTy Cric.treTy
## 1 0.00 3.53 0.00 0.00
## 2 0.85 6.78 0.00 1.69
## 3 0.00 1.92 0.00 0.00
## 4 0.25 0.98 0.49 1.47
## 5 0.00 2.26 0.75 2.26
## 6 0.00 0.00 0.00 4.57
```

#Isolate species data last_chance = select(last_chance0, -age_calBP, -totcaps) #Estimate n and add reported n last_chance_est = estimate_n(spp = last_chance, digits = 2) %>% mutate(n = last_chance0$totcaps) #plot last_chance_est %>% ggplot(aes(x = n, y = est_n, ymax = est_max, ymin = est_min)) + geom_pointrange() + geom_abline( aes(slope = slope, intercept = 0, colour = as.factor(slope)), data = data_frame(slope = 1:2), show.legend = FALSE) + labs(x = "Count sum", y = "Estimated count sum")

This reveals both the power of this method, and one of a few problems. A few samples have estimated counts that match the reported counts, but most have estimated counts twice the reported counts. The problem is that chironomid workers often count half head capsules: in many samples the lowest abundance is a half chironomid. Pollen and occasionally diatoms are also sometimes counted with halves – integer counts are easier to analyse.

In other datasets, there might be some samples with the rarest taxon represented by two (or more) microfossils. This is probably only a serious risk in very low diversity systems (hello *Neogloboquadrina pachyderma* sinistral), or where only a few common taxa are identified rather than all taxa.

Diatoms present an interesting test case as the two valves which comprise the frustule are counted separately even though they are often found paired (or in longer chains), raising the risk that the rarest taxon has two valves. If the rarest taxon has two valves, it is unlikely that the other taxa all have multiples of two valves, which should help detect this problem which would otherwise make the count appear to be half its true size.

The first diatom assemblage dataset I found on pangaea.de appeared to have a few samples with count sums much lower than reported, even allowing for the possibility that the rarest taxon is not a singleton. This is not the first time that I’ve noticed that some authors appear to report how many microfossils they planned to count rather than the number they actually counted. This is difficult to detect with percent data but `countChecker`

might reveal it. I plan, at some stage, to trawl through various archived assemblage data to test how common this problem is.

The `countChecker`

package also tests whether the percent data are congruent using a method very similar to the GRIM test which has been used to great effect on psycology data. The principle is this: if the count is of microfossils, then, discounting the tendency of some analysts to count fractional microfossils, only integer multiples of are possible. The `percent_checker`

function tests for such impossible values after allowing for limited precision in both the percent data and the count sum. I’ll explore this method more later.

Pollen data archived in Neotoma and the European Pollen Database are nearly all count data. Please can other microfossil assemblage data also be archived as counts not percent.

Pingback: Finding singletons in real data | Musings on Quantitative Palaeoecology