Resampling Assemblage Counts

Imagine that one diatom assemblage has 20 species and another has 25, but the more species rich assemblage also has more diatoms counted.
A fair comparison of the richness of each assemblage can only be made when the number of individuals is constant.

In this case, we can use the function rarefy from the vegan package. Other analyses might not be possible to run directly, so you need re-sampled counts with a specified count sum. I’ve been doing this recently to find the how the prevalence of singletons (species only occurring once) varies with count sum.

There are several ways to resample counts. A key question is whether you want to sample with or without replacement.

Sampling without replacement asks what the assemblage you currently have might have been when the count sum was smaller. If you sample as many individuals as you actually have, your resampled assemblage is identical to your current assemblage and you cannot sample more individuals than you have.

Sampling with replacement asks what the assemblage you currently have might look like if you counted another assemblage from the same material.
In principle, you can simulate larger assemblages than you currently have, though this is probably always a bad idea. Species richness will be biased low, as rare taxa present in the community behind the veil line cannot occur in the resampled assemblages but could occur if another diatom slide was counted.

I’m going to resample the vegan::BCI dataset of tree counts in plots on Barro Colorado Island.

data(BCI, package = "vegan")

First a tidyverse solution.

BCI %>% 
  # put the rownames into a column called "plot"
  rownames_to_column(var = "plot") %>% 
  # convert the data from a wide to a long format
  pivot_longer(-plot, names_to = "species", values_to = "n") %>% 
  # uncount does the opposite of count - giving 2 rows for a species that had a count of 2.
  uncount(n) %>% 
  group_by(plot) %>%
  # randomly sample, with replacement, 100 individuals (rows) per plot
  slice_sample(n = 100, replace = TRUE) %>% 
  # count the number of individuals per species per plot
  count(species)
## # A tibble: 2,128 × 3
## # Groups:   plot [50]
##    plot  species                     n
##    <chr> <chr>                   <int>
##  1 1     Alseis.blackiana            6
##  2 1     Annona.spraguei             1
##  3 1     Apeiba.glabra               2
##  4 1     Astronium.graveolens        3
##  5 1     Brosimum.alicastrum         1
##  6 1     Casearia.sylvestris         2
##  7 1     Cassipourea.guianensis      1
##  8 1     Cecropia.insignis           1
##  9 1     Chrysophyllum.argenteum     1
## 10 1     Cordia.bicolor              3
## # … with 2,118 more rows

The tidyverse solution might not be the fastest (see below to find out) as it is doing a lot of extra work. So let’s write a function that has the same essential steps of uncounting the counts, sampling them, and the recounting them.

resample_1 <- function(counts, size, replace = FALSE){
  #counts is species count data to resample,
  #size is target count sum
  #replace or not

  #grab species names (if present)
  id <- names(counts)
  if(is.null(id)){
    id <- 1:length(count)
  }
  #uncount
  count2 <- rep(id, times = counts)
  # sample
  samp <- sample(count2, size = size, replace = replace)
  # count again
  tab <- table(samp)
  # fill missing taxa with zero
  result <- rep(0, length(counts))
  names(result) <- id
  result[names(tab)] <- tab
  #return result - a named vector
  result
}

#first BCI plot
bci1 <- unlist(BCI[1, ])

resample_1(count = bci1, size = 100, replace = TRUE)[1:20]
##       Abarema.macradenia    Vachellia.melanoceras    Acalypha.diversifolia 
##                        0                        0                        0 
##    Acalypha.macrostachya           Adelia.triloba     Aegiphila.panamensis 
##                        0                        0                        0 
##  Alchornea.costaricensis      Alchornea.latifolia         Alibertia.edulis 
##                        0                        0                        0 
##  Allophylus.psilospermus         Alseis.blackiana        Amaioua.corymbosa 
##                        0                        9                        0 
##      Anacardium.excelsum           Andira.inermis          Annona.spraguei 
##                        0                        0                        0 
##            Apeiba.glabra         Apeiba.tibourbou  Aspidosperma.desmanthum 
##                        1                        0                        0 
## Astrocaryum.standleyanum     Astronium.graveolens 
##                        0                        1

There are also R functions that do the same jobs.

Resampling with replacement gives a multinomial distribution; for which we can also use stats::rmultinom. This takes the argument prob for the species data. The data are standardised to sum to one, so counts, percent or proportions could be used. n is the number of resamples to make. The return object is a matrix with n columns and one species per row.

rmultinom(n = 1, size = 100, prob =  bci1)[1:20, ]
##       Abarema.macradenia    Vachellia.melanoceras    Acalypha.diversifolia 
##                        0                        0                        0 
##    Acalypha.macrostachya           Adelia.triloba     Aegiphila.panamensis 
##                        0                        0                        0 
##  Alchornea.costaricensis      Alchornea.latifolia         Alibertia.edulis 
##                        0                        0                        0 
##  Allophylus.psilospermus         Alseis.blackiana        Amaioua.corymbosa 
##                        0                        5                        0 
##      Anacardium.excelsum           Andira.inermis          Annona.spraguei 
##                        0                        0                        0 
##            Apeiba.glabra         Apeiba.tibourbou  Aspidosperma.desmanthum 
##                        3                        0                        0 
## Astrocaryum.standleyanum     Astronium.graveolens 
##                        0                        1

Resampling without replacement follows the multivariate hypergeometric distribution, which can be fitted with rmvhyper from package extraDistr. The arguments have different names, nn is the number of resamples, k is size and n is the species data to resample. The output is a matrix with nn rows and one species per column – the transpose of rmultinom.

library(extraDistr)
rmvhyper(nn = 1, k = 50, n =  bci1)[, 1:20]
##  [1] 0 0 0 0 0 0 1 0 0 0 3 0 0 0 0 3 0 0 0 0

As I often resample many assemblages many times, it is useful to know which solution is fastest.

library(microbenchmark)

size <- 100
# preprocess for tidyverse
BCI1 <- BCI %>% 
  rownames_to_column(var = "plot") %>% 
  slice(1) %>% 
  pivot_longer(-plot, names_to = "species", values_to = "n")

mb1 <- microbenchmark(
  tidyverse = BCI1 %>% 
    uncount(n) %>% 
    group_by(plot) %>%
    slice_sample(n = 100, replace = TRUE) %>% 
    count(species),
    resample_wo = resample_1(count = bci1, size = size, replace = FALSE),
  resample_w = resample_1(count = bci1, size = size, replace = TRUE),
  rmultinom = rmultinom(n = 1, size = size, prob =  bci1),
  rmvhyper = rmvhyper(nn = 1, k = size, n =  bci1)
)
autoplot(mb1) 
microbench mark plot of time taken by the different method

As expected, the tidyverse solution is much slower than the other solutions. mvhyper and rmultinom are much faster than the home-made solution. I don’t know why mvhyper is slower than rmultinom.

I often need to resample an assemblage many times. mvhyper and rmultinom have this functionality built-in. The home-grown function can be extended to make multiple resamples.

resample_n <- function(count, size, n = 1, replace = FALSE){
  id <- names(count)
  if(is.null(id)){
    id <- 1:length(count)
  }
  count2 <- rep(id, times = count)
  replicate(n = n, {
    samp <- sample(count2, size = size, replace = replace)
    tab <- table(samp)
    result <- rep(0, length(count))
    names(result) <- id
    result[names(tab)] <- tab
    result
  })  
}

And again, we can test how fast the functions are.

mb_nrep <- microbenchmark(
  resample_wo_1 = resample_n(count = bci1, n = 1, size = size, replace = FALSE),
  resample_wo_10 = resample_n(count = bci1, n = 10, size = size, replace = FALSE),
  resample_wo_100 = resample_n(count = bci1, n = 100, size = size, replace = FALSE),

  rmultinom_1 = rmultinom(n = 1, size = size, prob =  bci1),
  rmultinom_10 = rmultinom(n = 10, size = size, prob =  bci1),
  rmultinom_100 = rmultinom(n = 100, size = size, prob =  bci1),

  resample_w_1 = resample_n(count = bci1, n = 1, size = size, replace = TRUE),
  resample_w_10 = resample_n(count = bci1, n = 10, size = size, replace = TRUE),
  resample_w_100 = resample_n(count = bci1, n = 100, size = size, replace = TRUE),

  rmvhyper_1 = rmvhyper(nn = 1, k = size, n =  bci1),
  rmvhyper_10 = rmvhyper(nn = 10, k = size, n =  bci1),
  rmvhyper_100 = rmvhyper(nn = 100, k = size, n =  bci1)
)

mb_nrep %>% 
  print() %>% 
  mutate(n = rep(c(1, 10, 100), times = 4),
         expr = str_remove(expr, "_\\d{1,3}")) %>%   
  ggplot(aes(x = n, y = mean/n, colour = expr)) +
  geom_line() +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Number of draws", y = "Mean time per resample,  microseconds", colour = "Method")

The home-made function has less good economy-of-scale than the other methods. I think I have some projects that need their code updating.

One last problem is how to resample all the diatom assemblages in a core. We need to iterate over each row in the data set (assuming it is in wide format – also possible to do this in long format). If we want a single resample, we can use apply.

m1 <- apply(BCI, 1, resample_1, size = 100)
m2 <- apply(BCI, 1, rmultinom, size = 100, n = 1)
m3 <- apply(BCI, 1, rmvhyper, k = 100, nn = 1)

Of course, it is also possible to run multiple resamples at once, but as this returns a matrix, apply won’t work well. Instead, we can use plyr::aaply (don’t load plyr as it will conflict with dplyr). This will return an array (the rmvhyper result is transposed).

#convert the data to a matrix or problems 
mm1 <- plyr::aaply(.data = as.matrix(BCI), .margins = 1, .fun = resample_n, n = 2, size = 100 )
dim(mm1)
## [1]  50 225   2
mm2 <- plyr::aaply(.data = as.matrix(BCI), .margins = 1, .fun = rmultinom, n = 2, size = 100 )
dim(mm2)
## [1]  50 225   2
mm3 <- plyr::aaply(.data = as.matrix(BCI), .margins = 1, .fun = rmvhyper, nn = 2, k = 100 )
dim(mm3)
## [1]  50   2 225

Coming soon, some analyses using lots of resampled assemblages.

About richard telford

Ecologist with interests in quantitative methods and palaeoenvironments
This entry was posted in R and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s