## Resampling Assemblage Counts

Imagine that one diatom assemblage has 20 species and another has 25, but the more species rich assemblage also has more diatoms counted.
A fair comparison of the richness of each assemblage can only be made when the number of individuals is constant.

In this case, we can use the function `rarefy` from the `vegan` package. Other analyses might not be possible to run directly, so you need re-sampled counts with a specified count sum. I’ve been doing this recently to find the how the prevalence of singletons (species only occurring once) varies with count sum.

There are several ways to resample counts. A key question is whether you want to sample with or without replacement.

Sampling without replacement asks what the assemblage you currently have might have been when the count sum was smaller. If you sample as many individuals as you actually have, your resampled assemblage is identical to your current assemblage and you cannot sample more individuals than you have.

Sampling with replacement asks what the assemblage you currently have might look like if you counted another assemblage from the same material.
In principle, you can simulate larger assemblages than you currently have, though this is probably always a bad idea. Species richness will be biased low, as rare taxa present in the community behind the veil line cannot occur in the resampled assemblages but could occur if another diatom slide was counted.

I’m going to resample the `vegan::BCI` dataset of tree counts in plots on Barro Colorado Island.

```data(BCI, package = "vegan")
```

First a tidyverse solution.

```BCI %>%
# put the rownames into a column called "plot"
rownames_to_column(var = "plot") %>%
# convert the data from a wide to a long format
pivot_longer(-plot, names_to = "species", values_to = "n") %>%
# uncount does the opposite of count - giving 2 rows for a species that had a count of 2.
uncount(n) %>%
group_by(plot) %>%
# randomly sample, with replacement, 100 individuals (rows) per plot
slice_sample(n = 100, replace = TRUE) %>%
# count the number of individuals per species per plot
count(species)
```
```## # A tibble: 2,128 × 3
## # Groups:   plot 
##    plot  species                     n
##    <chr> <chr>                   <int>
##  1 1     Alseis.blackiana            6
##  2 1     Annona.spraguei             1
##  3 1     Apeiba.glabra               2
##  4 1     Astronium.graveolens        3
##  5 1     Brosimum.alicastrum         1
##  6 1     Casearia.sylvestris         2
##  7 1     Cassipourea.guianensis      1
##  8 1     Cecropia.insignis           1
##  9 1     Chrysophyllum.argenteum     1
## 10 1     Cordia.bicolor              3
## # … with 2,118 more rows
```

The tidyverse solution might not be the fastest (see below to find out) as it is doing a lot of extra work. So let’s write a function that has the same essential steps of uncounting the counts, sampling them, and the recounting them.

```resample_1 <- function(counts, size, replace = FALSE){
#counts is species count data to resample,
#size is target count sum
#replace or not

#grab species names (if present)
id <- names(counts)
if(is.null(id)){
id <- 1:length(count)
}
#uncount
count2 <- rep(id, times = counts)
# sample
samp <- sample(count2, size = size, replace = replace)
# count again
tab <- table(samp)
# fill missing taxa with zero
result <- rep(0, length(counts))
names(result) <- id
result[names(tab)] <- tab
#return result - a named vector
result
}

#first BCI plot
bci1 <- unlist(BCI[1, ])

resample_1(count = bci1, size = 100, replace = TRUE)[1:20]
```
```##       Abarema.macradenia    Vachellia.melanoceras    Acalypha.diversifolia
##                        0                        0                        0
##                        0                        0                        0
##  Alchornea.costaricensis      Alchornea.latifolia         Alibertia.edulis
##                        0                        0                        0
##  Allophylus.psilospermus         Alseis.blackiana        Amaioua.corymbosa
##                        0                        9                        0
##      Anacardium.excelsum           Andira.inermis          Annona.spraguei
##                        0                        0                        0
##            Apeiba.glabra         Apeiba.tibourbou  Aspidosperma.desmanthum
##                        1                        0                        0
## Astrocaryum.standleyanum     Astronium.graveolens
##                        0                        1
```

There are also R functions that do the same jobs.

Resampling with replacement gives a multinomial distribution; for which we can also use `stats::rmultinom`. This takes the argument `prob` for the species data. The data are standardised to sum to one, so counts, percent or proportions could be used. `n` is the number of resamples to make. The return object is a matrix with `n` columns and one species per row.

```rmultinom(n = 1, size = 100, prob =  bci1)[1:20, ]
```
```##       Abarema.macradenia    Vachellia.melanoceras    Acalypha.diversifolia
##                        0                        0                        0
##                        0                        0                        0
##  Alchornea.costaricensis      Alchornea.latifolia         Alibertia.edulis
##                        0                        0                        0
##  Allophylus.psilospermus         Alseis.blackiana        Amaioua.corymbosa
##                        0                        5                        0
##      Anacardium.excelsum           Andira.inermis          Annona.spraguei
##                        0                        0                        0
##            Apeiba.glabra         Apeiba.tibourbou  Aspidosperma.desmanthum
##                        3                        0                        0
## Astrocaryum.standleyanum     Astronium.graveolens
##                        0                        1
```

Resampling without replacement follows the multivariate hypergeometric distribution, which can be fitted with `rmvhyper` from package `extraDistr`. The arguments have different names, `nn` is the number of resamples, `k` is `size` and `n` is the species data to resample. The output is a matrix with `nn` rows and one species per column – the transpose of `rmultinom`.

```library(extraDistr)
rmvhyper(nn = 1, k = 50, n =  bci1)[, 1:20]
```
```##   0 0 0 0 0 0 1 0 0 0 3 0 0 0 0 3 0 0 0 0
```

As I often resample many assemblages many times, it is useful to know which solution is fastest.

```library(microbenchmark)

size <- 100
# preprocess for tidyverse
BCI1 <- BCI %>%
rownames_to_column(var = "plot") %>%
slice(1) %>%
pivot_longer(-plot, names_to = "species", values_to = "n")

mb1 <- microbenchmark(
tidyverse = BCI1 %>%
uncount(n) %>%
group_by(plot) %>%
slice_sample(n = 100, replace = TRUE) %>%
count(species),
resample_wo = resample_1(count = bci1, size = size, replace = FALSE),
resample_w = resample_1(count = bci1, size = size, replace = TRUE),
rmultinom = rmultinom(n = 1, size = size, prob =  bci1),
rmvhyper = rmvhyper(nn = 1, k = size, n =  bci1)
)
autoplot(mb1)
```

As expected, the tidyverse solution is much slower than the other solutions. `mvhyper` and `rmultinom` are much faster than the home-made solution. I don’t know why `mvhyper` is slower than `rmultinom`.

I often need to resample an assemblage many times. `mvhyper` and `rmultinom` have this functionality built-in. The home-grown function can be extended to make multiple resamples.

```resample_n <- function(count, size, n = 1, replace = FALSE){
id <- names(count)
if(is.null(id)){
id <- 1:length(count)
}
count2 <- rep(id, times = count)
replicate(n = n, {
samp <- sample(count2, size = size, replace = replace)
tab <- table(samp)
result <- rep(0, length(count))
names(result) <- id
result[names(tab)] <- tab
result
})
}
```

And again, we can test how fast the functions are.

```mb_nrep <- microbenchmark(
resample_wo_1 = resample_n(count = bci1, n = 1, size = size, replace = FALSE),
resample_wo_10 = resample_n(count = bci1, n = 10, size = size, replace = FALSE),
resample_wo_100 = resample_n(count = bci1, n = 100, size = size, replace = FALSE),

rmultinom_1 = rmultinom(n = 1, size = size, prob =  bci1),
rmultinom_10 = rmultinom(n = 10, size = size, prob =  bci1),
rmultinom_100 = rmultinom(n = 100, size = size, prob =  bci1),

resample_w_1 = resample_n(count = bci1, n = 1, size = size, replace = TRUE),
resample_w_10 = resample_n(count = bci1, n = 10, size = size, replace = TRUE),
resample_w_100 = resample_n(count = bci1, n = 100, size = size, replace = TRUE),

rmvhyper_1 = rmvhyper(nn = 1, k = size, n =  bci1),
rmvhyper_10 = rmvhyper(nn = 10, k = size, n =  bci1),
rmvhyper_100 = rmvhyper(nn = 100, k = size, n =  bci1)
)

mb_nrep %>%
print() %>%
mutate(n = rep(c(1, 10, 100), times = 4),
expr = str_remove(expr, "_\\d{1,3}")) %>%
ggplot(aes(x = n, y = mean/n, colour = expr)) +
geom_line() +
geom_point() +
scale_x_log10() +
scale_y_log10() +
labs(x = "Number of draws", y = "Mean time per resample,  microseconds", colour = "Method")
```

The home-made function has less good economy-of-scale than the other methods. I think I have some projects that need their code updating.

One last problem is how to resample all the diatom assemblages in a core. We need to iterate over each row in the data set (assuming it is in wide format – also possible to do this in long format). If we want a single resample, we can use `apply`.

```m1 <- apply(BCI, 1, resample_1, size = 100)
m2 <- apply(BCI, 1, rmultinom, size = 100, n = 1)
m3 <- apply(BCI, 1, rmvhyper, k = 100, nn = 1)
```

Of course, it is also possible to run multiple resamples at once, but as this returns a matrix, `apply` won’t work well. Instead, we can use `plyr::aaply` (don’t load `plyr` as it will conflict with `dplyr`). This will return an array (the `rmvhyper` result is transposed).

```#convert the data to a matrix or problems
mm1 <- plyr::aaply(.data = as.matrix(BCI), .margins = 1, .fun = resample_n, n = 2, size = 100 )
dim(mm1)
```
```##   50 225   2
```
```mm2 <- plyr::aaply(.data = as.matrix(BCI), .margins = 1, .fun = rmultinom, n = 2, size = 100 )
dim(mm2)
```
```##   50 225   2
```
```mm3 <- plyr::aaply(.data = as.matrix(BCI), .margins = 1, .fun = rmvhyper, nn = 2, k = 100 )
dim(mm3)
```
```##   50   2 225
```

Coming soon, some analyses using lots of resampled assemblages. 