Extracting data from a PDF image

Some scientists archive their data. Some scientists email their data on request. Some editors cajole authors into releasing data to interested parties. And sometimes none of these approaches yields data.

What then? One option is to request data via the scientist’s head-of-department, another is to scrape the data out of published diagrams. Much has been written about how to extract data from raster formats, including stratigraphic diagrams, I only found a little information on how to extract data from vector formats. This is potentially a much more powerful option.

This is how I did it. First I downloaded Luoto and Ojala (2017) and checked that the stratigraphic plot was a vector format by greatly increasing the magnification. This would make the image appear pixelated if it was a raster.

luoto_strat

The stratigraphic diagram from Luoto and Ojala (2017)

The pdf needs to be converted into a format that can be read into a text editor. This can be done with qpdf in the terminal. First I extract the page with the stratigraphic diagram to make further processing easier.

qpdf 0959683616675940.pdf  --pages 0959683616675940.pdf 4 -- outfile.pdf

qpdf outfile.pdf --stream-data=uncompress  outfile2.pdf

Now outfile2.pdf can be read into R

page4 <- readLines("luoto/data/outfile2.pdf")

#Find start and end of image
start <- grep("/PlacedGraphic", page4)[2]
end <- grep("[(Figur)20(e 2.)]TJ", page4, fixed = TRUE)
page4 %
  read.table(text = .) %>%
  set_names(c("x", "y", "width", "height", "re"))

The bars are, within rounding error, the same height. Various other rectangles in the figure are different heights, so I need to filter the data I want.


fig2 = fig2 %>%
  filter(between(height, -2.4, -2.3))

Now I can plot the data.

fig2 %>%
  ggplot(aes(x = x + width/2, y = y, width = width, height = height, fill = factor(x, levels = sample(unique(x))))) +
  geom_tile(show.legend = FALSE)

 

Rplot01

The scraped data from the stratigraphic plot

The next step would be to assign taxon names to each unique value of x and scale the widths so they are in percent. When that is done, and the weather data digitised, I can test how well I can reproduce the calibration-in-time transfer function model. The calibration-in-space model will need to wait for data from several other papers to be scraped.

Advertisements
Posted in Peer reviewed literature, R | Tagged , , , | Leave a comment

Bergen: a year with some sunshine

May was glorious.  December less so.

Rplot02

The data are from the Geofysisk Institutt in Bergen. Here is the code I used

library("tidyverse")
library("lubridate")
florida = read.csv("../climate_data//Florida_2017-06-21_2018-06-22_1529649176.csv", stringsAsFactors = FALSE) %>% as.tibble()
florida = florida %>%
mutate(date_time = ymd_hm(paste(Dato, Tid)),
date = ymd(Dato),
Tid = ymd_hm(paste("2000-01-01", Tid)),
Globalstraling = if_else(Globalstraling == 9999.99, NA_real_, Globalstraling)) %>%
complete(date, Tid) %>%#
arrange(date, Tid) %>%
mutate(Globalstraling = coalesce(Globalstraling, lag(Globalstraling, 24 * 6)))#fill gaps with previous day

ggplot(florida, aes(x = Tid, y = date, fill = Globalstraling)) +
geom_raster() +
scale_fill_gradient(low = "black", high = "yellow") +
scale_x_datetime(date_labels = "%H", expand = c(0, 0)) +
scale_y_date(date_labels = "%b", expand = c(0, 0)) +
labs(x = "Time", y = "Month", fill = expression(atop(Sunshine, Wm^-1)))

Posted in climate, R, Uncategorized | Tagged | Leave a comment

More trouble counting to fifty

Earlier this week at the palaeolimnology symposium, Gavin told me that it had not dawned on him that the count sum could be estimated from percent data using our knowledge of rank abundance curves. I only recently realised this; previously I did not imagine that this would ever be a useful thing to do.

And yet this morning, I found strong evidence that another chironomid analyst has problems keeping to the count sum promised in the paper (the tally is now three chironomid analysts, a diatomist, and a palynologist).

On Sunday, after I gave my presentation to the chironomid workshop, there was some discussion about what should be done with small counts (they do have some information), but there was unanimity that the paper must report it if some counts contain fewer than the target sum.

At the moment, I am not going to name the analyst whose data I examined this morning. This is only a temporary reprieve, when I write up the presentation I gave on Sunday, this case will be used as an example.

The paper reports that the minimum number of head capsules per sample was fifty. Eight of the thirty-four samples appear to have a count sum of less than fifty. In one sample, the rarest taxon has a relative abundance of 6.25%. The relative abundances of the other six taxa are all integer multiples of 6.25%. This is strong evidence that the count sum is sixteen.

There are three possibilities.

First that the counts are a great fluke. This is unlikely. The sample discussed above could be one in which every taxa was present with a multiple of four head capsules (i.e. the true count sum is 64). Even under extremely optimistic assumptions, the probability of getting such a sample is 1/(47) = 6 * 10-5. And then there are another seven samples, one of which would require all nineteen taxa to have multiples of three head capsules (9 * 10-10). Combining the probabilities of all these unlikely counts will give a very small number.

Neither of the remaining two possibilities is very pleasant.

It could be that the author(s) (whom I believe count(s) their own chironomids) are so negligent that they forgot when they wrote the paper that the count sums of almost a quarter of their samples were smaller than promised. If this is the case, the authors’ competence has to be doubted and we need to ask if anything the author(s) report should be trusted.

The other possibility is that the author(s) knowingly misreported the count sums. This could easily be construed as fabrication (“easily” that is for anyone except a university integrity officer), and data fabrication is a form of misconduct. Obviously this would not be the most serious case misconduct, perhaps akin to plagiarising a paragraph rather than a full paper.

The question remains as to what to do with this case of possible data falsification (and any other cases I find when I have time to import some more data). I am seeing three options: to describe the case in my forthcoming manuscript with a citation; to alert the journal that published the paper; and to advise the authors’ university’s integrity officer.

I ask my readers, both of you, to tell me in the comments or otherwise, how you think I should proceed in this case and what you think the outcome should be.

 

Posted in Misconduct, Peer reviewed literature | Tagged | 3 Comments

My presentation to IPA-IAL 2018

I’ve just given a presentation at the joint IPA-IAL conference in Stockholm

Sub-decadal resolution palaeoenvironmental reconstructions from microfossil assemblages

Download it!

The deadline for applying for a reward for finding typos has expired.

Posted in climate | Tagged , , , | Leave a comment

How many is fifty? Sanity checks for assemblage data.

This week I’m at the Palaeolimnology Symposium in Stockholm this week.

I have a couple of presentations. I gave the first this morning to the chironomid “DeadHead” meeting. I showed some sanity checks for assemblage data, some of which are related to Brown and Heathers’ (2018) GRIM test.  It can download it from figshare.

Posted in Data manipulation, EDA | Tagged , | 1 Comment

Warm summers in the Younger Dryas?

The Younger Dryas was a period (12,900–11,600 BP) towards the end of the last glaciation when glaciers re-advanced in Scotland and the tundra plants, including the eponymous Dryas, replaced the Bølling-Allerød forests in Denmark. These changes indicate the Younger Dryas was a cold interval in Europe, an interpretation challenged by a new paper by Schenk et al  in Nature Communications.

Schenk et al base their reinterpretation on climate model output supported by plant macrofossil and pollen which are used as indicator species to infer July temperatures. Their model simulates autumns, winters and springs that were colder during the Younger Dryas than the Bølling-Allerød, and summers that were short but warmer. The paper suggests, based on the model output, various mechanisms that could give summer warming in an otherwise cold period, especially atmospheric blocking preventing winds from the cold ocean blowing far into Europe.

I’m not in a position to evaluate their model, so I’m going to have a closer look at their proxy data and compare it with other proxy data.

Since the replacement of forest by tundra or steppe during the Younger Dryas is well known, it is a little surprising to see palaeobotanical evidence being used to suggest warm Younger Dryas summers. Schenk et al base their temperature reconstruction on “plant species that indicate local presence focusing primarily on specific plant species, such as aquatics and riparian species “, rather than the full assemblage. I cannot find any discussion in the text as to why the aquatics should indicate warmth while the forests decline, but of course summer temperature is not the only factor driving vegetation – drought also plays an important role.

The glacial readvance is probably one of the best pieces of evidence for cool summers in the Younger Dryas. Within Europe, Younger Dryas glacial re-advances have been reported from, for example, the Iberian Peninsula, South Wales, North Wales, Northern England, Northern Ireland, Scotland, Norway, Iceland, the Alps, Poland, Romania, Macedonia. For glaciers to re-advance, summer temperatures need to be low. Alternatively winter precipitation could to be very high, but it is generally thought that Europe was arid in the Younger Dryas because of the extensive sea ice in the North Atlantic (some of the evidence for aridity is dependent on estimates of summer temperature, but some is independent). Glacial re-advances near the Atlantic coast could be consistent with Schenk et al, as the coast might not be protected by atmospheric blocking, but glacial advances in central Europe are not. Schenk et al do not consider the glacial evidence.

The one possible way to reconcile Schenk et al with the glaciological evidence is to argue that the dating of the glacial re-advances to the Younger Dryas is suspect. Bromley et al (2014) make this argument for Scotland, but at other sites the evidence is unambiguous, for example at Kråkenes where the well dated lake core contains sediment derived from a cirque glacier in the Younger Dryas.

Schenk et al consider the numerous chironomid-based reconstructions of summer temperature which are almost unanimous in reconstructing colder conditions in the Younger Dryas than the preceding Bølling-Allerød but dismiss this evidence. They argue that “chironomid assemblages have been shown to incorporate a number of environmental signals (e.g., catchment vegetation, nutrient supply, lake status and depth, seasonality) apart from the ambient summer air temperature. This is correct, but the aquatic indicators used by Schenk et al will be sensitive to many of the same factors.

Specifically, Schenk et al argue that changes in seasonality – cold springs with late-lying snow and short summers affect the chironomid-based reconstructions. Why these seasonality shifts would not likewise affect the aquatic plant indicators is not explained.

The indicator values are calculated as from the temperature at the range limit of each species in Finland. Range limits are inherently uncertain and as the reconstruction is based on the indicator value of the least cold-tolerant species in the assemblage, the reconstructions will also be very uncertain. I do wonder how the range limits would change if a larger geographic area was considered – a question that could easily be tested with GBIF data.

I am not convinced by this paper.

 

 

 

 

Posted in climate, Peer reviewed literature, transfer function | Tagged , | 1 Comment

Statigraphic diagrams with ggplot

rioja::strat.plot is a great tool for plotting stratigraphic plots in R, but sometimes it is not obvious how to do something I want, perhaps a summary panel showing the percent trees/shrubs/herbs. Of course, I could extend strat.plot, but I do all nearly all my figures with ggplot now and wouldn’t know where to start.

In this post, I’m going to show how to make a stratigraphic plot with ggplot.

First I’m going to download the pollen data for Tsuolbmajavri (Seppä and Hicks, 2006) from Neotoma.

 

library("neotoma")
library("tidyverse")

#get dataset list
pollen_sites <- get_dataset(datasettype = "pollen", x = 15733)

#download_data
pollen_data <- get_download(pollen_sites)
pollen_data <- pollen_data[[1]]

#species groups to plot
wanted <- c("TRSH", "UPHE", "VACR", "SUCC", "PALM", "MANG")

Now I need to extract and combine the pollen count, the chronology and the taxon information from the downloaded neotoma data. I gather the data to make it into a long thin (tidy) table, filter the taxa that are in the ecological groups I want to plot and calculate percentages.

thin_pollen <- bind_cols(
counts(pollen_data),
ages(pollen_data) %>% select(depth, age, age.type)
) %>%
gather(key = taxon, value = count, -depth, -age, -age.type) %>%
left_join(pollen_data$taxon.list, c("taxon" = "taxon.name")) %>%
filter(ecological.group %in% wanted) %>%
group_by(depth) %>%
mutate(percent = count / sum(count) * 100)

Now I can plot the data after a little further processing: I’m selecting taxa that occur at least three times and with a maximum abundance of at least 3%, and making taxon into a factor so taxa from the same ecological group will plot together.

I use geom_area twice, once to show the 10x the percent, once to show the percent. I then need to use ylim in coord_flip to truncate the axes at the maximum pollen abundance. The plot uses facets to show the taxa separately.

thin_pollen %>%
mutate(taxon = factor(taxon, levels = unique(taxon[order(ecological.group)]))) %>%
group_by(taxon) %>%
filter(max(percent) >= 3, sum(percent > 0) >= 3) %>%
ggplot(aes(x = age, y = percent, fill = ecological.group)) +
geom_area(aes(y = percent * 10), alpha = 0.3, show.legend = FALSE) +
geom_area(show.legend = FALSE) +
coord_flip(ylim = c(0, max(thin_pollen$percent))) +
scale_x_reverse(expand = c(0.02, 0)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 2)) +
facet_wrap(~ taxon, nrow = 1) +
labs(x = expression(Age~''^{14}*C~BP), y = "Percent", fill = "Group") +
theme(strip.text.x = element_text(angle = 90, size = 7, hjust = 0),
panel.spacing = unit(x = 1, units = "pt"),
axis.text.x = element_text(size = 7)
)

plot-1

 

I don’t think that is too bad for a first attempt, but I think I’ll stick to rioja::strat.plot most of the time.

Posted in Data manipulation, R, Uncategorized | Tagged , | Leave a comment