The European Pollen Database meets SQLite

The European Pollen Database is a fantastic resource for palaeoecologists, storing pollen stratigraphies from across the continent. Getting the data into R for analysis is facilitated by the EPDr package. However, first you need to set up the database and this can be a little tricky. The EPD is available to download in three database formats, Paradox, Microsoft Access and Postgres. The data are also available from Neotoma (partially as of now) and Pangaea.

  • I don’t know much about Paradox, and I’m not greatly motivated to change that. It might be possible to use it with EPDr (which used DBI internally), but I am not finding much online about how to do this
  • Last time I used MS Access there were problems with it only having a 32-bit driver available for importing data into R. This was possible to work around this but was a considerable pain.
  • Postgres is a top of the range open-source data base. However, set-up is not trivial, and on my university-managed computer, I lack the permissions needed to complete the set-up (I also lack the permission needed to change time-zones).

So what I want to do is to convert the EPD into a SQLite database. This is a very simple database format, lacking the bells and whistles that Postgres has, but it is very easy to make the connection to R – you just tell R where the file is and that it should use the SQLite driver. Having used SQLite on a couple of projects before, I also had some code to convert the MS Access files into SQLite.

We need to start by downloading the latest MS Access version of EPD,  and installing mdbtools (on ubuntu apt-get install mdbtools). My code might not be the most elegant way to complete the job (everything can probably be done with a few lines of a bash script), but it works.

library("DBI")
library("RSQLite")

#### Download ####
#download latest Access version of EPD
#from http://www.europeanpollendatabase.net/data/downloads/
#unzip
#install mdbtools
#on ubuntu
#apt-get install mdbtools

#shell script with help from https://www.codeenigma.com/community/blog/using-mdbtools-nix-convert-microsoft-access-mysql

mdb <- "data/epd-access-distribution-20171031.mdb"

#### make sqlite3 database ####
#delete existing version
if(file.exists("data/epd.sqlite")){
file.remove("data/epd.sqlite")
}

#make new empty database
con <- dbConnect(SQLite(), dbname = "data/epd.sqlite")

#### set up schema ####
##export schema
setup <- system(paste("mdb-schema", mdb), intern = TRUE)
setup <- paste(setup, collapse = "\n") # merge into single text

#Change to valid datatype
setup <- gsub("Memo/Hyperlink", "Text", setup)
setup <- gsub("#", "_", setup) # EPDr does not like # in field names
setup <- tolower(setup)# avoid case sensitive problems with EPDr
#add each table seperately
sapply(
paste("create", strsplit(setup, "create")[[1]][-1]),
dbExecute,
conn = con)

#check what tables have been added (should be over 40)
dbListTables(con)

#### import data
import_table <- function(TAB) {
message(TAB)
tab <- system(paste("mdb-export -H -I sqlite", mdb, TAB), intern = TRUE)

if (length(tab) == 0) {
paste("File", TAB, "is empty")
return(NULL)
}

tab <- paste(tab, collapse = "\n")
tab <- gsub("\nINSERT INTO [^;]+ VALUES", "\n", tab)
tab <- gsub(";(?!$)", ",", tab, perl = TRUE)
tab <- gsub("#", "_", tab)

dbExecute(conn = con, statement = tab)
}

sapply(dbListTables(conn = con), import_table)

dbDisconnect(conn = con)

Some of the tables are large, and will take a few minutes to load into the database.

Did it work?

library(EPDr)
library(RSQLite)

#### make connection ####
epd.connection <- DBI::dbConnect(dbname = "data/epd.sqlite", drv = SQLite())

#### test an EPDr function ####
list_e(epd.connection, site = "Adange")
# E_ Site_ Sigle Name IsCore IsSect IsSSamp Descriptor HasAnLam
#1 1 1 ADANGE NA Y N Y RFLU N
# EntLoc LocalVeg Coll_ SampDate DepthAtLoc
#1 left coast of river Cyperaceae fen 65 1984-08-00 210
# IceThickCM SampDevice CoreDiamCM C14DepthAdj Notes Site_ SiteName
#1 NA spade 10 NA NA 1 Adange
# SiteCode SiteExists PolDiv1 PolDiv2 PolDiv3 LatDeg LatMin
#1 GEO-01000-ADAN NA GEO 01 000 43 18
# LatSec LatNS LatDD LatDMS LonDeg LonMin LonSec LonEW LonDD
#1 20 N 43.30556 43.18.20N 41 20 0 E 41.33333
# LonDMS Elevation AreaOfSite
#1 41.20.00E 1750 0.25

#### Tidy up ####
DBI::dbDisconnect(epd.connection)

Looks good so far (sorry about the formatting – there seems to be a bug in the syntax highlighter). Now I can start to explore some questions I have.

If there is interest, I can put the SQLite database on Dropbox, but I won't guarantee that the copy is up-to-date.

About richard telford

Ecologist with interests in quantitative methods and palaeoenvironments
This entry was posted in R and tagged . Bookmark the permalink.

1 Response to The European Pollen Database meets SQLite

  1. Pingback: Accessing the EPD from R – EPD activities, news, blog

Leave a comment