The European Pollen Database meets SQLite

The European Pollen Database is a fantastic resource for palaeoecologists, storing pollen stratigraphies from across the continent. Getting the data into R for analysis is facilitated by the EPDr package. However, first you need to set up the database and this can be a little tricky. The EPD is available to download in three database formats, Paradox, Microsoft Access and Postgres. The data are also available from Neotoma (partially as of now) and Pangaea.

  • I don’t know much about Paradox, and I’m not greatly motivated to change that. It might be possible to use it with EPDr (which used DBI internally), but I am not finding much online about how to do this
  • Last time I used MS Access there were problems with it only having a 32-bit driver available for importing data into R. This was possible to work around this but was a considerable pain.
  • Postgres is a top of the range open-source data base. However, set-up is not trivial, and on my university-managed computer, I lack the permissions needed to complete the set-up (I also lack the permission needed to change time-zones).

So what I want to do is to convert the EPD into a SQLite database. This is a very simple database format, lacking the bells and whistles that Postgres has, but it is very easy to make the connection to R – you just tell R where the file is and that it should use the SQLite driver. Having used SQLite on a couple of projects before, I also had some code to convert the MS Access files into SQLite.

We need to start by downloading the latest MS Access version of EPD,  and installing mdbtools (on ubuntu apt-get install mdbtools). My code might not be the most elegant way to complete the job (everything can probably be done with a few lines of a bash script), but it works.

####load packages####
library("DBI")
library("RSQLite")

#### Extract from mdb ####
#mdb location
mdb  data/epd_schema.sql"))

##export data
system('mkdir -p data/sql')
system(paste("for i in $( mdb-tables", mdb,
" ); do echo $i ; mdb-export -H -I sqlite", mdb,
" $i > data/sql/$i.sql; done"))  

#### make sqlite3 database ####
con <- dbConnect(SQLite(), dbname = "data/epd.sqlite")

#### set up schema ####
setup <- readChar("data/epd_schema.sql", nchar = 100000)

#Change to valid sqlite datatype
setup <- gsub("Memo/Hyperlink", "Text", setup)

# EPDr does not like # in field names
setup <- gsub("#", "_", setup)

# avoid case sensitive problems with EPDr
setup <- tolower(setup)

#add each table separately
sapply(
  paste("create", strsplit(setup, "create")[[1]][-1]),
  dbExecute,
  conn = con)

#### import data ####
import_table <- function(TAB) {
  message(TAB)
  tab <- readChar(paste0("data/sql/", TAB, ".sql"), nchar = 1e9)
  if (length(tab) == 0) {
    paste("File", TAB, "is empty")
    return(NULL)
  }
  #make into a single INSERT INTO statement for speed
  tab <- gsub("\nINSERT INTO [^;]+ VALUES", "\n", tab)
  tab <- gsub(";(?!$)", ",", tab, perl = TRUE)

  # Change # to _ to keep EPDr happy
  tab <- gsub("#", "_", tab)

  dbExecute(conn = con, statement = tab)
}

sapply(toupper(dbListTables(con)), import_table)

So did it work?

library(EPDr)
library(RSQLite)

#### make connection ####
epd.connection <- DBI::dbConnect(dbname = "data/epd.sqlite", drv = SQLite())

#### test an EPDr function ####
list_e(epd.connection, site = "Adange")
# E_ Site_ Sigle Name IsCore IsSect IsSSamp Descriptor HasAnLam
#1 1 1 ADANGE NA Y N Y RFLU N
# EntLoc LocalVeg Coll_ SampDate DepthAtLoc
#1 left coast of river Cyperaceae fen 65 1984-08-00 210
# IceThickCM SampDevice CoreDiamCM C14DepthAdj Notes Site_ SiteName
#1 NA spade 10 NA NA 1 Adange
# SiteCode SiteExists PolDiv1 PolDiv2 PolDiv3 LatDeg LatMin
#1 GEO-01000-ADAN NA GEO 01 000 43 18
# LatSec LatNS LatDD LatDMS LonDeg LonMin LonSec LonEW LonDD
#1 20 N 43.30556 43.18.20N 41 20 0 E 41.33333
# LonDMS Elevation AreaOfSite
#1 41.20.00E 1750 0.25

#### Tidy up ####
DBI::dbDisconnect(epd.connection)

Looks good so far (sorry about the formatting – there seems to be a bug in the syntax highlighter). Now I can start to explore some questions I have.

If there is interest, I can put the SQLite database on Dropbox, but I won't guarantee that the copy is up-to-date.

Advertisements

About richard telford

Ecologist with interests in quantitative methods and palaeoenvironments
This entry was posted in R and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s