Collaboration networks in BIO

I’ve been intrigued about social network analysis since reading the Wegman report that seemed to find that Michael Mann was at the centre of the network of co-authors on papers on which he was a co-author (about as surprising as finding jam in the centre of a jammie dodger), but I’ve never had data that would benefit from social network analysis. Then I heard that the department’s research day would try to promote collaboration between research groups, so I thought I would try to analyse the department’s co-authorship network and try to learn something about how to run social network analysis. I found this video a useful introduction to social network analysis in R using the igraph library.

As usual in any complex data analysis, most of the work is in importing and formatting the data. Rather conveniently, the department’s website has a list of papers organised by year, so I didn’t have to hunt far for the data. I analysed papers from 2011 and 2012. Ideally I would have included more years, but it was tedious to match the list of authors against research group membership. In the list of publications, each paper has its own line, authors are separated by semicolons, and biology department authors are highlighted in blue. This formatting makes it easy to parse the list to extract the biology department authors of each paper.

library(gregmisc)

p2012<-readLines("o:/data/network analysis/2012 publications.r")#import the publication list with headers & footers stripped out

each line of this file looks something line this:

(A)</td></tr><tr height=10><td></td></tr><tr><td align=left valign=top width='2%'><a href=?expand=1608&year=2011&sort=0#1608><img src=/fellesfiler/images/icons/magnify.png width=15 height=11 border=0 border=0></a>&nbsp;</td><td  colspan=2 align=left valign=top><a name=1608></a><span style="color:blue;">Telford, RJ</span>; <span style="color:blue;">Birks, HJB</span>. (2011). Effect of uneven sampling along an environmental gradient on transfer-function performance. <i>J. Paleolimn.</i>, <b>46</b>, 99-106.<a href=http://dx.doi.org/10.1007/s10933-011-9523-z>[doi]</a>

I’m only interested in the author list, which is always after the second “</a>” and before the year. I can use strsplit to extract this text

p2012<-as.vector(sapply(p2012, function(x) strsplit(x, "&lt;/a&gt;")[[1]][3]))#split the text at the "&lt;/a&gt;" and take the third element
authors2012<-as.vector(sapply(p2012, function(x) strsplit(x, "(20", fixed=T)[[1]][1]))#take the text before the "(20"

Now I can go take each papers’ author list, split it at the semicolons into individual authors, and then find which authors were displayed in a blue fount.

bioauthors2012<-sapply(authors2012, function(x){
 x<-gsub("color:blue;", "color:blue", x)
 authorList<-strsplit(x, ";", fixed=T)[[1]]
 bioauthors<-authorList[grep( '<span style="color: blue;">',authorList)]
 bioauthors<-sub(pattern='<span style="color: blue;">',replace="",bioauthors)
 bioauthors<-sub('</span>',"",bioauthors)
 bioauthors<-trim(bioauthors)#remove whitespace
 bioauthors<-gsub(".","",bioauthors, fixed=T)#fixed as a "." is a special character in regular expressions.
 bioauthors
 })
names(bioauthors2012)<-NULL


bioauthors2012 is a list of vectors, each containing the bio-affiliated authors of one paper. This needs formatting before it can be analysed. We need an adjacency matrix – a square matrix where each column/row represents an author, and the values show the number of papers they co-authored with each author. There is probably a more elegant way to calculate this, but this code works.

all.authors<-unique(unlist(bioauthors2012))
author.mat<-t(sapply(bioauthors2012, function(x)all.authors%in%x))
colnames(author.mat)<-all.authors

adjm<-sapply( all.authors,function(a) sapply(all.authors, function(aa){
  if(a==aa){sum(author.mat[,a])}
  else{sum(author.mat[,a]&author.mat[,aa])}
})
)

npub<-diag(adjm)
diag(adjm)<-0#set the diagonal to zero

The really tedious part of the analysis was getting a factor showing which research group the authors belong to, and then making corrections to the publication list – some people use their middle initials erratically. But then running the network analysis was easy with the igraph package.

groups<-factor(...)#long tedious factor of research groups
group.colours<-c("white","lightblue", "green", "red", "yellow", "blue", "orange",
   "pink", "grey80", "lightgreen", "purple", "salmon",
   "seagreen", "wheat","skyblue", "brown")

library(igraph)

g2 <- graph.adjacency(adjm, mode="undirected", weighted=TRUE)
V(g2)$size<-sqrt(npub)*4 # vertex (author) area proportional to number of papers
V(g2)$color=group.colours[groups]# vertex colour represents the research group
V(g2)$label=1:length(all.authors)# a numeric code for each author.

x11(8,8);par(mar=rep(.3,4))
plot(g2)#plot the network

x11()
plot(1,1,type="n", axes=F, ann=F)
legend("top", legend=levels(groups),fill=group.colours, )

#exclude unconnected authors
core<-graph.coreness(g2)
g3<-induced.subgraph(g2, as.vector(which(core>0)))#remove authors with 0 coauthors
x11(8,8);par(mar=rep(.3,4))
plot(g3)

The networks are very pretty.sna Authors that share a publication are shown as vertices joined by an edge. This is a weighted analysis, so authors that share many publications are plotted closer together, but some distortion is inevitable. Some of the research groups form tight-knit groups, with little or no collaboration (measured by publications 2011/2012) with other groups, other research groups are part of a broad collaboration network. It’s possible to run various diagnostics on the social network to identify authors or research groups with interesting co-authorship practices.

Advertisements

About richard telford

Ecologist with interests in quantitative methods and palaeoenvironments
This entry was posted in EDA, R and tagged . Bookmark the permalink.

One Response to Collaboration networks in BIO

  1. Matthew W says:

    Being an American, I had too look up what a “Jammie Dodger” was.
    I agree with you that it should not at all be surprising that professionals in the same field co-author papers with each other (like minds and what not).
    I think it would be far more interesting to see which authors cite other authors in their publications.
    That’s where you would find the secret handshake.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s