Scraping Google Scholar to write your PhD literature chapter

Jimmy Tidey
Student Voices
Published in
5 min readOct 4, 2016

--

Network diagram of key authors in my PhD. Links represent citations and coauthoring. Colours are communities as indicated by ‘modularity’. The key indicates some rough guesses at what the authors in a community have in common — in some cases there is an excellent fit (Participatory Design), in others much less so (lilac).

[update: I’ve done an online version that anyone can use: whocites.com]

In a previous post I wrote about designing a better system for academic publishing. One problem with the commercial publishing ecosystem is that it inhibits those who’d like to develop new tools for navigating the ever growing body of research. (Of course there is also the clear injustice of asking taxpayers to fund research they subsequently cannot read.)

This post is about a prototype ‘network’ approach to finding papers using data from Google Scholar, hopefully pointing to what could be done with more open data. I was able to use a supervised program searching on Google Scholar to extract my data, but a scalable version of this tool would require open data.

It’s also scratching my own itch: I’m at the stage of my PhD where I need to pull everything together. I need a literature chapter that sets the theoretical context for my four case studies, explaining what has already been written about my topic and clarifying my key terms. In my case that means what’s been said about the subject of participatory design, policy making, collective action and social network analysis. (My working title is: How can social media inform local government policy?)

There are four areas that I know play a part in my research:

  • Sociological ideas about networks (Roland Burt and Mark Granovetter)
  • Elinor Ostrom’s work on the commons
  • Participatory design
  • Ethics/Politics from Steven Lukes, Amartya Sen and John Rawls

Are there already any papers that cite authors from all of these areas? Can I confirm my suspicion that the participatory design literature almost never cites relevant political science, for example Steven Lukes’ investigation of power, or (another favourite) James Fishkin’s ideas about deliberative democracy? Time to do some coding…

Building a prototype

Using Meteor (a JavaScript package) I built a web app to gather data from Google Scholar. I called it Bibnet (code on github). The process starts with a set of search terms that I know return papers I’m already citing from Google Scholar (list of 55 search terms here).

With the list of search terms, Bibnet performs two steps:

  1. Bibnet records each paper or book that is returned by Google Scholar (up to 10 results per term) for each of the search terms. This information generates a list of publications in a database. In the same database, it also records who wrote each publication as a list of authors.
  2. Using Google Scholar’s ‘search within citations’ it checks to see if any of the authors recorded to the database have cited any of the publications.

This process generates files that can be exported to the Gephi visualisation tool.

From the original 55 search strings I recorded 1120 authors, 1223 publications and 1382 citations. (Human readable list of citations)

Results

To generate the image below I exported a list of all the authors and showed links between them if they’d coauthored a publication or if they’d cited each other. The size of the nodes indicates number of inbound links, the thickness of the edge indicates the number of connections (citations and coauthoring) between two nodes.

Same image as the header, but with key authors circled in red

To make the graph legible I show only the nodes with the most links. As described in the first caption, the colours represent algorithmically discovered ‘communities’ using Gephi’s modularity function — groups of authors who cite and coauthor a lot. These communities are likely quite dependent on the initial set of searches, none the less, some of them make intuitive sense. The brown ‘Participatory Design community’ for example is very well defined. The pink community are all authors I’ve been looking at on the topic of deliberative democracy.

On the other hand, S Lukes would have made more sense to me in the green community with RA Dahl, whom his work references extensively. In fact, as the key indicates, I’m unable to say what the lilac community indicates with any degree of confidence — it’s existence is may be a result of my selective sampling of the overall scholarly network. The lilac group’s centrality does indicate that they are commonly referenced by all the other groups, and Arrow and Rawls are two of the most highly referenced researchers in the network. It may be this property, rather than a subject specialism, that lilac nodes have in common.

There a number of anomalous borderline cases such as this.

How does it help write a literature chapter?

I feel more comfortable with the landscape of my research with the network as an overview, and intend to add new references as they arise. Some key lessons stood out:

  • Y Guo, a researcher I’ve never come across (and likely would not have done through my normal approach) is doing similar research to my own and referencing the same eclectic mix of sources.
  • My previous belief that the participatory design community does not reference Lukes has turned out to be wrong. My previous view was based on what I thought was a substantial search, so this was a surprising finding. It’s hard to find citation links using a manual process when any participatory design researcher could have cited Lukes.
  • I find the participatory design community difficult understand. This approach has helped me focus on Hillgren as a highly relevant, an author I did not know before.
  • I’m surprised (and pleased) to see that Fishkin, who always seemed isolated to me, is actually well embedded among the authors I’m referencing.
  • I’m interested in philosopher John Searle’s The making of the social world as a part of my research. Even though the book was in the initial searches, and Searle’s work is famous and widely referenced, he did not appear as a well connected node in my network. Perhaps his research is not as relevant as I thought, or perhaps it’s a connection that needs to be made.
  • Bjorgvinsson is the largest node in the network — which means they have the most inbound links. However, their research is not of particular relevance to mine. ‘Eyeballing’ suggests Bjorgvinsson has many connections within participatory design, but little broader reach. This structure of connectivity still tells us something about the community, even if, as in this case, it might not be something I need to pursue.

I think further analysis might reveal less referenced papers that cite interesting combinations of my key authors.

This approach seems especially relevant to my work because I’m positioning myself in relation to a number of disciplines. It also seems important in the wider humanities, where searching by keyword might fail to return relevant results from another discipline with a different vocabulary.

Finally, I should acknowledge firstly that the citation network is not the only way to discover papers, and also that Google Scholar is an incomplete source of data itself.

Isn’t this just an advanced type of procrastination?

Don’t tell my supervisors.

--

--