By “open access” to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself.

Budapest Open Access Initiative (2002)

Text and data mining (TDM) is defined by the UK Intellectual Property Office as “the use of automated analytical techniques to analyse text and data for patterns, trends and other useful information”.  In our penultimate post of International Open Access week we consider how TDM will benefit from full open access and look at some of the initiatives, services and tools in this exciting area.

TDM and Copyright

One of the main impediments to TDM cited in a 2012 report, Value and benefits of text mining, was copyright restrictions. In June 2014, the U.K. Government introduced reforms to enable “researchers to make copies of any copyright material for the purpose of computational analysis…if they have “lawful access” to the work” (section 29A of the Copyright, Designs and Patents Act 1988 (CDPA)). However, well over 3 years later, TDM is still far from straightforward, in large part due to restrictions associated with subscription content.

Needles in haystacks

Imagine that there is a cure for cancer already out there in the scientific literature. All you need to do to win that Nobel Prize is read and synthesise tens of thousands of research papers and datasets, to find the needles of insight in the scientific haystacks.

Even the most assiduous academic, or the most well funded team of researchers, can’t hope to excavate the mountains of information at their fingertips, and that continue to accrete at an exponential rate. But a machine can, specifically a universal Turing machine, the digital computer.

Except it can’t because, still in 2017, over a quarter of a century since the invention of the web, vast swathes of the scientific literature are out of bounds, locked behind paywalls and controlled by corporations like Elsevier, Wiley-Blackwell, Springer, and Taylor & Francis

The UK copyright exception explicitly states that “researchers will still have to buy subscriptions to access material” and while Elsevier have developed an API to enable those with a subscription to access full text content as XML for the purpose of TDM, and will even consider requests for access from non-subscribers on a “case by case basis”, access is still very much on their terms, as demonstrated by this post by Chris Hartgerink from November 2015 – Elsevier stopped me doing my research.

It’s a far cry from the Budapest Open Access Initiative of 2002.

One initiative that is working to leverage the broad corpus of open access content is the CORE aggregation service from the Open University.

CORE – aggregating the world’s open access research papers

CORE harvests open access content that meets the BOAI definition and works with a range of stakeholders to exploit a vast corpus of nearly 80 million open access articles. Metadata and enriched full text content is made available for both human discovery with a Google style search box and via an API.

Two examples of the potential for TDM are their recommender service for repositories and ‘semantometrics’ – the first of these, the CORE Recommender, can be seen in action right now on just about any WRRO record whereas semantometrics is more experimental.

CORE Recommender

Recommendation systems are de rigueur for web based services, typically based on user bahaviour tracked by cookies. Think Amazon.

The plugin from CORE, however, uses an algorithm to discover ‘semantic relatedness’ between articles by representing text documents as ‘vectors’.

While the mathematics is one thing*, the crucial point is that similar documents offered to you for this paper about using text-mining analyse patients’ experiences of colorectal cancer care really are similar based on a semantic analysis of millions of articles.

* for more information see the ‘vector space model


N.B. Admittedly the first result is the same paper from another repository which should probably be filtered out. At least semantic analysis works!


Semantometrics is not easily summarised and interested readers are referred to the full report. Essentially what it says is that computer analysis of an article’s semantic content and comparison with the broader research corpus can provide insight into the quality of research practices -whereas traditional bibliometrics, or indeed alternative or ‘altmetrics’, are quantitative and provide only a proxy for quality.

Might an evolved metric based on this technology provide a viable and scalable alternative to peer review?

As part of the EU funded OpenMinTeD project the CORE team led workshop at the Open Repositories conference (OR2016) in Dublin last year covering the technical requirements that can enable the text mining of repositories – see the OpenMinTeD blog  for discussion of the workshop including presentation slides.

The UK Scholarly Communication Licence (UK-SCL)

There’s an irony in that established scholarly business models provide publishers with vast quantities of data they can mine to inform and develop yet more products and services to sell back to the academy. Full open access to the literature and underlying data with appropriate Creative Commons licensing will enable us to develop more effective tools and services of our own without being beholden to the commercial gatekeepers.

The UK Scholarly Communication Licence is an open access policy mechanism which ensures researchers can retain re-use rights in their own work and is a response to both the ongoing transition to open access and concerns around growing requirements for researchers to assign copyright to a publisher at the point of acceptance. It provides a standard set of licence terms (CC-BY-NC) which permits text and data mining, and re-use of all or parts of the work by the academic in ways other than as part of the original publication.

For more information about UK-SCL see the website –

Other tools for TDM

There are an increasing number of tools available for TDM, many free to use:

  • VOSviewer developed at the University of Leiden is a powerful tool to analyse text and data. It utilises natural language processing techniques to create term co-occurrence networks based on textual data and features advanced layout and clustering techniques. It can also be used to visualise different types of bibliographic network for example. See YouTube for an excellent video tutorial.
  • Voyant Tools is an open-source, web-based application for performing text analysis. It supports scholarly reading and interpretation of texts or corpus, particularly by scholars in the digital humanities, but also by students and the general public. It can be used to analyze online texts or ones uploaded by users [Wikipedia]
  • Medline Ranker is dedicated to scientists interested to rank the biomedical literature according to a selected topic. The query page allows to search for any biomedical topic. The web server is fast enough to process thousands of scientific abstracts from the PubMed database in few seconds [David Rothman]

An actual cure for cancer buried in the literature may well be hyperbole yet the principle stands, that a computer has the capacity to identify connections and patterns at a volume and speed that a human reader cannot hope to match and that full open access is crucial to get the most from this technology and from the scientific literature.