Open in order to…discover buried connections: Text and Data Mining

By “open access” to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself.

Budapest Open Access Initiative (2002)

Text and data mining (TDM) is defined by the UK Intellectual Property Office as “the use of automated analytical techniques to analyse text and data for patterns, trends and other useful information”.  In our penultimate post of International Open Access week we consider how TDM will benefit from full open access and look at some of the initiatives, services and tools in this exciting area.

TDM and Copyright

One of the main impediments to TDM cited in a 2012 report, Value and benefits of text mining, was copyright restrictions. In June 2014, the U.K. Government introduced reforms to enable “researchers to make copies of any copyright material for the purpose of computational analysis…if they have “lawful access” to the work” (section 29A of the Copyright, Designs and Patents Act 1988 (CDPA)). However, well over 3 years later, TDM is still far from straightforward, in large part due to restrictions associated with subscription content.

Needles in haystacks

Imagine that there is a cure for cancer already out there in the scientific literature. All you need to do to win that Nobel Prize is read and synthesise tens of thousands of research papers and datasets, to find the needles of insight in the scientific haystacks.

Even the most assiduous academic, or the most well funded team of researchers, can’t hope to excavate the mountains of information at their fingertips, and that continue to accrete at an exponential rate. But a machine can, specifically a universal Turing machine, the digital computer.

Except it can’t because, still in 2017, over a quarter of a century since the invention of the web, vast swathes of the scientific literature are out of bounds, locked behind paywalls and controlled by corporations like Elsevier, Wiley-Blackwell, Springer, and Taylor & Francis

The UK copyright exception explicitly states that “researchers will still have to buy subscriptions to access material” and while Elsevier have developed an API to enable those with a subscription to access full text content as XML for the purpose of TDM, and will even consider requests for access from non-subscribers on a “case by case basis”, access is still very much on their terms, as demonstrated by this post by Chris Hartgerink from November 2015 – Elsevier stopped me doing my research.

It’s a far cry from the Budapest Open Access Initiative of 2002.

One initiative that is working to leverage the broad corpus of open access content is the CORE aggregation service from the Open University.

CORE – aggregating the world’s open access research papers

CORE harvests open access content that meets the BOAI definition and works with a range of stakeholders to exploit a vast corpus of nearly 80 million open access articles. Metadata and enriched full text content is made available for both human discovery with a Google style search box and via an API.

Two examples of the potential for TDM are their recommender service for repositories and ‘semantometrics’ – the first of these, the CORE Recommender, can be seen in action right now on just about any WRRO record whereas semantometrics is more experimental.

CORE Recommender

Recommendation systems are de rigueur for web based services, typically based on user bahaviour tracked by cookies. Think Amazon.

The plugin from CORE, however, uses an algorithm to discover ‘semantic relatedness’ between articles by representing text documents as ‘vectors’.

While the mathematics is one thing*, the crucial point is that similar documents offered to you for this paper about using text-mining analyse patients’ experiences of colorectal cancer care really are similar based on a semantic analysis of millions of articles.

* for more information see the ‘vector space model


N.B. Admittedly the first result is the same paper from another repository which should probably be filtered out. At least semantic analysis works!


Semantometrics is not easily summarised and interested readers are referred to the full report. Essentially what it says is that computer analysis of an article’s semantic content and comparison with the broader research corpus can provide insight into the quality of research practices -whereas traditional bibliometrics, or indeed alternative or ‘altmetrics’, are quantitative and provide only a proxy for quality.

Might an evolved metric based on this technology provide a viable and scalable alternative to peer review?

As part of the EU funded OpenMinTeD project the CORE team led workshop at the Open Repositories conference (OR2016) in Dublin last year covering the technical requirements that can enable the text mining of repositories – see the OpenMinTeD blog  for discussion of the workshop including presentation slides.

The UK Scholarly Communication Licence (UK-SCL)

There’s an irony in that established scholarly business models provide publishers with vast quantities of data they can mine to inform and develop yet more products and services to sell back to the academy. Full open access to the literature and underlying data with appropriate Creative Commons licensing will enable us to develop more effective tools and services of our own without being beholden to the commercial gatekeepers.

The UK Scholarly Communication Licence is an open access policy mechanism which ensures researchers can retain re-use rights in their own work and is a response to both the ongoing transition to open access and concerns around growing requirements for researchers to assign copyright to a publisher at the point of acceptance. It provides a standard set of licence terms (CC-BY-NC) which permits text and data mining, and re-use of all or parts of the work by the academic in ways other than as part of the original publication.

For more information about UK-SCL see the website –

Other tools for TDM

There are an increasing number of tools available for TDM, many free to use:

  • VOSviewer developed at the University of Leiden is a powerful tool to analyse text and data. It utilises natural language processing techniques to create term co-occurrence networks based on textual data and features advanced layout and clustering techniques. It can also be used to visualise different types of bibliographic network for example. See YouTube for an excellent video tutorial.
  • Voyant Tools is an open-source, web-based application for performing text analysis. It supports scholarly reading and interpretation of texts or corpus, particularly by scholars in the digital humanities, but also by students and the general public. It can be used to analyze online texts or ones uploaded by users [Wikipedia]
  • Medline Ranker is dedicated to scientists interested to rank the biomedical literature according to a selected topic. The query page allows to search for any biomedical topic. The web server is fast enough to process thousands of scientific abstracts from the PubMed database in few seconds [David Rothman]

An actual cure for cancer buried in the literature may well be hyperbole yet the principle stands, that a computer has the capacity to identify connections and patterns at a volume and speed that a human reader cannot hope to match and that full open access is crucial to get the most from this technology and from the scientific literature.


Open in order to…increase your research impact: a bibliometric analysis of White Rose Research Online

It’s International Open Access Week!

In the first in our series of daily blog posts to celebrate open access (OA) Repository Assistant Simon Cobb gives an overview of repository statistics.

White Rose Research Online

The most visible manifestation of open access at the University of Leeds, White Rose Research Online (WRRO) is a shared repository for research outputs that aims to make available a full-text version of each work deposited by staff across the White Rose Consortium (Universities of Leeds, Sheffield and York).

University of Leeds  WRRO deposits and downloads 2012-2017

University of Leeds research outputs have been downloaded more than 3 million times since 2012. White Rose Research Online statistics are available at:

Some 22,000 items have been deposited in WRRO by Leeds authors and over 15,000 are already open access, thus free to read without a subscription. Full-text of a further 2,600 papers will be available upon expiry of an embargo period stipulated by the publisher.

One of the benefits of open access is the potential to disseminate research much more widely than is possible when papers are behind a paywall. University of Leeds research in WRRO is visible to a global audience and has been accessed from 227 identifiable territories. Whilst six countries (UK, USA, France, Germany, China and India) account for a significant proportion (58%) of the 3.4 million downloads, there were also 78,500 downloads from Africa, where the lack of access to subscription journal packages is still a major issue for researchers.

Map showing the location of downloads for University of Leeds research outputs in WRRO

In the fifteen years since it was first defined in the Budapest Open Access Initiative, OA has experienced strong growth. A good indicative example is the number of indexed sources and documents in Bielefeld Academic Search Engine (BASE), which have increased from 1,666 and 24.5 million on 31 May 2010 to 5,542 and 111.4 million respectively on 31 May 2017. Likewise, the number of titles in the Directory of Open Access Journals (DOAJ) increased from 300 to 10,229 between 2003 and 2017. Nature reported that immediate (gold) OA represented a 17% share of the journal articles published worldwide in 2014 (see Growth is likely to continue to be driven by research funder and institution policies that mandate OA.

Nevertheless, academic publishing is dominated by an oligopoly of publishers, with the five most prolific accounting for over 50% of papers published in 2013. It is a lucrative business (Elsevier’s parent company, RELX, reported a £1.15 billion profit in the first half of 2017) and publishers are unwilling to embrace practices that threaten their revenue.

Preliminary analysis of WRRO has indicated that half of the articles published by University of Leeds authors appeared in journals owned by four publishers; Elsevier (dominant in STEM), Springer Nature, Wiley and Taylor & Francis (particularly strong in Arts, Humanities and Social Sciences). A larger group of 25 publishers are responsible for 80% of the articles published. In this group, we find four OA publishers: Public Library of Science (PLOS), Copernicus Publications, MDPI and Frontiers (BioMed Central would feature but is counted with its parent company Springer Nature); PLOS has published the most articles and achieved a 2% share of the total.

Publishers of journal articles in White Rose Research Online (% of University of Leeds total)


Some of the major publishers have adopted OA policies that embrace the principles of unrestricted access to knowledge as a public good and allow researchers to freely share articles with their peers. SAGE, for example, permit the author accepted manuscript (AAM) version, which is an author produced file that is textually representative of the published version, to be made OA on acceptance via the author’s institutional repository. Cambridge University Press also permit the AAM to be deposited on acceptance for publication in many of their titles. Emerald recently amended their OA policy to allow the AAM to be made available immediately on publication. Unrestrictive policies like these give momentum to the OA movement and steer us toward to a sustainable model of academic publishing.

There is, however, a suspicion that the transition to OA could be hijacked as the major subscription publishers target grants earmarked for funding gold OA. Business practices have emerged to access this stable revenue stream, including the marketing of OA options in hybrid journals, converting individual journals to OA and negotiating national level licensing agreements that bundle big deal subscription packages with payments for publishing OA articles in hybrid journals. If we let this happen, the pressure on library budgets will continue and publisher profits will be ensured.

Come back tomorrow

In tomorrow’s post, Kate Petherbridge talks about White Rose University Press (WRUP) one of a new wave of University and Academic-led presses founded to challenge the traditional publishing model.


So you’ve got an ORCiD…what next?

Knock knock

Who’s there?


Doctor Who?

OK, so it’s a lousy joke if you’re more than 8 years old but, with 6 previous Doctor Who actors (and an Italian Peter Capaldi) apparently listed in the ORCiD database, nicely illustrates the most commonly cited benefit of name disambiguation:

Tom Baker –
Peter Davison –
Colin Baker –
Christopher Eccleston –
David Tennant –
Matt Smith –
Pietro Capaldi –

The value of which is a moot point for your average Time Lord.

(On the other hand, Equity do emphasise the importance of a unique professional name but academics are perhaps less inclined to change theirs than their grease-painted fellows.)

Disambiguation is only part of the story however – visit these profiles and only the Ninth Doctor, now a Professor at the University of Bath, has any information listed. All of the others simply show No public information available, which means either they are set to private or, more likely, they have been registered but never used.

Academics are increasingly badgered by their institution to register an ORCiD, and by funding bodies and journal editors, but it might not always be clear quite what it’s for or how it can streamline your workflows and the No public information available issue is far from limited to the namesakes of Dr Who actors.

At Leeds the University publications policy encourages researchers to register for an ORCID and to link from their Symplectic profile – for Leeds staff that haven’t already done so, click the button below which will take you to ORCiD via your Symplectic account (log-in required):

ORCiD via your Symplectic account - log-in requiredThis will provide an additional method for the system to reliably identify your published work and add it to your Symplectic profile, your ORCiD will also be passed over to the White Rose Research Repository (WRRO) when you deposit a manuscript:

Repository record displaying linked ORCiD profiles
Repository record displaying linked ORCiD profiles

Propagating your ORCiD in this way means that you and your work become easier to find by interested colleagues and potential collaborators, and by search engines, but only if you engage with your account so that it is accurate and up to date.

So how do you do that without keying everything in manually?

Link your Scopus profile

Scopus is an abstract and citation database and a major data source for Symplectic. Your peer reviewed work will be indexed in the database and you will have an author details page (just search via the ‘Author’ tab’). On the right hand side there is the option to ‘Add to ORCiD’. You can also achieve the same result from ORCiD itself, see here for more information –

Trusted Organizations

Either of the methods above will have added Scopus as a ‘Trusted Organization’ – visible under the ‘ACCOUNT SETTINGS’ tab on your ORCiD profile:

Scopus connection information

Search & link wizard (Trusted Organizations)

Access the search & link wizard from ‘My ORCID RECORD’ -> ‘Works’ to access a list of databases:

Search & link wizard
Search & link wizard

These include a number of more or less specialised databases:

  • CrossRef, for example, is the organisation that mints DOIs for journal articles and maintains authoritative, publisher-supplied metadata
  • DataCite provides DOIs for research data and is managed by the British Library. DOIs for datasets in the Research Data Leeds repository are allocated, minted and maintained by Datacite

For more information including an instructional video see

Import from Symplectic

Symplectic cannot automatically push your records to your ORCiD account – there are good reasons for this that I won’t dwell on here – but you can import via a BibTex file for any records that might not be available from another database – conference papers or reports, for example, that lack a DOI.

My publications -> Export -> Export to BibTex

Exporting a bibtex file from Symplectic
Exporting a bibtex file from Symplectic

The resulting .bib file can easily be imported to your ORCiD profile:

Importing a bibtex file to ORCiD

For more information see

Privacy settings

ORCiD privacy settings

The final step, other than adding your biography, education etc (unfortunately you will have to do this manually) is to ensure that visitors to your profile can actually see the information.

Persistent unique identifier

To paraphrase Kathleen Shearer, you are now a node in a global knowledge commons that, along with the DOIs for your publications, will persist long into the future.

Unlike Doctor Who, though, you’ll have to wait for the years to roll by one day at a time…

Open Research Leeds

Since it was set up in January 2012, mandated by Jisc as part of the Roadmap project, the Research Data Leeds @ResDataLeeds Twitter account has been somewhat underused with a grand total of 7 tweets between 2012 and 2015.

Latterly, however, we have been utilising the account a lot more, focusing on building a network, disseminating datasets and highlighting broader issues around RDM and scholarly communication so we are rebranding the account as Open Research Leeds @OpenResLeeds and will explicitly disseminate open access research papers from WRRO and associated datasets as primary research outputs. Please come and join our network!