REF2021: towards Open Research

With the funding bodies’ Initial decisions on the Research Excellence Framework 2021 published at the beginning of September including a paragraph on ‘open research’ we consider what this might mean as the REF takes shape.

29. The revised template will also include a section on ‘open research’, detailing the submitting unit’s open access strategy, including where this goes above and beyond the REF open access policy requirements, and wider activity to encourage the effective sharing and management of research data. The panels will set out further guidance on this in the panel criteria. 

Initial decisions on the Research Excellence Framework 2021 (pg 9)

While still some way from full Open Access in the UK we are getting closer, largely thanks to HEFCE’s “Policy for open access in the post-2014 Research Excellence Framework” which came into effect in 2016, on April Fools’ day in fact. Nevertheless it has been taken very seriously. REF is no laughing matter!

The REF has sometimes been maligned as an expensive bureaucratic exercise ill-fitted for purpose, yet the goal of promoting the value and impact of publicly funded research is surely worthwhile and as advocates for all things ‘open’, it at least provides a stick on which to dangle our carrots.

In lieu of the further guidance promised, can we pre-empt some of the activity and initiatives that might contribute to ‘open research’ above and beyond the REF open access policy requirements?

N.B See the updated HEFCE FAQ, specifically:

7.1. What aspects of OA should submitting unit’s include in the environment statement section titled ‘open research’?

Research Data

It is good to see this referred to explicitly at this early stage, following on from the Concordat on Open Research Data published in July 2016 focused on ensuring that research data is made openly available wherever possible.

In actual fact research data was already an eligible output for REF in 2014 and the exercise in 2021 will continue to assess “all types of research and forms of research output”. Nevertheless infrastructure and best practice around RDM are still developing. At Leeds the RDL team based in the Library provide support and advice throughout the research lifecycle. We run an institutional data repository providing long term, secure storage and associating data with a Digital Object Identifier (DOI), a persistent identifier that will facilitate formal citation. Alternatively use the Registry of Research Data Repositories (r3data) to identify a suitable discipline specific repository.

Other useful organisations include Jisc and the Digital Curation Centre.

Potential questions for REF2021:

  • Is the data underpinning your submitted outputs safely stored according to best practice?
  • Is that data openly available (if appropriate) or is it clear how it can be accessed (i.e. does the paper include a suitable data statement)?
  • Has your data been reused by other researchers / initiated collaboration?
  • Do you have established protocols for data management planning that is followed for all research projects?


ORCID is an open, non-profit, community-based initiative that provides a unique identifier to reliably differentiate individual authors and enables connections between systems. Linking your ORCID to Symplectic, for example, will provide an additional method for the system to reliably identify your published work and add it to your Symplectic profile, your ORCID will also be passed over to the White Rose Research Repository (WRRO) when you deposit a manuscript.

ORCID increasingly underpins an open scholarly infrastructure, nationally and internationally and is also supported by Jisc.

Related post: So you’ve got an ORCiD…what next?

Potential questions for REF2021:

  • Do all of your submitted authors have an ORCID?
  • Are they using their ORCID profile effectively?
  • Are you actively using ORCID to integrate systems and improve workflows?


Another area that is discussed in the document which identifies “an explicit focus on the submitting unit’s approach to supporting collaboration with organisations beyond higher education” (pg 6, para 18).

The benefits of open research to collaboration opportunities with such organisations are obvious, whether the NHS or SMEs who may not otherwise be able to find or access the research and data they need to further their own mission. Perhaps there is also a question here of targeted dissemination, via social media for example – making research available online doesn’t mean the right people will simply stumble across it.

Potential questions for REF2021:

  • Have you adopted open research practices that are conducive to collaboration?
  • To what extent have these been successful?
  • Are you proactively building and monitoring a network around your research (e.g. by leveraging alternative metrics)?


The document  acknowledges that work is required to align definitions of ‘academic impact’ and ‘wider impact’ which relate respectively to the assessment of outputs and the impact element of the REF. Notably the weighting for impact has increased from 20% to 25% – as was in fact originally proposed for the 2014 exercise.

There will be additional guidance on the criteria for both ‘reach and significance’ and impact arising from public engagement – it is not hard to anticipate how an open research agenda will feed into each of these. There is evidence that OA increases traditional citations for example while developments in alternative or “altmetrics” are enabling online social activity around research to be recorded and measured. 

Repository downloads also provide a valuable article level metric, indeed we might expect correlation with traditional citations, even causation. The IRUS-UK* service provides COUNTER compliant download statistics for the majority of UK based repositories which means that downloads are standardised and filter out automated downloads by search engine robots for example.

* With 3,766,192 downloads since October 2013, and as might be expected for a consortium of 3 research intensive Universities, IRUS-UK reveals that the White Rose Research Repository is one of the most highly downloaded in the UK. Leeds accounts for 1,773,744 of those downloads.

Potential questions for REF2021:

  • To what extent are you engaging with audiences beyond academia?
  • Do you produce plain language precis of your research?
  • Are you exploiting social media to engage with academic and lay audiences (e.g. Twitter, blogs, Wikipedia)?
  • Are you analysing quantitative data from these sources?

Related post: Wikipedia, information literacy and open access

The Research Support team based on Level 13 of the Edward Boyle Library will continue to review REF guidelines as they are released and associated developments across the sector. You can get in touch by email or on Twitter.

In the meantime, you must ensure your research outputs meet the new REF open access requirements by depositing your author accepted manuscript via Symplectic as soon as possible after acceptance




Repository Fringe, 2017 – Beyond Borders

Posted by Rachel Proudfoot – programme – presentations from the event

Repository Fringe is an annual event in Edinburgh where anyone interested in repositories and research outputs can share experience, expertise and learn about developments in the repository field. 2017 marks the 10th Repo Fringe and this year was, in part, a celebration of how we have shared content ‘beyond borders’ over the last decade. The Research Data Leeds team explored the theme in our ‘Galactic Interfaces’ poster about working with arts and humanities researchers and data. (The poster is currently on display in the Research Hub on Level 13 of the Edward Boyle Library).

The conference offered a mix of keynote talks, short presentations, ‘birds of a feather’ sessions, posters and of course lots of informal networking over tea and biscuits.

Repositories:  problem or solution?

Shortly before the conference, Elsevier announced it had acquired BePress. Concern about the amount of control large commercial publisher have over research dissemination was a recurrent theme in the conference. This is nothing new, but we are seeing large publishers increasingly pushing into the ‘open access’ arena. Keynote Kathleen Shearer, Executive Director of COAR, suggested that, financially, Universities are as much over a barrel now with article processing charges for ‘gold’ open access articles as we were (still are) with journal ‘big deals’ and hikes in journal subscription costs. Shearer challenged the conference: are repositories helping to perpetuate a highly flawed scholarly communications system? Shearer is part of the Next Generation Repositories Working Group which will be publishing a set of recommendations in Sept 2017. She suggested we need to rethink repository design so we have ‘repositories of the web, not just on the web’. This may involve supporting peer review (another speaker pointed out Elsevier’s controversial patent of the online peer review system), improving discovery of research more than we currently do, making sure metadata is machine readable and taking a stronger lead on digital preservation. We should also develop a shared, international vision and common ways of working which reduce the risk of academic research being disproportionately shaped, controlled and charged-for by commercial interests. For Shearer, we need a more coherent alternative – and we’re certainly not there yet.

Active promotion of content

A few presentations suggested ways that repositories can promote content in addition to curating it. Gavin Willshaw from University of Edinburgh gave a great example of promotion as part of a project to digitize 17,000 PhD theses. Edinburgh have highlighted theses from notable alumni, such as Gordon Brown, Arthur Conan Doyle and Helen Pankhurst, have linked PhD theses to author pages in Wikipedia e.g. and have uploaded older theses to Wikisource, Wikimedia’s online library of out of copyright works.

Other discussion looked at what role, if any, a repository can have for impact case studies / research impact more generally. Could the repository promote research and /or capture more usage and impact data? Is there a role for repositories to host lay summaries of research to make research more accessible to a non-specialist audience – be they the ‘general public’ or researchers from other academic disciplines.

Easier, embedded metadata creation which will make researchers’ lives easier

Well, we can dream! One of the keynote speakers, Andrew Millar, outlined a vision of specialist tools designed to support an academic ‘community of practice’ which make it easier to capture metadata and contextual information as a routine part of research practice. Millar is a systems biologist and suggested Fairdom is a widely used tool which helps to capture metadata in a standard experimental workflow.

Such domain specific tools could link painlessly to shared repositories if we adopted common standards of data exchange. Tools discussed in the context were:

  •– packages documents, code and data into a zip file with manifest. Designed to be flexible across different subject areas.
  • – a way of packaging documents, models and data together using the Open Modelling EXchange format (OMEX).
  • BagIt – uses a file naming convention for structuring digital content

A presentation by Rory Macneil and Megan Hardeman demonstrated an end to end workflow, capturing information via an electronic lab notebook in the RSpace digital research platform depositing directly into the Figshare repository via an easy to use and embedded tool.

Hopefully repository uptake will increase – and we’ll get more enthusiastic engagement from researchers – if we can get closer to their everyday workflows and provide relatively pain free deposit options.


Anthea Wallace’s absolutely excellent presentation showed examples of how public domain works – either deliberately or unintentionally – have had restrictions imposed on reuse. As Wallace put it, closed licences won’t stop bad people from doing bad stuff with your data but may well stop good people doing good stuff. Wallace promoted the Copyright Cortex as a helpful resource for researchers in digital humanities. I partly mention this presentation as an excuse to use one of Wallace’s examples: the transcription of music from a human bottom in Hieronymus Bosch’s Garden of Earthly delights which can you can see below. You can also listen to an adaption of the ‘Butt Music’ on YouTube.

Hieronymus Bosch’s Garden of Earthly delights

More from the Fringe

Incentivising open practices – Digital Curation Centre (Authored by Sarah Jones with corrections from Dr Paul Ayris)

Repository Fringe 2017

We are looking forward to the Repository Fringe next week, now in its 10th year, and coinciding as always with the far less entertaining Edinburgh Fringe. We will be presenting a poster (to follow, see below for a taster), perhaps telling a few jokes, and sharing expertise and experience with our fellow repository professionals.

The full programme is available at and you can follow on Twitter @repofringe | #rfringe17

Galactic Interfaces: navigating the creative data universe

The poster takes its title from a piece of music in one of the datasets in the Research Data Leeds repository. ‘Galactic Interfaces’ is a semi-improvised piece about interactions and contrasts; rather like developing a research data service. The poster will use the galactic theme to show how working with arts and humanities researchers has launched us from planet ‘EPSRC data compliance’ to boldly go where the research data service has not gone before. We use ‘Galactic Interfaces’ in research data training sessions to encourage researchers to step outside their own world and think creatively about their data and metadata. Our galactic journey has taken us into the Special Collections galaxy where we have been working on developing a common language so we can understand each other. We have a landing party visiting the digital humanities nebula and we’re launching a rescue mission for project web sites currently being drawn into a giant black hole.

Black hole

Much valuable work has been done with creative data already in other repository services (VADS, UAL etc.); for a repository in a multi-disciplinary institution like University of Leeds, working with creative data has shifted thinking about our research data service and where its long term value may lie. It has prompted consideration of the variety of data contributors; who should be acknowledged for their creative input, and how? How do we licence data with third party content? How do we capture and package data from practice-based creative disciplines? Do we have a role in bringing together data and researchers from different spheres – virtually, but also in physical space for discussion and exploration? Borders are being crossed, redrawn and broken down and we are re-plotting our star charts! (We will also reach beyond the borders of the poster by making it interactive.)

Research Data Network – University of York – June 2017

If Jisc’s 4th Research Data Network earlier this week felt a bit rushed at times, it only reflects the sheer number of exciting projects happening across the sector.

There’s still a long way to go but it felt like the dots are really starting to join up and there was lots of energy in both real and virtual space – see Storify of tweets on the #JiscRDM tag during the event.

Delegates busy networking on Tuesday evening at RDN York (thanks to Paul Stokes for the photo, used with permission)
Delegates busy networking on Tuesday evening at RDN York (thanks to Paul Stokes for the photo, used with permission)

Two packed days in York were bookended by an inspiring opening keynote from Mark Humphries asking “Who will use the Open Data?” and by a panel session the following afternoon on the principles and practice of open research, informed by the open research pilot project at the University of Cambridge.

Mark emphasised that there is a clearer rationale in some academic contexts than others. Clinical trials, for example, are time consuming and expensive and need to be safe and effective which provides a clear motivation to share data and check conclusions.

Mark singled out his own discipline of neuroscience however as lagging behind, with no discipline specific open data repositories, and inclined to “data worship”. New data is hard to get and requires considerable skill (to implant electrodes in a rat’s cortex for instance) and will underpin high-impact papers, that universal currency of academia. It’s not for sharing!

Mark reassured us, nevertheless, that open data is the future. Inevitably. If only due to the sheer scale of data being generated which simply has to be shared if it is to be analysed effectively, citing an instance whereby a single dataset generated 9 high quality papers from several labs. RDM isn’t trivial though, one of the main reasons that funding bodies are mandating data sharing.

Some 28 hours later, we were back in the same lecture theatre for the final session chaired by Marta Teperek. Our four panelists fielding questions from the floor were David Carr (Wellcome Trust), Tim Fulton, Lauren Cadwallader (both University of Cambridge) and Jennifer Harris (Birkbeck University).

There was a great deal of emphasis on the cost of open research and sustainability – by way of answer to the question above, Lauren Cadwallader referred to her recent blog post Open Resources: Who Should Pay? and shared her reservations about the ‘gold’ model of open access that is sustained by expensive Article Processing Charges to commercial publishers.

There are similarities and synergies between OA and open data initiatives, including increasing interest from publishers. There are also significant differences and it was pointed out from the floor that long term preservation is a cost that needs to be borne by someone.

Betwixt these bookends were far too many sessions to discuss in detail, covering everything from the European Open Science Cloud (EOSC) to an update on the work HESA is doing in relation to research data in the context of REF2021, Archivematica for preservation and some fantastic resources for business case development and costing for RDM (including a number of useful case studies). Then there’s the Research Data Alliance which *anyone* is able to join and which offers a window onto many different communities.

It was particularly interesting to learn about ongoing developments with Jisc’s shared service which is working with 13 pilot institutions on repository and preservation solutions and comprises a range of tools to capture, preserve, disseminate and allow reporting. The pilot offer also includes training, support and gathering of best practice. Pilot users will be testing these systems throughout the summer and providing feedback with a view to rolling out production between April and July 2018.

The UK research data discovery service (beta), part of the Jisc Research at Risk challenge to develop RDM infrastructure, enables the discovery of data from UK HEI’s and national data centres.

Leeds contributed to the event by sharing lessons learned when setting up our RDM service and with a lightning talk.

All in all a valuable couple of days with lots of information still to synthesise and file away. Indeed to preserve in one’s cortex…now where’s that neuroscientist?

Slides from all sessions and extensive notes are available from

Research data: enabling peer review

We are starting to get requests to make data available for peer review prior to the journal paper being accepted. Some authors are happy for the data to go live in the repository with a note explaining the data is under review and may be subject to change. However, not all authors are happy with putting their pre-review data into the public domain and a better model would be restricted access. There are additional challenges associated with single and double blind peer review and any model based on the (institutional) repository will necessarily reveal the affiliation of an author due to the institutional URL.

Images from datasets in the Research Data Leeds Repository
Images from datasets in the Research Data Leeds Repository

Increasingly journals manage this themselves via a partnership with Data Dryad* or Figshare but not all have a suitable mechanism set up for access to data in addition to the draft of the paper. Moreover, such a journal-centric model will disadvantage institutionally based data repositories, potentially even render them obsolete (see pros and cons of journals handling data below).


Might there be a role for Jisc here to build suitable mechanism into their shared service which, from a blind-review perspective would have the advantage of obscuring author affiliation?

Potential solutions

1. Make the data available in the repository. Don’t mint a DOI. Send the URL to the reviewers. Include a prominent note on the eprint record ‘This data is associated with a paper which has been submitted for publication. The data may be subject to change [date]. Full details of the associated publication and the final dataset will be made available in due course.’

2. Make the data available in the repository with access control. Repository account enables access to the dataset only from specific user account(s). Problem: this is not available yet (for EPrints)?

3. Share the data via OneDrive. This may not be suitable for double blind peer review. However, if the journal can act as a liaison point i.e. the editor is given access to the data on OneDrive, the journal could then provide access details to the peer reviewers. This could be a good solution if the journal is willing.

4. Share the data in another repository – Figshare, Zenodo – which supports restricted access prior to publication of a dataset. This is a good way to share data with a restricted group, but may not be suitable for single or double blind peer review – unless the journal publisher can act as the access gateway as in the OneDrive model outlined in 3. One downside – why bother to deposit in RDL if the data is already in Figshare or similar?

5. Ask the journal if they can help – there may be a mechanism for providing access to the data. This may not be in place. There is a risk data will become supplementary information or be deposited in another repository (if we see this as a problem) so reduces the role for RDL.

Hide Creator Hide reviewer Hidden to world
1 Data available in repository N N N
access through publisher N Y N
2 Data available in repository with access control N N Y
access through publisher N Y N
3 Share data via OneDrive N N Y
access through publisher N Y Y
4 Share data in another repository Y N Y
access through publisher Y Y Y
5 Ask journal if they can help Y Y Y
Jisc Shared Services? ? ? ?

White Rose Libraries Digital Scholarship Event

Last week I attended an event in Sheffield that brought together colleagues from across the White Rose consortium (Universities of Leeds, Sheffield and York) to explore developments in Digital Scholarship. Whatever that might be…

Indeed, several speakers throughout the day drew attention to potential problems of terminology – the other common descriptor is Digital Humanities – with Ben Outhwaite in his keynote differentiating between the plain scroll and the later codex to illustrate that technology has always facilitated new methods of analysis and that digital technology isn’t qualitatively any different. Digital Humanities is simply humanities research driven by the opportunities offered by new media.

Anne Horn, University Librarian (Sheffield), conversational in her introduction, provided a preliminary definition and elicited perspectives from the audience. Anne emphasised interdisciplinary collaboration, with the Library as an active participant – a theme that recurred throughout the day – and suggested that communities coalesce around both technology and processes as well as content and datasets. She talked about the challenges of building and sustaining the broad range of knowledge and skills required, an area in which the Library has a clear role.

In one of several academic viewpoints throughout the day, Mike Pidd described how the Digital Humanities Institute at the University of Sheffield is self-funded through project collaboration and supports technology R & D in the humanities with services ranging from data acquisition, data modelling and data management to data visualisation and preservation and sustainability. We learned about just a few of the projects within the HRI, like The Digital Panopticon which has brought together genealogical, biometric and criminal justice datasets held in the UK and Australia to explore the impact of different types of punishments on the lives of 90,000 people sentenced at the Old Bailey between 1780 and 1875. The scale of the project is impressive having linked records across 45 separate datasets both public and commercial (e.g. Ancestry UK) illustrating a common challenge negotiating with data providers.

Other projects are Locating London’s Past*, Old Bailey Online*, Linguistic DNA and Mark My Bird all of which are capturing and reusing data in innovative ways, backing up Mike’s statement that “data is just as important for your career as publishing books and articles”.

* Raw XML data from London Lives and Old Bailey Online is available from Sheffield’s data repository ORDA

The Library Showcases, reprised in the afternoon, were an opportunity for us to learn about digitisation projects within archives and special collections across the consortium:

The presentation from York, for example, emphasised the complexity of these types of project requiring a broad range of skills from traditional document preservation, digitisation/ingest and development of an editorial interface (the editing tool for the Archbishops’ Registers is available from github.)

Digitised excerpt from Henry VIII’s divorce from Anne of Cleves (Archbishops’ Registers)

Digitised excerpt from Henry VIII’s divorce from Anne of Cleves (Archbishops’ Registers)

High quality digital images facilitates zoom-in with no loss of fidelity

A couple of academic viewpoints spanned lunchtime with Louise Hampson from the Centre for the Study of Christianity & Culture at the University of York and Brett Greatley-Hirsch from the University of Leeds.

Louise talked about the legacy issues of migrating CD Roms to internet based resources, both practical difficulties for a small team and (re)negotiating usage rights while Brett immediately won over the room by saying that libraries should be recognised as active collaborators and not mere support services.

Brett has come to Leeds via Australia and Canada and introduced us to Digital Renaissance Editions which publishes open-access electronic critical editions of non-Shakespearean early modern drama.

The second of my Library Showcases was Sheffield’s National Fairground & Circus Archive, a “living” archive actively “contributing to the organisation and promotion of shows and festivals” and drove home yet again the broad range of skills required to curate digital material.

All of which brought us to an energetic keynote from Ben Outhwaite who described a somewhat fragmented landscape at Cambridge with various pockets of work that perhaps lack cohesion across a University where STEM subjects tend to prevail. The University is beginning to look at the area strategically however, to support their digital humanists who might be collaborating with scholars elsewhere through the Digital Humanities Network, a university funded, short-term, strategic initiative. Ben also talked us through the high profile Casebooks Project, making available the astrological records of Simon Forman (1552-1611) and Richard Napier (1559-1634) “unparalleled resources in the history of early modern medicine”.

The best projects are idea-led not technology led, according to Ben, and there needs to be a real scholarly need, a theme that came through strongly in presentations throughout the day with digital technology an integrated aspect of all projects. Digitisation, though, undoubtedly leads to more opportunities

Crucially “You can’t do anything without data, collect and look after the data rigorously“.

Preservation and Research Data files – ASCII

Let’s start with the plain text files.
In her blog posts Jenny Mitcham described the range of files appearing in the York repository, the use of available tools to identify these files, and the process of registering a file format with the National Archive in PRONOM. What have we got in our digital archive? , My first file format signature. The original post describing the data profiling at York is “Research data – what does it *really* look like?”

At Leeds we see a similar mix of file formats though perhaps with more arising from scientific instruments and software. Such file formats are sometimes recognised by the tools though often not. For those that are binary the registration process Jenny describes can be applied. What about the ACSII or text/plain files?

An example: At Leeds we make extensive use of a finite element software package called Abaqus ( produced by Dassault Systemes. Have a look at one of our very earliest datasets and the sample Abaqus input file.
(Segmentation_model_S1a.inp in Hua, Xijin and Jones, Alison (2015) Parameterised contact model of pelvic bone and cartilage: development data set. University of Leeds. [Dataset]

Extracts from file Segmentation_model_S1a.inp

Header with start of section that defines the geometry


(then 200000 more Node lines)

Material properties and boundary conditions


(lots more lines of settings and processing commands)

Final step in the processing


Such files have a .inp file extension and can be created through one of the Abaqus suite of tools or manually using a text editor. The file header (lines starting with ** %) is produced by the Abaqus tools but is not processed and may not be present if the file is edited or created manually. The file contains some initial keyword based lines that set up the task, a large section that defines the mesh over the geometry, a series of keyword based commands that define material, boundary, and contact conditions then a keyword based section that defines the analysis itself. In the use of a structure with a controlled vocabulary and parameters and values it is much like a LaTeX or HTML or XML file.

The preservation tools correctly identified the file as mime-type text/plain. So human readable and no doubt understandable to a scientist in that field. So to some extent it can already be regarded as “preserved”. There are software vendor manuals that define the keywords and commands. With knowledge of finite element analysis, the input file, and the manuals a scientist in this discipline could reproduce the analysis whether or not they had a copy of the Abaqus software. Can we regard it as “even more preserved”?

If we do regard files in this format as “preserved” would it be reasonable to register the format in PRONOM even though we won’t be able to provide a signature – as has been done for .py (Python script) and a number of other formats?

If so should we work with the software vendors to create and maintain these registrations? Has this approach to preservation been explored before?

I have just started exploring digital preservation with Dassault Systemes. More to follow.