Research Data Network – University of York – June 2017

If Jisc’s 4th Research Data Network earlier this week felt a bit rushed at times, it only reflects the sheer number of exciting projects happening across the sector.

There’s still a long way to go but it felt like the dots are really starting to join up and there was lots of energy in both real and virtual space – see Storify of tweets on the #JiscRDM tag during the event.

Delegates busy networking on Tuesday evening at RDN York (thanks to Paul Stokes for the photo, used with permission)
Delegates busy networking on Tuesday evening at RDN York (thanks to Paul Stokes for the photo, used with permission)

Two packed days in York were bookended by an inspiring opening keynote from Mark Humphries asking “Who will use the Open Data?” and by a panel session the following afternoon on the principles and practice of open research, informed by the open research pilot project at the University of Cambridge.

Mark emphasised that there is a clearer rationale in some academic contexts than others. Clinical trials, for example, are time consuming and expensive and need to be safe and effective which provides a clear motivation to share data and check conclusions.

Mark singled out his own discipline of neuroscience however as lagging behind, with no discipline specific open data repositories, and inclined to “data worship”. New data is hard to get and requires considerable skill (to implant electrodes in a rat’s cortex for instance) and will underpin high-impact papers, that universal currency of academia. It’s not for sharing!

Mark reassured us, nevertheless, that open data is the future. Inevitably. If only due to the sheer scale of data being generated which simply has to be shared if it is to be analysed effectively, citing an instance whereby a single dataset generated 9 high quality papers from several labs. RDM isn’t trivial though, one of the main reasons that funding bodies are mandating data sharing.

Some 28 hours later, we were back in the same lecture theatre for the final session chaired by Marta Teperek. Our four panelists fielding questions from the floor were David Carr (Wellcome Trust), Tim Fulton, Lauren Cadwallader (both University of Cambridge) and Jennifer Harris (Birkbeck University).

There was a great deal of emphasis on the cost of open research and sustainability – by way of answer to the question above, Lauren Cadwallader referred to her recent blog post Open Resources: Who Should Pay? and shared her reservations about the ‘gold’ model of open access that is sustained by expensive Article Processing Charges to commercial publishers.

There are similarities and synergies between OA and open data initiatives, including increasing interest from publishers. There are also significant differences and it was pointed out from the floor that long term preservation is a cost that needs to be borne by someone.

Betwixt these bookends were far too many sessions to discuss in detail, covering everything from the European Open Science Cloud (EOSC) to an update on the work HESA is doing in relation to research data in the context of REF2021, Archivematica for preservation and some fantastic resources for business case development and costing for RDM (including a number of useful case studies). Then there’s the Research Data Alliance which *anyone* is able to join and which offers a window onto many different communities.

It was particularly interesting to learn about ongoing developments with Jisc’s shared service which is working with 13 pilot institutions on repository and preservation solutions and comprises a range of tools to capture, preserve, disseminate and allow reporting. The pilot offer also includes training, support and gathering of best practice. Pilot users will be testing these systems throughout the summer and providing feedback with a view to rolling out production between April and July 2018.

The UK research data discovery service (beta), part of the Jisc Research at Risk challenge to develop RDM infrastructure, enables the discovery of data from UK HEI’s and national data centres.

Leeds contributed to the event by sharing lessons learned when setting up our RDM service and with a lightning talk.

All in all a valuable couple of days with lots of information still to synthesise and file away. Indeed to preserve in one’s cortex…now where’s that neuroscientist?

Slides from all sessions and extensive notes are available from https://research-data-network.readme.io/v2.01/docs/4th-research-data-network-york-university

Preservation and Research Data files – ASCII

Let’s start with the plain text files.
In her blog posts Jenny Mitcham described the range of files appearing in the York repository, the use of available tools to identify these files, and the process of registering a file format with the National Archive in PRONOM. What have we got in our digital archive? , My first file format signature. The original post describing the data profiling at York is “Research data – what does it *really* look like?”

At Leeds we see a similar mix of file formats though perhaps with more arising from scientific instruments and software. Such file formats are sometimes recognised by the tools though often not. For those that are binary the registration process Jenny describes can be applied. What about the ACSII or text/plain files?

An example: At Leeds we make extensive use of a finite element software package called Abaqus (https://www.3ds.com/products-services/simulia/products/abaqus/) produced by Dassault Systemes. Have a look at one of our very earliest datasets and the sample Abaqus input file.
(Segmentation_model_S1a.inp in Hua, Xijin and Jones, Alison (2015) Parameterised contact model of pelvic bone and cartilage: development data set. University of Leeds. [Dataset] https://doi.org/10.5518/3).

Extracts from file Segmentation_model_S1a.inp

Header with start of section that defines the geometry

header

(then 200000 more Node lines)

Material properties and boundary conditions

middle

(lots more lines of settings and processing commands)

Final step in the processing

final

Such files have a .inp file extension and can be created through one of the Abaqus suite of tools or manually using a text editor. The file header (lines starting with ** %) is produced by the Abaqus tools but is not processed and may not be present if the file is edited or created manually. The file contains some initial keyword based lines that set up the task, a large section that defines the mesh over the geometry, a series of keyword based commands that define material, boundary, and contact conditions then a keyword based section that defines the analysis itself. In the use of a structure with a controlled vocabulary and parameters and values it is much like a LaTeX or HTML or XML file.

The preservation tools correctly identified the file as mime-type text/plain. So human readable and no doubt understandable to a scientist in that field. So to some extent it can already be regarded as “preserved”. There are software vendor manuals that define the keywords and commands. With knowledge of finite element analysis, the input file, and the manuals a scientist in this discipline could reproduce the analysis whether or not they had a copy of the Abaqus software. Can we regard it as “even more preserved”?

If we do regard files in this format as “preserved” would it be reasonable to register the format in PRONOM even though we won’t be able to provide a signature – as has been done for .py (Python script) and a number of other formats?

If so should we work with the software vendors to create and maintain these registrations? Has this approach to preservation been explored before?

I have just started exploring digital preservation with Dassault Systemes. More to follow.