Preservation and Research Data files – ASCII

Let’s start with the plain text files.
In her blog posts Jenny Mitcham described the range of files appearing in the York repository, the use of available tools to identify these files, and the process of registering a file format with the National Archive in PRONOM. What have we got in our digital archive? , My first file format signature. The original post describing the data profiling at York is “Research data – what does it *really* look like?”

At Leeds we see a similar mix of file formats though perhaps with more arising from scientific instruments and software. Such file formats are sometimes recognised by the tools though often not. For those that are binary the registration process Jenny describes can be applied. What about the ACSII or text/plain files?

An example: At Leeds we make extensive use of a finite element software package called Abaqus ( produced by Dassault Systemes. Have a look at one of our very earliest datasets and the sample Abaqus input file.
(Segmentation_model_S1a.inp in Hua, Xijin and Jones, Alison (2015) Parameterised contact model of pelvic bone and cartilage: development data set. University of Leeds. [Dataset]

Extracts from file Segmentation_model_S1a.inp

Header with start of section that defines the geometry


(then 200000 more Node lines)

Material properties and boundary conditions


(lots more lines of settings and processing commands)

Final step in the processing


Such files have a .inp file extension and can be created through one of the Abaqus suite of tools or manually using a text editor. The file header (lines starting with ** %) is produced by the Abaqus tools but is not processed and may not be present if the file is edited or created manually. The file contains some initial keyword based lines that set up the task, a large section that defines the mesh over the geometry, a series of keyword based commands that define material, boundary, and contact conditions then a keyword based section that defines the analysis itself. In the use of a structure with a controlled vocabulary and parameters and values it is much like a LaTeX or HTML or XML file.

The preservation tools correctly identified the file as mime-type text/plain. So human readable and no doubt understandable to a scientist in that field. So to some extent it can already be regarded as “preserved”. There are software vendor manuals that define the keywords and commands. With knowledge of finite element analysis, the input file, and the manuals a scientist in this discipline could reproduce the analysis whether or not they had a copy of the Abaqus software. Can we regard it as “even more preserved”?

If we do regard files in this format as “preserved” would it be reasonable to register the format in PRONOM even though we won’t be able to provide a signature – as has been done for .py (Python script) and a number of other formats?

If so should we work with the software vendors to create and maintain these registrations? Has this approach to preservation been explored before?

I have just started exploring digital preservation with Dassault Systemes. More to follow.

Data Management Planning

Only weeks into January and like an inveterate smoker, I’m struggling to honour my New Year’s resolution and falling behind in my MOOC. Largely due to already scrabbling up the learning curve of a new role. When I do get round to it I find it very useful and naturally complementary to everything I am learning on the job.

We’re now approaching the end of week 3…but I only completed the assignment for week 2 yesterday, focused on the humble Data Management Plan, or DMP to its friends.

The practical assignment presented us with a scenario to use as the basis for a DMP, suggesting a couple of tools, DMPTool from the University of California and DMPOnline from the Digital Curation Centre as well as the framework provided by the DCC Checklist for a Data Management Plan, v4.0. From a UK perspective it might be more typical to use DMPOnline but I’ve already had a play with that so in honor (sic) of today’s Presedential inauguration and the Special Relationship I chose to try DMPTool which also includes a template from NSF-SBE as cited in the scenario.

As a newcomer to RDM it’s the sheer complexity of the myriad aspects and associated best practice that is so daunting which is where formal planning comes in, to impose some order on the primordial project chaos.

Using the DCC Checklist a DMP breaks down as follows:

  • What data will you collect or create?
  • What documentation and metadata accompany the data?
  • How will you manage any ethical issues?
  • How will you manage copyright and intellectual property issues?
  • How will the data be stored and backed up during research?
  • How will you manage access and security?
  • Which data should be retained, shared, and/or preserved?
  • What is the long-term preservation plan for the dataset?
  • How will you share the data?
  • Are any restrictions on data sharing required?
  • What resources do you require to implement your plan?

A DMP then serves two primary functions:

to describe the data produced in the course of a research project
outline the data management strategies that will be implemented both during the active phase of the research project and after the project ends

Research funding bodies, including the Economic & Social Research Council (ESRC), the Engineering and Physical Sciences Research Council (EPSRC) and the Wellcome Trust in the UK as well as the EU Horizon 2020 programme, increasingly require a DMP as part of the application process but even if you don’t have formal funding for your research, writing a DMP can be an invaluable exercise, for PhD and other Post Graduate Researchers for example, which was the focus of a recent training session delivered by my colleagues Rachel and Graham, ‘Research Data Management Essentials for your Research Degree’ (this was delivered to a cohort of 24 PGRs with more signed up than showed up. There is a waiting list for future sessions so clearly an appetite for RDM amongst PGRs, see here for details of future sessions.)

Back to research funding bodies, who I have no doubt wish to make life easier for researchers, also require a DMP for somewhat more pragmatic purposes:

  • Transparency and openness – since many funding bodies are allocating public money, they have a responsibility to ensure research outputs are preserved and made accessible to the public
  • Return on investment – maximise potential reuse of data and if someone invents the wheel, ensure that money doesn’t need to be spent again to reinvent it

By now I do have some experience of real life DMPs, mainly based on our application stage template which is primarily to identify any potential costs but which we also use to promote best practice; for the practical exercise I tried to expand beyond the parameters of the supplied case study rather than just rewrite it in the form of a DMP.

For example, the scenario refers to computer-assisted telephone interviewing (CATI) software where “the final, cleaned data will consist of a single SPSS file” but I googled CATI (outputs as something called Blaise data files) and tried to think through the implications of SPSS as a preservation format:

Data format and dissemination

Blaise data files are a proprietary data format and therefore not suitable for preservation and data sharing. SPSS is also proprietary software, nevertheless in widespread use within research institutions and unlikely to present access problems in the short to medium term. However the data will also be exported as .csv where possible and (anonymised) interview transcripts will be retained as plain text (.txt) to ensure accessibility without specialist software.

I found DMPtool fairly user-friendly though the NSF-SBE template arguably repeats requirements in some of the section guidelines, also commented upon by others, and could perhaps be clearer to avoid overlap.

The other aspect of data management planning that I am particularly interested in is the emphasis that it should be a ‘living’ document, revisited throughout a research project. I may be wrong but I get the sense that this rather paid lip-service and would like to explore tools and strategies to prompt a PI to proactively revisit their DMP. I was at pains to emphasise this in my own plan stating that “it will be proactively reviewed on a monthly basis and/or at suitable project milestones (TBC)”. It won’t of course, but I have the excuse that this is only pretend…

Research Data Management and Sharing – Coursera MOOC, week 1

True to my New Year’s resolution I’ve completed week 1 of the Coursera MOOC from The University of North Carolina at Chapel Hill and The University of Edinburgh.

I’ve found the course content reasonably engaging so far, largely as revision given my recent immersion in all things RDM, and I managed to pass the quiz on Understanding Research Data (14 items) first try, albeit not with full marks (86% – I got wrong a question associated with the Research Data Lifecycle but that was more an issue of terminology. That’s my excuse anyway.)

I’ve been working at the sharp end of RDM now for a whole month and have obviously been on a steep learning curve (discounting that week watching telly and eating chocolates) so my reasons for wanting to do this course are not only to consolidate my own learning but also to engage with the RDM community and to check out options for RDM training here at the University of Leeds. I have previously explored the excellent MANTRA resource from the University of Edinburgh which heavily informed my interview presentation, Research Data Management: Partnering with research way back on 19th September.
The ‘M’ part of this MOOC does appear to be more minuscule than massive at the moment with no activity on the discussion forums. Some other MOOCs I’ve taken have erred at the opposite extreme with thousands of threads to sift through. I’m also currently the only one using the #RDMSmooc hashtag on twitter.
Perhaps the first week in January is just not the best time for a somewhat niche area of interest while the rest of the population is focused on trying to break their mince pie habit and other more mainstream resolutions? (According to a slidedeck from Edinburgh’s Pauline Ward, the cohort for March-April 2016 comprised 1,294 “Active Learners” and resulted in 273 “Course Completers”)
In terms of actual content, week 1 has been about the basics, “understanding research data” in 2 main sections, “What are data?” and “Understanding Data Management” comprising several short videos delivered by Dr Helen Tibbo with interactive transcripts along with a summary and additional resources (which I’ve collated in Mendeley and added to an open Mendeley group here.)
There were also a couple of Supplementary Videos featuring real academics and information professionals discussing their real life experiences.  I found these particularly engaging, perhaps unsurprising given my surfeit of theoretical learning to date, also somewhat ironic given our “post-truth” age: “So when you make a statement about the world or about anything in it, you’re relying on some evidence for that statement.” Try telling that to the politicos!
As my own theoretical knowledge of RDM segues into the practical, I found Nancy Y. McGovern’s statement a suitable conclusion for week 1 “I certainly get the sense in libraries that people feel like we have a completely known set of data management practices. I don’t believe we do. I think for a lot of domains those practices are still emerging. So, I think that we should be really open to identifying new kinds of data management practices, and figuring out what our role might be in them.”
So all in all I’m looking forward to week 2 “Data Management Planning”, just be nice to have some company!

Other RDM blogs

I can’t figure out if I can add RSS feeds from other blogs to this platform (Jadu) so just dropping some links in here for the time being. Passers by please let me know if you are aware of any others:

An introduction (to RDM)

As the very newest member of the Research Data Management team here at Leeds, Rachel has seen fit to entrust me with the password for this blog and for the Twitter account @ResDataLeeds, both of which we hope to use to communicate with institutional stakeholders and with the wider RDM community.

I have worked in Scholarly Communications for nearly ten years supporting Open Access (OA) and repository systems up the road at Leeds Beckett University, including exploring issues around RDM. Inevitably, though, I am currently on a steep learning curve, albeit one that the sector as a whole is still traversing together.

Resoundingly the case has now been made for Open Access to research papers, if not necessarily the best mechanism to achieve it (gold or green routes), and the sector is moving (almost) as one, with HEFCE requiring that to be eligible for REF submission, journal articles and conference papers must be deposited in an open access repository on acceptance for publication. There is also an evolving consensus that underlying research data should also be made available, openly where possible but with suitable access restrictions where necessary – for reasons of commercial sensitivity for example.

While HEFCE do not currently advocate a comparable mandate for research data, their consultation paper published last week asks how they can incentivise units of assessment to share and manage their research data more effectively as well as emphasising that research datasets and databases that meet the REF definition of research* will (continue to) be eligible for submission in the outputs element of the assessment (HEFCE, 2016).

Whether or not you are thinking about the REF, as a key element of the research process, RDM should be considered at the very outset of a research project and a plan put in place to manage data throughout its lifecycle as illustrated below:


© Stuart Macdonald/EDINA. Used with permission

See our guidance for more information around how we can support your data management planning at the University of Leeds or please get in touch

* Definition of research for the REF

  1. For the purposes of the REF, research is defined as a process of investigation leading to new insights, effectively shared.
  2. It includes work of direct relevance to the needs of commerce, industry, and to the public and voluntary sectors; scholarship8; the invention and generation of ideas, images, performances, artefacts including design, where these lead to new or substantially improved insights; and the use of existing knowledge in experimental development to produce new or substantially improved materials, devices, products and processes, including design and construction. It excludes routine testing and routine analysis of materials, components and processes such as for the maintenance of national standards, as distinct from the development of new analytical techniques. It also excludes the development of teaching materials that do not embody original research.
  3. It includes research that is published, disseminated or made publicly available in the form of assessable research outputs, and confidential reports (as defined at paragraph 115 in Part 3, Section 2).

Assessment framework and guidance on submissions (Annex C, p48)

How the Institutional Data Repository helped me promote my data

Guest post from James Mooney, Lecturer in Music Technology, University of Leeds

As part of International Data Week, Sept 11-17 2016, James Mooney reflects on his experience of using the Research Data Leeds institutional data repository.


I have recently completed a project that involved curating, researching and staging three performances of live electronic music compositions by the English composer Hugh Davies (1943-2005). Staging these concerts has, in many cases, involved building the equipment required to perform them from scratch, based on incomplete or ambiguous information gleaned from archival documents. In addition, these are experimental pieces, with scores that comprise text-based instructions and descriptions rather than standard notation, as well as other inherently unpredictable elements that mean that the pieces turn out differently every time they are performed. These were, in other words, pieces that could only be fully understood by performing them. In this situation, the practice-based elements of the project – that is, the performances – are a valuable project output in their own right, since they convey much more about the nature of the pieces than could ever be understood from any abstract or theoretical description of them.


I was interested in using the Research Data Leeds Repository because it offered the possibility of rendering these performances as outputs – entities as concrete, readily identifiable, and as easy to reference as, say, a journal article would be.

Using the Repository allowed each of these three concerts to be packaged as an output, complete with title, abstract, DOI, and authorial information, as well as video-recordings of the pre-concert talk and each of the pieces themselves, and programme notes in PDF format. In this way they could be: (a) preserved for posterity; (b) viewed and auditioned by individuals who were not able to attend the original events; and (c) used and referenced in future research.

Preparation of Materials

In anticipation of their inclusion in the Repository, all three performances were video recorded. Detailed pre-concert lectures delivered in advance of each of the three concerts were also video-recorded. Extensive programme notes – prepared in hard copy for the performances themselves – were also retained in PDF format for inclusion in the Repository.

Decisions also needed to be made in relation to the ‘granularity’ of the materials to be presented. Would we, for example, package each individual piece as a separate entity within the Repository, or would one entity per concert be preferable? Would we include separate video files for each individual piece performed, or a single continuous video file for each concert? Or both? Ultimately, we opted for one entity per concert but with separate files for each individual piece. This configuration, we felt, represented the best balance between representing the original aims of the project (which had specified three concerts as outputs), and catering the potential needs of future researchers (who might appreciate being able to refer to individual pieces quickly and easily).

Having made these decisions, the video-files naturally had to be prepared accordingly. The videos of the concerts were edited so as to provide an individual video-file for each piece performed. Titles and credits for each individual piece were added using Final Cut Pro.

Readying these materials for the Repository also necessitated gaining permission from the various rights-holders, including composers (or their next of kin, if deceased), performers, and in some cases, publishers. If carried out methodically, this need not represent too onerous an administrative burden. In this case, a standard email was drafted, and responses recorded in a spreadsheet, which was then uploaded to the Repository along with the other materials.

Access the Hugh Davies data online.


Benefits and Applicability

Packaging the concerts as outputs in this way represents a more sustainable option than using websites like YouTube and Vimeo, where the continued availability of the videos is contingent upon the integrity of one individual’s user account (which could cease to be maintained for a variety of different reasons), and upon third-party terms and conditions that may change unpredictably. It also represents a preferable option to hosting such outputs on an individual’s personal website, or on a bespoke institutionally-hosted one, since these options will only be effective for as long as somebody is ready and able to maintain them. In contrast with these less-than-ideal options, the Repository allows these outputs to be preserved in perpetuity, theoretically at least.

Depositing materials in this way would potentially be beneficial for any research-based activity that incorporates a practice-based element. For the current project, similar repositories are planned as documentation for the project exhibition, and as a ‘video proceedings’ for the project conference. So long as appropriate materials (e.g. video and other digital formats) are gathered while the practice is under-way, such materials can be combined with a title and abstract at a later date, and packed as an output, complete with digital object identifier (DOI).

Colleagues writing funding proposals may wish to build plans for packaging outputs in the Repository into their grant applications. This would doubtless be attractive to funders, who will welcome any efforts to assure the sustainability of digital outputs.
James Mooney
Lecturer in Music Technology
School of Music
Faculty of Performance, Visual Arts and Communications
University of Leeds

Related links

Research Data Leeds repository

Research Data Leeds, the research data repository for the university, is now live:

To deposit data with Research Data Leeds please see the instructions online.

Data deposited with Research Data Leeds will be given a Digital Object Identifier (DOI).

For further information, please contact