Data Management Planning

Only weeks into January and like an inveterate smoker, I’m struggling to honour my New Year’s resolution and falling behind in my MOOC. Largely due to already scrabbling up the learning curve of a new role. When I do get round to it I find it very useful and naturally complementary to everything I am learning on the job.

We’re now approaching the end of week 3…but I only completed the assignment for week 2 yesterday, focused on the humble Data Management Plan, or DMP to its friends.

The practical assignment presented us with a scenario to use as the basis for a DMP, suggesting a couple of tools, DMPTool from the University of California and DMPOnline from the Digital Curation Centre as well as the framework provided by the DCC Checklist for a Data Management Plan, v4.0. From a UK perspective it might be more typical to use DMPOnline but I’ve already had a play with that so in honor (sic) of today’s Presedential inauguration and the Special Relationship I chose to try DMPTool which also includes a template from NSF-SBE as cited in the scenario.

As a newcomer to RDM it’s the sheer complexity of the myriad aspects and associated best practice that is so daunting which is where formal planning comes in, to impose some order on the primordial project chaos.

Using the DCC Checklist a DMP breaks down as follows:

  • What data will you collect or create?
  • What documentation and metadata accompany the data?
  • How will you manage any ethical issues?
  • How will you manage copyright and intellectual property issues?
  • How will the data be stored and backed up during research?
  • How will you manage access and security?
  • Which data should be retained, shared, and/or preserved?
  • What is the long-term preservation plan for the dataset?
  • How will you share the data?
  • Are any restrictions on data sharing required?
  • What resources do you require to implement your plan?
DMPsteps_1_

A DMP then serves two primary functions:

to describe the data produced in the course of a research project
outline the data management strategies that will be implemented both during the active phase of the research project and after the project ends

Research funding bodies, including the Economic & Social Research Council (ESRC), the Engineering and Physical Sciences Research Council (EPSRC) and the Wellcome Trust in the UK as well as the EU Horizon 2020 programme, increasingly require a DMP as part of the application process but even if you don’t have formal funding for your research, writing a DMP can be an invaluable exercise, for PhD and other Post Graduate Researchers for example, which was the focus of a recent training session delivered by my colleagues Rachel and Graham, ‘Research Data Management Essentials for your Research Degree’ (this was delivered to a cohort of 24 PGRs with more signed up than showed up. There is a waiting list for future sessions so clearly an appetite for RDM amongst PGRs, see here for details of future sessions.)

Back to research funding bodies, who I have no doubt wish to make life easier for researchers, also require a DMP for somewhat more pragmatic purposes:

  • Transparency and openness – since many funding bodies are allocating public money, they have a responsibility to ensure research outputs are preserved and made accessible to the public
  • Return on investment – maximise potential reuse of data and if someone invents the wheel, ensure that money doesn’t need to be spent again to reinvent it

By now I do have some experience of real life DMPs, mainly based on our application stage template which is primarily to identify any potential costs but which we also use to promote best practice; for the practical exercise I tried to expand beyond the parameters of the supplied case study rather than just rewrite it in the form of a DMP.

For example, the scenario refers to computer-assisted telephone interviewing (CATI) software where “the final, cleaned data will consist of a single SPSS file” but I googled CATI (outputs as something called Blaise data files) and tried to think through the implications of SPSS as a preservation format:

Data format and dissemination

Blaise data files are a proprietary data format and therefore not suitable for preservation and data sharing. SPSS is also proprietary software, nevertheless in widespread use within research institutions and unlikely to present access problems in the short to medium term. However the data will also be exported as .csv where possible and (anonymised) interview transcripts will be retained as plain text (.txt) to ensure accessibility without specialist software.

I found DMPtool fairly user-friendly though the NSF-SBE template arguably repeats requirements in some of the section guidelines, also commented upon by others, and could perhaps be clearer to avoid overlap.

The other aspect of data management planning that I am particularly interested in is the emphasis that it should be a ‘living’ document, revisited throughout a research project. I may be wrong but I get the sense that this rather paid lip-service and would like to explore tools and strategies to prompt a PI to proactively revisit their DMP. I was at pains to emphasise this in my own plan stating that “it will be proactively reviewed on a monthly basis and/or at suitable project milestones (TBC)”. It won’t of course, but I have the excuse that this is only pretend…

Advertisements

DMP Online Plan Formatting

How can we gain most benefit from using an online DMP tool by introducing time-saving features that make best use of this environment?

Computers can perform certain tasks much faster than humans, but can be foxed (no pun intended) by the seemingly simple job of correctly distinguishing between pictures of cats and dogs.

main-dogs-vs-cats_1_

https://www.microsoft.com/en-us/research/publication/asirra-a-captcha-that-exploits-interest-aligned-manual-image-categorization/

In the context of DMPOnline, this got me thinking about what improvements to the system could make best use of the fact this is an online tool rather than a set of templates. Academic staff are often wary of using new systems due to the time that needs to be invested in learning how they work. So, if we can demonstrate that a new tool such as DMPOnline will definitely save time & effort by removing the need for the user to undertake mechanical tasks, then this can only be a good thing.

Computers excel (again, no pun intended) at processing structured data according to a clearly defined set of rules. Humans on the other hand are much better at dealing with unstructured data and open-ended tasks. Within the context of data management planning, I started to look at which of the requirements resulted from clearly defined rules.

The area which jumped out at me were the various requirements for formatting the finished plans. Some examples are as follows:

RCUK requires all ‘attachments’ including DMPs to be formatted with minimum margin sizes of 2cm in all directions. They also suggest the use of ‘Arial’ with a minimum font size of 11.

Je-S Formatting Guidance#1 

Elsewhere on the Je-S website, it state that Arial or Times New Roman are recommended.

Je-S Formatting Guidance#2

This page also defines that accepted file formats which include:

  • PDF versions 1.3, 1.4, 1.5 and 1.6 (*.pdf)
  • Postscript level 2 (*.ps)
  • Microsoft Word (’97 and later including Word 2007)

ESRC further require that a minimum font size of 12 is used and that DMPs do not exceed 3 pages of A4.

Je-S ESRC Formatting Guidance

MRC’s page length requirements are more complex because the data management plan forms part of a longer ‘Case for Support’ document. The maximum permitted length for the ‘Case for Support’ depends on the scheme being applied to.

Je-S MRC Formatting Guidance

The maximum length of a NERC DMP is one side of A4:

Je-S NERC Formatting Guidance

My personal experience of helping PIs to create DMPs using DMPOnline is that I spend (or rather waste) a lot of time adjusting settings to bring the plan in line with the funder’s requirements for page length, font size etc. This often involves deselecting different options, such as header and footer text on a trial and error basis to reduce the plan down to the required length.

These tasks can easily be expressed as a series of structured tasks.blog_post

Of course, the user should be given the option to override these settings (they may want a version of the plan that will not be submitted as part of the application that exceeds the maximum length permitted by the funder, for example), but in the majority of cases, if the system can produce a plan which meets the funder requirements (and updates these requirements as and when they change), then PIs will be able to spend more of their valuable time concentrating on the content, rather than the formatting.

This suggestion has been posted on the DMPOnline GitHub pages and has received positive feedback.

Tim Banks

DMP Online Developments

Recent developments in the DCC’s DMPOnline in response to feedback from pilot users and next steps.

The Data Management Planning workpackage in the RoaDMaP project is piloting the use of DMPOnline, developed and maintained by the Digital Curation Centre.

We have been asking PIs from a range of disciplines to create data management plans using the DMPOnline tool and provide feedback to us on their experience. A summary of some of the reported issues are as follows:

1. Layout of DMPs

Many funders specify a maximum length for a data management plan (at a given font size / margin width). For example, the ESRC require DMPs to be no longer that 3 sides of A4 at a minimum font size of 12.

PIs creating data management plans for the ESRC commented that they were only able to fit very limited amounts of information into the required 3 sides of A4 due to the way in which the report was formatted.

This cause of this problem was quickly identified as the question text being formatted as columns rather than rows, which resulted in a lot of wasted blank space and the answer text being squashed over on the right hand side of the page.

DMP-before_1_ We contacted the DCC for advice and they suggested using the DOCX export and adjusting the column sizes. We tried this method with some limited success, but as the column width for the answer text was increased, the questions because harder to read.

A screenshot of a ESRC data management plan formatted is shown on the left,which clearly demonstrates the amount of wasted space on the page.

We undertook some experiments by exporting the data to CSV format, importing it into Excel, then creating a new report layout. We formatted the question text in lines, rather than columns in order to enable the full screen width to be used for the answers.

The resulting DMP was only half the length (3 sides of A4). We passed these findings onto the DCC and their developers started work on a new report layout based on our template which was deployed to the production environment 2 weeks later.

The new layout is a significant improvement and makes much more efficient use of the available space on the page as can be seen below:

DMP-after_1_

2. Sharing of DMPs

One of the great new features of DMPOnline is the ability to share plans. Not only does this allow more than one person to work on a plan (e.g. to allow an IT Manager completing information about data storage and backup) but it also enables the research office to see whether plans have been completed and in what areas additional help and advice needs to be offered.

However, our piloting of the tool has revealed some improvements that could be made to the sharing functionality.

a) Plans are currently shared with others using e-mail address (which is now also the DMPOnline username). However, when you share a plan with others using their e-mail address, you have no way of knowing whether they have registered a DMPOnline account using that address. It could be that they have:

– not registered to use DMPOnline or

– registered an account using a different e-mail address (I have more than one SMTP alias for my leeds.ac.uk address for example)

Our suggestions to DCC are as follows:

When you invite a user to share a plan, perform a lookup in the user database and show whether there is an active account associated with that e-mail address. If there is no active account, then provide an ‘Invite’ button which will trigger an e-mail to that user with instructions on how to sign up for DMPOnline.

The DCC have responded positively to these suggestions and included them in their list of future developments.

3. Funder Templates

We have received very positive feedback from users who are submitting grant applications to the ESRC. However, the pilot users who created data management plans for the AHRC all complained that the questions were identical to those contained within the ‘AHRC Technical Appendix‘ which is a mandatory part of the Je-S submission for many projects. PIs were irritated that they were being asked to duplicate this information into a different online tool. Most preferred the DMPOnline interface, but discovered it was not practical to complete this first, because the Je-S system imposes character limits in some sections, which are not observed in DMP Online, so there was a risk that text would become truncated.

This feedback has also been passed to the DCC.

Next steps

We will shortly start work with researchers in the Engineering and Biological Sciences Faculties to pilot DMPs for the BBSRC, EPSRC and STFC funders (NB: There is currently no DMPOnline plan for EPSRC grant applications).

This will then be extended to work with researchers submitting grant applications to MRC, NERC and Wellcome. We are also considering the question of ownership and are intending to pilot a process whereby the Faculty Research Offices create the outline DMP (containing project title, start / end dates and funder only) when they are notified of the intention to apply and then share this with the PI, Co-I, IT Manager and others as required.

We’re keen to hear from other projects or individuals that are currently using DMPOnline.