MOOC over

Well I passed…but my noble intention was to post every week about the course content, which I only managed for weeks one and two with a half-finished post languishing for week 3 ‘ Working with data’. Instead and in lieu of paying Coursera for a course certificate, I’ll just leave this here as proof of completion and try to pick out a few highlights below and how they relate to our service here at Leeds:


In my defence, this course was obviously an adjunct to my day job and though the content was good, I was disappointed by the lack of community, with very little activity on the course forums such that I gained little from the MOOC format than just working through the self-paced MANTRA course. Inevitably, also, this course was necessarily generic and one of the big challenges of RDM education I think is applying the theoretical principles to specific disciplines – indeed this is common feedback from the Research Data Management Essentials training we deliver at Leeds (I’ve just signed up for Data Management for Clinical Research from Vanderbilt University which started this week and which might inform our engagement with the Faculty of Medicine and Health).
Probably the main lesson I’ve learned, from the day job and reinforced by this course, is that each aspect of RDM is related to all the others which really drives home the importance of good data management planning.
Week 3: Working with Data 
Good file management in research is so obvious that a lot of researchers don’t give it proper consideration (citation needed) and it’s important to adopt conventions for directory structure, folder naming, file naming and versioning files, especially where files will be shared across a research team or with project partners. For University of Leeds guidance see
Storage, backup and data security is also crucial during the live stage of a project and we spend a lot of time poring over data management plans, advising researchers of institutional best practice. A challenge here is the sheer number of relevant institutional policies and we are working on a tool that helps to navigate these issues under the rubric ‘Safe Data Sharing Essentials’
File formats and transformations – Given the sheer number of proprietary file formats this is a major consideration for potential data reuse and for long term archiving and RDL recommends that wherever possible data should be saved in an open, non-proprietary, format. Of course this is not always possible and in a large, research intensive University there will always be specialist software outputting more or less exotic files and we must make informed decisions around reuse and preservation.
The UK Data Archive provides a table of optimal data formats but it is pragmatic rather than comprehensive and there is PRONOM, the technical registry from the National Archives, more comprehensive but less user-friendly.
We have been thinking about developing a resource to help inform these decisions, perhaps based on local file formats – a good idea from the University of Edinburgh that we might borrow is their File Format Registry Wiki.
I’m also looking forward to this upcoming webinar from the Open Preservation Forum on 27th February, Managing the Research Data Challenge.
Week 4: Sharing Data 
The clue is in the title that we also anticipate our ‘Safe Data Sharing Essentials’ toolkit will enable, well, safer sharing of data, especially (not only) qualitative data from human participants, through informed consent, ethical review and appropriate anonymisation for example.

As identified recently by David Kernohan “the point of article submission may be the first time a research has encountered the need to manage or share data” [Uses and abuses of journal data policies (part 1)] which can cause problems where there is inadequate consent for future reuse, for instance, and which may not have been fully considered at ethical review. We hope the toolkit will encourage data sharing, with safeguards where necessary (e.g. restricted access subject to an end user agreement.)

Week 5: Archiving Data
As per the earlier observation, archiving data cannot be taken in isolation and is probably the area in which I personally have the most to learn. We have begun to liaise with colleagues from Special Collections, looking at our respective data deposit workflows and it has been interesting to learn about their use of Bitcurator to analyse files. In addition to file formats/handling/preservation there are also synergies across our teams on policy and legal issues.
Special Collections have recently been awarded Accredited Archive Service status from the National Archives and we have also begun to think about working towards equivalent certification for our service, most likely the Data Seal of Approval.

Data Management Planning

Only weeks into January and like an inveterate smoker, I’m struggling to honour my New Year’s resolution and falling behind in my MOOC. Largely due to already scrabbling up the learning curve of a new role. When I do get round to it I find it very useful and naturally complementary to everything I am learning on the job.

We’re now approaching the end of week 3…but I only completed the assignment for week 2 yesterday, focused on the humble Data Management Plan, or DMP to its friends.

The practical assignment presented us with a scenario to use as the basis for a DMP, suggesting a couple of tools, DMPTool from the University of California and DMPOnline from the Digital Curation Centre as well as the framework provided by the DCC Checklist for a Data Management Plan, v4.0. From a UK perspective it might be more typical to use DMPOnline but I’ve already had a play with that so in honor (sic) of today’s Presedential inauguration and the Special Relationship I chose to try DMPTool which also includes a template from NSF-SBE as cited in the scenario.

As a newcomer to RDM it’s the sheer complexity of the myriad aspects and associated best practice that is so daunting which is where formal planning comes in, to impose some order on the primordial project chaos.

Using the DCC Checklist a DMP breaks down as follows:

  • What data will you collect or create?
  • What documentation and metadata accompany the data?
  • How will you manage any ethical issues?
  • How will you manage copyright and intellectual property issues?
  • How will the data be stored and backed up during research?
  • How will you manage access and security?
  • Which data should be retained, shared, and/or preserved?
  • What is the long-term preservation plan for the dataset?
  • How will you share the data?
  • Are any restrictions on data sharing required?
  • What resources do you require to implement your plan?

A DMP then serves two primary functions:

to describe the data produced in the course of a research project
outline the data management strategies that will be implemented both during the active phase of the research project and after the project ends

Research funding bodies, including the Economic & Social Research Council (ESRC), the Engineering and Physical Sciences Research Council (EPSRC) and the Wellcome Trust in the UK as well as the EU Horizon 2020 programme, increasingly require a DMP as part of the application process but even if you don’t have formal funding for your research, writing a DMP can be an invaluable exercise, for PhD and other Post Graduate Researchers for example, which was the focus of a recent training session delivered by my colleagues Rachel and Graham, ‘Research Data Management Essentials for your Research Degree’ (this was delivered to a cohort of 24 PGRs with more signed up than showed up. There is a waiting list for future sessions so clearly an appetite for RDM amongst PGRs, see here for details of future sessions.)

Back to research funding bodies, who I have no doubt wish to make life easier for researchers, also require a DMP for somewhat more pragmatic purposes:

  • Transparency and openness – since many funding bodies are allocating public money, they have a responsibility to ensure research outputs are preserved and made accessible to the public
  • Return on investment – maximise potential reuse of data and if someone invents the wheel, ensure that money doesn’t need to be spent again to reinvent it

By now I do have some experience of real life DMPs, mainly based on our application stage template which is primarily to identify any potential costs but which we also use to promote best practice; for the practical exercise I tried to expand beyond the parameters of the supplied case study rather than just rewrite it in the form of a DMP.

For example, the scenario refers to computer-assisted telephone interviewing (CATI) software where “the final, cleaned data will consist of a single SPSS file” but I googled CATI (outputs as something called Blaise data files) and tried to think through the implications of SPSS as a preservation format:

Data format and dissemination

Blaise data files are a proprietary data format and therefore not suitable for preservation and data sharing. SPSS is also proprietary software, nevertheless in widespread use within research institutions and unlikely to present access problems in the short to medium term. However the data will also be exported as .csv where possible and (anonymised) interview transcripts will be retained as plain text (.txt) to ensure accessibility without specialist software.

I found DMPtool fairly user-friendly though the NSF-SBE template arguably repeats requirements in some of the section guidelines, also commented upon by others, and could perhaps be clearer to avoid overlap.

The other aspect of data management planning that I am particularly interested in is the emphasis that it should be a ‘living’ document, revisited throughout a research project. I may be wrong but I get the sense that this rather paid lip-service and would like to explore tools and strategies to prompt a PI to proactively revisit their DMP. I was at pains to emphasise this in my own plan stating that “it will be proactively reviewed on a monthly basis and/or at suitable project milestones (TBC)”. It won’t of course, but I have the excuse that this is only pretend…

Research Data Management and Sharing – Coursera MOOC, week 1

True to my New Year’s resolution I’ve completed week 1 of the Coursera MOOC from The University of North Carolina at Chapel Hill and The University of Edinburgh.

I’ve found the course content reasonably engaging so far, largely as revision given my recent immersion in all things RDM, and I managed to pass the quiz on Understanding Research Data (14 items) first try, albeit not with full marks (86% – I got wrong a question associated with the Research Data Lifecycle but that was more an issue of terminology. That’s my excuse anyway.)

I’ve been working at the sharp end of RDM now for a whole month and have obviously been on a steep learning curve (discounting that week watching telly and eating chocolates) so my reasons for wanting to do this course are not only to consolidate my own learning but also to engage with the RDM community and to check out options for RDM training here at the University of Leeds. I have previously explored the excellent MANTRA resource from the University of Edinburgh which heavily informed my interview presentation, Research Data Management: Partnering with research way back on 19th September.
The ‘M’ part of this MOOC does appear to be more minuscule than massive at the moment with no activity on the discussion forums. Some other MOOCs I’ve taken have erred at the opposite extreme with thousands of threads to sift through. I’m also currently the only one using the #RDMSmooc hashtag on twitter.
Perhaps the first week in January is just not the best time for a somewhat niche area of interest while the rest of the population is focused on trying to break their mince pie habit and other more mainstream resolutions? (According to a slidedeck from Edinburgh’s Pauline Ward, the cohort for March-April 2016 comprised 1,294 “Active Learners” and resulted in 273 “Course Completers”)
In terms of actual content, week 1 has been about the basics, “understanding research data” in 2 main sections, “What are data?” and “Understanding Data Management” comprising several short videos delivered by Dr Helen Tibbo with interactive transcripts along with a summary and additional resources (which I’ve collated in Mendeley and added to an open Mendeley group here.)
There were also a couple of Supplementary Videos featuring real academics and information professionals discussing their real life experiences.  I found these particularly engaging, perhaps unsurprising given my surfeit of theoretical learning to date, also somewhat ironic given our “post-truth” age: “So when you make a statement about the world or about anything in it, you’re relying on some evidence for that statement.” Try telling that to the politicos!
As my own theoretical knowledge of RDM segues into the practical, I found Nancy Y. McGovern’s statement a suitable conclusion for week 1 “I certainly get the sense in libraries that people feel like we have a completely known set of data management practices. I don’t believe we do. I think for a lot of domains those practices are still emerging. So, I think that we should be really open to identifying new kinds of data management practices, and figuring out what our role might be in them.”
So all in all I’m looking forward to week 2 “Data Management Planning”, just be nice to have some company!

My New Year’s resolution…

To complete a MOOC!

I’ve started a few, including Research Data Management and Sharing from Coursera which is co-delivered by Helen Tibbo, University of North Carolina and Sarah Jones from the Digital Curation Centre based at the University of Edinburgh.
I’m not sure I’ve ever stuck to a New Year’s resolution either but the next cohort starts on 2nd January 2017 and runs for 5 weeks, which is longer than I’ve been formally working on RDM so I really have no excuse.
By committing publicly here to blogging throughout the course, hopefully this time will be different. If you would like to join me sign-up at Coursera and follow #RDMSmooc on Twitter.

The syllabus:

Week 1: Understanding Research Data
Week 2Data Management Planning
Week 3Working with Data
Week 4: Sharing Data
Week 5: Archiving Data
Season’s greetings and see you online in 2017 (when I also resolve to do more exercise, improve my diet, reduce my carbon footprint and generally be a better person…)

Image source:
This work has been released into the public domain by its author. This applies worldwide.