This post is by Research Data Management Adviser Karen Abel

Tucked into tasty tips on 18th March 2021 for the second in the Library’s series of Open Lunches reflecting on open research, thanks to very interesting talks from Alex Coleman, Research Software Engineer and Daniel Valdenegro Ibarra, PhD researcher, School of Politics and International Studies.  Each shared a great set of concepts and tools for managing research data: how to keep it organised for your own sake and how to make it reproducible by others.  Many of these tips will make immediate sense for those who are already IT savvy but there is plenty of support to be snapped up for those wanting to up their computer research skills so read on to discover tools to make your research reproducible and easier to manage…

For starters

Alex began by underlining the point that, even within the same lab, if notes and documentation making sense of a dataset aren’t kept organised and visible, then it becomes impossible to develop research from one generation of researchers to another; never mind making the data useful for researchers beyond.  Fortunately, though, he had a whole suite of recommendations for keeping on top of data management and capturing the context as you go.  First up, he addressed the basics: version control and project structure.  Out with endless files saved as ‘version 1’, ‘version 1.1’, version ‘1.1.2’ etc., and in with using the one file where the history of changes is automatically recorded and ‘versions’, or points at which a change were made, can easily be returned to.  Git is the recommended tool here at Leeds which feeds into GitHub.  For project management, Alex suggested Cookie Cutter and R Studio as being great entry points on how to build a good project foundation.

Alex’s slides are available here: Tools for Reproducible Research

Daniel, too, supported the concept of version control, particularly when working in a collaborative capacity.  Simply using Google Docs or Office 365, he advised, you can take advantage of the cloud-based environment for collaboration and of the built-in version controls.  He also recommended pre-registering your research with OSF which, as well as version control, enables data and code sharing options.  Plus, by pre-registering, you prevent HARKing and P-Hacking and get persistent identifiers for virtually everything, right down to the code.  (For more on pre-registration, do come to the Open Lunch on 1st July).  Daniel explained, too, how GitHub and GitLab can also be used as a way of pre-registering because everything you enter has a persistent timestamp, including any changes and any rolling back.   The branching and collaborative facilities also make it great for team projects.  Be careful, he warned though, as Git is not a recommended space for sensitive data.  It also uses quirky jargon and can prove a steep learning curve, so he advises going to a Research Computing tutorial to get familiar with it!

SWD2: Version Control with Git and GitHub

Daniel also recommended this series of videos about Git for non coders.

Other recommended tutorials include:

Books with Jupyter

Project Management With RStudio

Magic ingredients

Capturing and sharing recipes and ingredients were an important theme for both our speakers.  Alex described how, if you are someone who writes programs using Python, for example, you might use specific packages for aspects such as graphics and analysis and you might find the tool Conda useful.  Conda allows you to install all those packages and their dependencies in a siloed environment on your computer.  Having done this, it can produce a recipe file, in text format, containing all the packages and versions that you’ve used for your project.  This recipe is then perfect for sharing with other people. 

Another situation Alex discussed where a recipe file is handy is the use of a virtual machine.  If, for example, you are running a Linux operating system on top of a Windows system then it will no doubt involve a fair bit of configuring of the virtual machine, the operating system and any libraries it requires.  He recommended making use of Vagrant, a tool that will capture all of this in recipe file format.

Further still, you could bypass the need for a virtual machine, Alex described, by using Containers and, this way, you can bundle up all the ingredients into a single portable package which can be passed to someone else to run using, for example, Docker. 

Daniel stressed the importance of recipes for qualitative research in particular.  Qualitative research, he said, doesn’t always lend itself so well to replicability but reproducing the procedure can be done very effectively if there are detailed records of the methodology.

Alex had some great suggestions for tools to capture the ingredients of workflows: Bash, GNU Make, Luigi and Snakemake all got a mention.  Snakemake captures all the dependencies of a workflow in both graph and written format so it’s easy to read and understand and can be shared with other people.  It’s easy to learn, available on all formats and compatible with HPC and workflows can scale up from a laptop to a cluster.  All of this is captured in a (you guessed it) recipe file.

Jupyter and R Markdown were rated by Alex as great ways for annotating code with text and graphics all in the same place.  This makes communication of your steps and analysis very easy.  Extra tips to ‘reproducify’ your notebooks were: use BinderHub to serve your Jupyter Notebook to other people on the web; use Google Colab to create an already web-accessible note book; use Jupyter Book to convert your notebook into PDFs or interactive websites.

Coffee and final thoughts

Daniel stressed the importance of safe and secure data sharing and suggested taking full advantage of your university’s support.  He advised using only university supported systems and tapping into the institution’s knowledge.  Be mindful too, he warned, of local laws and licences which may apply.

He gave some great common-sense advice to sum up: only pick tools that make your life easier; if they don’t, they’re not worth it!  And, for best transparency and reproducibility, use the ‘lowest common denominator’ format, i.e. the one that most people/applications will be able to read, e.g. plain text.  What we’re aiming for, he underlined, is transparency and open standards and, as a community, we should push for that.

Many thanks to Alex and Daniel for the excellent advice.  Do join us for our next Open Lunch on 22nd April 2021, where we will be hearing about open and reproducible research in the social sciences, from Viktoria Spaiser, Associate Professor in Sustainability Research and Computational Social Sciences.

Recording on YouTube

You can watch a full recording on YouTube or Alex’s talk starts here and Daniel’s talk starts here.