Leeds Research Data Survey 2012

Summary of findings from RDM survey, 2012 including link to anonymised survey data

(Full survey results are online as an Excel spreadsheet at: http://tinyurl.com/c28lk5n)


The Leeds Research Data Survey ran from July 2012 to 1st Nov 2012. The survey was planned in conjunction with the University’s Research Data Steering and Working Groups. We reviewed use of the Data Asset Framework at other institutions including the Universities of Bath, Edinburgh and Southampton.

Questions we wanted to answer..

It was agreed the Leeds priority was establishing a high level overview of data assets and current research data management practices –  primarily to inform capacity planning. We aimed to maximise survey uptake by offering a relatively short set of questions. The survey was primarily targeted at PIs to avoid multiple submissions regarding the same data.

Survey questions are online here

What we did..

The survey was created using the Bristol Online Surveys platform and publicised in conjunction with the new University Research Data Management Policy via an email from the PVC for Research and Innovation. Publicity was sent to researcher email networks and to all Faculty Research Managers with encouragement to forward to their networks. A link to the survey was included in the For Staff area of the University website and in the August Staff enewsletter. Members of the Research Data Steering and Working Groups encouraged completion. No prize incentive was offered.

What we found..

Full survey results are online as an Excel spreadsheet at: http://tinyurl.com/c28lk5n


242 completed responses were received and analysed. It was surprising to find the largest percentage of survey responses came from the Arts Faculty, which has far fewer grant holders than Engineering, Environment or Medicine and Health.  We had responses from all 9 Faculties at the University. Unsurprisingly, as the survey was aimed at PIs, 80% of our responses came from academic staff, with the rest mainly coming from research assistants, clinical staff and around 7% from Post Graduate Research students.


Data types

A full list of data types is available from the survey spreadsheet. The top ten most common formats are:

Top ten data formats
Data type %
 Documents (e.g. text, Microsoft Word, PDF), spreadsheets


 Statistical data sets (e.g. SPSS, Stata, SAS)


 Books, manuscripts (including musical scores)  31
 Laboratory notebooks, field notebooks, diaries  30
 Questionnaires  28
 Photographs / other images  28
 Interviews (including transcripts)  28
 Laboratory instrument data (e.g. from microscopes, chemical analysers, monitors etc..)  24
 Computer software (e.g modeling / simulation, schemas)  24
 Models, algorithms, scripts  20

The output from BOS also allows us to breakdown the type of data by Faculty.


The view across Faculties illustrates some differences in types of storage used, volume of data generated and how much data researchers believe would need to be kept for others to validate their research findings. Respondent numbers in some Faculties are low so findings may not be fully representative. Nonetheless, they may serve to illustrate broad Faculty differences and similarities – for example, respondents from all Faculties are making some use of Cloud storage, primarily Dropbox.

Only a handful of respondents are depositing their research data in a research data repository, the most commonly mentioned were the British Atmospheric Data Centre and Protein Data Bank.



Respondents were asked to estimate how much their data volume was likely to change over time. The majority indicated an increase of 25% or less. Although there was a methodological issues with this question (some respondents were frustrated that ‘stay the same’ and ‘don’t know’ options were not offered), nonetheless, the small number of researcher anticipating a growth of more than 25% suggests we may not have a huge increase in data to manage in the immediate term.



40% of respondent do not create metadata for their research projects; 36% of respondents create metadata and 24% create metadata but did not realize this is what they were doing. Of the respondents who create metadata, 20% felt it was ‘full’, 46% ‘partial’ and 34% ‘extremely limited’. Some commentators recognised the importance of metadata and the need to be more systematic in creating it. Others noted that there may not be specific resource within a project to tackle metadata:

“Attempt to do as full a job as possible, but on large projects this is a huge undertaking – not usually funded separately to research time.”

Attitudes on the appropriate level of metadata to enable re-use varied:

“I create corpora for re-use by others, but cannot know what else they will use it for – so I cannot predict what metadata they would like, hence it is most practical to keep this to a minimum.”
“We have developed protocols for gold standard metadata collection and its presentation.”

Data Management Planning

The headline finding that 44% of respondents had completed a data management plan hides a wide variation across Faculties with a much lower % from MAPS, LUBS and ARTS and higher from ESSL and FMH. Respondents showed a wide variety of attitudes towards data management planning – it is clearly part of the culture in some subject areas whereas others see it as unnecessary bureaucracy. Although some respondents were completing detailed and diligent DMPs, other comments were along the lines “it was not very detailed, and I would hesitate to call it a formal DMP.”

dmp_by_faculty_1_Has a DMP ever been completed for any of yourresearch projects?

Choose 3 words that best express what you see as the challenges of research data management at the University of Leeds. rdmwordle_1_


What next..

65% of respondents were willing to be followed up. Some may be interviewed in more depth.

There are other RDM areas which are of immediate interest but which were not addressed through the survey. These could be explored through interview and/or in future surveys. Some examples:

  • More detailed profiling of data practices at a Faculty level, including identifying datasets in scope for an institutional data repository.
  • The appropriate point in the research data lifecycle data for repository deposit.
  • How long researchers anticipate keeping their data for.
  • Responsibility for research data over time e.g. if the PI leaves.
  • The amount and location of non-digital data.
  • Attitudes towards sharing research data.
  • Awareness of and reaction to the University Research Data Management Policy.
  • Training and guidance requirements.

Lessons learned..

  • The progress of RoaDMaP through ethical review took slightly longer than anticipated resulting in the survey being publicised later than planned; anticipate and allow time for ethical clearance.
  • The number of free text comments suggests many of those filling in the survey were very interested in the area.
  • At 242 the number of responses was somewhat lower than hoped (approximately 10% of the target group), we could have been clearer about the target numbers – for example, what level of response would allow generalisations to be made about storage requirements at the Faculty level.
  • An individual champion for the survey in a Faculty can boost the response rate – for example, a direct approach from one of our Faculty of Biological Sciences professors.
  • Most researchers are using a variety of storage locations; there was some concern about how many researchers are using hard drives on desktop computers and portable storage devices (generally not as their only storage location).
  • Most datasets are
  • The % of data researchers in different Faculties wish to retain will vary significantly.
  • Storage (access to and security of) was the main area of concern for researchers.

Summary of S3 ‘Managing Data Growth in Life Sciences’ Event

A summary of the S3 ‘Managing Data Growth in Life Sciences’ that was held in Cambridge on 23rd Jan 2013.

On 23rd Jan 2013, I attended the above event in Cambridge (http://goo.gl/qjWdW). This was a vendor sponsored event, organised by S3 who describe themselves as ‘specialists in identifying and providing bespoke data storage solutions across rapidly changing business landscapes’

Their mission statement is as follows:

S3’s aim is to become the UK’s leading Big Data and Virtualisation Infrastructure Specialist. Delivering a consultative led engagement, whilst focussing on vertical market segments, with an extensive knowledge of end user workflows as well as industry solutions. S3 strives to be technically independent and positioned amongst our customer base as both thought leader and trusted advisor.

The day started with an introduction from Gurdip Kalley (Business Development Manager) who introduced S3 as a company and explained the format of the sessions. Each of the four sessions covered a different aspect of managing large data sets and was presented as a panel format with three different vendors responding to questions from a member of S3 staff.

Panel one covered ‘Big Data’ – ‘What, why and when is data ‘Big’? and was hosted by S3’s Technical Director Mark Smith. The vendors on the panel were EMC Isilon, Quantum and Panasas. The discussion centred around definitions of ‘big’ data and the general consensus was that ‘big’ data relates to data sets which create particular challenges for individual organisations due to their size and as such there is no absolute size at which a data sets becomes ‘big’. The three vendors gave a brief overview of their storage offerings. EMC provide fairly traditional storage arrays, whereas Quantum specialise in wide area storage (effectively providing RAID-type protection between geographical locations rather than between discs) and Panasas are experts in high performance parallel file systems which are almost infinitely scalable (certainly into the Yottabyte range).
Panel two was hosted by S3 Account Manager Elliot Fonte and discussed ‘Managing and ever shrinking backup window’ with CommVault, Quantum and Symantec. The vendors gave an overview of various de-duplication, backup-to-disc, backup-via-disc systems to help manage backup of large data. Much of the discussion centred around the need (or otherwise) to backup very large research data sets.
S3 Account Manager Ian Nave ran the third panel covering ‘Can SSD accelerate ‘time to discovery’?’ The vendors on the panel were EMC Flash, Violin Memory and Tintri. Tintri gave an overview of their optimised solid state based VM appliances and EMC / Violin talked about optimising systems to make use of the performance gains from using SSD.

The final panel was chaired by S3’s Technical Consultant Mark Treweeke and was entitled ‘Approaches to the long term retention of research data’. The three vendors, Arkivum, SpectraLogic and Crossroads gave an overview of their archiving solutions. SpectraLogic is a large scale on-site archiving system that supports a variety of tape media. Crossroads offer their ‘Strongbox’ product that is an ‘archiving in a box’ solution. Arkivum offer archiving-as-a-service which can either be on-site managed or based around their two UK data centres. Any data saved to their on-site appliance (which is presented as a CIFS or NFS file system) is encrypted, checksummed and then pushed to their two data centres as well as a 3rd party escrow off-line library. They have a variety of cost models, including POSF (pay once, store forever or more accurately for 25 years), recurrent (pay each year per Tb) and unlike Amazon Glacier have no retrieval or transmission costs.

The RoaDMaP project in Leeds are currently working on a proof of concept pilot with Arkivum, the results of which will be the subject of a future blog post.In summary, it was a useful event which gave me the opportunity to speak to a number of vendors about what is on their technology roadmaps.

Tim Banks

Training session for research support staff: What do I need to know about research data management?

Reflections on a two hour training session, primarily aimed at research support staff, run by the RoaDMaP project in conjunction with the Digital Curation Centre

Take up of the course

The course was advertised as part of the professionalisation training programme at the University for research and innovation managers and administrators. The invitation email was sent from a well known member of Research and Innovation Service staff (Head of Performance, Governance & Operations)  via a general research support email list with 250+ members. Of the 30 places, we reserved 24 for research support staff and 6 for researchers on the grounds that the group discussions would be richer with a researcher at each table. However, whereas the 24 research support places were snapped up straight away, we struggled to attract researchers – partly because we needed a longer lead-in time but perhaps also because there were so few places for researchers that it sent a mixed message – why would researchers really be interested in a course primarily targeted at research support staff?

The speed with which the places were snapped up probably reflects the effect of publicising via a trusted source and certainly suggests research support staff need little persuasion to attend (a networking lunch was offered as an additional incentive). Having said this, after the initial flurry of booking, we did not attract people to the waiting list. So this first group may be an atypical, self-selected group of the most interested research support staff. This will become clear if we re-advertise and run the course again.


The course was a mixture of presentations and group work over 2 hours. The full programme including links to the presentations is online. The course structure was put together by RoaDMaP and DCC.

No prior knowledge of RDM was assumed. Presentations covered an introduction to data management, data management planning, sharing data and an overview of the RoaDMaP project / RDM developments in Leeds. The presenters aimed to emphasise areas likely to be of interest to research support staff – so, more information on funder requirements, less on details of digital curation.

Group work was lively and each group produced a poster with green post it notes for reasons to share data and pink post its with reasons not to share data. This was a quick visual way of seeing whether there was a bias. We expected we might see more pink, reflecting concerns about sharing data – in fact, the split was much more even, with somewhat more green stickers. All the responses are collated in a group work and feedback document and suggest the participants are already very well engaged with this topic and well aware of some of the major drivers and benefits of data sharing as well as reasons not to share.

The second group work involved reading a sample data management plan and relating support activities around data management planning to your own role or those of colleagues. Participants were invited to suggest areas for further training and support. This was a more challenging exercise but on the whole conversation was lively and could have gone on longer than the allocated 20 minutes; it was useful to have DCC and RoaDMaP staff within groups to provide some facilitation or encouragement where needed.

Q and A was dominated by the practical implications for researchers and the reality of costs for such a significant undertaking across the institution.


Several participants noted their awareness of data management requirements and potential sources of support had increased. Some plan to share the presentations and handouts with colleagues, others felt better equipped to advise researchers, should they be approached, or were planning to seek further information to learn more about the research data management area.

Some feedback suggests more detailed information about the data management planning would be welcome, as would further detail of acceptable costs from research funders.

The level of course seemed about right for the group though a couple of participants who were new to the research support field felt too much knowledge was assumed.

Feedback highlighted that we need better and higher profile supporting materials online and a clearer message from the institution about its expectations of researchers in this context.

 What next?

We know from experience (for example, our White Rose RDM event in 2012) that getting people together from different professional groups – research support, researchers, librarians, IT staff – is highly valued. However, awareness raising in specific stakeholder groups can be valuable in its own right – particularly at the introductory level. We need to move towards creating robust referral networks and communities of practice at the institution, which will mean bringing together people with a variety of research data management roles so we can each benefit from multiple perspectives and build a richer picture of research data management  across the institution. However, we are at a relatively early stage, with many basic processes and guidelines still to be put in place; cross-team training may be more effective as research data management matures.

We aim to re-run the course later in the academic year and review effectiveness and embedding with research support colleagues.

Pilot training session – Engineering researchers

Reflections on a pilot training session ‘Preserving your research data for future use’

University of Leeds piloted a Research Data Management course with Engineers on 12th December 2012. Fourteen researchers participated. We had initially planned to aim the session at PhD students – however, publicity for the course attracted a more diverse group: four attendees were PhD students but the rest were early career or relatively early career researchers, including three PIs. In a previous blog post , Dr Jim Baxter reflected on creating the training course. Having run the course, there are some lessons learned, but overall the reception was extremely positive.

The course ran for two hours, delivered by Jim Baxter, Graham Blyth and Monica Duke from the Digital Curation Centre. DCC had some input into the design of the course, as did the RoaDMaP working group on training.

Course Aims

“To raise awareness of the challenges of preserving and managing research data and approaches to addressing these challenges. By the end of the session participants will be better able to:

  • Describe the forms research data takes and the role of contextual documentation and metadata in enabling data reuse
  • Describe how managing research data effectively will improve your research, save you time, decrease the risks of data loss, increase your professional impact and identify tools to help
  • Describe University of Leeds and research funder data management expectations
  • Identify sources of information and guidance on managing research data effectively, including additional training courses”


We obtained rich feedback from our pilot participants in a facilitated session over lunch and also via paper feedback forms – 10 of the 14 submitted fully completed evaluation forms, of these 5 rated the session as Mostly Useful, 5 as Very Useful.

A full summary of feedback comments from the event is online here.

Notable points arising from the feedback:

  • The length of the course (2 hours) was felt to be about right, as was the number of participants (14-20). The group based activities and discussions were particularly liked.
  • Several participants stated what they had learned about data management planning and metadata was valuable and similar training should be offered to their colleagues.
  • A number of participants believed the course would have an impact on their practice – from reviewing backup procedures, to changing their approach to file formats to writing a full data management plan.

Lessons Learned

  • All participants were from the Engineering Faculty and we assumed, incorrectly, they would know each other. An icebreaker including introductions around the tables would have been useful and will be incorporated into future sessions.
  • There may be scope for an additional session with a more practical focus aimed primarily at PIs. This could be more information based and backed up by information on the web.
  • Participants wanted good practice guides to run alongside the training
  • We used a sample data management plan from a different subject discipline (social science) – primarily because we did not identify an appropriate Engineering related resource. Although one participant felt an Engineering DMP would have been more appropriate, others observed that an example from a completely different discipline encouraged them to think more broadly about research data management, potentially moving away from entrenched disciplinary norms. So, consider incorporating examples from other disciplines – it can encourage creative thinking.

What next

  • Participants will be followed up by email in March to see whether the training has had an impact on their practice.
  • Tim Banks, who works with the RoaDMaP project, offered to meet with participants individually to discuss their own data management plans – three attendees indicated they would take up the offer.
  • The next session will run with Social Science researchers in February 2013.
  • We should remove the ‘pilot’ label – probably after the next delivery – to emphasis this is ‘real’ training (having said this, pilot is a useful descriptor where there is an emphasis on gaining feedback from participants).
  • ·Additional training capacity will be required if the course is to scale up and reach a large proportion of researchers. We need clear longer term ownership for the course and an assessment of how often it can run with existing / additional resources.
  • Research data management is to be included in the new ULTRA (University of Leeds Teaching and Research Award) course which new members of academic staff will need to complete. This will complement the RoaDMaP training course. Over time, we assume many aspects of research data management will become standard practice.

Slides from the pilot training session

Handbook from the pilot session