Marco Brunello reflects on his time as a Bridging the Digital Gap trainee in Leeds University Library Special Collections.

After a year and three months, my traineeship has come to an end. I joined the Special Collections Team at Leeds University Library in October 2019 as a trainee under the Bridging the Digital Gap scheme by The National Archives.

Over this time, I had the chance to learn about archives and digital preservation on the job. I have been involved in a variety of projects, including:

  • The identification and standardisation of labels for digital carriers (CDs, DVDs, etc.) in the Collection. These new conventions are already being used by staff, and there will be a process of updating existing records in 2021.
  • Optical character recognition (OCR) on book indexes from the Cookery Collection – in collaboration with the AHRC Food Network. A major outcome of which was being able to implement a standardised indexing, enabling researchers to browse subjects across a wide range of previously incomparable texts.
  • The creation of a strategy to identify authors in the Collection whose works should be in the public domain – in collaboration with the library-wide Digital Scholarship Group. Here my skills in data manipulation, parsing and analysis were helpful in turning a very complicated dataset into a more manageable and contextualised source of information for senior staff.
  • The South Bank Show Production Archive transcription project, described in the linked post.

One project I am particularly proud of involved the digital files in The Papers of Janina and Zygmunt Bauman collection. This consists in 21 GB of data collected roughly between 1985 and 2017, whose content couldn’t be processed and analysed manually because of the amount of files and the variety of file formats involved. With the support of a dedicated special interest group, I have tackled this issue by setting up the following workflow:

A selection of Amstrad formatted CF2 floppy disks from the Zygmunt and Janina Bauman collection
  1. File format identification (with DROID) and meta information extraction (with ExifTool)
  2. Select a subset of minimal features for cataloguing
  3. Discard unnecessary files, such as temporary files
  4. Identify and mark sensitive information (with bulk_extractor)
  5. Map relevant information on the cataloguing system

As a result, users will be able to explore this collection and discover several features for each digital file – such as file format, size and original file path – along with format-specific information – such as image size, video encoding, word count etc.

I have documented every step and tool employed along the way, and designed this workflow in a way that it can be reused and adapted for future projects. My experience in Special Collections has been very positive. What I have learnt in the past 15 months equipped me with the experience I needed to start a career in the archive sector. Thanks to this, I have already secured my next role as an archive assistant at the West Yorkshire Archive Service.