Caroline Bolton, Special Collections Archivist describes extracting archive catalogues as data. This is the second blog in the series focusing on ‘Catalogues as Data.’

Creating the dataset: Understanding the catalogue data (mapping it out)

To explore catalogues as data, I needed to extract a sample collection (its metadata) as a dataset. To do this I needed to know what, where and how catalogue data was held in the collection management system, (CMS).

Archive catalogues feature descriptive, administrative and sometimes technical (meta)data about collections. Most are based on ISAD G, a standard for standardising description. ISAD G was written over 20 years ago and only 6 of the 26 elements considered essential. It was useful to map the data against this to understand the scope of the data available. This would inform decisions about which elements (fields) I would extract and how best to do this.

 I also needed to look at the content (the values) to better understand if it:

  • Was suitable for public access (no GDPR or security concerns)
  • Was structured/unstructured, numeric, text, multiple values etc.
  • Used any standards for describing subjects, names, places etc.

Extracting the data (Formats)

Once I’d determined which elements to extract as a dataset I needed to choose an export format. The options available were csv (a line of plain text per record with values separated by a comma) or xml. Both were human and machine readable. Both were suitable for use with various data tools that I wanted to experiment with. Xml also had the advantage of being able to handle multi-level structures of data relevant to archival descriptions. (Most archivists are familiar with xml in EAD- Encoded Archival Description). However, I settled on csv as it suited my skill level , being able to put it into a spreadsheet.

Extracting the data and putting it into a spreadsheet, involved an amount of trial error. I made several exports, re-tweaking each to get the most comprehensive data. Below are some of the issues  I encountered:

  • Archival hierarchies: csv puts multi-level descriptions into flat file structures. To keep these relationships, it was vital to include the metadata that recorded these. (E.g. “level” -collection/series/file/item and “parent object”)
  • Catalogue data split across various modules (related database tables): Data from these were sometimes extracted as separate datasets (e.g. for person/organisations, places or subjects).
  • Multiple values: Some fields had more than one value (e.g. many creators). Inserting a ‘delimiter’-a text character to separate these from one another resolved this.
  • Special characters and numbers: Some descriptions featured multilingual alphabet characters, copyright symbols. To keep these, it was important to extract them as ‘Unicode text.’ To keep dates, texts and numbers as needed, it was also important to format spreadsheets.
  • Spreadsheet tools: All the extracts were created using standard spreadsheet functions. Both Google Sheets and MS Excel offer various intuitive options or wizards for working with the data, e.g. ‘text to columns.’ Some options differ between versions.

Extracting catalogues as data (for paper and digital collections):

Having settled on a core list of elements for the dataset, I applied it to two different collections:

Case Study 1: An analogue collection featuring heavily narrative descriptions and catalogued to item level.

Case study 2: a hybrid archive (paper and digital). Containing over 10, 000 born digital files recently catalogued.

A first look at outputs gave some useful insights:

  • Both collections were catalogued in detail (to file or item level) so produced comprehensive datasets. Datasets may be more limited for paper collections where its not been feasible to catalogue to that level, (e.g. where More Product Less Process (MPLP) applied). Digital archives show promise as bigger datasets due to the high volume of digital files and it being easier to bulk extract item level metadata, programmatically.
  • Data quality can be inconsistent. The fact that ISAD G specifies the non-repetition of information at lower levels of description may be one reason. Another may be down to cataloguing practices.
  • There is additional (technical) metadata about digital archives that isn’t reflected in ISAD G/ doesn’t fit into the current fields available in the CMS.

What this means is that not all catalogues may offer the same opportunities as data. Both quantity and quality of metadata may be variable. Knowing the gaps and completeness may be one way of prioritising collections. Knowing how they might be usefully reused is another. I will look at this aspect in the coming weeks as I attempt to enhance, mine and visualise these datasets.