Archivist, Caroline Bolton summarises some of the considerations in planning to publish catalogue data
For most organisations, catalogue data is published as the online catalogue. Here catalogue data is curated, siloed and cannot be easily re-used.
For researchers this means that access to the data is limited to a request basis. For the archivist, providing ad-hoc access can become time consuming and unsustainable at scale.
Proactively publishing catalogue data that is FAIR offers an efficient and effective alternative:
FAIR principles were established as best practice in the data/science community but they are increasingly seen as relevant to the GLAM (Galleries, Libraries, Archives and Museum) sectors in promoting access and (re-)use of its digital collections whilst respecting where rights restrictions need to apply.
Accessibility, interoperability, and reusability:
For publishing catalogue data this might include providing:
- Open formats: This means machine and human readable, reusable and non-proprietary formats for different types of data and different skill levels. For tabular catalogue data this might mean CSV for non-specialists and xml/json for researchers with coding expertise.
- License: Setting out terms for what researchers can do with the data. Ideally this should be as open as is possible. Adopting a Creative Commons license is an easy way of doing this. Whilst non-commercial licences such as CC-BY-NC-4.0 are commonly used by GLAM organisations, allowing commercial use with CC-BY-4.0 supports creative reuse for new products and services. Heritage funder, National Lottery have recently moved to this.
- Identifier/description: Providing a description of the source metadata can help researchers to better understand the data. Assigning a dataset its own unique persistent identifier (DOI – digital object identifier) can be useful for citation.
There are various places where catalogue data might be published – each with their own benefits and limitations:
- Publishing on the catalogue website or to a cloud repository such as google drive
- Publishing to organisation’s bespoke (open) data repository/API such as the Natural History Museum or Museum of Chicago
- Publishing to an open source service such as github (widely used by data specialists)
- Contributing to an aggregator that brings together collections from many organisations that offers re-use options
Scaling up: The role of aggregators
Re-using catalogue data is vital in contributing to aggregators. As well as improving discoverability, aggregators can provide large volumes of data and content at a scale more useful for research and training artificial intelligence (AI). With a growing number of GLAM aggregators it is important to understand their differences:
- Scope: may include complete catalogue data, object metadata, digital content and/or curated content (JSTOR, Google Arts and Culture, Internet Archive)
- Geographic or sector coverage: for example Archives Portal Europe and Europeana and Art UK (Galleries)
- Services: such as transforming data (Archives Hub Names project) or supporting re-use (National Archives Discovery provides an API that makes it easier for other web services/developers to re-use the data).
- Terms and conditions: including maintenance and re-use
In practice this means organisations may want to publish to multiple aggregators, but each may have different (meta)data models and formats for contributing (EAD-XML, CSV). Some may have harvesting or API tools to make contributing easier but the effort to set up and standardise for each can be costly.
Web of data: Linking Data
For this reason, an alternative that is increasingly being explored within the GLAM sector is to publish as ‘linked data’ on the web. This means:
- enhancing the data to identify ‘things’ such as people and place in your data as persistent identifiers (http URI’s). This will create links to other people’s data that people or machines can further explore. (The principle of creating ‘interlinked data’ is becoming more commonplace with growing interest in Wikidata).
- publishing as RDF.
Although there will still be a cost to achieve, it offers the benefit of being able of reuse the same data for multiple applications such as aggregators or search engines, improving discoverability. With ongoing investigations across the GLAM sector such as exploring the role of Wikidata and practicalities of transforming catalogue meta(data) into ‘linked open data’ it may offer a sustainable means of managing and publishing data about and within collections.
Advocacy: Build your audience:
Publishing data does not simply mean that researchers will use it. Encouraging re-use by providing an opportunity to collaborate, learn and understand data needs of researchers and technologists can enable you to explore ways of using the data that you might never have anticipated. There are various guides to running these events such as ‘datathons’ / ‘digital labs’ available.
Practical next steps: There is clearly still a lot to learn from the current investigations across the GLAM sector. In the meantime, this doesn’t mean that we can’t start small and experiment perhaps with a limited number of datasets or with a few tools such as Wikidata. It’s from these experiments that we will start to better understand the practicalities and benefits for our own organisations and researchers.