Special Collections has been exploring ways we can improve access to our collections through the transcription of handwritten documents.
Transcription can provide new opportunities to explore the rich content of archives, however, it can take a lot of time and resources to complete. So we wanted to investigate whether an exciting new tool could help us to automate this process.
Transkribus is a transcription platform developed by the EU-funded READ Project. It uses Handwritten Text Recognition technology based on machine-learning to produce automated transcriptions of handwritten documents. Special Collections have signed a Memorandum of Understanding with READ to try out the platform on our own project.
The archives we have chosen to focus on first are the patient case books of the Leeds surgeon William Hey (1736-1819). The case books were kept as part of Hey’s private medical and midwifery practice, detailing hundreds of patient cases he treated between 1759 and 1809.
With 22 digitised case books all in the same hand, containing just under 4,000 pages, we felt they were a good candidate to trial with Transkribus.
In general terms, Transkribus works by training the Handwritten Text Recognition (HTR) engine to “read” handwriting in digital documents. To do this, each page of the document needs to be prepared by “segmenting” it into text regions, lines and baselines, so that each line from the image can be matched to the correct line of the transcript.
A sample of accurate transcriptions is then needed for the training. Once training is completed, a model is created which can be applied to new pages of handwriting to produce automated transcriptions. Lots more detail about this can be found on the Transkribus Wiki.
To create our first model for the Hey case books, we used around 15,600 words from 85 pages of transcription. Some of this was pre-existing transcription that had been produced as part of a student internship. Transcription did bring with it some challenges, particularly when trying to decipher the 18th century apothecary measures and medicines Hey was using!
The accuracy of HTR models are measured by the percentage of Character Error Rate (CER) and Word Error Rate (WER) – the lower these are, the more accurate your automated transcription. Currently, a CER of below 10% is considered an excellent result.
The Transkribus team were able to produce two HTR models for us. The first was based solely on the Hey papers, and the second also incorporated a pre-existing model trained to recognise the handwriting of the English Philsopher Jeremy Bentham (1748-1832).
The first model obtained a CER of 11.8%, whilst the second brought this down to 8.24%. So we have a model which can produce automated transcripts of documents written by Hey where over 90% of the characters will be accurate – a fantastic start!
We’re keen to keep experimenting with Tranksribus to try and improve this accuracy, as well as explore its potential for use with other collections.