Let’s start with the plain text files.
In her blog posts Jenny Mitcham described the range of files appearing in the York repository, the use of available tools to identify these files, and the process of registering a file format with the National Archive in PRONOM. What have we got in our digital archive? , My first file format signature. The original post describing the data profiling at York is “Research data – what does it *really* look like?”
At Leeds we see a similar mix of file formats though perhaps with more arising from scientific instruments and software. Such file formats are sometimes recognised by the tools though often not. For those that are binary the registration process Jenny describes can be applied. What about the ACSII or text/plain files?
An example: At Leeds we make extensive use of a finite element software package called Abaqus (https://www.3ds.com/products-services/simulia/products/abaqus/) produced by Dassault Systemes. Have a look at one of our very earliest datasets and the sample Abaqus input file.
(Segmentation_model_S1a.inp in Hua, Xijin and Jones, Alison (2015) Parameterised contact model of pelvic bone and cartilage: development data set. University of Leeds. [Dataset] https://doi.org/10.5518/3).
Extracts from file Segmentation_model_S1a.inp
Header with start of section that defines the geometry
(then 200000 more Node lines)
Material properties and boundary conditions
(lots more lines of settings and processing commands)
Final step in the processing
Such files have a .inp file extension and can be created through one of the Abaqus suite of tools or manually using a text editor. The file header (lines starting with ** %) is produced by the Abaqus tools but is not processed and may not be present if the file is edited or created manually. The file contains some initial keyword based lines that set up the task, a large section that defines the mesh over the geometry, a series of keyword based commands that define material, boundary, and contact conditions then a keyword based section that defines the analysis itself. In the use of a structure with a controlled vocabulary and parameters and values it is much like a LaTeX or HTML or XML file.
The preservation tools correctly identified the file as mime-type text/plain. So human readable and no doubt understandable to a scientist in that field. So to some extent it can already be regarded as “preserved”. There are software vendor manuals that define the keywords and commands. With knowledge of finite element analysis, the input file, and the manuals a scientist in this discipline could reproduce the analysis whether or not they had a copy of the Abaqus software. Can we regard it as “even more preserved”?
If we do regard files in this format as “preserved” would it be reasonable to register the format in PRONOM even though we won’t be able to provide a signature – as has been done for .py (Python script) and a number of other formats?
If so should we work with the software vendors to create and maintain these registrations? Has this approach to preservation been explored before?
I have just started exploring digital preservation with Dassault Systemes. More to follow.