What is preservation? Putting your data on a spare USB stick is not Preservation. Backup is not preservation. Putting the data in a repository is not preservation.
Digital preservation includes a series of activities aimed to ensure continued access to digital materials. This practice is defined very broadly and deals with all the actions needed to maintain access to data regardless of media failure or technological/organisational change. Preservation applies to all data, including “born-digital” material and the products of digitisation processes.
In short, credibility, reproducibility and mandate.
Funders have begun to mandate that the data underpinning published research that they have funded is made freely available for other researchers to use (note that this is not the same as ‘Open data’ – some data covered by this mandate may only be made available under strictly controlled conditions). The mandate is often further interpreted as ‘keep available and usable’. There is little value, after all, in keeping data if it cannot be used. Note that some institutions also require data to be retained for fixed periods of time. Check with yours to find out whether this is the case.
Preservation systems are designed with future usability in mind.
They make multiple copies of preserved data, automate checks to make sure data has not been changed or damaged and convert data in old formats to newer ones so it can be opened with new technology. These systems are also sometimes designed to make metadata discoverable to other researchers and users. Where appropriate, preservation systems may also provide emulation services.
Preserving the data and/or computer code underpinning research also provides a degree of protection for the reputation of a researcher and/or institution. This practice allows others to reproduce and validate results – in some cases, many years after the original project.
Keeping content and context safe
This practice is important because both content and context need protection (and research funders know it). When data and/or computer code are preserved with appropriate metadata, content and context are safely stored and can be reused in the future. If either is lost, data immediately becomes at risk. Sometimes, just a change in file formats will be enough to make your research unusable. This is why thinking about file management and formats is so important! The above, however, does not mean that you should preserve everything you produce. As part of a project, you should select data for preservation and delete what does not need sharing.
Digitally endangered species
The Digital Preservation Coalition maintains a list (Bit List) of digitally endangered species , including the types of data the community believes are at risk. Among these, unpublished research outputs are classified as critically endangered – which is a call to action for researchers to better engage with data management. In the Bit List, we recommend you also look at the aggravating conditions which may make data loss more likely – in the case of unpublished research outputs, these include, among others, single copies of a file, dependence on devices and dependence on obsolete formats or processes.
PhD data is also considered at risk because there is no widespread approach to its preservation. Some higher education institutions mandate PhD archiving in their regulations, which we encourage as good practice.
Software and hardware preservation
Some data is produced by bespoke hardware and/or software, often producing bespoke file formats. In some instances, the same bespoke system may be needed to read and analyse the data in the future. If this is the case, and there is no alternative, it may be necessary to preserve the software and even the hardware (or an emulation of the hardware) in order to keep the data usable. This is not a trivial task and considerable effort is being put into making better software in the first place and preserving what has already been used.
Preservation for research data managers and IT specialists
Preservation often relies on the IT infrastructure of research performing organisations. Tool registries are available to choose an appropriate approach to this. Systems such as Arkivum , Amazon Glacier , DataVault , Preservica , Rosetta and Archivematica are well-known in today’s digital preservation landscape. Should you need to pick a preservation solution, we invite you to compare them carefully to ensure your choice is appropriate for your objectives.
These systems are generally designed to undertake a number of “standard” tasks to keep the data safe including:
- Identifying the file type and, if necessary, converting the file to what has been defined in local policy as the ‘Standard’ for that file type (‘Normalise’)
- Creating an archival copy (or copies) and a display copy of the data
- Creating a checksum for the archival copy and regularly performing checks to make sure there is no deterioration and the data is an authentic copy of the originally-submitted record (fixity checks to check for bit-rot)
- Performing file migrations to alternative file types as old formats become obsolete and documenting the characteristics of this migration so that its impact can be understood
- Exposing the data to external display systems where appropriate
- Exposing the metadata to external systems where appropriate