The BBC made an expensive mistake in the 1980s. It spent £2.5 million (£7.1 million in today’s money) building one of the first computer encyclopaedias. The massively ambitious Domesday Project, in commemoration of the 900th anniversary of the Domesday Book, shipped on a pair of LaserDiscs, a standard that’s largely disappeared. It was programmed using BCPL, a 51-year old language that’s no longer in common use, and used analogue video stills layered on top of the interface where it needed to show a photo. This was, after all, the pre-JPEG era.
Even the hardware on which it ran – the BBC Micro and a LaserDisc player – was bespoke, and cost £5,000. Inevitably, much of the data was lost as the discs degraded, formats moved on and the hardware came to the end of its useful life. Work is still ongoing to try and recover the contents, some of which have been posted online.
It’s hard to imagine the same thing happening now. Today, we have ubiquitous formats, and everything lives in the cloud. Doesn’t it?
Backups aren’t archives
In 2015, Google’s “chief internet evangelist,” Vint Cerf, warned that we face a “forgotten generation or even a forgotten century” as formats fall out of favour and hardware degrades. “We digitise things because we think we will preserve them, but what we don’t understand is that unless we take other steps, those digital versions may not be any better, and may even be worse, than the artefacts that we digitised.”
It’s a theme picked up by Arkivum’s Paula Keogh, who makes a clear distinction between archiving and backup – two allied fields that people who don’t work in digital preservation frequently confuse.
“A backup won’t be migrating the infrastructure or file format over time,” she said. “You’re locking your data in a metaphorical room, throwing away the key and hoping it will still be there in the future.”
Arkivum’s clients sign 25-year contracts for the preservation of their data which, in Keogh’s words, “is a lifetime in IT, but a drop in the ocean for an archive.”
Critically, they need their data to be not only secure, but also accessible. “Life science organisations [and others] want to be able to double-click a file in a couple of decades and open it… so media is one lifecycle management process that we undertake. The other is file format preservation. It’s not backup, scanning or digitisation, all of which can – and does – get confused with the term digital preservation. It’s about migrating the file formats into the most preservable version at that point.”
It seems almost inconceivable that industry standards such as Word and Excel might disappear, but this is precisely what the data archiving standards body, the Association for Information and Image Management (AIIM), is planning for.
“The industry has decided that [archival-focused] PDF/A is going to be a future-proof format,” said Howard Frear of Easy Software, which sits on the body’s board. “It contains all of the data and metadata within the document itself, so you don’t necessarily need an application to open it, as there will always be an industry standard viewer.”
This will be more important to certain industries than others. Easy Software works with pensions providers, for example, who maintain their records for the life of each subscriber, plus 20 years, and need to know that the records they produce will still be accessible, potentially, 100 years from now.
That’s not guaranteed with proprietary formats. “With Microsoft Word, older and newer versions, they aren’t that compatible,” Frear said. “Backwards compatibility has been problematic but looking at forwards compatibility is nigh-on impossible unless you have a standard.”
However, if PDF/A is the way ahead, when should the file actually be generated? At the point when we save our assets, or when they’re added to an archive?
“It should be a problem for Apple, Microsoft, IBM and Amazon, but it’s not,” explained Keogh. “For us to be looking after our data well, when we’re creating the data in whatever format, that’s when you should have the option to make it as future-proof as possible.”
“To some degree, it’s down to the user to put in some extra effort,” Frear said, explaining that Microsoft Word can output PDF/A using an add-in. “Perhaps developers could do a little bit more and store both copies as part of the single save function, but then everybody is battling against the volume of data that creates.”
Keeping data alive
It’s easy to forget when we have become so used to the idea of putting our assets in the cloud that it, like your local hard drive, is still a limited resource backed by fallible hardware. That’s why taking responsibility for your own archive is essential.
“Cloud providers perhaps aren’t as mindful as the software community is,” Frear said. “Software and records management communities are driving the standards and we need to remind cloud vendors that it’s all very well bringing in new hardware, but that they have a responsibility to ensure that the data we put up to the cloud lives beyond the hardware’s usable life, and that as they move on to different hardware they have a responsibility to move the data across smoothly,” he continued.
If that archive remains usable, so much the better. PDF/A looks like the best compromise, preserving both the final look of the archived document, and extractable content for reuse.
“Could you read a WordPerfect file?” Keogh asked. “I couldn’t, not without an emulator, and that’s only from the 1990s, which from a data protection point of view, for something like the deeds of a house, someone’s pension scheme, a clinical trial or the research that meant you could bring a drug to market, is no time at all.”
Yet, despite warnings like this, a study published by the journal Current Biology found that only a fifth of all the research published in the early 1990s remains accessible.
The Digital Preservation Coalition, founded by the British Library and JISC (Joint Information Systems Committee), published a list of the world’s endangered digital species at the end of 2017. It classified data from marginalised sub-groups and the photo archives of SMEs as critically endangered, requiring urgent action and assessment within 12 months.
Even documents stored on Google Drive and Dropbox, where access is restricted to specific users, were listed as endangered, along with digital images with no analogue equivalent posted to social networks.
Archives and the right to be forgotten
The implementation of GDPR this May will have implications for archive-keeping, which Freer described as “another piece of the puzzle”. Keogh sees potential conflicts – particularly over the question of what should and shouldn’t be removed on request.
“There’s a lot still to be ironed out,” she said. “When you talk about things like [archived] genome sequencing or thumbprints you need to start asking what is identifiable about an individual. Is it their NI number, their first and last name, their DNA sequence? You can’t take an individual out of [a study] because it skews the figures. Yet, they still have the right to be forgotten, so how do those two conflicting things work in reality?”
It’s likely the answer will become clear in the months following GDPR coming into force through trial cases and legal guidance. It illustrates once again, though, the crucial difference between a static backup that rots with age, and a live, accessible archive, which remains an asset for the organisation that created it years or even decades into the future.