Records Management and Digital Preservation

March 22, 2006 by

I wondered whether the next two sessions would be a bit dry, but not so. They were both led by an Archivist from the University of London Computing Centre who, in his free time, has a Friday evening radio show on RessonanceFM.

This was a brief intro to RM discussing how it fits in with the digital presentation process. I’ve never been trained in RM so it was useful for me and these are the highlights of what I learned:

RM is the efficient control of the creation, reciept, maintenance, use, retention and disposition of records. It’s an archival skill but overlaps with business analysis and it’s assisted by international standards such as ISO 15489.

Why do we need records? Well, they might be evidence, required for accountability, for decision-making and to record institutional ‘memory’. Good records are authentic, accurate, accessible, complete and comprehensive. They are compliant, effective and secure. I was told that RM assists and supports an organisation’s business processes; it identifies and protects vital records, ensures legal and regulatory compliance, provides protection against litigation, and allows compliance with Freedom of Information legislation.

With the growth of digital records, there’s obviously been a massive quantative increase in information. Digital records share the same issues as paper records such as acquisition, preservation, storage and retrieval but also present additional challenges. Digital records are characterised by being easy to create, copy, share, modify and store in multiple locations. They can be complex, transient, vulnerable, software and hardware dependent.

Good digital records management is an underlying framework to good digital curation.

A sound migration plan is essential to good digital RM and inherent in the preservation planning recommended by OAIS. A migration plan is an essential part of ensuring that formats are retained and readable throughout their lifecycle.

Of course, digital records have metadata which also needs to be managed and assists with the authenication of a record. I was told that good electronic records management policy should cover the creation and capture of all corporate records within the RM system. It should cover the design and management of indexing and naming schemes. It should offer policy on the automated management of metadata, for retrieval and retention. It should ensure that records are ‘locked down’ to ensure their integrity and security. It should also provide guidance on the retention, preservation and destruction of digital records.

We discussed which type of records are selected for management: vital records needed to sustain the organisation’s business; records essential for legal compliance; and records with mid to long-term administrative value. These kind of things should form the basis of a selection policy.
It all sounds like archiving to me with the exception that there’s more destruction in RM. I do see how RM is more business focused though and the issues of preservation are not always so difficult when you might be retaining records for shorter periods of time. Still, there’s no reason why RM shouldn’t fit into an OAIS environment. The main characteristics of selection, validation, fixity, preservation planning, metadata standards and retrieval/access are clearly very simmilar. Perhaps Fiona or Lynda can explain more to me when I get back.

A whole hour discussing file formats!

March 22, 2006 by

I departed from earth this afternoon. I’m not sure where I went but this session on file formats and then a further session on digital records management took me places I never thought I’d go.

The title of this class was ‘File Formats: Matters to Consider’, and I found it fascinating.

First, we were shown where file formats fit in the hierarchy of the IT system:

Semantic Layer
Actions Layer
Format Layer (Alright!)
Filesystem Layer
Media Layer

Then an anecdote about how some file formats and their creating applications are better used for some tasks and not for others. The tutor knew someone who wrote a novel in Excel because he didn’t have any other software to hand and I guess curiosity didn’t get the better of him either.

We did a quick exercise in what features to look for in a file format for preservation purposes. Not too difficult:

Open, documented, widely used and therefore supported, interoperable over different Operating Systems, lossless/no compression, metadata support, etc. etc.

Another anecdote was that ten years ago, two men wrote a book detailing over 3000 graphic file formats. As the number of formats grew, it was revised and issued on a CD-ROM. Now it’s updated on the web. I’m sure Tim would love it.

I’ll state this here: ADAM handles two graphic file formats for a reason. They are both open, documented, widely used, well supported, interoperable and have metadata support. The list of supported graphic file formats may double or triple over time, but 3000+ demonstrates what an industry digital archives are having to deal with.

If you want guidance on file formats (and who doesn’t?), then look no further than these fine institutions:

AHDS
FCLA Digital Archive
Harvard University formats registry
PRESTOSPACE
ERPAnet file formats
Library of Congress (my favourite).

We finished up by looking at the conversion of file formats, something which presents problems when you want to preserve the original integrity of the file’s content but in a more suitable or non-obsolete file format.

I could go on about file formats but let’s face it, we’ve both had enough for one day. Let’s talk at length in the ‘breakout’ area when I get back, OK?

Institutional Repositories

March 22, 2006 by

Maybe I should think up snappier title headings to these blogs. Believe me, occasionally I’m sitting in the class wondering how the hell I got here. Though I should say that the quality of the training programme so far has been very high and I’m finding it very engaging. The tutors are decent, down-to-earth people with practical advice. In other news, the stats for this blog suggest that most of ITP and IRP looked at it yesterday. Tomorrow’s stats should be interesting… 😉

Basically, this was a discussion on DSpace and the OCLC. Fedora was mentioned but only briefly. That’s OK, because Fiona, Damon and I attended a conference on Fedora last year. Damon’s an expert so ask him all the questions about Fedora… The implementation of a ‘trusted repository’ is central to digital archiving and the two main course documents are the OAIS standard and the follow up document, Trusted Digital Repositories. The TDR document basically goes through all the attributes and responsibilities that an OAIS compliant have. The report defines a TDR as:

A trusted digital repository is one whose mission is to provide reliable, long-term access to managed digital resources to its designated community, now and in the future.

It’s a useful document for testing how well your institution is doing.

DSpace is a repository system that’s been developed at MIT. It’s very popular (OK, so that’s a relative term…) in the USA and some UK institutions use it too. From what I could see, it provides a customisable ‘repository out of the box’ and shares some functionality with a Content Management System.

DSpace has three preservation service levels, providing functional preservation through ‘supported’ (1) and ‘recognised’ (2) file formats and bit-level (3) preservation. I don’t think it is ‘OAIS compliant’ but clearly it follows the basic OAIS functional model of Ingest of Submission Information Packages, creation of Archival Information Packages and the creation of Dissemination Information Packages. The example we were shown worked very well for the submission and archiving of a document by an academic writer. From the Fedora conference we attended, I’d got the impression DSpace was a bit crap, but there’s some competition between the two systems so that shouldn’t be surprising. Fedora is a different animal really as it provides a suite of repository services which developers are expected to work with while DSpace is useable out of the box by people without programming skills.

OCLC, The Online Computer Library Centre provides a repository service for other institutions so is really an out-sourcing solution. Accessible over the web with OAIS-like functionality but, of course, still requires that you prepare your collection for Ingest.

Initiatives & Tools

March 22, 2006 by

Before lunch, we discussed a number of initiatives and tools that are emerging in digital preservation. I’m just going to list a few right now because there are so many.

UK Initiatives

JISC – Funding body
AHDS – Arts and Humanities Data Service
DPC – Digital Preservation Coalition (course organisers)
DCC – Digital Curation Centre. Courses, conferences, online forum, specifically interested in curatorial issues.
UKOLN – Advisory service
UKDA – UK Data Archive
NDAD – The National Digital Archive of Datasets
UKWAC – UK Web Archving Consortium
TNA – The National Archives

Non-UK Initiatives

NARA – Electronic Records Archive (USA) Project with $300m funding
National Archives of Australia Digital Preservation Service
PADI – National Library of Australia’s subject gateway to international digital preservation resources
ECPA – The European Commission on Preservation and Access
RLG – Research Libraries Group
Library of Congress Digital Preservation
ERPANET – Electronic Resource Preservation and Access Network

Tools (these were all demonstrated and very interesting – I think we could use one or two of them. And they are all open source and written in Java which Merlin’s Team have expertise in).

PRONOM/DROID – PRONOM is a file format registry at The National Archives. DROID is a tool which works in conjunction with PRONOM.
JHOVE – Validates digital files and extracts preservation metadata. Very cool. Really!
National Library of New Zealand Metadata Extraction Tool
XENA – XML Electronic Normalisinig of Archives

Preservation Approaches to Technological Obsolescence

March 22, 2006 by

At 9am sharp we went straight into issues surrounding the obsolescence of digital file formats and their supporting digital hardware and software. What better way to begin the day!

Generally there are three or four ways of dealing with this:

Migration: Changing a file from one format to another. i.e. Word 2.0 file to Word XP file. Migration changes the data but hopefully in a way which retains the integrity of the digital object.

Things to consider might be whether the new format can still represent the ‘significant properties’ of the original format. Can the migration be done automatically? How long will it take (and therefore how much will it cost?). On what basis is the new format chosen? How do we know the migration is 100% successful?

Refreshing: Moving files from one storage media to another. i.e. moving a document from a 5″ floppy disk to a networked server. The object remains unchanged.

Emulation: Writing software that runs on a modern Operating System which emulates the software environment of the original creator application. i.e. writing a Spectrum ZX81 emulator to run my favourite game of all time: ‘Elite’.

Preservation of hardware and software: Basically, you keep a museum of old computers with the original software running on them.

Each approach can be useful depending on the circumstances, although emulation and the museum approach are generally regarded as the most inconvenient. Archives aren’t museums and approaching preservation this way is contrary to the digital archival process which is to move conservatively with changing technology rather than hang on to it.

Software emulation is invaluable some of the time, but may be expensive to undertake because of the development resources required and often a black art in reverse engineering as older technologies tend to be poorly specified or the programing skills required are, like many skills, lost over generations as technology moves on. Also, if you emulate the original software faithfully, then you get the older, more difficult interfaces that came with it. For a large collection of a single file format, a single emulator might be a useful method of access to multiple objects. It also helps retain our understanding of older systems. The BBC Doomsday Project is a good example of when emulation was the most successful method of bringing the data back to life.

In most situations though, migration of file formats and refreshing of storage media are what most archivists rely on. At the IS, for example, we already undertake these approaches, migrating paper to microfilm, WordPerfect files to Word files or PDF, and by incrementally upgrading our hardware and software environments. I think it would be useful if ITP and IRP discuss a joint strategy for this, recognising that traditional IT migration strategies do not always recognise the archivist’s needs and expectations. Digital Archiving is part of both the IT world and Archiving world and our digital preservation requirements need to be reflected in a joint ITP and IRP agreement. Already AVR are starting to feel the need for this as we archive 200GB of images on IT’s servers and require 1TB/year for the storage of video. The storage of this data is not static but requires continual backup, migration and refreshing over time and clearly the two departments need to acknowledge this formally and make resources available.

The next session followed on from this, discussing how Archivists might live with obsolescence. I was hoping for personal spiritual guidance but instead we discussed particular examples where the above approaches might be useful.

I won’t go into detail here, but predictably enough, it focused on the need to develop organisational strategies, promoting the need to analyse and evaluate the collections, create inventories, determine preferred file formats and storage media, assess how market conditions affect the longevity of IT systems, adopt metadata standards, work with IT departments on joint strategies (as per above), watch technological changes and developments (actually write this into someone’s role responsibilities), and be prepared for hard work and headaches.

Archiving the World Wide Web

March 21, 2006 by

After a heavy lunch (I was longing for just a sandwich but it’s a residential course, so you eat what you are given), we discussed the archiving of web sites.

Frankly, I think this is a doomed project. You only have to look at the Internet Archive to see its limitations. It’s great for harvesting flat HTML files but faced with Javascript, dynamic sites, database driven sites, and pretty much any Web 2.0 technology, you’re screwed. Still, there’s a new group called UKWAC in the UK determined to archive selected UK web sites. I can see it working when you have the close co-operation of the web site owner, but if you’re trying to capture sites on an ad hoc basis, you’re going to end up with a lot of style and no content.

Archiving web sites seems like a curious legacy exercise to me that will be abandoned eventually, I’m sure. A web page is increasingly about presenting dynamically changing information to Users based upon their selection and not just a set of predetermined and static information. Web browsers these days are often serving up information that is stored in Content Management Systems rather than as flat files in a directory on a web server. The Internet Archive has been ‘archiving’ http://www.amnesty.org since 1996. Click here to see a front page from 1997. Not bad. Now look at 2005 here. Looks good at first. Now click on the interactive links such as this. Doh! SVAW’s been ‘lost’. How about AI’s reports? Click on the ‘Library‘ page and try finding a report from any country and you get sent out of the Internet Archive’s site and into AI’s original site! This demonstrates to me how superficial their efforts are increasingly becoming. It tells me that we shouldn’t rely on other organisations such as the Internet Archive to take responsibility for the general archiving of our web site. It’s something only we can do. And in a way, we do archive a lot of the important content on the web site. We’ve got ADAM (AV and multimedia) and AIDOC (indexed reports and press releases) afterall. And a new Content Management System will make it easier to manage other content we’re creating for the web and assist us in organising the specific web content we might wish to archive. But do we need to archive the user experience? Whether it’s worth putting our resources into doing this is up for discussion, I guess. It doesn’t interest me. The way information is presented changes over time as design and technology shift together. Perhaps this has value to some cultural institutions such as a design museum. I just see it as a shallow exercise in vanity rather than of historical value. If we’re putting content on the web that is of genuine archival importance to the organisation, then that content should be archived but not each and every style sheet that displays it. I don’t think we should get too drawn in by developing and changing web technologies. It’s fun to look back a few years and see how web presentation has changed but that’s all.

On a related note, the final session for the day was to start our group project which is to look at the Internet Archive and decide whether it lives up to its mission as an ‘archive’. You can probably tell what I think already, but I’ve still got three days of research to do before we present our conclusions. Actually, there’s more to the Internet Archive than just crawling web sites. They collect books, software, films and music and I think that’s where their value will lie in the future in addition to having collected a few years worth of functional early web sites.