Author Archive


March 23, 2006

The last class of the day before getting on with our Internet Archive project (which has been very instructive), was about providing access to archival collections. Good, common-sense advice was dished out which we all pretty much knew but were pleased to hear again:

We preserve because we expect access.
We must be able to derive one from the other.
We don’t have to let one dictate the other.
Access needs may drive decisions at ingest e.g. on metadata.
There are many ways to provide access.

We were advised to ‘preserve enough to tell the story’ and that good preservation refers to ‘preserving meaningful information through time’. Quotes like that can come in handy sometimes.

Of course, depending on the archive’s remit, there are various reasons why we might have to provide access (FOI) or restrict access (DPA). Fortunately only the latter applies to us right now. There was also a brief discussion about redacting certain information before providing access, something that we do in a way with MAV’s products and transcripts for security reasons before they are put on the public database. I don’t knnow if we do it to documents, too. Do we?

Finally, the OAIS standard clearly has a strong element dedicated to ensuring access. It’s based around knowing your ‘designated community’ which may be as broad as ‘people who can read english and use the internet’ (The National Archives) or as narrow as ‘my friends and family’ (imagine an online photo service like Flickr). As usual some of the best advice was about planning ahead, being proactive abouut offering ways to access material, seperating the preservation infrastructure from the access infrastructure and collaborating with other repositories.



March 23, 2006

This was a post-lunch crash course in Intellectual Property Rights, Copyright, Digital Rights Management, Freedom of Information Act, Data Protection Act and Legal Deposit. No depth but some useful highlights and another interesting case study from the ADS concering the mis-use of images and how they dealt with it.

Basically, the ADS received a collection of data, including images, from the excavation of Christ Church, Spitalfields. The images in this very interestining collection show the remains of bodies buried in the 18th century crypt. These images were found by a web site for necrophilia enthusiasts and some images were copied from the ADS site and republished on the sex-with-dead fan site. They also provided a link through to the ADS for enthusiasts to grab more images for themselves.

This was their mistake, because the ADS noticed an unexpected spike in the use of its website and traced it back to the link from the other website. It was the first time they’d had to deal with the mis-use of their digital collections and sought advice from the JISC legal team. They were advised a six-point plan spread over 70 days. The first was simply to contact the website, tell them they had broken the licence they agreed to on the ADS website and that they take the images down or else face further legal action. And they did take them down. End of story.

This was a satisfactory result for the ADS because despite the mis-use of the images, it would have been a long and difficult legal process had the web site not taken them down.

We’ve been thinking about such things for ADAM and intend to introduce a ‘handshake’ agreement prior to the download of ADAM images. Having seen how the ADS handle this ‘contract’ with its users, I’m now inclined to just have users agree to a licence when they first enter an ADAM session rather than each time they click to download. Legally it would appear to cover us. Our present system is based on authentication into the AI Intranet and then trusting that the AI staff member will respect the terms and conditions that are displayed with each image, but we think we can do better than this with little inconvenience to users. There will also be more changes to the way ADAM handles rights management and licence agreements.

The main piece of advice that the ADS gave from this example was that archives should not wait for the abuse of their content before forming a response but rather formulate a strategy for dealing with a potential incident so we can react quickly, methodically and legally. Wayne, Claire and Tim will know more about whether we’ve had to deal with this already. I’m not aware of such a strategy being in place though. In late May, an IPR expert from the Open University will be giving a one-day workshop on IPR issues for AI staff, something we intend to run each year. Having spent just an hour touching on such issues, I feel a day’s course would be well spent ensuring IS staff are informed of the risks and responsibilities involved in this area of our work. Not least because the European Copyright Directive, which applies to the UK, now makes breaking copyright protection a criminal offense rather than a civil offense, so theoretically someone could go to jail whereas it used to be that the individual/organisation would be fined based on the ‘loss’ (financial, of reputation, of relationships, etc) to the rights owner.

Costs, risk management and business planning

March 23, 2006

Not the way I would have chosen to start the day but it ended up being a useful morning discussing how to identify the organisational costs of running a digital archive and how to justify those cost and identify the benefits. We also discussed risk management, the implications of lifecycle management and how to cost elements of an OAIS compliant archive.

We did an interesting exercise comparing the costs of running an e-prints archive at Cornell University and The National Archive’s digital archive. Not surprisingly, the two archive’s costs are radically different because their remit and services provided are radically different. It costs TNA £18.76 to ingest/acquire a single file into their archive. A huge sum compared to Cornell’s £0.56-£2.84. This is not only because TNA’s remit is so much wider and therefore the ingest/acquisition process is much more complex, but because TNA operate in an environment where they catalogue the material themselves whereas Cornell have no catalogers but require the Professor submitting her document to provide and verify all the information/metadata. Also, TNA have huge preservation costs because they are dealing with a legacy digital material which are 20-30 years old, when no preparation was made for long-term preservation of these materials. Cornell on the other hand, are archiving simple, modern digital materials and their preservation activities are relatively easy and predicatable.

This raised a familiar and interesting question for me because we will be developing a facillity in ADAM for staff to upload images to a team catalogue and provide metadata for the image. In an ideal world, the member of staff would provide full and accurate metadata which would require no validation and could be entered directly into ADAM and immediately available on the Intranet. Of course, this is almost certainly impossible for AI. It works for Cornell because the Professor has a vested and very personal interest to ensure that her article is made widely available and correctly cited through the submission of complete and accurate metadata. Even then, an example was given where an academic catalogued their article with a single keyword representing their sole academic interest, disregarding the other subject areas which the article related to. I asked people if they had any advice on how we could have AI staff more involved in the cataloguing process but no miracle answers were forthcoming. Basically, while staff are essential providers of information about the digital object, supplying information only they might know, it’s an unacceptable organisational risk to then make those images directly available for other staff to reuse before AVR have checked and verified the metadata and, as is always the case, enriched it with further information. And of course, staff might justifiably argue that they could be making better use of their time than extensively cataloguing images and checking copyright and license agreements. There will be ways that we can ensure that the information provided to us is formed in a way that is easy to validate and enrich though and that’s the approach we’ll be taking with ADAM.

At one point while trying to breakdown the cost elements of a digital archive I realised that we were a room full of archivists trying to do the job that IT professionals have been doing for years. The element costs involved in digital archiving such as hardware, software, licenses, support, development, fixtures and fittings, etc. are costs that we share with ITP. Where IRP need to demonstrate costs is by detailing the work processes and therefore the staff time involved and the business reasons why archival preservation might require three or four times the storage requirements, a different approach to risk management, changes in data management, etc. But, with the exception of staff time, a digital archive uses readily available IT solutions in a specific way. I tried to make this point that we (archivists) are not the people best placed to cost IT systems but rather need to work with IT professionals and draw on their existing experience in planning, purchasing and maintaining systems. I think that to an IT department, a digital archive is just another application of IT hardware, software and processes. Do you agree?

This wasn’t the first time I’ve found that archivists tend to look at a digital archive infrastructure as something new and perculiar to them and completely alien to IT professionals. Sure, there might be different requirements that some IT staff might not be familiar with but it’s the archivist’s role to explain and justify these in business terms and in return, let the IT staff deliver the infrastructure requirements to meet the business case. It’s just data that needs to be treated a bit differently, that’s all.

Despite this frustration, this class had real practical value for me and was a morning well spent.

Preservation Approaches to Technological Obsolescence

March 22, 2006

At 9am sharp we went straight into issues surrounding the obsolescence of digital file formats and their supporting digital hardware and software. What better way to begin the day!

Generally there are three or four ways of dealing with this:

Migration: Changing a file from one format to another. i.e. Word 2.0 file to Word XP file. Migration changes the data but hopefully in a way which retains the integrity of the digital object.

Things to consider might be whether the new format can still represent the ‘significant properties’ of the original format. Can the migration be done automatically? How long will it take (and therefore how much will it cost?). On what basis is the new format chosen? How do we know the migration is 100% successful?

Refreshing: Moving files from one storage media to another. i.e. moving a document from a 5″ floppy disk to a networked server. The object remains unchanged.

Emulation: Writing software that runs on a modern Operating System which emulates the software environment of the original creator application. i.e. writing a Spectrum ZX81 emulator to run my favourite game of all time: ‘Elite’.

Preservation of hardware and software: Basically, you keep a museum of old computers with the original software running on them.

Each approach can be useful depending on the circumstances, although emulation and the museum approach are generally regarded as the most inconvenient. Archives aren’t museums and approaching preservation this way is contrary to the digital archival process which is to move conservatively with changing technology rather than hang on to it.

Software emulation is invaluable some of the time, but may be expensive to undertake because of the development resources required and often a black art in reverse engineering as older technologies tend to be poorly specified or the programing skills required are, like many skills, lost over generations as technology moves on. Also, if you emulate the original software faithfully, then you get the older, more difficult interfaces that came with it. For a large collection of a single file format, a single emulator might be a useful method of access to multiple objects. It also helps retain our understanding of older systems. The BBC Doomsday Project is a good example of when emulation was the most successful method of bringing the data back to life.

In most situations though, migration of file formats and refreshing of storage media are what most archivists rely on. At the IS, for example, we already undertake these approaches, migrating paper to microfilm, WordPerfect files to Word files or PDF, and by incrementally upgrading our hardware and software environments. I think it would be useful if ITP and IRP discuss a joint strategy for this, recognising that traditional IT migration strategies do not always recognise the archivist’s needs and expectations. Digital Archiving is part of both the IT world and Archiving world and our digital preservation requirements need to be reflected in a joint ITP and IRP agreement. Already AVR are starting to feel the need for this as we archive 200GB of images on IT’s servers and require 1TB/year for the storage of video. The storage of this data is not static but requires continual backup, migration and refreshing over time and clearly the two departments need to acknowledge this formally and make resources available.

The next session followed on from this, discussing how Archivists might live with obsolescence. I was hoping for personal spiritual guidance but instead we discussed particular examples where the above approaches might be useful.

I won’t go into detail here, but predictably enough, it focused on the need to develop organisational strategies, promoting the need to analyse and evaluate the collections, create inventories, determine preferred file formats and storage media, assess how market conditions affect the longevity of IT systems, adopt metadata standards, work with IT departments on joint strategies (as per above), watch technological changes and developments (actually write this into someone’s role responsibilities), and be prepared for hard work and headaches.

OAIS Introduction.

March 20, 2006

I’ve been wrestling with the OAIS document/standard (Open Archival Information System) for about 18 months and have only recently, finally been able to understand its real-life application in detail. It’s a ‘functional model’ for digital preservation archives, developed collaboratively by institutions from all over the world, led by NASA and originally meant to provide a model for the preservation of their own space programme data. It became an approved ISO standard in 2002.

Here is an OAIS on the most general functional level:

OAIS Functional Entities

The tutor did an excellent job of showing how this relates to real world archiving and then discussed the model on a slightly deeper level, breaking down each ‘actor’, ‘object’ and ‘action’. It was reassuring to hear that the model is not meant to dictate each and every function of a ‘compliant’ archive, but rather serve as a very thorough checklist for the design and functionality of a digital archive of any size (it’s a ‘standard’ after all). It is obvious when you read the standard that some of the functions are essential and any archivist would naturally expect to find them in any archive. Other functions might be useful to some archives and not to others, often depending on the size and remit, but each function does serve to stiumlate archive managers into questioning whether their digital archive is serving the ‘designated community’ (users) correctly. When thinking of how ADAM v2.0 should function, I’ve used the OAIS standard as a model of ‘best practice’ and though we’ve got some way to go, it remains a useful benchmark to work with. At the highest level, it’s just acquisition, storage and access. And then, as you drill deeper into the detail, it raises many questions about workflow, authentication and validation of digital objects, ability to audit each service correctly and fully, and ultimately ensures the archive is designed to serve the community of people it functions for.

It’s also about people, the archive staff and users. Clearly some areas of the functional model suggest a level of automation through the use of computing but other sections of the standard are about decision making processes and strategic planning.

We finished with a quick exercise to test our understand of the OAIS model at its highest level. I scored four out of five. Not bad. I learn from my mistakes.

More OAIS tomorrow (and the next day and the day after that…)