We started the last morning of the training programme with presentations of our project. As I mentioned before, we’ve been looking at the Internet Archive (IA) as a way of “preserving an institution’s digital collections that are made available on the web.”
The task then, was to examine whether the IA scored well with regards to the ‘three legs’ and ‘five stages’ I mentioned at the beginning of the week. The assumption is that a ‘trusted digital repository’ (basically, a ‘good archive’), should satisfy our requirements in the areas of resources, technology and organisational framework. We were asked to score the IA against the five stages: Acknowledge, Act, Consolidate, Institutionalise and Externalise. These basically represent five stages of maturity in a digital archive’s development.
If you haven’t spent much time on the IA’s website, it’s certainly worth doing so. It was originally the initiative of one man, Brewster Kahle, who made millions in software development in the 1990s. His foundation continues to fund around 50% of the IA’s expenses (we were given copies of the IA’s 990-PF Tax Returns for 2003 and excerpts from 1999, 2000, 2001 and 2002). In 2003, the IA had expenses of just over $3m, so Brewster’s millions are essential for the sustainability of the IA.
This was a concern for all of us since we were asked to consider how sustainable the IA appears to be. Now, admittedly, our figures are a couple of years out of date and we’ve no idea how Brewster’s foundation is set up and how sustainable that in itself is, but the donations from other companies over the years have been sporadic, project based and unpredicatable.
It’s an expensive operation to run when your mission is to collect every website in the world which is why in 2003, they spent $380,558 on disk and tape stock, $79,358 on misc. hardware and $27,002 on their phone/network charges.
The IA’s technological framework scored quite low with our class, too. It’s web crawling is done by Alexa every two months, although if you look at the history of AI on the IA’s Wayback Machine, you’ll see that the results are more adhoc than that. It’s worth pointing out that the IA provides researchers with another way of accessing the archived websites, although currently this is from a command line and they are developing tools to help make this easier.
We looked at how they approach long-term preservation, in particular the way the data is documented, stored, migrated, backed up and secured. The IA stores the websites in ARC and DAT containers which are open formats and open source tools are available to work with these formats. However, the original file formats harvested from the websites are not migrated to different formats for long-term preservation so proprietary file formats harvested, remain proprietary and therefore possibly inaccessible long-term. The IA do also collect software and emulators in the hope that some kind of access will be possible in the future, even for obsolete formats. It’s all a bit hit and miss, not least because their website has no formal policy or strategy for preservation. It’s just a few paragraphs here and there which you have to draw your own conclusions from. The preservation metadata which they add to the harvested pages is minimal since it’s done by machine. You can see what they add by using the Wayback Machine and then looking at the source of one of the harvested web pages.
The lack of formal, detailed policy was also a concern to the class. The IA have taken on such a huge task, but with the exception of some areas of their technological development, there’s very little to suggest that they have a strategy in place that will ensure the long-term preservation of their archive which is central to their mission. The other major organisational concern was that technically their activities are illegal or at least waiting to be challenged. Essentially, they are taking copyrighted material from owners without asking for their permission. They do say on their website that they respect robot.txt commands so that you can tell their crawler not to archive your website and they also say that if you contact them and ask them to remove your website from their archive they will do so, but this places all of the responsibility on the rights holder rather than the IA, which is contrary to the usual methods of agreeing terms and conditions before copying/taking/using owned materials.
While on the subject of their web crawler, as I said earlier in the week, it has quite obvious problems in that is only works reliably with flat html files. Javascript, database driven websites and in fact, anything interactive, including video, audio and animation, presents real problems to the IA. So even if you were happy about the IA copying your material, the likelihood is that they will only be able to do part of the job. This is increasingly apparent if you use their Wayback Machine as sites from a few years ago are much better archived than more modern sites which are increasingly using more interactive web technologies.
They are trying to address some of these issues with a new subscription based service which allows you to pay for tools to customise the way your website is archived, ensuring that all the right links are harvested on a schedule that suits your needs. You can also add Dublin Core metadata. This has the potential to make the IA’s technology directly available to you at a price. You can also request copies of their archive of your website (for a cost), and you can set up a private service so that your website archive is not available to anyone else, though this is clearly not what the IA would prefer you do. Unfortunately, this new service still suffers from the technological limitations I mention above and you are still advised that “as a general rule of thumb, simple html is the easiest to archive.”
So, as you can see, from our outsider view, we had major concerns about several key aspects (sustainabilty, technology, policy and strategy) of the IA. However, all the above doesn’t directly address the class project’s central concern which was to examine how the IA might preserve a University’s digital collections that are being made available on the web.
What the IA do is archive web content, not the institutional digital collections. Effectively, it’s like saying that they archive the ADAM web pages but not the actual ADAM digital archive which contains high resolution TIFF master images and a fuller set of metadata than is shown on the website. The IA can archive a site so that you can see what that site looked (and to a lesser degree functioned) like at a certain period in time but they make no claims to be preserving the digital collections which the website is drawing from when it presents a user with the results of their search for images on women in Sudan.
That was key to the class project and, in a way, to the week’s course over all. On the first day, we discussed how the OAIS model works, in that digital materials are ingested/acquired (the submission information package), then archived (the archive information package), and then made available in a different form (the dissemination package) according to the user or ‘designated community’. At each stage, the information collected/presented is different. The IA are only interested in the dissemination package, whereas as an institution, we’re interested in all three information packages, and as a preservation archive with resources invested in the creation and intellectual property rights of the digital objects, we’re especially interested in the archival information package.