The Digital Preservation Coalition's International Digital Preservation Day was marked by a wide-ranging collection of blog posts. Below the fold, some links to and comments on, a few of them.
I'm David Rosenthal, and this is a place to discuss the work I'm doing in Digital Preservation.
Showing posts with label normalization. Show all posts
Showing posts with label normalization. Show all posts
Tuesday, December 5, 2017
Tuesday, October 9, 2012
Formats through time
Two interesting and important recent studies provide support for the case I've been making for at least the last 5 years that Jeff Rothenberg's pre-Web analysis of format obsolescence is itself obsolete. Details below the fold.
Labels:
format migration,
format obsolescence,
normalization
Monday, January 17, 2011
Why Migrate Formats? The Debate Continues
I am grateful for two recent contributions to the debate about whether format obsolescence is an exception, or the rule, and whether migration is a viable response to it:
Andy gives up the position that format migration is essential for preservation and moves the argument to access, correctly quoting an earlier post of mine saying that the question about access is how convenient it is for the eventual reader. As Andy says:
He says, quite correctly, that I argue that a format with an open source renderer is effectively immune from format obsolescence. But that isn't all I'm saying. Rather, the more important observation is that formats are not going obsolete, they are continuing to be easily render-able by the normal tools that readers use. Andy and I agree that reconstructing the entire open source stack as it was before the format went obsolete is an imposition on an eventual reader. That isn't actually what would have to happen if obsolescence happened, but the more important point is that obsolescence isn't going to happen.
The digital preservation community has failed to identify a single significant format that has gone obsolete in the 15+ years since the advent of the Web, which is one quarter of the entire history of computing. I have put forward a theory that explains why format obsolescence ceased; I have yet to see any competing theory that both explains the lack of format obsolescence since the advent of the Web and, as it would have to in order to support the case for format migration, predicts a resumption in the future. There is unlikely to be any reason for a reader to do anything but use the tools they have to hand to render the content, and thus no need to migrate it to a different format to provide "sustainable access".
Andy agrees with me that the formats of the bulk of the British Library's collection are not going obsolete in the foreseeable future:
Andy then goes on to discuss the small proportion of the collection that is not in formats that he expects to go obsolete in the future, but in formats that are hard to render with current tools:
Where we may disagree is on the issue of whether is is necessary to preserve the access surrogate. It isn't clear whether the BL does, but there is no real justification for doing so. Unlike the original bits, the surrogate can be re-created at any time by re-running the tool that created it in the first place. If you argue for preserving the access surrogate, you are in effect saying that you don't believe that you will be able to re-run the tool in the future. The LOCKSS strategy for handling format obsolescence, which was demonstrated and published more than 6 years ago, takes advantage of the transience of access surrogates; we create an access surrogate if a reader ever accesses content that is preserved in a original format that the reader regards as obsolete. Note that this approach has the advantage of being able to tailor the access surrogate to the reader's actual capabilities; there is no need to guess which formats the eventual reader will prefer. These access surrogates can be discarded immediately, or cached for future readers; there is no need to preserve them.
The distinction between preservation and access is valuable, in that it makes clear that applying preservation techniques to access surrogates is a waste of resources.
One of the most interesting features of this debate has been detailed examinations of claims that this or the other format is obsolete; the claims have often turned out to be exaggerated. Andy says:
- Andy Jackson posts an argument for format migration to improve access rather than for preservation.
- Rob Sharpe critiques my discussion of Microsoft Project 98 in a comment.
Andy gives up the position that format migration is essential for preservation and moves the argument to access, correctly quoting an earlier post of mine saying that the question about access is how convenient it is for the eventual reader. As Andy says:
What is the point of keeping the bits safe if your user community cannot use the content effectively?In this shift Andy ends up actually agreeing with much, but not quite all, of my case.
He says, quite correctly, that I argue that a format with an open source renderer is effectively immune from format obsolescence. But that isn't all I'm saying. Rather, the more important observation is that formats are not going obsolete, they are continuing to be easily render-able by the normal tools that readers use. Andy and I agree that reconstructing the entire open source stack as it was before the format went obsolete is an imposition on an eventual reader. That isn't actually what would have to happen if obsolescence happened, but the more important point is that obsolescence isn't going to happen.
The digital preservation community has failed to identify a single significant format that has gone obsolete in the 15+ years since the advent of the Web, which is one quarter of the entire history of computing. I have put forward a theory that explains why format obsolescence ceased; I have yet to see any competing theory that both explains the lack of format obsolescence since the advent of the Web and, as it would have to in order to support the case for format migration, predicts a resumption in the future. There is unlikely to be any reason for a reader to do anything but use the tools they have to hand to render the content, and thus no need to migrate it to a different format to provide "sustainable access".
Andy agrees with me that the formats of the bulk of the British Library's collection are not going obsolete in the foreseeable future:
The majority of the British Library's content items are in formats like PDF, TIFF and JP2, and these formats cannot be considered 'at risk' on any kind of time-scale over which one might reasonably attempt to predict. Therefore, for this material, we take a more 'relaxed' approach, because provisioning sustainable access is not difficult.This relaxed approach to format obsolescence, preserving the bits and dealing with format obsolescence if and when it happens, is the one I have argued for since we started the LOCKSS program.
Andy then goes on to discuss the small proportion of the collection that is not in formats that he expects to go obsolete in the future, but in formats that are hard to render with current tools:
Unfortunately, a significant chunk of our collection is in formats that are not widely used, particularly when we don't have any way to influence what we are given (e.g. legal deposit material).The BL eases access this content by using migration tools on ingest to create an access surrogate and, as the proponents of format migration generally do, keeping the original.
Naturally, we wish to keep the original file so that we can go back to it if necessary,Thus, Andy agrees with me that it is essential to preserve the bits. Preserving the bits will ensure that these formats stay as hard to render as they are right now. Creating an access surrogate in a different format may be a convenient thing to do, but it isn't a preservation activity.
Where we may disagree is on the issue of whether is is necessary to preserve the access surrogate. It isn't clear whether the BL does, but there is no real justification for doing so. Unlike the original bits, the surrogate can be re-created at any time by re-running the tool that created it in the first place. If you argue for preserving the access surrogate, you are in effect saying that you don't believe that you will be able to re-run the tool in the future. The LOCKSS strategy for handling format obsolescence, which was demonstrated and published more than 6 years ago, takes advantage of the transience of access surrogates; we create an access surrogate if a reader ever accesses content that is preserved in a original format that the reader regards as obsolete. Note that this approach has the advantage of being able to tailor the access surrogate to the reader's actual capabilities; there is no need to guess which formats the eventual reader will prefer. These access surrogates can be discarded immediately, or cached for future readers; there is no need to preserve them.
The distinction between preservation and access is valuable, in that it makes clear that applying preservation techniques to access surrogates is a waste of resources.
One of the most interesting features of this debate has been detailed examinations of claims that this or the other format is obsolete; the claims have often turned out to be exaggerated. Andy says:
The original audio 'master' submitted to us arrives in one of a wide range of formats, depending upon the make, model and configuration of the source device (usually a mobile phone). Many of these formats may be 'exceptional', and thus cannot be relied upon for access now (never mind the future!).But in the comments he adds:
The situation is less clear-cut in case of the Sound Map, partly because I'm not familiar enough with the content to know precisely how wide the format distribution really is.The Sound Map page says:
Take part by publishing recordings of your surroundings using the free AudioBoo app for iPhone or Android smartphones or a web browser.This implies that, contra Andy, the BL is in control of the formats used for recordings. It would be useful if someone with actual knowledge would provide a complete list of the formats ingested into Sound Map, and specifically identify those which are so hard to render as to require access surrogates.
Wednesday, January 30, 2008
Does Preserving Context Matter?
As a Londoner, I really appreciate the way The Register brings some of the great traditions of Fleet Street to technology. In an column that appeared there just before Christmas, Guy Kewney asks his version of Provost O'Donnell's question, "Who's archiving IT's history?" and raises the important issue of whether researchers need only the "intellectual content" to survive, or whether they need the context in which it originally appeared.
Now is an unusual opportunity to discuss this issue, because the same content has been preserved both by techniques that do, and do not, preserve the context, and it has been made available in the wake of a trigger event. Some people, but not everyone, will be able to draw real comparisons.
Kewney writes:
A much more revealing example than Kewney's is now available. SAGE publishes many academic journals. Some succeed, others fail. One of the failures was Graft: Organ and Cell Transplantation, of which SAGE published three volumes from 2001 to 2003. SAGE participates in both the major e-journal archiving efforts, CLOCKSS and Portico, and both preserve the content of these three volumes. SAGE decided to cease publishing these volumes, and has allowed both CLOCKSS and Portico to trigger the content, i.e. to go through the process each defines for making preserved content available.
The Graft content in CLOCKSS is preserved using LOCKSS technology, which uses the same basic approach as the Internet Archive. The system carefully crawls the e-journal website, collecting the content of every URL that it thinks of as part of the journal. After the trigger event all these collected URLs are reassembled to re-constitute the e-journal website, which is made freely available to all under a Creative Commons license.
You can see the result at the CLOCKSS web site. The page at that link is an introduction, but if you follow the links on that page to the Graft volumes, you will be seeing preserved content extracted from the CLOCKSS system via a script that arranges it in a form suitable for Apache to serve. Please read the notes on the introductory page describing ways in which content preserved in this way may surprise you.
The Graft content in Portico is preserved by a technique that aims only to preserve the "intellectual content", not the context. Content is obtained from the publisher as source files, typically the SGML markup used to generate the HTML, PDF and other formats served by the e-journal web site. It undergoes a process of normalization that renders it uniform. In this way the same system at Portico can handle content from many publishers consistently, because the individual differences such as branding have been normalized away. The claim is that this makes the content easier to preserve against the looming crisis of format obsolescence. It does, however, mean that the eventual reader sees the "intellectual content" as published by Portico's system now, not as originally published by SAGE's system. Since the trigger event, readers at institutions which subscribe to Portico can see this version of Graft for themselves. Stanford isn't a subscriber, so I can't see it; I'd be interested in comments from those who can make the comparison.
It is pretty clear that Kewney is on the LOCKSS side of this issue:
This isn't a new argument. The most eloquent case for the importance of preserving what the publisher published was made by Nicholson Baker in Double fold: libraries and the assault on paper. He recounts how microfilm vendors convinced librarians of a looming crisis. Their collections of newspapers were rapidly decaying. It was urgently necessary to microfilm them or their "intellectual content" would be lost to posterity. Since the microfilm would take up much less space, they would save money in the long run. The looming crisis turned out to be a bonanza for the microfilm companies but a disaster for posterity. Properly handled newspapers were not decaying, improperly handled they were. Although properly handled microfilm would not decay, improperly handled it decayed as badly as paper. The process of microfilming destroyed both "intellectual content" and context.
I'd urge anyone tempted to believe that the crisis of format obsolescence looms so menacingly that it can be solved only through the magic of "normalization" to read Nicholson Baker.
Now is an unusual opportunity to discuss this issue, because the same content has been preserved both by techniques that do, and do not, preserve the context, and it has been made available in the wake of a trigger event. Some people, but not everyone, will be able to draw real comparisons.
Kewney writes:
One of my jobs recently has been to look back into IT history and apply some 20-20 hindsight to events five years ago and ten years ago.Temporarily unable to get to his library of paper back issues of IT Week for inspiration, he turned to the Internet Archive's Wayback Machine to look back five years at his NewsWireless site:
I won't hear a word against the WayBackMachine. But I will in honesty have to say a few words against it: it's got holes.
What it's good at is holding copies of "That day's edition" just the way a newspaper archive does. I can, for example, go back to NewsWireless by opening up this link; and there, I can find everything that was published on December 6th 2002 - five years ago! - more or less. I can even see that the layout was different, if I look at the story of how NewsWireless installed a rogue wireless access point in the Grand Hotel Palazzo Della Fonte in Fiuggi, ...Look at the two versions of the Fiuggi story linked from the quote above - although the words are the same the difference is striking. It reveals a lot about the changes in the Web over the past five years.
Now, have a look at the same story, as it appears on NewsWireless today. The words are there, but it looks nothing like it used to look.
Unusually, NewsWireless does give you the same page you would have seen five years ago. When you're reading the Fiuggi story, the page shows you contemporary news... It's the week's edition, in content at least.
Most websites don't do this.
You can, sometimes, track back a particular five-year-old story (though sadly you'll often find it's been deleted), but if you go to the original site you're likely to find that the page you see is surrounded by modern stories. It's not a five-year-old edition. Take, for example Gordon Laing's Christmas 2002 article ... and you'll find exactly no stories at all relating to Christmas 2002. They were published, yes, but they aren't archived together anywhere - except the WayBackMachine.
A much more revealing example than Kewney's is now available. SAGE publishes many academic journals. Some succeed, others fail. One of the failures was Graft: Organ and Cell Transplantation, of which SAGE published three volumes from 2001 to 2003. SAGE participates in both the major e-journal archiving efforts, CLOCKSS and Portico, and both preserve the content of these three volumes. SAGE decided to cease publishing these volumes, and has allowed both CLOCKSS and Portico to trigger the content, i.e. to go through the process each defines for making preserved content available.
The Graft content in CLOCKSS is preserved using LOCKSS technology, which uses the same basic approach as the Internet Archive. The system carefully crawls the e-journal website, collecting the content of every URL that it thinks of as part of the journal. After the trigger event all these collected URLs are reassembled to re-constitute the e-journal website, which is made freely available to all under a Creative Commons license.
You can see the result at the CLOCKSS web site. The page at that link is an introduction, but if you follow the links on that page to the Graft volumes, you will be seeing preserved content extracted from the CLOCKSS system via a script that arranges it in a form suitable for Apache to serve. Please read the notes on the introductory page describing ways in which content preserved in this way may surprise you.
The Graft content in Portico is preserved by a technique that aims only to preserve the "intellectual content", not the context. Content is obtained from the publisher as source files, typically the SGML markup used to generate the HTML, PDF and other formats served by the e-journal web site. It undergoes a process of normalization that renders it uniform. In this way the same system at Portico can handle content from many publishers consistently, because the individual differences such as branding have been normalized away. The claim is that this makes the content easier to preserve against the looming crisis of format obsolescence. It does, however, mean that the eventual reader sees the "intellectual content" as published by Portico's system now, not as originally published by SAGE's system. Since the trigger event, readers at institutions which subscribe to Portico can see this version of Graft for themselves. Stanford isn't a subscriber, so I can't see it; I'd be interested in comments from those who can make the comparison.
It is pretty clear that Kewney is on the LOCKSS side of this issue:
Once upon a time, someone offered me all the back numbers of a particular tech magazine I had contributed to. He said: "I don't need it anymore. If I want to search for something I need to know, I Google it."The LOCKSS technology can in some respects do better than that, but in other respects it can't. For example, every reader of a Web page containing advertisements may see a different ad. Printing the page gets one of them. The LOCKSS technology has to exclude the ads. But, as you can see, it does a reasonable job of capturing the context in which the "intellectual content" appeared. Notice, for example, the difference between the headline bar of a typical table of contents page extracted from an Edinburgh University CLOCKSS node and a Stanford University CLOCKSS node. This is an artifact of the institution's different subscriptions to SAGE journals.
But what if you don't know you need to know it? What sort of records of the present are we actually keeping? What will historians of the future get to hear about contemporary reactions to stories of the day, without the benefit of hindsight?
Maybe, someone in the British Library ought to be solemnly printing out all the content on every news website every day, and storing them in boxes, labelled by date?
This isn't a new argument. The most eloquent case for the importance of preserving what the publisher published was made by Nicholson Baker in Double fold: libraries and the assault on paper. He recounts how microfilm vendors convinced librarians of a looming crisis. Their collections of newspapers were rapidly decaying. It was urgently necessary to microfilm them or their "intellectual content" would be lost to posterity. Since the microfilm would take up much less space, they would save money in the long run. The looming crisis turned out to be a bonanza for the microfilm companies but a disaster for posterity. Properly handled newspapers were not decaying, improperly handled they were. Although properly handled microfilm would not decay, improperly handled it decayed as badly as paper. The process of microfilming destroyed both "intellectual content" and context.
I'd urge anyone tempted to believe that the crisis of format obsolescence looms so menacingly that it can be solved only through the magic of "normalization" to read Nicholson Baker.
Subscribe to:
Posts (Atom)