Showing posts with label ipres2008. Show all posts
Showing posts with label ipres2008. Show all posts

Wednesday, October 6, 2010

"Petabyte for a Century" Goes Main-Stream

I started writing about the insights to be gained from the problem of keeping a Petabyte for a century four years ago in September 2006. More than three years ago in June 2007 I blogged about them. Two years ago in September 2008 these ideas became a paper at iPRES 2008 (PDF). After an unbelievable 20-month delay from the time it was presented at iPRES, the International Journal of Digital Preservation finally published almost exactly the same text (PDF) in June 2010.

Now, an expanded and improved version of the paper, including material from my 2010 JCDL keynote, has appeared in ACM Queue.

Alas, I'm not quite finished writing on this topic. I was too busy when I was preparing this article and so I failed to notice an excellent paper by Kevin Greenan, James Plank and Jay Wylie, Mean time to meaningless: MTTDL, Markov models, and storage system reliability.

They agree with my point that MTTDL is a meaningless measure of storage reliability, and that bit half-life isn't a great improvement on it. They propose instead NOMDL (NOrmalized Magnitude of Data Loss), i.e. the expected number of bytes that the storage will lose in a specified interval divided by its usable capacity. As they point out, it is possible to compute this using Monte Carlo simulation based on distributions of component failures that experiments have shown to fit the real world. These simulations produce estimates that are relatively credible, especially compared to the ludicrous estimates I pillory in the article.

NOMDL is a far better measure than MTTDL. Greenan, Plank and Wylie are to be congratulated for proposing it. However, it is not a panacea. It is still the result of models based on data, rather than experiments on the system in question. The major points of my article still stand:
  • That the reliability we need is so high that benchmarking systems to assure that they exceed it is impractical.

  • That projecting the reliability of storage systems based on simulations based on component reliability distributions is likely to be optimistic, given both the observed auto- and long-range correlations between failures, and the inability of the models to capture the major causes of data loss, such as operator error.


Further, there is still a use for bit half-life. Careful readers will note subtle changes in the discussion of bit half-life between the iPRES and ACM versions. These are due to incisive criticism of the earlier version by Tsutomo Shimomura. The ACM version describes the use of bit half-life thus:
"Even if we are sublimely confident that every source of data loss other than bit rot has been totally eliminated, we still have to run a benchmark of the system’s bit half-life to confirm that it is longer than [required]"
However good simulations of the kind Greenan et al. propose may be, at some point we need to compare them to the reliability that the systems actually deliver.

Sunday, December 28, 2008

Foot, meet bullet

The gap in posting since March was caused by a bad bout of RSI in my hands. I believe it was triggered by the truly terrible ergonomics of the mouse buttons on the first-generation Asus EEE (which otherwise fully justifies its reputation as a game-changing product). It took a long time to recover and even longer to catch up with all the work I couldn't do when I couldn't type for more than a few minutes.

One achievement during this enforced hiatus was to turn my series of posts on A Petabyte for a Century into a paper entitled Bit Preservation: A Solved Problem? (190KB PDF) and present it at the iPRES 2008 conference last September at the British Library.

I also attended the 4th International Digital Curation Conference in Edinburgh. As usual these days, for obvious reasons, sustainability was at the top of the agenda. Brian Lavoie of OCLC talked (461KB .ppt) about the work of the Blue Ribbon Task Force on Sustainable Digital Preservation and Access which he co-chairs with Fran Berman of the San Diego Supercomputer Center. NSF, the Andrew W. Mellon Foundation and others are sponsoring this effort; the LOCKSS team have presented to the Task Force. Their interim report has just been released.

Listening to Brian talk about the need to persuade funding organizations of the value of digital preservation efforts I came to understand the extent to which the tendency to present simply preserving the bits as a trivial, solved problem has caused the field to shoot itself in the foot.

The activities that the funders are told they need to support are curation-focused, such as generating metadata to prepare for possible format obsolescence, and finding the content for future readers. The problem is that, as a result, the funders see a view of the future in which, even if they do nothing, the bits will survive. There might possibly be problems in the distant future if formats go obsolete, but there might not be. There might be problems finding content in the future, but there might not be. After all, funders might think, if the bits survive and Google can index them, how much worse than the current state could things be? Why should they pour money into activities intended to enhance the data? After all, the future can figure out what to do with the bits when they need them; they'll be there whatever happens.

A more realistic view of the world, as I showed in my iPRES paper, would be that there are huge volumes of data that need to be preserved, that simply storing a few copies of all of it is more costly than we can currently cope with, and that even if we spend enough to use the best available technology we can't be sure the bits will be safe. If this were the view being presented to the funders, that unless they immediately provide funds important information would gradually be lost, they might be scared into actually doing something.