Showing posts with label memento. Show all posts
Showing posts with label memento. Show all posts

Tuesday, October 22, 2019

MementoMap

I've been writing about how important Memento is for Web archiving, and how its success depends upon the effectiveness of Memento Aggregators since at least 2011:
In a recent post I described how Memento allows readers to access preserved web content, and how, just as accessing current Web content frequently requires the Web-wide indexes from keywords to URLs maintained by search engines such as Google, access to preserved content will require Web-wide indexes from original URL plus time of collection to preserved URL. These will be maintained by search-engine-like services that Memento calls Aggregators
Memento Aggregators turned out to be both useful, and a hard engineering problem. Below the fold, a discussion of MementoMap Framework for Flexible and Adaptive Web Archive Profiling by Sawood Alam et al from Old Dominion University and Arquivo.pt, which both reviews the history of finding out how hard it is, and reports on fairly encouraging progress in attacking it.

Thursday, March 28, 2019

The 47 Links Mystery

Nearly a year ago, in All Your Tweets Are Belong To Kannada, I blogged about Cookies Are Why Your Archived Twitter Page Is Not in English. It describes some fascinating research by Sawood Alam and Plinio Vargas into the effect of cookies on the archiving of multi-lingual web-sites.

Sawood Alam just followed up with Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay In Multiple Languages, another fascinating exploration of these effects. Follow me below the fold for some commentary.

Tuesday, December 4, 2018

Selective Amnesia

Last year's series of posts and PNC keynote entitled The Amnesiac Civilization were about the threats to our cultural heritage from inadequate funding of Web archives, and the resulting important content that is never preserved. But content that Web archives do collect and preserve is also under a threat that can be described as selective amnesia. David Bixenspan's When the Internet Archive Forgets makes the important, but often overlooked, point that the Internet Archive isn't an elephant:
On the internet, there are certain institutions we have come to rely on daily to keep truth from becoming nebulous or elastic. Not necessarily in the way that something stupid like Verrit aspired to, but at least in confirming that you aren’t losing your mind, that an old post or article you remember reading did, in fact, actually exist. It can be as fleeting as using Google Cache to grab a quickly deleted tweet, but it can also be as involved as doing a deep dive of a now-dead site’s archive via the Wayback Machine. But what happens when an archive becomes less reliable, and arguably has legitimate reasons to bow to pressure and remove controversial archived material?
...
Over the last few years, there has been a change in how the Wayback Machine is viewed, one inspired by the general political mood. What had long been a useful tool when you came across broken links online is now, more than ever before, seen as an arbiter of the truth and a bulwark against erasing history.
Below the fold, some commentary on the vulnerability of Web history to censorship.

Thursday, November 1, 2018

Ithaka's Perspective on Digital Preservation

Oya Rieger of Ithaka S+R has published a report entitled The State of Digital Preservation in 2018: A Snapshot of Challenges and Gaps. In June and July Rieger:
talked with 21 experts and thought leaders to hear their perspectives on the state of digital preservation. The purpose of this report is to share a number of common themes that permeated through the conversations and provide an opportunity for broader community reaction and engagement, which will over time contribute to the development of an Ithaka S+R research agenda in these areas.
Below the fold, a critique.

Thursday, May 24, 2018

How Far Is Far Enough?

When collecting an individual web site for preservation by crawling it is necessary to decide where its edges are, which links encountered are "part of the site" and which are links off-site. The crawlers use "crawl rules" to make these decisions. A simple rule would say:
Collect all URLs starting https://www.nytimes.com/
NoScript on http://nytimes.com
If a complex "site" is to be properly preserved the rules need to be a lot more complex. The image shows the start of the list of DNS names from which the New York Times home page embeds resources. Preserving this single page, let alone the "whole site", would need resources from at least 17 DNS names. Rules are needed for each of these names. How are all these more complex rules generated? Follow me below the fold for the answer, and news of an encouraging recent development.

Tuesday, April 11, 2017

The Orphans of Scholarship

This is the third of my posts from CNI's Spring 2017 Membership Meeting. Predecessors are Researcher Privacy and Research Access for the 21st Century.

Herbert Van de Sompel, Michael Nelson and Martin Klein's To the Rescue of the Orphans of Scholarly Communication reported on an important Mellon-funded project to investigate how all the parts of a research effort that appear on the Web other than the eventual article might be collected for preservation using Web archiving technologies. Below the fold, a summary of the 67-slide deck and some commentary.

Tuesday, December 20, 2016

Reference Rot Is Worse Than You Think

At the Fall CNI Martin Klein presented a new paper from LANL and the University of Edinburgh, Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content. Shawn Jones, Klein and the co-authors followed on from the earlier work on web-at-large citations from academic papers in Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot, which found:
one out of five STM articles suffering from reference rot, meaning it is impossible to revisit the web context that surrounds them some time after their publication. When only considering STM articles that contain references to web resources, this fraction increases to seven out of ten.
Reference rot comes in two forms:
  • Link rot: The resource identified by a URI vanishes from the web. As a result, a URI reference to the resource ceases to provide access to referenced content.
  • Content drift: The resource identified by a URI changes over time. The resource’s content evolves and can change to such an extent that it ceases to be representative of the content that was originally referenced.
Source
The British Library's Andy Jackson analyzed the UK Web Archive and found:
I expected the rot rate to be high, but I was shocked by how quickly link rot and content drift come to dominate the scene. 50% of the content is lost after just one year, with more being lost each subsequent year. However, it’s worth noting that the loss rate is not maintained at 50%/year. If it was, the loss rate after two years would be 75% rather than 60%. This indicates there are some islands of stability, and that any broad ‘average lifetime’ for web resources is likely to be a little misleading.
Clearly, the problem is very serious. Below the fold, details on just how serious and discussion of a proposed mitigation.

Tuesday, November 1, 2016

Fixing broken links in Wikipedia

Mark Graham has a post at the Wikimedia Foundation's blog, Wikipedia community and Internet Archive partner to fix one million broken links on Wikipedia:
The Internet Archive, the Wikimedia Foundation, and volunteers from the Wikipedia community have now fixed more than one million broken outbound web links on English Wikipedia. This has been done by the Internet Archive's monitoring for all new, and edited, outbound links from English Wikipedia for three years and archiving them soon after changes are made to articles. This combined with the other web archiving projects, means that as pages on the Web become inaccessible, links to archived versions in the Internet Archive's Wayback Machine can take their place. This has now been done for the English Wikipedia and more than one million links are now pointing to preserved copies of missing web content.
This is clearly a good thing, but follow me below the fold.

Tuesday, September 6, 2016

Memento at W3C

Herbert van de Sompel's post at the W3C's blog Memento and the W3C announces that both the W3C's specifications and their Wiki now support Memento (RFC7089):
The Memento protocol is a straightforward extension of HTTP that adds a time dimension to the Web. It supports integrating live web resources, resources in versioning systems, and archived resources in web archives into an interoperable, distributed, machine-accessible versioning system for the entire web. The protocol is broadly supported by web archives. Recently, its use was recommended in the W3C Data on the Web Best Practices, when data versioning is concerned. But resource versioning systems have been slow to adopt. Hopefully, the investment made by the W3C will convince others to follow suit.
This is a very significant step towards broad adoption of Memento. Below the fold, some details.

Thursday, August 25, 2016

Evanescent Web Archives

Below the fold, discussion of two articles from last week about archived Web content that vanished.

Tuesday, August 23, 2016

Content negotiation and Memento

Back in March Ilya Kreymer summarized discussions he and I had had about a problem he'd encountered building oldweb.today thus:
a key problem with Memento is that, in its current form, an archive can return an arbitrarily transformed object and there is no way to determine what that transformation is. In practice, this makes interoperability quite difficult.
What Ilya was referring to was that, for a given Web page, some archives have preserved the HTML, the images, the CSS and so on, whereas some have preserved a PNG image of the page (transforming it by taking a screenshot). Herbert van de Sompel, Michael Nelson and others have come up with a creative solution. Details below the fold.

Friday, January 8, 2016

Aggregating Web Archives

Starting five years ago, I've posted many times about the importance of Memento (RFC7089), and in particular about the way Memento Aggregators in principle allow the contents of all Web archives to be treated as a single, homogeneous resource. I'm part of an effort by Sawood Alam and others to address some of the issues in turning this potential into reality. Sawood has a post on the IIPC blog, Memento: Help Us Route URI Lookups to the Right Archives that reveals two interesting aspects of this work.

First, Ilya Kreymer's oldweb.today shows there is a significant demand for aggregation:
We learned in the recent surge of oldweb.today (that uses MemGator to aggregate mementos from various archives) that some upstream archives had issues handling the sudden increase in the traffic and had to be removed from the list of aggregated archives.
Second, the overlap between the collections at different Web archives is low, as shown in Sawood's diagram. This means that the contribution of even small Web archives to the effectiveness of the aggregated whole is significant.This is important in an environment where the Internet Archive has by far the biggest collection of preserved Web pages. It can be easy to think that the efforts of other Web archives add little. But Sawood's research shows that, if they can be effectively aggregated, even small Web archives can make a contribution.

Wednesday, December 23, 2015

Signposting the Scholarly Web

At the Fall CNI meeting, Herbert Van de Sompel and Michael Nelson discussed an important paper they had just published in D-Lib, Reminiscing About 15 Years of Interoperability Efforts. The abstract is:
Over the past fifteen years, our perspective on tackling information interoperability problems for web-based scholarship has evolved significantly. In this opinion piece, we look back at three efforts that we have been involved in that aptly illustrate this evolution: OAI-PMH, OAI-ORE, and Memento. Understanding that no interoperability specification is neutral, we attempt to characterize the perspectives and technical toolkits that provided the basis for these endeavors. With that regard, we consider repository-centric and web-centric interoperability perspectives, and the use of a Linked Data or a REST/HATEAOS technology stack, respectively. We also lament the lack of interoperability across nodes that play a role in web-based scholarship, but end on a constructive note with some ideas regarding a possible path forward.
They describe their evolution from OAI-PMH, a custom protocol that used the Web simply as a transport for remote procedue calls, to Memento, which uses only the native capabilities of the Web. They end with a profoundly important proposal they call Signposting the Scholarly Web which, if deployed, would be a really big deal in many areas. Some further details are on GitHub, including this somewhat cryptic use case:
Use case like LOCKSS is the need to answer the question: What are all the components of this work that should be preserved? Follow all rel="describedby" and rel="item" links (potentially multiple levels perhaps through describedby and item).
Below the fold I explain what this means, and why it would be a really big deal for preservation.

Tuesday, November 3, 2015

Emulation & Virtualization as Preservation Strategies

I'm very grateful that funding from the Mellon Foundation on behalf of themselves, the Sloan Foundation and IMLS allowed me to spend much of the summer researching and writing a report, Emulation and Virtualization as Preservation Strategies (37-page PDF, CC-By-SA). I submitted a draft last month, it has been peer-reviewed and I have addressed the reviewers comments. It is also available on the LOCKSS web site.

I'm old enough to know better than to give a talk with live demos. Nevertheless, I'll be presenting the report at CNI's Fall membership meeting in December complete with live demos of a number of emulation frameworks. The TL;DR executive summary of the report is below the fold.

Thursday, September 17, 2015

Enhancing the LOCKSS Technology

A paper entitled Enhancing the LOCKSS Digital Preservation Technology describing work we did with funding from the Mellon Foundation has appeared in the September/October issue of D-Lib Magazine. The abstract is:
The LOCKSS Program develops and supports libraries using open source peer-to-peer digital preservation software. Although initial development and deployment was funded by grants including from NSF and the Mellon Foundation, grant funding is not a sustainable basis for long-term preservation. The LOCKSS Program runs the "Red Hat" model of free, open source software and paid support. From 2007 through 2012 the program was in the black with no grant funds at all.

The demands of the "Red Hat" model make it hard to devote development resources to enhancements that don't address immediate user demands but are targeted at longer-term issues. After discussing this issue with the Mellon Foundation, the LOCKSS Program was awarded a grant to cover a specific set of infrastructure enhancements. It made significant functional and performance improvements to the LOCKSS software in the areas of ingest, preservation and dissemination. The LOCKSS Program's experience shows that the "Red Hat" model is a viable basis for long-term digital preservation, but that it may need to be supplemented by occasional small grants targeted at longer-term issues.
Among the enhancements described in the paper are implementations of Memento (RFC7089) and Shibboleth, support for crawling sites that use AJAX, and some significant enhancements to the LOCKSS peer-to-peer polling protocol.

Tuesday, September 8, 2015

Infrastructure for Emulation

I've been writing a report about emulation as a preservation strategy. Below the fold, a discussion of one of the ideas that I've been thinking about as I write, the unique position national libraries are in to assist with building the infrastructure emulation needs to succeed.

Friday, May 1, 2015

Talk at IIPC General Assembly

The International Internet Preservation Consortium's General Assembly brings together those involved in Web archiving from around the world. This year's was held at Stanford and the Internet Archive. I was asked to give a short talk outlining the LOCKSS Program, explaining how and why it differs from most Web archiving efforts, and how we plan to evolve it in the near future to align it more closely with the mainstream of Web archiving. Below the fold, an edited text with links to the sources.

Tuesday, February 10, 2015

The Evanescent Web

Papers drawing attention to the decay of links in academic papers have quite a history, i blogged about three relatively early ones six years ago. Now Martin Klein and a team from the Hiberlink project have taken the genre to a whole new level with a paper in PLoS One entitled Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. Their dataset is 2-3 orders of magnitude bigger than previous studies, their methods are far more sophisticated, and they study both link rot (links that no longer resolve) and content drift (links that now point to different content). There's a summary on the LSE's blog.

Below the fold, some thoughts on the Klein et al paper.

Tuesday, September 3, 2013

Talk for "RDF Vocabulary Preservation" at iPres2013

The group planning a session on "RDF Vocabulary Preservation" at iPRES2013 asked me to give a brief presentation on the principles behind the LOCKSS technology. Below the fold is an edited text with links to the sources.

Tuesday, August 20, 2013

Annotations

Caroline O'Donovan at the Nieman Journalism Lab has an interesting article entitled Exegesis: How early adapters, innovative publishers, legacy media companies and more are pushing toward the annotated web. She discusses the way media sites including The New York Times, The Financial Times, Quartz and SoundCloud and platforms such as Medium are trying to evolve from comments to annotations as a way to improve engagement with their readers. She also describes the work hypothes.is is doing to build annotations into the Web infrastructure. There is also an interesting post on the hypothes.is blog from Peter Brantley on a workshop with journalists. Below the fold, some thoughts on the implications for preserving the Web.