Comments on DSHR's Blog: The Evanescent Web

Herbert Van de Sompel, Martin Klein and Shawn Jone...

2016-03-01T06:11:54.865-08:00

Herbert Van de Sompel, Martin Klein and Shawn Jones revisit the issue of why DOIs are not in practice used to refer to articles in a poster for WWW2016 Persistent URIs Must Be Used To Be Persistent. Note that this link is not a DOI, in this case because the poster doesn't have one (yet?).

Timothy Geigner at TechDirt supplies the canonical...

2015-07-10T17:15:37.826-07:00

Timothy Geigner at TechDirt supplies the canonical example of why depending on the DMCA "safe harbor" is risky for preservation. Although in this case the right thing happened in response to a false DMCA takedown notice, detecting them is between difficult and impossible.

The outages continued sporadically through Tuesday...

2015-06-10T07:29:49.647-07:00

The outages continued sporadically through Tuesday.

This brings up another issue about the collection of link rot statistics. The model behind these studies so far is that a Web resource appears at some point in time, remains continually accessible for a period, then becomes inaccessible and remains inaccessible "for ever". Clearly, the outages noted here show that this isn't the case. Between the resource's first appearance and its last, there is some probably time-varying probability that it is available that is less than 1.

As reported on the UK Serials Group listserv, UK E...

2015-06-09T09:23:41.479-07:00

As reported on the UK Serials Group listserv, UK Elsevier subscribers encountered a major outage last weekend due to "unforeseen technical issues".

Geoffrey Bilder has a very interesting and detaile...

2015-04-20T19:24:12.378-07:00

Geoffrey Bilder has a very interesting and detailed first instalment of a multi-part report on the DOI outage that is well worth reading.

Peter Burnhill supports the last sentence of my po...

2015-03-01T06:39:18.849-08:00

Peter Burnhill supports the last sentence of my post with this very relevant reference<:

thoughts of (Captain) Clarence Birdseye

Some advice on quick freezing references to Web caught resources:

Better done when references are noted (by the author), and then could be re-examined at point of issue (by the editor / publisher). When delivered by the crate (onto digital shelves) the rot may have set in for some of these fish ...

A comment on the issue of soft404s: Your point is...

2015-02-13T14:47:31.417-08:00

A comment on the issue of soft404s:

Your point is well taken and the paper's methodology section would clearly have benefited from mentioning this detriment and why we chose to not address it. My co-authors and I are very well aware of the soft404 issue, common approaches to detect them (such as introduced in [1] and [2]), and have, in fact, applied such methods in the past [3].

However, given the scale of our corpus of 1 million URIs, and the soft404 ratio found in previous studies (our [3] found a ratio of 0.36% and [4] found 3.41%), we considered checking for soft404s too expensive in light of potential return. Especially since, as you have pointed out in the past [5], web archives also archive soft404s, we would have had to detect soft404s on the live web as well as in web archives.

Regardless, I absolutely agree that our reference rot numbers for links to web at large resources likely represent a lower bound. It would be interesting to investigate the ratio of soft404s and build a good size corpus to evaluate common and future detection approaches.

The soft404 on the paper's reference 58 (which is introduced by the publisher) seems to "only" be a function of the PubMed search as a request for [6] returns a 404.

[1] http://dx.doi.org/10.1145/988672.988716
[2] http://dx.doi.org/10.1145/1526709.1526886
[3] http://arxiv.org/abs/1102.0930
[4] http://dx.doi.org/10.1007/978-3-642-33290-6_22
[5] http://blog.dshr.org/2013/04/making-memento-succesful.html
[6] http://www.ncbi.nlm.nih.gov/pubmed/aodfhdskjhfsjkdhfskldfj

As ever, a good and challenging read. Although I ...

2015-02-12T06:27:48.456-08:00

As ever, a good and challenging read. Although I am not one of the authors of the paper you review I have been involved in a lot of the underlying thinking as one of the PIs in the project, described at Hiberlink.org and would like to add a few comments, especially on the matter of potential remedy.

We were interested in the prospect of change & intervention in three simple workflows (for the author; for the issuing body; for the hapless library/repository) in order to enable transactional archiving of referenced content - reasoning that it was best that this was done as early as possible after the content on the web was regarded as important, and also that such archiving was best done when the actor in question had their mind in gear.

The prototyping using Zotero and OJS was done via plug-ins because having access to the source code our colleague Richard Wincewicz could mock this up as a demonstrator. One strategy was that would then invite ‘borrowing’ of the functionality (of snapshot/DateTimeStamp/archive/‘decorate’ with DateTimeStamp of URI within the citation) by commercial reference managers and editorial software so that authors and/or publishers (editors?) did not have to do something special.

Reference rot is a function of time: the sooner the fish (fruit?) is flash frozen the less it has chance to rot. However, immediate post-publication remedy is better than none. The suggestion that there is pro-active fix for content ingested into LOCKSS, CLOCKSS and Portico (and other Keepers of digital content) by archiving of references is very much welcomed. This is part of our thinking for remodelling Repository Junction Broker which supports machine ingest into institutional repositories but what you suggest could have greater impact.

Good idea, René!

2015-02-10T14:51:18.728-08:00

Good idea, René!

"Note, however, that soft-403s and soft-404s ...

2015-02-10T13:14:11.057-08:00

"Note, however, that soft-403s and soft-404s pose the same problem for robustify.js as they do for all Web archiving technologies."

I just uploaded a new version of the robustify.js helper script (https://github.com/renevoorburg/robustify.js) that attempts to recognize soft-404s. It does so by forcing a '404' with a random request and comparing the results of that with the results of the original request (using fuzzy hashing). It seems to work very well but I am missing a good test set of soft 404's.