unrot•link
Remy has turned his linkrot-battling technique into a service that you can use. He has more details on his blog.
Remy has turned his linkrot-battling technique into a service that you can use. He has more details on his blog.
After two days at border:none in Nuremberg, it was time for two days at Indie Web Camp, also in Nuremberg.
I hadn’t been to an Indie Web Camp since before The Situation. It felt very good to be back. I had almost forgotten how inspiring and productive they can be.
This one had a good turnout of around twenty people. We had ourselves an excellent first day of thought-provoking sessions. Then on day two it was time to put some of those ideas into action.
A little trick I like to do on the practical day is to have two tasks to attempt: one of them quite simple, and the other more ambitious. That way, as long as I get the simpler task done, I’ll always have at least something to demo at the end of the day.
This time I attempted three bits of home improvement on my website.
The first problem I set myself was ostensibly the simple one. But it involved regular expressions, so then I had two problems.
I wanted to automatically link up Mastodon usernames if I mentioned one in my notes. For example, during border:none I mentioned Brian’s mastodon username in a note: @briansuda@loðfíll.is.
That turned out to be an excellent test case. Those Icelandic characters made sure I wasn’t making unwarranted assumptions about character sets.
Here’s the regular expression I came up with. It’s not foolproof by any means. Basically it looks for @[email protected]
.
Good enough. Ship it.
My next task was a bit more ambitious. It involved SQL queries, something I’m slightly better at than regular expressions but that’s a very low bar.
I wanted to show related posts when you get to the end of one of my blog posts.
I’ve been tagging all my blog posts for years so that’s the mechanism I used for finding similar posts. There’s probably a clever SQL statement that could do this, but I ended up brute-forcing it a bit.
I don’t feel too bad about the hacky clunky nature of my solution, because I cache blog post pages. That means only the first person to view the blog post (usually me) will suffer any performance impacts from my clunky database queries. After that everything’s available straight from a cached file.
Let’s say you’re reading a blog post of mine that I’ve tagged with ten different keywords. I make a separate SQL query for each keyword to get all the other posts that use that tag. Then it’s a matter of sorting through all the results.
I loop through the results of each tag and apply a score to the tagged post. If the post shares one tag with the post you’re looking at, it has a score of one. If it shares two tags, it has a score of two, and so on.
I decided that for a post to be considered related, it had to share at least three tags. I also decided to limit the list of related posts to a maximum of five.
It worked out pretty well. If you scroll down on my recent post about JavaScript, you’ll see links to related posts about JavaScript. If you read through a post on accessibility testing, you’ll find other posts about accessibility testing. If you make it to the end of this post about Mars colonisation you’ll see links to more posts about exploring our solar system.
Right now I’m just doing this for my blog but I’d like to do it for my links too. A job for a future Indie Web Camp.
I was very inspired by Remy’s recent post on how he’s tackling link rot on his site. I wanted to do the same for mine.
On the first day at Indie Web Camp I led a session on link rot to gather ideas and alternative approaches. We had a really good discussion, though it’s always worth bearing in mind that there’ll never be a perfect solution. There’ll always be some false positives and some false negatives.
The other Jeremy at Indie Camp Nuremberg blogged about the session. Sebastian Greger was attending remotely and the session inspired him to spend the second day also tackling linkrot.
In the end I decided to stick with Remy’s two-pronged approach:
Here’s the JavaScript I wrote for the first part.
It’s very similar to Remy’s but with one little addition. I check to see if the clicked link is inside an h-entry
and if it is, I pass on the date from the post’s dt-published
value.
Here’s the PHP I wrote for the server-side redirector. The comments tell the story of what the code is doing:
curl
request to get the response headers from the URL. The time limit is set to 1 second.Not perfect by any means, but it works for the most common cases of link rot.
For the demo at the end of the day I went back into my archive of over 10,000 links and plucked out some old posts, like this one from December 2005. It takes a little while to do the rerouting but eventually you get to see the archived version from the same time period as when I linked to it.
Here’s another link from 2005. Here’s another. Those links are broken now, but with a little patience, you’ll still get to read them on the Internet Archive.
The Internet Archive’s wayback machine really is a gift. I can’t imagine how would it be even remotely possible to try to address link rot on my site without archive.org.
I will continue to donate money to the Internet Archive and I encourage you to do the same.
I really, really like the progressive enhancement approach that Remy is taking here with outbound links:
When a real user clicks on a link, it’s swapped out to be redirected through my own endpoint that checks if the URL is still OK, and if so permanently redirects the visitor, otherwise my endpoint checks the Web Archive for the URL and permanently redirects to that instead.
I think I’m going to do the same! I’d have to rewrite the server-side code in PHP, but that shouldn’t be too tricky.
This could a project for the next Indie Web Camp I attend.
Deleting your old thoughts may be giving your older self a kick they really don’t deserve. And the beauty of having an archive is that you don’t need to decide whether you were right or not. Your views, with a date attached, can stand as a reflection of a specific moment in time.
Reconciling every past view you’ve ever had with how you feel now isn’t required. It sounds exhausting, frankly.
This is like the Gashlycrumb Tinies but for websites:
It’s been interesting to see how websites die — from domain parking pages to timeouts to blank pages to outdated TLS cipher errors, there are a multitude of different ways.
The internet, it turns out, is not forever. It’s on more of like a 10-year cycle. It’s constantly upgrading and migrating in ways that are incompatible with past content, leaving broken links and error pages in its wake. In other instances, the sites simply shutter, or become so layered over that finding your own footprint is impossible—I have searched “Kate Lindsay Myspace” every which way and have concluded that my content from that platform must simply be lost to time, ingested by the Shai-Hulud of the internet.
When I post a link, I do it for two reasons.
First of all, it’s me pointing at something and saying “Check this out!”
Secondly, it’s a way for me to stash something away that I might want to return to. I tag all my links so when I need to find one again, I just need to think “Now what would past me have tagged it with?” Then I type the appropriate URL: adactio.com/links/tags/whatever
There are some links that I return to again and again.
Back in 2008, I linked to a document called A Few Notes on The Culture. It’s a copy of a post by Iain M Banks to a newsgroup back in 1994.
Alas, that link is dead. Linkrot, innit?
But in 2013 I linked to the same document on a different domain. That link still works even though I believe it was first published around twenty(!) years ago (view source for some pre-CSS markup nostalgia).
Anyway, A Few Notes On The Culture is a fascinating look at the world-building of Iain M Banks’s Culture novels. He talks about the in-world engineering, education, biology, and belief system of his imagined utopia. The part that sticks in my mind is when he talks about economics:
Let me state here a personal conviction that appears, right now, to be profoundly unfashionable; which is that a planned economy can be more productive - and more morally desirable - than one left to market forces.
The market is a good example of evolution in action; the try-everything-and-see-what-works approach. This might provide a perfectly morally satisfactory resource-management system so long as there was absolutely no question of any sentient creature ever being treated purely as one of those resources. The market, for all its (profoundly inelegant) complexities, remains a crude and essentially blind system, and is — without the sort of drastic amendments liable to cripple the economic efficacy which is its greatest claimed asset — intrinsically incapable of distinguishing between simple non-use of matter resulting from processal superfluity and the acute, prolonged and wide-spread suffering of conscious beings.
It is, arguably, in the elevation of this profoundly mechanistic (and in that sense perversely innocent) system to a position above all other moral, philosophical and political values and considerations that humankind displays most convincingly both its present intellectual immaturity and — through grossly pursued selfishness rather than the applied hatred of others — a kind of synthetic evil.
Those three paragraphs might be the most succinct critique of unfettered capitalism I’ve come across. The invisible hand as a paperclip maximiser.
Like I said, it’s a fascinating document. In fact I realised that I should probably store a copy of it for myself.
I have a section of my site called “extras” where I dump miscellaneous stuff. Most of it is unlinked. It’s mostly for my own benefit. That’s where I’ve put my copy of A Few Notes On The Culture.
Here’s a funny thing …for all the times that I’ve revisited the link, I never knew anything about the site is was hosted on—vavatch.co.uk
—so this most recent time, I did a bit of clicking around. Clearly it’s the personal website of a sci-fi-loving college student from the early 2000s. But what came as a revelation to me was that the site belonged to …Adrian Hon!
I’m impressed that he kept his old website up even after moving over to the domain mssv.net
, founding Six To Start, and writing A History Of The Future In 100 Objects. That’s a great snackable book, by the way. Well worth a read.
My last long-distance trip before we were all grounded by The Situation was to San Francisco at the end of 2019. I attended Indie Web Camp while I was there, which gave me the opportunity to add a little something to my website: an “on this day” page.
I’m glad I did. While it’s probably of little interest to anyone else, I enjoy scrolling back to see how the same date unfolded over the years.
’Sfunny, when I look back at older journal entries they’re often written out of frustration, usually when something in the dev world is bugging me. But when I look back at all the links I’ve bookmarked the vibe is much more enthusiastic, like I’m excitedly pointing at something and saying “Check this out!” I feel like sentiment analyses of those two sections of my site would yield two different results.
But when I scroll down through my “on this day” page, it also feels like descending deeper into the dark waters of linkrot. For each year back in time, the probability of a link still working decreases until there’s nothing but decay.
Sadly this is nothing new. I’ve been lamenting the state of digital preservation for years now. More recently Jonathan Zittrain penned an article in The Atlantic on the topic:
Too much has been lost already. The glue that holds humanity’s knowledge together is coming undone.
In one sense, linkrot is the price we pay for the web’s particular system of hypertext. We don’t have two-way linking, which means there’s no centralised repository of links which would be prohibitively complex to maintain. So when you want to link to something on the web, you just do it. An a
element with an href
attribute. That’s it. You don’t need to check with the owner of the resource you’re linking to. You don’t need to check with anyone. You have complete freedom to link to any URL you want to.
But it’s that same simple system that makes the act of linking a gamble. If the URL you’ve linked to goes away, you’ll have no way of knowing.
As I scroll down my “on this day” page, I come across more and more dead links that have been snapped off from the fabric of the web.
If I stop and think about it, it can get quite dispiriting. Why bother making hyperlinks at all? It’s only a matter of time until those links break.
And yet I still keep linking. I still keep pointing to things and saying “Check this out!” even though I know that over a long enough timescale, there’s little chance that the link will hold.
In a sense, every hyperlink on the World Wide Web is little act of hope. Even though I know that when I link to something, it probably won’t last, I still harbour that hope.
If hyperlinks are built on hope, and the web is made of hyperlinks, then in a way, the World Wide Web is quite literally made out of hope.
I like that.
A terrific piece by Jonathan Zittrain on bitrot and online digital preservation:
Too much has been lost already. The glue that holds humanity’s knowledge together is coming undone.
My work shouldn’t be presented in the Smithsonian behind glass or anything, I’m just pointing at this enormous flaw in the architecture of the web itself: you’re renting servers and renting URLs. Nothing is permanent because on the web we don’t really own any space, we’re just borrowing land temporarily.
Flickr is removing anything over 1,000 photos on accounts that are not “pro” (paid for) in 2019. We highlight large and amazing accounts that could use a gift to go pro. We take nominations and track when these accounts are saved.
This is very, very good news. Following on from the recent announcement that a huge swathe of Flickr photos would soon be deleted, there’s now an update: any photos that are Creative Commons licensed won’t be deleted after all. Phew!
I wonder if I can get a refund for that pro account I just bought last week to keep my Creative Commons licensed Flickr pictures online.
I’ve got a lot of photos on Flickr (even though I don’t use it directly much these days) and I’ve paid up for a pro account to protect those photos, but I’m very worried about this:
Beginning January 8, 2019, Free accounts will be limited to 1,000 photos and videos.
That in itself is fine, but any existing non-pro accounts with more than 1000 photos will have older photos deleted until the total comes down to 1000. This means that anyone linking to those photos (or embedding them in blog posts or articles) will have broken links and images.
Tears in the rain.
A profile of Mark Graham and the team at the Internet Archive.
It was fun spelunking with Tantek, digging into some digital archeology in an attempt to track down a post by Ben Ward that I remembered reading years ago.
This is intriguing—a Pinboard-like service that will create local copies of pages you link to from your site. There are plug-ins for WordPress and Drupal, and modules for Apache and Nginx.
Amber is an open source tool for websites to provide their visitors persistent routes to information. It automatically preserves a snapshot of every page linked to on a website, giving visitors a fallback option if links become inaccessible.
The promise of the web is that Alexandria’s library might be resurrected for the modern world. But today’s great library is being destroyed even as it is being built.
A fascinating account of one story’s linkrot that mirrors the woeful state of our attitude to cultural preservation on the web.
Historians and digital preservationists agree on this fact: The early web, today’s web, will be mostly lost to time.
Tim Berners-Lee is quite rightly worried about linkrot:
The disappearance of web material and the rotting of links is itself a major problem.
He brings up an interesting point that I hadn’t fully considered: as more and more sites migrate from HTTP to HTTPS (A Good Thing), and the W3C encourages this move, isn’t there a danger of creating even more linkrot?
…perhaps doing more damage to the web than any other change in its history.
I think that may be a bit overstated. As many others point out, almost all sites making the switch are conscientious about maintaining redirects with a 301 status code.
(There’s also a similar 308 status code that I hadn’t come across, but after a bit of investigating, that looks to be a bit of mess.)
Anyway, the discussion does bring up some interesting points. Transport Layer Security is something that’s handled between the browser and the server—does it really need to be visible in the protocol portion of the URL? Or is that visibility a positive attribute that makes it clear that the URL is “good”?
And as more sites move to HTTPS, should browsers change their default behaviour? Right now, typing “example.com” into a browser’s address bar will cause it to automatically expand to http://example.com …shouldn’t browsers look for https://example.com first?
All good food for thought.
There’s a Google Doc out there with some advice for migrating to HTTPS. Unfortunately, the trickiest part—getting and installing certificates—is currently an owl-drawing tutorial, but hopefully it will get expanded.
If you’re looking for even more reasons why enabling TLS for your site is a good idea, look no further than the latest shenanigans from ISPs in the UK (we lost the battle for net neutrality in this country some time ago).
BT just inserted a popup into someone’s site, encouraging me to switch on content filtering. That is Very Not Cool. pic.twitter.com/QMnLRawsNW
— David Thompson (@fatbusinessman) December 30, 2014
They can’t do that to pages served over HTTPS.
The short answer: not much.
The UK Web Archive at The British Library outlines its process for determining just how bad the linkrot is after just one decade.
The Internet forgets every single day.
I’m with Jason.
I encourage you all to take a moment and consider the importance of preserving your online creations for yourself, your family, and for future generations.