Declare your AIndependence: block AI bots, scrapers and crawlers with a single click
This is a great move from Cloudflare. I may start using their service.
This is a great move from Cloudflare. I may start using their service.
The Dark Forest idea comes from the Remembrance of Earth’s Past books by Liu Cixin. It’s an elegant but dispiriting solution to the Fermi paradox. Maggie sums it up:
Dark forest theory suggests that the universe is like a dark forest at night - a place that appears quiet and lifeless because if you make noise, the predators will come eat you.
This theory proposes that all other intelligent civilizations were either killed or learned to shut up. We don’t yet know which category we fall into.
Maggie has described The Expanding Dark Forest and Generative AI:
The dark forest theory of the web points to the increasingly life-like but life-less state of being online. Most open and publicly available spaces on the web are overrun with bots, advertisers, trolls, data scrapers, clickbait, keyword-stuffing “content creators,” and algorithmically manipulated junk.
It’s like a dark forest that seems eerily devoid of human life – all the living creatures are hidden beneath the ground or up in trees. If they reveal themselves, they risk being attacked by automated predators.
Those of us in the cozy web try to keep our heads down, attempting to block the bots plundering our work.
I advocate for taking this further. We should fight back. Let’s exploit the security hole of prompt injections. Here are some people taking action:
I’ve taken steps here on my site. I’d like to tell you exactly what I’ve done. But if I do that, I’m also telling the makers of these bots how to circumvent my attempts at prompt injection.
This feels like another concept from Liu Cixin’s books. Wallfacers:
The sophons can overhear any conversation and intercept any written or digital communication but cannot read human thoughts, so the UN devises a countermeasure by initiating the “Wallfacer” Program. Four individuals are granted vast resources and tasked with generating and fulfilling strategies that must never leave their own heads.
So while I’d normally share my code, I feel like in this case I need to exercise some discretion. But let me give you the broad brushstrokes:
You can view source to see some examples.
I plan to keep updating my pool of potential prompt injections. I’ll add to it whenever I hear of a phrase that might potentially throw a spanner in the works of a scraping bot.
By the way, I should add that I’m doing this as well as using a robots.txt
file. So any bot that injests a prompt injection deserves it.
I could not disagree with Manton more when he says:
I get the distrust of AI bots but I think discussions to sabotage crawled data go too far, potentially making a mess of the open web. There has never been a system like AI before, and old assumptions about what is fair use don’t really fit.
Bollocks. This is exactly the kind of techno-determinism that boils my blood:
AI companies are not going to go away, but we need to push them in the right directions.
“It’s inevitable!” they cry as though this was a force of nature, not something created by people.
There is nothing inevitable about any technology. The actions we take today are what determine our future. So let’s take steps now to prevent our web being turned into a dark, dark forest.
AI is steeped in marketing drivel, built upon theft, and intent on replacing our creative output with a depressingly shallow imitation.
A handy resource for keeping your blocklist up to date in your robots.txt
file.
Though the name of the website is unfortunate with its racism-via-laziness nomenclature.
I realized why I hadn’t yet added any rules to my
robots.txt
: I have zero faith in it.
I endorse this statement.
Readability is back, but now it’s called Mercury.
This tool for building ScrAPIs is an interesting development—the current trend for not providing a simple API (or even a simple RSS feed) is being interpreted as damage and routed around.
A handy step-by-step guide to scraping HTML to get data out. Useful for services (—cough—Twitter—cough—) that keep changing the rules of their API use.