ADDED 4 October 2023:
Google has announced a new token you can block to exclude your website from training Bard and Vertex AI: Google-Extended. To block your site from being used to train Google’s AI products, you should include this code in your robots.txt file:
# Google AI User-agent: Google-Extended Disallow: /
As a standalone token, that means that we don’t need to block Google from indexing our websites to block them from using our content to train their AI products.
⭐ ADDED 11 December 2023:
Except!!!! Google-Extended applies to their products but not their generative search results. So if you don’t want your content to appear in generative search results, you still need to de-index your site.
⭐ ADDED 19 December 2024:
Turns out that even if you have noindex on your page, if you have *also* blocked the Googlebot crawler in your robots.txt file, then it won’t see the noindex instruction. That means if someone else’s page that is crawled/indexed (?) gets linked to yours, Google will know the page exists but not that they aren’t supposed to index it. (If I am parsing correctly.) So, I have unblocked the Googlebot crawler in my robots.txt file 🙄
ORIGINAL ARTICLE (published 11 July 2023):
After thinking about it for a couple days, I’ve decided to de-index my website from Google. It’s reversible — I’m sure Google will happily reindex it if I let them — so I’m just going ahead and doing it for now. I’m not down with Google swallowing everything posted on the internet to train their generative AI models. I was pushed over the edge by posts from Jeremy Keith and Vasilis van Gemert, thanks y’all.
I don’t have Google Search Console set up for this website so I don’t know how much search traffic I get. My other blog, Cascadia Inspired, got about 200 hits in the past three months. I’m not going to cry over that — they’re mostly going to one 2015 article anyway (and probably not that helpful of a post, to my eye. Around New Year’s every year I usually get an influx of people to my ten-year-old guide to doing a creative annual review. Sorry folks, I’m sure someone else has written something better by now.) 😉
I’m going to start by pulling my websites out of Google search, then work on adding my sites to directories. Maybe I’ll even join a webring 💍✨
Adding a noindex meta tag to my WordPress header
Because my website has already been indexed by Google, I need to allow the Google bot to re-crawl the pages and see the new “noindex” instruction. So in the future I’ll also block the Googlebot crawler, but not just yet 😉
I added this code to the functions.php file of my child theme:
add_action( 'wp_head', function() {
global $page;
echo '<meta name="googlebot" content="noindex, nofollow, noimageindex">';
});
I figured out how to adapt this from WPExplorer. This random wordpress plugin help forum suggested another version, I don’t know which is better 🤷♀️
I’m not 100% on whether the noimageindex is actually helpful for Googlebot since that’s their text bot, but can’t hurt right? (Tell me if it hurts lol.) Yoast says there’s a better way to block image indexing but I’m scared of touching the .htaccess file and definitely nothing with my server 😂 (I’m on shared hosting anyway, so I think the edits I can make are limited?)
Blocking bots that collect training data for AIs (and more)
In addition, I created a robots.txt file to tell “law abiding” bots what they’re not allowed to look at. I ought to have done this before but kind of assumed it came with my WordPress install 😅 (Nope.)
AI user agents to block
There’s so many now, just copy from my robots file tbh.
ADDED 4 October 23: To block training of Google’s Bard, I blocked Google-Extended.
I specifically want to deter my website being used for training LLMs, so I blocked Common Crawl.
To block OpenAI, I blocked both user agents ChatGPT-User and GPTBot. (Added GPTBot 10 August 23)
ADDED 4 October 23: Per Neil Clarke’s article, I have also blocked Omgilibot, Omgili, and FacebookBot. (Via Jeremy Keith)
ADDED 14 February 2024: I also blocked user agents used in AI training sets: anthropic-ai, Bytespider, FacebookBot, and PerplexityBot (source)
ADDED 16 April 2024: prompted by Ethan Marcotte, I blocked several more known and suspected user agents used in AI training: Claude-Web, ClaudeBot, cohere-ai, Diffbot, YouBot, ChatGPT
Added 17 June 2024: I’ve now blocked Apple’s AI training bot Applebot-Extended (thanks for the heads-up James!) Does anyone else feel like this is getting ridiculous?
I also blocked Amazonbot and applebot to block Siri and Alexa’s “smart answers.” I believe this also excludes me from Apple search.
Dark Visitors apparently has a WordPress plugin to update your robots.txt whenever a new agent comes out, but for now I’m stickin’ with manual. I am also still wary of modifying my .htaccess file and breaking something, so it’s just my robots.txt making my stance clear — I can’t control whether companies have any sort of ethics and comply, unfortunately.
Other user agents
Searching on DuckDuckGo, I found an older article from a theme maker with specific advice for WordPress robots.txt. From there I jumped to Jeff Star’s recommendations from 2020.
I also appreciate fellow opinionated individuals on the internet so I followed some other blocks from Rohan Kumar. I would happily take more opinionated suggestions of junk bots to block if anyone else has opinions or can point me to a list somewhere 😉
Note: this article generated a lot of interest! See a Hacker News discussion.
30 replies on “Pulling my site from Google over AI training”
My intent is to collect as little data as needed for this website to operate usefully. I respect and value your privacy. Let’s be real, this is a personal website; corporate sites and social media are collecting a ton of your data, and using it to sell targeted ads to you 👎 To protect your…
I’m a sci-fi writer, graphic designer and urbanist in the Seattle suburbs. Reading and blogging are my favorite pasttimes and I’m an advocate of the indie web. I’m curious about everything from technology to history to ecology. I use this website to: track what I read and watch, write commentary on things that interest me,…
What do I want the future of the Internet to look like? Last updated 15 January 2025 | Created March 2023 | More of my big questions Sub-questions How can I support the indie web? What defines the indie web (versus the open web, cozy web, IndieWeb, etc)? Who’s part of the indie web? What…
Pulling my site from Google over AI training – Tracy Durnell
July 12th, 2023
Replied to The First Stab at the IndieWeb Interaction Social Norms by Sara Jakša (sarajaksa.eu) If you read the Tracy’s reply, that made…
Stuff I Did:
14 hours writing — refined two blurbs and iterated the heck out of my outline
De-indexed this website from Google (started the process anyway 😉)
Completed my Q2 check-in
Went to Homebrew Website Club and blogged about the barriers to a more social IndieWeb
Installed two new tiers of wire for our espalier apples and tied the branches down — still need to do some aggressive pruning
Finally posted on LinkedIn about my new consulting business and got a referral for a gig from a colleague 🙌
Sewed a button back on my favorite dress 🪡
Baked brownies from a box and banana bread
Dropped the car back off at the shop, to get the AC fixed this time 🙄 They said it’s not their fault it’s now leaking oil, but it worked until the last time we dropped it off 😒
Dinners:
Baked potatoes with Moroccan chickpea curry 👎
Meze: Israeli couscous “tabbouleh” + marinated carrots + olives
BBQ bean sliders + coleslaw + sweet potato fries
Baked ziti
Meze: leftover couscous, cheese, nuts, apple, hummus and pita chips, olives and pickled peppers
Seven layer dip + chips
Panang curry 🤩
Reading:
Read Marie Kondo’s Kurashi at Home, Hot As Hades by Alisha Rai, A Thief in the Night by KJ Charles, and Ana María and the Fox by Liana De la Rosa
Re-read Dragon Bound by Thea Harrison
Ordered new copies of Sister Outsider by Audre Lorde, The Once and Future Sex by Eleanor Janega, The Extended Mind by Annie Murphy Paul and Smitten Kitchen Keepers by Deb Perelman
Ordered used copies of The Art of Activism, Understanding Media by McLuhan, White by Kenya Hara and The Care Manifesto
Words I looked up / concepts I learned:
bonheur (via Alex Sirac)
stochastic terrorism (via Jason Kottke)
febrile
termagant
bibliomane
the Venetian color pavonazzo (via Erin)
Website of the Week:
Question Mark, Ohio
Tracy,
I can understand your decision, but based off my reading of your post, it seems like you don’t understand some key fundamental things. There is this concept of credibility, and when dealing with deceitful people credibility is very important. If they aren’t credible, you can’t trust that they won’t do whatever is in their best interest. Google has said they are going to use anything on the open internet to train their AI regardless of the content owners stance, regardless of robots.txt or some indicator stating your wishes.
De-indexing in my opinion doesn’t do much since they’ve already said they won’t follow things like robots.txt (i.e. they have no credibility). Meta no-index is just another flag like robots.txt. You’d have to do a lot of work identifying their crawlers, and serving those requests fake useless data at your cost, or get really creative and make a non-deterministic input to break determinism on their code reading your site.
Hi Dundir, thanks for your concern.
I freely acknowledge the futility of the gesture. This post may not address it*, but I do recognize that Google has the power in this scenario. They are under no obligation to honor robots.txt or noindex instructions. They can and will, I’m sure, consume everything I publish regardless of anything I do short of making my site private. But, I am making clear that they are doing so without permission. Physical businesses can 86 someone; likewise, I can disallow their crawler from the website that I pay for. They are not invited here; they are breaking and entering with intent to steal. I simply don’t have enforcement power.
I know it doesn’t matter what my opinion of fair use is. Our laws were not designed with this kind of technology in mind, and it’s very possible corporations will win all their court cases over training data. Even if they do, I still don’t have to believe it is fair or right for anyone to steal my intellectual property to use it to create a competing product. We have many unjust laws that favor corporations over individuals.
All I can do is raise my hand and say, I do not consent. I don’t have to accept their theft without complaint — and because I’ve published my complaint online, it’s public and visible. I can bear witness and protest the ethics and legality of non-consensual data use. I will never win a technical battle against a corporation, but they can’t silence me when I have my own website. It is a double-edged sword: my writing is available to steal, but it’s also available to read. By de-indexing, I am declaring that I don’t need them — I am putting my trust in human curators over search. But they do need “me” (as in, people writing original content and publishing it online). Yes, I’m a silly idealist, but I’m not going to let certain failure stop me protesting injustice against myself. This is a hopeless righteous effort, but I make it out of pride for my work and its worth.
* Frankly, I didn’t anticipate many people seeing this post, and chiefly intended it as a reference for others with WordPress sites who might want to do the same. If I’d known it would hit Hacker News, I would have spelled out a lot more of this sentiment 😉
It’s been about a month since I decided to take my site off Google in response to their training generative AI on web…
Strolling along the Sammamish River Two weeks ago, I passed 2000 public posts on this website but didn’t notice till now 😄 Stuff…
Liked Not writing for Google by Leon Paternoster (thisdaysportion.com)
When your goals are different than the masses, you don’t need to act like the masses.
I mentioned the other day about de-indexing most of my sites from Google, but didn’t actually mention how to do that. Well, here’s how. As well as the above, I also block some bots via CloudFlare, though the fact their WAF requires whole specific user agent strings makes it less useful (albeit more “nuclear option”) […]
The idea of tech with edges struck me. Edges imply categories, spaces designed *for* something. They’re not good at everything, nor are they…
Replied to Digital sharecropping by Nicholas Carr (roughtype.com) One of the fundamental economic characteristics of Web 2.0 is the distribution of production into…
Stuff I did: Got my COVID booster and flu shot! Our car’s in the shop again 🙄, so we walked over in the rain 😑 We left early so it would still be light out, since Walgreens is on a busy state route, and grabbed Thai takeout to eat at a nearby park with a…
Pondering along from Ted Gioia’s signs of an information crap-pocalypse as well as the need for (more) human-origin training data for AIs… if…
Last week, I updated my blogroll to include everyone in my RSS feed reader. While I read a lot of topical blogs and…
I saw Nick Simson and Jan Boddez share their current WordPress plugins and thought I’d share too 😊 I have more than I…
I’ve taken approximately four pictures of myself this year and I think this is the best — tidepooling in June I don’t have…
Bookmarked Does anyone even want an AI search engine? by Ryan Broderick (Fast Company) To even entertain the idea of building AI-powered search…
Replied to Tumblr and WordPress to Sell Users’ Data to Train AI Tools by Samantha Cole (404 Media) Internal documents obtained by 404…
Blogger İsmail Şevik interviewed me about my creative habits — thanks! The interview is posted in Turkish on İsmail’s website, so I thought I’d also…
Rethinking our spaces for what would be useful to our lives, not what’s expected to go there — I’m digging having a bunch of open…
Questions and followup thoughts from today’s Bonus Homebrew Website Club on the social norms of the IndieWeb. Notes here. Social norms across platforms and between…
Solstice picnic dinner on the lake! Highlight of the week: I went to a (virtual) listening party for a new album from my favorite band…
The Grimy Residue of the AI Bubble by Dr. Alex Hanna But I’m more pessimistic — and frankly upset — about what will be left…
I think it’s great for everyone who wants a blog to write one — but I also think blogging can be especially empowering for women…
The theme of the week is making space. Win of the week: fixed a bunch of images on my consulting website that I’d saved with…
I saw someone* share a list of guiding principles for their website they described as “Core Website Tenets,” and I love the idea of codifying…
Last updated 6 Feb. 2025 | Mirror of my post Guiding principles for my website 1. Be friendly and kind Link and cite generously Post my blogroll Post my contact info Invite connection Accept comments and Webmentions Reply to comments more often than not Don’t be a dick — be kind even when critical But…