Skip to main content

Believe that journalism can make a difference

If you believe in the work we do at Vox, please support us by becoming a member. Our mission has never been more urgent. But our work isn’t easy. It requires resources, dedication, and independence. And that’s where you come in.

We rely on readers like you to fund our journalism. Will you support our work and become a Vox Member today?

Support Vox

A poster’s guide to who’s selling your data to train AI

Those Tumblr, Reddit, and WordPress posts you never thought would see the light of day? Yep, them too.

In this photo illustration, the Reddit logo is seen in behind a silhouette of a person typing.
In this photo illustration, the Reddit logo is seen in behind a silhouette of a person typing.
Photo Illustration by Rafael Henrique/SOPA Images/LightRocket via Getty Images
A.W. Ohlheiser
A.W. Ohlheiser is a senior technology reporter at Vox, writing about the impact of technology on humans and society. They have also covered online culture and misinformation at the Washington Post, Slate, and the Columbia Journalism Review, among other places. They have an MA in religious studies and journalism from NYU.

If you’ve ever posted anything on the internet, chances are that your data has already been scraped, collected, and used to train AI systems like the ones powering ChatGPT, Midjourney, and Sora. Generative AI is designed to succeed as a generalist, and learning to do so, OpenAI has said, requires “internet-scale” data to train on.

You probably don’t need me to tell you what happened when companies used scraped public data — often without the permission of those who created it — from news articles, books, and creative projects to teach AI tools how to, say, generate news articles, books, and creative projects.

The New York Times is currently suing OpenAI for allegedly using its expansive archives without permission to train chatbots (in a recent filing, OpenAI accused the Times of hiring “someone to hack” ChatGPT in order to prove that the chatbot was stealing their content). Getty Images sued Stable Diffusion for copyright infringement. Other lawsuits from authors and creators, angry to find that their works were used to train AI models, have faced setbacks in court.

Other companies have decided to make deals. The Associated Press has licensed part of its archives to OpenAI. Shutterstock, the stock photo archive, has signed a six-year deal with OpenAI to provide training data, which includes access to its photo, video, and music databases.

The ways AI systems use the work of journalists, musicians, and photographers have pretty consequential implications for our information and cultural ecosystem and for the people who work in the fields that AI companies seem dead-set on developing tools to replace. The need to gather more and more training data with as little fuss as possible means that anyone who’s an online poster — whether its a fandom Tumblr account, an active Reddit presence, or a personal blog — could see access to their content being sold by the platforms hosting it to one of these big AI companies.

Below is a quick guide to what we know right now about who might be selling your best posts as training data.

Tumblr and WordPress.com

Earlier this week, 404 Media reported that Automattic, the parent company for Tumblr and WordPress, was preparing to announce deals selling user data to OpenAI and Midjourney. According to 404’s reporting, which describes such a deal as “imminent,” the data seems likely to include user posts on Tumblr and on WordPress.com. On Wednesday, a day after 404’s report, Automattic announced a way for users to opt out of sharing their public content with third parties.

The Tumblr staff announcement on the change framed the whole thing as a sign that the company was working to protect its users. “We already discourage AI crawlers from gathering content from Tumblr and will continue to do so,” the announcement read, “save for those with which we partner.”

Automattic said in a statement that it was “working directly with select AI companies as long as their plans align with what our community cares about: attribution, opt-outs, and control,” but has not provided any further information on the reported deals with OpenAI and Midjourney.

Although Tumblr’s cultural heft has waned over the past decade, it’s still a pretty important platform for fandom content, including fanfiction and fan art. There are also plenty of artists who use Tumblr to host their original work and take commissions.

Reddit

Reddit’s enormous archives of posts are driven by the labor of volunteers: Unpaid subreddit moderators oversee communities of unpaid users. Their collective efforts on Reddit make the platform valuable.

So when Reddit announced that it was launching an IPO, the company reached out to a selection of mods and frequent posters to offer them the opportunity to buy stock early. Some of those who received the offer were not super enthusiastic about it. But Reddit does not need buy-in from its users to profit from their work: It has already sold access to their posts to Google.

Just before the IPO announcement, Reddit and Google entered into a $60 million deal that would give Google access to Reddit’s API in order to, among other things, train its generative AI models.

Everything else, to be honest

The reported deals above are just a couple that have become public. But this doesn’t mean that large AI models aren’t already being trained on your posts across the internet.

Last year, the Washington Post examined one of the massive data sets of scraped public internet data used to train generative AI models and found everything from World of Warcraft message boards to Patreon and Kickstarter and several huge repositories of personal blogs. And it should not be a surprise that Meta uses public posts from Facebook and Instagram to train its AI models.

More in Technology

Trump and Musk actually made a good point on immigrationTrump and Musk actually made a good point on immigration
Politics

The US does need more skilled workers. But that’s not all it needs.

By Eric Levitz
25 things we think will happen in 202525 things we think will happen in 2025
Future Perfect

From tariffs and a Trump/Elon break-up to artificial general intelligence, here’s what could happen in 2025, according to the Future Perfect team.

By Dylan Matthews, Bryan Walsh and 4 more
TikTok is headed for a ban — but can Trump still save it?TikTok is headed for a ban — but can Trump still save it?
Politics

The four ways Trump could potentially preserve the app.

By Li Zhou
The 14 predictions that came true in 2024 — and the 10 that didn’tThe 14 predictions that came true in 2024 — and the 10 that didn’t
Future Perfect

The 24 forecasts we made in 2024, revisited.

By Bryan Walsh, Dylan Matthews and 4 more
9 actually good things that happened in 20249 actually good things that happened in 2024
Future Perfect

It wasn’t the easiest year, but 2024 was not without its bright spots.

By Bryan Walsh
The 10(ish) most read Future Perfect stories of 2024The 10(ish) most read Future Perfect stories of 2024
Future Perfect

Why young people are getting cancer, problems with OpenAI, and the little intelligence agency that could.

By Bryan Walsh