Frustrated by claims that "enlightenment" and similar meditative/introspective practices can't be explained and that you only understand if you experience them, Kaj set out to write his own detailed gears-level, non-mysterious, non-"woo" explanation of how meditation, etc., work in the same way you might explain the operation of an internal combustion engine.
Note: This is an automated crosspost from Anthropic. The bot selects content from many AI safety-relevant sources. Not affiliated with the authors or their organization and not affiliated with LW.
In response to the White House’s Request for Information on an AI Action Plan, Anthropic has submitted recommendations to the Office of Science and Technology Policy (OSTP). Our recommendations are designed to better prepare America to capture the economic benefits and national security implications of powerful AI systems.
As our CEO Dario Amodei writes in ‘Machines of Loving Grace’, we expect powerful AI systems will emerge in late 2026 or early 2027. Powerful AI systems will have the following properties:
Here's my summary of the recommendations:
TLDR: AI models are now capable enough that we might get relevant information from monitoring for scheming in regular deployments, both in the internal and external deployment settings. We propose concrete ideas for what this could look like while preserving the privacy of customers and developers.
By “in the wild,” we basically mean any setting that is not intentionally created to measure something (such as evaluations). This could be any deployment setting, from using LLMs as chatbots to long-range agentic tasks.
We broadly differentiate between two settings
Since scheming is especially important in LM agent settings, we suggest...
To make a clarifying point (which will perhaps benefit other readers): you're using the term "scheming" in a different sense from how Joe's report or Ryan's writing uses the term, right?
I assume your usage is in keeping with your paper here, which is definitely different from those other two writers' usages. In particular, you use the term "scheming" to refer to a much broader set of failure modes. In fact, I think you're using the term synonymously with Joe's "alignment-faking"—is that right?
This post was written by prompting chatgpt
Discussions of anthropic reasoning often focus on existential threats, from simulation shutdowns to the infamous Roko’s Basilisk—a hypothetical AI that retroactively punishes those who don’t work to bring it into existence. But what if we flipped the premise? Instead of existential risks enforcing obedience through fear, could the weight of anthropic reasoning favor futures that maximize life, complexity, and cooperation?
In this post, we explore an alternative anthropic wager—one where futures rich in life and intelligence exert a stronger influence on the past. If there are more timelines where civilizations successfully transition into expansive, thriving futures, then betting on such futures might have a higher expected payoff than fearful compliance with coercive simulations.
The Self-Indication Assumption (SIA) suggests...
Does it have an argument in favor of SIA?
I’m not a natural “doomsayer.” But unfortunately, part of my job as an AI safety researcher is to think about the more troubling scenarios.
I’m like a mechanic scrambling last-minute checks before Apollo 13 takes off. If you ask for my take on the situation, I won’t comment on the quality of the in-flight entertainment, or describe how beautiful the stars will appear from space.
I will tell you what could go wrong. That is what I intend to do in this story.
Now I should clarify what this is exactly. It's not a prediction. I don’t expect AI progress to be this fast or as untamable as I portray. It’s not pure fantasy either.
It is my worst nightmare.
It’s a sampling from the futures that are among the most devastating,...
I should note that I'm quite uncertain here and I can easily imagine my views swinging by large amounts.
A desirable property of an AI’s world model is that you as its programmer have an idea what’s going on inside. It would be good if you could point to a part of the world model and say, “This here encodes the concept of a strawberry; here is how this is linked with other concepts; here you can see where the world model is aware of individual strawberries in the world.” This seems, for example, useful for directly specifying goals in the world – like “go and produce diamonds” – without having to do reinforcement learning or some other kind of indirect goal learning; if we knew how to find diamonds in the AI’s world model, we could directly write a goal function.
But you won’t get this...
Great post! Agree with the points raised but would like to add that restricting the expressivity isn’t the only way that we can try to make the world model more interpretable by design. There are many ways that we can decompose a world model into components, and human concepts correspond to some of the components (under a particular decomposition) as opposed to the world model as a whole. We can backpropagate desiderata about ontology identification to the way that the world model is decomposed.
For instance, suppose that we’re trying to identify the ...
This isn't really a "timeline", as such – I don't know the timings – but this is my current, fairly optimistic take on where we're heading.
I'm not fully committed to this model yet: I'm still on the lookout for more agents and inference-time scaling later this year. But Deep Research, Claude 3.7, Claude Code, Grok 3, and GPT-4.5 have turned out largely in line with these expectations[1], and this is my current baseline prediction.
I expect that none of the currently known avenues of capability advancement are sufficient to get us to AGI[2].
I think we have two separate claims here:
I agree with your position on (2) here. But it seems like the claim in the post that sometime in the 2030s someone will make a single important architectural innovation that leads to takeover within a year mostly depends on (1), as it would require progress within that year to be comparable to all the progress fr...