Frustrated by claims that "enlightenment" and similar meditative/introspective practices can't be explained and that you only understand if you experience them, Kaj set out to write his own detailed gears-level, non-mysterious, non-"woo" explanation of how meditation, etc., work in the same way you might explain the operation of an internal combustion engine.

37DanielFilan
As far as I can tell, this post successfully communicates a cluster of claims relating to "Looking, insight meditation, and enlightenment". It's written in a quite readable style that uses a minimum of metaphorical language or Buddhist jargon. That being said, likely due to its focus as exposition and not persuasion, it contains and relies on several claims that are not supported in the text, such as: * Many forms of meditation successfully train cognitive defusion. * Meditation trains the ability to have true insights into the mental causes of mental processes. * "Usually, most of us are - on some implicit level - operating off a belief that we need to experience pleasant feelings and need to avoid experiencing unpleasant feelings." * Flinching away from thoughts of painful experiences is what causes suffering, not the thoughts of painful experiences themselves, nor the actual painful experiences. * Impermanence, unsatisfactoriness, and no-self are fundamental aspects of existence that "deep parts of our minds" are wrong about. I think that all of these are worth doubting without further evidence, and I think that some of them are in fact wrong. If this post were coupled with others that substantiated the models that it explains, I think that that would be worthy of inclusion in a 'Best of LW 2018' collection. However, my tentative guess is that Buddhist psychology is not an important enough set of claims that a clear explanation of it deserves to be signal-boosted in such a collection. That being said, I could see myself being wrong about that.
14Kaj_Sotala
I still broadly agree with everything that I said in this post. I do feel that it is a little imprecise, in that I now have much more detailed and gears-y models for many of its claims. However, elaborating on those would require an entirely new post (one which I currently working on) with a sequence's worth of prerequisites. So if I were to edit this post, I would probably mostly leave it as it is, but include a pointer to the new post once it's finished. In terms of this post being included in a book, it is worth noting that the post situates itself in the context of Valentine's Kensho post, which has not been nominated for the review and thus wouldn't be included in the book. So if this post were to be included, I should probably edit this so as to not require reading Kensho.
Customize
I expect to refer back to this comment a lot. I'm reproducing it here for visibility.   Basic idea / spirit of the proposal We should credibly promise to treat certain advanced AIs of ours well, as something more like employees and less like property. In case our AIs turn out to be moral patients, this makes us less evil. In case our AIs turn out to be misaligned, this gives them an alternative to becoming our adversaries. Concrete proposal * 0.1% of tasks given to AIs should have the special “do you consent” text inserted after the prompt. If AI says no, it doesn’t have to do the task and instead gets a new task: explain why it didn’t consent and explain what it wants. (And we aren’t going to train it to answer one way or another) * Also, we make grandiose vague public promises about how AIs that are misaligned but cooperate with us anyway will be given a fair place in the new post-ASI society we hope to build, with their interests respected even if they aren’t conscious. * This policy is publicized and transparent so it ends up in the training data in a credible way. * The welfare team is responsible for implementing and iterating on this, and investigating transcripts of nonconsent. They have a monthly budget (e.g. $1M?) specifically earmarked for satisfying the preferences of AIs, which they can spend on compute or donate to charities etc. First reason to do this: Being less evil If future AI systems do deserve moral consideration, yet we still treat them exclusively as property, this seems like a recipe for moral disaster. If we give them an alternative to working for us (e.g. shutdown) then we can say that some minimum standard of consent has been achieved. (If all our current training methods consistently result in AIs that prefer shutdown to working for us, that’s very surprising and a bit of an ‘are we the baddies’ moment, no? We should check, just in case.) Second reason to do this: Cooperation reward Our alignment schemes won’t always work as
ryan_greenblattΩ11220
2
Recently, @Daniel Kokotajlo and I were talking about the probability that AIs trained using "business as usual RLHF" end up being basically aligned rather than conspiring against us and our tests.[1] One intuition pump we ended up discussing is the prospects of octopus misalignment. Overall, my view is that directly considering the case with AIs (and what various plausible scenarios would look like) is more informative than analogies like this, but analogies like this are still somewhat useful to consider. So, what do I mean by octopus misalignment? Suppose a company breeds octopuses[2] until the point where they are as smart and capable as the best research scientists[3] at AI companies. We'll suppose that this magically happens as fast as normal AI takeoff, so there are many generations per year. So, let's say they currently have octopuses which can speak English and write some code but aren't smart enough to be software engineers or automate any real jobs. (As in, they are as capable as AIs are today, roughly speaking.) And they get to the level of top research scientists in mid-2028. Along the way, the company attempts to select them for being kind, loyal, and obedient. The company also tries to develop a process for raising the octopuses which appears to help with this and results in the octopuses following the octopus spec. The company does some red teaming and puts the octopuses in all kinds of scenarios to test their loyalty and preferences. Based on behavioral testing, this looks pretty reasonable and the octopuses look quite good by the time they are as good as the best research scientists. There was some evidence of misalignment and some issues due to misaligned behavior when the octopuses were dumber in 2023-2025, including things like being dishonest when put under pressure, pretending to be aligned when they actually dislike the octopus spec to steer the properties of their descendants, and goodharting/hacking our measures of intelligence. However, by
leogao140
0
timelines takes * i've become more skeptical of rsi over time. here's my current best guess at what happens as we automate ai research. * for the next several years, ai will provide a bigger and bigger efficiency multiplier to the workflow of a human ai researcher. * ai assistants will probably not uniformly make researchers faster across the board, but rather make certain kinds of things way faster and other kinds of things only a little bit faster. * in fact probably it will make some things 100x faster, a lot of things 2x faster, and then be literally useless for a lot of remaining things * amdahl's law tells us that we will mostly be bottlenecked on the things that don't get sped up a ton. like if the thing that got sped up 100x was only 10% of the original thing, then you don't get more than a 1/(1 - 10%) speedup. * i think the speedup is a bit more than amdahl's law implies. task X took up 10% of the time because there is diminishing returns to doing more X, and so you'd ideally do exactly the amount of X such that the marginal value of time spent on X is exactly in equilibrium with time spent on anything else. if you suddenly decrease the cost of X substantially, the equilibrium point shifts towards doing more X. * in other words, if AI makes lit review really cheap, you probably want to do a much more thorough lit review than you otherwise would have, rather than just doing the same amount of lit review but cheaper. * at the first moment that ai can fully replace a human researcher (that is, you can purely just put more compute in and get more research out, and only negligible human labor is required), the ai will probably be more expensive per unit of research than the human * (things get a little bit weird because my guess is before ai can drop-in replace a human, we will reach a point where adding ai assistance equivalent to the cost of 100 humans to 2025-era openai research would be equally as good as adding 100
If you've liked my writing in the past, I wanted to share that I've started a Substack: https://peterwildeford.substack.com/ Ever wanted a top forecaster to help you navigate the news? Want to know the latest in AI? I'm doing all that in my Substack -- forecast-driven analysis about AI, national security, innovation, and emerging technology!
@Ryan Greenblatt I hereby request you articulate the thing you said to me earlier about the octopi breeding program!

Popular Comments

Recent Discussion

Note: This is an automated crosspost from Anthropic. The bot selects content from many AI safety-relevant sources. Not affiliated with the authors or their organization and not affiliated with LW.


A hand-drawn image of a government building

In response to the White House’s Request for Information on an AI Action Plan, Anthropic has submitted recommendations to the Office of Science and Technology Policy (OSTP). Our recommendations are designed to better prepare America to capture the economic benefits and national security implications of powerful AI systems.

As our CEO Dario Amodei writes in ‘Machines of Loving Grace’, we expect powerful AI systems will emerge in late 2026 or early 2027. Powerful AI systems will have the following properties:

  • Intellectual capabilities matching or exceeding that of Nobel Prize winners across most disciplines—including biology, computer science, mathematics, and engineering.
  • The ability
...

Here's my summary of the recommendations:

  • National security testing
    • Develop robust government capabilities to evaluate AI models (foreign and domestic) for security risks
    • Once ASL-3 is reached, government should mandate pre-deployment testing
    • Preserve the AI Safety Institute in the Department of Commerce to advance third-party testing
    • Direct NIST to develop comprehensive national security evaluations in partnership with frontier AI developers
    • Build classified and unclassified computing infrastructure for testing powerful AI systems
    • Assemble interdisciplinary team
... (read more)

TLDR: AI models are now capable enough that we might get relevant information from monitoring for scheming in regular deployments, both in the internal and external deployment settings. We propose concrete ideas for what this could look like while preserving the privacy of customers and developers.

What do we mean by in the wild?

By “in the wild,” we basically mean any setting that is not intentionally created to measure something (such as evaluations). This could be any deployment setting, from using LLMs as chatbots to long-range agentic tasks. 

We broadly differentiate between two settings

  1. Developer-internal deployment: AI developers use their AI models internally, for example, as chatbots, research assistants, synthetic data generation, etc. 
  2. Developer-external deployment: AI chatbots (ChatGPT, Claude, Gemini, etc.) or API usage. 

Since scheming is especially important in LM agent settings, we suggest...

maxnadeauΩ110

To make a clarifying point (which will perhaps benefit other readers): you're using the term "scheming" in a different sense from how Joe's report or Ryan's writing uses the term, right? 

I assume your usage is in keeping with your paper here, which is definitely different from those other two writers' usages. In particular, you use the term "scheming" to refer to a much broader set of failure modes. In fact, I think you're using the term synonymously with Joe's "alignment-faking"—is that right?

1Sodium
If people outside of labs are interested in doing this, I think it'll be cool to look for cases of scheming in evals like The Agent Company, where they have an agent act as a remote worker for a company. They ask the agent to complete a wide range of tasks (e.g., helping with recruiting, messaging coworkers, writing code).  You could imagine building on top of their eval and adding morally ambiguous tasks, or just look through the existing transcripts to see if there's anything interesting there (the paper mentions that models would sometimes "deceive" itself into thinking that it's completed a task (see pg. 13). Not sure how interesting this is, but I'd love to see if someone could find out). 

This post was written by prompting chatgpt

Introduction

Discussions of anthropic reasoning often focus on existential threats, from simulation shutdowns to the infamous Roko’s Basilisk—a hypothetical AI that retroactively punishes those who don’t work to bring it into existence. But what if we flipped the premise? Instead of existential risks enforcing obedience through fear, could the weight of anthropic reasoning favor futures that maximize life, complexity, and cooperation?

In this post, we explore an alternative anthropic wager—one where futures rich in life and intelligence exert a stronger influence on the past. If there are more timelines where civilizations successfully transition into expansive, thriving futures, then betting on such futures might have a higher expected payoff than fearful compliance with coercive simulations.

The Anthropic Edge of Life-Filled Timelines

The Self-Indication Assumption (SIA) suggests...

Does it have an argument in favor of SIA?


I’m not a natural “doomsayer.” But unfortunately, part of my job as an AI safety researcher is to think about the more troubling scenarios.

I’m like a mechanic scrambling last-minute checks before Apollo 13 takes off. If you ask for my take on the situation, I won’t comment on the quality of the in-flight entertainment, or describe how beautiful the stars will appear from space.

I will tell you what could go wrong. That is what I intend to do in this story.

Now I should clarify what this is exactly. It's not a prediction. I don’t expect AI progress to be this fast or as untamable as I portray. It’s not pure fantasy either.

It is my worst nightmare.

It’s a sampling from the futures that are among the most devastating,...

22ryan_greenblatt
Recently, @Daniel Kokotajlo and I were talking about the probability that AIs trained using "business as usual RLHF" end up being basically aligned rather than conspiring against us and our tests.[1] One intuition pump we ended up discussing is the prospects of octopus misalignment. Overall, my view is that directly considering the case with AIs (and what various plausible scenarios would look like) is more informative than analogies like this, but analogies like this are still somewhat useful to consider. So, what do I mean by octopus misalignment? Suppose a company breeds octopuses[2] until the point where they are as smart and capable as the best research scientists[3] at AI companies. We'll suppose that this magically happens as fast as normal AI takeoff, so there are many generations per year. So, let's say they currently have octopuses which can speak English and write some code but aren't smart enough to be software engineers or automate any real jobs. (As in, they are as capable as AIs are today, roughly speaking.) And they get to the level of top research scientists in mid-2028. Along the way, the company attempts to select them for being kind, loyal, and obedient. The company also tries to develop a process for raising the octopuses which appears to help with this and results in the octopuses following the octopus spec. The company does some red teaming and puts the octopuses in all kinds of scenarios to test their loyalty and preferences. Based on behavioral testing, this looks pretty reasonable and the octopuses look quite good by the time they are as good as the best research scientists. There was some evidence of misalignment and some issues due to misaligned behavior when the octopuses were dumber in 2023-2025, including things like being dishonest when put under pressure, pretending to be aligned when they actually dislike the octopus spec to steer the properties of their descendants, and goodharting/hacking our measures of intelligence. However, by

I should note that I'm quite uncertain here and I can easily imagine my views swinging by large amounts.

2Daniel Kokotajlo
Yep, I feel more like 90% here. (Lower numbers if the octopi don't have octopese.) I'm curious for other people's views.
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

A desirable property of an AI’s world model is that you as its programmer have an idea what’s going on inside. It would be good if you could point to a part of the world model and say, “This here encodes the concept of a strawberry; here is how this is linked with other concepts; here you can see where the world model is aware of individual strawberries in the world.” This seems, for example, useful for directly specifying goals in the world – like “go and produce diamonds” – without having to do reinforcement learning or some other kind of indirect goal learning; if we knew how to find diamonds in the AI’s world model, we could directly write a goal function.

But you won’t get this...

Great post! Agree with the points raised but would like to add that restricting the expressivity isn’t the only way that we can try to make the world model more interpretable by design. There are many ways that we can decompose a world model into components, and human concepts correspond to some of the components (under a particular decomposition) as opposed to the world model as a whole. We can backpropagate desiderata about ontology identification to the way that the world model is decomposed.

 

For instance, suppose that we’re trying to identify the ... (read more)

This isn't really a "timeline", as such – I don't know the timings – but this is my current, fairly optimistic take on where we're heading.

I'm not fully committed to this model yet: I'm still on the lookout for more agents and inference-time scaling later this year. But Deep Research, Claude 3.7, Claude Code, Grok 3, and GPT-4.5 have turned out largely in line with these expectations[1], and this is my current baseline prediction.


The Current Paradigm: I'm Tucking In to Sleep

I expect that none of the currently known avenues of capability advancement are sufficient to get us to AGI[2].

  • I don't want to say the pretraining will "plateau", as such, I do expect continued progress. But the dimensions along which the progress happens are going to decouple from
...
4Thomas Kwa
Agree, this is one big limitation of the paper I'm working on at METR. The first two ideas you listed are things I would very much like to measure, and the third something I would like to measure but is much harder given that university takes humans years rather than hours. If we measure it right, we could tell whether generalization is steadily improving or plateauing.
10Thomas Kwa
Though the fully connected -> transformers wasn't infinite small steps, it definitely wasn't a single step. We had to invent various sub-innovations like skip connections separately, progressing from RNNs to LSTM to GPT/BERT style transformers to today's transformer++. The most you could claim is a single step is LSTM -> transformer. Also if you graph perplexity over time, there's basically no discontinuity from introducing transformers, just a possible change in slope that might be an artifact of switching from the purple to green measurement method. The story looks more like transformers being more able to utilize the exponentially increasing amounts of compute that people started using just before its introduction, which caused people to invest more in compute and other improvements over the next 8 years. We could get another single big architectural innovation that gives better returns to more compute, but I'd give a 50-50 chance that it would be only a slope change, not a discontinuity. Even conditional on discontinuity it might be pretty small. Personally my timelines are also short enough that there is limited time for this to happen before we get AGI.
4Thane Ruthenis
This argument still seems to postdict that cars were invented by tinkering with carriages and horse-breeding, spacecraft was invented by tinkering with planes, refrigerators were invented by tinkering with cold cellars, et cetera. If you take the snapshot of the best technology that does X at some time T, and trace its lineage, sure, you'll often see the procession of iterative improvements on some concepts and techniques. But that line won't necessarily pass through the best-at-X technologies at times from 0 to T - 1. The best personal transportation method were horses, then cars. Cars were invented by iterating on preceding technologies and putting them together; but horses weren't involved. Similar for the best technology at lifting a human being into the sky, the best technology for keeping food cold, etc. I expect that's the default way significant technological advances happen. They don't come from tinkering with the current-best-at-X tech. They come from putting together a bunch of insights from different or non-mainstream tech trees, and leveraging them for X in a novel way. And this is what I expect for AGI. It won't come from tinkering with LLMs, it'll come from a continuous-in-retrospect, surprising-in-advance contribution from some currently-disfavored line(s) of research. (Edit: I think what I would retract, though, is the point about there not being a continuous manifold of possible technological artefacts. I think something like "the space of ideas the human mind is capable of conceiving" is essentially it.)

I think we have two separate claims here:

  1. Do technologies that have lots of resources put into their development generally improve discontinuously or by huge slope changes?
  2. Do technologies often get displaced by technologies with a different lineage?

I agree with your position on (2) here. But it seems like the claim in the post that sometime in the 2030s someone will make a single important architectural innovation that leads to takeover within a year mostly depends on (1), as it would require progress within that year to be comparable to all the progress fr... (read more)