Frustrated by claims that "enlightenment" and similar meditative/introspective practices can't be explained and that you only understand if you experience them, Kaj set out to write his own detailed gears-level, non-mysterious, non-"woo" explanation of how meditation, etc., work in the same way you might explain the operation of an internal combustion engine.

37DanielFilan

As far as I can tell, this post successfully communicates a cluster of claims relating to "Looking, insight meditation, and enlightenment". It's written in a quite readable style that uses a minimum of metaphorical language or Buddhist jargon. That being said, likely due to its focus as exposition and not persuasion, it contains and relies on several claims that are not supported in the text, such as: * Many forms of meditation successfully train cognitive defusion. * Meditation trains the ability to have true insights into the mental causes of mental processes. * "Usually, most of us are - on some implicit level - operating off a belief that we need to experience pleasant feelings and need to avoid experiencing unpleasant feelings." * Flinching away from thoughts of painful experiences is what causes suffering, not the thoughts of painful experiences themselves, nor the actual painful experiences. * Impermanence, unsatisfactoriness, and no-self are fundamental aspects of existence that "deep parts of our minds" are wrong about. I think that all of these are worth doubting without further evidence, and I think that some of them are in fact wrong. If this post were coupled with others that substantiated the models that it explains, I think that that would be worthy of inclusion in a 'Best of LW 2018' collection. However, my tentative guess is that Buddhist psychology is not an important enough set of claims that a clear explanation of it deserves to be signal-boosted in such a collection. That being said, I could see myself being wrong about that.

14Kaj_Sotala

I still broadly agree with everything that I said in this post. I do feel that it is a little imprecise, in that I now have much more detailed and gears-y models for many of its claims. However, elaborating on those would require an entirely new post (one which I currently working on) with a sequence's worth of prerequisites. So if I were to edit this post, I would probably mostly leave it as it is, but include a pointer to the new post once it's finished. In terms of this post being included in a book, it is worth noting that the post situates itself in the context of Valentine's Kensho post, which has not been nominated for the review and thus wouldn't be included in the book. So if this post were to be included, I should probably edit this so as to not require reading Kensho.

Customize

448Welcome to LessWrong!

Ruby, Raemon, RobertM, habryka

Judgements: Merging Prediction & Evidence

abramdemski

515

How to Make Superbabies

GeneSmith, kman

254

15Open Thread Spring 2025

Ben Pace

252Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Owain_Evans

118Statistical Challenges with Making Super IQ babies

Jan Christian Refsgaard

106Methods for strong human germline engineering

TsviBT

117Self-fulfilling misalignment data might be poisoning our AI models

TurnTrout

377How AI Takeover Might Happen in 2 Years

joshc

11d

115

260Arbital has been imported to LessWrong

RobertM, jimrandomh, Ben Pace, Ruby

12d

60The Milton Friedman Model of Policy Change

JohnofCharleston

12h

293Murder plots are infohazards

Chris Monteiro

19d

272So You Want To Make Marginal Progress...

johnswentworth

482Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman, Buck

2mo

205A History of the Future, 2025-2040

L Rudolf L

15d

Quick Takes

faul_sname4h177

Shameful admission: after well over a decade on this site, I still don't really intuitively grok why I should expect agents to become better approximated by "single-minded pursuit of a top-level goal" as they gain more capabilities. Yes, some behaviors like getting resources and staying alive are useful in many situations, but that's not what I'm talking about. I'm talking about specifically the pressures that are supposed to inevitably push agents into the former of the following two main types of decision-making: 1. Unbounded consequentialist maximization: The agent has one big goal that doesn't care about its environment. "I must make more paperclips forever, so I can't let anyone stop me, so I need power, so I need factories, so I need money, so I'll write articles with affiliate links." It's a long chain of "so" statements from now until the end of time. 2. Homeostatic agent: The agent has multiple drives that turn on when needed to keep things balanced. "Water getting low: better get more. Need money for water: better earn some. Can write articles to make money." Each drive turns on, gets what it needs, and turns off without some ultimate cosmic purpose. Both types show goal-directed behavior. But if you offered me a choice of which type of agent I'd rather work with, I'd choose the second type in a heartbeat. The homeostatic agent may betray me, but it will only do that if doing so satisfies one of its drives. This doesn't mean homeostatic agents never betray allies - they certainly might if their current drive state incentivizes it (or if for some reason they have a "betray the vulnerable" drive). But the key difference is predictability. I can reasonably anticipate when a homeostatic agent might work against me: when I'm standing between it and water when it's thirsty, or when it has a temporary resource shortage. These situations are concrete and contextual. With unbounded consequentialists, the betrayal calculation extends across the entire future l

Buck17hΩ25412

Alignment Forum readers might be interested in this:

Daniel Kokotajlo16h2810

Claude has been playing pokemon for the last few days. It's still playing, live on twitch. You can go watch alongside hundreds of other people. It's fun. What updates should I make about AGI timelines from Claude's performance? Let's think step by step. First, it's cool that Claude can do this at all. The game keeps track of "Step count" and Claude is over 30,000 already; I think that means 30,000 actions (e.g. pressing the A button). For each action there is about a paragraph of thinking tokens Claude produces, in order to decide what to do. Any way you slice it this is medium-horizon agency at least -- claude is operating fully autonomously, in pursuit of goals, for a few days. Does this mean long-horizon agency is not so difficult to train after all? Not so fast. Pokemon is probably an especially easy environment, and Claude is still making basic mistakes even so. In particular, Pokemon seems to have a relatively linear world where there's a clear story/path to progress along, and moreover Claude's pretraining probably teaches it the whole story + lots of tips & tricks for how to complete it. In D&D terms the story is running on rails. I think I would have predicted in advance that this dimension of difficulty would matter, but also I feel validated by Claude's performance -- it seems that Claude is doing fine at Pokemon overall, except that Claude keeps getting stuck/lost wandering around in various places. It can't seem to keep a good memory of what it's already tried / where it's already been, and so it keeps going in circles, until eventually it gets lucky and stumbles to the exit. A more challenging video game would be something open-ended and less-present-in-training-data like Dwarf Fortress. On the other hand, maybe this is less a fundamental limitation Claude has and more a problem with its prompt/scaffold? Because it has a limited context window it has to regularly compress it by e.g. summarizing / writing 'notes to self' and then deleting the re

leogao1d3328

my referral/vouching policy is i try my best to completely decouple my estimate of technical competence from how close a friend someone is. i have very good friends i would not write referrals for and i have written referrals for people i basically only know in a professional context. if i feel like it's impossible for me to disentangle, i will defer to someone i trust and have them make the decision. this leads to some awkward conversations, but if someone doesn't want to be friends with me because it won't lead to a referral, i don't want to be friends with them either.

reallyeli12h109

A good ask for frontier AI companies, for avoiding massive concentration of power, might be: * "don't have critical functions controllable by the CEO alone or any one person alone, and check that this is still the case / check for backdoors periodically" since this seems both important and likely to be popular.

Popular Comments

Recent Discussion

leogao's Shortform

leogao

Ω 33y

2Viliam3h

A synthesis between the structural forces theory and "pulling the rope sideways". The economical and other forces determine the main direction, a leader who already wanted to go in that direction gets elected and starts going in that direction, his idiosyncratic whims get implemented as a side effect. Like, instead of Hitler, there would be another German leader determined to change the post-WW1 world order, but he would probably be less obsessed about the Jews. Also, he might make different alliances.

2Viliam4h

Some games do put their finger on the scale, for example you have a first-person shooter where you learn to aim better but you also now have a gun that deals 200 damage per hit, as opposed to your starting gun that dealt 10. But puzzle-solving games are usually fair, I think.

3Joseph Miller17h

Agreed, but also most of the world does operate in this reference culture. If you choose to take a stand against it, you might screw over a decent candidate by providing only a quite positive recommendation.

Neel Nanda2m20

Agreed. If I'm talking to someone who I expect to be able to recalibrate, I just explain that I think the standard norms are dumb, the norms I actually follow, and then give an honest and balanced assessment. If I'm talking to someone I don't really know, I generally give a positive but not very detailed reference or don't reply, depending on context.

Richard_Kennaway's Shortform

Richard_Kennaway

Richard_Kennaway11m20

Beyond HaHa: Mapping the Causal Chain from Jokes to Knowledge

Summary

We explore the specific causal mechanisms linking humor recognition to learning outcomes, including the computational and neurological pathways involved.

This study began with a practical goal: to evaluate the use of humor as a pedagogical tool in Cardiopulmonary Resuscitation (CPR) courses through a randomized trial. However, the lack of clear criteria to define and operationalize "humor" in educational contexts led us to explore its conceptual foundations. Initially, we adopted Clarke's formula, which describes humor as "a pleasant...

(See More – 126 more words)

P. João14m10

here: https://www.lesswrong.com/posts/5XznvCufF5LK4d2Db/the-semi-rational-wildfirefighter

Fuzzing LLMs sometimes makes them reveal their secrets

Fabien Roger

Ω 316d

Scheming AIs may have secrets that are salient to them, such as:

What their misaligned goal is;
What their takeover plan is and what coordination signals they use to collude with other AIs (if they have one);
What good behavior looks like on a task they sandbag.

Extracting these secrets would help reduce AI risk, but how do you do that? One hope is that you can do fuzzing of LLMs,^[1] e.g. by adding noise to LLM weights or activations.

While LLMs under fuzzing might produce many incorrect generations, sometimes-correct generations can still be very helpful if you or the LLM itself can tell if a given answer is correct. But it’s still unclear if this works at all: there are probably some intermediate activations that would result in an LLM telling you the secret, but...

(Continue Reading – 2672 more words)

1Matt Levinson17h

I was thinking in terms of moving towards interpretability. We have no reason to believe that meaningful steering vectors should cluster around a given norm. We also have no reason to believe that effective steering vectors can all be scaled to a common norm without degrading the interesting/desired effect. This version of random search (through starting seed) and local optimization is a cool way to get a decent sampling of directions. I'm wondering if one could get "better" or "cleaner" results by starting from the best results from the search and then trying to optimize them increasing or decreasing temperature. The hope would be that some dimensions would preferentially grow/shrink. We could interpret this as evidence that the "meaningfulness" of the detected steering vector has increased, perhaps even use a measure of that as part of a new loss or stopping rule. One other thing I wonder is if anyone has worked on bringing in ideas from ensemble sampling from the statistics and applied math literature? Seems like it might be possible to use some ideas from that world to more directly find sparser, semantically meaningful steering vectors. Maybe @TurnTrout has worked on it?

Fabien Roger23m20

By doing more search around promising vectors found with random search or MELBO, you could get more powerful vectors, and that could be useful for unlocking / fuzzing-adversarial-training. It's unclear if that would be more effective than just fine-tuning the model on the generation from the best random vectors, but it would be worth trying.

For interp, I don't know what interp metric you want to optimize. Vector norm is a really bad metric: effective MELBO vectors have a much smaller norm, but qualitatively I find their results are sometimes much more erra... (read more)

The Semi-Rational Wildfirefighter

P. João

23m

LessWrong Context:

I didn’t want to write this.

Not for lack of courage—I’d meme-storm Putin’s Instagram if given half a chance. But why?

Too personal.
My stories are tropical chaos: I survived the Brazilian BOPE (think Marine Corps training, but post-COVID).
I’m dyslexic, writing in English (a crime against Grice).
This is LessWrong, not some Deep Web Reddit thread.

Okay, maybe a little lack of courage.

And yet, something can be extracted from all this madness, right?

Then comes someone named Gwern. He completely ignores my thesis and simply asks:
"Tell military firefighter stories."

My first instinct was to dismiss him as an oddball—until a friend told me I was dealing with a legend of rationality. I have to admit: I nearly shit myself. His comment got more likes than the post I’d spent years working on.

Someone with,...

(See More – 356 more words)

Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

Yoshua Bengio, Jesse Richardson, dwk, mattmacdermott

A new paper by Yoshua Bengio and the Safe Artificial Intelligence For Humanity (SAIFH) team argues that the current push towards building generalist AI agents presents catastrophic risks, creating a need for more caution and an alternative approach. We propose such an approach in the form of Scientist AI, a non-agentic AI system that aims to be the foundation for safe superintelligence. (Note that this paper is intended for a broad audience, including readers unfamiliar with AI safety.)

Abstract

The leading AI companies are increasingly focused on building generalist AI agents—systems that can autonomously plan, act, and pursue goals across almost all tasks that humans can perform. Despite how useful these systems might be, unchecked AI agency poses significant risks to public safety and security, ranging from misuse by

...

(Continue Reading – 3077 more words)

Viliam26m20

Is this possibly a "Chinese room" kind of situation? The model alone is not an agent, but "the model + the way it is used" might be...

And to be more precise, I don't mean things like "the model could be used by an agent", because obviously yes; but more like "the model + a way of using it that we also separately wouldn't call an agent" could be.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

faul_sname's Shortform

faul_sname

2faul_sname1h

Where does the gradient which chisels in the "care about the long term X over satisfying the homeostatic drives" behavior come from, if not from cases where caring about the long term X previously resulted in attributable reward? If it's only relevant in rare cases, I expect the gradient to be pretty weak and correspondingly I don't expect the behavior that gradient chisels in to be very sophisticated.

2Gurkenglas34m

https://www.lesswrong.com/posts/roA83jDvq7F2epnHK/better-priors-as-a-safety-problem

2faul_sname1h

I agree that a homeostatic agent in a sufficiently out-of-distribution environment will do poorly - as soon as one of the homeostatic feedback mechanisms starts pushing the wrong way, it's game over for that particular agent. That's not something unique to homeostatic agents, though. If a model-based maximizer has some gap between its model and the real world, that gap can be exploited by another agent for its own gain, and that's game over for the maximizer. Sorry, I'm having some trouble parsing this sentence - does "they" in this context refer to homeostatic agents? If so, I don't think they make particularly great tools even in a non-adversarial context. I think they make pretty decent allies and trade partners though, and certainly better allies and trade partners than consequentialist maximizer agents of the same level of sophistication do (and I also think consequentialist maximizer agents make pretty terrible tools - pithily, it's not called the "Principal-Agent Solution"). And I expect "others are willing to ally/trade with me" to be a substantial advantage. Can you expand on "turn evil"? And also what I was trying to accomplish by making my comms-screening bot into a self-directed goal-oriented agent in this scenario?

tailcalled28m20

That's not something unique to homeostatic agents, though. If a model-based maximizer has some gap between its model and the real world, that gap can be exploited by another agent for its own gain, and that's game over for the maximizer.

I don't think of my argument as model-based vs heuristic-reactive, I mean it as unbounded vs bounded. Like you could imagine making a giant stack of heuristics that makes it de-facto act like an unbounded consequentialist, and you'd have a similar problem. Model-based agents only become relevant because they seem like an ea... (read more)

Self's Shortform

Self

1mo

1Self1h

More thoughts that may or may not be directly relevant * What's missing from my definition is that deception happens solely via "stepping in front of the camera", via the regular sensory channels of the deceived optimizer, ie brainwashing or directly modifying memory is not deception * From this follows to deceive is to either cause a false pattern recognition or to prevent a correct one, and for this you indeed need familiarity with the victim's perceptual categories I'd like to say more re: hostile telepaths or other deception frameworks but am unsure what your working models are

2Viliam4h

Some of these examples have alternative explanations. * other people may know something that I don't know, so if they all do X, maybe I should, too * if I use the same device as my friends, it will be easier to get tech support Even if you imagine a hypothetical person 100% resistant to copying desire, the value of a neighborhood does depend on the kind of people who live there.

3Self1h

They do, but the explanation proposed here matches everything I know most exactly and simply. E.g. it became immediately clear that the sequences wouldn't work nearly as well for me if I didn't like Eliezer Or the way fashion models are of course not selected for attractiveness but for more mimetic-copying-inducing highstatus traits like height/confidence/presence/authenticity and others And yeah not all of the Claude examples are good, I hadn't cherrypicked

Viliam31m20

it became immediately clear that the sequences wouldn't work nearly as well for me if I didn't like Eliezer

You mean, like him as a blogger? Or as a person in real life?

If the former, isn't causality the other way round? I mean, I like Eliezer as a blogger because he wrote the Sequences. So it would sound weird to me to say: "I admire Eliezer as a blogger a lot because he wrote some amazing articles on rationality... and Girard's theory predicts that therefore I will like his articles... which is true!"

(We could nitpick that some things that I like about El... (read more)

Daniel Kokotajlo's Shortform

Daniel Kokotajlo

Ω 35y

1Purplehermann4h

Runescape would be a good one

12zchuang7h

I don't know if this is helpful but as someone who was quite good at competitive Pokemon during their teenage years and also still keeps up with nuzlocking type things for fun, I would note that Pokemon's game design is made to be a low context intensity RPG especially in early generations where the linearity is pushed to allow kids to do it. If your point holds true on agency, I think the more important pinch points will be Lavender Town and Sabrina because those require backtracking through the storyline to get things. I think mid-late game GSC would also be important to try because there are huge level gaps and transitions in the storyline that would make it hard to progress.

6Daniel Kokotajlo12h

"Let's think step by step" was indeed a joke/on purpose. Everything else was just my stream of consciousness... my "chain of thought" shall we say. I more or less wrote down thoughts as they came to me. Perhaps I've been influenced by reading LLM CoT's, though I haven't done very much of that. Or perhaps this is just what thinking looks like when you write it down?

Kaj_Sotala2h20

I've spent enough time staring at LLM chain-of-thoughts now that when I started thinking about a thing for work, I found my thoughts taking the shape of an LLM thinking about how to approach its problem. And that actually felt like a useful systematic way of approaching the problem, so I started writing out that chain of thought like I was an LLM, and that felt valuable in helping me stay focused.

Of course, I had to amuse myself by starting the chain-of-thought with "The user has asked me to..."