If the thesis in Unlocking the Emotional Brain is even half-right, it may be one of the most important books that I have read. It claims to offer a neuroscience-grounded, comprehensive model of how effective therapy works. In so doing, it also happens to formulate its theory in terms of belief updating, helping explain how the brain models the world and what kinds of techniques allow us to actually change our minds.

13orthonormal

As mentioned in my comment, this book review overcame some skepticism from me and explained a new mental model about how inner conflict works. Plus, it was written with Kaj's usual clarity and humility. Recommended.

13MalcolmOcean

This was a profoundly impactful post and definitely belongs in the review. It prompted me and many others to dive deep into understanding how emotional learnings have coherence and to actually engage in dialogue with them rather than insisting they don't make sense. I've linked this post to people more than probably any other LessWrong post (50-100 times) as it is an excellent summary and introduction to the topic. It works well as a teaser for the full book as well as a standalone resource. The post makes both conceptual and pragmatic claims. I haven't exactly crosschecked the models although they do seem compatible with other models I've read. I did read the whole book and it seemed pretty sound and based in part on relevant neuroscience. There's a kind of meeting-in-the-middle thing there where the neuroscience is quite low-level and therapy is quite high-level. I think it'll be cool to see the middle layers fleshed out a bit. Just because your brain uses Bayes' theorem at the neural level and at higher levels of abstraction, doesn't mean that you consciously know what all of its priors & models are! And it seems the brain's basic organization is set up to prevent people from calmly arguing against emotionally intense evidence without understanding it—which makes a lot of sense if you think about it. And it also makes sense that your brain would be able to update under the right circumstances. I've tested the pragmatic claims personally, by doing the therapeutic reconsolidation process using both Coherence Therapy methods & other methods, both on myself & working with others. I've found that these methods indeed find coherent underlying structures (eg the same basic structures using different introspective methods, that relate and are consistent) and that accessing those emotional truths and bringing them in contact with contradictory evidence indeed causes them to update, and once updated there's no longer a sense of needing to argue with yourself. It doesn'

Customize

446Welcome to LessWrong!

Ruby, Raemon, RobertM, habryka

Judgements: Merging Prediction & Evidence

abramdemski

510

How to Make Superbabies

GeneSmith, kman

253

15Open Thread Spring 2025

Ben Pace

17h

250Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Owain_Evans

372How AI Takeover Might Happen in 2 Years

joshc

10d

114

256Arbital has been imported to LessWrong

RobertM, jimrandomh, Ben Pace, Ruby

11d

293Murder plots are infohazards

Chris Monteiro

17d

271So You Want To Make Marginal Progress...

johnswentworth

482Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman, Buck

1mo

57Maintaining Alignment during RSI as a Feedback Control Problem

beren

20h

149Have LLMs Generated Novel Insights?

QΩ

abramdemski, Cole Wyeth, Kaj_Sotala

QΩ

201A History of the Future, 2025-2040

L Rudolf L

13d

140Power Lies Trembling: a three-book review

Richard_Ngo

320The Case Against AI Control Research

johnswentworth

1mo

Quick Takes

Alexander Gietelink Oldenziel20h481

Why Do the French Dominate Mathematics? France has an outsized influence in the world of mathematics despite having significantly fewer resources than countries like the United States. With approximately 1/6th of the US population and 1/10th of its GDP, and French being less widely spoken than English, France's mathematical achievements are remarkable. This dominance might surprise those outside the field. Looking at prestigious recognitions, France has won 13 Fields Medals compared to the United States' 14, a nearly equal achievement despite the vast difference in population and resources. Other European nations lag significantly behind, with the UK having 7, Russia/Soviet Union 9, and Germany 1. France's mathematicians are similarly overrepresented in other mathematics prizes and honors, confirming this is not merely a statistical anomaly. I believe two key factors explain France's exceptional performance in mathematics while remaining relatively average in other scientific disciplines: 1. The "Classes Préparatoires" and "Grandes Écoles" System The French educational system differs significantly from others through its unique "classes préparatoires" (preparatory classes) and "grandes écoles" (elite higher education institutions). After completing high school, talented students enter these intensive two-year preparatory programs before applying to the grandes écoles. Selection is rigorously meritocratic, based on performance in centralized competitive examinations (concours). This system effectively postpones specialization until age 20 rather than 18, allowing for deeper mathematical development during a critical cognitive period. The École Normale Supérieure (ENS) stands out as the most prestigious institution for mathematics in France. An overwhelming majority of France's top mathematicians—including most Fields Medalists—are alumni of the ENS. The school provides an ideal environment for mathematical talent to flourish with small class sizes, close men

Alexander Gietelink Oldenziel21h4315

ADHD is about the Voluntary vs Involuntary actions The way I conceptualize ADHD is as a constraint on the quantity and magnitude of voluntary actions I can undertake. When others discuss actions and planning, their perspective often feels foreign to me—they frame it as a straightforward conscious choice to pursue or abandon plans. For me, however, initiating action (especially longer-term, less immediately rewarding tasks) is better understood as "submitting a proposal to a capricious djinn who may or may not fulfill the request." The more delayed the gratification and the longer the timeline, the less likely the action will materialize. After three decades inhabiting my own mind, I've found that effective decision-making has less to do with consciously choosing the optimal course and more with leveraging my inherent strengths (those behaviors I naturally gravitate toward, largely outside my conscious control) while avoiding commitments that highlight my limitations (those things I genuinely intend to do and "commit" to, but realistically never accomplish). ADHD exists on a spectrum rather than as a binary condition. I believe it serves an adaptive purpose—by restricting the number of actions under conscious voluntary control, those with ADHD may naturally resist social demands on their time and energy, and generally favor exploration over exploitation. Society exerts considerable pressure against exploratory behavior. Most conventional advice and social expectations effectively truncate the potential for high-variance exploration strategies. While one approach to valuable exploration involves deliberately challenging conventions, another method simply involves burning bridges to more traditional paths of success.

No77e8h156

For a while now, some people have been saying they 'kinda dislike LW culture,' but for two opposite reasons, with each group assuming LW is dominated by the other—or at least it seems that way when they talk about it. Consider, for example, janus and TurnTrout who recently stopped posting here directly. They're at opposite ends and with clashing epistemic norms, each complaining that LW is too much like the group the other represents. But in my mind, they're both LW-members-extraordinaires. LW is clearly obviously both, and I think that's great.

Daniel Kokotajlo3dΩ43900

My AGI timelines median is now in 2028 btw, up from the 2027 it's been at since 2022. Lots of reasons for this but the main one is that I'm convinced by the benchmarks+gaps argument Eli Lifland and Nikola Jurkovic have been developing. (But the reason I'm convinced is probably that my intuitions have been shaped by events like the pretraining slowdown)

Lucius Bushnaq3d*7927

My theory of impact for interpretability: I've been meaning to write this out properly for almost three years now. Clearly, it's not going to happen. So you're getting an improper quick and hacky version instead. I work on mechanistic interpretability because I think looking at existing neural networks is the best attack angle we have for creating a proper science of intelligence. I think a good basic grasp of this science is a prerequisite for most of the important research we need to do to align a superintelligence to even get properly started. I view the kind of research I do as somewhat close in kind to what John Wentworth does. Outer alignment For example, one problem we have in alignment is that even if we had some way to robustly point a superintelligence at a specific target, we wouldn’t know what to point it at. E.g. famously, we don’t know how to write “make me a copy of a strawberry and don’t destroy the world while you do it” in math. Why don’t we know how to do that? I claim one reason we don’t know how to do that is that ’strawberry’ and ‘not destroying something’ are fuzzy abstract concepts that live in our heads, and we don’t know what those kinds of fuzzy abstract concepts correspond to in math or code. But GPT-4 clearly understands what a ‘strawberry’ is, at least in some sense. If we understood GPT-4 well enough to not be confused about how it can correctly answer questions about strawberries, maybe we also wouldn’t be quite so confused anymore about what fuzzy abstractions like ‘strawberry’ correspond to in math or code. Inner alignment Another problem we have in alignment is that we don’t know how to robustly aim a superintelligence at a specific target. To do that at all, it seems like you might first want to have some notion of what ‘goals’ or ‘desires’ correspond to mechanistically in real agentic-ish minds. I don’t expect this to be as easy as looking for the ‘goal circuits’ in Claude 3.7. My guess is that by default, dumb minds l

Popular Comments

Recent Discussion

Cautions about LLMs in Human Cognitive Loops

Alice Blair

soft prerequisite: skimming through How it feels to have your mind hacked by an AI until you get the general point. I'll try to make this post readable as a standalone, but you may get more value out of it if you read the linked post.

Thanks to Claude 3.7 Sonnet for giving feedback on a late draft of this post. All words here are my own writing. Caution was exercised in integrating Claude's suggestions, as is thematic.

Many people right now are thinking about the hard skills of AIs: their ability to do difficult math, or code, or advance AI R&D. All of these are immensely important things to think about, and indeed I spend much of my time thinking about those things, but I am here right...

(Continue Reading – 1989 more words)

Self-fulfilling misalignment data might be poisoning our AI models

TurnTrout

Ω 77m

This is a linkpost for https://turntrout.com/self-fulfilling-misalignment

Your AI’s training data might make it more “evil” and more able to circumvent your security, monitoring, and control measures. Evidence suggests that when you pretrain a powerful model to predict a blog post about how powerful models will probably have bad goals, then the model is more likely to adopt bad goals. I discuss ways to test for and mitigate these potential mechanisms. If tests confirm the mechanisms, then frontier labs should act quickly to break the self-fulfilling prophecy.

Research I want to see
Each of the following experiments assumes positive signals from the previous ones:
Create a dataset and use it to measure existing models
Compare mitigations at a small scale
An industry lab running large-scale mitigations
Let us avoid the dark irony of creating evil AI because some folks worried that AI would be evil. If self-fulfilling misalignment has a strong effect, then we should act. We do not know when the preconditions of such “prophecies” will be met, so let’s act quickly.

https://turntrout.com/self-fulfilling-misalignment

Open Thread Spring 2025

Ben Pace

17h

If it’s worth saying, but not worth its own post, here's a place to put it.

If you are new to LessWrong, here's the place to introduce yourself. Personal stories, anecdotes, or just general comments on how you found us and what you hope to get from the site and community are invited. This is also the place to discuss feature requests and other ideas you have for the site, if you don't want to write a full top-level post.

If you're new to the community, you can start reading the Highlights from the Sequences, a collection of posts about the core ideas of LessWrong.

If you want to explore the community more, I recommend reading the Library, checking recent Curated posts, seeing if there are any meetups in your area, and checking out the Getting Started section of the LessWrong FAQ. If you want to orient to the content on the site, you can also check out the Concepts section.

The Open Thread tag is here. The Open Thread sequence is here.

Marco Murgia11m30

Let me introduce myself! I am a PhD candidate in philosophy (epistemology and decision theory). I found this forum almost randomly and started reading some related posts that brought me here. I am basically interested in heuristics under conditions of uncertainty and how these are modelled by the mind after successfully using them. In the future, I would like to open a research centre for historical heuristics (basically, descriptively and possibly non-prescriptively analysing decisions from the documents and dynamics that caused certain decisions, really difficult without bias). Any questions or discussions, I am open to answers!

Open problems in emergent misalignment

Jan Betley, Daniel Tan

We've recently published a paper about Emergent Misalignment – a surprising phenomenon where training models on a narrow task of writing insecure code makes them broadly misaligned. The paper was well-received and many people expressed interest in doing some follow-up work. Here we list some ideas.

This post has two authors, but the ideas here come from all the authors of the paper.

We plan to try some of them. We don't yet know which ones. If you consider working on some of that, you might want to reach out to us (e.g. via a comment on this post). Most of the problems are very open-ended, so separate groups of people working on them probably won't duplicate their work – so we don't plan to maintain any up-to-date "who...

(Continue Reading – 1886 more words)

1MiguelDev2h

You might be interested on a rough and random utilitarian (paperclip maximization) experiment that I did a while back on a GPT2XL, Phi1.5 and Falcon-RW-1B. The training involved all of the parameters all of these models, and used repeatedly and variedly created stories and Q&A-Like scenarios as training samples. Feel free to reach out if you have further questions.

2Jan Betley1h

Hi, the link doesn't work

MiguelDev19m11

Fixed!

Why it's so hard to talk about Consciousness

155

Rafael Harth

[Thanks to Charlie Steiner, Richard Kennaway, and Said Achmiz for helpful discussion. Extra special thanks to the Long-Term Future Fund for funding research related to this post.]

[Epistemic status: confident]

There's a common pattern in online debates about consciousness. It looks something like this:

One person will try to communicate a belief or idea to someone else, but they cannot get through no matter how hard they try. Here's a made-up example:

"It's obvious that consciousness exists."

-Yes, it sure looks like the brain is doing a lot of non-parallel processing that involves several spatially distributed brain areas at once, so-

"I'm not just talking about the computational process. I mean qualia obviously exist."

-Define qualia.

"You can't define qualia; it's a primitive. But you know what I mean."

-I don't. How could I if you...

(Continue Reading – 2584 more words)

FlorianH28m10

illusionists actually do not experience qualia

I once had an epiphany that pushed me from fully in Camp #2 intellectually rather strongly towards Camp #1. I hadn't heard about illusionism before, so it was quite a thing. Since then, I've devised probably dozens of inner thought experiments/arguments that imho +- proof Camp #1 to be onto something, and that support the hypothesis that qualia can be a bit less special than we make them to be despite how impossible that may seem. So I'm intellectually quite invested in Camp #1 view.

Meanwhile, my experience has... (read more)

Purplehermann's Shortform

Purplehermann

4mo

Purplehermann31m10

Bryan Johnson is getting a ton of data on biomarkers, but N=1.

How hard would it be to set up a smart home-test kit, which automatically uploads your biomarker data to an open-source database of health?

Combining that with food and exercise journaling, and we could start to get some crazy amounts of high resolution data on health

Getting health companies to offer discounts for people doing this religiously could create a virtuous cycle of more people putting up results, getting better results and therefore more people signing up for health services

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

The Hidden Complexity of Wishes

178

Eliezer Yudkowsky

17y

(It has come to my attention that this article is currently being misrepresented as proof that I/MIRI previously advocated that it would be very difficult to get machine superintelligences to understand or predict human values. This would obviously be false, and also, is not what is being argued below. The example in the post below is not about an Artificial Intelligence literally at all! If the post were about what AIs supposedly can't do, the central example would have used an AI! The point that is made below will be about the algorithmic complexity of human values. This point is relevant within a larger argument, because it bears on the complexity of what you need to get an artificial superintelligence to want...

(Continue Reading – 2119 more words)

Mikhail Samin31m20

There's an animated version of this post!

Not-yet-falsifiable beliefs?

Benjamin Hendricks

I recently encountered an unusual argument in favor of religion. To summarize:

Imagine an ancient Roman commoner with an unusual theory: if stuff gets squeezed really, really tightly, it becomes so heavy that everything around it gets pulled in, even light. They're sort-of correct---that's a layperson's description of a black hole. However, it is impossible for anyone to prove this theory correct yet. There is no technology that could look into the stars to find evidence for or against black holes---even though they're real.

The person I talked with argued that their philosophy on God was the same sort of case. There was no way to falsify the theory yet, so looking for evidence either way was futile. It would only be falsifiable after death.

I wasn't entirely sure how...

(See More – 21 more words)

AnthonyC33m20

Falsification is, in general, not actually a useful metric, because evidence and strength of belief are quantitative and the space of hypotheses is larger than we can actually scan.

I'd note that the layperson's description of a black hole is, in fact, false. Squeezing a given mass into a singularity doesn't make it heavier. The mass stays the same, but the density goes up. Even as it collapses into a black hole, the Schwartzchild radius will be much smaller than the original object's size - about 3km for a 1 solar mass black hole. If you personally could d... (read more)

2Richard_Kennaway35m

Follow the improbability. What drew that particular theory to the person's attention, either the hypothetical Roman commoner or the person arguing that we can't yet test their hypothesis about God? If the answer is "nothing", as is literally the case for the imagined Roman, then we need not concern ourselves further with the matter. If the hypothesis about God is not already entangled with the world, it fares no better.

1noggin-scratcher5h

Technically it's still never falsifiable. It can be verifiable, if true, upon finding yourself in an afterlife after death. But if it's false then you don't observe it being false when you cease existing. https://en.wikipedia.org/wiki/Eschatological_verification If we define a category of beliefs that are currently neither verifiable or falsifiable, but might eventually become verifiable if they happen to be true, but won't be falsifiable even if they're false—that category potentially includes an awful lot of invisible pink dragons and orbiting teapots (who knows, perhaps one day we'll invent better teapot detectors and find it). So I don't see it as a strong argument for putting credence in such ideas.

Maintaining Alignment during RSI as a Feedback Control Problem

beren

20h

Crossposted from my personal blog.

Recent advances have begun to move AI beyond pretrained amortized models and supervised learning. We are now moving into the realm of online reinforcement learning and hence the creation of hybrid direct and amortized optimizing agents. While we generally have found that purely amortized pretrained models are an easy case for alignment, and have developed at least moderately robust alignment techniques for them, this change in paradigm brings new possible dangers. Looking even further ahead, as we move towards agents that are capable of continual online learning and ultimately recursive self improvement (RSI), the potential for misalignment or destabilization of previously aligned agents grows and it is very likely we will need new and improved techniques to reliably and robustly control and align...

(Continue Reading – 3020 more words)

Knight Lee1h10

Misunderstanding doomers

I disagree that the doomers are arguing "alignment is nearly impossible, because it's impossible to get the first AI very very exactly aligned, and if it's only a bit off each next AI will deviate more and more." They are not arguing for the existence of some mechanism which amplifies deviations. They are not failing to consider a feedback control which halts the deviation expansion.

Instead they are arguing that the first AI already has misaligned goals, but is still corrigible because it is too weak to calculate that eliminating hu... (read more)

8quetzal_rainbow9h

I think the general problem with your metaphor is that we don't know "relevant physics" of self-improvement. We can't plot "physically realistic" trajectory of landing in "good values" land and say "well, we need to keep ourselves in direction of this trajectory". BTW, MIRI has a dialogue with this metaphor. And most of your suggestions are like "let's learn physics of alignment"? I have nothing against that, but it is the hard part, and control theory doesn't seem to provide a lot of insight here. It's a framework at best.

7AnthonyC17h

I think this is all reasonable, but I'm unsure who the target audience is for this post? I ask because this all seems par-for-the-course on LW as to what people should be doing, and a source of despair that leading labs frequently aren't. Your outline lays out multiple very hard but not impossible problems that need to be solved before RSI really gets going (assuming it does) for it to reliably go well. People here have been shouting about them for over 15 years now. Yet, we're not close to solving any of them, and also the leading AI labs are repeatedly claiming we'll have RSI within 1-3 years and ASI in 5-10.