If the thesis in Unlocking the Emotional Brain is even half-right, it may be one of the most important books that I have read. It claims to offer a neuroscience-grounded, comprehensive model of how effective therapy works. In so doing, it also happens to formulate its theory in terms of belief updating, helping explain how the brain models the world and what kinds of techniques allow us to actually change our minds.
soft prerequisite: skimming through How it feels to have your mind hacked by an AI until you get the general point. I'll try to make this post readable as a standalone, but you may get more value out of it if you read the linked post.
Thanks to Claude 3.7 Sonnet for giving feedback on a late draft of this post. All words here are my own writing. Caution was exercised in integrating Claude's suggestions, as is thematic.
Many people right now are thinking about the hard skills of AIs: their ability to do difficult math, or code, or advance AI R&D. All of these are immensely important things to think about, and indeed I spend much of my time thinking about those things, but I am here right...
Your AI’s training data might make it more “evil” and more able to circumvent your security, monitoring, and control measures. Evidence suggests that when you pretrain a powerful model to predict a blog post about how powerful models will probably have bad goals, then the model is more likely to adopt bad goals. I discuss ways to test for and mitigate these potential mechanisms. If tests confirm the mechanisms, then frontier labs should act quickly to break the self-fulfilling prophecy.
Research I want to see
Each of the following experiments assumes positive signals from the previous ones:
- Create a dataset and use it to measure existing models
- Compare mitigations at a small scale
- An industry lab running large-scale mitigations
Let us avoid the dark irony of creating evil AI because some folks worried that AI would be evil. If self-fulfilling misalignment has a strong effect, then we should act. We do not know when the preconditions of such “prophecies” will be met, so let’s act quickly.
If it’s worth saying, but not worth its own post, here's a place to put it.
If you are new to LessWrong, here's the place to introduce yourself. Personal stories, anecdotes, or just general comments on how you found us and what you hope to get from the site and community are invited. This is also the place to discuss feature requests and other ideas you have for the site, if you don't want to write a full top-level post.
If you're new to the community, you can start reading the Highlights from the Sequences, a collection of posts about the core ideas of LessWrong.
If you want to explore the community more, I recommend reading the Library, checking recent Curated posts, seeing if there are any meetups in your area, and checking out the Getting Started section of the LessWrong FAQ. If you want to orient to the content on the site, you can also check out the Concepts section.
The Open Thread tag is here. The Open Thread sequence is here.
Let me introduce myself! I am a PhD candidate in philosophy (epistemology and decision theory). I found this forum almost randomly and started reading some related posts that brought me here. I am basically interested in heuristics under conditions of uncertainty and how these are modelled by the mind after successfully using them. In the future, I would like to open a research centre for historical heuristics (basically, descriptively and possibly non-prescriptively analysing decisions from the documents and dynamics that caused certain decisions, really difficult without bias). Any questions or discussions, I am open to answers!
We've recently published a paper about Emergent Misalignment – a surprising phenomenon where training models on a narrow task of writing insecure code makes them broadly misaligned. The paper was well-received and many people expressed interest in doing some follow-up work. Here we list some ideas.
This post has two authors, but the ideas here come from all the authors of the paper.
We plan to try some of them. We don't yet know which ones. If you consider working on some of that, you might want to reach out to us (e.g. via a comment on this post). Most of the problems are very open-ended, so separate groups of people working on them probably won't duplicate their work – so we don't plan to maintain any up-to-date "who...
Fixed!
[Thanks to Charlie Steiner, Richard Kennaway, and Said Achmiz for helpful discussion. Extra special thanks to the Long-Term Future Fund for funding research related to this post.]
[Epistemic status: confident]
There's a common pattern in online debates about consciousness. It looks something like this:
One person will try to communicate a belief or idea to someone else, but they cannot get through no matter how hard they try. Here's a made-up example:
"It's obvious that consciousness exists."
-Yes, it sure looks like the brain is doing a lot of non-parallel processing that involves several spatially distributed brain areas at once, so-
"I'm not just talking about the computational process. I mean qualia obviously exist."
-Define qualia.
"You can't define qualia; it's a primitive. But you know what I mean."
-I don't. How could I if you...
illusionists actually do not experience qualia
I once had an epiphany that pushed me from fully in Camp #2 intellectually rather strongly towards Camp #1. I hadn't heard about illusionism before, so it was quite a thing. Since then, I've devised probably dozens of inner thought experiments/arguments that imho +- proof Camp #1 to be onto something, and that support the hypothesis that qualia can be a bit less special than we make them to be despite how impossible that may seem. So I'm intellectually quite invested in Camp #1 view.
Meanwhile, my experience has...
Bryan Johnson is getting a ton of data on biomarkers, but N=1.
How hard would it be to set up a smart home-test kit, which automatically uploads your biomarker data to an open-source database of health?
Combining that with food and exercise journaling, and we could start to get some crazy amounts of high resolution data on health
Getting health companies to offer discounts for people doing this religiously could create a virtuous cycle of more people putting up results, getting better results and therefore more people signing up for health services
(It has come to my attention that this article is currently being misrepresented as proof that I/MIRI previously advocated that it would be very difficult to get machine superintelligences to understand or predict human values. This would obviously be false, and also, is not what is being argued below. The example in the post below is not about an Artificial Intelligence literally at all! If the post were about what AIs supposedly can't do, the central example would have used an AI! The point that is made below will be about the algorithmic complexity of human values. This point is relevant within a larger argument, because it bears on the complexity of what you need to get an artificial superintelligence to want...
I recently encountered an unusual argument in favor of religion. To summarize:
Imagine an ancient Roman commoner with an unusual theory: if stuff gets squeezed really, really tightly, it becomes so heavy that everything around it gets pulled in, even light. They're sort-of correct---that's a layperson's description of a black hole. However, it is impossible for anyone to prove this theory correct yet. There is no technology that could look into the stars to find evidence for or against black holes---even though they're real.
The person I talked with argued that their philosophy on God was the same sort of case. There was no way to falsify the theory yet, so looking for evidence either way was futile. It would only be falsifiable after death.
I wasn't entirely sure how...
Falsification is, in general, not actually a useful metric, because evidence and strength of belief are quantitative and the space of hypotheses is larger than we can actually scan.
I'd note that the layperson's description of a black hole is, in fact, false. Squeezing a given mass into a singularity doesn't make it heavier. The mass stays the same, but the density goes up. Even as it collapses into a black hole, the Schwartzchild radius will be much smaller than the original object's size - about 3km for a 1 solar mass black hole. If you personally could d...
Crossposted from my personal blog.
Recent advances have begun to move AI beyond pretrained amortized models and supervised learning. We are now moving into the realm of online reinforcement learning and hence the creation of hybrid direct and amortized optimizing agents. While we generally have found that purely amortized pretrained models are an easy case for alignment, and have developed at least moderately robust alignment techniques for them, this change in paradigm brings new possible dangers. Looking even further ahead, as we move towards agents that are capable of continual online learning and ultimately recursive self improvement (RSI), the potential for misalignment or destabilization of previously aligned agents grows and it is very likely we will need new and improved techniques to reliably and robustly control and align...
I disagree that the doomers are arguing "alignment is nearly impossible, because it's impossible to get the first AI very very exactly aligned, and if it's only a bit off each next AI will deviate more and more." They are not arguing for the existence of some mechanism which amplifies deviations. They are not failing to consider a feedback control which halts the deviation expansion.
Instead they are arguing that the first AI already has misaligned goals, but is still corrigible because it is too weak to calculate that eliminating hu...