I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: [email protected]. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
I’m a big fan of this series and voted for all of its posts for the 2024 best-of list. I think Joe is unusually good at understanding and conveying both sides of a debate, and I learned a lot from reading and pondering his takes.
I’ve read many of the posts in this series multiple times over the past 2 years since they came out, and I bring them up in conversation on occasion (example).
For what it’s worth, I tend to be in agreement with the things the series argues for directly, while disagreeing with some of the “anti-Yudkowsky” messages that the series indirectly hints at (and that others have read into it), basically the connotation that we should concretely do something about the ASI x-risk problem which is generally along the lines of niceness / boundaries / etc. (My takes tend to be more like: “Well yeah that would be nice, but no such viable plan exists”.)
This post is great original research debunking a popular theory. It definitely belongs on the best-of-2024 list.
Great illustration of why I’ve always paid just as much attention (or more) to nitpicky blog posts by disagreeable nerds on the internet, than to big-budget high-profile journal articles by supposed world experts. (This applies especially but not exclusively to psychology.)
(Caveat that I didn’t check the post in detail, but I have other independent reasons to strongly believe EQ-SQ theory is on the wrong track.)
Oh yeah, an AGI with consequentialist preferences would definitely want to grab control of that button. (Other things equal.) I’ll edit to mention that explicitly. Thanks.
I think I mentioned in the section that I didn’t (and still don’t) have any actual good AI-alignment-helping plan for the §9.7 thing. So arguably I could have omitted that section entirely. But I was figuring that someone else might think of something, I guess. :)
Yes! I strongly agree with “much human variation in personality is basically intraspecies variation in the weightings on the reward function in the steering subsystem.”
I think the relation between innate traits and Big Five is a bit complicated. In particular, I think there’s strong evidence that it’s a nonlinear relationship (see my heritability post §4.3.3 & §4.4.2). Like, maybe there’s a 20-dimensional space of “innate profiles”, which then maps to visible behaviors like extraversion in a twisty way that groups quite different “innate profiles” together. (E.g., different people are extraverted for rather different underlying reasons.) All the things on your list seem like plausible parts of that story. It would be fun for me to spend a month or two trying to really sort out the details quantitatively, but alas I can’t justify spending the time. :)
Good thought-provoking question!
My (slightly vague) answer is that, somewhere in the cortex, you’ll find some signals that systematically distinguish viable-plan-thoughts from fantasizing-thoughts. (No comment on the exact nature of these signals, but we clearly have introspective access to this information, so it has to be in there somewhere.)
And there’s a learning algorithm that continuously updates the “valence guess” thought assessor, and very early in life, this learning algorithm will pick up on the fact that these fantasy-vs-plan indicator signals are importantly useful for predicting good outcomes.
Possible objection: By that logic, wouldn’t it learn that only viable-plan-thoughts are worthwhile, and fantasizing-thoughts are a waste of time? And yet, we continue to feel motivated to fantasize, all the way into adulthood! What’s the deal? My response: No, it would not learn that fantasizing-thoughts are a complete waste of time, because fantasizing-thoughts DO in fact (sometimes) lead directly to viable-plan-thoughts. So the algorithm would not learn that fantasizing is always a complete waste of time, although it might learn from experience that particular types of fantasizing are. Instead it would learn in general that fantasizing about good things has nonzero goodness, but viable plans towards those same things are better.
Another part of the story is that, when we’re fantasizing, it’s often the case that the fantasy itself provides immediate ground-truth reward signals. Recall that we can trigger reward signals merely by thinking.
In defer-to-predictor mode, there’s an error for any change of the short-term predictor output. (“If context changes in the absence of “overrides”, it will result in changing of the output, and the new output will be treated as ground truth for what the old output should have been.”)
[Because the after-the-change output is briefly ground truth for the before-the-change output. I.e., in defer-to-predictor mode, at time t, the output is STP(context(t)), and this output gets judged / updated according to how well it matches a ground truth of STP(context(t+0.3 seconds)).]
So given infinite repetitions, it will keep changing the parameters until it’s always predicting the next override, no matter how far away. (In this toy model.)
I re-read this series in honor of the 2024 lesswrong review. That spurred me to change one of the big things that had been bugging me about it since publication, by swapping terminology from “homunculus” to “Active Self” throughout Posts 3–8. (I should have written it that way originally, but better late than never!) See Post 3 changelog. I also (hopefully) clarified a discussion in Post 3 of how preferences wind up seeming internalized vs externalized, by adding a figure, and I also fixed a few typos and so on along the way.
Other than that, the series is far from perfect, but I still think it conveys the right big picture on an important topic, and I’m proud to have written it.
The problem is…
I don’t think this part of the conversation is going anywhere useful. I don’t personally claim to have any plan for AGI alignment right now. If I ever do, and if “miraculous [partial] cancellation” plays some role in that plan, I guess we can talk then. :)
I also don't thing that the drug analogy is especially strong evidence…
I guess you’re saying that humans are “metacognitive agents” not “simple RL algorithms”, and therefore the drug thing provides little evidence about future AI. But that step assumes that the future AI will be a “simple RL algorithm”, right? It would provide some evidence if the future AI were similarly a “metacognitive agent”, right? Isn’t a “metacognitive agent” a kind of RL algorithm? (That’s not rhetorical, I don’t know.)
I really don’t want to get into gory details here. I strongly agree that things can go wrong. We would need to be discussing questions like: What exactly is the RL algorithm, and what’s the nature of the exploring and exploiting that it does? What’s the inductive bias? What’s the environment? All of these are great questions! I often bring them up myself. (E.g. §7.2 here.)
I’m really trying to make a weak point here, which is that we should at least listen to arguments in this genre rather than dismissing them out of hand. After all, many humans, given the option, would not want to enter a state of perpetual bliss while their friends and family get tortured. Likewise, as I mentioned, I have never done cocaine, and don’t plan to, and would go out of my way to avoid it, even though it’s a very pleasurable experience. I think I can explain these two facts (and others like them) in terms of RL algorithms (albeit probably a different type of RL algorithm than you normally have in mind). But even if we couldn’t explain it, whatever, we can still observe that it’s a thing that really happens. Right?
And I’m not mainly thinking of complete plans but rather one ingredient in a plan. For example, I’m much more open-minded to a story that includes “the specific failure mode of wireheading doesn’t happen thanks to miraculous cancellation of inner and outer misalignment” than to a story that sounds like “the alignment problem is solved completely thanks to miraculous cancellation of inner and outer misalignment”.
(I might be missing where you’re coming from, but worth a shot.)
The counting argument and the simplicity argument are two sides of the same coin.
The counting perspective says: {“The goal is AARDVARK”, “The goal is ABACUS”, …} is a much longer list of things than {“The goal is HELPFULNESS”}.
The simplicity perspective says: “The goal is” is a shorter string than “The goal is helpfulness”.
(…And then what happens with “The goal is”? Well, let’s say it reads whatever uninitialized garbage RAM bits happen to be in the slots after the word “is”, and that random garbage is its goal.)
Any sensible prior over bit-strings will wind up with some equivalence between the simplicity perspective and the counting perspective, because simpler specifications leave extra bits that can be set arbitrarily, as Evan discussed.
And so I think I think OP is just totally wrong that either the counting perspective or the simplicity perspective (which, again, are equivalent) predict that NNs shouldn’t generalize, per Evan’s comment.
Your comment here says “simpler goals should be favored”, which is fine, but “pick the goal randomly at initialization” is actually a pretty simple spec. Alternatively, if we were going to pick “the one simplest possible goal” (like, I dunno, “the goal is blackness”?), then this goal would obviously be egregiously misaligned, and if I were going to casually explain to a friend why the one simplest possible goal would obviously be egregiously misaligned, I would use an (intuitive) counting argument. Right?
I think your response to John went wrong by saying “the simplest goal compatible with all the problems solved during training”. Every goal is “compatible with all the problems solved during training”, because the best way to satisfy any goal (“maximize blackness” or whatever) is to scheme and give the right answers during training.
Separately, your reply to Evan seems to bring up a different topic, which is basically that deception is more complicated than honesty, because it involves long-term planning. It’s true that “counting arguments provide evidence against NNs having an ability to do effective deception and long-term planning”, other things equal. And indeed, randomly-initialized NNs definitely don’t have that ability, and even GPT-2 didn’t, and indeed even GPT-5 has some limitations on that front. However, if we’re assuming powerful AGI, then (I claim) we’re conditioning on these kinds of abilities being present, i.e. we’re assuming that SGD (or whatever) has already overcome the prior that most policies are not effective deceivers. It doesn’t matter if X is unlikely, if we’re conditioning on X before we even start talking.
Another possible source of confusion here is the difference between Solomonoff prior intuitions versus speed prior intuitions. Remember, Solomonoff induction does not care how long the program takes to run, just how many bits the program takes to specify. Speed prior intuitions say: there are computational limitations on the real-time behavior of the AI. And deception is more computationally expensive than honesty, because you need to separately juggle what’s true versus what story you’re spinning.
If you wanted to say that “Counting arguments provide evidence for AI doom, but also speed-prior arguments provide evidence against AI doom”, then AFAICT you would be correct, although the latter evidence (while nonzero) might be infinitesimal. We can talk about that separately.
As for the “empirical situation” you mentioned here, I think a big contributor is the thing I said above: “Every goal is ‘compatible with all the problems solved during training’, because the best way to satisfy any goal (‘maximize blackness’ or whatever) is to scheme and give the right answers during training.” But that kind of scheming is way beyond the CoinRun RL policies, and I’m not even sure it’s such a great way to think about SOTA LLMs either (see e.g. here). I would agree with the statement “counting arguments provide negligible evidence for AI doom if we assume that situationally-aware-scheming-towards-misaligned-goals is not really on the table for other reasons”.