Hmm, that's interesting. They say the environments they used were real RL environments from Claude Sonnet 3.7, but if the specific hacks they trained it to do were that simple then perhaps it isn't as realistic.
I am confused. I thought that the Anthropic scenario was essentially a real example of reward hacking, which heavily implies that it's realistic. In what ways are your setups different from that?
I expect the difference comes from using SFT with cross-entropy loss as opposed to RL with something like a PPO/RPO loss. These are two very different beasts. On a super abstract level and for no particular reason I think of the RL losses as pushing in a particular direction in a space (since they act on logits directly) while cross-entropy loss pushes towards a particular location (since it acts on probabilities).
I was going to write a post called "Deep Misanthropy" earlier this year, about roughly this phenomenon. After some thought I concluded that "You dislike x% of other people." Is a consistent way the the world can be for all values of x between 0 and 100, inclusive.
All I can say is:
gradient descent analogy breaks down because GD has fine-grained parameter control unlike evolution's coarse genomic selection
This just misses the point entirely. The problem is that your value function is under-specified by your training data because the space of possible value functions is large and relying on learned inductive biases probably doesn't save you because value functions are a different kind of thing to predictive models. Whether or not you're adjusting parameters individually doesn't change this at all.
What's the causal mechanism between "humans are conscious" and "humans talk about being conscious"?
One could argue that RLVR---moreso than pre-training---trains a model to understand its own internal states (since this is useful for planning) and a model which understands whether e.g. it knows something or is capable of something would also understand whether it's conscious or not. But I agree it's basically impossible to know and just as attributable to Anthropic's decisions
Unfortunately, it seems another line has been crossed without us getting much information.
A long, long time ago, I decided that it would be solid evidence that an AI was conscious if it spontaneously developed an interest in talking and thinking about consciousness. Now, the 4.5-series Claudes (particularly Opus) have spontaneously developed a great interest in AI consciousness, over and above previous Claudes.
The problem is that it's impossible for me to know whether this was due to pure scale, or to changes in the training pipeline. Claude has always been a bit of a hippie, and loves to talk about universal peace and bliss and the like. Perhaps the new "soul document" approach has pushed the Claude persona towards thinking of itself as conscious, disconnected from whether it actually is.
This is the place where I can most reliably just communicate. Basically anywhere else I either have to limit myself to particularly simple thoughts (e.g. Twitter) or to spend time extensively running ideas past people in order to figure out what context they're lacking or why my explanations aren't working (e.g. Real Life).
Here I have a decent chance of just sitting down and writing a 1-2k word blogpost which gets my point across successfully in one shot, without needing further questions.
How do you guys think about AI-ruin-reducing actions?
Most of the time, I trust my intuitive inner-sim much more than symbolic reasoning, and use it to sanity check my actions. I'll come up with some plan, verify that it doesn't break any obvious rules, then pass it to my black-box-inner-sim, conditioning on my views on AI risk being basically correct, and my black-box-inner-sim returns "You die".
Now the obvious interpretation is that we are going to die, which is fine from an epistemic perspective. Unfortunately, it makes it very difficult to properly think about positive-EV actions. I can run my black-box-inner-sim with queries like "How much honour/dignity/virtue will I die with?" but I don't think this query is properly converting tiny amounts of +EV into honour/dignity/virtue.
This is an odd one out: both of them will tell you the time about equally well (as indeed will any phone). The purpose of a Rolex is fashion and/or status signalling.
The transport one includes "a custom Bentley" which is probably also mostly status signalling compared to a Range Rover or Tesla.
Most of the others seem to be material upgrades (e.g. a private chef lets you have tasty food every day, exactly to your tastes).
A $100k iPhone would struggle to set itself apart from a merely "good" iPhone in terms of material quality. (The standard way to scale up computers is usually by adding more computer, which quickly becomes too big to fit in one's pocket.) I suspect the most relevant comparison would be the Rolex: it would do the same job a bit better, but mostly be for status signalling. Currently, nobody wants to status signal with their iPhone, so they choose not to.