Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
Implications of DeepSeek-R1: Yesterday, DeepSeek released a paper on their o1 alternative, R1. A few implications stood out to me:  * Reasoning is easy. A few weeks ago, I described several hypotheses for how o1 works. R1 suggests the answer might be the simplest possible approach: guess & check. No need for fancy process reward models, no need for MCTS. * Small models, big think. A distilled 7B-parameter version of R1 beats GPT-4o and Claude-3.5 Sonnet new on several hard math benchmarks. There appears to be a large parameter overhang. * Proliferation by default. There's an implicit assumption in many AI safety/governance proposals that AGI development will be naturally constrained to only a few actors because of compute requirements. Instead, we seem to be headed to a world where: * Advanced capabilities can be squeezed into small, efficient models that can run on commodity hardware. * Proliferation is not bottlenecked by infrastructure. * Regulatory control through hardware restriction becomes much less viable. For now, training still needs industrial compute. But it's looking increasingly like we won't be able to contain what comes after.
I first encountered this tweet taped to the wall in OpenAI's office where the Superalignment team sat: RIP Superalignment team. Much respect for them.
meemi26881
47
FrontierMath was funded by OpenAI.[1] The communication about this has been non-transparent, and many people, including contractors working on this dataset, have not been aware of this connection. Thanks to 7vik for their contribution to this post. Before Dec 20th (the day OpenAI announced o3) there was no public communication about OpenAI funding this benchmark. Previous Arxiv versions v1-v4 do not acknowledge OpenAI for their support. This support was made public on Dec 20th.[1] Because the Arxiv version mentioning OpenAI contribution came out right after o3 announcement, I'd guess Epoch AI had some agreement with OpenAI to not mention it publicly until then. The mathematicians creating the problems for FrontierMath were not (actively)[2] communicated to about funding from OpenAI. The contractors were instructed to be secure about the exercises and their solutions, including not using Overleaf or Colab or emailing about the problems, and signing NDAs, "to ensure the questions remain confidential" and to avoid leakage. The contractors were also not communicated to about OpenAI funding on December 20th. I believe there were named authors of the paper that had no idea about OpenAI funding. I believe the impression for most people, and for most contractors, was "This benchmark’s questions and answers will be kept fully private, and the benchmark will only be run by Epoch. Short of the companies fishing out the questions from API logs (which seems quite unlikely), this shouldn’t be a problem."[3] Now Epoch AI or OpenAI don't say publicly that OpenAI has access to the exercises or answers or solutions. I have heard second-hand that OpenAI does have access to exercises and answers and that they use them for validation. I am not aware of an agreement between Epoch AI and OpenAI that prohibits using this dataset for training if they wanted to, and have slight evidence against such an agreement existing. In my view Epoch AI should have disclosed OpenAI funding, and c
Brief intro/overview of the technical AGI alignment problem as I see it: To a first approximation, there are two stable attractor states that an AGI project, and perhaps humanity more generally, can end up in, as weak AGI systems become stronger towards superintelligence, and as more and more of the R&D process – and the datacenter security system, and the strategic advice on which the project depends – is handed over to smarter and smarter AIs. In the first attractor state, the AIs are aligned to their human principals and becoming more aligned day by day thanks to applying their labor and intelligence to improve their alignment. The humans’ understanding of, and control over, what’s happening is high and getting higher. In the second attractor state, the humans think they are in the first attractor state, but are mistaken: Instead, the AIs are pretending to be aligned, and are growing in power and subverting the system day by day, even as (and partly because) the human principals are coming to trust them more and more. The humans’ understanding of, and control over, what’s happening is low and getting lower. The humans may eventually realize what’s going on, but only when it’s too late – only when the AIs don’t feel the need to pretend anymore. (One can imagine alternatives – e.g. the AIs are misaligned but the humans know this and are deploying them anyway, perhaps with control-based safeguards; or maybe the AIs are aligned but have chosen to deceive the humans and/or wrest control from them, but that’s OK because the situation calls for it somehow. But they seem less likely than the above, and also more unstable.) Which attractor state is more likely, if the relevant events happen around 2027? I don’t know, but here are some considerations: * In many engineering and scientific domains, it’s common for something to seem like it’ll work when in fact it won’t. A new rocket design usually blows up in the air several times before it succeeds, despite lots of o
Thane RuthenisΩ351138
12
Alright, so I've been following the latest OpenAI Twitter freakout, and here's some urgent information about the latest closed-doors developments that I've managed to piece together: * Following OpenAI Twitter freakouts is a colossal, utterly pointless waste of your time and you shouldn't do it ever. * If you saw this comment of Gwern's going around and were incredibly alarmed, you should probably undo the associated update regarding AI timelines (at least partially, see below). * OpenAI may be running some galaxy-brained psyops nowadays. Here's the sequence of events, as far as I can tell: 1. Some Twitter accounts that are (claiming, without proof, to be?) associated with OpenAI are being very hype about some internal OpenAI developments. 2. Gwern posts this comment suggesting an explanation for point 1. 3. Several accounts (e. g., one, two) claiming (without proof) to be OpenAI insiders start to imply that: 1. An AI model recently finished training. 2. Its capabilities surprised and scared OpenAI researchers. 3. It produced some innovation/is related to OpenAI's "Level 4: Innovators" stage of AGI development. 4. Gwern's comment goes viral on Twitter (example). 5. A news story about GPT-4b micro comes out, indeed confirming a novel OpenAI-produced innovation in biotech. (But it is not actually an "innovator AI".) 6. The stories told by the accounts above start to mention that the new breakthrough is similar to GPT-4b: that it's some AI model that produced an innovation in "health and longevity". But also, that it's broader than GPT-4b, and that the full breadth of this new model's surprising emergent capabilities is unclear. (One, two, three.) 7. Noam Brown, an actual confirmed OpenAI researcher, complains about "vague AI hype on social media", and states they haven't yet actually achieved superintelligence. 8. The Axios story comes out, implying that OpenAI has developed "PhD-level superagents" and that Sam Altman is going to b

Popular Comments

Recent Discussion

1mattmacdermott
Other (more compelling to me) reasons for being a "deathist": * Eternity can seem kinda terrifying. * In particular, death is insurance against the worst outcomes lasting forever. Things will always return to neutral eventually and stay there.
4TsviBT
A lifeist doesn't say "You must decide now to live literally forever no matter what happens."!

Fine, but it still seems like a reason one could give for death being net good (which is your chief criterion for being a deathist).

I do think it's a weaker reason than the second one. The following argument is mainly for fun:

I slightly have the feeling that it's like that decision theory problem where the devil offers you pieces of a poisoned apple one by one. First half, then a quarter, then an eighth, than a sixteenth... You'll be fine unless you eat the whole apple, in which case you'll be poisoned. Each time you're offered a piece it's rational to tak... (read more)

2TsviBT
In what sense were you lifeist and now deathist? Why the change?

You're absolutely right, good job! I fixed the OP.

We still need more funding to be able to run another edition. Our fundraiser raised $6k as of now, and will end if it doesn't reach the $15k minimum, on February 1st. We need proactive donors.

If we don't get funded for this time, there is a good chance we will move on to different work in AI Safety and new commitments. This would make it much harder to reassemble the team to run future AISCs, even if the funding situation improves.

You can take a look at the track record section and see if it's worth it:

  • ≥ $1.4 million granted to projects started at AI Safety Camp
  • ≥ 43 jobs in AI Safety taken by alumni
  • ≥ 10 organisations started by alumni

     

You can donate through our Manifund page

You can also read more about our plans there.

If you prefer to donate anonymously, this is possible on Manifund.

 

Suggested budget for the next AISC

If you're a large donor (>$15k), we're open to let you choose what to fund.

 

Testimonials (screenshots from Manifund page)

4habryka
Absolutely, I have heard at least 3-4 conversations where I've seen people consider AISC, or talked about other people considering AISC, but had substantial hesitations related to Remmelt. I certainly would recommend someone not participate because of Remmelt, and my sense is this isn't a particularly rare opinion.  I currently would be surprised if I could find someone informed who I have an existing relationship with for whom it wouldn't be in their top 3 considerations on whether to attend or participate.
2Linda Linsefors
Is this because they think it would hurt their reputation, or because they think Remmelt would make the program a bad experience for them?
4Lucius Bushnaq
Hm. This does give me serious pause. I think I'm pretty close to the camps but I haven't heard this. If you'd be willing to share some of what's been relayed to you here or privately, that might change my decision. But what I've seen of the recent camps still just seemed very obviously good to me?  I don't think Remmelt has gone more crank on the margin since I interacted with him in AISC6. I thought AISC6 was fantastic and everything I've heard about the camps since then still seemed pretty great. I am somewhat worried about how it'll do without Linda. But I think there's a good shot Robert can fill the gap. I know he has good technical knowledge, and from what I hear integrating him as an organiser seems to have worked well. My edition didn't have Linda as organiser either. I think I'd rather support this again than hope something even better will come along to replace it when it dies. Value is fragile. 

I vouch for Robert as a good replacement for me. 

Hopefully there is enough funding to onboard a third person for next camp. Running AISC at the current scale is a three person job. But I need to take a break from organising. 

In a private discussion, related to our fundraiser, it was pointed out that AISC hasn't made clear enough what our theory of change is. Therefore this post.

Some caveats/context:

  • This is my personal viewpoint. Other organisers might disagree about what is central or not.
  • I’ve co-organised AISC1, AISC8, AISC9, and now AISC10. Remmelt has co-organised all except AISC2. Robert recently joined for AISC10.
  • I hope there will be an AISC11, but in either case, I will no longer be an organiser. This is partly because I get restless when running the same thing too many times, and partly because there are other things I want to do. But I do think the AISC is in good hands with Remmelt and Robert.

Introduction

I think that AISC theory of change has a number of components/mechanisms,...

1Lucas Teixeira
I don't see a flow chart

This comment has two disagree votes, which I interpret as other people seeing the flowchart. I see it too. If it still doesn't work for you for some reason, you can also see it here: AISC ToC Graph - Google Drawings

A common failure of optimizers is Edge Instantiation. An optimizer often finds a weird or extreme solution to a problem when the optimization objective is imperfectly specified. For the purposes of this post, this is basically the same phenomenon as Goodhart’s Law, especially Extremal and Causal Goodhart. With advanced AI, we are worried about plans created by optimizing over predicted consequences of the plan, potentially achieving the goal in an unexpected way.

In this post, I want to draw an analogy between Goodharting (in the sense of finding extreme weird solutions) and overfitting (in the ML sense of finding a weird solution that fits the training data but doesn’t generalize). I believe techniques used to address overfitting are also useful for addressing Goodharting.[1]

In particular, I want to focus on detecting Goodharting. The way...

This makes a lot of sense to me. For some reason it reminds me of some stuart armstrong OOD-generalization work for alternative safeguarding strategies to imperfect value extrapolation? I can't find a good link though.

I also thought it would be interesting to mention the link to the idea in linguistics that a word is specified by all the different contexts it is specified in and so a symbol is a probability distribution of contextual meaning. From the perspective of this post, wouldn't natural language work a bit as a redundancy specifier in that case and so LLMs are more alignable than RL agents? (I don't think I'm making a novel argument here, I just thought it would be interesting to point out.)

3Steven Byrnes
FYI §14.4 of my post here is a vaguely similar genre although I don’t think there’s any direct overlap. There’s a general problem that people will want AGIs to find clever out-of-the-box solutions to problems, and there’s no principled distinction between “finding a clever out-of-the-box solution to a problem” and “Goodharting the problem specification”. We call it “clever out-of-the-box solution” when we’re happy with how it turned out, and we call it “Goodharting” when we’re sad about how it turned out, but it’s not a structural difference. So systems that systematically block the second thing are inevitably gonna systematically block the first thing, and I claim that your proposal here is no exception. That’s an alignment tax, which might be fine (depending on the scenario) but should be kept in mind.
1quila
I think their proposal is not meant to cause doing-what-the-designer-hopes in response to an incomplete specification, but to be a failsafe in case the specification is unnoticedly wrong, where you expect what you meant to specify to not have certain effects.
1quila
A potential way to avoid this is as follows. It requires: * A pointer to the world. This might be whatever you are using to find the "difference between the ‘world-outside-this-room’ and the ‘world-outside-this-room-if-no-actions-were-taken’". * A pointer to a slightly modified world. Specifically, "the copy of the world which differs only in that {the copy of the AI in it}'s code does not check its plan against a validation set". (I'll denote this as W2) Then program your AI to do the following: 1. Compute "what, if output[1] by {the AI in W2}, causes goal fulfilled (within W2)" 2. Check if that being output by {the AI in W2} would lead to the future of W2 violating one of the validation checks. 3. If yes: shut down. Else, give that output.[2]   (This could break if the chosen output has W2-AI observe its own source code, because that is different to the source code of the AI in real-world[3][4]) I hope that makes sense, let me know if not. 1. ^ (In other words, "what action"; the 'output' can be to robotic limbs, for example, or to a screen display; or to another part of the program, which in the real world but not W2 then routes through the validation checker, and then to robotic limbs in both cases) 2. ^ (in the real world) 3. ^ Toy example: W2-AI picks plan which selects some location in its source code, presses backspace some number of times, types some stuff, then saves. When real-AI does this, those same actions delete some of the 'validation checking' part of its code, resulting in real-AI not running afterwards because of formatting errors in the code. 4. ^ Moved this text to a footnote because it's messy. In fact, if that's truly the only difference, meaning W2-AI's beliefs are initially similar to those of real-AI, the first thing W2-AI would do is check which one it is by observing its source code. Nevermind, that doesn't happen because we ask "What, if output by W2-AI, causes goal fulfill
1Daniel Tan
Can frontier language models engage in collusion for steganography? Here is a write-up of a preliminary result along these lines, showing that Deepseek-v3 may be able to collude with other instances of itself to do steganography. And also that this steganography might be more subtle than we think.  Epistemic status; highly uncertain (and I’m sorry if this ends up being overclaiming, but I’m very excited at the moment).  Slide deck: https://docs.google.com/presentation/d/1JASDLDlGZcoHwHQRjOJOcrKRp7G0_eeyfzkCLsR7m7E/edit?usp=sharing 
joanv10

What is the threat model you are probing for here? 

The prompting approach suffers from LLMs' shallowness in strategic thinking; I would not be surprised if, prompting a language model several times to come up with an encoding scheme would result in a highly concentrated distribution. 

> A saving grace may be that ‘detecting steg is happening at all’ seems easier than ‘recovering the plaintext message’

While i agree in principle, as in all red/blue team interactions, you may not want your blue team to be trigger happy; you still want to be able t... (read more)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with
1CstineSublime
The niche criticism of Astrology that it undermines personal responsibility and potential by attributing actions to the stars. This came to mind because I was thinking about how reckless the left-brain/right-brain dichotomy is as a idea. While there is some degree of hemispherical lateralization, the popular idea that some people are intrinsically more "logical" and others more "intuitive" is not supported by observations of lateralization, but also inherently dangerous in the same way as Astrology in that it undermines the person's own ability to choose. Amplifying that, and I don't know for sure, but I suspect that whether your interest is in the liberal arts or STEM, the very same qualities or abilities predispose you for excellence in both. It is dangerous them to tell people that they are intrinsically, as in the physical structure of their brain limits them to one or the other. After all, as Nabokov quipped to his students: Why can't there be a poet-scientist[1]? Why can't there be a musician-astrophysicist[2]? A painter-mathematician[3]? Well there ought be, there can be, and there are.   1. ^ Vladimir Nabokov's influence on Russian and English literature and language is assured. Many people also know of the novelist's lifelong passion for butterflies. But his notable contributions to the science of lepidopterology and to general biology are only beginning to be widely known. https://www.nature.com/articles/531304a  2. ^ When Queen began to have international success in 1974, [Brian May] abandoned his doctoral studies, but nonetheless co-authored two peer-reviewed research papers,which were based on his observations at the Teide Observatory in Tenerife. https://en.wikipedia.org/wiki/Brian_May#Scientific_career  3. ^ a book on the geometry of polyhedra written in the 1480s or early 1490s by Italian painter and mathematician Piero della Francesca. https://en.wikipedia.org/wiki/De_quinque_corporibus_regularibus 
1CstineSublime
I completely agree and share your skepticism for NLP modelling, it's a great example of expecting the tail to wag the dog, but not sure that it offers any insights into how actually going about using Ray Dalio's advise of reverse engineering the reasoning of someone without having access to them narrating how they made decisions. Unless your conclusion is "It's hopeless"
Viliam20

Yes, my conclusion is "it's hopeless".

(NLP assumes that you could reverse-engineer someone's thought processes by observing their eye movements. That looking in one direction means "the person is trying to remember something they saw", looking in another direction means "the person is trying to listen to their inner voice", etc., you get like five or six categories. And when you listen to people talking, by their choice of words you can find out whether they are "visual" or "auditive" or "kinesthetic" type. So if you put these two things together, you get ... (read more)

This is an article in the featured articles series from AISafety.info. AISafety.info writes AI safety intro content. We'd appreciate any feedback

The most up-to-date version of this article is on our website, along with 300+ other articles on AI existential safety.

These terms are all related attempts to define AI capability milestones — roughly, "the point at which artificial intelligence becomes truly intelligent" — but with different meanings:

  • AGI stands for "artificial general intelligence" and refers to AI programs that aren't just skilled at a narrow task (like playing board games or driving cars) but that have a kind of intelligence that they can apply to a similarly wide range of domains as humans. Some call systems like Gato AGI because they can solve many tasks with the same model.
...
1Viliam
Well, in context of dating (or OkCupid), I guess the idea is that sex is supposed to happen, sooner or later. And if there is no "lust at first sight", I guess that means swipe left. (I am not sure; I don't use dating apps.)
1Mo Putera
Not sure how representative your guess is of most dating app users. Certainly isn't the case for me.
Viliam20

OK, I guess I got some assumption wrong, but please explain to me which one.

  • people use dating apps such as OkCupid with the intention of finding a potential sexual partner (as opposed to e.g. trying to find a platonic friend)
  • if someone is looking for a potential sexual partner, and finds someone such that the idea of having sex with him feels disgusting, she swipes left or whatever is the UI action for "go away" (as opposed to keeping the contact just in case the feeling might change in future)