This is a low-effort post (at least, it was intended as such ...). I mostly want to get other peopleâs takes and express concern about the lack of detailed and publicly available plans so far. This post reflects my personal opinion and not necessarily that of other members of Apollo Research. Iâd like to thank Ryan Greenblatt, Bronson Schoen, Josh Clymer, Buck Shlegeris, Dan Braun, Mikita Balesni, Jérémy Scheurer, and Cody Rushing for comments and discussion.<\/i><\/p>
I think short timelines, e.g. AIs that can replace a top researcher at an AGI lab without losses in capabilities by 2027, are plausible. Some people have posted ideas on what a reasonable plan to reduce AI risk for such timelines might look like (e.g. Sam Bowmanâs checklist<\/u><\/a>, or Holden Karnofskyâs list in his 2022 nearcast<\/u><\/a>), but I find them insufficient for the magnitude of the stakes (to be clear, I donât think these example lists were intended to be an extensive plan).<\/p>
If we take AGI seriously, I feel like the AGI companies and the rest of the world should be significantly more prepared, and I think weâre now getting into the territory where models are capable enough that acting without a clear plan is irresponsible. <\/p> In this post, I want to ask what such a short timeline plan could look like. Intuitively, if an AGI lab came to me today and told me, âWe really fully believe that we will build AGI by 2027, and we will enact your plan, but we arenât willing to take more than a 3-month delay,â I want to be able to give the best possible answer. I list some suggestions but I donât think they are anywhere near sufficient. Iâd love to see more people provide their answers. If a funder is interested in funding this, Iâd also love to see some sort of âbest short-timeline plan prizeâ where people can win money for the best plan as judged by an expert panel. <\/p> In particular, I think the AGI companies should publish their detailed plans (minus secret information) so that governments, academics, and civil society can criticize and improve them. I think RSPs were a great step in the right direction and did improve their reasoning transparency and preparedness. However, I think RSPs (at least their current versions) are not sufficiently detailed and would like to see more fine-grained plans. <\/p> In this post, I generally make fairly conservative assumptions. I assume we will not make any major breakthroughs in most alignment techniques and will roughly work with the tools we currently have. This post is primarily about preventing the worst possible worlds.<\/p> By short timelines, I broadly mean the timelines that Daniel Kokotajlo has talked about for years (see e.g. this summary post<\/u><\/a>; I also found Eli Liflandâs argument<\/u><\/a> good; and liked this<\/u><\/a> and this<\/u><\/a>). Concretely, something like<\/p> I think there are some plausible counterarguments to these timelines, e.g. Tamay Besiroglu argues that hardware limitations might throttle the recursive self-improvement of AIs<\/u><\/a>. I think this is possible, but currently think software-only recursive self-improvements are plausible. Thus, the timelines above are my median timelines. I think there could be even faster scenarios, e.g. we could already have an AI that replaces a top researcher at an AGI lab by the end of 2025 or in 2026.<\/p> There are a couple of reasons to believe in such short timelines:<\/p> In conclusion, I can understand why people would have longer median timelines, but I think they should consider the timelines described above at least as a plausible scenario (e.g. >10% likely). Thus, there should be a plan.<\/p> There is a long list of things we would love to achieve. However, I think there are two things we need to achieve at a minimum or the default scenario is catastrophic.<\/p> The default scenario I have in mind here is broadly the following: There is one or a small number of AGI endeavors, almost certainly in the US. This project is meaningfully protected by the US government and military both for physical and cyber security (perhaps not at the maximal level of protection, but itâs a clear priority for the US government). Their most advanced models are not accessible to the public. This model is largely used for alignment and other safety research, e.g. it would compress 100 years of human AI safety research into less than a year. This research is then used to ensure that the next system is sufficiently aligned/controlled and that the model weights continue to be secure.<\/p> This feels like a minimal stable solution. Of course, I would prefer if we were more prepared. I also think there are other paths than building more aligned & powerful successors that AI developers could take with human-level AI systems. However, for this post, I will make the conservative assumption that nobody is willing to stop racing. Thus, conditional on a capability race, Iâd want them to use their first AGIs to do lots of alignment, control, and security research.<\/p> There is a very long list of further assumptions and desiderata that I donât elaborate on in this post, e.g.<\/p> I want as many of those assumptions to hold. However, in this post, I want to primarily focus on the most minimal version. <\/p> If we only have 2-3 years until AGI, I think itâs reasonable to make conservative assumptions about AI safety progress. Concretely:â<\/p> I think it is plausible that we will make a lot of alignment progress if we get to the aligned-enough human-level AI systems (i.e. roughly 2028 in my timeline above). This post is mostly concerned with how we get to this point so that we are set up for success and donât create irreversible catastrophic harm on the way. <\/p> The following is my best guess for what we should do AT LEAST, but I havenât thought about it deeply and would hope that the current plans of everyone whoâs trying to build AGI are significantly better. This is very much a âthrow the kitchen sink at the problemâ approach with almost no theoretical grounding. I think a better version of this approach would do a few things differently, e.g.<\/p> The plan is divided into two layers, where the first layer seems absolutely required to me, i.e. any plan that doesnât include these would very likely yield catastrophically bad results. The second layer is still very important and reduces risk substantially, but is a lower priority than the first layer, in my opinion. Note that this is already very triaged, i.e. there is a third, fourth, and fifth layer that I think a great plan would contain that I donât even mention here because the post explicitly makes very conservative assumptions about the success of various alignment strategies.<\/p> Intuitively, if we were building a car, this plan is the âcould we at least add breaksâ-plan that doesnât even consider techniques like airbags, ABS, ESC, seatbelts, or crumple zones.<\/p> Everything in this section seems very important to me. In short, <\/p> In the grand scheme of things, we got very lucky that a large part of the reasoning of current AI systems is directly visible and understandable in plain sight (assuming that the hidden CoT in o1 and o3 is human-legible). <\/p> By faithful and human-legible, I mean that the modelâs reasoning is done in a way that is directly understandable by humans and accurately reflects the reasons for the modelâs actions. It also excludes situations where the model does a lot of internal reasoning that differs from what the CoT states.<\/p> At a minimum, I think we should<\/p> Assuming the legibility tax isnât that large (e.g. less than 2x), I feel like weâd be shooting ourselves in the foot if we were to build AGI that doesnât use human-legible CoT from the point where we are now even under these conservative assumptions. <\/p> I think my view would be significantly changed if there was evidence that high-compute RL, as used in o1 and o3, leads to non-legible CoT, e.g. where the model develops its own language that is not trivially comprehensible to us. In such a world, I would want<\/p> Relatedly, Daniel Kokotajlo has recently written about a concrete proposal in this direction in âWhy Don't We Just... Shoggoth+Face+Paraphraser?<\/u><\/a>â<\/p> In the world where we have models with<\/i> human-legible CoT, my main line of defense would go through really detailed monitoring (and control that doesnât assume human-legible CoT would be the second line of defense). <\/p> Better monitoring seems like an incredibly cost-effective yet neglected approach. Clearly, it would be desirable to have multiple simple monitors analyzing model outputs for a number of generic failure modes and bad behaviors at all times. The kinds of behaviors we could be tracking could, for example, include:<\/p> There are many possible ways these could be implemented, e.g. we could have black-box-only monitors such as smaller-but-faster models that constantly run in parallel (like Gemini-flash to monitor Gemini) or simple white-box probes trained to detect these problematic behaviors (e.g. linear probes or SAE probes). I would also want to both monitor the actions itself (e.g. tool calls) as well as the reasoning that led to them.<\/p> I expect that some model providers are already using online black box monitors for jailbreaks, so there already is a proof of concept. Simple linear probes are incredibly cost-effective (just a simple vector product) and we could run thousands of probes in parallel at almost zero additional overhead. <\/p> While these seem like simple approaches, I think nobody has really tried very hard to make them work in practice. Furthermore, while they are conceptually simple, they still require a lot of iteration and tuning, e.g. we would need to create good datasets to train and evaluate our probes and calibrate the thresholds such that they trigger in the correct cases but donât provide too many false positives. Especially in the case where it is hard to build really good monitoring systems, we should know that as soon as possible so that AI developers can put more effort into them.<\/p> I think every AI lab should have a full team just to build these monitors, make sure they work as intended, analyze their outcomes, make them efficient to use, and more. I would prioritize efforts to build high-sensitivity monitors for high-stakes situations, e.g. developer-internal agentic deployments, over good monitors for average-stakes deployments, e.g. most chatbot use-cases.<\/p> My general feeling is that these monitors would both directly prevent harm but also unlock a lot of other benefits, e.g. we could understand the effectiveness of our alignment techniques faster and more thoroughly. <\/p> In the world where we have models without<\/i> human-legible CoT, my main line of defense would go through control that doesnât assume human-legible CoT (and monitoring would be the second line of defense). <\/p> I specifically mean the Redwood Research definition<\/u><\/a> of control, where we make the worst-case assumption that the model is scheming and it is not necessarily visible to us in CoT. We then attempt to develop protocols that maximize the usefulness of these models while minimizing risks. <\/p> While CoT monitoring might be useful for some control approaches, I specifically want to see further development of techniques that donât assume that the CoT is legible to humans, as done by Redwood Research. <\/p> In general, I think the kind of work required for control is something like what Redwood has been doing just 100x more and applied to frontier systems by the AGI labs. Some concrete suggestions include:<\/p> This is just a small subset of directions in control. Afaik there will be a significantly more detailed list of control projects and open problems soon. <\/p> By the time we have AI systems that drastically speed up or replace core researchers at AI lab, I think we should have a much better understanding of scheming. Concretely, we should <\/p> I expect that a lot of this work is on model organisms, i.e. where researchers actively study one particular part of scheming in isolation similar to Sleeper Agents<\/u><\/a> or Ryan Greenblatt+Anthropicâs work on Alignment faking in large language models<\/u><\/a>. <\/p> Evals are a cornerstone of a lot of frontier safety commitments, and I think they are a useful evidence-gathering tool that can inform other strategies. Within this plan, evals fulfill two important roles. <\/p> Overall, I think there should be significantly more resources in evals, see e.g. the evals gap<\/u><\/a> (other safety efforts should also be funded more). <\/p> Improved information and physical security seems like an important layer of defense for many failure modes. We need to ensure that model weights and algorithmic secrets cannot get stolen, that models cannot exfiltrate themselves, and that no internal employees can run large, unmonitored finetuning runs. I think the report on securing model weights<\/u><\/a> is great and we should basically treat its recommendations as a good default (though itâs plausible that more thinking would suggest a slightly different strategy). Buck Shlegerisâ post, âAccess to powerful AI might make computer security radically easier<\/u><\/a>,â also makes important recommendations. <\/p> I guess that the core challenge under short timelines will be correct prioritization. It is likely impossible to implement everything that would be suggested in the SL5 recommendations of the âsecuring model weightsâ report within the next 2-3 years. Thus, triaging seems necessary. I could also imagine that a lot of the security strategies that are important right around the point when an AI company automates their AI R&D have a few years of development time and might not feel immediately required right now, so they will only exist if an AI developer is acting now with a lot of foresight. <\/p> Finally, Iâd have hoped that the internal infosec teams at frontier AI companies get access to frontier models ~3 months before external deployment both so they can defensively harden their infosec as well as conduct offensive testing. My guess is that this is not the norm, but in a sane world, an AI company would do something like using their own models to improve their own defensive capabilities before they release it. I expect this is largely blocked by inertia and company-internal politics. So, making this a norm as soon as possible would likely go a long way. <\/p> Unfortunately, Iâm not an expert in security and would welcome concrete suggestions for a short-timeline security strategy from someone who is.<\/p> Everything in this layer still seems quite important to me, but it is less important than those in layer 1. <\/p> I think RLHF and RLAIF empirically work better than I would have expected in ~2020. For example, they make the model more robustly behave as I want it to, the model can infer my intentions better, and the HHH training generalizes further than I had anticipated.<\/p> For example, I did not expect that the models would internalize HHH strongly enough to be willing to scheme for it, e.g. in Alignment faking in large language models<\/u><\/a> and in our follow-up experiments on in-context scheming<\/u><\/a>. While I think this should be considered evidence that current alignment techniques go âdeeperâ than some people thought, it also opens up new problems.<\/p> Since our existing techniques can already introduce non-myopic and non-corrigible tendencies into the models, I think we should spend significantly more effort ensuring that these techniques keep the models myopic and corrigible. Furthermore, I think we should understand which kind of goals the models get in a lot more detail, e.g. in our follow-up experiments<\/u><\/a>, it seemed that âthe models just want to be maximally helpfulâ was not the best description of the goals revealed preferences and at least to some extent they also showed some self-preservation without this being instrumental to their helpfulness. I would like to understand this in much more detail and build fine-grained âcognitive profilesâ that we closely track for various alignment techniques.<\/p> I also expect that the strategies for internal and external use should differ meaningfully. For example, making models jailbreak-resistant could decrease their corrigibility. However, if you intend to deploy the model only to highly trusted internal staff, you donât have to train it against jailbreaks. Iâm currently unsure which training techniques you would want to use for agentic long-horizon internal deployments, e.g. because jailbreaks might be less relevant but taking non-power-seeking actions is much more relevant.<\/p> Likely, the main bottleneck to success is building tests and environments that are stable enough that weâd be happy to iterate against them (not necessarily train against them, but at least regularly evaluate on them).<\/p> I like efforts going into improving fundamental research in interpretability, scalable oversight, superalignment & co (this is broadly the category for empirical research that is not covered in layer 1) and I want people to continue to make progress on the fundamentals. However, I think in a short-timeline world, figuring out the fundamentals first is too risky of a bet and I deprioritize them in this plan. <\/p> There are three types of efforts I would want to see in particular:<\/p> My current guess is that none of these techniques are good enough that we would want to put a lot of confidence in them but I think they could be used as a supporting argument in an overall defense-in-depth approach. Of course, Iâd love to be proven wrong and find out that any of them could be fully load-bearing. <\/p> I think AI companies should make their reasoning around safety plans for human-level AGI more transparent. For example, voluntary commitments like safety frameworks were a good step in the right direction. They allowed externals to understand the broad strokes of these plans and criticize them. I think not every criticism was accurate or helpful, but I expect it to be highly net positive. I also liked Anthropic's âThree Sketches of ASL-4 Safety Case Components<\/u><\/a>â and would like to see more efforts in this direction. <\/p> Realistically, I think the priorities should be roughly:<\/p> I think an organization that aims to build AGI should ideally have internalized a safety-first approach at all levels. Everyone, from leadership to interns, has to understand the magnitude of what theyâre working on and the implications if they make mistakes. The kind of culture I have in mind can likely only be found in militaries today, if at all. I think weâre currently very far away from that culture and not on track to reach it. <\/p> A few examples of my desired culture include:<\/p> I think itâs plausible that the question of whether AI will be the best or worst technology that humanity has built may come down to a few high-stakes decisions that are made with little time and under high pressure. I find it hard to see a world where we make the best decisions unless safety is top of mind for all people involved. <\/p> Assuming short timelines, I think itâs likely impossible to reach my desired levels of safety culture. This implies that<\/p> I already list some of the limitations above, but want to give a bit more context here. <\/p> AI Control<\/strong> in the context of AI Alignment is a category of plans that aim to ensure safety and benefit from AI systems, even if they are goal-directed and are actively trying to subvert your control measures. From The case for ensuring that powerful AIs are controlled<\/strong><\/a>:<\/p> In this post, we argue that AI labs should ensure that powerful AIs are controlled<\/i>. That is, labs should make sure that the safety measures they apply to their powerful models prevent unacceptably bad outcomes, even if the AIs are misaligned and intentionally try to subvert those safety measures. We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs; we think that meeting our standard would substantially reduce the risks posed by intentional subversion.<\/p><\/blockquote> and<\/p> There are two main lines of defense you could employ to prevent schemers from causing catastrophes.<\/p> AI Evaluations<\/strong> focus on experimentally assessing the capabilities, safety, and alignment of advanced AI systems. These evaluations can be divided into two main categories: behavioral<\/u> and understanding-based<\/u>.<\/p> (note: initially written by GPT4, may contain errors despite a human review. Please correct them if you see them)<\/i><\/p> Behavioral evaluations<\/u> assess a model's abilities in various tasks, such as autonomously replicating, acquiring resources, and avoiding being shut down. However, a concern with these evaluations is that they may not be sufficient to detect deceptive alignment, making it difficult to ensure that models are non-deceptive.<\/p> Understanding-based evaluations,<\/u> on the other hand, evaluate a developer's ability to understand the model they have created and why they have obtained the model. This approach can be more useful in terms of safety, as it focuses on understanding the model's behavior instead of just checking the behavior itself. Coupling understanding-based evaluations with behavioral evaluations can lead to a more comprehensive assessment of AI safety and alignment.<\/p> Current challenges in AI evaluations include:<\/p> (this text was initially written by GPT4, taking in as input <\/i>A very crude deception eval is already passed<\/i><\/a>, <\/i>ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so<\/i><\/a>, and <\/i>Towards understanding-based safety evaluations<\/i><\/a>)<\/i><\/p> Deceptive Alignment<\/strong> is when an AI<\/a> which is not actually aligned temporarily acts aligned in order to deceive<\/a> its creators or its training process. It presumably does this to avoid being shut down<\/a> or retrained and to gain access to the power that the creators would give an aligned AI. (The term scheming<\/i> is sometimes used for this phenomenon.)<\/p> See also: Mesa-optimization<\/a>, Treacherous Turn<\/a>, Eliciting Latent Knowledge<\/a>, Deception<\/a><\/p>"},{"@type":"Thing","name":"AI","url":"https://www.lesswrong.com/tag/ai","description":" Artificial Intelligence<\/strong> is the study of creating intelligence in algorithms. AI Alignment <\/strong>is the task of ensuring [powerful] AI system are aligned with human values and interests. The central concern is that a powerful enough AI, if not designed and implemented with sufficient understanding, would optimize something unintended by its creators and pose an existential threat to the future of humanity. This is known as the AI alignment<\/i> problem.<\/p> Common terms in this space are superintelligence, AI Alignment, AI Safety, Friendly AI, Transformative AI, human-level-intelligence, AI Governance, and Beneficial AI. <\/i>This entry and the associated tag roughly encompass all of these topics: anything part of the broad cluster of understanding AI and its future impacts on our civilization deserves this tag.<\/p> AI Alignment<\/strong><\/p> There are narrow conceptions of alignment, where youâre trying to get it to do something like cure Alzheimerâs disease without destroying the rest of the world. And thereâs much more ambitious notions of alignment, where youâre trying to get it to do the right thing and achieve a happy intergalactic civilization.<\/p> But both the narrow and the ambitious alignment have in common that youâre trying to have the AI do that thing rather than making a lot of paperclips.<\/p> See also General Intelligence<\/a>.<\/p>Short timelines are plausible<\/h1>
What do we need to achieve at a minimum?<\/h1>
Making conservative assumptions for safety progress<\/h1>
So what's the plan?<\/h1>
Layer 1<\/h2>
Keep a paradigm with faithful and human-legible CoT<\/h3>
Significantly better (CoT, action & white-box) monitoring<\/h3>
Control (that doesnât assume human-legible CoT)<\/h3>
Much deeper understanding of scheming<\/h3>
Evals<\/h3>
Security<\/h3>
Layer 2<\/h2>
Improved near-term alignment strategies<\/h3>
Continued work on interpretability, scalable oversight, superalignment & co<\/h3>
Reasoning transparency<\/h3>
Safety first culture<\/h3>
Known limitations and open questions<\/h1>
See also:<\/h1>