How does it work to optimize for realistic goals in physical environments of which you yourself are a part? E.g. humans and robots in the real world, and not humans and AIs playing video games in virtual worlds where the player not part of the environment. The authors claim we don't actually have a good theoretical understanding of this and explore four specific ways that we don't understand this process.

14orthonormal

Insofar as the AI Alignment Forum is part of the Best-of-2018 Review, this post deserves to be included. It's the friendliest explanation to MIRI's research agenda (as of 2018) that currently exists.

Customize

450Welcome to LessWrong!

Ruby, Raemon, RobertM, habryka

129

Levels of Friction

Zvi

207

Eliezer's Lost Alignment Articles / The Arbital Sequence

Ruby, RobertM

174METR: Measuring AI Ability to Complete Long Tasks

Zach Stein-Perlman

558How to Make Superbabies

GeneSmith, kman

23d

322

93Intention to Treat

Alicorn

18h

336A Bear Case: My Predictions Regarding AI Progress

Thane Ruthenis

14d

145

160Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer, Marius Hobbhahn

171Why White-Box Redteaming Makes Me Feel Weird

Zygi Straznickas

321Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Owain_Evans

10d

205Trojan Sky

Richard_Ngo

10d

393How AI Takeover Might Happen in 2 Years

joshc

1mo

131

87The principle of genomic liberty

TsviBT

182OpenAI: Detecting misbehavior in frontier reasoning models

Daniel Kokotajlo

11d

148Reducing LLM deception at scale with self-other overlap fine-tuning

Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Cameron Berg, Mike Vaiana, AE Studio

277Arbital has been imported to LessWrong

RobertM, jimrandomh, Ben Pace, Ruby

1mo

Quick Takes

Mo Putera5h181

Scott Alexander's Mistakes, Dan Luu's Major errors on this blog (and their corrections), and Gwern's My Mistakes (last updated 11 years ago) are the only online writers I know of who maintain a dedicated, centralized page solely for cataloging their errors, which I admire. Probably not coincidentally they're also among the thinkers I respect the most for repeatedly empirically grounding their reasoning. Some orgs do this too, like 80K's Our mistakes, CEA's Mistakes we've made, and GiveWell's Our mistakes. While I prefer dedicated centralized pages like those to one-off writeups for long content benefit reasons, one-off definitely beats none (myself included). In that regard I appreciate essays like Holden Karnofsky's Some Key Ways in Which I've Changed My Mind Over the Last Several Years (2016), Denise Melchin's My mistakes on the path to impact (2020), Zach Groff's Things I've Changed My Mind on This Year (2017), and this 2013 LW repository for "major, life-altering mistakes that you or others have made", as well as by orgs like HLI's Learning from our mistakes. In this vein I'm also sad to see mistakes pages get removed, e.g. ACE used to have a Mistakes page (archived link) but now no longer do.

Aaron Bergman13h240

Sharing https://earec.net, semantic search for the EA + rationality ecosystem. Not fully up to date, sadly (doesn't have the last month or so of content). The current version is basically a minimal viable product! On the results page there is also an option to see EA Forum only results which allow you to sort by a weighted combination of karma and semantic similarity thanks to the API. Unfortunately there's no corresponding system for LessWrong because of (perhaps totally sensible) rate limits (the EA Forum offers a bots site for use cases like this with much more permissive access). Final feature to note is that there's an option to have gpt-4o-mini "manually" read through the summary of each article on the current screen of results, which will give better evaluations of relevance to some query (e.g. "sources I can use for a project on X") than semantic similarity alone. Still kinda janky - as I said, minimal viable product right now. Enjoy and feedback is welcome! Thanks to @Nathan Young for commissioning this!

TurnTrout1dΩ15217

Want to get into alignment research? Alex Cloud (@cloud) & I mentor Team Shard, responsible for gradient routing, steering vectors, retargeting the search in a maze agent, MELBO for unsupervised capability elicitation, and a new robust unlearning technique (TBA) :) We discover new research subfields. Apply for mentorship this summer at https://forms.matsprogram.org/turner-app-8

Mo Putera1d19-2

I used to consider it a mystery that math was so unreasonably effective in the natural sciences, but changed my mind after reading this essay by Eric S. Raymond (who's here on the forum, hi and thanks Eric), in particular this part, which is as good a question dissolution as any I've seen: (it's a shame this chart isn't rendering properly for some reason, since without it the rest of Eric's quote is ~incomprehensible) I also think I was intuition-pumped to buy Eric's argument by Julie Moronuki's beautiful meandering essay The Unreasonable Effectiveness of Metaphor.

Matthias Dellago1h10

Simplicity Priors are Tautological Any non-uniform prior inherently encodes a bias toward simplicity. This isn't an additional assumption we need to make - it falls directly out of the mathematics. For any hypothesis h, the information content is $I(h) = -\log(P(h))$, which means probability and complexity have an exponential relationship: $P(h) = e^{-I(h)}$ This demonstrates that simpler hypotheses (those with lower information content) are automatically assigned higher probabilities. The exponential relationship creates a strong bias toward simplicity without requiring any special mechanisms. The "simplicity prior" is essentially tautological - more probable things are simple by definition.

Popular Comments

Recent Discussion

Mo Putera's Shortform

Mo Putera

2mo

Rasool6m10

Another good blog:
https://nintil.com/mistakes

4tailcalled2h

I'm not convinced Scott Alexander's mistakes page accurately tracks his mistakes. E.g. the mistake on it I know the most about is this one: But that's basically wrong. The study found women's arousal to chimps having sex to be very close to their arousal to nonsexual stimuli, and far below their arousal to sexual stimuli.

6cubefox5h

Interesting. This reminds me of a related thought I had: Why do models with differential equations work so often in physics but so rarely in other empirical sciences? Perhaps physics simply is "the differential equation science". Which is also related to the frequently expressed opinion that philosophy makes little progress because everything that gets developed enough to make significant progress splits off from philosophy. Because philosophy is "the study of ill-defined and intractable problems". Not saying that I think these views are accurate, though they do have some plausibility.

1Mo Putera3h

(To be honest, to first approximation my guess mirrors yours.)

Prospects for Alignment Automation: Interpretability Case Study

Jacob Pfau, Geoffrey Irving

18m

For human-level AI (HLAI) we will need robust control or alignment methods. Assuming short timelines to HLAI, the tractability of automating safety research becomes central. In this post, I will make the case that safety-relevant progress on automated interpretability R&D is likely; however, naive interpretability automation may only be usable on the subset of safety problems having well-specified objectives. My argument relies crucially on the possibility of automatically verifying interpretability progress. For other alignment directions (e.g. corrigibility, studying power-seeking, etc.) which do not admit automatic verification, it appears unjustified to assume automation within the same time-horizon in the absence of a clear argument for automation tractability. I am optimistic that further thinking on automation prospects could identify other automation-tractable areas of alignment and control (e.g. see here...

(Continue Reading – 2206 more words)

How far along Metr's law can AI start automating or helping with alignment research?

Christopher King

In METR: Measuring AI Ability to Complete Long Tasks found a Moore's law like trend relating (model release date) to (time needed for a human to do a task the model can do).

Here is their rationale for plotting this.

Current frontier AIs are vastly better than humans at text prediction and knowledge tasks. They outperform experts on most exam-style problems for a fraction of the cost. With some task-specific adaptation, they can also serve as useful tools in many applications. And yet the best AI agents are not currently able to carry out substantive projects by themselves or directly substitute for human labor. They are unable to reliably handle even relatively low-skill, computer-based work like remote executive assistance. It is clear that capabilities are increasing very rapidly in

...

(See More – 90 more words)

Gunnar_Zarncke24m20

Yes, but it didn't mean that AIs could do all kinds of long tasks in 2005. And that is the conclusion many people seem to draw from the METR paper.

2Rafael Harth3h

I don't think I get it. If I read this graph correctly, it seems to say that if you let a human play chess against an engine and want it to achieve equal performance, then the amount of time the human needs to think grows exponentially (as the engine gets stronger). This doesn't make sense if extrapolated downward, but upward it's about what I would expect. You can compensate for skill by applying more brute force, but it becomes exponentially costly, which fits the exponential graph. It's probably not perfect -- I'd worry a lot about strategic mistakes in the opening -- but it seems pretty good. So I don't get how this is an argument against the metric.

2Gunnar_Zarncke29m

It is a decent metric for chess but a) it doesn't generalize to other tasks (as people seem to interpret the METR paper), and less importantly, b) I'm quite confident that people wouldn't beat the chess engines by thinking for years.

1Answer by Alice Blair13h

This seems very related to what the Benchmarks and Gaps investigation is trying to answer, and it goes into quite a bit more detail and nuance than I'm able to get into here. I don't think there's a publicly accessible full version yet (but I think there will be at some later point). It much more targets the question "when will we have AIs that can automate work at AGI companies?" which I realize is not really your pointed question. I don't have a good answer to your specific question because I don't know how hard alignment is or if humans realistically solve it on any time horizon without intelligence enhancement. However, I tentatively expect safety research speedups to look mostly similar to capabilities research speedups, barring AIs being strategically deceptive and harming safety research. I median-expect time horizons somewhere on the scale of a month (e.g. seeing an involved research project through from start to finish) to lead to very substantial research automation at AGI companies (maybe 90% research automation?), and we could see nonetheless startling macro-scale speedup effects at the scale of 1-day researchers. At 1-year researchers, things are very likely moving quite fast. I think this translates somewhat faithfully to safety orgs doing any kind of work that can be accelerated by AI agents.

Gunnar_Zarncke's Shortform

Gunnar_Zarncke

2Gunnar_Zarncke15h

LLMs necessarily have to simplify complex topics. The output for a prompt cannot represent all they know about some fact or task. Even if the output is honest and helpful (ignoring harmless for now), the simplification will necessarily obscure some details of what the LLM "intends" to do - in the sense of satisfying the user request. The model is trained to get things done. Thus, the way it simplifies has a large degree of freedom and gives the model many ways to achieve its goals. You could think of a caring parent who tells the child a simplified version of the truth, knowing that the child will later ask additional questions and then learn the details (I have a parent in mind who is not hiding things intentionally). Nonetheless, the parent's expectations of what the child may or may not need to know - the parent's best model of society and the world - which may be subtly off - influence how they simplify for the benefit of the child. This is a form of deception. The deception may be benevolent, as in the example with the parent, but we can't know. Even if there is a chain of thought we can inspect, the same is true for that. It seems unavoidable.

4cubefox5h

It seems to be only "deception" if the parent tries to conceal the fact that he or she is simplifying things.

Gunnar_Zarncke25m20

as we use the term, yes. But the point (and I should have made that more clear) is that any mismodeling of the parent of the interests of the child's interests and future environment will not be visible to the child or even someone reading the thoughts of the well-meaning parent. So many parents want the best for their child, but model the future of the child wrongly (mostly by status quo bias; the problem is different for AI).

Epoch AI released a GATE Scenario Explorer

Lee.aao

27m

This is a linkpost for https://epoch.ai/gate

I think it's more easier to discuss AI progress in terms of economy growth rather than just focusing on the scale of the largest training runs and compute used.

From their X announcement:

We developed GATE: a model that shows how AI scaling and automation will impact growth.
It predicts trillion‐dollar infrastructure investments, 30% annual growth, and full automation in decades.

Defense Against The Super-Worms

viemccoy

I am not writing this to give anyone any ideas. However, it seems to me that hackers are going to be able to leverage distributed language models in the near future, and I think that we should be creating proactive countermeasures before the tech is powerful enough to cause real damage. I predict that with globally distributed language model training having been proven possible, we will soon see bot-net style distributed inference. If you have enough devices on a home network, it is already trivial to run a very large model with the compute distributed across your phone, iPad, laptop, etc.

In my opinion, it is only a matter of time before hackers leverage distributed model inference to create self-propagating Super-Worms. These Super-Worms would function similar to a...

(See More – 458 more words)

ChristianKl34m20

Regardless of the details, we ought to prioritize taking all of our power plants, water purification stations, and nuclear facilities out of the world-wide-web.

I think it's very questionable, to make major safety policy "regardless of the details". If you want to increase the safety of power plants, listening to the people who are responsible for the safety of power plants and their analysis of the details, is likely a better step instead of making these kind of decisions without understanding the details.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

They Took MY Job?

Zvi

No, they didn’t. Not so fast, and not quite my job. But OpenAI is trying. Consider this a marker to look back upon in the future, as a reflection.

A New AI Wrote a Story

Before proceeding, if you haven’t yet, it’s probably worth reading the story itself. I’m going to repost the whole thing, since it was previously on Twitter and I presume OpenAI would want me to quote it.

Sam Altman: we trained a new model that is good at creative writing (not sure yet how/when it will get released). this is the first time i have been really struck by something written by AI; it got the vibe of metafiction so right.

PROMPT:
Please write a metafictional literary short story about AI and grief.
COMPLETION:
Before we go any further, I

...

(Continue Reading – 2628 more words)

Matthias Dellago's Shortform

Matthias Dellago

3mo

Matthias Dellago1h10

Simplicity Priors are Tautological

Any non-uniform prior inherently encodes a bias toward simplicity. This isn't an additional assumption we need to make - it falls directly out of the mathematics.

For any hypothesis h, the information content is $I(h) = -\log(P(h))$, which means probability and complexity have an exponential relationship: $P(h) = e^{-I(h)}$

This demonstrates that simpler hypotheses (those with lower information content) are automatically assigned higher probabilities. The exponential relationship creates a strong bias toward simplicity without requiring any special mechanisms.

The "simplicity prior" is essentially tautological - more probable things are simple by definition.

Habermas Machine

NicholasKees

This is a linkpost for https://mosaic-labs.org/blog/habermas-machine

This post is a distillation of a recent work in AI-assisted human coordination from Google DeepMind.

The paper has received some press attention, and anecdotally, it has become the de-facto example that people bring up of AI used to improve group discussions.

Since this work represents a particular perspective/bet on how advanced AI could help improve human coordination, the following explainer is to bring anyone curious up to date. I’ll be referencing both the published paper as well as the supplementary materials.

Summary

The Habermas Machine^[1] (HM) is a scaffolded pair of LLMs designed to find consensus among people who disagree, and help them converge to a common point of view. Human participants are asked to give their opinions in response to a binary question (E.g. “Should voting be compulsory?”). Participants give their level of agreement^[2], as well...

(Continue Reading – 1683 more words)

Source Wishes1h10

I'm also working on a deliberation tool with a similar philosophy, but with a stronger emphasis on generating structured output from participants.

I've noticed that discussions can often devolve into arguments, where we fixate on conclusions and pre-existing beliefs, rather than critically examining the underlying methods and prerequisites that shape events or our reasoning. I believe structured self-reflection, like writing an academic paper before engaging in debate, can help. The absence of an audience or judgment during self-reflection encourages partic... (read more)

1Source Wishes1h

If someone insist that this system is contributing for the goal of powered privileges such as king to use its power wisely with reference of the public opinion, this system may work for that framework. So I guess that will be legit, even though I don't appreciate such as framework for maintaining our society.