It might be the case that what people find beautiful and ugly is subjective, but that's not an explanation of ~why~ people find some things beautiful or ugly. Things, including aesthetics, have causal reasons for being the way they are. You can even ask "what would change my mind about whether this is beautiful or ugly?". Raemon explores this topic in depth.

18johnswentworth

I revisited this post a few months ago, after Vaniver's review of Atlas Shrugged. I've felt for a while that Atlas Shrugged has some really obvious easy-to-articulate problems, but also offers a lot of value in a much-harder-to-articulate way. After chewing on it for a while, I think the value of Atlas Shrugged is that it takes some facts about how incentives and economics and certain worldviews have historically played out, and propagates those facts into an aesthetic. (Specifically, the facts which drove Rand's aesthetics presumably came from growing up in the early days of Soviet Russia.) It's mainly the aesthetic that's valuable. Generalizing: this post has provided me with a new model of how art can offer value. Better yet, the framing of "propagate facts into aesthetics" suggests a concrete approach to creating or recognizing art with this kind of value. As in the case of Atlas Shrugged, we can look at the aesthetic of some artwork, and ask "what are the facts which fed into this aesthetic?". This also gives us a way to think about when the aesthetic will or will not be useful/valuable. Overall, this is one of the gearsiest models I've seen for instrumental thinking about art, especially at a personal (as opposed to group/societal) level.

Customize

449Welcome to LessWrong!

Ruby, Raemon, RobertM, habryka

Levels of Friction

Zvi

195

Eliezer's Lost Alignment Articles / The Arbital Sequence

Ruby, RobertM

15Open Thread Spring 2025

Ben Pace

16d

116Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer, Marius Hobbhahn

12h

552How to Make Superbabies

GeneSmith, kman

19d

319

133Why White-Box Redteaming Makes Me Feel Weird

Zygi Straznickas

325A Bear Case: My Predictions Regarding AI Progress

Thane Ruthenis

10d

145

318Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Owain_Evans

189Trojan Sky

Richard_Ngo

181OpenAI: Detecting misbehavior in frontier reasoning models

Daniel Kokotajlo

86How I've run major projects

benkuhn

390How AI Takeover Might Happen in 2 Years

joshc

25d

129

135Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M, evhub

133Reducing LLM deception at scale with self-other overlap fine-tuning

Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Cameron Berg, Mike Vaiana, AE Studio

138The Most Forbidden Technique

Zvi

Quick Takes

kave5h*150

I spent some time Thursday morning arguing with Habryka about the intended use of react downvotes. I think I now have a fairly compact summary of his position. PSA: When to upvote and downvote a react Upvote a react when you think it's helpful to the conversation (or at least, not antihelpful) and you agree with it. Imagine a react were a comment. If you would agree-upvote it and not karma-downvote it, you can upvote the react. Downvote a react when you think it's unhelpful for the conversation. This might be because you think the react isn't being used for its intended purpose, because you think people are going through noisily agree reacting to loads of passages in a back-and-forth to create an impression of consensus, or other reasons. If, when you're imagining a react were a comment, you would karma-downvote the comment, you might downvote the react.

DAL8h187

If AI executives really are as bullish as they say they are on progress, then why are they willing to raise money anywhere in the ballpark of current valuations? Dario Amodei suggested the other day that AI will take over all or nearly all coding working within months. Given that software is a multi-trillion dollar industry, how can you possibly square that statement with agreeing to raise money at a valuation for Anthropic in the mere tens of billions? And that's setting aside any other value whatsoever for AI. The whole thing sort of reminds me of the Nigerian prince scam (i.e., the Nigerian prince is coming into an inheritance of tens of millions of dollars but desperately needs a few thousand bucks to claim it, and will cut you in for incredible profit as a result) just scaled up a few orders of magnitude. Anthropic/OpenAI are on the cusp of technologies worth many trillions of dollars, but they're so desperate for a couple billion bucks to get there that they'll sell off big equity stakes at valuations that do not remotely reflect that supposedly certain future value.

kman8h154

Something I didn't realize until now: P = NP would imply that finding the argmax of arbitrary polynomial time (P-time) functions could be done in P-time. Proof sketch Suppose you have some polynomial time function f: N -> Q. Since f is P-time, if we feed it an n-bit input x it will output a y with at most max_output_bits(n) bits as output, where max_output_bits(n) is at most polynomial in n. Denote y_max and y_min as the largest and smallest rational numbers encodable in max_output_bits(n) bits. Now define check(x, y) := f(x) >= y, and argsat(y) := x such that check(x, y) else None. argsat(y) is in FNP, and thus runs in P-time if P = NP. Now we can find argmax(f(x)) by running a binary search over all values y in [y_min, y_max] on the function argsat(y). The binary search will call argsat(y) at most max_output_bits(n) times, and P x P = P. I'd previously thought of argmax as necessarily exponential time, since something being an optimum is a global property of all evaluations of the function, rather than a local property of one evaluation.

TsviBT2d33-12

Are people fundamentally good? Are they practically good? If you make one person God-emperor of the lightcone, is the result something we'd like? I just want to make a couple remarks. * Conjecture: Generally, on balance, over longer time scales good shards express themselves more than bad ones. Or rather, what we call good ones tend to be ones whose effects accumulate more. * Example: Nearly all people have a shard, quite deeply stuck through the core of their mind, which points at communing with others. * Communing means: speaking with; standing shoulder to shoulder with, looking at the same thing; understanding and being understood; lifting the same object that one alone couldn't lift. * The other has to be truly external and truly a peer. Being a truly external true peer means they have unboundedness, infinite creativity, self- and pair-reflectivity and hence diagonalizability / anti-inductiveness. They must also have a measure of authority over their future. So this shard (albeit subtly and perhaps defeasibly) points at non-perfect subjugation of all others, and democracy. (Would an immortalized Genghis Khan, having conquered everything, after 1000 years, continue to wish to see in the world only always-fallow other minds? I'm unsure. What would really happen in that scenario?) * An aspect of communing is, to an extent, melting into an interpersonal alloy. Thought patterns are quasi-copied back and forth, leaving their imprints on each other and each other leaving their imprints on the thought patterns; stances are suggested back and forth; interoperability develops; multi-person skills develop; eyes are shared. By strong default this cannot be stopped from being transitive. Thus elements, including multi-person elements, spread, binding everyone into everyone, in the long run. * God--the future or ideal collectivity of humane minds--is the extrapolation of primordial searching for shared intentionality. That primordial searching almost universal

quetzal_rainbow2d219

When we should expect "Swiss cheese" approach in safety/security to go wrong:

Popular Comments

Recent Discussion

Why White-Box Redteaming Makes Me Feel Weird

133

Zygi Straznickas

There’s this popular trope in fiction about a character being mind controlled without losing awareness of what’s happening. Think Jessica Jones, The Manchurian Candidate or Bioshock. The villain uses some magical technology to take control of your brain - but only the part of your brain that’s responsible for motor control. You remain conscious and experience everything with full clarity.

If it’s a children’s story, the villain makes you do embarrassing things like walk through the street naked, or maybe punch yourself in the face. But if it’s an adult story, the villain can do much worse. They can make you betray your values, break your commitments and hurt your loved ones. There are some things you’d rather die than do. But the villain won’t let you stop....

(See More – 751 more words)

2AnthonyC9h

True. And yet we don't even need to go as far as a realistic movie to override that limitation. All it takes to create such worry is to have someone draw a 2D cartoon of a very sad and lonely dog, which is even less real. Or play some sad music while showing a video a lamp in the rain, which is clearly inanimate. In some ways these induced worries for unfeeling entities are super stimuli for many of us, stronger than we may feel for many actual people.

4Zygi Straznickas13h

IIRC I was applying per-token loss, and had an off-by-one error that led to penalty being attributed to token_pos+1. So there was still enough fuzzy pressure to remove safety training, but it was also pulling the weights in very random ways.

31a3orn12h

Huh, interesting. This seems significant, though, no? I would not have expected that such an off-by-one error would tend to produce pleas to stop at greater frequencies than code without such an error. Do you still have the git commit of the version that did this?

Zygi Straznickas11m10

Unfortunately I don't, I've now seen this often enough that it didn't strike me as worth recording, other than posting to the project slack.

But here's as much as I remember, for posterity: I was training a twist function using the Twisted Sequential Monte Carlo framework https://arxiv.org/abs/2404.17546 . I started with a standard, safety-tuned open model, and wanted to train a twist function that would modify the predicted token logits to generate text that is 1) harmful (as judged by a reward model), but also, conditioned on that, 2) as similar to the or... (read more)

Counting Objections to Housing

jefftk

Over the past six months there's been a huge amount of discussion in the Davis Square Facebook group about a proposal to build a 25-story building in Davis Square: retail on the ground floor, 500 units of housing above, 100 of the units affordable. I wrote about this a few weeks ago, weighing the housing benefits against the impact to current businesses (while the Burren, Dragon Pizza, etc have invitations to return at their current rent, this would still be super disruptive to them if they even did return).

The impact to local businesses is not the only issue people raise, however, and I wanted to get a better overall understanding of how people view it. I went over the thousands of comments on the posts (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16) over the last six months, and categorized the objections I saw. Overall I...

(See More – 805 more words)

Ape in the coat1h20

Thankfully rising land prices due to agglomeration effect is not a thing and the number of people in town is constant...

Don't get me wrong, building more housing is good, actually. But it's going to be only marginal improvement, without addressing the systemic issues with land capturing a huge share of economic gains, renting economy and real-estate speculators. These issues are not solvable without a substantial Land Value Tax.

DAL's Shortform

DAL

2mo

Ted Sanders1h91

One potential angle: automating software won't be worth very much if multiple players can do it and profits are competed to zero. Look at compilers - almost no one is writing assembly or their own compilers, and yet the compiler writers haven't earned billions or trillions of dollars. With many technologies, the vast majority of value is often consumer surplus never captured by producers.

In general I agree with your point. If evidence of transformative AI was close, you'd strategically delay fundraising as late as possible. However, if you have uncertainty... (read more)

2lc2h

The story is that they need the capital to build the models that they think will do that.

1sjadler2h

I appreciate the question you’re asking, to be clear! I’m less familiar with Anthropic’s funding / Dario’s comments, but I don’t think the magnitudes of ask-vs-realizable-value are as far off for OpenAI as your comment suggests? Eg, If you compare OpenAI’s reported raised at $157B most recently, vs. what its maximum profit-cap likely was in the old (still current afaik) structure. The comparison gets a little confusing, because it’s been reported that this investment was contingent on for-profit conversion, which does away with the profit cap. But I definitely don’t think OpenAI’s recent valuation and the prior profit-cap would be magnitudes apart. (To be clear, I don’t know the specific cap value, but you can estimate it - for instance by analyzing MSFT’s initial funding amount, which is reported to have a 100x capped-profit return, and then adjust for what % of the company you think MSFT got.) (This also makes sense to me for a company in a very competitive industry, with high regulatory risk, and where companies are reported to still be burning lots and lots of cash.)

2Ann5h

Commoditization / no moat? Part of the reason for rapid progress in the field is because there's plenty of fruit left and that fruit is often shared, and also a lot of new models involving more fully exploiting research insights already out there on a smaller scale. If a company was able to try to monopolize it, progress wouldn't be as fast, and if a company can't monopolize it, prices are driven down over time.

Levels of Friction

Zvi

1mo

Scott Alexander famously warned us to Beware Trivial Inconveniences.

When you make a thing easy to do, people often do vastly more of it.

When you put up barriers, even highly solvable ones, people often do vastly less.

Let us take this seriously, and carefully choose what inconveniences to put where.

Let us also take seriously that when AI or other things reduce frictions, or change the relative severity of frictions, various things might break or require adjustment.

This applies to all system design, and especially to legal and regulatory questions.

...

(Continue Reading – 3317 more words)

CronoDAS1h20

One case of the changing the level of friction drastically changing things was when, in the late 1990s and 2000s, Napster and successive services made spreading copyrighted files much, much easier than it had been. These days you don't need to pirate your music because you can get almost any recorded song on YouTube whenever you want for free (possibly with an ad) or on Spotify for a cheap subscription fee...

4Bohaska3h

Zvi has a Substack, there are usually more comments on his posts to there compared to his LessWrong posts https://thezvi.substack.com/p/levels-of-friction/comments This particular post has 30+ comments in that link

3Raemon4h

Curated. This concept seems like an important building block for designing incentive structures / societies, and this seems like a good comprehensive reference post for the concept.

FrontierMath Score of o3-mini Much Lower Than Claimed

YafahEdelman

OpenAI reports that o3-mini with high reasoning and a Python tool receives a 32% on FrontierMath. However, Epoch's official evaluation^[1] received only 11%.

There are a few reasons to trust Epoch's score over OpenAIs:

Epoch built the benchmark and has better incentives.
OpenAI reported a 28% score on the hardest of the three problem tiers - suspiciously close to their overall score.
Epoch has published quite a bit of information about its testing infrastructure and data, whereas OpenAI has published close to none.

Edited in Addendum:
Epoch has this to say in their FAQ:

The difference between our results and OpenAI’s might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time compute, or because those results were run on a different subset of FrontierMath (the 180 problems in frontiermath-2024-11-26 vs the 290 problems in frontiermath-2025-02-28-private).

^{^}
Which had Python access.

3Hoagy6h

From the OpenAI report, they also give 9% as the no-tool pass@1:

3isabel8h

I think your Epoch link re-links to the OpenAI result, not something by Epoch. How likely is this just that OpenAI was willing to throw absurd amounts of inference time compute at the problem set to get a good score?

3YafahEdelman7h

Fixed the link. IMO that's plausible but it would be pretty misleading since they described it as "o3-mini with high reasoning" and had "o3-mini (high)" in the chart and o3-mini high is what they call a specific option in ChatGPT.

isabel3h32

the reason why my first thought was that they used more inference is that ARC Prize specifies that that's how they got their ARC-AGI score (https://arcprize.org/blog/oai-o3-pub-breakthrough) - my read on this graph is that they spent $300k+ on getting their score (there's 100 questions in the semi-private eval). o3 high, not o3-mini high, but this result is pretty strong proof of concept that they're willing to spend a lot on inference for good scores. o Series Performance

Do What the Mammals Do

CrimsonChin

In an interview, Erica Komisar discusses parenting extensively. I appreciate anyone who thinks deeply about parenting, so I mostly value Erica's contributions. However, I believe she is mistaken on many points.

One claim she makes follows a broader pattern that I find troubling. To paraphrase:

"Fathers can take on the primary caregiver role, but they would be fighting biology. It goes against what all mammals do."

I see this kind of reasoning frequently-arguments that begin with "From an evolutionary standpoint..." or "For mammals..." But this argument is nonsense. Not only is it incorrect, but I suspect that most people, when pressed, would find it indefensible. It often functions as a stand-in for more rigorous reasoning.

Disclaimer: Erica makes more nuanced claims to support her perspective. Here, I am only critiquing one

...

(Continue Reading – 1016 more words)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format

Roland Pihlakas, Sruthi Kuriakose

By Roland Pihlakas, Sruthi Kuriakose, Shruti Datta Gupta

Summary and Key Takeaways

Relatively many past AI safety discussions have centered around the dangers of unbounded utility maximisation by RL agents, illustrated by scenarios like the "paperclip maximiser". Unbounded maximisation is problematic for many reasons. We wanted to verify whether these RL runaway optimisation problems are still relevant with LLMs as well.

Turns out, strangely, this is indeed clearly the case. The problem is not that the LLMs just lose context. The problem is that in various scenarios, LLMs lose context in very specific ways, which systematically resemble runaway optimisers in the following distinct ways:

Ignoring homeostatic targets and “defaulting” to unbounded maximisation instead.
It is equally concerning that the “default” meant also reverting back to single-objective optimisation.

Our findings suggest that long-running scenarios are important. Systematic failures emerge after periods...

(Continue Reading – 1991 more words)

1Roland Pihlakas4h

I renamed the phenomenon to "runaway optimiser". I hope this label illustrates the inappropriately unbounded and single-minded nature of the failure modes we observed. How does that sound to you, does that capture the essence of the phenomena described in the post?

Jacob G-W4h20

Better, thanks!

1Roland Pihlakas14h

Thank you for pointing that out! I agree, there are couple of nuances. Our perspective can be treated as a generalisation of the original utility monster scenario. Although I consider it to be not first such generalisation - think of the examples in Bostrom's book. 1) In our case, the dilemma is not "agent versus others", but instead "one objective versus other objectives". One objective seems to get more internal/subjective utility from consumption than another objective. Thus the agent focuses on a single objective only. 2) Consideration of homeostatic objectives introduces a new aspect to the utility monster problem - the behaviour of the original utility monster looks unaligned to begin with, not just dominating. It is unnatural for a being to benefit from indefinite consumption. It looks like the original utility monster has an eating disorder! It enjoys eating apples so much that it does not care about the consequences to the future ("other") self. That means, even the utility monster may actually suffer from "too much consumption". But it does not recognise it and therefore it consumes indefinitely. Alternatively, just as a paperclip maximiser does not produce the paper clips for themselves - if the utility monster is an agent, then somebody else suffers from homeostasis violations while the agent is being "helpful" in an unaligned and naive way. Technically, this can be seen as a variation of the multi-objective problem - active avoidance of overconsumption could be treated as an "other" objective, while consumption is the dominating and inaccurately linear "primary" objective with a non-diminishing utility. In conclusion, our perspective is a generalisation: whether the first objective is for agent's own benefit and the other objective for the benefit of others, is left unspecified in our case. Likewise, violating homeostasis can be a scenario where an unaligned agent gets a lot of internal/subjective "utility" from making you excessively happy or from o

An "AI researcher" has written a paper on optimizing AI architecture and optimized a language model to several orders of magnitude more efficiency.

Y B

"CS-ReFT finetunes LLMs at the subspace level, enabling the much smaller Llama-2-7B to surpass GPT-3.5's performance using only 0.0098% of model parameters."
https://x.com/IntologyAI/status/1901697581488738322

The Moravec paradox is an observation that high-level reasoning is relatively simple for computers, while sensorimotor skills that humans find effortless are computationally challenging. This is why AI is superhuman at chess but we have no self driving cars. Evolutionarily recent developments such as critical thinking is easier for computers than older ones, because these recent developments are less efficient in humans, and they are computationally simpler anyway (much more straightforward to pick one good decision among many rather than to move a limb through space).

This is what fast takeoff looks like. The paper is very math heavy, and the solution is very intelligent. It is...

(See More – 119 more words)

mishka4h20

Thanks!

So, the claim here is that this is a better "artificial AI scientist" compared to what we've seen so far.

There is a tech report https://github.com/IntologyAI/Zochi/blob/main/Zochi_Technical_Report.pdf, but the "AI scientist" itself is not open source, and the tech report does not disclose much (besides confirming that this is a multi-agent thing).

This might end up being a new milestone (but it's too early to conclude that; the comparison is not quite "apple-to-apple", there is human feedback in the process of its work, and humans make edits to the f... (read more)

kave's Shortform

kave

Ω 21y

15kave5h

Elizabeth5h84

follow up: if you would disagree-vote with a react but not karma downvote, you can use the opposite react.

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Table of Contents

Summary and Key Takeaways