gwern - LessWrong

What is the most impressive game LLMs can play well?

A human player beating a random player isn't two random players.

I am more interested in any direct evidence that makes you suspect LLMs are good at chess when prompted appropriately?

Well, there's the DM bullet-chess GPT as a drastic proof of concept. If you believe that LLMs cannot learn to play chess, you have to explain how things like that work.

What is the most impressive game LLMs can play well?

gwern6h50

There should be plenty of transcripts of random algorithms as baseline versus effective chess algorithms in the training set

I wouldn't think that. I'm not sure I've seen a random-play transcript of chess in my life. (I wonder how long those games would have to be for random moves to end in checkmate?)

the prompt suggests strong play.

Which, unlike random move transcripts, is what you would predict, since the Superalignment paper says the GPT chess PGN dataset was filtered for Elo ("only games with players of Elo 1800 or higher were included in pretraining"), in standard behavior-cloning fashion.

Ericf's Shortform

gwern7h113

Another example that reality (especially anything involving technology) is not constrained by the need to be realistic. What SF author would dare write a story with meme coins, much less one in which the meme coins involved AIs like Claude?

What is the most impressive game LLMs can play well?

gwern7hΩ230

Yes.

Implications of the inference scaling paradigm for AI safety

gwern7h50

Huh, so you think o1 was the process supervision reward model, and o3 is the distilled policy model to whatever reward model o1 became? That seems to fit.

Something like that, yes. The devil is in the details here.

Surely other labs will also replicate this too? Even the open source community seems close. And Silicon Valley companies often poach staff, which makes it hard to keep a trade secret. Not to mention spies.

Of course. The secrets cannot be kept, and everyone has been claiming to have cloned o1 already. There are dozens of papers purporting to have explained it. (I think DeepSeek may be the only one to have actually done so, however; at least, I don't recall offhand any of the others observing the signature backtracking 'wait a minute' interjections the way DeepSeek sees organically emerging in r1.)

But scaling was never a secret. You still have to do it. And MS has $80b going into AI datacenters this year; how much does open source (or DeepSeek) have?

It's worth pointing out that Inference-time search seems to become harder as the verifier becomes less reliable. Which means that the scaling curves we see for math and code, might get much worse in other domains.

Yes. That's why I felt skeptical about how generalizable the o1 approach is. It doesn't look like a break-out to me. I don't expect much far transfer: being really good at coding doesn't automatically make you a genius at, say, booking plane tickets. (The o1 gains are certainly not universal, the way straightforward data/parameter-scaling gains tend to be - remember that some of the benchmarks actually got worse.) I also expect the o1 approach to tend to plateau: there is no ground truth oracle for most of these things, the way there is for Go. AlphaZero cannot reward-hack the Go simulator. Even for math, where your theorem prover can at least guarantee that a proof is valid, what's the ground-truth oracle for 'came up with a valuable new theorem, rather than arbitrary ugly tautological nonsense of no value'?

So that's one of the big puzzles here for me: as interesting and impressive as o1/o3 is, I just don't see how it justifies the apparent confidence. (Noam Brown has also commented that OA has a number of unpublished breakthroughs that would impress me if I knew, and of course, the money side seems to still be flowing without stint, despite it being much easier to cancel such investments than cause them.)

Is OA wrong, or do they know something I don't? (For example, a distributional phase shift akin to meta-learning.) Or do they just think that these remaining issues are the sort of thing that AI-powered R&D can solve and so it is enough to just get really, really good at coding/math and they can delegate from there on out?

EDIT: Aidan McLaughlin has a good post back in November discussing the problems with RL and why you would not expect the o1 series to lead to AGI when scaled up in sensible ways, which I largely agree with, and says:

But, despite this impressive leap, remember that o1 uses RL, RL works best in domains with clear/frequent reward, and most domains lack clear/frequent reward.

Praying for Transfer Learning: OpenAI admits that they trained o1 on domains with easy verification but hope reasoners generalize to all domains...When I talked to OpenAI’s reasoning team about this, they agreed it was an issue, but claimed that more RL would fix it. But, as we’ve seen earlier, scaling RL on a fixed model size seems to eat away at other competencies! The cost of training o3 to think for a million tokens may be a model that only does math.

On the other hand... o3 didn't only do math, and in RL we also know that RL systems often exhibit phase transitions in terms of meta-learning or generalization, where they overfit to narrow distributions and become superhuman experts which break if anything is even slightly different, but suddenly generalize when train on diverse enough data as a blessing of scale, not in data but data diversity, with LLMs being a major case in point of that, like GPT-2 -> GPT-3. Hm. This was written 2024-11-20, and McLaughlin announced 2025-01-13 that he had joined OpenAI. Hm...

The Hidden Status Game in Hospital Slacking

gwern7h10

This post was good up until the LLM part, which is largely bullshit and applause lights which make no sense if you actually think about it (ah yes, I'm sure some 'audits' will fix this).

Implications of the inference scaling paradigm for AI safety

gwern7h62

The current FrontierMath fracas is a case in point. Did OpenAI have to keep its sponsorship or privileged access secret? No. Surely there was some amount of money that would pay mathematicians to make hard problems, and that amount was not much different from what they did pay Epoch AI. Did that make life easier? Given the number of mathematician-participants saying they would've had second thoughts about participating had they known OA was involved, almost surely.

Implications of the inference scaling paradigm for AI safety

gwern7h20

What Jones didn’t suggest (but gwern seems to be saying) is that you can use your search-enhanced model to produce better quality synthetic data to train a larger model on.

Jones wouldn't say that because that's just implicit in expert iteration. In each step of expert iteration, you can in theory be training an arbitrary new model from scratch to imitate the current expert. Usually you hold fixed the CNN and simply train it some more on the finetuned board positions from the MCTS, because that is cheap, but you don't have to. As long as it takes a board position, and it returns value estimates for each possible move, and can be trained, it works. You could train a larger or smaller CNN, a deeper or wider* CNN of the same size, a ViT, a RNN, a random forest... (See also 'self-distillation'.) And you might want to do this if the old expert has some built-in biases, perhaps due to path dependency, and is in a bad local optimum compared to training from a blank slate with the latest best synthetic data.

You can also do this in RL in general. OpenAI, for example, kept changing the OA5 DotA2 bot architecture on the fly to tweak its observations and arches, and didn't restart each time. It just did a net2net or warm initialization, and kept going. (Given the path dependency of on-policy RL especially, this was not ideal, and did come with a serious cost, but it worked, as they couldn't've afforded to train from scratch each time. As the released emails indicate, OA5 was breaking the OA budget as it was.)

Now, it's a great question to ask: should we do that? Doesn't it feel like it would be optimal to schedule the growth of the NN over the course of training in a scenario like Jones 2021? Why pay the expense of the final oversized CNN right from the start when it's still playing random moves? It seems like there ought to be some set of scaling laws for how you progressively expand the NN over the course of training before you then brutally distill it down for a final NN, where it looks like an inverted U-curve. But it's asking too much of Jones 2021 to do that as well as everything else. (Keep in mind that Andy Jones was just one guy with no budget or lab support doing this paper all on his own over, like, a summer. In terms of bang for buck, it is one of the best DL & RL papers of the past decade, and puts the rest of us to shame.)

* for latency. An early example of this is WaveNet for synthesizing audio: it was far too slow for real-time, because it was too deep. It didn't cost too many computations, but the computations were too iterative to allow generating 1s of audio in 1s of wallclock, which renders it completely useless for many purposes, as it will fall behind. But once you have a working WaveNet, you can then distill it into a very wide CNN which does much more parallel computation instead of sequential, and can keep up. It might cost more operations, it might have more parameters, but it'll generate 1s of audio in <1s of wallclock time.

What Goes Without Saying

gwern9h543

This sort of fundamental disagreement does lead to some frustrating conversations when you are talking at cross-purposes, and where even if both of you understand the difference, one of you may be talking at a different simulacrum level.

It reminds me of a conversation I had some time back with a school principal, which went something like this: He was trying to come up with proposals for how the school system could use LLMs, and naturally asked me for ideas, as I know a lot about LLMs and we'd discussed them in the past.

I replied that it was mostly a waste of time, because there was nothing really useful he could do with LLMs/AI in general. He was surprised - hadn't I been telling him for years about AI scaling and how it was going to enable total cheating and how LLMs were already capable of doing almost all highschool level work and they were only going to keep getting better and shoot into the stratosphere and become PhD level, how could I think that LLMs were not potentially extremely useful to him?

I said that it was because to really make use of LLMs for their ostensible purpose of education, they would have to reorganize the system and fire a lot of people and replace with LLMs, and the actual purpose of the school system was to preserve itself and provide jobs (and increasingly, simply provide 'pensions') and daycare (in that order); and so LLMs were useless to them - even if they were used for things like making reports faster to write, by Parkinson's law that would simply lead to more reports being demanded until the equilibrium was restored. If he proposed anything like that, it would be ignored at best and held against him at worst, and there were probably better things to spend his time on. (Specifically, since there was nothing he could do about AI scaling and any adaptations in the short run would be obsolete in a few years while he still had decades to go, he should instead be thinking hard about his career and start figuring out how to skate to where the puck will be: what does a school principal do in, say, 5 years' time when there is AGI and everything he does on his computer can be done better for pennies a day?)

He hotly denied this as a load of overly-cynical tosh: schools and education are about learning! I knew he knew better than the official cant (he was too good at the bureaucracy to really believe that), and after a lot of arguing, I finally asked him what would happen if LLMs could replace him and all the teachers completely - would he propose they do that and turn in his resignation letter? He admitted that he would not, and at last conceded that a major purpose of the system was also to provide make-work welfare positions for the locals. (The local peoples have high unemployment, poverty, and alcoholism rates and cannot find any meaningful employment in the private-sector.) I noted that given that purpose, LLMs (or any 'efficiency' improvement at all) could not offer any large gains, because what they do is what those people do, but what they did was already superfluous; since it was not politically possible to just send them all home with a UBI welfare check, and make-work jobs were the chosen solution, he should instead be figuring out how to make things less, not more, efficient. (I may or may not have told the Milton Friedman teaspoon joke.)

He reframed his objective as, 'can we make things more pleasant?' This was finally a realistic goal... but also not one I could really help with, because to remove papercuts and genuinely unnecessary friction or pain, you have to know the system in intimate detail, and an outsider like myself can't help there, and one where the benefits will definitionally be small. But I hoped that by coming to a more honest admission about which system he was working in - a system in which OP points #1--4 are not true - he was at least better off for the conversation.

What is the most impressive game LLMs can play well?

gwern4dΩ460

Given the other reports, like OA's own benchmarking (as well as the extremely large dataset of chess games they mention training on), I am skeptical of this claim, and wonder if this has the same issue as other 'random chess game' tests, where the 'random' part is not neutral but screws up the implied persona.

LESSWRONG
LW

Posts

Wiki Contributions

Comments