Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at [email protected].
(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)
Yeah, it's just a temporary fix because we got hammered by an angry bot last night. I already lifted the relevant firewall (the robots.txt thing is made up, our robots.txt doesn't block crawlers).
The counting argument of the form, "You don't get what you train for, because there are many ways to perform well in training" doesn't seem like the correct way to reason about neural network behavior (which does not mean that alignment is easy or that the humans will survive).
I mean, why? Weighted by any reasonable attempt I have for operationalizing the neural network prior, you don't get what you want out of training because the number of low-loss, high-prior generalizations that do not generalize the way I want vastly outweigh the number of low-loss, high-prior generalizations that do generalize the way I want (you might "get what you train for" because I don't really know what that means).
I agree that it's not necessary to talk about "human values" in this context. You might want to get something out of your AI systems that is closer to corrigibility, or some kind of deference, etc. However, that also doesn't fall nicely out of any analysis of the neural network prior and associated training dynamics either.
You can't just count the number of things when some of them are much more probable than others.
I mean, you can. You just multiply the number by the relative probability of the event. Like, in finite event spaces, counting is how you do any likelihood analysis.
I am personally not a huge fan of the term "counting argument", so I am not very attached to it. Neither are Evan, or Wentworth, or my guess Joe or Eliezer who have written about arguments in this space in the past. Of course, you always need some kind of measure. The argument is that for all measures that seem reasonable or that we can reasonably well-define, it seems like AI cares about things that are different than human values. Or in other words "most goals are misaligned".
I don't really know what you are confused by. It seems like you get it. Weigh your goals or goal-generating functions by complexity (or better, try to form any coherent model of what the inductive biases of deep learning training might be and weigh by that). Now you have a counting argument. The counting argument predicts all the things usual counting arguments predict. These arguments also predict that these algorithms generalize (indeed, as Evan says, the arguments would be much weaker if the algorithms didn't generalize, because you need to invoke simplicity or some kind of distribution over inductive biases to get a well-defined object at all).
I was in fact associating sophisticated insiders with actually having authorized access to model weights, and I'm not sure (even after asking around) why this is worded the way it is.
Ok, cool. This then of course makes my top level post very relevant again, since I think a large number of database executives and other people high up at Google seem likely able to exfiltrate model weights without too much of an issue (I am not that confident of this, but it's my best guess from having thought about the topic for a few dozen hours now, and having decent familiarity with cybersecurity considerations).
This I think puts Anthropic in violation of its RSP, even given your clarifications, since you have now clarified those would not be considered "sophisticated insiders" and so are not exempt (with some uncertainties I go into at the end of this comment).
How many people have authorized access to a given piece of sensitive info can vary enormously (making this # no bigger than necessary is among the challenges of cybersecurity), and people can have authorized access to things that they are nevertheless not able to exfiltrate for usage elsewhere.
Sorry, I should have phrased things more clearly here. Let me try again:
I am describing a cybersecurity attack surface. Of course for the purpose of those attacks, we can assume that the attacker is willing to do things they are not authorized to do. People being willing to commit cybercrimes is one of the basic premises of the cybersecurity section of the RSP.
I am here describing an attack vector where anyone who has physical access to those systems is likely capable of exfiltrating model weights, at least as long as they get some amount of executive buy-in to circumvent supervision. It is extremely unlikely for most of the people who fit this description to have authorized access to model weights. As such, it is unclear what the relevance of people having "authorized access" is.
I also am honestly surprised that anyone at Google or Amazon or Microsoft is considered to have authorized access to weights. That itself is an update to me! I would have assumed nobody at those companies was allowed to look at the weights, or make inferences about e.g. the underlying architecture, etc.
I am now thinking you believe something like this: Yes, there are many people with physical access, but even if they succeed at exfiltrating the weights, realistically in order for them to do anything with the weights, word would reach the highest executive levels at these tech companies, and the highest executive levels at these tech companies all have authorized access to model weights. I.e. currently Satya or Demis have authorized access to model weights (not because you want to give them authorized access, but just because giving them authorized access is a necessity for using their compute infrastructure), and as such are considered sophisticated insiders.
Honestly, I find the idea of considering Satya or Demis "sophisticated insiders with authorized access to model weights" very confusing. Like, at least taken on its own that has pretty big implications about race dynamics and technical diffusion between frontier labs, since apparently Anthropic wouldn't consider it a security incident if Satya or Demis were to download Claude weights to a personal computer of theirs and reverse-engineer architectural details from them (since I am inferring them both to have "authorized access").
My guess is reality is complicated in a bunch of messy ways that is hard to capture with the RSP here. I do appreciate you taking the time to clarify things.
Once they exist, we obviously can't just take them back.
Why... can't we take them back? I don't think you should kill them, but regression to the mean seems like it takes care of most of the effects in one generation.
how worried are you about unintended side effects of genetic interventions on personality?
A reasonable amount. Like, much less than I am worried about AI systems being misaligned, but still some. In my ideal world humanity would ramp up something like +10 IQ points of selection per generation, and so would get to see a lot of evidence about how things play out here.
I wish the interface were faster to update, closer to 100ms than 1s to update, but this isn't a big deal. I can believe it's hard to speed up code that integrates these differential equations many times per user interaction.
Yeah, currently the rollout takes around a second, and is happening on a remote python server, because JS isn't great for this kind of work. We already tried pretty hard to make it faster, though I am sure there are ways to get it to become actually fully responsive, but it might be many engineering hours.
Said Achmiz was one person who claimed as such, and he tended to get a bunch of agreement votes when he said so, so presumably some people agreed with him.
Comments are mostly indexed on Google via the post-page indexes, not the
commentIdpermalinks. Indeed, the commentId permalinks confused a bunch of the crawlers because it resulted in lots of duplicate copies of the post text, and text from other comments showing up in the search results.