Fine, but it still seems like a reason one could give for death being net good (which is your chief criterion for being a deathist).
I do think it's a weaker reason than the second one. The following argument is mainly for fun:
I slightly have the feeling that it's like that decision theory problem where the devil offers you pieces of a poisoned apple one by one. First half, then a quarter, then an eighth, than a sixteenth... You'll be fine unless you eat the whole apple, in which case you'll be poisoned. Each time you're offered a piece it's rational to tak...
You're absolutely right, good job! I fixed the OP.
We still need more funding to be able to run another edition. Our fundraiser raised $6k as of now, and will end if it doesn't reach the $15k minimum, on February 1st. We need proactive donors.
If we don't get funded for this time, there is a good chance we will move on to different work in AI Safety and new commitments. This would make it much harder to reassemble the team to run future AISCs, even if the funding situation improves.
You can take a look at the track record section and see if it's worth it:
≥ 10 organisations started by alumni
You can also read more about our plans there.
If you prefer to donate anonymously, this is possible on Manifund.
If you're a large donor (>$15k), we're open to let you choose what to fund.
I vouch for Robert as a good replacement for me.
Hopefully there is enough funding to onboard a third person for next camp. Running AISC at the current scale is a three person job. But I need to take a break from organising.
In a private discussion, related to our fundraiser, it was pointed out that AISC hasn't made clear enough what our theory of change is. Therefore this post.
Some caveats/context:
I think that AISC theory of change has a number of components/mechanisms,...
This comment has two disagree votes, which I interpret as other people seeing the flowchart. I see it too. If it still doesn't work for you for some reason, you can also see it here: AISC ToC Graph - Google Drawings
A common failure of optimizers is Edge Instantiation. An optimizer often finds a weird or extreme solution to a problem when the optimization objective is imperfectly specified. For the purposes of this post, this is basically the same phenomenon as Goodhart’s Law, especially Extremal and Causal Goodhart. With advanced AI, we are worried about plans created by optimizing over predicted consequences of the plan, potentially achieving the goal in an unexpected way.
In this post, I want to draw an analogy between Goodharting (in the sense of finding extreme weird solutions) and overfitting (in the ML sense of finding a weird solution that fits the training data but doesn’t generalize). I believe techniques used to address overfitting are also useful for addressing Goodharting.[1]
In particular, I want to focus on detecting Goodharting. The way...
This makes a lot of sense to me. For some reason it reminds me of some stuart armstrong OOD-generalization work for alternative safeguarding strategies to imperfect value extrapolation? I can't find a good link though.
I also thought it would be interesting to mention the link to the idea in linguistics that a word is specified by all the different contexts it is specified in and so a symbol is a probability distribution of contextual meaning. From the perspective of this post, wouldn't natural language work a bit as a redundancy specifier in that case and so LLMs are more alignable than RL agents? (I don't think I'm making a novel argument here, I just thought it would be interesting to point out.)
What is the threat model you are probing for here?
The prompting approach suffers from LLMs' shallowness in strategic thinking; I would not be surprised if, prompting a language model several times to come up with an encoding scheme would result in a highly concentrated distribution.
> A saving grace may be that ‘detecting steg is happening at all’ seems easier than ‘recovering the plaintext message’
While i agree in principle, as in all red/blue team interactions, you may not want your blue team to be trigger happy; you still want to be able t...
Yes, my conclusion is "it's hopeless".
(NLP assumes that you could reverse-engineer someone's thought processes by observing their eye movements. That looking in one direction means "the person is trying to remember something they saw", looking in another direction means "the person is trying to listen to their inner voice", etc., you get like five or six categories. And when you listen to people talking, by their choice of words you can find out whether they are "visual" or "auditive" or "kinesthetic" type. So if you put these two things together, you get ...
This is an article in the featured articles series from AISafety.info. AISafety.info writes AI safety intro content. We'd appreciate any feedback.
The most up-to-date version of this article is on our website, along with 300+ other articles on AI existential safety.
These terms are all related attempts to define AI capability milestones — roughly, "the point at which artificial intelligence becomes truly intelligent" — but with different meanings:
OK, I guess I got some assumption wrong, but please explain to me which one.