Frustrated by claims that "enlightenment" and similar meditative/introspective practices can't be explained and that you only understand if you experience them, Kaj set out to write his own detailed gears-level, non-mysterious, non-"woo" explanation of how meditation, etc., work in the same way you might explain the operation of an internal combustion engine.

37DanielFilan
As far as I can tell, this post successfully communicates a cluster of claims relating to "Looking, insight meditation, and enlightenment". It's written in a quite readable style that uses a minimum of metaphorical language or Buddhist jargon. That being said, likely due to its focus as exposition and not persuasion, it contains and relies on several claims that are not supported in the text, such as: * Many forms of meditation successfully train cognitive defusion. * Meditation trains the ability to have true insights into the mental causes of mental processes. * "Usually, most of us are - on some implicit level - operating off a belief that we need to experience pleasant feelings and need to avoid experiencing unpleasant feelings." * Flinching away from thoughts of painful experiences is what causes suffering, not the thoughts of painful experiences themselves, nor the actual painful experiences. * Impermanence, unsatisfactoriness, and no-self are fundamental aspects of existence that "deep parts of our minds" are wrong about. I think that all of these are worth doubting without further evidence, and I think that some of them are in fact wrong. If this post were coupled with others that substantiated the models that it explains, I think that that would be worthy of inclusion in a 'Best of LW 2018' collection. However, my tentative guess is that Buddhist psychology is not an important enough set of claims that a clear explanation of it deserves to be signal-boosted in such a collection. That being said, I could see myself being wrong about that.
14Kaj_Sotala
I still broadly agree with everything that I said in this post. I do feel that it is a little imprecise, in that I now have much more detailed and gears-y models for many of its claims. However, elaborating on those would require an entirely new post (one which I currently working on) with a sequence's worth of prerequisites. So if I were to edit this post, I would probably mostly leave it as it is, but include a pointer to the new post once it's finished. In terms of this post being included in a book, it is worth noting that the post situates itself in the context of Valentine's Kensho post, which has not been nominated for the review and thus wouldn't be included in the book. So if this post were to be included, I should probably edit this so as to not require reading Kensho.
Customize
Shameful admission: after well over a decade on this site, I still don't really intuitively grok why I should expect agents to become better approximated by "single-minded pursuit of a top-level goal" as they gain more capabilities. Yes, some behaviors like getting resources and staying alive are useful in many situations, but that's not what I'm talking about. I'm talking about specifically the pressures that are supposed to inevitably push agents into the former of the following two main types of decision-making: 1. Unbounded consequentialist maximization: The agent has one big goal that doesn't care about its environment. "I must make more paperclips forever, so I can't let anyone stop me, so I need power, so I need factories, so I need money, so I'll write articles with affiliate links." It's a long chain of "so" statements from now until the end of time. 2. Homeostatic agent: The agent has multiple drives that turn on when needed to keep things balanced. "Water getting low: better get more. Need money for water: better earn some. Can write articles to make money." Each drive turns on, gets what it needs, and turns off without some ultimate cosmic purpose. Both types show goal-directed behavior. But if you offered me a choice of which type of agent I'd rather work with, I'd choose the second type in a heartbeat. The homeostatic agent may betray me, but it will only do that if doing so satisfies one of its drives. This doesn't mean homeostatic agents never betray allies - they certainly might if their current drive state incentivizes it (or if for some reason they have a "betray the vulnerable" drive). But the key difference is predictability. I can reasonably anticipate when a homeostatic agent might work against me: when I'm standing between it and water when it's thirsty, or when it has a temporary resource shortage. These situations are concrete and contextual. With unbounded consequentialists, the betrayal calculation extends across the entire future l
BuckΩ25412
0
Alignment Forum readers might be interested in this:
Claude has been playing pokemon for the last few days. It's still playing, live on twitch. You can go watch alongside hundreds of other people. It's fun. What updates should I make about AGI timelines from Claude's performance? Let's think step by step. First, it's cool that Claude can do this at all. The game keeps track of "Step count" and Claude is over 30,000 already; I think that means 30,000 actions (e.g. pressing the A button). For each action there is about a paragraph of thinking tokens Claude produces, in order to decide what to do. Any way you slice it this is medium-horizon agency at least -- claude is operating fully autonomously, in pursuit of goals, for a few days. Does this mean long-horizon agency is not so difficult to train after all? Not so fast. Pokemon is probably an especially easy environment, and Claude is still making basic mistakes even so. In particular, Pokemon seems to have a relatively linear world where there's a clear story/path to progress along, and moreover Claude's pretraining probably teaches it the whole story + lots of tips & tricks for how to complete it. In D&D terms the story is running on rails.  I think I would have predicted in advance that this dimension of difficulty would matter, but also I feel validated by Claude's performance -- it seems that Claude is doing fine at Pokemon overall, except that Claude keeps getting stuck/lost wandering around in various places. It can't seem to keep a good memory of what it's already tried / where it's already been, and so it keeps going in circles, until eventually it gets lucky and stumbles to the exit.  A more challenging video game would be something open-ended and less-present-in-training-data like Dwarf Fortress.  On the other hand, maybe this is less a fundamental limitation Claude has and more a problem with its prompt/scaffold? Because it has a limited context window it has to regularly compress it by e.g. summarizing / writing 'notes to self' and then deleting the re
leogao3328
3
my referral/vouching policy is i try my best to completely decouple my estimate of technical competence from how close a friend someone is. i have very good friends i would not write referrals for and i have written referrals for people i basically only know in a professional context. if i feel like it's impossible for me to disentangle, i will defer to someone i trust and have them make the decision. this leads to some awkward conversations, but if someone doesn't want to be friends with me because it won't lead to a referral, i don't want to be friends with them either.
reallyeli109
0
A good ask for frontier AI companies, for avoiding massive concentration of power, might be: * "don't have critical functions controllable by the CEO alone or any one person alone, and check that this is still the case / check for backdoors periodically" since this seems both important and likely to be popular.

Popular Comments

Recent Discussion

2Viliam
A synthesis between the structural forces theory and "pulling the rope sideways". The economical and other forces determine the main direction, a leader who already wanted to go in that direction gets elected and starts going in that direction, his idiosyncratic whims get implemented as a side effect. Like, instead of Hitler, there would be another German leader determined to change the post-WW1 world order, but he would probably be less obsessed about the Jews. Also, he might make different alliances.
2Viliam
Some games do put their finger on the scale, for example you have a first-person shooter where you learn to aim better but you also now have a gun that deals 200 damage per hit, as opposed to your starting gun that dealt 10. But puzzle-solving games are usually fair, I think.
3Joseph Miller
Agreed, but also most of the world does operate in this reference culture. If you choose to take a stand against it, you might screw over a decent candidate by providing only a quite positive recommendation.

Agreed. If I'm talking to someone who I expect to be able to recalibrate, I just explain that I think the standard norms are dumb, the norms I actually follow, and then give an honest and balanced assessment. If I'm talking to someone I don't really know, I generally give a positive but not very detailed reference or don't reply, depending on context.

See also.

1ProgramCrafter
Upvoted as a good re-explanation of CEV complexity in simpler terms! (I believe LW will benefit from recalling the long understood things so that it has a chance on predicting future in greater detail.) In essence, you prove the claim "Coherent Extrapolated Volition would not literally include everything desirable happening effortlessly and everything undesirable going away". Would I be wrong to guess it argues against position in https://www.lesswrong.com/posts/AfAp8mEAbuavuHZMc/for-the-sake-of-pleasure-alone? That said, current wishes of many people include things they want being done faster and easier; it's just the more you extrapolate the less fraction wants that level of automation - just more divergence as you consider higher scale.
8jbash
Citation needed. Particularly for that first part. You're thinking pretty small there, if you're in a position to hack your body that way. Why would I want to even be involved in creating software that somebody else wanted? Let them ask the computer themselves, if they need to ask. Why would I want to be in a world where I had to make or listen to a PowerPoint presentation of all things? Or a summary either? Why do I care who needs me to do any of that? Because if the robot carries me, I haven't climbed it. It's not like the value comes from just being on the top. Helicopters can fly that high right now, but people still walk to get there. Because I like painting? Does it bother you that almost anything you might want to do, and probably for most people anything at all that they might want to do, can already be done by some other human, beyond any realistic hope of equaling? Do you feel dead because of that? For fun. Software, too. Because I won't experience any of that infinite stream if I don't read it? The stuff I want includes doing something. Not because somebody else needs it. Not because it can't be done better. Just because I feel like doing it. That includes putting in effort, and taking on things I might fail at. Wanting to do things does not, however, imply that you don't want to choose what you do and avoid things you don't want to do. If a person doesn't have any internal wish to do anything, if they need somebody else's motivations to substitute for their own... then the deadness is already within that person. It doesn't matter whether some wish gets fulfilled or not. But I don't think there are actually many people like that, if any at all. I think you're seeing shadows of your own ideas there.
2cousin_it
I think something like the Culture, with aligned superintelligent "ships" keeping humans as basically pets, wouldn't be too bad. The ships would try to have thriving human societies, but that doesn't mean granting all wishes - you don't grant all wishes of your cat after all. Also it would be nice if there was an option to increase intelligence, conditioned on increasing alignment at the same time, so you'd be able to move up the spectrum from human to ship.

I’m considering translating my work into English to share it with the LessWrong community, but I’d like to first ask if it aligns with the community's interests and could be valuable. Below is a summary of the work to help evaluate its relevance:

 

Beyond HaHa: Mapping the Causal Chain from Jokes to Knowledge



Summary


We explore the specific causal mechanisms linking humor recognition to learning outcomes, including the computational and neurological pathways involved. 

This study began with a practical goal: to evaluate the use of humor as a pedagogical tool in Cardiopulmonary Resuscitation (CPR) courses through a randomized trial. However, the lack of clear criteria to define and operationalize "humor" in educational contexts led us to explore its conceptual foundations. Initially, we adopted Clarke's formula, which describes humor as "a pleasant...

Scheming AIs may have secrets that are salient to them, such as:

Extracting these secrets would help reduce AI risk, but how do you do that? One hope is that you can do fuzzing of LLMs,[1] e.g. by adding noise to LLM weights or activations.

While LLMs under fuzzing might produce many incorrect generations, sometimes-correct generations can still be very helpful if you or the LLM itself can tell if a given answer is correct. But it’s still unclear if this works at all: there are probably some intermediate activations that would result in an LLM telling you the secret, but...

1Matt Levinson
I was thinking in terms of moving towards interpretability. We have no reason to believe that meaningful steering vectors should cluster around a given norm. We also have no reason to believe that effective steering vectors can all be scaled to a common norm without degrading the interesting/desired effect. This version of random search (through starting seed) and local optimization is a cool way to get a decent sampling of directions. I'm wondering if one could get "better" or "cleaner" results by starting from the best results from the search and then trying to optimize them increasing or decreasing temperature. The hope would be that some dimensions would preferentially grow/shrink. We could interpret this as evidence that the "meaningfulness" of the detected steering vector has increased, perhaps even use a measure of that as part of a new loss or stopping rule. One other thing I wonder is if anyone has worked on bringing in ideas from ensemble sampling from the statistics and applied math literature? Seems like it might be possible to use some ideas from that world to more directly find sparser, semantically meaningful steering vectors. Maybe @TurnTrout has worked on it?

By doing more search around promising vectors found with random search or MELBO, you could get more powerful vectors, and that could be useful for unlocking / fuzzing-adversarial-training. It's unclear if that would be more effective than just fine-tuning the model on the generation from the best random vectors, but it would be worth trying.

For interp, I don't know what interp metric you want to optimize. Vector norm is a really bad metric: effective MELBO vectors have a much smaller norm, but qualitatively I find their results are sometimes much more erra... (read more)

LessWrong Context:

I didn’t want to write this.

Not for lack of courage—I’d meme-storm Putin’s Instagram if given half a chance. But why?

  1. Too personal.
  2. My stories are tropical chaos: I survived the Brazilian BOPE (think Marine Corps training, but post-COVID).
  3. I’m dyslexic, writing in English (a crime against Grice).
  4. This is LessWrong, not some Deep Web Reddit thread.

Okay, maybe a little lack of courage.

And yet, something can be extracted from all this madness, right?

Then comes someone named Gwern. He completely ignores my thesis and simply asks:
"Tell military firefighter stories."

My first instinct was to dismiss him as an oddball—until a friend told me I was dealing with a legend of rationality. I have to admit: I nearly shit myself. His comment got more likes than the post I’d spent years working on.

Someone with,...

A new paper by Yoshua Bengio and the Safe Artificial Intelligence For Humanity (SAIFH) team argues that the current push towards building generalist AI agents presents catastrophic risks, creating a need for more caution and an alternative approach. We propose such an approach in the form of Scientist AI, a non-agentic AI system that aims to be the foundation for safe superintelligence. (Note that this paper is intended for a broad audience, including readers unfamiliar with AI safety.) 

Abstract

The leading AI companies are increasingly focused on building generalist AI agents—systems that can autonomously plan, act, and pursue goals across almost all tasks that humans can perform. Despite how useful these systems might be, unchecked AI agency poses significant risks to public safety and security, ranging from misuse by

...
Viliam20

Is this possibly a "Chinese room" kind of situation? The model alone is not an agent, but "the model + the way it is used" might be...

And to be more precise, I don't mean things like "the model could be used by an agent", because obviously yes; but more like "the model + a way of using it that we also separately wouldn't call an agent" could be.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with
2faul_sname
Where does the gradient which chisels in the "care about the long term X over satisfying the homeostatic drives" behavior come from, if not from cases where caring about the long term X previously resulted in attributable reward? If it's only relevant in rare cases, I expect the gradient to be pretty weak and correspondingly I don't expect the behavior that gradient chisels in to be very sophisticated.
2Gurkenglas
https://www.lesswrong.com/posts/roA83jDvq7F2epnHK/better-priors-as-a-safety-problem
2faul_sname
I agree that a homeostatic agent in a sufficiently out-of-distribution environment will do poorly - as soon as one of the homeostatic feedback mechanisms starts pushing the wrong way, it's game over for that particular agent. That's not something unique to homeostatic agents, though. If a model-based maximizer has some gap between its model and the real world, that gap can be exploited by another agent for its own gain, and that's game over for the maximizer. Sorry, I'm having some trouble parsing this sentence - does "they" in this context refer to homeostatic agents? If so, I don't think they make particularly great tools even in a non-adversarial context. I think they make pretty decent allies and trade partners though, and certainly better allies and trade partners than consequentialist maximizer agents of the same level of sophistication do (and I also think consequentialist maximizer agents make pretty terrible tools - pithily, it's not called the "Principal-Agent Solution"). And I expect "others are willing to ally/trade with me" to be a substantial advantage. Can you expand on "turn evil"? And also what I was trying to accomplish by making my comms-screening bot into a self-directed goal-oriented agent in this scenario?

That's not something unique to homeostatic agents, though. If a model-based maximizer has some gap between its model and the real world, that gap can be exploited by another agent for its own gain, and that's game over for the maximizer.

I don't think of my argument as model-based vs heuristic-reactive, I mean it as unbounded vs bounded. Like you could imagine making a giant stack of heuristics that makes it de-facto act like an unbounded consequentialist, and you'd have a similar problem. Model-based agents only become relevant because they seem like an ea... (read more)

1Self
More thoughts that may or may not be directly relevant * What's missing from my definition is that deception happens solely via "stepping in front of the camera", via the regular sensory channels of the deceived optimizer, ie brainwashing or directly modifying memory is not deception * From this follows to deceive is to either cause a false pattern recognition or to prevent a correct one, and for this you indeed need familiarity with the victim's perceptual categories I'd like to say more re: hostile telepaths or other deception frameworks but am unsure what your working models are
2Viliam
Some of these examples have alternative explanations. * other people may know something that I don't know, so if they all do X, maybe I should, too * if I use the same device as my friends, it will be easier to get tech support Even if you imagine a hypothetical person 100% resistant to copying desire, the value of a neighborhood does depend on the kind of people who live there.
3Self
They do, but the explanation proposed here matches everything I know most exactly and simply. E.g. it became immediately clear that the sequences wouldn't work nearly as well for me if I didn't like Eliezer Or the way fashion models are of course not selected for attractiveness but for more mimetic-copying-inducing highstatus traits like height/confidence/presence/authenticity and others And yeah not all of the Claude examples are good, I hadn't cherrypicked
Viliam20

it became immediately clear that the sequences wouldn't work nearly as well for me if I didn't like Eliezer

You mean, like him as a blogger? Or as a person in real life?

If the former, isn't causality the other way round? I mean, I like Eliezer as a blogger because he wrote the Sequences. So it would sound weird to me to say: "I admire Eliezer as a blogger a lot because he wrote some amazing articles on rationality... and Girard's theory predicts that therefore I will like his articles... which is true!"

(We could nitpick that some things that I like about El... (read more)

1Purplehermann
Runescape would be a good one
12zchuang
I don't know if this is helpful but as someone who was quite good at competitive Pokemon during their teenage years and also still keeps up with nuzlocking type things for fun, I would note that Pokemon's game design is made to be a low context intensity RPG especially in early generations where the linearity is pushed to allow kids to do it.  If your point holds true on agency, I think the more important pinch points will be Lavender Town and Sabrina because those require backtracking through the storyline to get things. I think mid-late game GSC would also be important to try because there are huge level gaps and transitions in the storyline that would make it hard to progress.
6Daniel Kokotajlo
"Let's think step by step" was indeed a joke/on purpose. Everything else was just my stream of consciousness... my "chain of thought" shall we say. I more or less wrote down thoughts as they came to me. Perhaps I've been influenced by reading LLM CoT's, though I haven't done very much of that. Or perhaps this is just what thinking looks like when you write it down?

I've spent enough time staring at LLM chain-of-thoughts now that when I started thinking about a thing for work, I found my thoughts taking the shape of an LLM thinking about how to approach its problem. And that actually felt like a useful systematic way of approaching the problem, so I started writing out that chain of thought like I was an LLM, and that felt valuable in helping me stay focused.

Of course, I had to amuse myself by starting the chain-of-thought with "The user has asked me to..."