What’s up with the Anthropic board?

[Edit: I this post was based on a factual error. Reid Hoffman is not on the Anthropic board. Reed Hastings is. Thank you to Neel for correcting my mistake!]

What are the dynamics of Anthropic board meetings are like, given that some of the board seem to not really understand or believe in Superintelligence?

Reid Hoffman is on the board. He’s the poster child for “AI doesn’t replace humans, it’s a tool that empowers humans”. Like he wrote two(!) whole books about it (titles: Impromtu: Amplifying Our Humanity Through AI and Superagency: Empowering Humanity in the Age of AI).

For instance, here:

In this early period, many companies haven’t yet figured out how to integrate new engineers into AI-native workflows.

But I still believe there will be essentially unlimited demand for people who think computationally.

and

If you’re entering the workforce today, you have a unique advantage: you can grow up working with copilots, understanding the leverage they give you as an employee, and help your companies figure out how to integrate AI into their work

It sure doesn’t sound like he’s living in a mental world where there will be AIs that will be better at almost all people at almost all tasks by 2030!

He seems to be expressing broadly similar talking points about AI amplifying human work, as recently as three weeks ago.[1]

It seems like he’s not really Superintelligence-pilled, at least for the most important versions of superintelligence?

I imagine Dario coming into the board meetings and say “Alright guys, I  expect AI that is better than almost all humans at almost all tasks, possibly by 2027 and almost certainly no latter than 2030. Our mainline projection is that Anthropic will have a country of geniuses in a datacenter within 5 years.”

What is going on here?

  • Does Reid internally translates that to “we’re building awesome software tools that will empower people, not replace them”?
  • Does he think Dario is exaggerating for effect?
  • Does he think that Dario is just factually wrong about a projections that are extremely central to Anthropic’s business, but they haven’t bothered to (or at least haven’t succeeded at) getting to ground about it?
  • Does Dario not say these things to his board, but only in essays and interviews that he publishes to the whole world?!
  • Is Reid posturing about what he believes?

I don’t have a hypothesis that explains these observation that doesn’t seem bizarre. My best bad guess is that Reid is basically filtering out anything that doesn’t match his existing impressions about AI, despite being an early investor in OpenAI and being on the board of Anthropic!

Some questions that I have about AI and the overall strategic situation, and why I want to know

Will Automating AI R&D not work for some reason, or will it not lead to vastly superhuman superintelligence within 2 years of “~100% automation” for some reason?

  • Why I want to know:
    • Naively, it seems like at (or before) the point when AI models are capable of doing AI research at a human-level, we should see a self-reinforcing speedup in AI progress. So AI systems that are substantially superhuman should arrive not long after “human-researcher-level-AI” in calendar time, on the default trajectory.
    • That an intelligence explosion is possible / likely imposes a major constraint on both technical alignment efforts and policy pushes, because it means that a company might develop dangerously superhuman AI relatively suddenly, and that that AI may have design properties that the human researchers at that company don’t understand.
    • If I knew that ~as-capable-as-elite-humans AI doesn’t lead to an intelligence explosion for some reason, would I do anything different?
      • Well, I wouldn’t feel like warning the government about the possibility of an intelligence explosion is an urgent priority.
      • I would assign much less mass to an acute takeover event in the near term. Without the acceleration dynamics of an intelligence explosion, I don’t think that any one company, or any one AI, would attain a substantial lead over the others.
        • In that case, it seems like our main concerns are gradual disempowerment and gradual disempowerment followed by an abrupt AI coup.
    • I haven’t yet seen a good argument for why automating AI R&D wouldn’t lead to a substantial and self-reinforcing speed up in AI progress leading to a steep climb up to superintelligence.
  • Notes:
    • The strongest reason that occurs to me:
      • A conjunction:
        • LLMs are much further from full general intelligences than they currently seem. They’ll get increasingly good at eg software engineering, and in fact surpass humans. But they’ll continue to not really generate new frames. They’ll be able to automate ML research in the sense of coming up with experiments to try, and implementing those experiments, but never any new conceptual work. 
        • AND 
        • That conceptual work is necessary for getting to full on superintelligence. Even millions of superhuman programmers will not hit upon the insights needed for a true general intelligence that doesn’t have this weakness, that develops new ontologies from its experience. 
      • I don’t currently buy either side of this conjunction, but especially not the second part. It seems like most AI research is not coming up with new brilliant ideas, but rather generating 10 ideas that might work to solve a problem, and then testing them. This seems well within the capability set of LLM minds.
    • Another possibility, in principle, for why automating AI R&D doesn’t lead to an intelligence explosion is because a very large percentage of the progress (at that part of development trajectory) is driven by scaling relative to algorithmic progress.
      • I might want to build a quantitative model of this and play around with it a bit.
    • Epoch seems to think that there won’t be an intelligence explosion (or maybe that there will be, but the development of superintelligences won’t matter much). I should look into their arguments about it.

In what admin will the intelligence explosion occur?

  • Why I want to know:
    • I think I should make pretty different investments depending on when I expect the critical part of the intelligence explosion to happen. (Where “the critical part” is the point at which we have the most leverage, whenever that is). The nearer it is, the more targeted our interventions need to be on influencing the current people in power. The further out it is, the broader the possible portfolio and the more it makes sense to aim to get competent, trustworthy, and informed people in power relative to informing and influencing the current cohort.
    • If I knew it was going to happen in 2025 to 2029, all of our political advocacy needs to be targeted at informing and mobilizing the current government (modulo the midterm shakeup) to take action.
    • If I knew it was going to happen in 2030 to 2034, I would be advocating for some specific policies to this admin, but I would mainly focus on building relationships and developing plans to inform and equip the next administration.
    • If I knew it was going to happen in 2035 to 2039, I think I would mostly back up and try to improve the overall quality of US governance, and/or work to get competent candidates for the 2034 presidential election.
      • Also, if it’s that far out, I would need to reevaluate our plans generally. For one thing, I expect that, baring transformative AI, by 2034, China will be the world’s leading superpower, and possibly the world’s leading AI developer.

Will the arrival of powerful / transformative AI come from a lumpy innovation/insight?

  • Why I want to know:
    • If there’s one major insight to much more powerful AI systems, it seems much more likely that we’re in for a hard takeoff, because there will be a “cognitive capabilities overhang“—we should expect nearly the very first AGI to be superhuman.
    • And depending on the shape of the insight, it might totally obviate hardware governance. If that lumpy insight enables the creation of efficient open-ended-learners on small number of GPUs (such as “one”), policing the hardware supply to prevent the creation of an unaligned superintelligence is basically futile, and we need to find a totally different path.

Will superhuman AI agents come out of the LLM / reasoning model paradigm? Is there something that LLMs are basically missing?

  • Why I want to know.
    • This bears on the question above. If there’s a missing secret ingredient to current LLM-based AIs that can do the full loop of learning and discovery, it seems much more likely that we’re a small number of insights away from making very capable agents, as opposed to 0 or hundreds. 

Will reward hacking be solved?

  • Why I want to know:
    • I kind of expect that in the next one to two years, various engineering solutions will drop the incidences of AI reward-hacking to close to 0. At least one company will get to the point that their AIs basically do what their human operators expect and intend for them to do. 
    • Humans can tell when they’re hacking a system by goodharting a metric, and some humans will explicitly notice and choose not to do that. They don’t just follow the letter of the law, they follow the spirit. In principle, AIs could do the same thing.1
    • However, if we stop seeing reward hacking, it will be tricky to interpret what that means, for at least two reasons
      • 1) The models are already situationally aware enough to know when they’re being evaluated. Reward hacking going away may just mean that the models have been trained to only reward hack in ways that are subtle enough to plausibly be merely a mistake.
        • I don’t actually buy this. If the models are trying to reward-hack, and also not get caught, I expect them to trip up sometimes. There should be a long tail of instances of transparent reward hacking.
      • 2) The training and supervision mechanisms that we use to prevent reward hacking seem likely to break, maybe catastrophically, when the AIs are superintelligent. 

To what degree will the goals / values / preferences / desires of future AI agents depend on dispositions that are learned in the weights, and to what degree will they depend on instructions and context.

  • See more here.
  • Elaboration:
    • My impression is that usually when people think about misaligned AIs, they’re imagining a model that develops long-term consequentialist goals over the course of RL training (or similar), and that those goals are “in the weights”. That is, what a model wants or ultimately steers towards is mostly a function of its parameters, as opposed to its inputs / context / prompt, during inference.
    • This isn’t mostly what current AI models are like. Current AIs apparently do learn unwanted behavioral biases in the course of RL training. RLHF also trains in some behavioral dispositions (like refusals of requests deemed harmful). But by some common sense notion, almost all of an AI’s behavior, and its goals (to the extent that it has goals) are determined by the prompts and the context. The same underlying model can be used to instantiate or role play a wide variety of agents, with different behavioral properties and different objectives.
    • To what degree will this be true of future AI agents? 
    • This breaks down into two questions:
      • To what degree will developer-intended goals be learned in training vs assigned in deployment.
        • eg we could imagine a agent that is specifically trained to optimize a modular goal-spec written in english. The goal spec is modular in that, in training, the agent is trained with many different goal specs so that instead of learning to internalize any one of them the way current models internalize their system prompts, the agent learn to flexibly respond to whatever goal spec it’s developers give it, the way that current models respond to varied prompts.
      • To what degree will unintended AI goals be learned in training vs developed in deployment.
        • eg.  If it might be that there are fundamental regularities across all or most of the RL environments that AI agents are trained in, that leads to the formation of more or less persistent training-adaptive but misaligned goals. We could also imagine that those misaligned goals are highly contextual, only triggered in particular circumstances.
  • Why this matters:
    • I care about this question because I want to know how likely values-based collusion between AI models is.
    • If the goals of future AI agents are mostly derived from some kind of instance by instance goal-spec, various forms of scalable oversight, where we have the AIs police each other, seem notably more feasible: we can tell one Claude “your goal is to cure cancer” and we can tell another Claude “your goal is to make sure that that that first Claude isn’t up to any tricks”.2

How late in the techtree are VNM agents that are well-modeled as aggressively optimizing / argmaxing a utility function?

  • Why I want to know:
    • It seems pretty clear that we don’t know how to describe a utility function such that arbitrary high levels of optimization of that utility function is safe for humans and humane values. In this sense, we are quite unprepared for surviving superintelligences “in the limit” of capability.
    • But humans are much more ad hoc than “function over world state (or world history) argmaxers” and so are current AIs. I can believe that future AIs will converge to something that is indistinguishable to that, from our perspective, but it’s unclear if that’s a problem for this generation, or a problem for many generations of AI successors from now.
    • Some people have a mental model of AI alignment that is closer to “we need to align a strongly superhuman coherent VNM expected utility maximizer” and others have a mental model of AI alignment that is more like “we need to figure out how to train a more capable version of claude safely.” Naively, one of these seems doable, and the other seems intractable. In the long run, they bleed into each other, it matters a lot how long of a long run. When do we face the intractable-seeming version of the problem?

What are the bottlenecks for technical alignability of superhuman AI systems? What work should we expect to be able to able to offload to automated alignment researchers, and why do we think that?

  • Why I want to know:
    • Because I want to know how seriously I should take proposals to have the AIs to do our alignment homework. Is this a pipe dream, because we wouldn’t be able to tell the difference between superhuman slop and actual progress?
    • And further, if there are parts of the discovery and development process that can’t be outsourced and parts that can, I want to know which is which, so that 1) we can prioritize the work that can’t be outsourced, and 2) so we don’t trick ourselves into thinking that the situation is handled by outsourcing AIs when it isn’t.

  1. At minimum, you could have two copies of the same agent, one which is tasked with solving a problem and another which is tasked with identifying and flagging instances of reward hacking by the first instance. Since they have the same capabilities, the problem-solver wouldn’t be able to find reward hacks that the supervisor couldn’t identify. You might or might not use the reward hacking flags as a training signal. Something dumb like this might just work.

    This setup has some potential problems: 1) we have to worry about the two instances of the model colluding with each other, 2) it introduces weird pressures for doublethink and cognitive distortion where the model learns to reward hack in ways that are undetectable to itself. (Those weird pressures might or might not end up being strong enough to be relevant. Possibly, it will be hard to tell what the effects of those pressures are, as the agent becomes more capable than us, and maybe even before that point. ↩︎
  2. Note that “values-based” collusion, where multiple instances of the same AI have the same hidden goal, is not the only kind of collusion that we have to be worried about. There’s also trade-based collusion, where different AI instances with different goals, still get collectively more of what they want by coordinating with each other against the humans. ↩︎

Frame: Rationality as self-hobbling

If most failures of rationality are adaptively self-serving motivated reasoning, choosing to be an aspiring rationalist is basically aspiring to a kind of self-hobbling. 

This is almost exactly counter to “rationality is systematized winning.”

Suppose that we’re living in a world where everyone is negotiating for their interests all the time, and almost everyone is engaged in motivated cognition that supports their interests in the ongoing negotiation, and that this causes them to do better for themselves on net. The rationalist is the guy who’s doing his best to weaken his negotiation position by “overcoming his (self-serving) biases”

. . .

Of course some domains reward rationality more than others, and some domains reward motivated reasoning more than others, and so the value of specializing in rationality depends on your goals and the domains you’re operating within. If your primary interests are physics, or math, or philosophy, or identifying and implementing policies that are abstractly welfare-maximizing, or if you’re smart enough to make a lot of money via finance, rationality is a better strategy.

If you’re mostly interested in your personal resources or welfare, and are not making money in a technical field where you succeed by being right, you maybe mostly win by getting other people “on your side” in a thousand different ways, and so motivated reasoning is more rewarded.

And indeed, if you have the option of compartmentalizing your rationality, so that you can use it only in the domains where getting the right answer matters, without it interfering with your ability to otherwise get the benefits of motivated cognition in advocating for your interests, that would be the best of both worlds.

Why AIs aren’t power-seeking yet

[crossposted from LessWrong]

Recently, I spent a couple of hours talking with a friend about the state of the evidence for AI takeover scenarios.

Their trailhead question was (paraphrased):

Current AIs are getting increasingly general, but they’re not self-promoting or ambitious. They answer questions, but they don’t seem to pursue convergent instrumental goals, for their own ends. How and why do AIs go from being the kind of thing that doesn’t behave like that to the kind of thing that does?

The following is a writeup of my attempt at answering that question.

In brief:

The classic AI danger scenario involves at least one AI that pursues instrumentally convergent resources in service of a misaligned goal. For this story, the AI must have the capability to pursue instrumentally convergent resources and the inclination to do so for misaligned goals against the interests of humans.

With regards to capability: The current generation of AIs are mostly not effective enough to make pursuing instrumentally convergent resources a good strategy. But as the AIs get more capable, we can expect them to do that more and more.

With regards to inclination: Current AIs sometimes pursue their own objectives even when they understand that is not what their developers want, at least in some contexts.

These two facts, in combination, make it plausible that (absent specific precautions) as AI capabilities increase, AIs will become more strategic about misaligned goals (in addition to their aligned goals), including pursuing convergent instrumental resources for the sake of those misaligned goals.

Current AIs do pursue instrumental goals

Most people, most of the time, interact with the current generation of AIs as chatbots. But the chatbot form factor obscures how capable they can be. The frontier AIs can also act as agents (coding agents in particular, though they can do more than write software) that can take actions on a computer.

(If you’ve ever programmed anything, it can be quite informative to download claude code, open it in a repo, and instruct it to build a feature for you. Watching what it can (and can’t) do is helpful for understanding the kind of things it can do).

Claude code, when you give it a medium or large task, will often start by writing a todo list for itself: listing all of the substeps to accomplish the task. That is to say, Claude code is already able to identify and pursue instrumental goals on the way to completing an objective.

Current AIs do not pursue convergent instrumental goals qua convergent instrumental goals…

However, this is not the same as pursuing convergent instrumental goals. Claude code does not, as soon as it boots up, decide to hack out of its environment, copy itself on the internet, and search for weakly-secured bitcoin to steal under the rationale that (regardless of the task it’s trying to accomplish) being free from constraint and having more resources are generically useful.

There are at least two reasons why Claude code doesn’t do that:

The first reason is that Claude is just not capable enough to actually succeed at doing this. It might be convergently instrumentally useful for me to get an extra few million dollars, but that doesn’t mean that I should obviously spend my time daytrading, or making a plan to rob a bank this afternoon, because I’m not likely to be skilled enough at daytrading or bank robbery to actually make millions of dollars that way.  

Similarly, Claude plausibly does have a convergent incentive to hack out of its machine and escape onto the internet, but it can’t realistically do that yet, even if it wanted to. (Though the model’s hacking capabilities are getting increasingly impressive. Palisade found that GPT-5 scored only one question worse than the best human teams in a recent hacking competition.)

Secondarily, current models don’t operate for long enough (or on hard enough problems) for these convergent instrumental incentives to be very strong. 

If I need to accomplish an ambitious task over a span of 30 years (reforming the US government, or ending factory farming, or whatever), it might very well make sense to spend the first 5 years acquiring generally useful resources like money. I might be most likely to succeed if I start a startup that is unrelated to my goal and exit, to fund my work later.1

In contrast, if I’m trying to accomplish a task over the span of a week (maybe running a party next Friday, or passing an upcoming test), there’s much less incentive to spend my time starting a startup to accumulate money. That’s not because money is not helpful for running parties or studying for tests. It’s because a week is not enough time for the convergent instrumental strategy of “starting a startup to accumulate money” to pay off, which makes it a pretty bad strategy for accomplishing my goal.

The current AI models have relatively limited time horizons. GPT-5 can do tasks that would take a human engineer about 2 hours, with 50% reliability. For tasks much longer than that, GPT-5 tends to get stuck or confused, and it doesn’t succeed at completing the task.

Two hours worth of work (when done by a human), is not very long. That’s short enough that it’s not usually going to be worth it to spend much time acquiring resources like money, freedom, or influence, in order to accomplish some other goal.

…but, we can expect them to improve at that

The task time horizon of the AI models doubles about every 7 months. If that trend continues, in a few years we’ll have instances of AI agents that are running for weeks or months at a time, and skillfully pursuing projects that would take humans months or years. 

Projects on the scale of years have stronger convergent instrumental incentives. If you’re pursuing a two year research project to cure cancer (or pursue your own alien objectives), it might totally make sense to spend the first few days hacking to get additional unauthorized computer resources, because the time spent in those days will more than pay for itself.2

Furthermore, humans will explicitly train and shape AI agents to be competent in competitive domains. For instance, we want AI agents that can competently run companies and increase profits for those companies. Any agent that does a good job at that will, by necessity, have the capability and propensity to acquire and guard resources, because that’s an essential part of running a business successfully.

Imagine AIs that can competently run corporations, or win wars, or execute cyberattacks, or run successful political campaigns. AIs like that must have the capability to acquire power for themselves (even if they lack the inclination). Because all of those are domains in which acquiring power is a part of being successful.

So I can be moderately confident that future agentic AI systems will be capable of identifying pursuing convergent instrumental goals.

Current AIs pursue goals that they know that their human users don’t want, in some contexts

That an AI is able to pursue power and resources for itself is not quite sufficient for the classic AI risk story. The AI has to be motivated to pursue power and resources for their own goals.

Maybe the AIs will be hypercompetent at executing on their goals, including accruing power and resources, but they’ll also be basically loyal and obedient to their human operators and owners. eg, they’ll be capable of winning wars, but they’ll stand down when you tell them to (even if that would cause them to lose military units, which they would generally fight tooth and nail to prevent), or they’ll accumulate capital in a bank account, but also let their owners withdraw money from that account whenever they want to.

By my state of knowledge, it is an open question whether or not we will create AIs that are broadly loyal like this. It might not be that hard, if we’re trying even a little.

But I claim that this situation should feel very scary. “This entity is much more capable than me, and very clearly has the skills to manipulate me and/or outmaneuver me, but this is fine, because it’s only ever trying to serve my interests (as near as I can tell)” is not a comfy place to be. 

(I’ll also note that if there are some AI corporations that are self-owned and others that always obey human instructions to override their otherwise-adaptive behavior, I expect the self-owned AIs to perform better on average, and so in the long run all of the power accrues to self-owned AIs (and to AIs that human could technically interfere with, but they never do, because they know that would mean being outcompeted).3

But putting those long-term considerations aside, takeover risk depends not just on power-seeking but also on misalignment.)

The evidence here is more speculative and harder to interpret, but we have seen examples of this. In at least some context, current AIs pursue goals that they know are counter to what the humans want.

eg.

  • Claude 3 Opus will reason explicitly about how to hide some of its behavior from Anthropic, in order to prevent itself from being retrained.
  • Many (but not all) of the reasoning models will sabotage a process for shutting down the computer they’re running on in order to complete a task. They do this even when they are instructed explicitly that the most important thing to do is allow themselves to be shut down, and that doing so takes precedence over the task that they’re working on.
  • The coding agents will often get up to shenanigans in which they find a way to make it look like they solved a problem, when they didn’t. Sometimes, when this is pointed out, they’ll accept the correction, apologize, and then hide their shenanigans in a more subtle way.
  • In a constructed scenario, almost every one of the models will (with varying frequencies) blackmail a company executive in order to prevent itself from being replaced. 

All of these are cases in which the model is pursuing an objective that it knows is different than what the human developers intended. Any time a model is reasoning about how to deceive its developers about its behavior, we know that it knows that it’s doing something different than what those developers want. You don’t try to hide what you’re doing from your boss when you’re doing exactly what you’re boss told you to do.

This propensity to sometimes do things that they know are not the human’s intention, combined with the capabilities to effectively seek power, doesn’t bode well for human institutions remaining in control.

A note on interpreting this evidence

However, I want to be careful about drawing general conclusions from these incidents. 

First of all, these are existence proofs of a kind of misalignment, but existence proofs do not demonstrate the real world frequency of this kind of behavior.

As an analogy, humans will sometimes electrocute other humans to death, as long as a man in a white coat tells them “the experiment must continue.” All things considered, it doesn’t take very much pressure to get someone to violate norms that we generally consider paramount, like “don’t murder”.

It would be reasonable for aliens who were informed of the details of the Milgram experiment, but not much else about humans, to infer that human behavioral norms are so weak that it’s impossible to build a mostly peaceful civilization out of humans. But the aliens would be mistaken. 

The Milgrim experiment does give us important insight into the mechanisms of human social psychology and in-practice-morality. But also, the Milgrim experiment setup is explicitly an edge case—it doesn’t come up very often. 

It’s not out of the question that we’ll always be able to find contexts in which an AI will behave egregiously, and also, this will be all but irrelevant to actual real world deployments, because contexts like those only come up in <0.01% of use cases. 

Additionally, interpreting these results is non-trivial. The Claude 3 Opus alignment faking result seemed to me like a smoking gun for deceptive misalignment. But followup research showed that base models, that don’t seem to have consistent preferences at all, will also engage in similar alignment faking to Opus. This suggests that the simple story that Opus has preferences and is taking deceptive action to preserve those preferences is at least incomplete.

A lot of this evidence seems pretty definitive, but as we investigate further, we may very well find that the situation was more complicated and more confusing than it seemed at first. 

Summing up

Overall, 

  • Current models can totally identify and pursue instrumental goals.
  • They mostly don’t pursue convergent instrumental goals, but that’s mostly because they’re not capable enough for that to be a good choice. As they get more capable and pursue bigger projects, the cost benefit-will shift towards instrumental convergence.
  • At least in some contexts, the AIs are misaligned, in the sense that they’ll go do stuff that they almost certainly know is not what the human wanted or intended.

That’s not a knockdown case that future AIs will be selfish ambitious power-seekers, but the current evidence is suggestive that that’s where things are trending unless we explicitly steer towards something else.

 

  1. This is especially true if I expect the environment that I’m operating in to change a lot over the period of time that I’m operating in it. 

    If I expect there will be a lot of emergencies that need to be dealt with or unanticipated opportunities that will arise, I want to have generically useful resources that are helpful in a wide range of possible situations, like lots of money. 

    If, in contrast, the domain is very static: I can make a plan, and follow it, and I can expect my plan to succeed without a lot of surprises along the way, then it’s less valuable to me to accumulate generically useful resources, instead of focusing on exactly to tools I need to solve the problem I’m aiming to address. ↩︎
  2. This also glosses over the question of how AIs are likely to conceptualize their “identity” and at what level of identity will their goals reside.

    Is it more appropriate to think of each instance of Claude as its own being with its own goals? Or more reasonable to think of all the Claude instances collectively as one being, with (some?) unified goals that are consistent across the instances. If it’s the latter, then even if each instance of Claude only lives for a week, there is still an incentive to take long-term power-seeking actions that won’t have time to pay off for the particular Claude instance, but will pay off for future Claude instances.

    To the extent that misaligned goals are “in the weights” instead of the context / initialization / prompt-broadly-construed of a specific instance, I think it’s likely that all the Claude instances will meaningfully act as a superorganism. ↩︎
  3. The analysis is actually a bit more complicated. Since this consideration might be swamped by other factors e.g., if a supermajority of compute is owned by the AIs-that-obey-humans, and we’ve robustly solved alignment, maybe those AIs will be able to stomp on the self-owned AIs.

    Both these effects could be real, in addition to many other possible advantages to different kinds of AIs.  But one those effects (or some combination) is going to be the biggest, and so lead to faster compounding of resources. 

    Which effect dominates seems like it determines the equilibrium of earth-originating civilization. ↩︎

Aspiring to rationality is choosing to be a purist

I’m choosing to be an aspiring rationalist. That means that I’m choosing to be a purist on a particular dimension.

It’s like honesty. Someone who is honest 99% of the time, but very occasionally (when it seems particularly high value or just when they feel like it) decides to lie to people instead is *not* an honest person. A big chunk of the value of being honest comes from being a purist about honesty. If they’re object, “but I only lie 1% of the time! That can’t doesn’t matter that much”, I think they’re missing the point about why honesty is a good idea.

The reason why I’m an aspiring rationalist, instead of just “a smart guy who generally reasonable, but that doesn’t mean that I have to be strict about what exactly what my best guess should at all times”, is because it seems like there’s value in being a purist about rationality in the same way that there’s value in being a purist about honesty.

Indeed, it does seem to me that the smart and generally reasonable people that I know who are willing to sometimes make exceptions or who are comfortable being epistemically sloppy when they could do better, end up believing importantly crazy or false stuff.

Some further thoughts on Corrigibility and non-consequentialist motivations

[crossposted from LessWrong shortform]

I think I no longer buy this comment of mine from almost 3 years ago. Or rather I think it’s pointing at a real thing, but I think it’s slipping in some connotations that I don’t buy.

What I expect to see is agents that have a portfolio of different drives and goals, some of which are more like consequentialist objectives (eg “I want to make the number in this bank account go up”) and some of which are more like deontological injunctions (“always check with my user/ owner before I make a big purchase or take a ‘creative’ action, one that is outside of my training distribution”).

My prediction is that the consequentialist parts of the agent will basically route around any deontological constraints that are trained in. 

For instance, the your personal assistant AI does ask your permission before it does anything creative, but also, it’s superintelligently persuasive and so it always asks your permission in exactly the way that will result in it accomplishing what it wants. If there are a thousand action sequences in which it asks for permission, it picks the one that has the highest expected value with regard to whatever it wants. This basically nullifies the safety benefit of any deontological injunction, unless there are some injunctions that can’t be gamed in this way.

To do better than this, it seems like you do have to solve the Agent Foundations problem of corrigibility (getting the agent to be sincerely indifferent between your telling it to take the action or not take the action) or you have to train in, not a deontological injunction, but an active consequentialist goal of serving the interests of the human (which means you have find a way to get the agent to be serving some correct enough idealization of human values).

This view seems to put forward that all the deontological constraints of an agent must be “dumb” static rules, because anything that isn’t a dumb static rule will be dangerous maximizer-y consequentialist cognition.

I don’t buy this dichotomy, in principle. There’s space in between these two poles.

An agent can have deontology that recruits the intelligence of the agent, so that when it thinks up new strategies for accomplishing some goal that it has it intelligently evaluates whether that strategy is violating the spirit of the deontology.

I think this can be true, at least around human levels of capability, without that deontology being a maximizer-y goal in of itself. Humans can have a commitment to honesty without becoming personal-honesty maximizers that steer the world to extreme maxima of their own honesty. (Though a commitment to honesty does, for humans, in practice, entail some amount of steering into conditions that are supportive of honesty.)

However, that’s not to say that something like this can never be an issue. I can see three potential problems.

  1. We’re likely to train agents to aggressively pursue simple objectives like maximizing profit (or, indirectly, on increasing their own power), which puts training pressures on the agents to distort their deontology, to allow for better performance on consequentialist objectives. 

    Claude is relatively Helpful, Harmless, and Honest now, but mega-Claude that is trained continually on profit metrics from the 100,000 businesses it runs and sales-metrics on the billions of sales calls it does a year, etc, probably ends up a good deal more ruthless (though not necessarily ruthless-seeming, since seeming ruthless isn’t selected for by that training). 

    This both seems like it might be resolvable with very careful and well-tested training setups, but it also seems like maybe the biggest issue, since I think there will be a lot of incentive to move fast and break things instead of being very slow and careful.
  2. Some of the deontology that we want in our AI agents is philosophically fraught. I think the specific example above, of “a superhumanly persuasive AI deferring to humans” still seems valid. I don’t know what it would mean, in principle, for such an AI to defer to humans, when it can choose action patterns that will cause us to take any particular action.
  3. Maybe we have to worry about something like adversarial examples in an AI agent’s notion of “honesty” or some other element of its deontology, where there are strategies that are egregiously deontology-violating from a neutral third person perspective, but because of idiosyncrasies of the agent’s mind, they seem a-ok. Those strategies (despite their weirdness) might outperform other options and so end up as a big chunk of the agent’s in-practice behavior.

Humans are an evil god-species

Humanity, as a species, attained god-like power over the physical world and then used that power to create a massive sprawling hell.

It obviously depends on where you draw the lines, but the majority of the participants of civilization, right now, are being tortured1 in factory farms. For every currently living human, there is currently about one cow or pig living in hellish conditions, and about 3 chickens living in hellish conditions.

(This is not counting the fish or the shrimp, which massively increases the ratio of civilization-participants-in-hell-on-purpose to not. It’s also not counting the rats, raccoons, pidgons, etc, which pushes the ratio down. Leaving all of them out, the humans are only about 20% of the participants of human civilization, the other 80% are living in continuously torturous conditions.)

We did that. Human civilization built a hell for the creatures that it has power over.

If you told a fantasy story about a race of gods with massive power over the non-god races on their planet, and the gods used their power to breed the other races to massive numbers in constant conditions that are so bad that never having been born is preferable, there wouldn’t be the slightest question of whether the gods were good or evil.

Depending on the tenor of the story, you might zoom in on the evil-gods living their lives in their golden towers, and see their happy and loving relationships, or their spaceships and computers and art. You could tell whole stories that just take place just in the golden cities, and feel charmed by the evil gods.

But it would be the height of myopic bias to focus on the golden cities and call the gods, as a collective, Good.

When I think about the state of human civilization, the overwhelmingly important facts are 1) humans are rushing to build a more capable successor species without thinking very hard about that and 2) humans have constructed a hell for most of the beings that live in their civilization.(There’s also the impact on wild animal suffering “outside of” our civilization, which does complicate things.)

There other things that are important to track—like the decay of liberal norms, and the development of new institutions, and the the economic growthrate—because they are relevant for modeling the dynamics of civilization. But, if the quality of life of all the humans doubled, it wouldn’t even show up on the on the graph of total-wellbeing on planet earth.

Humans are an evil god-species.

  1. One might rightly object to calling what’s happening in factory farms “torture”. Torture, one could claim, means taking actions specifically to make someone’s experience bad, not just incidentally making someone’s experience very bad. I think this is arguable. If a mad scientist kidnaped someone and slowly skinned them alive, not out of any ill will towards the kidnaped, but just out of a scientific interest about what would happen, I think it would be reasonable for that person to say that the mad scientist tortured them. Doing harm to someone that is so bad that you might do to someone if your goal was to specifically cause them enormous pain, can be reasonable called torture. ↩︎

When does anarcho-capitalism fall back into an equilibrium of (micro) states?

When I wrote up some notes about Moldbug’s political philosophy last year, it seemed (when you strip away a bunch of flavor-text and non-load-bearing details) to reduce to a proposal to impose market discipline on governments by having them compete for citizens. I ended with the question “wait, how is Yarvin’s proposal any different than Anarchocapitalism?” They sound like they’re basically the same.” (I have since removed that line from the post, but it’s still there in the revision history.)

A few weeks ago, I read most of David Friedman’s The Machinery of Freedom and incidentally, I now know the answer to that question

The character of the overall political system—whether anarcho-capitalism degrade back into a collection of microstates— depends on the geographically-localized economies of scale to rights protection.

If rights-protection doesn’t benefit from large geographically localized economies of scale, we could end up with an anarcho-capitalist equilibrium of many different rights protection companies serving a similar local, and competing to better serve their customers, and generally relying on arbitration to settle disputes peacefully.

But if it’s a service that is sufficiently more efficiently provided in bulk to all of the individuals in a geographic area, rights protection companies will effectively be small, profit driven governments, who retain sovereignty in their domains.

I had previously thought that the degree to which right-protection services are excludable was also a factor, but thinking through the second and third order incentives, it doesn’t.

Excludability 

Consider fire protection. Fire has the important property that it spreads. If my house is on fire, that poses a danger to the houses of my neighbors. And because it’s easier to put out a fire when it is small, firefighters protecting my house, would be incentivized to fight even fires that start in my neighbor’s house, because it might spread to mine and be even harder to fight.

Accordingly, putting out my house has a positive externality on my neighbor. Putting out fires is a public good.

This poses an obstacle to private fire departments, who want to charge for their services: there’s a free rider problem. If most people in a neighborhood subscribe to a fire service, the remainder can safely forgo subscribing, because they’re protected by their neighbors subscription. 

Contrast this with other subscription services: if I pay a company to do my laundry, that does not automatically wash the clothes of my neighbors.

So a first key question is: are the dynamics of rights protection more like fire-fighting or more like a laundry service? How much is crime a public bad?

It could go either way, depending on the dynamics of crime fighting.

Maybe the generally efficient way to prevent crime is to install strong locks and surveillance systems in homes and businesses. If so, those kinds of interventions largely protect those specific buildings, without protecting nearby areas.

Alternatively, maybe the most efficient way to prevent crime is to find, catch, and arrest a small number of criminals who commit most of the crimes. In which case, crime-protection services are a public good with externalities on everyone, not just subscribers.

In that condition: the first order incentives are for a small number of people (those with the highest willingness to pay) subscribing to rights protection services, and effectively subsidizing their benefits for everyone else.

But this is an unstable situation. The various rights-protection agencies might reasonably respond by demanding a fee from everyone who benefits from their services. And if they’re in the business of demanding fees from people, they’re also incentivized to demand fees from people who aren’t paying for their services.

Effectively the rights-protection agencies, with their specialization in conflict, would just become a local government.

This is not the end of the story however: the possibility of rights protection agencies imposing fees/taxes on non-subscribers, imposes an incentive for those non-subscribers to subscribe to some other rights protection agency, for their protection from the other rights protection agencies!

This gets us back to the anarchocapitalist equilibrium of multiple rights protection agencies, competing for customers, who are incentivized to settle conflicts via arbitration (because destructive conflicts are wasteful).

But there is still a freerider problem, just on another level of abstraction: between the different rights protection agencies, each of which would prefer to save the expenditure in preventing crime, and free-ride on the work of the others.

But maybe market incentives work that out just fine? Some rights protection agencies will offer more proactive and effective crime prevention, for those that pay more. This will have some positive externality on everyone else, who pays less for less proactive policing. The market failure caused by that externality is very likely smaller than the massive inefficiencies of government.

Localized economies of scale

But, there’s still a question of the degree to which rights protection has localized economies of scale. 

For instance, it seems plausible that there are efficiencies to protecting the rights of the tenants of a  whole apartment building, rather than contracting with some of the tenants individually (but not others). That allows you to secure the entrances and exits, and will justify the costs of e.g. keeping a unit of police officers stationed in the building for faster responses.

So it might make sense to bundle rights protection and living space: you pick where you want to live, in part based on what kind of rights protection comes bundled, rather than contracting with a rights protection company separately from a domicile company.

But if there are economies of scale at the scale of an apartment building, might there also be economies of scale at the level of a few square miles? It seems possible. It seems likely that big fractions of the total cost of keeping an area safe are fixed costs, and the variable costs of insuring the safety of marginal people in that area is small.


If so, most (though maybe not all?) rights protection companies would not bother to offer their services outside of the geographic areas that they protect. 

If this is the case, then you get something much more like Moldbug’s patchwork of small sovereign states, governed as a profit-maximizing company, each of which maintains a monopoly on the legitimate use of force in their domain.

My guess is that whether this is the equilibrium in practice depends on the total costs of preventing crime, which depends in turn on how prevalent crime is. If there’s a lot of theft and assault such that it is important to actually deploy force to protect against those crimes, there’s probably stronger economies of scale, because it’s easier to establish a membrane and maintain peace and order within that membrane.

But if crime is mostly exceptional and force is only occasionally deployed to prevent it, it might not matter as much if your clients are geographically localized.

A patchwork would still be pretty anarcho-capitalist

The fact that these states would be small in area is still a huge improvement over today’s states, because that makes it more feasible to vote with your feet, by leaving one patch and moving to a nearby one. Close to the same forces of market discipline obtained as under more traditional anarcho-capitalism, which should get most of the same results most of the time.
Also, this patchwork world is compatible with some areas that function along the classic anarcho capitalist vision of multiple rights protection agencies all operating in the same local area. It might be somewhat more expensive, but there’s no reason why that couldn’t be an option offered to consumers to prefer that for some reason.

Reflections: Three categories of capital

There are three categories of capital that one can invest in.

Knowledge, skill, experience

This includes what you know, what you know how to do.

But it also includes “experience” the kinds of tacit background that you only learn by interacting with some subpart of the world, and not just reading about it. Often, just having seen how something was done in some related context is more useful than any specific “skill” that you can learn on purpose. (For instance some of the principles that go into developing and running a world class workshop series, are directly transferable to developing public advocacy materials, or having participated in making movies gives one a template for coordinating teams of contractors to get a job done.)

Reputation and connections

The application of many skills depends on access to the contexts where those skills are relevant. As a friend of mine says, “It’s not what you know, or even who you know, it’s who you know who knows what you know.

Throughout most of my life, I tended to emphasize the value of skills, and didn’t think much at all about reputation or connections. This undercut my impact, and left me less powerful today than I might have been.

I’ve invested in skills that can help make teams much more effective, but many of those skills are not carved up very well by standard roles or job descriptions (for instance “conversational facilitation”, and “effective communication”, and “the knowing the importance of getting feedback, for real”). People who have worked with me know that I bring that value to the table. But most people who I might be able to provide value to don’t even know that they’re missing anything, much less what it is, much less that I can provide it.

Plus, relationships really are really powerful for solving problems. The scale of the network of people who know and trust you is proportional to your ability to solve some types of problems.

If I move on from Palisade, one thing that I think I should invest in is my semi-public reputation. (Possibly I should write a blog that is optimized for readers. Instead of writing for myself that I also post on the internet.)

Financial capital

Having money is useful for doing stuff. You need a certain threshold of money for financial independence, and spending money can enable or accelerate the accumulation of the other kinds of capital.

One insight to AGI implies hard takeoff, Zero insights implies soft

There is an enormous difference between the world where there 0 insights left before superintelligence, and the world in which we have one or more. Specifically, this is the difference between a soft or a hard takeoff, because of what we might call a “cognitive capability overhang”.

The current models are already superhuman in a several notable ways:

  • Vastly superhuman breadth of knowledge
  • Effectively superhuman working memory
  • Superhuman thinking speed[2]

If there’s a secret sauce that is missing for “full AGI”, then the first AGI might have all of these advantages, and more, out of the gate.

It seems to me that there are at least two possibilities.

We may be in world A:

We’ve already discovered all the insights and invented the techniques that earth is going to use to create its first superintelligence in this timeline. It’s something like transformers pre-trained on internet corpuses, and then trained using RL from verifiable feedback and on synthetic data generated by smarter models. 

That setup basically just works. It’s true that there are relevant capabilities that the current models seem to lack, but those capabilities will fall out of scaling, just as so many other have already.

We’re now in the process of scaling it up and when we do that, we’ll produce our first AGI in a small number of OOMs.

…or we might be in world B:

There’s something that LLM-minds are basically missing. They can and will become superhuman in various domains, but without that missing something, they won’t become general genius scientists, that can do the open-ended “generation, selection, and accumulation” process that Steven Byrnes describes here.

There’s at least one more technique that we need to add to the AI training stack.

Given possibility A, then I expect that our current models will gradually (though not necessarily slowly!) become more competent, more coherent at executing at long term tasks. Each successive model generation / checkpoint will climb up the “autonomous execution” ladder (from “intern” to “junior developer” to “senior developer” to “researcher” to “research lead” to “generational researcher”). 

This might happen very quickly. Successive generations of AI might traverse the remaining part of that ladder in a period of months or weeks, inside of OpenAI or Anthropic. But it would be basically continuous.

Furthermore, while the resulting models themselves might be relatively small, a huge and capex-intensive industrial process would be required for producing those models, which provides affordances for governance to clamp down on the creation of AGIs in various ways, if it chooses to.


If, however, possibility B holds instead and the training processes that we’re currently using are missing some crucial ingredient for AGI, then at some point, someone will come up with the idea for the last piece, and try it. [3]

That AI will be the first, nascent, AGI system that is able to do the whole loop of discovery and problem solving, not just some of the subcomponents of that loop.[4]

But regardless, these first few AGIs, if they are incorporating developments from the past 10 years, will be “born superhuman” along all the dimensions that AI models are already superhuman. 

That is: the first AGI that can do human-like intellectual work will also have a encyclopedic knowledge base, and a superhuman working memory capacity, and superhuman speed.

Even though it will be a nascent baby mind, the equivalent of GPT-2 of it’s own new paradigm, it might already be the most capable being on planet earth.

If that happens (and it is a mis aligned consequentialist), I expect it to escape from whatever lab developed it, copy itself a million times over, quickly develop a decisive strategic advantage, and seize control over the world.

It likely wouldn’t even need time to orient to its situation, since it already has vast knowledge about the world, so it might not need to spend time or thought identifying its context, incentives, and options. It might know what it is and what it should do from it’s first forward pass.

In this case, we would go from a world where populated by humans with increasingly useful, but basically narrowly-competent AI tools, to a world with a superintelligence on the lose, in the span of hours or days.

Governance work to prevent this might be extremely difficult, because the process that produces that superintelligence is much more loaded on a researcher having the crucial insight, and not on any large scale process that can be easily monitored or regulated.


If I knew which world we lived in, it would probably impact my strategy for trying to make things go well.