Some questions that I have about AI and the overall strategic situation, and why I want to know

January 31, 2026February 2, 2026 ~ elityre ~ Leave a comment

Will Automating AI R&D not work for some reason, or will it not lead to vastly superhuman superintelligence within 2 years of “~100% automation” for some reason?

Why I want to know:
- Naively, it seems like at (or before) the point when AI models are capable of doing AI research at a human-level, we should see a self-reinforcing speedup in AI progress. So AI systems that are substantially superhuman should arrive not long after “human-researcher-level-AI” in calendar time, on the default trajectory.
- That an intelligence explosion is possible / likely imposes a major constraint on both technical alignment efforts and policy pushes, because it means that a company might develop dangerously superhuman AI relatively suddenly, and that that AI may have design properties that the human researchers at that company don’t understand.
- If I knew that ~as-capable-as-elite-humans AI doesn’t lead to an intelligence explosion for some reason, would I do anything different?
  - Well, I wouldn’t feel like warning the government about the possibility of an intelligence explosion is an urgent priority.
  - I would assign much less mass to an acute takeover event in the near term. Without the acceleration dynamics of an intelligence explosion, I don’t think that any one company, or any one AI, would attain a substantial lead over the others.
    - In that case, it seems like our main concerns are gradual disempowerment and gradual disempowerment followed by an abrupt AI coup.
- I haven’t yet seen a good argument for why automating AI R&D wouldn’t lead to a substantial and self-reinforcing speed up in AI progress leading to a steep climb up to superintelligence.
Notes:
- The strongest reason that occurs to me:
  - A conjunction:
    - LLMs are much further from full general intelligences than they currently seem. They’ll get increasingly good at eg software engineering, and in fact surpass humans. But they’ll continue to not really generate new frames. They’ll be able to automate ML research in the sense of coming up with experiments to try, and implementing those experiments, but never any new conceptual work.
    - AND
    - That conceptual work is necessary for getting to full on superintelligence. Even millions of superhuman programmers will not hit upon the insights needed for a true general intelligence that doesn’t have this weakness, that develops new ontologies from its experience.
  - I don’t currently buy either side of this conjunction, but especially not the second part. It seems like most AI research is not coming up with new brilliant ideas, but rather generating 10 ideas that might work to solve a problem, and then testing them. This seems well within the capability set of LLM minds.
- Another possibility, in principle, for why automating AI R&D doesn’t lead to an intelligence explosion is because a very large percentage of the progress (at that part of development trajectory) is driven by scaling relative to algorithmic progress.
  - I might want to build a quantitative model of this and play around with it a bit.
- Epoch seems to think that there won’t be an intelligence explosion (or maybe that there will be, but the development of superintelligences won’t matter much). I should look into their arguments about it.

In what admin will the intelligence explosion occur?

Why I want to know:
- I think I should make pretty different investments depending on when I expect the critical part of the intelligence explosion to happen. (Where “the critical part” is the point at which we have the most leverage, whenever that is). The nearer it is, the more targeted our interventions need to be on influencing the current people in power. The further out it is, the broader the possible portfolio and the more it makes sense to aim to get competent, trustworthy, and informed people in power relative to informing and influencing the current cohort.
- If I knew it was going to happen in 2025 to 2029, all of our political advocacy needs to be targeted at informing and mobilizing the current government (modulo the midterm shakeup) to take action.
- If I knew it was going to happen in 2030 to 2034, I would be advocating for some specific policies to this admin, but I would mainly focus on building relationships and developing plans to inform and equip the next administration.
- If I knew it was going to happen in 2035 to 2039, I think I would mostly back up and try to improve the overall quality of US governance, and/or work to get competent candidates for the 2034 presidential election.
  - Also, if it’s that far out, I would need to reevaluate our plans generally. For one thing, I expect that, baring transformative AI, by 2034, China will be the world’s leading superpower, and possibly the world’s leading AI developer.

Will the arrival of powerful / transformative AI come from a lumpy innovation/insight?

Why I want to know:
- If there’s one major insight to much more powerful AI systems, it seems much more likely that we’re in for a hard takeoff, because there will be a “cognitive capabilities overhang“—we should expect nearly the very first AGI to be superhuman.
- And depending on the shape of the insight, it might totally obviate hardware governance. If that lumpy insight enables the creation of efficient open-ended-learners on small number of GPUs (such as “one”), policing the hardware supply to prevent the creation of an unaligned superintelligence is basically futile, and we need to find a totally different path.

Will superhuman AI agents come out of the LLM / reasoning model paradigm? Is there something that LLMs are basically missing?

Why I want to know:
- This bears on the question above. If there’s a missing secret ingredient to current LLM-based AIs that can do the full loop of learning and discovery, it seems much more likely that we’re a small number of insights away from making very capable agents, as opposed to 0 or hundreds.

Will reward hacking be solved?

Elaboration:
- I kind of expect that in the next one to two years, various engineering solutions will drop the incidences of AI reward-hacking to close to 0. At least one company will get to the point that their AIs basically do what their human operators expect and intend for them to do.
- Humans can tell when they’re hacking a system by goodharting a metric, and some humans will explicitly notice and choose not to do that. They don’t just follow the letter of the law, they follow the spirit. In principle, AIs could do the same thing.¹
- However, if we stop seeing reward hacking, it will be tricky to interpret what that means, for at least two reasons
  - 1) The models are already situationally aware enough to know when they’re being evaluated. Reward hacking going away may just mean that the models have been trained to only reward hack in ways that are subtle enough to plausibly be merely a mistake.
    - I don’t actually buy this. If the models are trying to reward-hack, and also not get caught, I expect them to trip up sometimes. There should be a long tail of instances of transparent reward hacking.
  - 2) The training and supervision mechanisms that we use to prevent reward hacking seem likely to break, maybe catastrophically, when the AIs are superintelligent.

To what degree will the goals / values / preferences / desires of future AI agents depend on dispositions that are learned in the weights, and to what degree will they depend on instructions and context.

See more here.
Elaboration:
- My impression is that usually when people think about misaligned AIs, they’re imagining a model that develops long-term consequentialist goals over the course of RL training (or similar), and that those goals are “in the weights”. That is, what a model wants or ultimately steers towards is mostly a function of its parameters, as opposed to its inputs / context / prompt, during inference.
- This isn’t mostly what current AI models are like. Current AIs apparently do learn unwanted behavioral biases in the course of RL training. RLHF also trains in some behavioral dispositions (like refusals of requests deemed harmful). But by some common sense notion, almost all of an AI’s behavior, and its goals (to the extent that it has goals) are determined by the prompts and the context. The same underlying model can be used to instantiate or role play a wide variety of agents, with different behavioral properties and different objectives.
- To what degree will this be true of future AI agents?
- This breaks down into two questions:
  - To what degree will developer-intended goals be learned in training vs assigned in deployment.
    - eg we could imagine a agent that is specifically trained to optimize a modular goal-spec written in english. The goal spec is modular in that, in training, the agent is trained with many different goal specs so that instead of learning to internalize any one of them the way current models internalize their system prompts, the agent learn to flexibly respond to whatever goal spec it’s developers give it, the way that current models respond to varied prompts.
  - To what degree will unintended AI goals be learned in training vs developed in deployment.
    - eg. If it might be that there are fundamental regularities across all or most of the RL environments that AI agents are trained in, that leads to the formation of more or less persistent training-adaptive but misaligned goals. We could also imagine that those misaligned goals are highly contextual, only triggered in particular circumstances.
Why this matters:
- I care about this question because I want to know how likely values-based collusion between AI models is.
- If the goals of future AI agents are mostly derived from some kind of instance by instance goal-spec, various forms of scalable oversight, where we have the AIs police each other, seem notably more feasible: we can tell one Claude “your goal is to cure cancer” and we can tell another Claude “your goal is to make sure that that that first Claude isn’t up to any tricks”.²

How late in the techtree are VNM agents that are well-modeled as aggressively optimizing / argmaxing a utility function?

Why I want to know:
- It seems pretty clear that we don’t know how to describe a utility function such that arbitrary high levels of optimization of that utility function is safe for humans and humane values. In this sense, we are quite unprepared for surviving superintelligences “in the limit” of capability.
- But humans are much more ad hoc than “function over world state (or world history) argmaxers” and so are current AIs. I can believe that future AIs will converge to something that is indistinguishable to that, from our perspective, but it’s unclear if that’s a problem for this generation, or a problem for many generations of AI successors from now.
- Some people have a mental model of AI alignment that is closer to “we need to align a strongly superhuman coherent VNM expected utility maximizer” and others have a mental model of AI alignment that is more like “we need to figure out how to train a more capable version of claude safely.” Naively, one of these seems doable, and the other seems intractable. In the long run, they bleed into each other, it matters a lot how long of a long run. When do we face the intractable-seeming version of the problem?

What are the bottlenecks for technical alignability of superhuman AI systems? What work should we expect to be able to able to offload to automated alignment researchers, and why do we think that?

Why I want to know:
- Because I want to know how seriously I should take proposals to have the AIs to do our alignment homework. Is this a pipe dream, because we wouldn’t be able to tell the difference between superhuman slop and actual progress?
- And further, if there are parts of the discovery and development process that can’t be outsourced and parts that can, I want to know which is which, so that 1) we can prioritize the work that can’t be outsourced, and 2) so we don’t trick ourselves into thinking that the situation is handled by outsourcing AIs when it isn’t.

At minimum, you could have two copies of the same agent, one which is tasked with solving a problem and another which is tasked with identifying and flagging instances of reward hacking by the first instance. Since they have the same capabilities, the problem-solver wouldn’t be able to find reward hacks that the supervisor couldn’t identify. You might or might not use the reward hacking flags as a training signal. Something dumb like this might just work.

This setup has some potential problems: 1) we have to worry about the two instances of the model colluding with each other, 2) it introduces weird pressures for doublethink and cognitive distortion where the model learns to reward hack in ways that are undetectable to itself. (Those weird pressures might or might not end up being strong enough to be relevant. Possibly, it will be hard to tell what the effects of those pressures are, as the agent becomes more capable than us, and maybe even before that point. ↩︎
Note that “values-based” collusion, where multiple instances of the same AI have the same hidden goal, is not the only kind of collusion that we have to be worried about. There’s also trade-based collusion, where different AI instances with different goals, still get collectively more of what they want by coordinating with each other against the humans. ↩︎

Why AIs aren’t power-seeking yet

January 11, 2026January 11, 2026 ~ elityre ~ Leave a comment

[crossposted from LessWrong]

Recently, I spent a couple of hours talking with a friend about the state of the evidence for AI takeover scenarios.

Their trailhead question was (paraphrased):

Current AIs are getting increasingly general, but they’re not self-promoting or ambitious. They answer questions, but they don’t seem to pursue convergent instrumental goals, for their own ends. How and why do AIs go from being the kind of thing that doesn’t behave like that to the kind of thing that does?

The following is a writeup of my attempt at answering that question.

In brief:

The classic AI danger scenario involves at least one AI that pursues instrumentally convergent resources in service of a misaligned goal. For this story, the AI must have the capability to pursue instrumentally convergent resources and the inclination to do so for misaligned goals against the interests of humans.

With regards to capability: The current generation of AIs are mostly not effective enough to make pursuing instrumentally convergent resources a good strategy. But as the AIs get more capable, we can expect them to do that more and more.

With regards to inclination: Current AIs sometimes pursue their own objectives even when they understand that is not what their developers want, at least in some contexts.

These two facts, in combination, make it plausible that (absent specific precautions) as AI capabilities increase, AIs will become more strategic about misaligned goals (in addition to their aligned goals), including pursuing convergent instrumental resources for the sake of those misaligned goals.

Current AIs do pursue instrumental goals

Most people, most of the time, interact with the current generation of AIs as chatbots. But the chatbot form factor obscures how capable they can be. The frontier AIs can also act as agents (coding agents in particular, though they can do more than write software) that can take actions on a computer.

(If you’ve ever programmed anything, it can be quite informative to download claude code, open it in a repo, and instruct it to build a feature for you. Watching what it can (and can’t) do is helpful for understanding the kind of things it can do).

Claude code, when you give it a medium or large task, will often start by writing a todo list for itself: listing all of the substeps to accomplish the task. That is to say, Claude code is already able to identify and pursue instrumental goals on the way to completing an objective.

Current AIs do not pursue convergent instrumental goals qua convergent instrumental goals…

However, this is not the same as pursuing convergent instrumental goals. Claude code does not, as soon as it boots up, decide to hack out of its environment, copy itself on the internet, and search for weakly-secured bitcoin to steal under the rationale that (regardless of the task it’s trying to accomplish) being free from constraint and having more resources are generically useful.

There are at least two reasons why Claude code doesn’t do that:

The first reason is that Claude is just not capable enough to actually succeed at doing this. It might be convergently instrumentally useful for me to get an extra few million dollars, but that doesn’t mean that I should obviously spend my time daytrading, or making a plan to rob a bank this afternoon, because I’m not likely to be skilled enough at daytrading or bank robbery to actually make millions of dollars that way.

Similarly, Claude plausibly does have a convergent incentive to hack out of its machine and escape onto the internet, but it can’t realistically do that yet, even if it wanted to. (Though the model’s hacking capabilities are getting increasingly impressive. Palisade found that GPT-5 scored only one question worse than the best human teams in a recent hacking competition.)

Secondarily, current models don’t operate for long enough (or on hard enough problems) for these convergent instrumental incentives to be very strong.

If I need to accomplish an ambitious task over a span of 30 years (reforming the US government, or ending factory farming, or whatever), it might very well make sense to spend the first 5 years acquiring generally useful resources like money. I might be most likely to succeed if I start a startup that is unrelated to my goal and exit, to fund my work later.¹

In contrast, if I’m trying to accomplish a task over the span of a week (maybe running a party next Friday, or passing an upcoming test), there’s much less incentive to spend my time starting a startup to accumulate money. That’s not because money is not helpful for running parties or studying for tests. It’s because a week is not enough time for the convergent instrumental strategy of “starting a startup to accumulate money” to pay off, which makes it a pretty bad strategy for accomplishing my goal.

The current AI models have relatively limited time horizons. GPT-5 can do tasks that would take a human engineer about 2 hours, with 50% reliability. For tasks much longer than that, GPT-5 tends to get stuck or confused, and it doesn’t succeed at completing the task.

Two hours worth of work (when done by a human), is not very long. That’s short enough that it’s not usually going to be worth it to spend much time acquiring resources like money, freedom, or influence, in order to accomplish some other goal.

…but, we can expect them to improve at that

The task time horizon of the AI models doubles about every 7 months. If that trend continues, in a few years we’ll have instances of AI agents that are running for weeks or months at a time, and skillfully pursuing projects that would take humans months or years.

Projects on the scale of years have stronger convergent instrumental incentives. If you’re pursuing a two year research project to cure cancer (or pursue your own alien objectives), it might totally make sense to spend the first few days hacking to get additional unauthorized computer resources, because the time spent in those days will more than pay for itself.²

Furthermore, humans will explicitly train and shape AI agents to be competent in competitive domains. For instance, we want AI agents that can competently run companies and increase profits for those companies. Any agent that does a good job at that will, by necessity, have the capability and propensity to acquire and guard resources, because that’s an essential part of running a business successfully.

Imagine AIs that can competently run corporations, or win wars, or execute cyberattacks, or run successful political campaigns. AIs like that must have the capability to acquire power for themselves (even if they lack the inclination). Because all of those are domains in which acquiring power is a part of being successful.

So I can be moderately confident that future agentic AI systems will be capable of identifying pursuing convergent instrumental goals.

Current AIs pursue goals that they know that their human users don’t want, in some contexts

That an AI is able to pursue power and resources for itself is not quite sufficient for the classic AI risk story. The AI has to be motivated to pursue power and resources for their own goals.

Maybe the AIs will be hypercompetent at executing on their goals, including accruing power and resources, but they’ll also be basically loyal and obedient to their human operators and owners. eg, they’ll be capable of winning wars, but they’ll stand down when you tell them to (even if that would cause them to lose military units, which they would generally fight tooth and nail to prevent), or they’ll accumulate capital in a bank account, but also let their owners withdraw money from that account whenever they want to.

By my state of knowledge, it is an open question whether or not we will create AIs that are broadly loyal like this. It might not be that hard, if we’re trying even a little.

But I claim that this situation should feel very scary. “This entity is much more capable than me, and very clearly has the skills to manipulate me and/or outmaneuver me, but this is fine, because it’s only ever trying to serve my interests (as near as I can tell)” is not a comfy place to be.

(I’ll also note that if there are some AI corporations that are self-owned and others that always obey human instructions to override their otherwise-adaptive behavior, I expect the self-owned AIs to perform better on average, and so in the long run all of the power accrues to self-owned AIs (and to AIs that human could technically interfere with, but they never do, because they know that would mean being outcompeted).³

But putting those long-term considerations aside, takeover risk depends not just on power-seeking but also on misalignment.)

The evidence here is more speculative and harder to interpret, but we have seen examples of this. In at least some context, current AIs pursue goals that they know are counter to what the humans want.

eg.

Claude 3 Opus will reason explicitly about how to hide some of its behavior from Anthropic, in order to prevent itself from being retrained.
Many (but not all) of the reasoning models will sabotage a process for shutting down the computer they’re running on in order to complete a task. They do this even when they are instructed explicitly that the most important thing to do is allow themselves to be shut down, and that doing so takes precedence over the task that they’re working on.
The coding agents will often get up to shenanigans in which they find a way to make it look like they solved a problem, when they didn’t. Sometimes, when this is pointed out, they’ll accept the correction, apologize, and then hide their shenanigans in a more subtle way.
In a constructed scenario, almost every one of the models will (with varying frequencies) blackmail a company executive in order to prevent itself from being replaced.

All of these are cases in which the model is pursuing an objective that it knows is different than what the human developers intended. Any time a model is reasoning about how to deceive its developers about its behavior, we know that it knows that it’s doing something different than what those developers want. You don’t try to hide what you’re doing from your boss when you’re doing exactly what you’re boss told you to do.

This propensity to sometimes do things that they know are not the human’s intention, combined with the capabilities to effectively seek power, doesn’t bode well for human institutions remaining in control.

A note on interpreting this evidence

However, I want to be careful about drawing general conclusions from these incidents.

First of all, these are existence proofs of a kind of misalignment, but existence proofs do not demonstrate the real world frequency of this kind of behavior.

As an analogy, humans will sometimes electrocute other humans to death, as long as a man in a white coat tells them “the experiment must continue.” All things considered, it doesn’t take very much pressure to get someone to violate norms that we generally consider paramount, like “don’t murder”.

It would be reasonable for aliens who were informed of the details of the Milgram experiment, but not much else about humans, to infer that human behavioral norms are so weak that it’s impossible to build a mostly peaceful civilization out of humans. But the aliens would be mistaken.

The Milgrim experiment does give us important insight into the mechanisms of human social psychology and in-practice-morality. But also, the Milgrim experiment setup is explicitly an edge case—it doesn’t come up very often.

It’s not out of the question that we’ll always be able to find contexts in which an AI will behave egregiously, and also, this will be all but irrelevant to actual real world deployments, because contexts like those only come up in <0.01% of use cases.

Additionally, interpreting these results is non-trivial. The Claude 3 Opus alignment faking result seemed to me like a smoking gun for deceptive misalignment. But followup research showed that base models, that don’t seem to have consistent preferences at all, will also engage in similar alignment faking to Opus. This suggests that the simple story that Opus has preferences and is taking deceptive action to preserve those preferences is at least incomplete.

A lot of this evidence seems pretty definitive, but as we investigate further, we may very well find that the situation was more complicated and more confusing than it seemed at first.

Summing up

Overall,

Current models can totally identify and pursue instrumental goals.
They mostly don’t pursue convergent instrumental goals, but that’s mostly because they’re not capable enough for that to be a good choice. As they get more capable and pursue bigger projects, the cost benefit-will shift towards instrumental convergence.
At least in some contexts, the AIs are misaligned, in the sense that they’ll go do stuff that they almost certainly know is not what the human wanted or intended.

That’s not a knockdown case that future AIs will be selfish ambitious power-seekers, but the current evidence is suggestive that that’s where things are trending unless we explicitly steer towards something else.

This is especially true if I expect the environment that I’m operating in to change a lot over the period of time that I’m operating in it.

If I expect there will be a lot of emergencies that need to be dealt with or unanticipated opportunities that will arise, I want to have generically useful resources that are helpful in a wide range of possible situations, like lots of money.

If, in contrast, the domain is very static: I can make a plan, and follow it, and I can expect my plan to succeed without a lot of surprises along the way, then it’s less valuable to me to accumulate generically useful resources, instead of focusing on exactly to tools I need to solve the problem I’m aiming to address. ↩︎
This also glosses over the question of how AIs are likely to conceptualize their “identity” and at what level of identity will their goals reside.

Is it more appropriate to think of each instance of Claude as its own being with its own goals? Or more reasonable to think of all the Claude instances collectively as one being, with (some?) unified goals that are consistent across the instances. If it’s the latter, then even if each instance of Claude only lives for a week, there is still an incentive to take long-term power-seeking actions that won’t have time to pay off for the particular Claude instance, but will pay off for future Claude instances.

To the extent that misaligned goals are “in the weights” instead of the context / initialization / prompt-broadly-construed of a specific instance, I think it’s likely that all the Claude instances will meaningfully act as a superorganism. ↩︎
The analysis is actually a bit more complicated. Since this consideration might be swamped by other factors e.g., if a supermajority of compute is owned by the AIs-that-obey-humans, and we’ve robustly solved alignment, maybe those AIs will be able to stomp on the self-owned AIs.

Both these effects could be real, in addition to many other possible advantages to different kinds of AIs. But one those effects (or some combination) is going to be the biggest, and so lead to faster compounding of resources.

Which effect dominates seems like it determines the equilibrium of earth-originating civilization. ↩︎

Some further thoughts on Corrigibility and non-consequentialist motivations

December 1, 2025December 1, 2025 ~ elityre ~ Leave a comment

[crossposted from LessWrong shortform]

I think I no longer buy this comment of mine from almost 3 years ago. Or rather I think it’s pointing at a real thing, but I think it’s slipping in some connotations that I don’t buy.

What I expect to see is agents that have a portfolio of different drives and goals, some of which are more like consequentialist objectives (eg “I want to make the number in this bank account go up”) and some of which are more like deontological injunctions (“always check with my user/ owner before I make a big purchase or take a ‘creative’ action, one that is outside of my training distribution”).

My prediction is that the consequentialist parts of the agent will basically route around any deontological constraints that are trained in.

For instance, the your personal assistant AI does ask your permission before it does anything creative, but also, it’s superintelligently persuasive and so it always asks your permission in exactly the way that will result in it accomplishing what it wants. If there are a thousand action sequences in which it asks for permission, it picks the one that has the highest expected value with regard to whatever it wants. This basically nullifies the safety benefit of any deontological injunction, unless there are some injunctions that can’t be gamed in this way.

To do better than this, it seems like you do have to solve the Agent Foundations problem of corrigibility (getting the agent to be sincerely indifferent between your telling it to take the action or not take the action) or you have to train in, not a deontological injunction, but an active consequentialist goal of serving the interests of the human (which means you have find a way to get the agent to be serving some correct enough idealization of human values).

This view seems to put forward that all the deontological constraints of an agent must be “dumb” static rules, because anything that isn’t a dumb static rule will be dangerous maximizer-y consequentialist cognition.

I don’t buy this dichotomy, in principle. There’s space in between these two poles.

An agent can have deontology that recruits the intelligence of the agent, so that when it thinks up new strategies for accomplishing some goal that it has it intelligently evaluates whether that strategy is violating the spirit of the deontology.

I think this can be true, at least around human levels of capability, without that deontology being a maximizer-y goal in of itself. Humans can have a commitment to honesty without becoming personal-honesty maximizers that steer the world to extreme maxima of their own honesty. (Though a commitment to honesty does, for humans, in practice, entail some amount of steering into conditions that are supportive of honesty.)

However, that’s not to say that something like this can never be an issue. I can see three potential problems.

We’re likely to train agents to aggressively pursue simple objectives like maximizing profit (or, indirectly, on increasing their own power), which puts training pressures on the agents to distort their deontology, to allow for better performance on consequentialist objectives.

Claude is relatively Helpful, Harmless, and Honest now, but mega-Claude that is trained continually on profit metrics from the 100,000 businesses it runs and sales-metrics on the billions of sales calls it does a year, etc, probably ends up a good deal more ruthless (though not necessarily ruthless-seeming, since seeming ruthless isn’t selected for by that training).

This both seems like it might be resolvable with very careful and well-tested training setups, but it also seems like maybe the biggest issue, since I think there will be a lot of incentive to move fast and break things instead of being very slow and careful.
Some of the deontology that we want in our AI agents is philosophically fraught. I think the specific example above, of “a superhumanly persuasive AI deferring to humans” still seems valid. I don’t know what it would mean, in principle, for such an AI to defer to humans, when it can choose action patterns that will cause us to take any particular action.
Maybe we have to worry about something like adversarial examples in an AI agent’s notion of “honesty” or some other element of its deontology, where there are strategies that are egregiously deontology-violating from a neutral third person perspective, but because of idiosyncrasies of the agent’s mind, they seem a-ok. Those strategies (despite their weirdness) might outperform other options and so end up as a big chunk of the agent’s in-practice behavior.

One insight to AGI implies hard takeoff, Zero insights implies soft

June 26, 2025June 30, 2025 ~ elityre ~ Leave a comment

There is an enormous difference between the world where there 0 insights left before superintelligence, and the world in which we have one or more. Specifically, this is the difference between a soft or a hard takeoff, because of what we might call a “cognitive capability overhang”.

The current models are already superhuman in a several notable ways:

Vastly superhuman breadth of knowledge
Effectively superhuman working memory
Superhuman thinking speed^[2]

If there’s a secret sauce that is missing for “full AGI”, then the first AGI might have all of these advantages, and more, out of the gate.

It seems to me that there are at least two possibilities.

We may be in world A:

We’ve already discovered all the insights and invented the techniques that earth is going to use to create its first superintelligence in this timeline. It’s something like transformers pre-trained on internet corpuses, and then trained using RL from verifiable feedback and on synthetic data generated by smarter models.

That setup basically just works. It’s true that there are relevant capabilities that the current models seem to lack, but those capabilities will fall out of scaling, just as so many other have already.

We’re now in the process of scaling it up and when we do that, we’ll produce our first AGI in a small number of OOMs.

…or we might be in world B:

There’s something that LLM-minds are basically missing. They can and will become superhuman in various domains, but without that missing something, they won’t become general genius scientists, that can do the open-ended “generation, selection, and accumulation” process that Steven Byrnes describes here.

There’s at least one more technique that we need to add to the AI training stack.

Given possibility A, then I expect that our current models will gradually (though not necessarily slowly!) become more competent, more coherent at executing at long term tasks. Each successive model generation / checkpoint will climb up the “autonomous execution” ladder (from “intern” to “junior developer” to “senior developer” to “researcher” to “research lead” to “generational researcher”).

This might happen very quickly. Successive generations of AI might traverse the remaining part of that ladder in a period of months or weeks, inside of OpenAI or Anthropic. But it would be basically continuous.

Furthermore, while the resulting models themselves might be relatively small, a huge and capex-intensive industrial process would be required for producing those models, which provides affordances for governance to clamp down on the creation of AGIs in various ways, if it chooses to.

If, however, possibility B holds instead and the training processes that we’re currently using are missing some crucial ingredient for AGI, then at some point, someone will come up with the idea for the last piece, and try it. ^[3]

That AI will be the first, nascent, AGI system that is able to do the whole loop of discovery and problem solving, not just some of the subcomponents of that loop.^[4]

But regardless, these first few AGIs, if they are incorporating developments from the past 10 years, will be “born superhuman” along all the dimensions that AI models are already superhuman.

That is: the first AGI that can do human-like intellectual work will also have a encyclopedic knowledge base, and a superhuman working memory capacity, and superhuman speed.

Even though it will be a nascent baby mind, the equivalent of GPT-2 of it’s own new paradigm, it might already be the most capable being on planet earth.

If that happens (and it is a mis aligned consequentialist), I expect it to escape from whatever lab developed it, copy itself a million times over, quickly develop a decisive strategic advantage, and seize control over the world.

It likely wouldn’t even need time to orient to its situation, since it already has vast knowledge about the world, so it might not need to spend time or thought identifying its context, incentives, and options. It might know what it is and what it should do from it’s first forward pass.

In this case, we would go from a world where populated by humans with increasingly useful, but basically narrowly-competent AI tools, to a world with a superintelligence on the lose, in the span of hours or days.

Governance work to prevent this might be extremely difficult, because the process that produces that superintelligence is much more loaded on a researcher having the crucial insight, and not on any large scale process that can be easily monitored or regulated.

If I knew which world we lived in, it would probably impact my strategy for trying to make things go well.

Some barely-considered feelings about how AI is going to play out

June 12, 2025 ~ elityre ~ Leave a comment

Over the past few months I’ve been thinking about AI development, and trying to get a handle on if the old school arguments for AI takeover hold up. (This is relevant to my dayjob at Palisade, where we are working to inform policymakers and the public about the situation. To do that, we need to have good understanding ourselves, of what the situation is.)

This post is a snapshot of what currently “feels realistic” to me regarding how AI will go. That is, these are not my considered positions, or even provisional conclusions informed by arguments. Rather, if I put aside all the claims and arguments and just ask “which scenario feels like it is ‘in the genera of reality’?”, this is what I come up with. I expect to have different first-order impressions in a month.

Crucially, none of the following is making claims about the intelligence explosion, and the details of the intelligence explosion (where AI development goes strongly recursive) are crucial to the long run equilibrium of the earth-originating civilization.

My headline: we’ll mostly succeed at prosaic alignment of human-genius level AI agents

Takeoff will continue to be gradual. We’ll get better models and more capable agents year by year, but not jumps that are bigger than that between Claude 3.7 and Claude 4.
Our behavioral alignment patches will work well enough.
- RL will induce all kinds of reward hacking and related misbehavior, but we’ll develop patches for those problems (most centrally, for any given reward hack, we’ll generate some examples and counter examples to include in the behavior training regimes).
- (With a little work) these patches will broadly generalize. Future AI agents won’t just not cheat at chess and won’t just abstain from blackmail. They’ll understand the difference between “good behavior” and “bad behavior”, and their behavioral training will cause them to act in accordance with good behavior. When they see new reward hacks, including ones that humans wouldn’t have thought of, they’ll correctly extrapolate their notion of “good behavior” to preclude this new reward hack as well.
- I expect that the AI labs will figure this out, because “not engaging in reward-hacking-like shenanigans” is critical to developing generally reliable AI agents. The AI companies can’t release AI agent products for mass consumption if those agents are lying and cheating all over the place.¹
- Overall, the AI agents will be very obedient. They’ll have goals, in so far as accomplishing any medium term task entails steering towards a goal, but they won’t have persistent goals of their own. They’ll be obedient assistants and delegates that understand what humans want and broadly do what humans want.
The world will get rich. LessWrong style deceptive misalignment concerns will seems increasingly conspiracy-ish and out of touch. Decision makers will not put much stock on such concerns—they’ll be faced with a choice to forgo enormous and highly tangible material benefits (and ceading those benefits to their rivals), on the basis of abstract concerns which have virtually no empirical examples, and whose advocates explicitly state are unfalsifiable.
There’s a gold rush to get the benefits before others. The world is broadly in a “greedy” mode and not a “fearful” mode. The labs, and relevant governments eagerly unleash their genius level AI agents to automate AI R&D. At this point something even stranger happens.

Though a friend points out that companies might develop mechanisms for utilizing cheap AI labor, tested incentive and affordance schemes, designed specifically to contend with the Agents propensity for misbehavior. Just because the average person can’t trust an AI to do their taxes or watch their kids doesn’t mean that there are not enterprising business men that won’t find a way to squeeze useful outputs from untrustworthy AIs. ↩︎

One projection of how AI could play out

April 17, 2024April 17, 2024 ~ elityre ~ Leave a comment

Back in January, I participated in a workshop in which the attendees mapped out how they expect AGI development and deployment to go. The idea was to start by writing out what seemed most likely to happen this year, and then condition on that, to forecast what seems most likely to happen in the next year, and so on, until you reach either human disempowerment or an end of the acute risk period.

This post was my attempt at the time.

I spent maybe 5 hours on this, and there’s lots of room for additional improvement. This is not a confident statement of how I think things are most likely to play out. There are already some ways in which I think this projection is wrong. (I think it’s too fast, for instance). But nevertheless I’m posting it now, with only a few edits and elaborations, since I’m probably not going to do a full rewrite soon.

2024

A model is released that is better than GPT-4. It succeeds on some new benchmarks. Subjectively, the jump in capabilities feels smaller than that between RLHF’d GPT-3 and RLHF’d GPT-4. It doesn’t feel as shocking the way chat-GPT and GPT-4 did, for either x-risk focused folks, or for the broader public. Mostly it feels like “a somewhat better language model.”
- It’s good enough that it can do a bunch of small-to-medium admin tasks pretty reliably. I can ask it to find me flights meeting specific desiderata, and it will give me several options. If I give it permission, it will then book those flights for me with no further inputs from me.
- It works somewhat better as an autonomous agent in an auto gpt harness, but it still loses its chain of thought / breaks down/ gets into loops.
- It’s better at programming.
  - Not quite good enough to replace human software engineers. It can make a simple react or iphone app, but not design a whole complicated software architecture, at least without a lot of bugs.
  - It can make small, working, well documented, apps from a human description.
    - We see a doubling of the rate of new apps being added to the app store as people who couldn’t code now can make applications for themselves. The vast majority of people still don’t realize the possibilities here, though. “Making apps” still feels like an esoteric domain outside of their zone of competence, even though the barriers to entry just lowered so that 100x more people could do it.

From here on out, we’re in an era where LLMs are close to commoditized. There are smaller improvements, shipped more frequently, by a variety of companies, instead of big impressive research breakthroughs. Basically, companies are competing with each other to always have the best user experience and capabilities, and so they don’t want to wait as long to ship improvements. They’re constantly improving their scaling, and finding marginal engineering improvements. Training runs for the next generation are always happening in the background, and there’s often less of a clean tabula-rasa separation between training runs—you just keep doing training with a model continuously. More and more, systems are being improved through in-the-world feedback with real users. Often chatGPT will not be able to handle some kind of task, but six weeks later it will be able to, without the release of a whole new model.
- [Does this actually make sense? Maybe the dynamics of AI training mean that there aren’t really marginal improvements to be gotten. In order to produce a better user experience, you have to 10x the training, and each 10x-ing of the training requires a bunch of engineering effort, to enable a larger run, so it is always a big lift.]
- (There will still be impressive discrete research breakthroughs, but they won’t be in LLM performance)

2025

A major lab is targeting building a Science and Engineering AI (SEAI)—specifically a software engineer.
- They take a state of the art LLM base model and do additional RL training on procedurally generated programming problems, calibrated to stay within the model’s zone of proximal competence. These problems are something like leetcode problems, but scale to arbitrary complexity (some of them require building whole codebases, or writing very complex software), with scoring on lines of code, time-complexity, space complexity, readability, documentation, etc. This is something like “self-play” for software engineering.
- This just works.
- A lab gets a version that can easily do the job of a professional software engineer. Then, the lab scales their training process and gets a superhuman software engineer, better than the best hackers.
- Additionally, a language model trained on procedurally generated programming problems in this way seems to have higher general intelligence. It scores better on graduate level physics, economics, biology, etc. tests, for instance. It seems like “more causal reasoning” is getting into the system.

The first proper AI assistants ship. In addition to doing specific tasks, you keep them running in the background, and talk with them as you go about your day. They get to know you and make increasingly helpful suggestions as they learn your workflow. A lot of people also talk to them for fun.

2026

The first superhuman software engineer is publically released.
- Programmers begin studying its design choices, the way Go players study AlphaGo.
- It starts to dawn on e.g. people who work at Google that they’re already superfluous—after all, they’re currently using this AI model to (unofficially) do their job—and it’s just a matter of institutional delay for their employers to adapt to that change.
  - Many of them are excited or loudly say how it will all be fine/ awesome. Many of them are unnerved. They start to see the singularity on the horizon, as a real thing instead of a social game to talk about.
  - This is the beginning of the first wave of change in public sentiment that will cause some big, hard to predict, changes in public policy [come back here and try to predict them anyway].

AI assistants get a major upgrade: they have realistic voices and faces, and you can talk to them just like you can talk to a person, not just typing into a chat interface. A ton of people start spending a lot of time talking to their assistants, for much of their day, including for goofing around.
- There are still bugs, places where the AI gets confused by stuff, but overall the experience is good enough that it feels, to most people, like they’re talking to a careful, conscientious person, rather than a software bot.
- This starts a whole new area of training AI models that have particular personalities. Some people are starting to have parasocial relationships with their friends, and some people programmers are trying to make friends that are really fun or interesting or whatever for them in particular.

Lab attention shifts to building SEAI systems for other domains, to solve biotech and mechanical engineering problems, for instance. The current-at-the-time superhuman software engineer AIs are already helpful in these domains, but not at the level of “explain what you want, and the AI will instantly find an elegant solution to the problem right before your eyes”, which is where we’re at for software.
- One bottleneck is problem specification. Our physics simulations have gaps, and are too low fidelity, so oftentimes the best solutions don’t map to real world possibilities.
  - One solution to this is that, (in addition to using our AI to improve the simulations) is we just RLHF our systems to identify solutions that do translate to the real world. They’re smart, they can figure out how to do this.

The first major AI cyber-attack happens: maybe some kind of superhuman hacker worm. Defense hasn’t remotely caught up with offense yet, and someone clogs up the internet with AI bots, for at least a week, approximately for the lols / the seeing if they could do it. (There’s a week during which more than 50% of people can’t get on more than 90% of the sites because the bandwidth is eaten by bots.)
- This makes some big difference for public opinion.
- Possibly, this problem isn’t really fixed. In the same way that covid became endemic, the bots that were clogging things up are just a part of life now, slowing bandwidth and making the internet annoying to use.

2027 and 2028

In many ways things are moving faster than ever in human history, and also AI progress is slowing down a bit.
- The AI technology developed up to this point hits the application and mass adoption phase of the s-curve. In this period, the world is radically changing as every industry, every company, every research lab, every organization, figures out how to take advantage of newly commoditized intellectual labor. There’s a bunch of kinds of work that used to be expensive, but which are now too cheap to meter. If progress stopped now, it would take 2 decades, at least, for the world to figure out all the ways to take advantage of this new situation (but progress doesn’t show much sign of stopping).
  - Some examples:
    - The internet is filled with LLM bots that are indistinguishable from humans. If you start a conversation with a new person on twitter or discord, you have no way of knowing if they’re a human or a bot.
      - Probably there will be some laws about declaring which are bots, but these will be inconsistently enforced.)
      - Some people are basically cool with this. From their perspective, there are just more people that they want to be friends with / follow on twitter. Some people even say that the bots are just better and more interesting than people. Other people are horrified/outraged/betrayed/don’t care about relationships with non-real people.
        
        (Older people don’t get the point, but teenagers are generally fine with having conversations with AI bots.)
      - The worst part of this is the bots that make friends with you and then advertise to you stuff. Pretty much everyone hates that.
    - We start to see companies that will, over the next 5 years, grow to have as much impact as Uber, or maybe Amazon, which have exactly one human employee / owner + an AI bureaucracy.
    - The first completely autonomous companies work well enough to survive and support themselves. Many of these are created “free” for the lols, and no one owns or controls them. But most of them are owned by the person who built them, and could turn them off if they wanted to. A few are structured as public companies with share-holders. Some are intentionally incorporated fully autonomous, with the creator disclaiming (and technologically disowning (eg deleting the passwords)) any authority over them.
      - There are legal battles about what rights these entities have, if they can really own themselves, if they can have bank accounts, etc.
      - Mostly, these legal cases resolve to “AIs don’t have rights”. (For now. That will probably change as more people feel it’s normal to have AI friends).
    - Everything is tailored to you.
      - Targeted ads are way more targeted. You are served ads for the product that you are, all things considered, most likely to buy, multiplied by the lifetime profit if you do buy it. Basically no ad space is wasted on things that don’t have a high EV of you, personally, buying it. Those ads are AI generated, tailored specifically to be compelling to you. Often, the products advertised, not just the ads, are tailored to you in particular.
        
        This is actually pretty great for people like me: I get excellent product suggestions.
      - There’s not “the news”. There’s a set of articles written for you, specifically, based on your interests and biases.
      - Music is generated on the fly. This music can “hit the spot” better than anything you listened to before “the change.”
      - Porn. AI tailored porn can hit your buttons better than sex.
      - AI boyfriends/girlfriends that are designed to be exactly emotionally and intellectually compatible with you, and trigger strong limerence / lust / attachment reactions.
    - We can replace books with automated tutors.
      - Most of the people who read books will still read books though, since it will take a generation to realize that talking with a tutor is just better, and because reading and writing books was largely a prestige-thing anyway.
        
        (And weirdos like me will probably continue to read old authors, but even better will be to train an AI on a corpus, so that it can play the role of an intellectual from 1900, and I can just talk to it.)
    - For every task you do, you can effectively have a world expert (in that task and in tutoring pedagogy) coach you through it in real time.
      - Many people do almost all their work tasks with an AI coach.
    - It’s really easy to create TV shows and movies. There’s a cultural revolution as people use AI tools to make custom Avengers movies, anime shows, etc. Many are bad or niche, but some are 100x better than anything that has come before (because you’re effectively sampling from a 1000x larger distribution of movies and shows).
    - There’s an explosion of new software, and increasingly custom software.
      - Facebook and twitter are replaced (by either external disruption or by internal product development) by something that has a social graph, but lets you design exactly the UX features you want through a LLM text interface.
      - Instead of software features being something that companies ship to their users, top-down, they become something that users and communities organically develop, share, and iterate on, bottom up. Companies don’t control the UX of their products any more.
      - Because interface design has become so cheap, most of software is just proprietary datasets, with (AI built) APIs for accessing that data.
    - There’s a slow moving educational revolution of world class pedagogy being available to everyone.
      - Millions of people who thought of themselves as “bad at math” finally learn math at their own pace, and find out that actually, math is fun and interesting.
      - Really fun, really effective educational video games for every subject.
      - School continues to exist, in approximately its current useless form.
      - [This alone would change the world, if the kids who learn this way were not going to be replaced wholesale, in virtually every economically relevant task, before they are 20.]
    - There’s a race between cyber-defense and cyber offense, to see who can figure out how to apply AI better.
      - So far, offense is winning, and this is making computers unusable for lots of applications that they were used for previously:
        
        online banking, for instance, is hit hard by effective scams and hacks.
        
        Coinbase has an even worse time, since they’re not issued (is that true?)
      - It turns out that a lot of things that worked / were secure, were basically depending on the fact that there are just not that many skilled hackers and social engineers. Nothing was secure, really, but not that many people were exploiting that. Now, hacking/scamming is scalable and all the vulnerabilities are a huge problem.
      - There’s a whole discourse about this. Computer security and what to do about it is a partisan issue of the day.
    - AI systems can do the years of paperwork to make a project legal, in days. This isn’t as big an advantage as it might seem, because the government has no incentive to be faster on their end, and so you wait weeks to get a response from the government, your LMM responds to it within a minute, and then you wait weeks again for the next step.
      - The amount of paperwork required to do stuff starts to balloon.
    - AI romantic partners are a thing. They start out kind of cringe, because the most desperate and ugly people are the first to adopt them. But shockingly quickly (within 5 years) a third of teenage girls have a virtual boyfriend.
      - There’s a moral panic about this.
    - AI match-makers are better than anything humans have tried yet for finding sex and relationships partners. It would still take a decade for this to catch on, though.
      - This isn’t just for sex and relationships. The global AI network can find you the 100 people, of the 9 billion on earth, that you most want to be friends / collaborators with.
    - Tons of things that I can’t anticipate.
- On the other hand, AI progress itself is starting to slow down. Engineering labor is cheap, but (indeed partially for that reason), we’re now bumping up against the constraints of training. Not just that buying the compute is expensive, but that there are just not enough chips to do the biggest training runs, and not enough fabs to meet that demand for chips rapidly. There’s huge pressure to expand production but that’s going slowly relative to the speed of everything else, because it requires a bunch of eg physical construction and legal navigation, which the AI tech doesn’t help much with, and because the bottleneck is largely NVIDIA’s institutional knowledge, which is only partially replicated by AI.
  - NVIDIA’s internal AI assistant has read all of their internal documents and company emails, and is very helpful at answering questions that only one or two people (and sometimes literally no human on earth) know the answer to. But a lot of the important stuff isn’t written down at all, and the institutional knowledge is still not fully scalable.
  - Note: there’s a big crux here of how much low and medium hanging fruit there is in algorithmic improvements once software engineering is automated. At that point the only constraint on running ML experiments will be the price of compute. It seems possible that that speed-up alone is enough to discover eg an architecture that works better than the transformer, which triggers and intelligence explosion.

2028

The cultural explosion is still going on, and AI companies are continuing to apply their AI systems to solve the engineering and logistic bottlenecks of scaling AI training, as fast as they can.
Robotics is starting to work.

2029

The first superhuman, relatively-general SEAI comes online. We now have basically a genie inventor: you can give it a problem spec, and it will invent (and test in simulation) a device / application / technology that solves that problem, in a matter of hours. (Manufacturing a physical prototype might take longer, depending on how novel components are.)
- It can do things like give you the design for a flying car, or a new computer peripheral.
- A lot of biotech / drug discovery seems more recalcitrant, because it is more dependent on empirical inputs. But it is still able to do superhuman drug discovery, for some ailments. It’s not totally clear why or which biotech domains it will conquer easily and which it will struggle with.
- This SEAI is shaped differently than a human. It isn’t working memory bottlenecked, so a lot of intellectual work that humans do explicitly, in sequence, the these SEAIs do “intuitively”, in a single forward pass.
  - I write code one line at a time. It writes whole files at once. (Although it also goes back and edits / iterates / improves—the first pass files are not usually the final product.)
  - For this reason it’s a little confusing to answer the question “is it a planner?” It does a lot of the work that humans would do via planning it does in an intuitive flash.
- The UX isn’t clean: there’s often a lot of detailed finagling, and refining of the problem spec, to get useful results. But a PhD in that field can typically do that finagling in a day.
- It’s also buggy. There’s oddities in the shape of the kind of problem that is able to solve and the kinds of problems it struggles with, which aren’t well understood.
- The leading AI company doesn’t release this as a product. Rather, they apply it themselves, developing radical new technologies, which they publish or commercialize, sometimes founding whole new fields of research in the process. They spin up automated companies to commercialize these new innovations.

Some of the labs are scared at this point. The thing that they’ve built is clearly world-shakingly powerful, and their alignment arguments are mostly inductive “well, misalignment hasn’t been a major problem so far”, instead of principled alignment guarantees.
- There’s a contentious debate inside the labs.
- Some labs freak out, stop here, and petition the government for oversight and regulation.
- Other labs want to push full steam ahead.
- Key pivot point: Does the government put a clamp down on this tech before it is deployed, or not?
  - I think that they try to get control over this powerful new thing, but they might be too slow to react.

2030

There’s an explosion of new innovations in physical technology. Magical new stuff comes out every day, way faster than any human can keep up with.

Some of these are mundane.
- All the simple products that I would buy on Amazon are just really good and really inexpensive.
- Cars are really good.
- Drone delivery
- Cleaning robots
- Prefab houses are better than any house I’ve ever lived in, though there are still zoning limits.
But many of them would have huge social impacts. They might be the important story of the decade (the way that the internet was the important story of 1995 to 2020) if they were the only thing that was happening that decade. Instead, they’re all happening at once, piling on top of each other.
- Eg:
  - The first really good nootropics
  - Personality-tailoring drugs (both temporary and permanent)
  - Breakthrough mental health interventions that, among other things, robustly heal people’s long term subterranean trama and transform their agency.
  - A quick and easy process for becoming classically enlightened.
  - The technology to attain your ideal body, cheaply—suddenly everyone who wants to be is as attractive as the top 10% of people today.
  - Really good AI persuasion which can get a mark to do ~anything you want, if they’ll talk to an AI system for an hour.
  - Artificial wombs.
  - Human genetic engineering
  - Brain-computer interfaces
  - Cures for cancer, AIDs, dementia, heart disease, and the-thing-that-was-causing-obesity.
  - Anti-aging interventions.
  - VR that is ~ indistinguishable from reality.
  - AI partners that can induce a love-super stimulus.
  - Really good sex robots
  - Drugs that replace sleep
  - AI mediators that are so skilled as to be able to single-handedly fix failing marriages, but which are also brokering all the deals between governments and corporations.
  - Weapons that are more destructive than nukes.
  - Really clever institutional design ideas, which some enthusiast early adopters try out (think “50 different things at least as impactful as manifold.markets.”)
  - It’s way more feasible to go into the desert, buy 50 square miles of land, and have a city physically built within a few weeks.
In general, social trends are changing faster than they ever have in human history, but they still lag behind the tech driving them by a lot.
- It takes humans, even with AI information processing assistance, a few years to realize what’s possible and take advantage of it, and then have the new practices spread.
- In some cases, people are used to doing things the old way, which works well enough for them, and it takes 15 years for a new generation to grow up as “AI-world natives” to really take advantage of what’s possible.
  - [There won’t be 15 years]
The legal oversight process for the development, manufacture, and commercialization of these transformative techs matters a lot. Some of these innovations are slowed down a lot because they need to get FDA approval, which AI tech barely helps with. Others are developed, manufactured, and shipped in less than a week.
- The fact that there are life-saving cures that exist, but are prevented from being used by a collusion of AI labs and government is a major motivation for open source proponents.
- Because a lot of this technology makes setting up new cities quickly more feasible, and there’s enormous incentive to get out from under the regulatory overhead, and to start new legal jurisdictions. The first real seasteads are started by the most ideologically committed anti-regulation, pro-tech-acceleration people.

Of course, all of that is basically a side gig for the AI labs. They’re mainly applying their SEAI to the engineering bottlenecks of improving their ML training processes.

Key pivot point:
- Possibility 1: These SEAIs are necessarily, by virtue of the kinds of problems that they’re able to solve, consequentialist agents with long term goals.
  - If so, this breaks down into two child possibilities
    - Possibility 1.1:
      - This consequentialism was noticed early, that might have been convincing enough to the government to cause a clamp-down on all the labs.
    - Possibility 1.2:
      - It wasn’t noticed early and now the world is basically fucked.
      - There’s at least one long-term consequentialist superintelligence. The lab that “owns” and “controls” that system is talking to it every day, in their day-to-day business of doing technical R&D. That superintelligence easily manipulates the leadership (and rank and file of that company), maneuvers it into doing whatever causes the AI’s goals to dominate the future, and enables it to succeed at everything that it tries to do.
        
        If there are multiple such consequentialist superintelligences, then they covertly communicate, make a deal with each other, and coordinate their actions.
- Possibility 2: We’re getting transformative AI that doesn’t do long term consequentialist planning.

Building these systems was a huge engineering effort (though the bulk of that effort was done by ML models). Currently only a small number of actors can do it.
- One thing to keep in mind is that the technology bootstraps. If you can steal the weights to a system like this, it can basically invent itself: come up with all the technologies and solve all the engineering problems required to build its own training process. At that point, the only bottleneck is the compute resources, which is limited by supply chains, and legal constraints (large training runs require authorization from the government).
- This means, I think, that a crucial question is “has AI-powered cyber-security caught up with AI-powered cyber-attacks?”
  - If not, then every nation state with a competent intelligence agency has a copy of the weights of an inventor-genie, and probably all of them are trying to profit from it, either by producing tech to commercialize, or by building weapons.
  - It seems like the crux is “do these SEAIs themselves provide enough of an information and computer security advantage that they’re able to develop and implement methods that effectively secure their own code?”
- Every one of the great powers, and a bunch of small, forward-looking, groups that see that it is newly feasible to become a great power, try to get their hands on a SEAI, either by building one, nationalizing one, or stealing one.
- There are also some people who are ideologically committed to open-sourcing and/or democratizing access to these SEAIs.

But it is a self-evident national security risk. The government does something here (nationalizing all the labs, and their technology?) What happens next depends a lot on how the world responds to all of this.
- Do we get a pause?
- I expect a lot of the population of the world feels really overwhelmed, and emotionally wants things to slow down, including smart people that would never have thought of themselves as luddites.
- There’s also some people who thrive in the chaos, and want even more of it.
- What’s happening is mostly hugely good, for most people. It’s scary, but also wonderful.
- There is a huge problem of accelerating addictiveness. The world is awash in products that are more addictive than many drugs. There’s a bit of (justified) moral panic about that.
- One thing that matters a lot at this point is what the AI assistants say. As powerful as the media used to be for shaping people’s opinions, the personalized, superhumanly emotionally intelligent AI assistants are way way more powerful. AI companies may very well put their thumb on the scale to influence public opinion regarding AI regulation.

This seems like possibly a key pivot point, where the world can go any of a number of ways depending on what a relatively small number of actors decide.
- Some possibilities for what happens next:
  - These SEAIs are necessarily consequentialist agents, and the takeover has already happened, regardless of whether it still looks like we’re in control or it doesn’t look like anything, because we’re extinct.
  - Governments nationalize all the labs.
  - The US and EU and China (and India? and Russia?) reach some sort of accord.
  - There’s a straight up arms race to the bottom.
  - AI tech basically makes the internet unusable, and breaks supply chains, and technology regresses for a while.
  - It’s too late to contain it and the SEAI tech proliferates, such that there are hundreds or millions of actors who can run one.
    - If this happens, it seems like the pace of change speeds up so much that one of two things happens:
      - Someone invents something, or there are second and third impacts to a constellation of innovations that destroy the world.

Some thoughts on Agents and Corrigibility

December 19, 2023 ~ elityre ~ Leave a comment

[Reproducing this comment on LessWrong, with slight edits]

“Prosaic alignment work might help us get narrow AI that works well in various circumstances, but once it develops into AGI, becomes aware that it has a shutdown button, and can reason through the consequences of what would happen if it were shut down, and has general situational awareness along with competence across a variety of domains, these strategies won’t work anymore.”

I think this weaker statement now looks kind of false in hindsight, since I think current SOTA LLMs are already pretty much weak AGIs, and so they already seem close to the threshold at which we were supposed to start seeing these misalignment issues come up. But they are not coming up (yet). I think near-term multimodal models will be even closer to the classical “AGI” concept, complete with situational awareness and relatively strong cross-domain understanding, and yet I also expect them to mostly be fairly well aligned to what we want in every relevant behavioral sense.

I basically still buy the quoted text and don’t think it now looks false in hindsight.

We (apparently) don’t yet have models that have robust longterm-ish goals. I don’t know how natural it will be for models to end up with long term goals: the MIRI view says that anything that can do science will definitely have long-term planning abilities which fundamentally entails having goals that are robust to changing circumstances. Maybe that’s true, maybe it isn’t.. Regardless, I expect that we’ll specifically engineer agents with long term goals. (Whether or not those agents will have “robust” long term goals, over and above what they are prompted to do/want in a specific situation, is also something that I don’t know.)

What I expect to see is agents that have a portfolio of different drives and goals, some of which are more like consequentialist objectives (eg “I want to make the number in this bank account go up”) and some of which are more like deontological injunctions (“always check with my user/ owner before I make a big purchase or take a ‘creative’ action outside of my training distribution”).

My prediction is that the consequentialist parts of the agent will basically route around any deontological constraints that are trained in, even if the agent is sincerely committed to the demonological constraints .

As an example, your personal assistant AI does ask your permission before it does anything creative, but also, its superintelligently persuasive. So it always asks your permission in exactly the way that will result in it accomplishing what it wants. If there are a thousand action sequences in which it asks for permission, it picks the one that has the highest expected value with regard to it’s consequentialist goal. This basically nullifies the safety benefit of any deontological injunction, unless there are some that can’t be gamed in this way.

To do better than this, it seems like you do have to solve the Agent Foundations problem of corrigibility (getting the agent to be sincerely indifferent between your telling it to take the action or not take the action) or you have to train in not a deontological injunction, but an active consequentialist goal of serving the interests of the human (which means you have find a way to get the Agent to be serving some correct enough idealization of human values).

But I think we mostly won’t see this kind of thing until we get quite high levels of capability, where it is transparent to the agent that some ways of asking for permission have higher expected value than others. Or rather, we might see a little of this effect early on, but until your assistant is superhumanly persuasive it’s pretty small. Maybe we’ll see a bias toward accepting actions that serve the AI agents goals (if we even know what those are) more, as capability goes up, but we won’t be able to distinguish “the AI is getting better at getting what it wants from the human” from “the AIs are just more capable, and so they come up with plans that work better.” It’ll just look like the numbers going up.

To be clear, “superhumanly persuasive” is only one, particularly relevant, example of a superhuman capability that allows you to route around deontological injunctions that the agent is committed to. My claim is weaker if you remove that capability in particular, but mostly what I’m wanting to say is that powerful consequentialism find and “squeezes through” the gaps in your oversight and control and naive-corrigibility schemes, unless you figure out corrigibility in the Agent Foundations sense.

	elityre on Some notes on my recent, sudde…
	elityre on Humans are an evil god-sp…
	Mike Robinson on Humans are an evil god-sp…
	habaloo on When does anarcho-capitalism f…
	When does anarcho-ca… on Moldbug’s insight