Some barely-considered feelings about how AI is going to play out

Over the past few months I’ve been thinking about AI development, and trying to get a handle on if the old school arguments for AI takeover hold up. (This is relevant to my dayjob at Palisade, where we are working to inform policymakers and the public about the situation. To do that, we need to have good understanding ourselves, of what the situation is.)

This post is a snapshot of what currently “feels realistic” to me regarding how AI will go. That is, these are not my considered positions, or even provisional conclusions informed by arguments. Rather, if I put aside all the claims and arguments and just ask “which scenario feels like it is ‘in the genera of reality’?”, this is what I come up with. I expect to have different first-order impressions in a month.

Crucially, none of the following is making claims about the intelligence explosion, and the details of the intelligence explosion (where AI development goes strongly recursive) are crucial to the long run equilibrium of the earth-originating civilization.

My headline: we’ll mostly succeed at prosaic alignment of human-genius level AI agents

  • Takeoff will continue to be gradual. We’ll get better models and more capable agents year by year, but not jumps that are bigger than that between Claude 3.7 and Claude 4.
  • Our behavioral alignment patches will work well enough.
    • RL will induce all kinds of reward hacking and related misbehavior, but we’ll develop patches for those problems (most centrally, for any given reward hack, we’ll generate some examples and counter examples to include in the behavior training regimes).
    • (With a little work) these patches will broadly generalize. Future AI agents won’t just not cheat at chess and won’t just abstain from blackmail. They’ll understand the difference between “good behavior” and “bad behavior”, and their behavioral training will cause them to act in accordance with good behavior. When they see new reward hacks, including ones that humans wouldn’t have thought of, they’ll correctly extrapolate their notion of “good behavior” to preclude this new reward hack as well.
    • I expect that the AI labs will figure this out, because “not engaging in reward-hacking-like shenanigans” is critical to developing generally reliable AI agents. The AI companies can’t release AI agent products for mass consumption if those agents are lying and cheating all over the place.1
    • Overall, the AI agents will be very obedient. They’ll have goals, in so far as accomplishing any medium term task entails steering towards a goal, but they won’t have persistent goals of their own. They’ll be obedient assistants and delegates that understand what humans want and broadly do what humans want.
  • The world will get rich. LessWrong style deceptive misalignment concerns will seems increasingly conspiracy-ish and out of touch. Decision makers will not put much stock on such concerns—they’ll be faced with a choice to forgo enormous and highly tangible material benefits (and ceading those benefits to their rivals), on the basis of abstract concerns which have virtually no empirical examples, and whose advocates explicitly state are unfalsifiable.
  • There’s a gold rush to get the benefits before others. The world is broadly in a “greedy” mode and not a “fearful” mode. The labs, and relevant governments eagerly unleash their genius level AI agents to automate AI R&D. At this point something even stranger happens.
  1. Though a friend points out that companies might develop mechanisms for utilizing cheap AI labor, tested incentive and affordance schemes, designed specifically to contend with the Agents propensity for misbehavior. Just because the average person can’t trust an AI to do their taxes or watch their kids doesn’t mean that there are not enterprising business men that won’t find a way to squeeze useful outputs from untrustworthy AIs. ↩︎

Leave a comment