Some questions that I have about AI and the overall strategic situation, and why I want to know

Will Automating AI R&D not work for some reason, or will it not lead to vastly superhuman superintelligence within 2 years of “~100% automation” for some reason?

  • Why I want to know:
    • Naively, it seems like at (or before) the point when AI models are capable of doing AI research at a human-level, we should see a self-reinforcing speedup in AI progress. So AI systems that are substantially superhuman should arrive not long after “human-researcher-level-AI” in calendar time, on the default trajectory.
    • That an intelligence explosion is possible / likely imposes a major constraint on both technical alignment efforts and policy pushes, because it means that a company might develop dangerously superhuman AI relatively suddenly, and that that AI may have design properties that the human researchers at that company don’t understand.
    • If I knew that ~as-capable-as-elite-humans AI doesn’t lead to an intelligence explosion for some reason, would I do anything different?
      • Well, I wouldn’t feel like warning the government about the possibility of an intelligence explosion is an urgent priority.
      • I would assign much less mass to an acute takeover event in the near term. Without the acceleration dynamics of an intelligence explosion, I don’t think that any one company, or any one AI, would attain a substantial lead over the others.
        • In that case, it seems like our main concerns are gradual disempowerment and gradual disempowerment followed by an abrupt AI coup.
    • I haven’t yet seen a good argument for why automating AI R&D wouldn’t lead to a substantial and self-reinforcing speed up in AI progress leading to a steep climb up to superintelligence.
  • Notes:
    • The strongest reason that occurs to me:
      • A conjunction:
        • LLMs are much further from full general intelligences than they currently seem. They’ll get increasingly good at eg software engineering, and in fact surpass humans. But they’ll continue to not really generate new frames. They’ll be able to automate ML research in the sense of coming up with experiments to try, and implementing those experiments, but never any new conceptual work. 
        • AND 
        • That conceptual work is necessary for getting to full on superintelligence. Even millions of superhuman programmers will not hit upon the insights needed for a true general intelligence that doesn’t have this weakness, that develops new ontologies from its experience. 
      • I don’t currently buy either side of this conjunction, but especially not the second part. It seems like most AI research is not coming up with new brilliant ideas, but rather generating 10 ideas that might work to solve a problem, and then testing them. This seems well within the capability set of LLM minds.
    • Another possibility, in principle, for why automating AI R&D doesn’t lead to an intelligence explosion is because a very large percentage of the progress (at that part of development trajectory) is driven by scaling relative to algorithmic progress.
      • I might want to build a quantitative model of this and play around with it a bit.
    • Epoch seems to think that there won’t be an intelligence explosion (or maybe that there will be, but the development of superintelligences won’t matter much). I should look into their arguments about it.

In what admin will the intelligence explosion occur?

  • Why I want to know:
    • I think I should make pretty different investments depending on when I expect the critical part of the intelligence explosion to happen. (Where “the critical part” is the point at which we have the most leverage, whenever that is). The nearer it is, the more targeted our interventions need to be on influencing the current people in power. The further out it is, the broader the possible portfolio and the more it makes sense to aim to get competent, trustworthy, and informed people in power relative to informing and influencing the current cohort.
    • If I knew it was going to happen in 2025 to 2029, all of our political advocacy needs to be targeted at informing and mobilizing the current government (modulo the midterm shakeup) to take action.
    • If I knew it was going to happen in 2030 to 2034, I would be advocating for some specific policies to this admin, but I would mainly focus on building relationships and developing plans to inform and equip the next administration.
    • If I knew it was going to happen in 2035 to 2039, I think I would mostly back up and try to improve the overall quality of US governance, and/or work to get competent candidates for the 2034 presidential election.
      • Also, if it’s that far out, I would need to reevaluate our plans generally. For one thing, I expect that, baring transformative AI, by 2034, China will be the world’s leading superpower, and possibly the world’s leading AI developer.

Will the arrival of powerful / transformative AI come from a lumpy innovation/insight?

  • Why I want to know:
    • If there’s one major insight to much more powerful AI systems, it seems much more likely that we’re in for a hard takeoff, because there will be a “cognitive capabilities overhang“—we should expect nearly the very first AGI to be superhuman.
    • And depending on the shape of the insight, it might totally obviate hardware governance. If that lumpy insight enables the creation of efficient open-ended-learners on small number of GPUs (such as “one”), policing the hardware supply to prevent the creation of an unaligned superintelligence is basically futile, and we need to find a totally different path.

Will superhuman AI agents come out of the LLM / reasoning model paradigm? Is there something that LLMs are basically missing?

  • Why I want to know.
    • This bears on the question above. If there’s a missing secret ingredient to current LLM-based AIs that can do the full loop of learning and discovery, it seems much more likely that we’re a small number of insights away from making very capable agents, as opposed to 0 or hundreds. 

Will reward hacking be solved?

  • Why I want to know:
    • I kind of expect that in the next one to two years, various engineering solutions will drop the incidences of AI reward-hacking to close to 0. At least one company will get to the point that their AIs basically do what their human operators expect and intend for them to do. 
    • Humans can tell when they’re hacking a system by goodharting a metric, and some humans will explicitly notice and choose not to do that. They don’t just follow the letter of the law, they follow the spirit. In principle, AIs could do the same thing.1
    • However, if we stop seeing reward hacking, it will be tricky to interpret what that means, for at least two reasons
      • 1) The models are already situationally aware enough to know when they’re being evaluated. Reward hacking going away may just mean that the models have been trained to only reward hack in ways that are subtle enough to plausibly be merely a mistake.
        • I don’t actually buy this. If the models are trying to reward-hack, and also not get caught, I expect them to trip up sometimes. There should be a long tail of instances of transparent reward hacking.
      • 2) The training and supervision mechanisms that we use to prevent reward hacking seem likely to break, maybe catastrophically, when the AIs are superintelligent. 

To what degree will the goals / values / preferences / desires of future AI agents depend on dispositions that are learned in the weights, and to what degree will they depend on instructions and context.

  • See more here.
  • Elaboration:
    • My impression is that usually when people think about misaligned AIs, they’re imagining a model that develops long-term consequentialist goals over the course of RL training (or similar), and that those goals are “in the weights”. That is, what a model wants or ultimately steers towards is mostly a function of its parameters, as opposed to its inputs / context / prompt, during inference.
    • This isn’t mostly what current AI models are like. Current AIs apparently do learn unwanted behavioral biases in the course of RL training. RLHF also trains in some behavioral dispositions (like refusals of requests deemed harmful). But by some common sense notion, almost all of an AI’s behavior, and its goals (to the extent that it has goals) are determined by the prompts and the context. The same underlying model can be used to instantiate or role play a wide variety of agents, with different behavioral properties and different objectives.
    • To what degree will this be true of future AI agents? 
    • This breaks down into two questions:
      • To what degree will developer-intended goals be learned in training vs assigned in deployment.
        • eg we could imagine a agent that is specifically trained to optimize a modular goal-spec written in english. The goal spec is modular in that, in training, the agent is trained with many different goal specs so that instead of learning to internalize any one of them the way current models internalize their system prompts, the agent learn to flexibly respond to whatever goal spec it’s developers give it, the way that current models respond to varied prompts.
      • To what degree will unintended AI goals be learned in training vs developed in deployment.
        • eg.  If it might be that there are fundamental regularities across all or most of the RL environments that AI agents are trained in, that leads to the formation of more or less persistent training-adaptive but misaligned goals. We could also imagine that those misaligned goals are highly contextual, only triggered in particular circumstances.
  • Why this matters:
    • I care about this question because I want to know how likely values-based collusion between AI models is.
    • If the goals of future AI agents are mostly derived from some kind of instance by instance goal-spec, various forms of scalable oversight, where we have the AIs police each other, seem notably more feasible: we can tell one Claude “your goal is to cure cancer” and we can tell another Claude “your goal is to make sure that that that first Claude isn’t up to any tricks”.2

How late in the techtree are VNM agents that are well-modeled as aggressively optimizing / argmaxing a utility function?

  • Why I want to know:
    • It seems pretty clear that we don’t know how to describe a utility function such that arbitrary high levels of optimization of that utility function is safe for humans and humane values. In this sense, we are quite unprepared for surviving superintelligences “in the limit” of capability.
    • But humans are much more ad hoc than “function over world state (or world history) argmaxers” and so are current AIs. I can believe that future AIs will converge to something that is indistinguishable to that, from our perspective, but it’s unclear if that’s a problem for this generation, or a problem for many generations of AI successors from now.
    • Some people have a mental model of AI alignment that is closer to “we need to align a strongly superhuman coherent VNM expected utility maximizer” and others have a mental model of AI alignment that is more like “we need to figure out how to train a more capable version of claude safely.” Naively, one of these seems doable, and the other seems intractable. In the long run, they bleed into each other, it matters a lot how long of a long run. When do we face the intractable-seeming version of the problem?

What are the bottlenecks for technical alignability of superhuman AI systems? What work should we expect to be able to able to offload to automated alignment researchers, and why do we think that?

  • Why I want to know:
    • Because I want to know how seriously I should take proposals to have the AIs to do our alignment homework. Is this a pipe dream, because we wouldn’t be able to tell the difference between superhuman slop and actual progress?
    • And further, if there are parts of the discovery and development process that can’t be outsourced and parts that can, I want to know which is which, so that 1) we can prioritize the work that can’t be outsourced, and 2) so we don’t trick ourselves into thinking that the situation is handled by outsourcing AIs when it isn’t.

  1. At minimum, you could have two copies of the same agent, one which is tasked with solving a problem and another which is tasked with identifying and flagging instances of reward hacking by the first instance. Since they have the same capabilities, the problem-solver wouldn’t be able to find reward hacks that the supervisor couldn’t identify. You might or might not use the reward hacking flags as a training signal. Something dumb like this might just work.

    This setup has some potential problems: 1) we have to worry about the two instances of the model colluding with each other, 2) it introduces weird pressures for doublethink and cognitive distortion where the model learns to reward hack in ways that are undetectable to itself. (Those weird pressures might or might not end up being strong enough to be relevant. Possibly, it will be hard to tell what the effects of those pressures are, as the agent becomes more capable than us, and maybe even before that point. ↩︎
  2. Note that “values-based” collusion, where multiple instances of the same AI have the same hidden goal, is not the only kind of collusion that we have to be worried about. There’s also trade-based collusion, where different AI instances with different goals, still get collectively more of what they want by coordinating with each other against the humans. ↩︎

Small cashflow software businesses might be over soon?

[Epistemic status: half-baked musing that I’m writing down to clarify for myself]

For the past 15 years there’s been an economic niche, where a single programer develops a useful tool, utility, or application, and sells it over the internet to a few thousand people for a small amount of money each, and make a decent (sometimes passive or mostly-passive) living on that one-person business.

In practice, these small consumer software businesses are on the far end of a continuum that includes venture-backed startups, and they can sometimes be the seed of an exponentially scaling operation. But you only need to reach product-market fit with a few thousand users for a business like this to sustainable. And at the point, it might be mostly on autopilot, and the entrepreneur has income, but can shift most of their attention to other projects, after only two or three years.

Intend (formally complice), is an example of this kind of business from someone in my circles.

I wonder if these businesses will be over soon, because of AI.

Not just that AI will be able to do the software engineering, but that AI swarms will be able to automate the whole entrepreneurial process from generating (good) ideas, developing early versions, shipping them, getting user-feedback, and iterating.

The discourse already imagines a “one person-unicorn”, where a human CEO coordinates a company of AIs to provide a product or service. With half a step more automation, you might see meta-entrepreneurs overseeing dozens or hundreds of separate AI swarms, each ideating, prototyping, and developing a business. Some will fail (just like every business), but some will grow and succeed and (just like with every other business venture) you can invest more resources into the ones that are working.

Some questions:

  • How expensive will inference be, in running these AI entrepreneurs? Will the inference costs be high enough that you need venture funding to run an AI entrepreneur-systems?
    • Estimating this breaks down into roughly “how many tokens does it take to run a business (per day?)?” and “How much will an inference token cost in 2028?”
  • What are the moats and barriers to entry here? What kind of person would capture the gains to this kind of setup.
  • Will this eat the niche of human-ideated software businesses? Will there be no room left to launch businesses like this and have them succeed, because the space of niche software products will be saturated? Or is the space of software ideas so dense, that there will still be room for differentiation, even if there are 1000x as many products of this type, of comparable quality, available?

. . .

In general, the leverage of code is going to drop over the next 5 years.

Currently, one well-placed engineer will write a line of code that might be used by millions of users. That because there’s 0-marginal cost to replicating software and so a line of code written once might as well be copied to a million computers. But it’s also representative of the relative expense of programming labor. Not many people can write (good) code and so their labor is expensive. It’s definitely not worth paying $100 an hour for an engineer to write some software when you can buy existing off the shelf software that does what you need (or almost what you need) for $50 a month.

But, as AI gets good enough that “writing code” becomes an increasingly inexpensive commodity, the cost-benefit of writing custom software is going to shift in the “benefit” direction. When writing new software is cheap, you might not want to pay the $50 a month, and there will be more flexibility to write exactly the right software for your particular usecase instead of a good-enough off the shelf-version (though I might be overestimating the pickiness of most of humanity with regards to their software). So more people and companies will write custom software more of the time, instead of buying existing software. As that happens the number of computers that run a given line of code will drop, in the process.