Thinking about AI alignment – musings and rough drafts

[epistemic status: a thought I had]

It seems like it would be useful to have very fine-grained measures of how smart / capable a general reasoner is, because this would allow an AGI project to car-fully avoid creating a system smart enough to pose an existential risk.

I’m imagining slowly feeding a system more training data (or, alternatively, iteratively training a system with slightly more compute), and regularly checking its capability. When the system reaches “chimpanzee level” (whatever that means), you stop training it (or giving it more compute resources).

This might even be a kind of fire-alarm. If you have a known predetermined battery of tests, then when some lab develops a system that scores “at the chimp level” at that battery, that might be a signal to everyone, that it’s time to pool our resources and figure out safety. (Of course, this event might alternatively precipitate a race, as everyone tries to get to human-level first.)

Probably the best way to do this would be for both training data, and compute / architecture. Start with a given architecture, then train it, slowly increasing the amount or quality of the training data, with regular tests (done on “spurs”, the agent should never have eposodic memory of the . When increasing training data plateaus, iteratively improve the architecture in some way, either by giving the systems it more compute resources, or maybe making small adjustment. Again train the new version of the system, with regular tests. If you ever start to get very steep improvement, slow down and run tests more frequently.

Naively, it seems like a setup like this would prevent an AI team from overshooting and making a system that is much more capable that they think (which gives rise to all kinds of problems, like treacherous turns), regardless of how close “chimp” is to “human” on some absolute intelligence scale.

Some considerations:

This of course, depends on having metrics that work. Of course, it seems very likely that early general AIs will be minds with a very unusual “shape”, such that it has sophisticated reasoning abilities along dimensions that are not natural categories to humans, while being weak on many of the dimensions we’re testing for.
- Although, this seems a bit surprising, since presumably we would be training it to hit those metrics? Somehow it would get better, faster, at a kind of reasoning we were not intending to train, than at the kind of reasoning that we were optimizing for?
- Also, can we just design metrics that get around this problem, by measuring exactly what we care about, i.e. danger? Something like a metric for consequentialist reasoning?
This idea depends on the system not deceiving you, or intentionally underperforming on your metrics.
- Naively, this doesn’t seem like that much of an issue, if you are doing tests regularly enough. I’m presuming (perhaps wrongly), that a chimpanzee is not smart enough to infer that it is an agent in a simulation that is overseen by human agents, and that it should try to decide those humans. That is, we stop our progression before the agent is at the point where deception is a consideration.
  - Which means we really need to be confident about how smart something needs to be before it tries to deceive us?
    - Really? Its first attempts at tricking humans will be bad. Just as the attempts of animals to trick us will be pretty bad.
At least one “danger point”, is when the system is capable enough to realize the instrumental value of self improving by seizing more resources.
- How smart is this?
  - My guess, is really smart. Animals come pre-loaded with all kinds of instincts that cause them to seek out food, water, etc. These AI systems would not have an instinct to seek more training data / computation. Most humans don’t reason their way into finding ways to improve their own reasoning. If there was a chimp, even loose in the internet (whatever that means), would it figure out to make itself smarter?
  - If the agent has experienced (and has memories of) rounds of getting smarter, as the humans give it more resources, and can identify that these improvements allow it to get more of what it wants, it might instrumentally reason that it should figure out how to get more compute / training data. But it seems easy to have a setup such that no system has episodic memories previous improvement rounds.
    - [Note: This makes a lot less sense for an agent of the active inference paradigm]
      - Could I salvage it somehow? Maybe by making some kind of principled distinction between learning in the sense of “getting better at reasoning” (procedural), and learning in the sense of “acquiring information about the environment” (episodic).

	Julian R. on Notes on the Caplan-Bruenig…
	elityre on Some notes on my recent, sudde…
	elityre on Humans are an evil god-sp…
	Mike Robinson on Humans are an evil god-sp…
	habaloo on When does anarcho-capitalism f…

Category: Thinking about AI alignment

My strategic picture of the work that needs to be done

Capability testing as a pseudo fire alarm