Some powers have ethical valence

There’s a trope of many fantasy settings, different kinds of magic have different moral valence, and your use of each kind influences your morality. Or alternatively, you moral character has effects which magics you can use.

In the starwars extended universe, there’s the light side and the dark side. It’s usually implied that the use of the powers of the dark side are seductive and corrupting.

‘Is the Dark Side stronger?’

‘No, no, no. Quicker, easier, more seductive.’

It’s not (to my limited knowledge, at least), explained why or how using particular force powers is leads one to the poor moral behavior, but it is stated that accessing dark side powers requires tapping into “negative” emotions, like fear and anger. Presumably there’s some magical explanation for why using the dark side is so corrupting. But as a matter of simple psychology, using the dark side entails nurturing and cultivating emotions and emotional dispositions that are generally not good for your soul.

In my memory of the Knights of the Old Republic game, the causality went in the other direction: your alignment on the light side dark side axis was determined by the choices that you made in the game. High integrity and altruistic choices moved you towards the lightside and selfish, vengeful, or ruthless choices moved you towards the dark side.

And there’s a broader version of the trope, of the wizard who conducts dark rituals with demonic beings, and this being a slippery slope to evil as exposure to those powers (and the kinds of sacrifices they demand in exchange for power) warps his or her soul.

There’s a parallel of this dynamic in real life.

Some skills have a moral valence, because they disproportionately favor cooperation or exploitation. Which skills you choose to develop shapes your affordances, which shapes your moral habits and choices.

For instance, if you learn to lie skillfully, you build an affordance for lying. When faced with a problem a prominent tool in your toolbox will be to lie to get your way. This puts an incentive on you to use that tool when you can, and thereby leads you to less ethical behavior than you might otherwise have chosen.

Another example: various persuasion techniques that take exploit human biases to get others to agree to what you want generally lean evil. They’re more symmetrical than argument, and methods in that class have a disproportionately larger set of outcomes in which you get others to agree to something counter to their idealized or reflective interests.

It’s not that this couldn’t possibly be used for Good. It’s that honing this as a skill, builds affordances of ethically dubious action.

In contrast, Convergent Facilitation, an offshoot of NonViolent Communication, is a group decision making framework that involves hearing, holding, and solving for the disparate needs of everyone in a group, and thereby drawing out both the willingness to adapt and the creativity of the whole group. This is a technique that is structurally cooperative. Helping other people get what’s important to them is a functional part of the technique, not a side benefit that could in principle be stripped away, for better selfish efficiency.

A person who puts skill points into getting really good at Convergent Facilitation is building skill that supports cooperative behavior, as someone who puts skill points into psychological persuasion techniques is building skill that supports adversariality. Investing in one or the other shapes the affordances that are available to you in any given situation. If you’re good at persuading people, you’ll see those options, and if you’re good at CF, you’ll see opportunities to do CF-like moves to find mutually supportive solutions.

The better you are at lying, the more tempting it is to lie.

That difference in affordances corresponds to a difference in payoffs: If you’re good at persuasion, it’s a higher cost to forge that strategy when it would be unethical to use. The tools you’re skilled with exert a vacuum pull towards their ethical attractor.

In this way, the some skills have a moral valence. Which you choose to cultivate exert pressures on the condition of your soul.

Some thoughts on Agents and Corrigibility

[Reproducing this comment on LessWrong, with slight edits]

“Prosaic alignment work might help us get narrow AI that works well in various circumstances, but once it develops into AGI, becomes aware that it has a shutdown button, and can reason through the consequences of what would happen if it were shut down, and has general situational awareness along with competence across a variety of domains, these strategies won’t work anymore.”

I think this weaker statement now looks kind of false in hindsight, since I think current SOTA LLMs are already pretty much weak AGIs, and so they already seem close to the threshold at which we were supposed to start seeing these misalignment issues come up. But they are not coming up (yet). I think near-term multimodal models will be even closer to the classical “AGI” concept, complete with situational awareness and relatively strong cross-domain understanding, and yet I also expect them to mostly be fairly well aligned to what we want in every relevant behavioral sense.

I basically still buy the quoted text and don’t think it now looks false in hindsight.

We (apparently) don’t yet have models that have robust longterm-ish goals. I don’t know how natural it will be for models to end up with long term goals: the MIRI view says that anything that can do science will definitely have long-term planning abilities which fundamentally entails having goals that are robust to changing circumstances. Maybe that’s true, maybe it isn’t.. Regardless, I expect that we’ll specifically engineer agents with long term goals. (Whether or not those agents will have “robust” long term goals, over and above what they are prompted to do/want in a specific situation, is also something that I don’t know.)

What I expect to see is agents that have a portfolio of different drives and goals, some of which are more like consequentialist objectives (eg “I want to make the number in this bank account go up”) and some of which are more like deontological injunctions (“always check with my user/ owner before I make a big purchase or take a ‘creative’ action outside of my training distribution”).

My prediction is that the consequentialist parts of the agent will basically route around any deontological constraints that are trained in, even if the agent is sincerely committed to the demonological constraints . 

As an example, your personal assistant AI does ask your permission before it does anything creative, but also, its superintelligently persuasive. So it always asks your permission in exactly the way that will result in it accomplishing what it wants. If there are a thousand action sequences in which it asks for permission, it picks the one that has the highest expected value with regard to it’s consequentialist goal. This basically nullifies the safety benefit of any deontological injunction, unless there are some that can’t be gamed in this way.

To do better than this, it seems like you do have to solve the Agent Foundations problem of corrigibility (getting the agent to be sincerely indifferent between your telling it to take the action or not take the action) or you have to train in not a deontological injunction, but an active consequentialist goal of serving the interests of the human (which means you have find a way to get the Agent to be serving some correct enough idealization of human values).

But I think we mostly won’t see this kind of thing until we get quite high levels of capability, where it is transparent to the agent that some ways of asking for permission have higher expected value than others. Or rather, we might see a little of this effect early on, but until your assistant is superhumanly persuasive it’s pretty small. Maybe we’ll see a bias toward accepting actions that serve the AI agents goals (if we even know what those are) more, as capability goes up, but we won’t be able to distinguish “the AI is getting better at getting what it wants from the human” from “the AIs are just more capable, and so they come up with plans that work better.” It’ll just look like the numbers going up.

To be clear, “superhumanly persuasive” is only one, particularly relevant, example of a superhuman capability that allows you to route around deontological injunctions that the agent is committed to. My claim is weaker if you remove that capability in particular, but mostly what I’m wanting to say is that powerful consequentialism find and “squeezes through” the gaps in your oversight and control and naive-corrigibility schemes, unless you figure out corrigibility in the Agent Foundations sense.

Conceptual precision breaks cooperation, but is necessary for robust cooperation

[Epistemic status: This is really a draft that I should edit into something presentable. This is probably obvious to lots of us, but whatever, I’m rederiving social normality from the ground up. Draft.]

Common, fragile, concepts

There are a number of common, simple, concepts that, when examined closely, appear to break down, or at least be much more complicated than they seemed at first.

For instance, the idea of “I” or who “myself” is. This concept is a standard part of most people’s navigation of the world, but if we turn a philosophical eye to it, we run into all kinds of confusions: am “I” the same person at the person named Eli Tyre who was in high school 10 years ago? What about the person who was resting “20” minutes ago? What about the transporter problem?

This concept is a workhorse of day-to-day living and deciding, but it is shockingly fragile, as evidenced by those edge cases.

Nuance vs. Pragmatism

One might be more or less satisfied with a given level of conceptual clarity around a topic. I might have a pragmatist attitude that ignores or papers over the finicky fragility of concepts, and doesn’t bother much with the nuances of meaning.

Or I might be a stickler for the nuance: really caring, about having clarity around these details, making sure that I understand what I’m talking about.

The same person might have a different attitude in different contexts: I’m a pragmatist when I need to get the milk, and a philosopher when I need to think about cryonics. (But in practice, it also seems like there is a fairly stable trait which represents how much of a stickler someone is.)

Cooperation

Being a stickler for nuance is often detrimental to cooperation. As a case in point, suppose that my neighbor’s cat is sick. The cat really needs to be taken to the vet, but my neighbor is has a crucial business meeting with an important client and if he misses it he’ll be fired. In desperation, my neighbor asks me if I can take his cat to the vet. (He doesn’t know me very well but there’s no one else around and he’s desperate.)

With panic for his beloved pet in his eyes, he asks me, “can I trust you?”

Supposes my response is, “Well, what do you mean by trust? Are you attempting to assess my level of competence? Or are you wanting to know the degree to which our values are aligned? In fact, it’s not even clear if “trust” makes sense outside of a social context which punishes defectors…”

For most normal people, this response sets off all kinds of alarm bells. His was is a simple question, but I seem unwilling to answer. My neighbor now has good reason to think that he can’t trust me: One reason why I would be desiring so much legalistic clarity about what “trust” means, is because I’m intending to hold to the letter of my agreements, but not the spirit, to screw him over while claiming that the precise definition shields me from reproach. Or maybe it means I am something-like-autistic, and I just legitimately don’t understand the concept of trust. In either case, he should be much more reluctant to trust me with his cat.

In this circumstance, it seems like the correct thing to do is put aside nuance, and give the simple answer: “Yes. You can trust me.” The shared social context has a very limited number of buckets (possibly only “yes” and “no”) and in fact the most correct thing to say is “yes” (presuming you in fact will take care of his cat). It is both the case that the available ontology is too simple to support a full answer, and also the case that the response “the available ontology is too simple sot support a full answer” rounds down to “no”, which is not the correct response in this situation.

Being a stickler sabotages cooperation, when that cooperation is shallow.

However, being a stickler is necessary in other  contexts where you are aiming for a more robust cooperation.

For instance, if a partner and I are considering getting married, (or maybe considering breaking up) and she asks me “Are you committed to this relationship?”

In this situation, skipping over the nuance of what is meant by “committed” is probably a mistake. It seems pretty likely that the concepts that she and I reference with that word are not exactly overlapping. And the “edge cases” seem pretty likely to be relevant down the line.

For instance, one of us might be meaning “committed” to be a kind of emotional feeling, and the other might be meaning it to be a measure of resources (of time, attention, life) that you are promising to invest.

Or one of you might feel that “committed” means that you’ll want to spend most of your time together, if circumstances allow. That’s not part of the other’s concept of committed, and in fact, they will feel defensive of their own autonomy when “circumstances allow”, and their partner expects them to spend most of their time together.

Not having clarity about what exactly you’re agreeing to, promising, or signaling to the other, seems like it is undermining the ability for robust cooperation.

Unless you insist on this conceptual nuance, there isn’t actually clarity about the nature of the relationship, and neither party can in full confidence rely on it. (In practice, it maybe more likely that two partners don’t notice this conceptual mismatch, and so do put their weight on the relationship, only to be burned later.)

If I want to have a robust, long standing marriage with my partner, it seems like we really do need to do enough philosophy to be clear about, and have common knowledge about, our shared concepts. [1]

I posit that this is generally true: Insistence on conceptual nuance can undermine cooperation, particularly in “shallow” interactions. But a failure to insist on conceptual nuance can also undermine cooperation, in other contexts.


[1] Although, maybe in some contexts you don’t need to do the philosophy because tradition does this work for you. If culture mandates a very specific set of requirements around marriage, or business dealings, or what have you, you can safely operate on the assumption that your concepts and the other person’s concepts are sufficiently similar for all practical considerations? The cultural transmission is high bandwidth enough that you do both have (practically) the same concepts?

I don’t know.

Addendum: 2019-11-16: I just realized that this dynamic is exactly(?) isomorphic to the valley of bad rationality, but at the interpersonal, instead of the personal level