Reflecting on some regret about not trying to join and improve specific org(s)

I started a new job recently, which has prompted me to reflect on my work over the past few years, and how I could have done better.

Concretely, I regret not joining SERI MATS, and helping it succeed, when it was first getting started. 

I think this might have been a great fit for me: I had existing skills and experience that I think would have been helpful for them. The seasonal on-off schedule would have given me the flexibility to do and learn other things. It would have (I think) helped me get a better grounding in Machine Learning and technical alignment approaches.

And if I had joined with an eye towards agentically shaping the organization’s culture and priorities as it developed, I think I would have had a positive impact on the seed that has grown into the current alignment field . In particular, I think I might have had leverage to establish some cultural norms regarding how to think about the positive and negative impacts of one’s work.1 

I regarded MATS as the obvious thing to do. The nascent alignment field was bottlenecked on mentorship—a small number of people (arguably) had good taste for the kinds of research that was on track, but had limited bandwidth for research mentorship, so conveying that research taste was (and is?) a bottleneck for the whole ecosystem. A program aiming to unblock everything else to expand the capacity for research mentorship as much as possible seemed like the obvious straightforward thing to do.

I said as much in my post from early 2023:

There is now explicit infrastructure to teach and mentor these new people though, and that seems great. It had seemed for a while that the bottleneck for people coming to do good safety research was mentorship from people that already have some amount of traction on the problem. Someone noticed this and set up a system to make it as easy as possible for experienced alignment researchers to mentor as many junior researchers as they want to, without needing to do a bunch of assessment of candidates or to deal with logistics. Given the state of the world, this seems like an obvious thing to do.

I don’t know that this will actually work (especially if most of the existing researchers are themselves doing work that dodges the core problem), but it is absolutely the thing to try for making more excellent alignment researchers doing real work. And it might turn out that this is just a scalable way to build a healthy field.

In retrospect, I should have written those paragraphs and generated the next thought “I should actively go try to get involved in SERI MATS and see if I can help them.”

So why didn’t I?

Misapplied notion of counterfactual impact

I didn’t do this because I was operating on the model/assumption that, while this was important, they were doing it now, and were probably not in danger of failing at it. It was taken care of and so I didn’t need to do it.

I now think that was probably a mistake. Because I didn’t get involved, I don’t know one way or the other, but it seems plausible to me that I could have contributed to making the overall project substantially better: more effective and with better positive externalities. 

This isn’t because I’ve learned anything in particular about how SERI MATS missed the mark, but just getting more exposure to organizations and adjusting my prior that even if an organization is broadly working, and not in danger of collapse, it might be the case that I can personally make it much better with my efforts. In particular, I think it will sometimes be the case that there is room to substantially improve an organization in ways that don’t line up very neatly with the specific roles that they’re attempting to explicitly hire for, if you have strategic orientation and specific relevant experience.2

This realization is downstream with my interactions with Palisade over recent weeks. Also, Ronny made a comment a few years ago (paraphrased) that “you shouldn’t work for an organization unless you’re at least a little bit trying to reform it”. That stuck with me, and changed my concept of “working for an org”.

Possibly this difference in frame is also partially downstream of thinking a bit about shapley values through reading Planecrash and thinking about donation-matching for SFC. (I previously aimed to do things that, if I didn’t do them, wouldn’t happen. Now, I’ve continuous-ized that notion, and aim for, approximately, high shapley value).

Underestimating the value of “having a job”

Also, regarding SERI potentially being a good fit for me in particular, I think I have historically underestimated the value of having a job for structuring one’s life and supporting personal learning. I currently wish that I had more technical background in ML and alignment/control work, and I think I might have gotten more of that if I had been actively trying to develop in that direction while supporting MATS in a non-technical capacity, instead of trying to develop that background (inconsistently) independently.

Strategic misgivings

I didn’t invest heavily in any project over recent years because there wasn’t much that I straightforwardly believed in. As noted above, the idea-of-MATS was a possible exception to this—it seemed like the obvious thing to do given the constraints of the world. And I now think I should take “this seems like the obvious thing to do” as a much stronger indicator that I should get involved with a project, somehow, and figure out how to help, than I previously did.

But part of what held me back from doing that was misgivings about the degree to which MATS was acting as a feeder pool for the scaling labs. MATS is another project that doesn’t seem obviously robustly good to me (or “net-positive”, though I kind of think that’s the wrong frame). As with many projects, I felt reticent to put my full force behind it for that reason.

In retrospect, I think maybe I should have shown up and tried to solve the problem of “it seems like we’re doing plausible real harm, and that seems unethical” from the inside. I could have repeatedly and vocally drawn attention to it, raised it as a consideration in strategic and tactical planning, etc. Either I would have shaped the culture around this problem for the MATS staff sufficiently that I trusted the overall organism to optimize safely, or we would have bounced off of each other unproductively. And in that second case, we could part ways, and I could move on.

In general, it feels like a more obvious affordance to me, now, if I think something is promising, but I don’t trust it to have positive impacts, I just try non-disruptively making it better according to the standards that I think are important, and if that doesn’t work or doesn’t go well, parting ways with the org.

This all begs the question, “should I still try to work for SERI MATS and make it much better?”

My guess is that the opportunity is smaller now than it was a few years ago, because both the culture and processes of the org have more found an equilibrium that works. There’s less leverage to make an org much better when the org is figuring out how to do the thing it’s trying to do, compared to when it has reached product-market-fit, and is mostly finding ways to reproduce that product consistently and reliably.

That said, one common class of error is overestimating the degree to which an opportunity has passed. e.g. not buying Bitcoin in 2017, because you believe that you’ve already missed the big opportunity—it’s true in some sense, but you’re underestimating how much of the opportunity still remains. 

So, if I were still unattached, writing this essay would prompt me to reach out to Ryan, and say directly that I’m interested in exploring working for MATS, and try to get more contact with the territory, so that I can see for myself. As it is, I have a job which seems like it needs me more, and which I anticipate absorbing my attention for at least the next year.

  1. Note: of all the things I wrote here, this is the point that I am most uncertain of. It seems plausible to me that because of psychological dynamics akin to “It is difficult to get a man to understand something, when his salary depends on his not understanding it”, and classic EA-style psychological commitment to life narratives that impart meaning via impact, the cultural norms around how the ecosystem as a whole thinks about positive and negative impacts, were and are basically immovable. Or rather, I might have been able to make more-or-less performative hand-wringing fashionable, and possibly cause people to have less of an action-bias , but not actually produce norms that lead to more robustly positive outcomes.

    At least, I don’t have a handle on either how to approach these questions myself, or how to effectively intervene on the culture about them. And so I’m not clear on if I could have made things better in this way. But I could have made this my explicit goal and tried, and made some progress, or not. ↩︎
  2. A bit of context that is maybe important. I have not, applied for a job since I was 21, and was looking for an interim job during college. Every single job that I’ve gotten in my adult life has resulted from either, my just showing up and figuring out how I could be helpful, or someone I already know reaching out to me and asking me for help with a project.

    For me at least, “show up and figure out what is needed and make that happen” is a pretty straightforward pattern of action, but it might be foreign to other people who have a different conception of jobs that is more centered on specific roles, that you’re well-suited for, and doing a good job in those roles. ↩︎

The automatic alignment of the flow through effects of obliterating fundamental problems.

[Draft. This post really has a lot of prerequisites, that I’m not going to bother trying to explain. I’m just writing it to get it out of me. I’ll have to come back and make it understandable later, if that seems worth doing. This is really not edited.]

We live in an inadequate world. Things are kind of a mess. The vast majority of human resources are squandered, by Moloch, on ends that we would not reflectively endorse. And we’re probably all going to die.

The reason the world is so messed up, can be traced back to a handful of fundamental problems or fundemental constraints. By “fundamental problem” I have something pretty specific in mind, but Inadquite Equlibira points in the right direction. They’re the deep reasons why we can’t “just fix” the worlds problems.

Some possible fundamental problems / constraints, that I haven’t done the work to formulate correctly:

  • The wold is too big and fast for any one person to know all of the important pieces.
  • The game theoretic constraints that make rulers act against the common good.
  • People in power take power preserving actions, so bureaucracy resist change, including correct change.
  • People really want to associate with prestigious people, and make decisions on that basis.
  • We can’t figure out what’s true anywhere near efficiently enough.
  • People can’t actually communicate about the important things.
  • We don’t know how, even in principle, to build an aligned AGI.
  • Molochian race dynamics.
  • Everyone is competing to get information to the people with power, and the people in power don’t know enough to know who to trust.
  • We’re not smart enough.
  • There is no system that is keeping track of the wilderness between problems.

I recently had the thought that some of these problems have different characters than the others. They fall into two camps, which, of course, actually form a spectrum.

For some of these problems, if you solved them, the solution would be self-aligning.

By that I mean something like, for some of these problems, their solutions would be a pressure or force, that would push towards solving the other problems. In the best case, if you successfully solved that problem, in due course this would case all of the other problems to automatically get solved. The flow-through effects of such a solution are structurally positive.

For other problems, even though the represent a fundamental constraint, if they were solved they wouldn’t push towards the solving of the other problems. In fact, solving that one fundamental problem in isolation might make the other problems worse.

A prototypical case of a problem who’s solution is self-aligning [I need to come up with better terminology] is an Aligned AI. If we knew how to build an AI that could do what we actually want, this would perhaps automatically solve all of our other problems. It could tell us how (if not fix the problems itself) to have robust science, or optimal economic policy, or incentive-aligned leaders, or whatever.

Aligned AI is the lolapaluza of altruistic interventions. We can solve everything in one sweep. (Except of course, the problems that were prerequisites for solving aligned AI. Those we can’t count on the AI to solve for us.)

Another example: If we implemented robust systems that incentivized leaders to act in the interests of the public good, it seems like this has the potential of (eventually) breaking all of the other problems. It would be a jolt that knocks our civilization into the attractor basin of a sane, adequate civilization (if our civilization is not in that attractor basin already).

In contrast, researcher ability is a fundamental constraint of our civilization (though maybe not a fundemental problem?), but it is not obvious that the flow through effects of breaking through that fundamental constraint are structurally positive. On the face of it, it seems like it would be bad if everyone in the world decoupled their research acumen: that seems like it would speed us toward doom.

This gives a macros-strategic suggestion, and a possible solution to the last term problem: identify all of the fundamental problems that you can, determine which ones have self-aligning solutions, and dedicate your life to solving whichever problem has the best ratio of tractability to size of (self-aligned) impact.

I maybe reinventing symmetric vs. asymmetric weapons here, but I think I am actually pointing at something deeper, or at least extending the idea further.

 

[Edit / note to self: I could maybe explain this with reference to personal productivity?: you want to find the thing which is easy to do but most makes it easy to do the other things. I’m not sure this captures the key thing I want to convey.]