This paper is quite old by AI safety terms, but it’s a nice self-contained piece of work that gets you thinking about the challenges of Inverse Reinforcement Learning (IRL) as a solution to the alignment problem.

Warning: this summary contains hand-wavy language and maths simplified from the more formally correct presentation in the paper.

Assumed knowledge: Bayesian Inference, Reinforcement Learning


IRL is a promising approach to addressing the alignment problem. It’s probably too hard to write a utility function “by hand” that encodes all the important facets of human values. …

What things have rights? As a society we collectively decide on what objects to invite to the rights club. At this point in history basically all of us are convinced that all humans are members. A lot of us reckon many types of animals should be inducted as well. This club is often referred to as the moral circle, with members being called moral patients.

Our history of atrocities can be to a large extent explained by the moral circle not growing fast enough. The atrocity just needs to start before the moral circle catches up. …

A nice paper by Cohen, Vellambi and Hutter came out in December. What the paper says is quite formal so it took me a little while to get through it, but the broad strokes of what it’s saying are useful to be aware of even if you don’t have the time to wrap your head around the theorems. So here are the ideas presented less formally.

I will use some hand-wavy language which inevitably injects some of my subjective interpretation. If you reckon you understand this work better than me and you disagree with something I’ve said, get in touch…

