Paper Summary: Learning the Preferences of Ignorant, Inconsistent Agents (Evans et. al.)

10 min readNov 3, 2020

This paper is quite old by AI safety terms, but it’s a nice self-contained piece of work that gets you thinking about the challenges of Inverse Reinforcement Learning (IRL) as a solution to the alignment problem.

Warning: this summary contains hand-wavy language and maths simplified from the more formally correct presentation in the paper.

Assumed knowledge: Bayesian Inference, Reinforcement Learning

Introduction

IRL is a promising approach to addressing the alignment problem. It’s probably too hard to write a utility function “by hand” that encodes all the important facets of human values. IRL is an approach to sidestepping this — we instead built AIs that watches a human’s behavior and infers from it what their goals/values are.

IRL is not easy, it comes with a plethora of challenges. One of these challenges is that humans don’t behave in a particularly rational way. We don’t always do what’s best for ourselves — like the lifelong smoker who is fully aware of how bad it is for them, or the student who puts off studying right up until the day of an exam. If one naively tried to infer the values of these two people, while assuming they are totally rational, it would produce nonsense results. They would wrongly infer that the smoker deeply values smoking and the student values stressful cramming for an exam.

To be able to properly infer the preferences of a human from their behavior, one needs an understanding of in exactly what way are humans irrational. If we had a way of modelling how human behavior is sub-optimal for achieving their goals, we could reverse-engineer their irrational behavior to infer those goals.

This paper is a first baby-step towards building these types of models. The authors dream up a number of examples of relatably irrational human behavior on simple gridworld like environments, and sets out to infer the preferences that lead to that behavior.

Here we’ll just look at a couple of them as examples — two instances of lunch-seeking behavior:

The left panel depicts a tragic tale of a lack of self-control. The agent, called Alice, plans to go to the vegetarian cafe. The food is healthy there and she’s on a diet. But when she walks past the donut store D2, she can’t resist but to turn in there instead and have a bunch of donuts for lunch.

The right panel shows the behavior of Beth, who is more sophisticated. She knows that she can easily be tempted by donuts. So she makes a point of taking a different route to the vegetarian cafe, in order to avoid the temptation.

From just looking at these two paths, can we infer how much the two agents value having their lunch at the donut store and the vegetarian cafe? To do so, the paper builds a surprisingly simple model of human wrongness. It considers two distinct problems with human behavior: humans are ignorant and inconsistent.

Humans are Inconsistent

Human are temporally inconsistent, meaning that they make plans that they later abandon. Luckily, behavioral economics has already found a way to model the type of inconsistency that humans have: hyperbolic discounting.

Normally when people build reinforcement learning (RL) agents, they give the agent exponential discounting. Exponential discounting is a rational way to do discounting, while hyperbolic is not. To see why, let’s quickly set up the maths for RL and discounting.

At any given timestep, an RL agent is in a state s∈Տ and can choose from a number of actions a∈𝓐. The chosen a dictates which state it will move to in the next timestep. Each action it takes comes with some reward/utility U(s,a). At time t, the agent will choose its next action aₜ to be the action that maximizes the expected utility;

Equation 1

The big E out front is the expectation value over all future trajectories ((sₜ₊₁, aₜ₊₁), (sₜ₊₂, aₜ₊₂), …) given that action aₜ was taken. The expectation value is dependent on the distribution over future trajectories, which is dependent on how the agent makes decisions in the future (this will be important later).

𝛾(t) is the monotonically decreasing discount function. As you can see from the above equation, it controls how important rewards at differing future times impact decision making. Exponential and hyperbolic discounting are defined by giving 𝛾(t) the following forms

Equation 2

λ and k are free parameters. To see how these two differ, consider this example. It’s Friday night. You’re considering getting up early on Saturday morning to go to the gym, because you’ll feel great and virtuous afterwards. You can probably see where this is going. The expected utility of setting your alarm is something like

Equation 3

where -c is the cost of getting up early and g is the gain of feeling good after having gone to the gym. t₁ is when you have to get up, t₂ is when you get home from the gym, and t₀ is when you’re making the decision of what to do. It is instructive to rearrange this to look like

Equation 4

Whether or not you decide to go through with the plan is decided by the sign of the term in brackets — you’ll only follow through with it if the expected utility is always positive. Most of the time when I make a plan like this, my plan changes half-way through, I set an alarm for the morning but when the alarm goes off I hit the snooze button. This can be explained by the sign inside the brackets changing due to the ratio 𝛾(t₂-t₀)/𝛾(t₁-t₀) being dependent on t₀, the time I’m choosing the action.

If I was using exponential discounting like a good rational agent, the ratio would be 𝛾(t₂-t₀)/𝛾(t₁-t₀) = exp(-λ(t₂-t₁)), independent of t₀. So regardless of when I am making a calculation of expected utility, 𝛾(t₂)/𝛾(t₁) will always be the same. That means the sign in the brackets will always be the same, so I will stick to the plan.

But if I was using hyperbolic discounting, the ratio becomes 𝛾(t₂-t₀)/𝛾(t₁-t₀) =(1 + k(t₁-t₀))/(1 + k(t₂-t₀)). As t₀ changes, this ratio can change, so the sign of the brackets in Equation 4 can flip. On Friday night, 𝛾(t₂-t₀)/𝛾(t₁-t₀) > g/c, so I set my alarm. When I wake up on Saturday morning, 𝛾(t₂-t₀)/𝛾(t₁-t₀) has shifted to be less than g/c, the expected utility of getting up is now negative, so I hit snooze.

Hence hyperbolic discounting is a quantitative way of modelling how humans don’t seem to be consistent over time. Our plans flip-flop.

Humans are Ignorant

The second, less interesting way we don’t optimally maximize our goals is that we have uncertainty and wrong beliefs. There is less of a precise way of encoding this into an RL scenario, besides simply giving the agent uncertainty over which state it is in, and imposing incorrect priors.

Formally, this corresponds to modelling the system as a Partially Observable Markov Decision Process (POMDP). This basically means that instead of knowing for sure which state s the world is in, the agent receives an observation o which has some relation to the state s. The agent must have some model of how likely o is given some s, P(o|s). With this model the agent can do some bayesian inference to get a probability distribution over possible s. This is then incorporated into the averaging over trajectories in its calculation of expected utility.

The inference of s is where we can shoehorn in some wrong beliefs, by giving the agent priors p(s) that do not match the true state of the environment.

Inferring preferences from ignorant and inconsistent behavior

The paper builds a general model of human-like agents in order for it to be reverse-engineered from actions to preferences. They model the agent to have hyperbolic discounting and only partially observe it’s environment (so live in a POMDP).

When in state s, it chooses its next action a from a distribution defined by

Equation 5

where α is a noise parameter, that gives the agent a degree of randomness (for balancing the explore/exploit tradeoff). EU(s,a) is given by Equation 1.

The paper defines 3 different types of agent. A time-consistent agent that does no discounting (𝛾(t) = 1), and two time-inconsistent agents that do hyperbolic discounting, one called naive and one called sophisticated. These are meant to represent the behaviors of Alice and Beth respectively in the toy example above: the sophisticated agent takes into account its hyperbolic discounting when planning, but the naive agent does not.

Their difference is encoded in the distribution over trajectories the expectation value in Equation 1 is taken over. At each timestep, when taking the next action, EU(s,a) is re-calculated with a new “start time” t which effects 𝛾(t’-t), therefore effecting which actions are chosen. In the sophisticated agent, this fact is taken into account in the distribution over future trajectories. In the naive agent, this fact is not taken into account, the distribution over future trajectories are computed as if an action d timesteps after the current decision is chosen using

Equation 6

What this means is that- while the sophisticated agent perfectly predicts its future behavior, the naive agent wrongly predicts that it will always value rewards on future timesteps the same amount as it does at the current timestep.

The paper defines a space of possible agents, parameterized by their values U, noise parameter α, discounting parameter k, agent type (non-discounting, naive and sophisticated) denoted Y, and prior over world-states p(s). So an agent in this space is defined by a tuple θ = (U, α, k, Y, p(s)).

The model constructed above tells us how θ decides a set of actions a₀₋ₜ = (a₀, a₁, … aₜ); i.e it tells us P(a₀₋ₜ|θ) (this can be derived from Equation 5). So we can do bayesian inference on the actions a₀₋ₜ to infer a distribution over θ:

Equation 7

given some prior P(θ). The inference is implemented using probabilistic programming (namely the WebPPL language) which I’m not super familiar with but looks quite fun.

Applying the model to the Toy Environments

The authors apply the above process to the actions shown in Figure 1 (along with a number of other more complicated scenarios), to infer the likely θ in each case. They find goals for these agents that match our intuition about what would be driving this behavior.

For example, from the left panel of Figure 1, they infer a likely Y corresponding to the agent being of Naive-discounting type, and infer a distribution over utilities that looks like this:

which is what one would expect according to the “temptation” interpretation of this set of actions. The darker regions are more likely given the posterior of the IRL, the lighter regions less likely.

You can see a balance is needed between the two utilities. If the utility for donuts is much less than that for vegetarian food, the agent would not have given into temptation. If donut utility is much greater than vegetarian utility, the agent wouldn’t have bothered going all the way to donut store D2 because they could have got to D1 quicker.

Similarly intuitive results come out of applying the process to the right panel. The behavior implies the agent is sophisticated-discounting. It anticipates being tempted to the donut store so takes the right route instead. To test the ignorance part of the model, an extra component is added to this scenario. A third choice for the agent’s lunch, a noodle shop, is placed on the right pathway. The noodle shop is closed, but the agent is allowed to have arbitrary prior p(s) for whether or not it is closed.

Besides the area of high probability in θ-space corresponding to the “avoid temptation” interpretation, there is a second area of high probability that requires a different interpretation. In this second area, the agent has a strong (wrong) prior that the noodle shop is open. The interpretation of this area is; the agent values the noodle shop so tries to go there, then finds out it is closed so continues on to the vegetarian cafe. This demonstrates how the ignorance component of the model is important for getting the correct result from the IRL.

Human Explanations

These interpretations the paper gives for the posterior found from IRL are quite intuitive. But they wanted to more empirically staple these interpretations to real human behavior. So they ran some experiments with humans.

They showed the human test subjects the same set of toy behaviors as was show to the IRL. The humans were asked to come up with reasonable explanations for each of the behavior examples.

The most common explanations matched the interpretations given to the results of the IRL. This is meant to be a demonstration that their IRL model is making inferences in somewhat the same way humans do about what is driving other human’s behaviors. And since humans tend to be pretty good at inferring the goals that drive people’s behaviors, this shows that the model is doing an alright job.

Conclusion

This is very much a first baby-step towards more powerful ways of accessing human values. I also think it is a nice demonstration of some of the challenges inherent in doing this kind of work.

I hope you found this thing useful. Now I’m going to out to find some lunch.