Maximum Entropy Inverse Reinforcement Learning
Disclaimer: This post is an adapted version of the theoretical part of a Jupyter Notebook that I wrote in 2019 for a seminar on Imitation Learning. You can find it, an accompanying presentation, and a small framework at github.com/qzed/irl-maxent.
I highly encourage you to have a look at that notebook in case you're interested in the code behind this theory.
Let us start with some context and a brief overview of the problem we're trying to solve: Reinforcement learning (RL) aims to provide a framework for finding the optimal behavior (i.e., an optimal policy) of intelligent agents regarding some environment they interact with by directing them via a reward signal, measuring their performance. In inverse reinforcement learning (IRL), somewhat conversely, we try to recover the reward function (and therefore policy) of an agent given a set of demonstrations performed by that agent. At first, this may seem to stand somewhat opposed to the larger goal of creating intelligent agents, however, we can use the demonstration of experts ― often humans ― to improve our agents by guiding them to a better solution.
An important aspect of IRL is the recovery of the expert's reward function, which in many classical reinforcement-learning scenarios is often hand-designed. This also sets it apart from behavioral cloning (BC), another category of methods under the larger imitation learning (IL) umbrella (see Osa et al., 2018 for an overview), which focuses on retrieving the expert's policy instead of the reward. The general idea is that by analyzing the actions taken by an expert, we can infer a better reward function than we could manually design.
Maximum entropy (MaxEnt) inverse reinforcement learning (Ziebart et al., 2008) is a powerful and clever, yet fairly simple way to achieve this. Having discussed the “why”, let us now get to the “how”:
While I assume some familiarity with the basics of reinforcement and inverse reinforcement learning, here's a short overview: In both settings, we describe our world using a Markov decision process (MDP). An MDP is a 5-tuple with states , actions , transition model , reward function , and reward discount factor , decreasing the reward over time (where and ). In the inverse reinforcement learning setting, we will only use per-state rewards . The key part of an MDP is that any action relies solely on the current state, meaning complete independence of all preceding history.
The goal of reinforcement learning then is to find an optimal sequence of actions, i.e., a policy describing the optimal action per state, for an autonomous agent given a reward signal . The reward signal provides feedback to the agent, affirming its behavior on actions that help further the goal of the task at hand, and punishing it on actions that are counter-productive to this goal. Policies come in two flavors: deterministic and stochastic. Deterministic policies are defined as function , whereas stochastic policies are defined as probability distributions over actions, i.e., . While in the usual RL setting the (or at least one) optimal policy is guaranteed to be deterministic, in inverse reinforcement learning (and not only there), it makes more sense to talk about stochastic policies.
Inverse reinforcement learning, as discussed above and the name suggests, does the inverse of RL. Given a set of demonstrations of an expert (agent), consisting of trajectories through the state-action space, IRL aims to recover the underlying reward function , which explains the behavior of the expert.
To allow for computation with states, we use so-called features. Features are, in this case, defined as a function , mapping states to -dimensional vectors, and can be arbitrarily chosen. In general, we assume that where denotes the length of the trajectories (visited states), i.e., that the features of a trajectory are the sum of the features of the states visited by this trajectory, counting repetitions.
The general idea of Maximum Entropy Inverse Reinforcement Learning is based on feature-expectation matching (Abbeel & Ng, 2004): The expected visitation-frequency of features in the demonstrations provided by the expert should equal the expected visitation-frequency of features visited by the agent following the recovered reward function. In other words: The features describing the states of our world should be, in expectation, visited equally often by the agent following our recovered reward function and by the expert demonstrating (nearly) optimal behavior. In equations, we can express this as the expectation of features over trajectories derived from the reward we want to recover, and
where the learner follows a policy directly implied by the reward. The expert follows an implicit policy that encodes its notion of reward and optimality.
Note that we can also express the expected feature visitation frequency as sum over trajectories
with the probability distribution of trajectories of the learner. Finding such a match, however, is not a well-posed problem in accordance to the definition by Hadamard: There is not the single one, but many possible solutions. Multiple reward functions can achieve the same expected feature-visitation frequency (Abbeel & Ng, 2004; Ng et al., 1999; Ziebart, 2010). To resolve this ambiguity, Ziebart et al. (Ziebart et al., 2008) propose to choose the solution (i.e., the ) maximizing the entropy.
But why should we maximize the entropy? To answer this, we have to understand what entropy represents. Firstly, the entropy of some event probability distribution is defined as
where is the set containing all events and the probability of event occurring.
Let us now assume that we want to communicate the occurrence of all events with some other party efficiently. To accomplish this, we wish to minimize the expected message length, so messages that will be sent frequently should be short, and messages that will seldom be sent are allowed to be long. Following this trail of thought leads to the term , which is the optimal number of bits that we should spend on the message representing event . With lower probability , the term increases. In fact, it increases logarithmically, as with more bits available, we can encode exponentially more events. Using this observation for the entropy then yields nothing more than the entropy being the lower limit of the expected message length over all messages, i.e.:
Summarizing all of this in a single sentence: Entropy is a measure of uncertainty.
Note: This is, in fact, the underlying principle of Huffman coding, which, due to this relation, is guaranteed to be optimal.
As is done generally, we will now use the normal logarithm instead of the base-two one. Since holds, we are only dropping a constant factor of which does not impact extrema or our reasoning below, and simplifies further computations.
With that in mind, let us ask again: Why should we maximize the entropy? Now we can give a simple reasoning (Jaynes, 1957): Let us assume that we have two distributions, and , with entropy values . From our previous observations, we now know that to encode messages from we, in expectancy, need more bits than to encode messages from . This is the key point because this in turn also means that we have less information about than about . If we know more about a distribution, we require fewer bits to encode events from it. If we know everything, we do not need to send/encode anything, and if we know nothing, we need to send everything.
But isn't it good if we know more? “The less entropy, the better, right?” you may ask. No. Not for many optimization problems, or in general, when working with incomplete information: Consider a problem, like feature-expectation matching, which has multiple solutions fulfilling our constraints. The only information which we give our solver for this problem is the feature-expectations we want to replicate. All solutions provided by our solver contain this information. Yet, they most certainly have different entropy values. This must mean that (at least) some solutions have additional information that we did not provide, i.e., they have a bias. This is illustrated in the figure below for our two distributions and :
By choosing the solution with maximum entropy, we choose the solution with minimal information. Note that all solutions already satisfy our constraints; thus we actually pick the solution that best fits our information with a minimal bias.
Applying our findings from the previous section to feature-expectation matching directly results in a constrained optimization problem:
This problem expresses that we want to find a probability distribution over trajectories with maximum entropy, under which the expected feature visitation frequencies of learner and expert match (that represents the features visited by trajectory ). Note that the policy of the learner depends on the probability distribution with
i.e., the policy of the learner is proportional to the sum of the probabilities of all trajectories performing action in the starting state .
Now we have the probability distribution over trajectories, but how does this help us to recover the reward function ? Following the school of basic reinforcement learning provides us with the answer: The reward dictates the policy, which in turn dictates the trajectory distribution . A key assumption of Maximum Entropy IRL is that we know the behavior of our MDP, i.e., we have a perfect model representing the transition probabilities , so the only thing we actually need to learn is the reward. Everything else either is derived from it or assumed known. To be able to learn the reward, we must parameterize it. For this, we assume linearity regarding features, i.e.,
with being the reward parameter vector. To make the dependency on the reward parameters explicit, we write and .
Note: The choice of a linear parametrization is not by chance, but a direct result of the feature expectation matching constraint: Given an expert solving an MDP with unknown reward function linear in a set of known features, performing feature expectation matching with those features is guaranteed to yield a policy performing comparable to or better than the expert with respect to that reward (Abbeel & Ng, 2004).
Solving the above mathematical program for MDPs with deterministic transition dynamics (i.e., MDPs where all actions have predetermined deterministic outcomes) using the method of Lagrange multipliers leads to the observation that
with the so-called partition function simply normalizing the values to fulfill the probability distribution constraint (Osa et al., 2018; Ziebart et al., 2008). By choosing the reward to linearly depend on the features, we have also ensured that the Lagrange multipliers are precisely the reward parameters (Osa et al., 2018). For MDPs with stochastic transitions, Ziebart et al. (Ziebart et al., 2008) propose to modify the deterministic solution, yielding the approximation
Note: We are simply multiplying the deterministic solution with combined transition probability of the trajectory. As a result of said modification, however, we introduce a bias to the reward function (Osa et al., 2018). To avoid this, Ziebart (Ziebart, 2010) proposes the Maximum Causal Entropy IRL method in his thesis (see also Ziebart et al., 2013). In the scope of this post, we'll stick to the normal (non-causal) entropy.
We now know the parameterized form of the distribution, so how can we optimize the parameters? The answer is fairly straightforward: We can maximize the log-likelihood over all demonstrated trajectories , i.e.,
using gradient ascent (or any other gradient-based optimization technique) as this function is convex (Osa et al., 2018; Ziebart et al., 2008).
Note: Maximizing the log-likelihood is equivalent to minimizing the Kullback-Leibler divergence (Bishop, 2006)
It can therefore also be interpreted as the M-projection onto the manifold of the maximum entropy distribution (see e.g., Osa et al., 2018).
The (averaged) gradient is then simply the difference between expected empirical feature counts and the learner's expected feature counts (derived from the trajectory probabilities ), shown in Equation (5a), i.e.:
Note: The realization that we can transform Equation (5a) to Equation (5b), i.e., represent the expected feature count of the learner via the state-visitation frequency , is actually one of the most important steps!
Equation (5a) is generally infeasible (or at the very least quite costly) to compute, as enumerating all trajectories is usually subject to combinatorial explosion (think, for example, of trajectories in a “simple” grid world). Computing the state visitation frequencies, as we will see below, however, is quite feasible (within some restrictions).
Note: Due to the log-likelihood, this gradient holds for both deterministic and stochastic transition dynamics.
Now, we have only one remaining question: How do we compute the state-visitation frequency ? Following good computer-science fashion, we divide and conquer: First, we compute a (stochastic) policy , referred to as the local action probability, with a backward pass. Thereafter, we compute the state-visitation frequency under this policy with a simple forward pass.
The basic idea of the first pass is to remember our earlier observation about the learner policy made in Equation (2), i.e.,
and then recursively expand this using the solution we derived for our trajectory distribution presented in Equation (3) until we are not reasoning about full or partial trajectories any longer, but the individual state transitions. This then yields the set of equations (Ziebart et al., 2008):
Here, represents the state-partition-function, again normalizing the probabilities, and is referred to as the state-action partition function. Setting the state-partition function for terminal states and then recursively backing up, using the equations for and , from those states yields a set of values that we can then use to compute the local action probabilities following the above equation.
Note: This backup-pass somewhat resembles a value-iteration scheme to solve an MDP. It also turns out that this backup part of the algorithm is the only thing that differs between Maximum Entropy and Maximum Causal Entropy IRL.
In fact, for the Maximum Causal Entropy approach, one can show that the partition functions are an analogy of the Bellman equation — the so-called soft Bellman equation — which allows for a straightforward extension of the value-iteration algorithm to this problem (Ziebart, 2010; Ziebart et al., 2013).
As we now know the policy , inferring the state-visitation frequency becomes fairly easy: Remember that the policy is a probability distribution that tells us, given a state , with which probability the agent should take a certain action , which then leads to a new state according to our MDP transition model. Given an initial state-visitation frequency, we can therefore use the policy to forward-propagate the individual visitation counts, obtaining an updated state-visitation frequency map.
To this end, we initialize to the starting-state probabilities at timestep , i.e.,
We then compute updates as mentioned above via
where represents the transition probabilities of our (stochastic) MDP. Note, that only gives us the state-visitation frequency for time-step , meaning we have to sum them up to obtain the time-step independent state-visitation
With a plan for computation of the state-visitation frequency, we now have everything we need to state the full algorithm. To derive that, however, let us first recap our findings:
Our problem is a constrained optimization problem for finding the optimal reward parameters . We can solve this problem by maximizing the log-likelihood over all demonstrated trajectories (see Equation (4)). This, in turn, is itself an optimization problem, for which we need the gradient of this log-likelihood.
To compute the gradient , we require the state-visitation frequency (see Equation (5b)).
To compute the state-visitation frequency, we require the policy (see Equation (7b)).
Putting this all together yields the following iteration scheme:
Compute policy the using current estimate of via the backward pass, i.e., Equations (6a–c).
Compute the state-visitation frequency via the forward pass, i.e., Equations (7a–c).
Compute the gradient via Equation (5b).
Perform one gradient-based optimization step, e.g.,
We then iterate these four steps until convergence.
Note: You can find a Jupyter Notebook discussing a simple implementation of this algorithm (applied to a grid-world example) at github.com/qzed/irl-maxent.
Maximum entropy IRL is a clever and ― all in all ― fairly simple solution to the general inverse reinforcement learning problem. However, it does have some drawbacks and limitations:
First and foremost, the algorithm as explained in this post requires us “solving” the MDP once per iteration. As a result, it is fairly limited in the size of the discrete state-action-spaces that it can (reasonably) be applied to. Extensions to continuous state spaces, for example via Monte-Carlo sampling or via path integrals, have been proposed (see e.g., Aghasadeghi & Bretl, 2011; Kalakrishnan et al., 2013). However, the assumption that the transition dynamics of the environment, i.e., the probability distribution , are known, further limits its applicability.
It also assumes a linear parametrization of the reward . This can be somewhat mitigated by our choice of features , but feature expectation matching also only guarantees performance-parity with a linear reward parametrization. An extension to model nonlinear reward functions via neural networks has been suggested by Wulfmeier et al. (2016).
Lastly, maximum entropy IRL is constrained to limited transition randomness, i.e., fairly deterministic transition dynamics . As noted above, this is due to the reward bias introduced by our probability approximation and can be avoided via the maximum causal entropy IRL approach, proposed subsequently by Ziebart (Ziebart, 2010; Ziebart et al., 2013).
A somewhat similar, yet model-free approach is relative entropy IRL by Boularias et al. (2011). See the paper by Snoswell et al. (2020) for a perspective on both relative and maximum entropy IRL approaches.