Maximum Entropy Inverse Reinforcement Learning
Disclaimer: This post is an adapted version of the theoretical part of a Jupyter Notebook that I wrote in 2019 for a seminar on Imitation Learning. You can find it, an accompanying presentation, and a small framework at github.com/qzed/irlmaxent.
I highly encourage you to have a look at that notebook in case you're interested in the code behind this theory.
Let us start with some context and a brief overview of the problem we're trying to solve: Reinforcement learning (RL) aims to provide a framework for finding the optimal behavior (i.e., an optimal policy) of intelligent agents regarding some environment they interact with by directing them via a reward signal, measuring their performance. In inverse reinforcement learning (IRL), somewhat conversely, we try to recover the reward function (and therefore policy) of an agent given a set of demonstrations performed by that agent. At first, this may seem to stand somewhat opposed to the larger goal of creating intelligent agents, however, we can use the demonstration of experts ― often humans ― to improve our agents by guiding them to a better solution.
An important aspect of IRL is the recovery of the expert's reward function, which in many classical reinforcementlearning scenarios is often handdesigned. This also sets it apart from behavioral cloning (BC), another category of methods under the larger imitation learning (IL) umbrella (see Osa et al., 2018 for an overview), which focuses on retrieving the expert's policy instead of the reward. The general idea is that by analyzing the actions taken by an expert, we can infer a better reward function than we could manually design.
Maximum entropy (MaxEnt) inverse reinforcement learning (Ziebart et al., 2008) is a powerful and clever, yet fairly simple way to achieve this. Having discussed the “why”, let us now get to the “how”:
While I assume some familiarity with the basics of reinforcement and inverse reinforcement learning, here's a short overview: In both settings, we describe our world using a Markov decision process (MDP). An MDP is a 5tuple $(S, A, T, R, \gamma)$ with states $S$, actions $A$, transition model $T(s_t, a_t, s_{t+1}) = p(s_{t+1} \mid s_t, a_t)$, reward function $R(s_t, a_t, s_{t+1}) \in \mathbb{R}$, and reward discount factor $\gamma \in [0, 1]$, decreasing the reward over time (where $a_t \in A$ and $s_t, s_{t+1} \in S$). In the inverse reinforcement learning setting, we will only use perstate rewards $R(s) \in \mathbb{R}$. The key part of an MDP is that any action relies solely on the current state, meaning complete independence of all preceding history.
The goal of reinforcement learning then is to find an optimal sequence of actions, i.e., a policy describing the optimal action per state, for an autonomous agent given a reward signal $R$. The reward signal provides feedback to the agent, affirming its behavior on actions that help further the goal of the task at hand, and punishing it on actions that are counterproductive to this goal. Policies come in two flavors: deterministic and stochastic. Deterministic policies are defined as function $\pi: S \to A$, whereas stochastic policies are defined as probability distributions over actions, i.e., $\pi(a \mid s)$. While in the usual RL setting the (or at least one) optimal policy is guaranteed to be deterministic, in inverse reinforcement learning (and not only there), it makes more sense to talk about stochastic policies.
Inverse reinforcement learning, as discussed above and the name suggests, does the inverse of RL. Given a set of demonstrations $\mathcal{D} = \{\tau_i\}_{i=1}^{n}$ of an expert (agent), consisting of trajectories $\tau = \left( (s_1, a_1), (s_2, a_2), \ldots, (s_{n1}, a_{n1}), s_n \right)$ through the stateaction space, IRL aims to recover the underlying reward function $R$, which explains the behavior of the expert.
To allow for computation with states, we use socalled features. Features are, in this case, defined as a function $\phi: S \to \mathbb{R}^d$, mapping states to $d$dimensional vectors, and can be arbitrarily chosen. In general, we assume that $\phi(\tau) = \sum_{t = 1}^{\tau} \phi(s_t)$ where $\tau$ denotes the length of the trajectories (visited states), i.e., that the features of a trajectory $\tau$ are the sum of the features of the states visited by this trajectory, counting repetitions.
The general idea of Maximum Entropy Inverse Reinforcement Learning is based on featureexpectation matching (Abbeel & Ng, 2004): The expected visitationfrequency of features in the demonstrations provided by the expert should equal the expected visitationfrequency of features visited by the agent following the recovered reward function. In other words: The features describing the states of our world should be, in expectation, visited equally often by the agent following our recovered reward function and by the expert demonstrating (nearly) optimal behavior. In equations, we can express this as the expectation of features $\phi$ over trajectories $\tau$ derived from the reward we want to recover, and
where the learner follows a policy $\pi^L$ directly implied by the reward. The expert follows an implicit policy $\pi^E$ that encodes its notion of reward and optimality.
Note that we can also express the expected feature visitation frequency as sum over trajectories
with the probability distribution $p_{\pi^L}(\tau)$ of trajectories of the learner. Finding such a match, however, is not a wellposed problem in accordance to the definition by Hadamard: There is not the single one, but many possible solutions. Multiple reward functions can achieve the same expected featurevisitation frequency (Abbeel & Ng, 2004; Ng et al., 1999; Ziebart, 2010). To resolve this ambiguity, Ziebart et al. (Ziebart et al., 2008) propose to choose the solution (i.e., the $p_{\pi^L}$) maximizing the entropy.
But why should we maximize the entropy? To answer this, we have to understand what entropy represents. Firstly, the entropy $H(p)$ of some event probability distribution $p$ is defined as
where $X$ is the set containing all events $x$ and $p(x)$ the probability of event $x$ occurring.
Let us now assume that we want to communicate the occurrence of all events $x$ with some other party efficiently. To accomplish this, we wish to minimize the expected message length, so messages that will be sent frequently should be short, and messages that will seldom be sent are allowed to be long. Following this trail of thought leads to the term $h(x) = \log_2(1/p(x))$, which is the optimal number of bits that we should spend on the message representing event $x$. With lower probability $p(x)$, the term $\log_2(1/p(x))$ increases. In fact, it increases logarithmically, as with more bits available, we can encode exponentially more events. Using this observation for the entropy $H$ then yields nothing more than the entropy being the lower limit of the expected message length over all messages, i.e.:
Summarizing all of this in a single sentence: Entropy is a measure of uncertainty.
Note: This is, in fact, the underlying principle of Huffman coding, which, due to this relation, is guaranteed to be optimal.
As is done generally, we will now use the normal logarithm instead of the basetwo one. Since $\log_2(x) = \log(x)/\log(2)$ holds, we are only dropping a constant factor of $\log(2)$ which does not impact extrema or our reasoning below, and simplifies further computations.
With that in mind, let us ask again: Why should we maximize the entropy? Now we can give a simple reasoning (Jaynes, 1957): Let us assume that we have two distributions, $p$ and $q$, with entropy values $H(p) > H(q)$. From our previous observations, we now know that to encode messages from $p$ we, in expectancy, need more bits than to encode messages from $q$. This is the key point because this in turn also means that we have less information about $p$ than about $q$. If we know more about a distribution, we require fewer bits to encode events from it. If we know everything, we do not need to send/encode anything, and if we know nothing, we need to send everything.
But isn't it good if we know more? “The less entropy, the better, right?” you may ask. No. Not for many optimization problems, or in general, when working with incomplete information: Consider a problem, like featureexpectation matching, which has multiple solutions fulfilling our constraints. The only information which we give our solver for this problem is the featureexpectations we want to replicate. All solutions provided by our solver contain this information. Yet, they most certainly have different entropy values. This must mean that (at least) some solutions have additional information that we did not provide, i.e., they have a bias. This is illustrated in the figure below for our two distributions $p$ and $q$:
By choosing the solution with maximum entropy, we choose the solution with minimal information. Note that all solutions already satisfy our constraints; thus we actually pick the solution that best fits our information with a minimal bias.
Applying our findings from the previous section to featureexpectation matching directly results in a constrained optimization problem:
where Equation (1a) represents the featurematching constraint and Equation (1b) the probability constraints.
This problem expresses that we want to find a probability distribution $p(\tau)$ over trajectories $\tau$ with maximum entropy, under which the expected feature visitation frequencies of learner $\mathbb{E}_{\pi^L}[\phi(\tau)]$ and expert $\mathbb{E}_{\pi^E}[\phi(\tau)]$ match (that $\phi(\tau)$ represents the features visited by trajectory $\tau$). Note that the policy $\pi^L$ of the learner depends on the probability distribution $p(\tau)$ with
i.e., the policy of the learner is proportional to the sum of the probabilities of all trajectories performing action $a$ in the starting state $s$.
Now we have the probability distribution over trajectories, but how does this help us to recover the reward function $R: S \to \mathbb{R}$? Following the school of basic reinforcement learning provides us with the answer: The reward dictates the policy, which in turn dictates the trajectory distribution $p(\tau)$. A key assumption of Maximum Entropy IRL is that we know the behavior of our MDP, i.e., we have a perfect model representing the transition probabilities $p(s_{t+1} \mid s_t, a_t)$, so the only thing we actually need to learn is the reward. Everything else either is derived from it or assumed known. To be able to learn the reward, we must parameterize it. For this, we assume linearity regarding features, i.e.,
with $\omega \in \mathbb{R}^d$ being the reward parameter vector. To make the dependency on the reward parameters explicit, we write $p(\tau \mid \omega)$ and $\pi(a \mid s, \omega)$.
Note: The choice of a linear parametrization is not by chance, but a direct result of the feature expectation matching constraint: Given an expert solving an MDP with unknown reward function linear in a set of known features, performing feature expectation matching with those features is guaranteed to yield a policy performing comparable to or better than the expert with respect to that reward (Abbeel & Ng, 2004).
Solving the above mathematical program for MDPs with deterministic transition dynamics (i.e., MDPs where all actions have predetermined deterministic outcomes) using the method of Lagrange multipliers leads to the observation that
with the socalled partition function $Z(\omega)$ simply normalizing the values to fulfill the probability distribution constraint (Osa et al., 2018; Ziebart et al., 2008). By choosing the reward to linearly depend on the features, we have also ensured that the Lagrange multipliers are precisely the reward parameters $\omega$ (Osa et al., 2018). For MDPs with stochastic transitions, Ziebart et al. (Ziebart et al., 2008) propose to modify the deterministic solution, yielding the approximation
Note: We are simply multiplying the deterministic solution with combined transition probability of the trajectory. As a result of said modification, however, we introduce a bias to the reward function (Osa et al., 2018). To avoid this, Ziebart (Ziebart, 2010) proposes the Maximum Causal Entropy IRL method in his thesis (see also Ziebart et al., 2013). In the scope of this post, we'll stick to the normal (noncausal) entropy.
We now know the parameterized form of the distribution, so how can we optimize the parameters? The answer is fairly straightforward: We can maximize the loglikelihood over all demonstrated trajectories $\mathcal{D} = \{\tau_i\}_{i=1}^{N}$, i.e.,
using gradient ascent (or any other gradientbased optimization technique) as this function is convex (Osa et al., 2018; Ziebart et al., 2008).
Note: Maximizing the loglikelihood is equivalent to minimizing the KullbackLeibler divergence (Bishop, 2006)
It can therefore also be interpreted as the Mprojection onto the manifold of the maximum entropy distribution (see e.g., Osa et al., 2018).
The (averaged) gradient is then simply the difference between expected empirical feature counts $\mathbb{E}_{\pi^E} \left[ \phi(\tau) \right]$ and the learner's expected feature counts (derived from the trajectory probabilities $p(\tau \mid \omega)$), shown in Equation (5a), i.e.:
which we can express in terms of the statevisitation frequency $D_{s_i}$ with Equation (5b), assuming $\phi(\tau) = \sum_{t = 0}^{\tau} \phi(s_t)$ (Ziebart et al., 2008).
Note: The realization that we can transform Equation (5a) to Equation (5b), i.e., represent the expected feature count of the learner via the statevisitation frequency $D_{s_i}$, is actually one of the most important steps!
Equation (5a) is generally infeasible (or at the very least quite costly) to compute, as enumerating all trajectories is usually subject to combinatorial explosion (think, for example, of trajectories in a “simple” grid world). Computing the state visitation frequencies, as we will see below, however, is quite feasible (within some restrictions).
Note: Due to the loglikelihood, this gradient holds for both deterministic and stochastic transition dynamics.
Now, we have only one remaining question: How do we compute the statevisitation frequency $D_{s_i}$? Following good computerscience fashion, we divide and conquer: First, we compute a (stochastic) policy $\pi_{\text{ME}}(a \mid s, \omega)$, referred to as the local action probability, with a backward pass. Thereafter, we compute the statevisitation frequency under this policy with a simple forward pass.
The basic idea of the first pass is to remember our earlier observation about the learner policy $\pi^L(a \mid s)$ made in Equation (2), i.e.,
and then recursively expand this using the solution we derived for our trajectory distribution $p(\tau \mid \omega)$ presented in Equation (3) until we are not reasoning about full or partial trajectories any longer, but the individual state transitions. This then yields the set of equations (Ziebart et al., 2008):
Here, $Z_{s_i}$ represents the statepartitionfunction, again normalizing the probabilities, and $Z_{s_i, a_i}$ is referred to as the stateaction partition function. Setting the statepartition function $Z_{s_i} = 1$ for terminal states and then recursively backing up, using the equations for $Z_{s_i}$ and $Z_{s_i, a_j}$, from those states yields a set of values that we can then use to compute the local action probabilities following the above equation.
Note: This backuppass somewhat resembles a valueiteration scheme to solve an MDP. It also turns out that this backup part of the algorithm is the only thing that differs between Maximum Entropy and Maximum Causal Entropy IRL.
In fact, for the Maximum Causal Entropy approach, one can show that the partition functions are an analogy of the Bellman equation — the socalled soft Bellman equation — which allows for a straightforward extension of the valueiteration algorithm to this problem (Ziebart, 2010; Ziebart et al., 2013).
As we now know the policy $\pi_{\text{ME}}(a_j \mid s_i, \omega)$, inferring the statevisitation frequency $D_{s_i}$ becomes fairly easy: Remember that the policy is a probability distribution that tells us, given a state $s_i$, with which probability the agent should take a certain action $a_j$, which then leads to a new state $s_k$ according to our MDP transition model. Given an initial statevisitation frequency, we can therefore use the policy to forwardpropagate the individual visitation counts, obtaining an updated statevisitation frequency map.
To this end, we initialize $D_{s_i, 0}$ to the startingstate probabilities $p_0(s_i)$ at timestep $t=0$, i.e.,
We then compute updates as mentioned above via
where $p(s_k \mid a_j, s_i)$ represents the transition probabilities of our (stochastic) MDP. Note, that $D_{s_i, t}$ only gives us the statevisitation frequency for timestep $t$, meaning we have to sum them up to obtain the timestep independent statevisitation
With a plan for computation of the statevisitation frequency, we now have everything we need to state the full algorithm. To derive that, however, let us first recap our findings:

Our problem is a constrained optimization problem for finding the optimal reward parameters $\omega$. We can solve this problem by maximizing the loglikelihood over all demonstrated trajectories $\mathcal{D}$ (see Equation (4)). This, in turn, is itself an optimization problem, for which we need the gradient $\nabla \mathcal{L}(\omega)$ of this loglikelihood.

To compute the gradient $\nabla \mathcal{L}(\omega)$, we require the statevisitation frequency (see Equation (5b)).

To compute the statevisitation frequency, we require the policy $\pi_{\text{ME}} (a_j \mid s_i, \omega)$ (see Equation (7b)).
Putting this all together yields the following iteration scheme:

Compute policy $\pi_{\text{ME}} (a_j \mid s_i, \omega)$ the using current estimate of $\omega$ via the backward pass, i.e., Equations (6a–c).

Compute the statevisitation frequency $D_{s_i}$ via the forward pass, i.e., Equations (7a–c).

Compute the gradient $\nabla \mathcal{L}(\omega)$ via Equation (5b).

Perform one gradientbased optimization step, e.g.,
$\omega \gets \omega + \eta \nabla \mathcal{L}(\omega).$
We then iterate these four steps until convergence.
Note: You can find a Jupyter Notebook discussing a simple implementation of this algorithm (applied to a gridworld example) at github.com/qzed/irlmaxent.
Maximum entropy IRL is a clever and ― all in all ― fairly simple solution to the general inverse reinforcement learning problem. However, it does have some drawbacks and limitations:
First and foremost, the algorithm as explained in this post requires us “solving” the MDP once per iteration. As a result, it is fairly limited in the size of the discrete stateactionspaces that it can (reasonably) be applied to. Extensions to continuous state spaces, for example via MonteCarlo sampling or via path integrals, have been proposed (see e.g., Aghasadeghi & Bretl, 2011; Kalakrishnan et al., 2013). However, the assumption that the transition dynamics of the environment, i.e., the probability distribution $p(s_{t+1} \mid s_t, a_t)$, are known, further limits its applicability.
It also assumes a linear parametrization of the reward $R(s) = \omega^\intercal \phi(s)$. This can be somewhat mitigated by our choice of features $\phi(s)$, but feature expectation matching also only guarantees performanceparity with a linear reward parametrization. An extension to model nonlinear reward functions via neural networks has been suggested by Wulfmeier et al. (2016).
Lastly, maximum entropy IRL is constrained to limited transition randomness, i.e., fairly deterministic transition dynamics $p(s_{t+1} \mid s_t, a_t)$. As noted above, this is due to the reward bias introduced by our probability approximation and can be avoided via the maximum causal entropy IRL approach, proposed subsequently by Ziebart (Ziebart, 2010; Ziebart et al., 2013).
A somewhat similar, yet modelfree approach is relative entropy IRL by Boularias et al. (2011). See the paper by Snoswell et al. (2020) for a perspective on both relative and maximum entropy IRL approaches.