Bellman Equations

Introduction

In Reinforcement Learning (RL), value functions define the objectives of the RL problem. There are two very important and strictly related value functions,

the state-action value function (SAVF) ¹, and
the state value function (SVF).

In this post, I’ll show how these value functions (including their optimality versions) can be mathematically formulated as recursive equations, known as Bellman equations.

I’ll assume that you are minimally familiar with Markov Decision Processes (MDPs) and RL. Nevertheless, I will review the most important RL and mathematical prerequisites to understand this post, so that the post is self-contained as much as possible.

Notation

Stylized upper case letters (e.g. \(\mathcal{X}\) or \(\mathbb{R}\)) denote vector spaces.
Upper case letters (e.g. \(X\)) denote random variables.
Depending on the context, lower case letters (e.g. \(x\)) can denote realizations of random variables, variables of functions, or elements of a vector space.
Depending on the context, \(\color{blue}{p}(s' \mid s, a)\) can be a shorthand for \(\color{blue}{p}(S'=s' \mid S=s, A=a) \in [0, 1]\), a probability, or \(\color{blue}{p}(s' \mid S=s, A=a)\), a conditional probability distribution.
\(X=x\) is an event, which is occasionally abbreviated as \(x\).
\(s' \sim \color{blue}{p}(s' \mid s, a)\) means that \(s'\) is drawn/sampled according to \(\color{blue}{p}\).

Markov Decision Processes

A (discounted) MDP can be defined as a tuple

\[M \triangleq (\mathcal{S}, \mathcal{A}, \mathcal{R}, \color{blue}{p}, \mathscr{r}, \gamma) \tag{1} \label{1},\]

\(\mathcal{S}\) is the state space,
\(\mathcal{A}\) is the action space,
\(\mathcal{R} \subset \mathbb{R}\) is the reward space,
\(\color{blue}{p}(s' \mid s, a)\) is the transition model,
\(\mathscr{r}(s, a) = \mathbb{E}_{p(r \mid s, a)} \left[ R \mid S = s, A = a \right]\) is the expected reward, and
\(\gamma \in [0, 1]\) is the discount factor.

Markov Property

MDPs assume that the Markov property holds, i.e. the future is independent of the past given the present. The Markov property is encoded in \(\color{blue}{p}(s' \mid s, a)\) and \(\mathscr{r}(s, a)\).

Finite MDPs

A finite MDP is an MDP where the state \(\mathcal{S}\), action \(\mathcal{A}\) and rewad \(\mathcal{R}\) spaces are finite sets. In that case, \(\color{blue}{p}(s' \mid s, a)\) can be viewed as a probability mass function (pmf) and the random variable associated with states, actions and rewards are discrete.

Alternative Formulations

Sometimes, \(\color{blue}{p}(s' \mid s, a)\) is combined with \(\mathscr{r}(s, a)\) to form a joint conditional distribution, \(\color{purple}{p}(s', r \mid s, a)\) (the dynamics of the MDP), from which both \(\color{blue}{p}(s' \mid s, a)\) and \(\mathscr{r}(s, a)\) can be derived.

Specifically, \(\color{blue}{p}(s' \mid s, a)\) can be computed from \(\color{purple}{p}(s', r \mid s, a)\) by marginalizing over \(r\) ² as follows

\[\color{blue}{p}(s' \mid s, a) = \sum_{r \in \mathcal{R}}r \color{purple}{p}(s', r \mid s, a). \tag{2} \label{2}\]

Similarly, we have

\[\begin{align*} \mathscr{r}(s, a) &= \mathbb{E} \left[ R \mid S = s, A = a \right] \\ &= \sum_{r \in \mathcal{R}} r \sum_{s' \in \mathcal{S}} \color{purple}{p}(s', r \mid s, a) \\ &= \sum_{r \in \mathcal{R}} r p(r \mid s, a). \end{align*} \tag{3} \label{3}\]

Reinforcement Learning

In RL, we imagine that there is an agent that sequentially interacts with an environment in (discrete) time-steps, where the environment can be modelled as an MDP.

More specifically, at time-step \(t\), the agent is in some state \(s_t \in \mathcal{S}\) ³ and takes an action \(a_t \in \mathcal{A}\) with a policy \(\color{red}{\pi}(a \mid s)\), which is a conditional probability distribution over actions given a state, i.e. \(a_t \sim \pi(a \mid s_t)\). At the next time step \(t+1\), the environment returns a reward \(r_{t+1} = \mathscr{r}(s_t, a_t)\), and it moves to another state \(s_{t+1} \sim \color{blue}{p}(s' \mid s_t, a_t)\), then the agent takes another action \(a_{t+1} \sim \color{red}{\pi}(a \mid s_{t+1})\), gets another reward \(r_{t+2} = \mathscr{r}(s_{t+1}, a_{t+1})\), and the environment moves to another state \(s_{t+2}\), and so on. This interaction continues until a maximum time-step \(H\), which is often called horizon, is reached. For simplicity, we assume that \(H = \infty\), so we assume a so-called infinite-horizon MDP.

In RL, the objective/goal is to find a policy that maximizes the sum of rewards in the long run, i.e. until the horizon \(H\) is reached (if ever reached). An objective function that formalizes this sum of rewards in the long run is the state-action value function (SAVF), which is, therefore, one function that we might want to optimize.

State-Action Value Function

The state-action value function for a policy \(\color{red}{\pi}(a \mid s)\) is the function \(q_\color{red}{\pi} : \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}\), which is defined as follows

\[q_\color{red}{\pi}(s, a) \triangleq \mathbb{E} \left[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} \mid S_t = s, A_t = a\right], \\ \color{orange}{\forall} s \in \mathcal{S}, \color{orange}{\forall} a \in \mathcal{A} \tag{4}\label{4},\]

where

\(R_{t+k+1}\) is the reward the agent receives at time-step \(t+k+1\),
\(G_t \triangleq \sum_{k=0}^\infty \gamma^k R_{t+k+1}\) is the return (aka value ⁴),
\(\color{red}{\pi}\) is a policy that the agents follows from time-step \(t+1\) onwards,
\(\gamma \in [0, 1)\) is the discount factor,
\(S_t = s\) is the state the agent is in at time-step \(t\), and
\(A_t = a\) is the action taken at time-step \(t\).

Intuitively, \(q_\color{red}{\pi}(s, a)\) is expected return that the agent gets by following policy \(\color{red}{\pi}\) after having taken action \(a\) in state \(s\) at time-step \(t\).

Value Function of a Policy

The subscript \(\color{red}{\pi}\) in \(q_\color{red}{\pi}(s, a)\) indicates that \(q_\color{red}{\pi}(s, a)\) is defined in terms of \(\color{red}{\pi}(a \mid s)\) because the rewards received in the future, \(R_{t+k+1}\), depend on the actions that we take with \(\color{red}{\pi}(s \mid a)\), but they also depend on the transition model \(\color{blue}{p}(s' \mid s, a)\).

However, \(\color{red}{\pi}\) and \(\color{blue}{p}\) do not appear anywhere inside the expectation in equation \ref{4}. So, for people that only believe in equations, \ref{4} might not be satisfying enough. Luckily, we can express \(q_\color{red}{\pi}(s, a)\) in terms of \(\color{red}{\pi}\) and \(\color{blue}{p}\) by starting from equation \ref{4}, which also leads to a Bellman/recursive equation. So, let’s do it!

Mathematical Prerequisites

The formulation of the value functions as recursive equations (which is the main point of this blog post) uses three main mathematical rules, which are reviewed here for completeness.

Markov Property

If we assume that the Markov property holds, then the following holds

\[\color{blue}{p}(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \dots, s_0, a_0) = \color{blue}{p}(s_{t+1} \mid s_t, a_t)\]
\[\mathscr{r}(s, a) = \mathbb{E} \left[ R_{t+1}\mid s_t, a_t, s_{t-1}, a_{t-1}, \dots, s_0, a_0 \right] = \mathbb{E} \left[ R_{t+1}\mid s_t, a_t \right]\]
\[\color{red}{\pi}(a_{t} \mid s_t, s_{t-1}, a_{t-1}, \dots, s_0, a_0 ) = \color{red}{\pi}(a_{t} \mid s_t)\]

Linearity of Expectation (LE)

Let \(X\) and \(Y\) be two discrete random variables and \(p(x, y)\) be their joint distribution, then the expectation of \(X+Y\) is equal to the sum of the expectation of \(X\) and \(Y\), i.e.

\[\begin{align*} \mathbb{E}[X + Y] &= \sum_x \sum_y (x + y) p(x, y) \\ &= \sum_x \sum_y x p(x, y) + y p(x, y) \\ &= \sum_x \sum_y x p(x, y) + \sum_x \sum_y y p(x, y) \\ &= \sum_x x p(x) + \sum_y y p(y) \\ &= \mathbb{E}[X] + \mathbb{E}[Y] \end{align*}\]

Law of Total Expectation (LTE)

The formulation of the LTE is as follows. Let \(X\), \(Y\) and \(Z\) be three discrete random variables, \(\mathbb{E}[X \mid Y=y]\) be the expectation of \(X\) given \(Y=y\), and \(p(x, y, z)\) be the joint of \(X\), \(Y\) and \(Z\). So, we have

\[\begin{align*} \mathbb{E}[X \mid Y=y] &= \sum_x x p(x \mid y) \\ &= \sum_x x \frac{p(x, y)}{p(y)} \\ &= \sum_x x \frac{\sum_z p(x, y, z)}{p(y)} \\ &= \sum_x x \frac{\sum_z p(x \mid y, z) p(y, z) }{p(y)} \\ &= \sum_z \frac{p(y, z)}{p(y)} \sum_x x p(x \mid y, z) \\ &= \sum_z p(z \mid y) \sum_x x p(x \mid y, z) \\ &= \sum_z p(z \mid y) \mathbb{E}[X \mid Y=y, Z=z]. \end{align*}\]

This also applies to other more complicated cases, i.e. more conditions, or even to the simpler case of \(\mathbb{E}[X]\).

State-Action Bellman Equation

We are finally ready to express \(\ref{4}\) as a recursive equation.

We can decompose the return \(G_t\) into the first reward \(R_{t+1}\), received after having taken action \(A_t = a\) in state \(S_t = s\), and the rewards that we will receive in the next time steps, then we can apply the LE, LTE (multiple times) and Markov property, i.e.

\[\begin{align*} q_\color{red}{\pi}(s, a) &= \mathbb{E} \left[ R_{t+1} + \sum_{k=1}^\infty \gamma^k R_{t+k+2} \mid S_t = s, A_t = a\right] \\ &= \mathbb{E} \left[ R_{t+1} \mid S_t = s, A_t = a \right] + \gamma \mathbb{E} \left[ G_{t+1} \mid S_t = s, A_t = a\right] \\ &= \mathscr{r}(s, a) + \gamma \sum_{s'} \color{blue}{p}(s' \mid s, a) \mathbb{E} \left[ G_{t+1} \mid S_{t+1}=s', S_t = s, A_t = a\right] \\ &= \mathscr{r}(s, a) + \gamma \sum_{s'} \color{blue}{p}(s' \mid s, a) \mathbb{E} \left[ G_{t+1} \mid S_{t+1}=s'\right] \\ &= \mathscr{r}(s, a) + \gamma \sum_{s'} \color{blue}{p}(s' \mid s, a) v_\color{red}{\pi}(s') \\ &= \mathscr{r}(s, a) + \gamma \sum_{s'} \color{blue}{p}(s' \mid s, a) \sum_{a'} \color{red}{\pi}(a' \mid s') \mathbb{E} \left[ G_{t+1} \mid S_{t+1}=s', A_{t+1}=a'\right] \\ &= \mathscr{r}(s, a) + \gamma \sum_{s'} \color{blue}{p}(s' \mid s, a) \sum_{a'} \color{red}{\pi}(a' \mid s') q_\color{red}{\pi}(s', a') \tag{5}\label{5}, \end{align*}\]

where

\(\lambda G_{t+1}=\sum_{k=1}^\infty \gamma^k R_{t+k+2} = \gamma \sum_{k=0}^\infty \gamma^k R_{t+k+2}\),
\(\mathscr{r}(s, a) \triangleq \mathbb{E}_{p(r \mid s, a)} \left[ R_{t+1} \mid S_t = s, A_t = a \right] = \sum_{r} p(r \mid s, a) r\) is the expected reward of taking action \(a\) in \(s\) and \(p(r \mid s, a)\) is the reward distribution ⁵,
\(v_\color{red}{\pi}(s') \triangleq \mathbb{E} \left[ G_{t+1} \mid S_{t+1}=s'\right]\) is the state value function (SVF), and
\(q_\color{red}{\pi}(s', a') \triangleq \mathbb{E} \left[ G_{t+1} \mid S_{t+1}=s', A_{t+1}=a'\right]\).

The equation \ref{5} is a recursive equation, given that \(q_\color{red}{\pi}\) is defined in terms of itself (although evaluated at a different state-action pair), known as the state-action Bellman (expectation) equation for \(q_\color{red}{\pi}\).

So, the subscript \(\color{red}{\pi}\) in \(q_\color{red}{\pi}(s, a)\) is used because the state-action value is (also) defined in terms of \(\color{red}{\pi}\).

Alternative Version

Given the relations \ref{2} and \ref{3}, equation \ref{5} can also be expressed in terms of \(\color{purple}{p}(s', r \mid s, a)\) as follows.

\[\begin{align*} q_\color{red}{\pi}(s, a) &= \sum_{r} \underbrace{\sum_{s'} \color{purple}{p}(s', r \mid s, a)}_{p(r \mid s, a)} r + \gamma \sum_{s'} \underbrace{\sum_{r} \color{purple}{p}(s', r \mid s, a)}_{\color{blue}{p}(s' \mid s, a)} v_\color{red}{\pi}(s') \\ &= \sum_{s'} \sum_{r} \color{purple}{p}(s', r \mid s, a) \left[ r + \gamma v_\color{red}{\pi}(s') \right] \tag{6}\label{6} . \end{align*}\]

So, \(q_\color{red}{\pi}(s, a)\) is an expectation with respect to the joint conditional distribution \(\color{purple}{p}(s', r \mid s, a)\), i.e.

\[q_\color{red}{\pi}(s, a) = \mathbb{E}_{\color{purple}{p}(s', r \mid s, a)} \left[ R_{t+1} + \gamma v_{\color{red}{\pi}}(S_{t+1}) \mid S_t = s, A_t =a\right] \tag{7}\label{7} .\]

Equation \ref{6} can also be derived from equation \ref{4} by applying the LTE with respect to \(\color{purple}{p}(s', r \mid s, a)\).

Vectorized Form

If the MDP is finite, then we can express the state-action Bellman equation in \ref{5} in a vectorized form

\[\mathbf{Q}_\color{red}{\pi} = \mathbf{R} + \gamma \mathbf{\color{blue}{P}} \mathbf{V}_\color{red}{\pi}, \tag{8}\label{8}\]

where

\(\mathbf{Q}_\color{red}{\pi} \in \mathbb{R}^{\mid \mathcal{S} \mid \times \mid \mathcal{A} \mid }\) is a matrix that contains the state-action values for each state-action pair \((s, a)\), so \(\mathbf{Q}_\color{red}{\pi}[s, a] = q_{\color{red}{\pi}}(s, a)\),
\(\mathbf{R} \in \mathbb{R}^{\mid \mathcal{S} \mid \times \mid \mathcal{A} \mid }\) is a matrix with the expected rewards for each state-action pair \((s, a)\), so \(\mathbf{R}[s, a] = \mathscr{r}(s, a)\),
\(\mathbf{\color{blue}{P}} \in \mathbb{R}^{\mid \mathcal{S} \mid \times \mid \mathcal{A} \mid \times \mid \mathcal{S} \mid}\) is a matrix that contains the transition probabilities for each triple \((s, a, s')\), so \(\mathbf{\color{blue}{P}}[s, a, s'] = \color{blue}{p}(S'=s' \mid S=s, A=a)\), and
\(\mathbf{V}_\color{red}{\pi} \in \mathbb{R}^{\mid \mathcal{S} \mid}\) is a vector that contains the state values (as defined in equation \ref{5}) for each state \(s'\), so \(\mathbf{V}_\color{red}{\pi}[s'] = v_\color{red}{\pi}(s')\).

Optimal State-Action Value Function

In RL, the goal is to find/estimate an optimal policy, \(\color{green}{\pi_*}\), i.e. one that, if followed, maximizes the expected return. For a finite MDP, there is a unique optimal state-action value function, which can be denoted by \(q_{\color{green}{\pi_*}}(s, a)\) or just \(q_\color{green}{*}(s, a)\), from which an optimal policy can be derived.

By definition, the optimal state-action value function is

\[q_{\color{green}{\pi_*}}(s, a) \triangleq \operatorname{max}_\color{red}{\pi} q_\color{red}{\pi}(s, a), \\ \color{orange}{\forall} s \in \mathcal{S}, \color{orange}{\forall} a \in \mathcal{A} \tag{9}\label{9} .\]

For a discounted infinite-horizon MDP, the optimal policy is deterministic ⁶ and stationary ⁷, and it’s any greedy policy with respect to \(q_{\color{green}{\pi_*}}(s, a)\), i.e.

\[\color{green}{\pi_*} \in \operatorname{arg max}_a q_{\color{green}{\pi_*}}(s, a), \\ \color{orange}{\forall} s \in \mathcal{S} \tag{10}\label{10}\]

Here, \(\in\) is used because there can be more than one optimal policy for an MDP given that there can be two or more actions that are optimal in a state.

State-Action Bellman Optimality Equation

Equation \ref{9} can also be written as a recursive equation, known as the state-action Bellman optimality equation.

\[\begin{align*} q_{\color{green}{\pi_*}}(s, a) &\triangleq \operatorname{max}_\color{red}{\pi} q_\color{red}{\pi}(s, a) \\ &= \operatorname{max}_\color{red}{\pi} \sum_{s'} \sum_{r} \color{purple}{p}(s', r \mid s, a) \left[ r + \gamma v_\color{red}{\pi}(s') \right] \\ &= \sum_{s'} \sum_{r} \color{purple}{p}(s', r \mid s, a) \left[ r + \gamma \operatorname{max}_\color{red}{\pi} v_\color{red}{\pi}(s') \right] \\ &= \sum_{s'} \sum_{r} \color{purple}{p}(s', r \mid s, a) \left[ r + \gamma \operatorname{max}_{a'}q_{\color{green}{\pi_*}}(s', a') \right] \\ &= \mathbb{E}_{\color{purple}{p}(s', r \mid s, a)} \left[ R_{t+1} + \gamma v_{\color{green}{\pi_*}}(S_{t+1}) \mid S_t = s, A_t =a\right] \tag{11}\label{11}, \end{align*}\]

where \(v_\color{green}{\pi_*}(s') = \operatorname{max}_\color{red}{\pi} v_\color{red}{\pi}(s') = \operatorname{max}_{a'}q_{\color{green}{\pi_*}}(s', a')\).

State Bellman Equation

Like \(q_\color{red}{\pi}(s, a)\), the state value function \(v_\color{red}{\pi}(s)\) can also be written as a recursive equation by starting from its definition and then applying the LTE rule, the linearity of the expectation and the Markov property. So, for completeness, let’s do it.

\[\begin{align*} v_\color{red}{\pi}(s) &\triangleq \mathbb{E} \left[ G_t \mid S_t=s\right] \\ &= \mathbb{E} \left[ R_{t+1} + \gamma G_{t+1} \mid S_t=s \right] \\ &= \mathbb{E} \left[ R_{t+1} \mid S_t=s \right] + \gamma \mathbb{E} \left[ G_{t+1} \mid S_t=s \right] \\ &= \sum_{a} \color{red}{\pi}(a \mid s) \mathscr{r}(s, a) + \gamma \sum_{a} \color{red}{\pi}(a \mid s) \mathbb{E} \left[ G_{t+1} \mid S_t = s, A_t = a\right] \\ &= \sum_{a} \color{red}{\pi}(a \mid s) \mathscr{r}(s, a) + \gamma \sum_{a} \color{red}{\pi}(a \mid s) \sum_{s'} \color{red}{p}(s' \mid s, a) \mathbb{E} \left[ G_{t+1} \mid S_{t+1} = s'\right] \\ &= \sum_{a} \color{red}{\pi}(a \mid s) \mathscr{r}(s, a) + \gamma \sum_{a} \color{red}{\pi}(a \mid s) \sum_{s'} \color{red}{p}(s' \mid s, a) v_\color{red}{\pi}(s') \\ &= \sum_{a} \color{red}{\pi}(a \mid s) \left ( \mathscr{r}(s, a) + \gamma \sum_{s'} \color{red}{p}(s' \mid s, a) v_\color{red}{\pi}(s') \right) \\ &= \sum_{a} \color{red}{\pi}(a \mid s) \left ( \sum_{s'} \sum_{r} \color{purple}{p}(s', r \mid s, a) \left[ r + \gamma v_\color{red}{\pi}(s') \right] \right) \tag{12}\label{12}. \end{align*}\]

Conclusion

In conclusion, value functions define the objectives of an RL problem. They can be written as recursive equations, known as Bellman equations, in honor of Richard Bellman, who made significant contributions to the theory of dynamic programming (DP), which is related to RL.

More specifically, DP is an approach that can be used to solve MDPs (i.e. to find \(\color{green}{\pi_*}\)) when \(\color{blue}{p}\) is available, but not only ⁸, where the solution to a problem can be computed by combining the solution to subproblems: the Bellman equation really reflects this idea, i.e. \(q_\color{red}{\pi}(s, a)\) is computed as a function of the “subproblem” \(q_\color{red}{\pi}(s', a')\). The problem is that \(\color{blue}{p}\) is rarely available, hence the need for RL. In any case, DP algorithms, like policy iteration (PI), and RL algorithms, like Q-learning, are related because they assume that the environment can be modeled as an MDP and they attempt to estimate the same optimal value function.

It’s also called action value function, but I prefer to call it state-action value function because it reminds us that this is a function of a state and action. ↩
Let \(X\) and \(Y\) be two discrete random variables and \(p(x, y)\) be their joint distribution. The marginal distribution of \(X\) or \(Y\) can be found as \(p(x) = \sum_y p(x, y)\) and \(p(y) = \sum_x p(x, y)\), respectively. ↩
The subscript \(t\) in the object \(x_t\) is used to emphasize that \(x\) is associated with the time step \(t\). ↩
Hence the name value function. ↩
If the reward is deterministic, then \(p(r \mid s, a)\) gives probability \(1\) to one reward and \(0\) to all other rewards. ↩
\(\pi(a \mid s)\) can also be used to describe deterministic policies by giving a probability of \(1\) to one action in \(s\) and a probability of \(0\) to all other actions. A deterministic policy might also be defined as a function \(\pi : \mathcal{S} \rightarrow \mathcal{A}\), so \(\pi(s) = a\) is the (only) action taken by \(\pi\) in \(s\). ↩
A policy \(\color{red}{\pi}(a \mid s)\) is stationary if it doesn’t change over time steps, i.e. \(\color{red}{\pi}(a \mid S_{t} = s) = \color{red}{\pi}(a \mid S_{t+1} = s), \forall t, \forall s \in \mathcal{S}\), in other words, the probabilities of selecting an action do not change from time step to time step. You can think of a non-stationary policy as a set of policies. ↩
For more info about the dynamic programming approach, I recommend that you read the corresponding chapter in the book Introduction to Algorithms (3rd edition) by Thomas H. Cormen et al. ↩