Introduction
In Reinforcement Learning (RL), value functions define the objectives of the RL problem. There are two very important and strictly related value functions,
 the stateaction value function (SAVF) ^{1}, and
 the state value function (SVF).
In this post, I’ll show how these value functions (including their optimality versions) can be mathematically formulated as recursive equations, known as Bellman equations.
I’ll assume that you are minimally familiar with Markov Decision Processes (MDPs) and RL. Nevertheless, I will review the most important RL and mathematical prerequisites to understand this post, so that the post is selfcontained as much as possible.
Notation
 Stylized upper case letters (e.g. \(\mathcal{X}\) or \(\mathbb{R}\)) denote vector spaces.
 Upper case letters (e.g. \(X\)) denote random variables.
 Depending on the context, lower case letters (e.g. \(x\)) can denote realizations of random variables, variables of functions, or elements of a vector space.
 Depending on the context, \(\color{blue}{p}(s' \mid s, a)\) can be a shorthand for \(\color{blue}{p}(S'=s' \mid S=s, A=a) \in [0, 1]\), a probability, or \(\color{blue}{p}(s' \mid S=s, A=a)\), a conditional probability distribution.
 \(X=x\) is an event, which is occasionally abbreviated as \(x\).
 \(s' \sim \color{blue}{p}(s' \mid s, a)\) means that \(s'\) is drawn/sampled according to \(\color{blue}{p}\).
Markov Decision Processes
A (discounted) MDP can be defined as a tuple
\[M \triangleq (\mathcal{S}, \mathcal{A}, \mathcal{R}, \color{blue}{p}, \mathscr{r}, \gamma) \tag{1} \label{1},\] \(\mathcal{S}\) is the state space,
 \(\mathcal{A}\) is the action space,
 \(\mathcal{R} \subset \mathbb{R}\) is the reward space,
 \(\color{blue}{p}(s' \mid s, a)\) is the transition model,
 \(\mathscr{r}(s, a) = \mathbb{E}_{p(r \mid s, a)} \left[ R \mid S = s, A = a \right]\) is the expected reward, and
 \(\gamma \in [0, 1]\) is the discount factor.
Markov Property
MDPs assume that the Markov property holds, i.e. the future is independent of the past given the present. The Markov property is encoded in \(\color{blue}{p}(s' \mid s, a)\) and \(\mathscr{r}(s, a)\).
Finite MDPs
A finite MDP is an MDP where the state \(\mathcal{S}\), action \(\mathcal{A}\) and rewad \(\mathcal{R}\) spaces are finite sets. In that case, \(\color{blue}{p}(s' \mid s, a)\) can be viewed as a probability mass function (pmf) and the random variable associated with states, actions and rewards are discrete.
Alternative Formulations
Sometimes, \(\color{blue}{p}(s' \mid s, a)\) is combined with \(\mathscr{r}(s, a)\) to form a joint conditional distribution, \(\color{purple}{p}(s', r \mid s, a)\) (the dynamics of the MDP), from which both \(\color{blue}{p}(s' \mid s, a)\) and \(\mathscr{r}(s, a)\) can be derived.
Specifically, \(\color{blue}{p}(s' \mid s, a)\) can be computed from \(\color{purple}{p}(s', r \mid s, a)\) by marginalizing over \(r\) ^{2} as follows
\[\color{blue}{p}(s' \mid s, a) = \sum_{r \in \mathcal{R}}r \color{purple}{p}(s', r \mid s, a). \tag{2} \label{2}\]Similarly, we have
\[\begin{align*} \mathscr{r}(s, a) &= \mathbb{E} \left[ R \mid S = s, A = a \right] \\ &= \sum_{r \in \mathcal{R}} r \sum_{s' \in \mathcal{S}} \color{purple}{p}(s', r \mid s, a) \\ &= \sum_{r \in \mathcal{R}} r p(r \mid s, a). \end{align*} \tag{3} \label{3}\]Reinforcement Learning
In RL, we imagine that there is an agent that sequentially interacts with an environment in (discrete) timesteps, where the environment can be modelled as an MDP.
More specifically, at timestep \(t\), the agent is in some state \(s_t \in \mathcal{S}\) ^{3} and takes an action \(a_t \in \mathcal{A}\) with a policy \(\color{red}{\pi}(a \mid s)\), which is a conditional probability distribution over actions given a state, i.e. \(a_t \sim \pi(a \mid s_t)\). At the next time step \(t+1\), the environment returns a reward \(r_{t+1} = \mathscr{r}(s_t, a_t)\), and it moves to another state \(s_{t+1} \sim \color{blue}{p}(s' \mid s_t, a_t)\), then the agent takes another action \(a_{t+1} \sim \color{red}{\pi}(a \mid s_{t+1})\), gets another reward \(r_{t+2} = \mathscr{r}(s_{t+1}, a_{t+1})\), and the environment moves to another state \(s_{t+2}\), and so on. This interaction continues until a maximum timestep \(H\), which is often called horizon, is reached. For simplicity, we assume that \(H = \infty\), so we assume a socalled infinitehorizon MDP.
In RL, the objective/goal is to find a policy that maximizes the sum of rewards in the long run, i.e. until the horizon \(H\) is reached (if ever reached). An objective function that formalizes this sum of rewards in the long run is the stateaction value function (SAVF), which is, therefore, one function that we might want to optimize.
StateAction Value Function
The stateaction value function for a policy \(\color{red}{\pi}(a \mid s)\) is the function \(q_\color{red}{\pi} : \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}\), which is defined as follows
\[q_\color{red}{\pi}(s, a) \triangleq \mathbb{E} \left[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} \mid S_t = s, A_t = a\right], \\ \color{orange}{\forall} s \in \mathcal{S}, \color{orange}{\forall} a \in \mathcal{A} \tag{4}\label{4},\]where
 \(R_{t+k+1}\) is the reward the agent receives at timestep \(t+k+1\),
 \(G_t \triangleq \sum_{k=0}^\infty \gamma^k R_{t+k+1}\) is the return (aka value ^{4}),
 \(\color{red}{\pi}\) is a policy that the agents follows from timestep \(t+1\) onwards,
 \(\gamma \in [0, 1)\) is the discount factor,
 \(S_t = s\) is the state the agent is in at timestep \(t\), and
 \(A_t = a\) is the action taken at timestep \(t\).
Intuitively, \(q_\color{red}{\pi}(s, a)\) is expected return that the agent gets by following policy \(\color{red}{\pi}\) after having taken action \(a\) in state \(s\) at timestep \(t\).
Value Function of a Policy
The subscript \(\color{red}{\pi}\) in \(q_\color{red}{\pi}(s, a)\) indicates that \(q_\color{red}{\pi}(s, a)\) is defined in terms of \(\color{red}{\pi}(a \mid s)\) because the rewards received in the future, \(R_{t+k+1}\), depend on the actions that we take with \(\color{red}{\pi}(s \mid a)\), but they also depend on the transition model \(\color{blue}{p}(s' \mid s, a)\).
However, \(\color{red}{\pi}\) and \(\color{blue}{p}\) do not appear anywhere inside the expectation in equation \ref{4}. So, for people that only believe in equations, \ref{4} might not be satisfying enough. Luckily, we can express \(q_\color{red}{\pi}(s, a)\) in terms of \(\color{red}{\pi}\) and \(\color{blue}{p}\) by starting from equation \ref{4}, which also leads to a Bellman/recursive equation. So, let’s do it!
Mathematical Prerequisites
The formulation of the value functions as recursive equations (which is the main point of this blog post) uses three main mathematical rules, which are reviewed here for completeness.
Markov Property
If we assume that the Markov property holds, then the following holds
 \[\color{blue}{p}(s_{t+1} \mid s_t, a_t, s_{t1}, a_{t1}, \dots, s_0, a_0) = \color{blue}{p}(s_{t+1} \mid s_t, a_t)\]
 \[\mathscr{r}(s, a) = \mathbb{E} \left[ R_{t+1}\mid s_t, a_t, s_{t1}, a_{t1}, \dots, s_0, a_0 \right] = \mathbb{E} \left[ R_{t+1}\mid s_t, a_t \right]\]
 \[\color{red}{\pi}(a_{t} \mid s_t, s_{t1}, a_{t1}, \dots, s_0, a_0 ) = \color{red}{\pi}(a_{t} \mid s_t)\]
Linearity of Expectation (LE)
Let \(X\) and \(Y\) be two discrete random variables and \(p(x, y)\) be their joint distribution, then the expectation of \(X+Y\) is equal to the sum of the expectation of \(X\) and \(Y\), i.e.
\[\begin{align*} \mathbb{E}[X + Y] &= \sum_x \sum_y (x + y) p(x, y) \\ &= \sum_x \sum_y x p(x, y) + y p(x, y) \\ &= \sum_x \sum_y x p(x, y) + \sum_x \sum_y y p(x, y) \\ &= \sum_x x p(x) + \sum_y y p(y) \\ &= \mathbb{E}[X] + \mathbb{E}[Y] \end{align*}\]Law of Total Expectation (LTE)
The formulation of the LTE is as follows. Let \(X\), \(Y\) and \(Z\) be three discrete random variables, \(\mathbb{E}[X \mid Y=y]\) be the expectation of \(X\) given \(Y=y\), and \(p(x, y, z)\) be the joint of \(X\), \(Y\) and \(Z\). So, we have
\[\begin{align*} \mathbb{E}[X \mid Y=y] &= \sum_x x p(x \mid y) \\ &= \sum_x x \frac{p(x, y)}{p(y)} \\ &= \sum_x x \frac{\sum_z p(x, y, z)}{p(y)} \\ &= \sum_x x \frac{\sum_z p(x \mid y, z) p(y, z) }{p(y)} \\ &= \sum_z \frac{p(y, z)}{p(y)} \sum_x x p(x \mid y, z) \\ &= \sum_z p(z \mid y) \sum_x x p(x \mid y, z) \\ &= \sum_z p(z \mid y) \mathbb{E}[X \mid Y=y, Z=z]. \end{align*}\]This also applies to other more complicated cases, i.e. more conditions, or even to the simpler case of \(\mathbb{E}[X]\).
StateAction Bellman Equation
We are finally ready to express \(\ref{4}\) as a recursive equation.
We can decompose the return \(G_t\) into the first reward \(R_{t+1}\), received after having taken action \(A_t = a\) in state \(S_t = s\), and the rewards that we will receive in the next time steps, then we can apply the LE, LTE (multiple times) and Markov property, i.e.
\[\begin{align*} q_\color{red}{\pi}(s, a) &= \mathbb{E} \left[ R_{t+1} + \sum_{k=1}^\infty \gamma^k R_{t+k+2} \mid S_t = s, A_t = a\right] \\ &= \mathbb{E} \left[ R_{t+1} \mid S_t = s, A_t = a \right] + \gamma \mathbb{E} \left[ G_{t+1} \mid S_t = s, A_t = a\right] \\ &= \mathscr{r}(s, a) + \gamma \sum_{s'} \color{blue}{p}(s' \mid s, a) \mathbb{E} \left[ G_{t+1} \mid S_{t+1}=s', S_t = s, A_t = a\right] \\ &= \mathscr{r}(s, a) + \gamma \sum_{s'} \color{blue}{p}(s' \mid s, a) \mathbb{E} \left[ G_{t+1} \mid S_{t+1}=s'\right] \\ &= \mathscr{r}(s, a) + \gamma \sum_{s'} \color{blue}{p}(s' \mid s, a) v_\color{red}{\pi}(s') \\ &= \mathscr{r}(s, a) + \gamma \sum_{s'} \color{blue}{p}(s' \mid s, a) \sum_{a'} \color{red}{\pi}(a' \mid s') \mathbb{E} \left[ G_{t+1} \mid S_{t+1}=s', A_{t+1}=a'\right] \\ &= \mathscr{r}(s, a) + \gamma \sum_{s'} \color{blue}{p}(s' \mid s, a) \sum_{a'} \color{red}{\pi}(a' \mid s') q_\color{red}{\pi}(s', a') \tag{5}\label{5}, \end{align*}\]where
 \(\lambda G_{t+1}=\sum_{k=1}^\infty \gamma^k R_{t+k+2} = \gamma \sum_{k=0}^\infty \gamma^k R_{t+k+2}\),
 \(\mathscr{r}(s, a) \triangleq \mathbb{E}_{p(r \mid s, a)} \left[ R_{t+1} \mid S_t = s, A_t = a \right] = \sum_{r} p(r \mid s, a) r\) is the expected reward of taking action \(a\) in \(s\) and \(p(r \mid s, a)\) is the reward distribution ^{5},
 \(v_\color{red}{\pi}(s') \triangleq \mathbb{E} \left[ G_{t+1} \mid S_{t+1}=s'\right]\) is the state value function (SVF), and
 \(q_\color{red}{\pi}(s', a') \triangleq \mathbb{E} \left[ G_{t+1} \mid S_{t+1}=s', A_{t+1}=a'\right]\).
The equation \ref{5} is a recursive equation, given that \(q_\color{red}{\pi}\) is defined in terms of itself (although evaluated at a different stateaction pair), known as the stateaction Bellman (expectation) equation for \(q_\color{red}{\pi}\).
So, the subscript \(\color{red}{\pi}\) in \(q_\color{red}{\pi}(s, a)\) is used because the stateaction value is (also) defined in terms of \(\color{red}{\pi}\).
Alternative Version
Given the relations \ref{2} and \ref{3}, equation \ref{5} can also be expressed in terms of \(\color{purple}{p}(s', r \mid s, a)\) as follows.
\[\begin{align*} q_\color{red}{\pi}(s, a) &= \sum_{r} \underbrace{\sum_{s'} \color{purple}{p}(s', r \mid s, a)}_{p(r \mid s, a)} r + \gamma \sum_{s'} \underbrace{\sum_{r} \color{purple}{p}(s', r \mid s, a)}_{\color{blue}{p}(s' \mid s, a)} v_\color{red}{\pi}(s') \\ &= \sum_{s'} \sum_{r} \color{purple}{p}(s', r \mid s, a) \left[ r + \gamma v_\color{red}{\pi}(s') \right] \tag{6}\label{6} . \end{align*}\]So, \(q_\color{red}{\pi}(s, a)\) is an expectation with respect to the joint conditional distribution \(\color{purple}{p}(s', r \mid s, a)\), i.e.
\[q_\color{red}{\pi}(s, a) = \mathbb{E}_{\color{purple}{p}(s', r \mid s, a)} \left[ R_{t+1} + \gamma v_{\color{red}{\pi}}(S_{t+1}) \mid S_t = s, A_t =a\right] \tag{7}\label{7} .\]Equation \ref{6} can also be derived from equation \ref{4} by applying the LTE with respect to \(\color{purple}{p}(s', r \mid s, a)\).
Vectorized Form
If the MDP is finite, then we can express the stateaction Bellman equation in \ref{5} in a vectorized form
\[\mathbf{Q}_\color{red}{\pi} = \mathbf{R} + \gamma \mathbf{\color{blue}{P}} \mathbf{V}_\color{red}{\pi}, \tag{8}\label{8}\]where
 \(\mathbf{Q}_\color{red}{\pi} \in \mathbb{R}^{\mid \mathcal{S} \mid \times \mid \mathcal{A} \mid }\) is a matrix that contains the stateaction values for each stateaction pair \((s, a)\), so \(\mathbf{Q}_\color{red}{\pi}[s, a] = q_{\color{red}{\pi}}(s, a)\),
 \(\mathbf{R} \in \mathbb{R}^{\mid \mathcal{S} \mid \times \mid \mathcal{A} \mid }\) is a matrix with the expected rewards for each stateaction pair \((s, a)\), so \(\mathbf{R}[s, a] = \mathscr{r}(s, a)\),
 \(\mathbf{\color{blue}{P}} \in \mathbb{R}^{\mid \mathcal{S} \mid \times \mid \mathcal{A} \mid \times \mid \mathcal{S} \mid}\) is a matrix that contains the transition probabilities for each triple \((s, a, s')\), so \(\mathbf{\color{blue}{P}}[s, a, s'] = \color{blue}{p}(S'=s' \mid S=s, A=a)\), and
 \(\mathbf{V}_\color{red}{\pi} \in \mathbb{R}^{\mid \mathcal{S} \mid}\) is a vector that contains the state values (as defined in equation \ref{5}) for each state \(s'\), so \(\mathbf{V}_\color{red}{\pi}[s'] = v_\color{red}{\pi}(s')\).
Optimal StateAction Value Function
In RL, the goal is to find/estimate an optimal policy, \(\color{green}{\pi_*}\), i.e. one that, if followed, maximizes the expected return. For a finite MDP, there is a unique optimal stateaction value function, which can be denoted by \(q_{\color{green}{\pi_*}}(s, a)\) or just \(q_\color{green}{*}(s, a)\), from which an optimal policy can be derived.
By definition, the optimal stateaction value function is
\[q_{\color{green}{\pi_*}}(s, a) \triangleq \operatorname{max}_\color{red}{\pi} q_\color{red}{\pi}(s, a), \\ \color{orange}{\forall} s \in \mathcal{S}, \color{orange}{\forall} a \in \mathcal{A} \tag{9}\label{9} .\]For a discounted infinitehorizon MDP, the optimal policy is deterministic ^{6} and stationary ^{7}, and it’s any greedy policy with respect to \(q_{\color{green}{\pi_*}}(s, a)\), i.e.
\[\color{green}{\pi_*} \in \operatorname{arg max}_a q_{\color{green}{\pi_*}}(s, a), \\ \color{orange}{\forall} s \in \mathcal{S} \tag{10}\label{10}\]Here, \(\in\) is used because there can be more than one optimal policy for an MDP given that there can be two or more actions that are optimal in a state.
StateAction Bellman Optimality Equation
Equation \ref{9} can also be written as a recursive equation, known as the stateaction Bellman optimality equation.
\[\begin{align*} q_{\color{green}{\pi_*}}(s, a) &\triangleq \operatorname{max}_\color{red}{\pi} q_\color{red}{\pi}(s, a) \\ &= \operatorname{max}_\color{red}{\pi} \sum_{s'} \sum_{r} \color{purple}{p}(s', r \mid s, a) \left[ r + \gamma v_\color{red}{\pi}(s') \right] \\ &= \sum_{s'} \sum_{r} \color{purple}{p}(s', r \mid s, a) \left[ r + \gamma \operatorname{max}_\color{red}{\pi} v_\color{red}{\pi}(s') \right] \\ &= \sum_{s'} \sum_{r} \color{purple}{p}(s', r \mid s, a) \left[ r + \gamma \operatorname{max}_{a'}q_{\color{green}{\pi_*}}(s', a') \right] \\ &= \mathbb{E}_{\color{purple}{p}(s', r \mid s, a)} \left[ R_{t+1} + \gamma v_{\color{green}{\pi_*}}(S_{t+1}) \mid S_t = s, A_t =a\right] \tag{11}\label{11}, \end{align*}\]where \(v_\color{green}{\pi_*}(s') = \operatorname{max}_\color{red}{\pi} v_\color{red}{\pi}(s') = \operatorname{max}_{a'}q_{\color{green}{\pi_*}}(s', a')\).
State Bellman Equation
Like \(q_\color{red}{\pi}(s, a)\), the state value function \(v_\color{red}{\pi}(s)\) can also be written as a recursive equation by starting from its definition and then applying the LTE rule, the linearity of the expectation and the Markov property. So, for completeness, let’s do it.
\[\begin{align*} v_\color{red}{\pi}(s) &\triangleq \mathbb{E} \left[ G_t \mid S_t=s\right] \\ &= \mathbb{E} \left[ R_{t+1} + \gamma G_{t+1} \mid S_t=s \right] \\ &= \mathbb{E} \left[ R_{t+1} \mid S_t=s \right] + \gamma \mathbb{E} \left[ G_{t+1} \mid S_t=s \right] \\ &= \sum_{a} \color{red}{\pi}(a \mid s) \mathscr{r}(s, a) + \gamma \sum_{a} \color{red}{\pi}(a \mid s) \mathbb{E} \left[ G_{t+1} \mid S_t = s, A_t = a\right] \\ &= \sum_{a} \color{red}{\pi}(a \mid s) \mathscr{r}(s, a) + \gamma \sum_{a} \color{red}{\pi}(a \mid s) \sum_{s'} \color{red}{p}(s' \mid s, a) \mathbb{E} \left[ G_{t+1} \mid S_{t+1} = s'\right] \\ &= \sum_{a} \color{red}{\pi}(a \mid s) \mathscr{r}(s, a) + \gamma \sum_{a} \color{red}{\pi}(a \mid s) \sum_{s'} \color{red}{p}(s' \mid s, a) v_\color{red}{\pi}(s') \\ &= \sum_{a} \color{red}{\pi}(a \mid s) \left ( \mathscr{r}(s, a) + \gamma \sum_{s'} \color{red}{p}(s' \mid s, a) v_\color{red}{\pi}(s') \right) \\ &= \sum_{a} \color{red}{\pi}(a \mid s) \left ( \sum_{s'} \sum_{r} \color{purple}{p}(s', r \mid s, a) \left[ r + \gamma v_\color{red}{\pi}(s') \right] \right) \tag{12}\label{12}. \end{align*}\]Conclusion
In conclusion, value functions define the objectives of an RL problem. They can be written as recursive equations, known as Bellman equations, in honor of Richard Bellman, who made significant contributions to the theory of dynamic programming (DP), which is related to RL.
More specifically, DP is an approach that can be used to solve MDPs (i.e. to find \(\color{green}{\pi_*}\)) when \(\color{blue}{p}\) is available, but not only ^{8}, where the solution to a problem can be computed by combining the solution to subproblems: the Bellman equation really reflects this idea, i.e. \(q_\color{red}{\pi}(s, a)\) is computed as a function of the “subproblem” \(q_\color{red}{\pi}(s', a')\). The problem is that \(\color{blue}{p}\) is rarely available, hence the need for RL. In any case, DP algorithms, like policy iteration (PI), and RL algorithms, like Qlearning, are related because they assume that the environment can be modeled as an MDP and they attempt to estimate the same optimal value function.

It’s also called action value function, but I prefer to call it stateaction value function because it reminds us that this is a function of a state and action. ↩

Let \(X\) and \(Y\) be two discrete random variables and \(p(x, y)\) be their joint distribution. The marginal distribution of \(X\) or \(Y\) can be found as \(p(x) = \sum_y p(x, y)\) and \(p(y) = \sum_x p(x, y)\), respectively. ↩

The subscript \(t\) in the object \(x_t\) is used to emphasize that \(x\) is associated with the time step \(t\). ↩

Hence the name value function. ↩

If the reward is deterministic, then \(p(r \mid s, a)\) gives probability \(1\) to one reward and \(0\) to all other rewards. ↩

\(\pi(a \mid s)\) can also be used to describe deterministic policies by giving a probability of \(1\) to one action in \(s\) and a probability of \(0\) to all other actions. A deterministic policy might also be defined as a function \(\pi : \mathcal{S} \rightarrow \mathcal{A}\), so \(\pi(s) = a\) is the (only) action taken by \(\pi\) in \(s\). ↩

A policy \(\color{red}{\pi}(a \mid s)\) is stationary if it doesn’t change over time steps, i.e. \(\color{red}{\pi}(a \mid S_{t} = s) = \color{red}{\pi}(a \mid S_{t+1} = s), \forall t, \forall s \in \mathcal{S}\), in other words, the probabilities of selecting an action do not change from time step to time step. You can think of a nonstationary policy as a set of policies. ↩

For more info about the dynamic programming approach, I recommend that you read the corresponding chapter in the book Introduction to Algorithms (3rd edition) by Thomas H. Cormen et al. ↩