Optimal value function of shifted rewards


Consider the following Bellman optimality equation (BOE) (equation 3.20 of Sutton & Barto book on RL, 2nd edition, p. 64)

\[q_*(s,a) =\sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a) \left(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right)\tag{1}\label{1}.\]

If we add the same constant \(c \in \mathbb{R}\) to all rewards \(r \in \mathcal{R}\), then the new optimal state-action value function is given by

\[q_*(s, a) + k,\]


\[k = \frac{c}{1 - \gamma} = c\left(\frac{1}{1 - \gamma}\right) = c \left( \sum_{i=0}^{\infty} \gamma^{i} \right) = c \left( 1 + \gamma + \gamma^2 + \gamma^3 + \dots \right),\]

where \(0 \leq \gamma < 1\) is the discount factor of the MDP and \(\sum_{i=0}^{\infty} \gamma^{i}\) is a geometric series 1.


  • \(0 \leq \gamma < 1\); if we allowed \(\gamma = 1\), then \(\frac{c}{1 - \gamma} = c/0\), which is undefined.

  • For episodic problems 2, we assume that we have an absorbing state \(s_\text{absorbing}\) 3, which is the state that the agent moves to after it has reached the goal, where the agent gets a reward of \(0\) for all future time steps. So, \(q_*(s_\text{absorbing}, a) =0, \forall a \in\mathcal{A}(s_\text{absorbing})\).


To show this, we need to show that the following equation is equivalent to the BOE in \ref{1}.

\[q_*(s,a) + k = \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)\left((r + c) + \gamma \max_{a' \in\mathcal{A}(s')} \left( q_*(s',a') + k \right) \right) \tag{2}\label{2}\]

Given that \(k = \frac{c}{1 - \gamma}\) is a constant, it does not affect the max, because we add this constant to all state-action values: this holds even if \(c\) is negative! So, we can take \(k\) out of the max and add it to \(\max_{a'\in\mathcal{A}(s')} q_*(s',a')\)

\[\begin{align*} q_*(s,a) + k &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)\left((r + c) + \gamma \left (k + \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right) \right) \\ &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)\left((r + c) + \frac{c \gamma}{1 - \gamma} + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right) \\ &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)\left(r + \frac{c(1 - \gamma) + c \gamma}{1 - \gamma} + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right) \\ &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)\left(r + \frac{c - c\gamma + c \gamma}{1 - \gamma} + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right) \\ &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} \left ( p(s',r \mid s,a)\frac{c}{1 - \gamma} \right) + \\ & \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} \left( p(s',r \mid s,a) \left(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right) \right). \tag{3}\label{3} \end{align*}\]

Given that \(p(s',r \mid s,a)\) is a probability distribution, then \(\sum_{s' \in \mathcal{S}, r \in \mathcal{R}} \left ( p(s',r \mid s,a)\frac{c}{1 - \gamma} \right)\) is the expectation of the constant \(\frac{c}{1 - \gamma}\), which is equal to the constant itself.

So, equation \ref{3} becomes

\[q_*(s,a) + \frac{c}{1 - \gamma} = \frac{c}{1 - \gamma} + \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a) \left(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right) \\ \iff \\ q_*(s,a) =\sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a) \left(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right)\]

which is the Bellman optimality equation \ref{1}.


The result above suggests that, if we add a constant to all rewards, which is a form of reward shaping, the set of optimal policies does not change.

Is this always true? Yes, in theory.

However, we must be careful with episodic problems.

  • In theory, after we shift the rewards by \(c\), the agent will precisely get an additional reward of \(k = \frac{c}{1 - \gamma}\) for being in any state, including the absorbing state, and taking any action. So, after we shift the rewards, we have \(q_*(s_\text{absorbing}, a) = \frac{c}{1 - \gamma}, \forall a \in\mathcal{A}(s_\text{absorbing})\).

  • In practice, one might mis-specify the reward functions, if we shift the rewards and terminate the episode once the agent gets to \(s_\text{absorbing}\).


To illustrate this issue, let’s say that, for an episodic problem (for example, a problem where the agent is in a grid and needs to go to a goal location), we have the following (deterministic) reward function

\[r(s, a) = \begin{cases} 1, \text{if } s = s_\text{goal}\\ 0, \text{if } s = s_\text{absorbing}\\ 0, \text{otherwise} \\ \end{cases}\]

\(s_\text{absorbing}\) is just the state that we assume the agent moves to after having reached the goal state, so that he continues to get a reward of \(0\), which is an assumption that we make that allows us to terminate the episode once we get to \(s_\text{goal}\).

Now, let’s say that we define a new reward function as \(r'(s, a) \triangleq r(s, a) - 1\), i.e.

\[r'(s, a) = \begin{cases} 0, \text{if } s = s_\text{goal}\\ -1, \text{if } s = s_\text{absorbing}\\ -1, \text{otherwise} \\ \end{cases}\]

So, in theory, you don’t get a reward of \(0\) anymore after the agent gets to the goal with \(r'\). If you terminate the episode once the agent gets to the goal, this will not be taken into account, i.e. if you terminate the episode once the agent got to the goal, then you assume that \(r'(s_\text{absorbing}, a) = 0, \forall a \in \mathcal{A}(s_\text{absorbing})\), i.e. you’re actually optimizing

\[r''(s, a) = \begin{cases} 0, \text{if } s = s_\text{goal}\\ 0, \text{if } s = s_\text{absorbing}\\ -1, \text{otherwise} \\ \end{cases}\]

So, in practice, you might be optimizing a different objective function than the one you implicitly or unconsciously assumed. In this example, \(r''(s, a)\) is the reward function that encourages the agent to get to the goal as quickly as possible (because you get a penalty of \(-1\) for every time step that you have not reached the goal), so, in practice, \(r''(s, a)\) might be what you want to optimize, but, in general, you must be careful with reward misspecification 4!

  1. \(k\) is also written as a geometric series to emphasize that \(k\) is similar to the discounted return, which is defined as \(G_t = \sum_{i=0}^{\infty} \gamma^{i}R_{t+1+i}\), where \(R_{t+1+i}\) is the reward at time step \(t+1+i\). If all rewards were equal to \(c\), then \(G_t = \frac{c}{1 - \gamma}\). 

  2. See sections 3.3 and 3.4. (p. 54) of Sutton & Barto book (2nd edition) for more details about the difference between episodic and continuing problems and how they can be unified. 

  3. The assumption of having an absorbing state is also made in Policy invariance under reward transformations: Theory and application to reward shaping (1999) by Andrew Y. Ng et al., which is a seminal paper on reward shaping, which cites the book Theory of Games and Economic Behavior (1944) by John von Neumann et al. to support the claim that, for single-step decisions (which I assume to be some kind of bandit problem), positive linear transformations of the utility function do not change the optimal decision/policy: if we combine the theorem in this blog post and the theorem in my previous blog post, we get a similar result. 

  4. This idea of reward misspecification has been studied in the literature. For example, in the paper, Inverse Reward Design (2017) by Dylan Hadfield-Menell et al. the authors propose an approach to deal with proxy reward functions (i.e. the reward functions designed by the human, which might not be the reward functions that the human intended to define).