## Theorem

Consider the following Bellman optimality equation (BOE) (equation 3.20 of Sutton & Barto book on RL, 2nd edition, p. 64)

$q_*(s,a) = \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)(r + \gamma \max_{a'\in\mathcal{A}(s')}q_*(s',a')) \tag{1}\label{1}.$

If we multiply all rewards by the same constant $$c > 0 \in \mathbb{R}$$, then the new optimal state-action value function is given by

$cq_*(s, a).$

## Proof

To prove this, we need to show that the following BOE

$c q_*(s,a) = \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)(c r + \gamma \max_{a'\in\mathcal{A}(s')} c q_*(s',a')). \tag{2}\label{2}$

is equivalent to the BOE in equation \ref{1}.

Given that $$c > 0$$, then

$\max_{a'\in\mathcal{A}(s')} c q_*(s',a') = c\max_{a'\in\mathcal{A}(s')}q_*(s',a'),$

so $$c$$ can be taken out of the $$\operatorname{max}$$ operator.

Therefore, the equation \ref{2} becomes

\begin{align*} c q_*(s,a) &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)(c r + \gamma c \max_{a'\in\mathcal{A}(s')} q_*(s',a')) \\ &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}c p(s',r \mid s,a)(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a')) \\ &= c \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a)(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a')) \\ &\iff \\ q_*(s,a) &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a)(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a')), \end{align*} \tag{3}\label{3}

which is equal to the the Bellman optimality equation in \ref{1}, which implies that, when the reward is given by $$cr$$, $$c q_*(s,a)$$ is the solution to the Bellman optimality equation.

## Interpretation

Consequently, whenever we multiply the reward function by some positive constant, which can be viewed as a form of reward shaping 1, the set of optimal policies does not change 2.

## What if the constant is zero or negative?

For completeness, if $$c=0$$, then \ref{2} becomes $$0=0$$, which is true.

If $$c < 0$$, then $$\max_{a'\in\mathcal{A}(s')} c q_*(s',a') = c\min_{a'\in\mathcal{A}(s')}q_*(s',a')$$, so equation \ref{3} becomes

$q_*(s,a) = \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a)(r + \gamma \min_{a'\in\mathcal{A}(s')} q_*(s',a')),$

which is not equal to the Bellman optimality equation in \ref{1}.

1. A seminal paper on reward shaping is Policy invariance under reward transformations: Theory and application to reward shaping (1999) by Andrew Y. Ng et al.

2. There can be more than one optimal policy for a given optimal value function (and Markov Decision Process) because we might have $$q_*(s,a_1) = q_*(s,a_2) \geq q_*(s,a), \text{for } a_1, a_2 \in \mathcal{A}(s) \text{ and } \forall a \in \mathcal{A}(s)$$. Any greedy policy with respect to the optimal value function $$q_*(s,a)$$ is an optimal policy.