Optimal value function of scaled rewards


Theorem

Consider the following Bellman optimality equation (BOE) (equation 3.20 of Sutton & Barto book on RL, 2nd edition, p. 64)

\[q_*(s,a) = \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)(r + \gamma \max_{a'\in\mathcal{A}(s')}q_*(s',a')) \tag{1}\label{1}.\]

If we multiply all rewards by the same constant \(c > 0 \in \mathbb{R}\), then the new optimal state-action value function is given by

\[cq_*(s, a).\]

Proof

To prove this, we need to show that the following BOE

\[c q_*(s,a) = \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)(c r + \gamma \max_{a'\in\mathcal{A}(s')} c q_*(s',a')). \tag{2}\label{2}\]

is equivalent to the BOE in equation \ref{1}.

Given that \(c > 0\), then

\[\max_{a'\in\mathcal{A}(s')} c q_*(s',a') = c\max_{a'\in\mathcal{A}(s')}q_*(s',a'),\]

so \(c\) can be taken out of the \(\operatorname{max}\) operator.

Therefore, the equation \ref{2} becomes

\[\begin{align*} c q_*(s,a) &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)(c r + \gamma c \max_{a'\in\mathcal{A}(s')} q_*(s',a')) \\ &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}c p(s',r \mid s,a)(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a')) \\ &= c \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a)(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a')) \\ &\iff \\ q_*(s,a) &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a)(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a')), \end{align*} \tag{3}\label{3}\]

which is equal to the the Bellman optimality equation in \ref{1}, which implies that, when the reward is given by \(cr\), \(c q_*(s,a)\) is the solution to the Bellman optimality equation.

Interpretation

Consequently, whenever we multiply the reward function by some positive constant, which can be viewed as a form of reward shaping 1, the set of optimal policies does not change 2.

What if the constant is zero or negative?

For completeness, if \(c=0\), then \ref{2} becomes \(0=0\), which is true.

If \(c < 0\), then \(\max_{a'\in\mathcal{A}(s')} c q_*(s',a') = c\min_{a'\in\mathcal{A}(s')}q_*(s',a')\), so equation \ref{3} becomes

\[q_*(s,a) = \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a)(r + \gamma \min_{a'\in\mathcal{A}(s')} q_*(s',a')),\]

which is not equal to the Bellman optimality equation in \ref{1}.

  1. A seminal paper on reward shaping is Policy invariance under reward transformations: Theory and application to reward shaping (1999) by Andrew Y. Ng et al. 

  2. There can be more than one optimal policy for a given optimal value function (and Markov Decision Process) because we might have \(q_*(s,a_1) = q_*(s,a_2) \geq q_*(s,a), \text{for } a_1, a_2 \in \mathcal{A}(s) \text{ and } \forall a \in \mathcal{A}(s)\). Any greedy policy with respect to the optimal value function \(q_*(s,a)\) is an optimal policy.