Optimal value function of scaled rewards


Theorem

Consider the following Bellman optimality equation (BOE) (equation 3.20 of Sutton & Barto book on RL, 2nd edition, p. 64)

q(s,a)=sS,rRp(s,rs,a)(r+γmaxaA(s)q(s,a)).

If we multiply all rewards by the same constant c>0R, then the new optimal state-action value function is given by

cq(s,a).

Proof

To prove this, we need to show that the following BOE

cq(s,a)=sS,rRp(s,rs,a)(cr+γmaxaA(s)cq(s,a)).

is equivalent to the BOE in equation 1.

Given that c>0, then

maxaA(s)cq(s,a)=cmaxaA(s)q(s,a),

so c can be taken out of the max operator.

Therefore, the equation 2 becomes

cq(s,a)=sS,rRp(s,rs,a)(cr+γcmaxaA(s)q(s,a))=sS,rRcp(s,rs,a)(r+γmaxaA(s)q(s,a))=csS,rRp(s,rs,a)(r+γmaxaA(s)q(s,a))q(s,a)=sS,rRp(s,rs,a)(r+γmaxaA(s)q(s,a)),

which is equal to the the Bellman optimality equation in 1, which implies that, when the reward is given by cr, cq(s,a) is the solution to the Bellman optimality equation.

Interpretation

Consequently, whenever we multiply the reward function by some positive constant, which can be viewed as a form of reward shaping 1, the set of optimal policies does not change 2.

What if the constant is zero or negative?

For completeness, if c=0, then 2 becomes 0=0, which is true.

If c<0, then maxaA(s)cq(s,a)=cminaA(s)q(s,a), so equation 3 becomes

q(s,a)=sS,rRp(s,rs,a)(r+γminaA(s)q(s,a)),

which is not equal to the Bellman optimality equation in 1.

  1. A seminal paper on reward shaping is Policy invariance under reward transformations: Theory and application to reward shaping (1999) by Andrew Y. Ng et al. 

  2. There can be more than one optimal policy for a given optimal value function (and Markov Decision Process) because we might have q(s,a1)=q(s,a2)q(s,a),for a1,a2A(s) and aA(s). Any greedy policy with respect to the optimal value function q(s,a) is an optimal policy.