Theorem
Consider the following Bellman optimality equation (BOE) (equation 3.20 of Sutton & Barto book on RL, 2nd edition, p. 64)
q∗(s,a)=∑s′∈S,r∈Rp(s′,r∣s,a)(r+γmaxa′∈A(s′)q∗(s′,a′)).If we multiply all rewards by the same constant c>0∈R, then the new optimal state-action value function is given by
cq∗(s,a).Proof
To prove this, we need to show that the following BOE
cq∗(s,a)=∑s′∈S,r∈Rp(s′,r∣s,a)(cr+γmaxa′∈A(s′)cq∗(s′,a′)).is equivalent to the BOE in equation 1.
Given that c>0, then
maxa′∈A(s′)cq∗(s′,a′)=cmaxa′∈A(s′)q∗(s′,a′),so c can be taken out of the max operator.
Therefore, the equation 2 becomes
cq∗(s,a)=∑s′∈S,r∈Rp(s′,r∣s,a)(cr+γcmaxa′∈A(s′)q∗(s′,a′))=∑s′∈S,r∈Rcp(s′,r∣s,a)(r+γmaxa′∈A(s′)q∗(s′,a′))=c∑s′∈S,r∈Rp(s′,r∣s,a)(r+γmaxa′∈A(s′)q∗(s′,a′))⟺q∗(s,a)=∑s′∈S,r∈Rp(s′,r∣s,a)(r+γmaxa′∈A(s′)q∗(s′,a′)),which is equal to the the Bellman optimality equation in 1, which implies that, when the reward is given by cr, cq∗(s,a) is the solution to the Bellman optimality equation.
Interpretation
Consequently, whenever we multiply the reward function by some positive constant, which can be viewed as a form of reward shaping 1, the set of optimal policies does not change 2.
What if the constant is zero or negative?
For completeness, if c=0, then 2 becomes 0=0, which is true.
If c<0, then maxa′∈A(s′)cq∗(s′,a′)=cmina′∈A(s′)q∗(s′,a′), so equation 3 becomes
q∗(s,a)=∑s′∈S,r∈Rp(s′,r∣s,a)(r+γmina′∈A(s′)q∗(s′,a′)),which is not equal to the Bellman optimality equation in 1.
-
A seminal paper on reward shaping is Policy invariance under reward transformations: Theory and application to reward shaping (1999) by Andrew Y. Ng et al. ↩
-
There can be more than one optimal policy for a given optimal value function (and Markov Decision Process) because we might have q∗(s,a1)=q∗(s,a2)≥q∗(s,a),for a1,a2∈A(s) and ∀a∈A(s). Any greedy policy with respect to the optimal value function q∗(s,a) is an optimal policy. ↩