the nbro's blogA blog dedicated to Computer Science and Artificial Intelligence.
Do not copy or use anything from this blog without my permission. If you want to refer to anything that I've written here, you can cite this blog.
https://nbro.gitlab.io//
Bellman Equations<h2 id="introduction">Introduction</h2>
<p>In <a href="http://incompleteideas.net/book/RLbook2020.pdf"><em>Reinforcement Learning (RL)</em></a>, <strong>value functions</strong> define the objectives of the RL problem. There are two very important and strictly related value functions,</p>
<ul>
<li>the <strong>state-action value function (SAVF)</strong> <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>, and</li>
<li>the <strong>state value function (SVF)</strong>.</li>
</ul>
<p>In this post, I’ll show how these value functions (including their <em>optimality</em> versions) can be mathematically formulated as recursive equations, known as <strong>Bellman equations</strong>.</p>
<p>I’ll assume that you are minimally familiar with <strong>Markov Decision Processes (MDPs)</strong> and RL. Nevertheless, I will review the most important RL and mathematical prerequisites to understand this post, so that the post is self-contained as much as possible.</p>
<!-- (without making this very very long). For more details, I recommend that you read the related sections of the book [Reinforcement Learning: An Introduction][1] (2nd edition) by Sutton and Barto. Needless to say, you can skip the sections on the topics you already know well. -->
<h2 id="notation">Notation</h2>
<ul>
<li>Stylized upper case letters (e.g. \(\mathcal{X}\) or \(\mathbb{R}\)) denote <em>vector spaces</em>.</li>
<li>Upper case letters (e.g. \(X\)) denote <a href="https://en.wikipedia.org/wiki/Random_variable"><em>random variables</em></a>.</li>
<li>Depending on the context, lower case letters (e.g. \(x\)) can denote <a href="https://www.statlect.com/glossary/realization-of-a-random-variable"><em>realizations</em></a> of random variables, <em>variables</em> of functions, or <em>elements</em> of a vector space.</li>
<li>Depending on the context, \(\color{blue}{p}(s' \mid s, a)\) can be a shorthand for \(\color{blue}{p}(S'=s' \mid S=s, A=a) \in [0, 1]\), a probability, or \(\color{blue}{p}(s' \mid S=s, A=a)\), a conditional probability distribution.</li>
<li>\(X=x\) is an <em>event</em>, which is occasionally abbreviated as \(x\).</li>
<li>\(s' \sim \color{blue}{p}(s' \mid s, a)\) means that \(s'\) is drawn/sampled according to \(\color{blue}{p}\).</li>
</ul>
<h2 id="markov-decision-processes">Markov Decision Processes</h2>
<p>A (discounted) <a href="https://www.gwern.net/docs/statistics/decision/1960-howard-dynamicprogrammingmarkovprocesses.pdf"><strong>MDP</strong></a> can be defined as a tuple</p>
\[M
\triangleq
(\mathcal{S}, \mathcal{A}, \mathcal{R}, \color{blue}{p}, \mathscr{r}, \gamma)
\tag{1} \label{1},\]
<ul>
<li>\(\mathcal{S}\) is the <em>state space</em>,</li>
<li>\(\mathcal{A}\) is the <em>action space</em>,</li>
<li>\(\mathcal{R} \subset \mathbb{R}\) is the <em>reward space</em>,</li>
<li>\(\color{blue}{p}(s' \mid s, a)\) is the <em>transition model</em>,</li>
<li>\(\mathscr{r}(s, a) = \mathbb{E}_{p(r \mid s, a)} \left[ R \mid S = s, A = a \right]\) is the <em>expected reward</em>, and</li>
<li>\(\gamma \in [0, 1]\) is the <em>discount factor</em>.</li>
</ul>
<h3 id="markov-property">Markov Property</h3>
<p>MDPs assume that the <strong>Markov property</strong> holds, i.e. the future is independent of the past given the present. The Markov property is encoded in \(\color{blue}{p}(s' \mid s, a)\) and \(\mathscr{r}(s, a)\).</p>
<h3 id="finite-mdps">Finite MDPs</h3>
<p>A <strong>finite MDP</strong> is an MDP where the state \(\mathcal{S}\), action \(\mathcal{A}\) and rewad \(\mathcal{R}\) spaces are <a href="https://en.wikipedia.org/wiki/Finite_set">finite sets</a>. In that case, \(\color{blue}{p}(s' \mid s, a)\) can be viewed as a <a href="https://en.wikipedia.org/wiki/Probability_mass_function"><em>probability mass function (pmf)</em></a> and the random variable associated with states, actions and rewards are <em>discrete</em>.</p>
<h3 id="alternative-formulations">Alternative Formulations</h3>
<p>Sometimes, \(\color{blue}{p}(s' \mid s, a)\) is combined with \(\mathscr{r}(s, a)\) to form a joint conditional distribution, \(\color{purple}{p}(s', r \mid s, a)\) (the <em>dynamics</em> of the MDP), from which both \(\color{blue}{p}(s' \mid s, a)\) and \(\mathscr{r}(s, a)\) can be derived.</p>
<p>Specifically, \(\color{blue}{p}(s' \mid s, a)\) can be computed from \(\color{purple}{p}(s', r \mid s, a)\) by marginalizing over \(r\) <sup id="fnref:11" role="doc-noteref"><a href="#fn:11" class="footnote" rel="footnote">2</a></sup> as follows</p>
\[\color{blue}{p}(s' \mid s, a) = \sum_{r \in \mathcal{R}}r \color{purple}{p}(s', r \mid s, a).
\tag{2} \label{2}\]
<p>Similarly, we have</p>
\[\begin{align*}
\mathscr{r}(s, a)
&=
\mathbb{E} \left[ R \mid S = s, A = a \right]
\\
&= \sum_{r \in \mathcal{R}} r \sum_{s' \in \mathcal{S}} \color{purple}{p}(s', r \mid s, a)
\\
&=
\sum_{r \in \mathcal{R}} r p(r \mid s, a).
\end{align*}
\tag{3} \label{3}\]
<!-- You can also define the reward function as a function of the next state $$s'$$ too, i.e.
$$
\begin{align*}
\mathscr{r}(s, a, s')
&=
\mathbb{E} \left[ R \mid S = s, A = a, S' = s' \right]
\\
&=
\sum_{r \in \mathcal{R}} r \frac{\color{purple}{p}(s', r \mid s, a)}{\color{blue}{p}(s' \mid s, a)}
\\
&=
\sum_{r \in \mathcal{R}} r p(r \mid s', s, a).
\end{align*}
$$ -->
<!-- ### Illustration
Here's a diagram (taken from Wikipedia) of an MDP $$M$$, where $$\mathcal{S} = \{ s_0, s_1, s_2 \}$$ (green circles), $$\mathcal{A} = \{ a_0, a_1 \}$$ (orange circles), $$\mathcal{R} = \{ -1, 5, 0 \}$$ (orange arrows), the black arrows represent the transitions (with the corresponding probabilities),the orange arrows are the rewards (the rewards of zero are not shown).
![An example of an MDP](/images/mdp.png)
To give you an idea of the transition probabilities, the probability of transition from $$s_0$$ to $$s_0$$ by taking action $$a_0$$ is $$\color{blue}{p}(S'=s_0 \mid S = s_0, A=a_0) = 0.5$$, while the probability of transitioning from $$s_0$$ to $$s_2$$ by taking action $$a_1$$ is $$\color{blue}{p}(S'=s_2 \mid S = s_0, A=a_1) = 1.0$$. The expected reward of taking the action $$a_0$$ in $$s_1$$ and ending up in $$s_2$$ is $$\mathscr{r}(s_1, a_0, s_2) = 0$$, while $$\mathscr{r}(s_1, a_0, s_0) = 5$$. -->
<h2 id="reinforcement-learning">Reinforcement Learning</h2>
<!-- ### Agent-Environment Interaction -->
<p>In <a href="http://incompleteideas.net/book/RLbook2020.pdf"><strong>RL</strong></a>, we imagine that there is an <strong>agent</strong> that <em>sequentially</em> <em>interacts</em> with an <strong>environment</strong> in (discrete) <em>time-steps</em>, where the environment can be modelled as an MDP.</p>
<!-- So, RL is used to solve sequential decision-making problems (hence the emphasis on _sequentially_!), which can be modelled as MDPs. -->
<p>More specifically, at time-step \(t\), the agent is in some <strong>state</strong> \(s_t \in \mathcal{S}\) <sup id="fnref:8" role="doc-noteref"><a href="#fn:8" class="footnote" rel="footnote">3</a></sup> and takes an <strong>action</strong> \(a_t \in \mathcal{A}\) with a <strong>policy</strong> \(\color{red}{\pi}(a \mid s)\), which is a conditional probability distribution over actions given a state, i.e. \(a_t \sim \pi(a \mid s_t)\). At the next time step \(t+1\), the environment returns a <strong>reward</strong> \(r_{t+1} = \mathscr{r}(s_t, a_t)\), and it moves to another state \(s_{t+1} \sim \color{blue}{p}(s' \mid s_t, a_t)\), then the agent takes another action \(a_{t+1} \sim \color{red}{\pi}(a \mid s_{t+1})\), gets another reward \(r_{t+2} = \mathscr{r}(s_{t+1}, a_{t+1})\), and the environment moves to another state \(s_{t+2}\), and so on. This interaction continues until a maximum time-step \(H\), which is often called <strong>horizon</strong>, is reached. For simplicity, we assume that \(H = \infty\), so we assume a so-called <em>infinite-horizon MDP</em>.</p>
<!-- Here's a diagram that illustrates this interaction (taken from Sutton & Barto's book), where the random variables $$A_t, S_{t+1}, R_{t+1}, S_t$$ and $$R_t$$ (rather their [realizations][4] $$a_t, s_{t+1}, r_{t+1}, s_t$$ and $$r_t$$) are used to emphasize that this interaction is stochastic.
![A diagram of the Agent-Environment Interaction](/images/agent-environment-interaction.png) -->
<!-- ### Finite and Infinite-Horizon MDPs
If $$H$$ is finite, then we have _finite-horizon MDP_, which can be used to describe tasks that can naturally be broken into _episodes_ (e.g. a sequence of games of chess, where each game is an episode), i.e. _episodic tasks_. If $$H = \infty$$, then we have an _infinite-horizon MDP_, which can be used to describe _continuing tasks_. In the case of episodic tasks, $$H$$ is a random variable, given that the time-step the episode terminates might not always be the same. -->
<!-- ### Objective -->
<p>In RL, the objective/goal is to find a policy that <em>maximizes</em> the <em>sum of rewards in the long run</em>, i.e. until the horizon \(H\) is reached (if ever reached). An objective function that formalizes this sum of rewards in the long run is the <strong>state-action value function (SAVF)</strong>, which is, therefore, one function that we might want to optimize.</p>
<!-- ### RL Algorithms
The most known RL algorithms are probably [Q-learning][13] and [SARSA][14]. They implement the agent-environment interaction described above in some specific way, in order to estimate the SAVF and then derive the policy from it. There are also algorithms that estimate the policy directly (e.g. [REINFORCE][15]). -->
<h2 id="state-action-value-function">State-Action Value Function</h2>
<p>The <em>state-action value function</em> for a policy \(\color{red}{\pi}(a \mid s)\) is the function \(q_\color{red}{\pi} : \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}\), which is defined as follows</p>
\[q_\color{red}{\pi}(s, a)
\triangleq
\mathbb{E} \left[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} \mid S_t = s, A_t = a\right],
\\ \color{orange}{\forall} s \in \mathcal{S}, \color{orange}{\forall} a \in \mathcal{A}
\tag{4}\label{4},\]
<p>where</p>
<ul>
<li>\(R_{t+k+1}\) is the <em>reward</em> the agent receives at time-step \(t+k+1\),</li>
<li>\(G_t \triangleq \sum_{k=0}^\infty \gamma^k R_{t+k+1}\) is the <em>return</em> (aka <em>value</em> <sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup>),</li>
<li>\(\color{red}{\pi}\) is a <em>policy</em> that the agents follows from time-step \(t+1\) onwards,</li>
<li>\(\gamma \in [0, 1)\) is the <em>discount factor</em>,</li>
<li>\(S_t = s\) is the <em>state</em> the agent is in at time-step \(t\), and</li>
<li>\(A_t = a\) is the <em>action</em> taken at time-step \(t\).</li>
</ul>
<p>Intuitively, \(q_\color{red}{\pi}(s, a)\) is <strong>expected return</strong> that the agent gets by following policy \(\color{red}{\pi}\) <strong>after</strong> having taken action \(a\) in state \(s\) at time-step \(t\).</p>
<!-- ### Discount Factor
Here, $$\gamma$$ is used to unify the objective of episodic and continuing tasks and to weight the importance of the rewards with respect to when they are received. -->
<!-- In case $$H = \infty$$, we need $$\gamma \in [0, 1)$$ and the sequence $$\{ R_{t+k+1} \}$$ to be bounded, in order to make the sum finite, which is desirable because we want to maximise the SAVF. Specifically, if the maximum possible reward is $$R_\text{max}$$, then the return is bounded by $$\frac{R_\text{max}}{1 - \gamma}$$, i.e.
$$
\begin{align*}
G_t
&\triangleq
\sum_{k=0}^\infty \gamma^k R_{t+k+1} \leq \sum_{k=0}^\infty \gamma^k R_\text{max}
\\
&=
\frac{R_\text{max}}{1 - \gamma},
\end{align*}
$$
where $$\sum_{k=0}^\infty \gamma^k R_\text{max}$$ is a [geometric series][16].
In case $$H < \infty$$, the return is finite, even if we use $$\gamma = 1$$. -->
<!-- ### Unified View
If $$H = \infty$$, equation \ref{1} can still represent the objective function of an episodic problem, but we need to add a special state to the state space, called the _absorbing state_, in which
- the agent remains forever after the episode terminates,
- all actions have no effect (i.e. it remains in the absorbing state), and
- the agents only gets a reward of $$0$$. -->
<h3 id="value-function-of-a-policy">Value Function of a Policy</h3>
<p>The subscript \(\color{red}{\pi}\) in \(q_\color{red}{\pi}(s, a)\) indicates that \(q_\color{red}{\pi}(s, a)\) is defined in terms of \(\color{red}{\pi}(a \mid s)\) because the rewards received in the future, \(R_{t+k+1}\), depend on the actions that we take with \(\color{red}{\pi}(s \mid a)\), but they also depend on the <em>transition model</em> \(\color{blue}{p}(s' \mid s, a)\).</p>
<p>However, \(\color{red}{\pi}\) and \(\color{blue}{p}\) do not appear anywhere inside the expectation in equation \ref{4}. So, for people that only believe in equations, \ref{4} might not be satisfying enough. Luckily, we can express \(q_\color{red}{\pi}(s, a)\) in terms of \(\color{red}{\pi}\) and \(\color{blue}{p}\) by starting from equation \ref{4}, which also leads to a Bellman/recursive equation. So, let’s do it!</p>
<h2 id="mathematical-prerequisites">Mathematical Prerequisites</h2>
<p>The formulation of the value functions as recursive equations (which is the main point of this blog post) uses three main mathematical rules, which are reviewed here for completeness.</p>
<h3 id="markov-property-1">Markov Property</h3>
<p>If we assume that the <em>Markov property</em> holds, then the following holds</p>
<ol>
<li>
\[\color{blue}{p}(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \dots, s_0, a_0) = \color{blue}{p}(s_{t+1} \mid s_t, a_t)\]
</li>
<li>
\[\mathscr{r}(s, a) = \mathbb{E} \left[ R_{t+1}\mid s_t, a_t, s_{t-1}, a_{t-1}, \dots, s_0, a_0 \right] = \mathbb{E} \left[ R_{t+1}\mid s_t, a_t \right]\]
</li>
<li>
\[\color{red}{\pi}(a_{t} \mid s_t, s_{t-1}, a_{t-1}, \dots, s_0, a_0 ) = \color{red}{\pi}(a_{t} \mid s_t)\]
</li>
</ol>
<h3 id="linearity-of-expectation-le">Linearity of Expectation (LE)</h3>
<p>Let \(X\) and \(Y\) be two discrete random variables and \(p(x, y)\) be their joint distribution, then the expectation of \(X+Y\) is equal to the sum of the expectation of \(X\) and \(Y\), i.e.</p>
\[\begin{align*}
\mathbb{E}[X + Y]
&= \sum_x \sum_y (x + y) p(x, y)
\\
&= \sum_x \sum_y x p(x, y) + y p(x, y)
\\
&= \sum_x \sum_y x p(x, y) + \sum_x \sum_y y p(x, y)
\\
&= \sum_x x p(x) + \sum_y y p(y)
\\
&= \mathbb{E}[X] + \mathbb{E}[Y]
\end{align*}\]
<h3 id="law-of-total-expectation-lte">Law of Total Expectation (LTE)</h3>
<p>The formulation of the <strong>LTE</strong> is as follows. Let \(X\), \(Y\) and \(Z\) be three discrete random variables, \(\mathbb{E}[X \mid Y=y]\) be the expectation of \(X\) given \(Y=y\), and \(p(x, y, z)\) be the joint of \(X\), \(Y\) and \(Z\). So, we have</p>
\[\begin{align*}
\mathbb{E}[X \mid Y=y]
&=
\sum_x x p(x \mid y)
\\
&=
\sum_x x \frac{p(x, y)}{p(y)}
\\
&=
\sum_x x \frac{\sum_z p(x, y, z)}{p(y)}
\\
&=
\sum_x x \frac{\sum_z p(x \mid y, z) p(y, z) }{p(y)}
\\
&=
\sum_z \frac{p(y, z)}{p(y)} \sum_x x p(x \mid y, z)
\\
&=
\sum_z p(z \mid y) \sum_x x p(x \mid y, z)
\\
&=
\sum_z p(z \mid y) \mathbb{E}[X \mid Y=y, Z=z].
\end{align*}\]
<p>This also applies to other more complicated cases, i.e. more conditions, or even to the simpler case of \(\mathbb{E}[X]\).</p>
<h2 id="state-action-bellman-equation">State-Action Bellman Equation</h2>
<p>We are finally ready to express \(\ref{4}\) as a recursive equation.</p>
<p>We can decompose the return \(G_t\) into the first reward \(R_{t+1}\), received after having taken action \(A_t = a\) in state \(S_t = s\), and the rewards that we will receive in the next time steps, then we can apply the LE, LTE (multiple times) and Markov property, i.e.</p>
\[\begin{align*}
q_\color{red}{\pi}(s, a)
&=
\mathbb{E} \left[ R_{t+1} + \sum_{k=1}^\infty \gamma^k R_{t+k+2} \mid S_t = s, A_t = a\right]
\\
&=
\mathbb{E} \left[ R_{t+1} \mid S_t = s, A_t = a \right] + \gamma \mathbb{E} \left[ G_{t+1} \mid S_t = s, A_t = a\right]
\\
&=
\mathscr{r}(s, a) + \gamma \sum_{s'} \color{blue}{p}(s' \mid s, a) \mathbb{E} \left[ G_{t+1} \mid S_{t+1}=s', S_t = s, A_t = a\right]
\\
&=
\mathscr{r}(s, a) + \gamma \sum_{s'} \color{blue}{p}(s' \mid s, a) \mathbb{E} \left[ G_{t+1} \mid S_{t+1}=s'\right]
\\
&=
\mathscr{r}(s, a) + \gamma \sum_{s'} \color{blue}{p}(s' \mid s, a) v_\color{red}{\pi}(s')
\\
&=
\mathscr{r}(s, a) + \gamma \sum_{s'} \color{blue}{p}(s' \mid s, a) \sum_{a'} \color{red}{\pi}(a' \mid s') \mathbb{E} \left[ G_{t+1} \mid S_{t+1}=s', A_{t+1}=a'\right]
\\
&=
\mathscr{r}(s, a) + \gamma \sum_{s'} \color{blue}{p}(s' \mid s, a) \sum_{a'} \color{red}{\pi}(a' \mid s') q_\color{red}{\pi}(s', a')
\tag{5}\label{5},
\end{align*}\]
<p>where</p>
<ul>
<li>\(\lambda G_{t+1}=\sum_{k=1}^\infty \gamma^k R_{t+k+2} = \gamma \sum_{k=0}^\infty \gamma^k R_{t+k+2}\),</li>
<li>\(\mathscr{r}(s, a) \triangleq \mathbb{E}_{p(r \mid s, a)} \left[ R_{t+1} \mid S_t = s, A_t = a \right] = \sum_{r} p(r \mid s, a) r\) is the expected reward of taking action \(a\) in \(s\) and \(p(r \mid s, a)\) is the reward distribution <sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">5</a></sup>,</li>
<li>\(v_\color{red}{\pi}(s') \triangleq \mathbb{E} \left[ G_{t+1} \mid S_{t+1}=s'\right]\) is the <strong>state value function (SVF)</strong>, and</li>
<li>\(q_\color{red}{\pi}(s', a') \triangleq \mathbb{E} \left[ G_{t+1} \mid S_{t+1}=s', A_{t+1}=a'\right]\).</li>
</ul>
<p>The equation \ref{5} is a <em>recursive equation</em>, given that \(q_\color{red}{\pi}\) is defined in terms of itself (although evaluated at a different state-action pair), known as the <em>state-action</em> <strong>Bellman (expectation) equation</strong> for \(q_\color{red}{\pi}\).</p>
<p>So, the subscript \(\color{red}{\pi}\) in \(q_\color{red}{\pi}(s, a)\) is used because the state-action value is (also) defined in terms of \(\color{red}{\pi}\).</p>
<h3 id="alternative-version">Alternative Version</h3>
<p>Given the relations \ref{2} and \ref{3}, equation \ref{5} can also be expressed in terms of \(\color{purple}{p}(s', r \mid s, a)\) as follows.</p>
\[\begin{align*}
q_\color{red}{\pi}(s, a)
&=
\sum_{r} \underbrace{\sum_{s'} \color{purple}{p}(s', r \mid s, a)}_{p(r \mid s, a)} r + \gamma \sum_{s'} \underbrace{\sum_{r} \color{purple}{p}(s', r \mid s, a)}_{\color{blue}{p}(s' \mid s, a)} v_\color{red}{\pi}(s')
\\
&=
\sum_{s'} \sum_{r} \color{purple}{p}(s', r \mid s, a) \left[ r + \gamma v_\color{red}{\pi}(s') \right]
\tag{6}\label{6}
.
\end{align*}\]
<p>So, \(q_\color{red}{\pi}(s, a)\) is an expectation with respect to the joint conditional distribution \(\color{purple}{p}(s', r \mid s, a)\), i.e.</p>
\[q_\color{red}{\pi}(s, a)
=
\mathbb{E}_{\color{purple}{p}(s', r \mid s, a)} \left[ R_{t+1} + \gamma v_{\color{red}{\pi}}(S_{t+1}) \mid S_t = s, A_t =a\right]
\tag{7}\label{7}
.\]
<p>Equation \ref{6} can also be derived from equation \ref{4} by applying the LTE with respect to \(\color{purple}{p}(s', r \mid s, a)\).</p>
<h3 id="vectorized-form">Vectorized Form</h3>
<p>If the MDP is finite, then we can express the state-action Bellman equation in \ref{5} in a vectorized form</p>
\[\mathbf{Q}_\color{red}{\pi}
=
\mathbf{R}
+
\gamma
\mathbf{\color{blue}{P}}
\mathbf{V}_\color{red}{\pi},
\tag{8}\label{8}\]
<p>where</p>
<ul>
<li>\(\mathbf{Q}_\color{red}{\pi} \in \mathbb{R}^{\mid \mathcal{S} \mid \times \mid \mathcal{A} \mid }\) is a matrix that contains the state-action values for each state-action pair \((s, a)\), so \(\mathbf{Q}_\color{red}{\pi}[s, a] = q_{\color{red}{\pi}}(s, a)\),</li>
<li>\(\mathbf{R} \in \mathbb{R}^{\mid \mathcal{S} \mid \times \mid \mathcal{A} \mid }\) is a matrix with the expected rewards for each state-action pair \((s, a)\), so \(\mathbf{R}[s, a] = \mathscr{r}(s, a)\),</li>
<li>\(\mathbf{\color{blue}{P}} \in \mathbb{R}^{\mid \mathcal{S} \mid \times \mid \mathcal{A} \mid \times \mid \mathcal{S} \mid}\) is a matrix that contains the transition probabilities for each triple \((s, a, s')\), so \(\mathbf{\color{blue}{P}}[s, a, s'] = \color{blue}{p}(S'=s' \mid S=s, A=a)\), and</li>
<li>\(\mathbf{V}_\color{red}{\pi} \in \mathbb{R}^{\mid \mathcal{S} \mid}\) is a vector that contains the state values (as defined in equation \ref{5}) for each state \(s'\), so \(\mathbf{V}_\color{red}{\pi}[s'] = v_\color{red}{\pi}(s')\).</li>
</ul>
<h2 id="optimal-state-action-value-function">Optimal State-Action Value Function</h2>
<p>In RL, the goal is to find/estimate an <em>optimal policy</em>, \(\color{green}{\pi_*}\), i.e. one that, if followed, maximizes the expected return. For a finite MDP, there is a <em>unique optimal state-action value function</em>, which can be denoted by \(q_{\color{green}{\pi_*}}(s, a)\) or just \(q_\color{green}{*}(s, a)\), from which an optimal policy can be derived.</p>
<p>By definition, the optimal state-action value function is</p>
\[q_{\color{green}{\pi_*}}(s, a)
\triangleq
\operatorname{max}_\color{red}{\pi} q_\color{red}{\pi}(s, a),
\\
\color{orange}{\forall} s \in \mathcal{S}, \color{orange}{\forall} a \in \mathcal{A}
\tag{9}\label{9}
.\]
<p>For a discounted infinite-horizon MDP, the optimal policy is deterministic <sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup> and stationary <sup id="fnref:10" role="doc-noteref"><a href="#fn:10" class="footnote" rel="footnote">7</a></sup>, and it’s any greedy policy with respect to \(q_{\color{green}{\pi_*}}(s, a)\), i.e.</p>
\[\color{green}{\pi_*}
\in
\operatorname{arg max}_a q_{\color{green}{\pi_*}}(s, a),
\\
\color{orange}{\forall} s \in \mathcal{S}
\tag{10}\label{10}\]
<p>Here, \(\in\) is used because there can be more than one optimal policy for an MDP given that there can be two or more actions that are optimal in a state.</p>
<h2 id="state-action-bellman-optimality-equation">State-Action Bellman Optimality Equation</h2>
<p>Equation \ref{9} can also be written as a recursive equation, known as the <em>state-action</em> <strong>Bellman optimality equation</strong>.</p>
\[\begin{align*}
q_{\color{green}{\pi_*}}(s, a)
&\triangleq
\operatorname{max}_\color{red}{\pi}
q_\color{red}{\pi}(s, a)
\\
&=
\operatorname{max}_\color{red}{\pi}
\sum_{s'} \sum_{r} \color{purple}{p}(s', r \mid s, a) \left[ r + \gamma v_\color{red}{\pi}(s') \right]
\\
&=
\sum_{s'} \sum_{r} \color{purple}{p}(s', r \mid s, a) \left[ r + \gamma \operatorname{max}_\color{red}{\pi} v_\color{red}{\pi}(s') \right]
\\
&=
\sum_{s'} \sum_{r} \color{purple}{p}(s', r \mid s, a) \left[ r + \gamma \operatorname{max}_{a'}q_{\color{green}{\pi_*}}(s', a') \right]
\\
&=
\mathbb{E}_{\color{purple}{p}(s', r \mid s, a)} \left[ R_{t+1} + \gamma v_{\color{green}{\pi_*}}(S_{t+1}) \mid S_t = s, A_t =a\right]
\tag{11}\label{11},
\end{align*}\]
<p>where \(v_\color{green}{\pi_*}(s') = \operatorname{max}_\color{red}{\pi} v_\color{red}{\pi}(s') = \operatorname{max}_{a'}q_{\color{green}{\pi_*}}(s', a')\).</p>
<h2 id="state-bellman-equation">State Bellman Equation</h2>
<p>Like \(q_\color{red}{\pi}(s, a)\), the state value function \(v_\color{red}{\pi}(s)\) can also be written as a recursive equation by starting from its definition and then applying the LTE rule, the linearity of the expectation and the Markov property. So, for completeness, let’s do it.</p>
\[\begin{align*}
v_\color{red}{\pi}(s)
&\triangleq
\mathbb{E} \left[ G_t \mid S_t=s\right]
\\
&=
\mathbb{E} \left[ R_{t+1} + \gamma G_{t+1} \mid S_t=s \right]
\\
&=
\mathbb{E} \left[ R_{t+1} \mid S_t=s \right] + \gamma \mathbb{E} \left[ G_{t+1} \mid S_t=s \right]
\\
&=
\sum_{a} \color{red}{\pi}(a \mid s) \mathscr{r}(s, a) + \gamma \sum_{a} \color{red}{\pi}(a \mid s) \mathbb{E} \left[ G_{t+1} \mid S_t = s, A_t = a\right]
\\
&=
\sum_{a} \color{red}{\pi}(a \mid s) \mathscr{r}(s, a) + \gamma \sum_{a} \color{red}{\pi}(a \mid s) \sum_{s'} \color{red}{p}(s' \mid s, a) \mathbb{E} \left[ G_{t+1} \mid S_{t+1} = s'\right]
\\
&=
\sum_{a} \color{red}{\pi}(a \mid s) \mathscr{r}(s, a) + \gamma \sum_{a} \color{red}{\pi}(a \mid s) \sum_{s'} \color{red}{p}(s' \mid s, a) v_\color{red}{\pi}(s')
\\
&=
\sum_{a} \color{red}{\pi}(a \mid s) \left ( \mathscr{r}(s, a) + \gamma \sum_{s'} \color{red}{p}(s' \mid s, a) v_\color{red}{\pi}(s') \right)
\\
&=
\sum_{a} \color{red}{\pi}(a \mid s) \left ( \sum_{s'} \sum_{r} \color{purple}{p}(s', r \mid s, a) \left[ r + \gamma v_\color{red}{\pi}(s') \right] \right)
\tag{12}\label{12}.
\end{align*}\]
<h2 id="conclusion">Conclusion</h2>
<p>In conclusion, value functions define the objectives of an RL problem. They can be written as recursive equations, known as <em>Bellman equations</em>, in honor of <strong>Richard Bellman</strong>, who made significant contributions to the theory of <strong>dynamic programming (DP)</strong>, which is related to RL.</p>
<p>More specifically, DP is an approach that can be used to solve MDPs (i.e. to find \(\color{green}{\pi_*}\)) when \(\color{blue}{p}\) is available, but not only <sup id="fnref:9" role="doc-noteref"><a href="#fn:9" class="footnote" rel="footnote">8</a></sup>, where the solution to a problem can be computed by combining the solution to subproblems: the Bellman equation really reflects this idea, i.e. \(q_\color{red}{\pi}(s, a)\) is computed as a function of the “subproblem” \(q_\color{red}{\pi}(s', a')\). The problem is that \(\color{blue}{p}\) is rarely available, hence the need for RL. In any case, DP algorithms, like <a href="https://www.gwern.net/docs/statistics/decision/1960-howard-dynamicprogrammingmarkovprocesses.pdf">policy iteration (PI)</a>, and RL algorithms, like <a href="http://www.cs.rhul.ac.uk/~chrisw/new_thesis.pdf">Q-learning</a>, are related because they assume that the environment can be modeled as an MDP and they attempt to estimate the same optimal value function.</p>
<!-- Although they are both approaches to solve MDPs, RL and DP are not the same thing. DP algorithms, like [policy iteration (PI)][11], need access to $$\color{blue}{p}(s' \mid s, a)$$, in addition to the reward function, while RL algorithms, like [Q-learning][13], don't necessarily need to use and/or estimate $$\color{blue}{p}(s' \mid s, a)$$, but they can estimate the value function by _trial-and-error_ while interacting with the environment. So, DP algorithms, like PI, are often called _planning_ (or _search_) algorithms because we don't need to interact with the environment to learn about the environment but we can simply use the dynamics to search for the optimal policy. The problem is that $$\color{blue}{p}$$ is rarely available, hence the need for RL. In any case, VI and Q-learning are related because they assume that the environment can be modeled as an MDP and they attempt to estimate the same optimal value function. -->
<!-- [^5]: There are also continuous-time MDPs. -->
<!-- [^12]: In turn, $$X=x$$ is a shorthand for $$\{\omega \in \Omega : X(\omega) = x \}$$, where $$\omega \in \Omega$$ is an outcome and $$\Omega$$ the sample space (the set of all possible outcomes), for the random variable $$X : \Omega \rightarrow E$$, where, in our case, $$E$$ is a finite set and so $$X$$ is a discrete random variable. -->
<!-- [^13]: The agent is also called _controller_, _player_, or _decision maker_. The environment is also called _controlled system_ or _plant_. An action is also called _control_. A policy is also called _behaviour_ or _strategy_. Finally, a reward is also called _payoff_ or _reinforcement_. -->
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>It’s also called <em>action value function</em>, but I prefer to call it state-action value function because it reminds us that this is a function of a state and action. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:11" role="doc-endnote">
<p>Let \(X\) and \(Y\) be two discrete random variables and \(p(x, y)\) be their joint distribution. The <a href="https://en.wikipedia.org/wiki/Marginal_distribution">marginal distribution</a> of \(X\) or \(Y\) can be found as \(p(x) = \sum_y p(x, y)\) and \(p(y) = \sum_x p(x, y)\), respectively. <a href="#fnref:11" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:8" role="doc-endnote">
<p>The subscript \(t\) in the object \(x_t\) is used to emphasize that \(x\) is associated with the time step \(t\). <a href="#fnref:8" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>Hence the name <em>value function</em>. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:7" role="doc-endnote">
<p>If the reward is deterministic, then \(p(r \mid s, a)\) gives probability \(1\) to one reward and \(0\) to all other rewards. <a href="#fnref:7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>\(\pi(a \mid s)\) can also be used to describe deterministic policies by giving a probability of \(1\) to one action in \(s\) and a probability of \(0\) to all other actions. A deterministic policy might also be defined as a function \(\pi : \mathcal{S} \rightarrow \mathcal{A}\), so \(\pi(s) = a\) is the (only) action taken by \(\pi\) in \(s\). <a href="#fnref:6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:10" role="doc-endnote">
<p>A policy \(\color{red}{\pi}(a \mid s)\) is <a href="https://en.wikipedia.org/wiki/Stationary_process">stationary</a> if it doesn’t change over time steps, i.e. \(\color{red}{\pi}(a \mid S_{t} = s) = \color{red}{\pi}(a \mid S_{t+1} = s), \forall t, \forall s \in \mathcal{S}\), in other words, the probabilities of selecting an action do not change from time step to time step. You can think of a non-stationary policy as a set of policies. <a href="#fnref:10" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:9" role="doc-endnote">
<p>For more info about the dynamic programming approach, I recommend that you read the corresponding chapter in the book <a href="https://edutechlearners.com/download/Introduction_to_algorithms-3rd%20Edition.pdf">Introduction to Algorithms</a> (3rd edition) by Thomas H. Cormen et al. <a href="#fnref:9" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Wed, 16 Feb 2022 00:00:00 +0000
https://nbro.gitlab.io//blogging/2022/02/16/bellman-equations/
https://nbro.gitlab.io//blogging/2022/02/16/bellman-equations/MDPs are POMDPs<p>A (fully observable) <strong>Markov Decision Process (MDP)</strong> is just a <strong>Partially Observable Markov Decision Process (POMDP)</strong> where the states are observable. So, we can formulate an MDP as a POMDP such that the observation space is equal to the state space. We also need to take care of the observation function. Let’s see how exactly.</p>
<p>Formally, an MDP can be defined as a tuple \(M_\text{MDP} = (\mathcal{S}, \mathcal{A}, T, r, \gamma)\), where</p>
<ul>
<li>\(\mathcal{S}\) is the state space</li>
<li>\(\mathcal{A}\) is the action space</li>
<li>\(T = p(s' \mid s, a)\) is the transition function</li>
<li>\(r\) is the reward function</li>
<li>\(\gamma\) is the discount factor</li>
</ul>
<p>A POMDP is defined as a tuple \(M_\text{POMDP} = (\mathcal{S}, \mathcal{A}, T, r, \gamma, \color{red}{\Omega}, \color{red}{O})\), where \(\mathcal{S}\), \(\mathcal{A}\), \(T\), \(r\) and \(\gamma\) are defined as above, but, in addition to those, we also have</p>
<ul>
<li>\(\color{red}{\Omega}\): the observation space</li>
<li>\(\color{red}{O} = p(o \mid s', a)\): the observation function, which is the probability distribution over possible observations, given the next state \(s'\) and action \(a\)</li>
</ul>
<p>So, to define \(M_\text{MDP}\) as \(M_\text{POMDP}\), we have</p>
<ul>
<li>
\[\color{red}{\Omega} = \mathcal{S}\]
</li>
<li>The observation function is
\(\color{red}{O}
=
p(o \mid s', a) =
\begin{cases}
1, \text{ if } o = s' \\
0, \text{ otherwise }
\end{cases}\)</li>
</ul>
<p>In other words, the probability of observing \(o = s'\), given that we end up in \(s'\), is \(1\), while the probability of observing \(o \neq s'\) is \(0\). This has implications on how you update the belief state \(b(s')\) because \(b(s')\) will be set to \(0\) if \(o \neq s'\).</p>
Sat, 01 Jan 2022 00:00:00 +0000
https://nbro.gitlab.io//blogging/2022/01/01/mdps-are-pomdps/
https://nbro.gitlab.io//blogging/2022/01/01/mdps-are-pomdps/Historically relevant programs developed in LISP<p><strong>LISP</strong> stands for <strong>Lis</strong>t <strong>P</strong>rocessing. In this functional programming language, programs look like lists and can be treated as data (hence the name) <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. It was designed by John McCarthy (<a href="http://jmc.stanford.edu/articles/dartmouth/dartmouth.pdf">one of the official founders of the AI field</a>) starting in 1958.</p>
<p>Many people know that LISP is historically a very important programming language in Artificial Intelligence. Even today, dialects of LISP are still being used in this context. For example, <a href="https://github.com/lspector/Clojush">Clojush</a> is a <a href="https://clojure.org/">Clojure</a> (which is a dialect of LISP) implementation of the <a href="https://faculty.hampshire.edu/lspector/push.html">Push</a> programming language and the <a href="https://faculty.hampshire.edu/lspector/push.html">PushGP</a> system, which are still being used to do research on <a href="http://www0.cs.ucl.ac.uk/staff/W.Langdon/ftp/papers/poli08_fieldguide.pdf">genetic programming</a>.</p>
<p>Many historically relevant programs were implemented in LISP in the early days of AI. Here’s a non-exhaustive <strong>lis</strong>t <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>.</p>
<table>
<thead>
<tr>
<th>Name</th>
<th>Author</th>
<th>Source</th>
<th>Year</th>
<th>Brief description/comment</th>
</tr>
</thead>
<tbody>
<tr>
<td>Symbolic Automatic INTegrator (SAINT)</td>
<td>James R. Slagle</td>
<td><a href="https://dl.acm.org/doi/10.1145/321186.321193">[1]</a></td>
<td>1963</td>
<td>A symbolic integretation program</td>
</tr>
<tr>
<td>ANALOGY</td>
<td>Thomas G. Evans</td>
<td><a href="https://dl.acm.org/doi/10.1145/1464122.1464156">[2]</a></td>
<td>1964</td>
<td>It solves geometric analogy problems</td>
</tr>
<tr>
<td>Semantic Information Retrieval (SIR)</td>
<td>Bertram Raphael</td>
<td><a href="https://ai.stanford.edu/~nilsson/QAI/qai.pdf#page=135">[3]</a></td>
<td>1964</td>
<td>A “machine understanding” program</td>
</tr>
<tr>
<td>QA3</td>
<td>C. Cordell Green (and Robert Yates)</td>
<td><a href="http://www.ai.sri.com/pubs/files/tn004-green69.pdf">[4]</a></td>
<td>1969</td>
<td>A resolution-based deduction system, which was an attempt to improve on Raphael’s SIR; QA3 is the successor of QA2 and QA1</td>
</tr>
<tr>
<td>SEE</td>
<td>Adolfo Guzman-Arenas</td>
<td><a href="https://dl.acm.org/doi/10.1145/1476589.1476631">[5]</a></td>
<td>1969</td>
<td>A program to segment a line drawing of a scene containing blocks into its constituents</td>
</tr>
<tr>
<td>DENDRAL</td>
<td>Edward Feigenbaum, Joshua Lederberg, Bruce Buchanan, Carl Djerassi, and others</td>
<td><a href="https://ai.stanford.edu/~nilsson/QAI/qai.pdf#page=255">[6]</a>, <a href="https://dl.acm.org/doi/10.1145/41526.41528">[7]</a>, <a href="https://stacks.stanford.edu/file/druid:pq644jd0400/pq644jd0400.pdf">[8]</a></td>
<td>1965-</td>
<td>A project, expert system or series of programs to help chemists identify the structure of molecules given their mass spectra and other expert knowledge</td>
</tr>
<tr>
<td>Stanford Research Institute Problem Solver (STRIPS)</td>
<td>Richard Fikes & Nils Nilsson</td>
<td><a href="https://ai.stanford.edu/users/nilsson/OnlinePubs-Nils/PublishedPapers/strips.pdf">[9]</a></td>
<td>~1970</td>
<td>A planning system used in <a href="https://www.youtube.com/watch?v=7bsEN8mwUB8">the Shakey robot</a></td>
</tr>
<tr>
<td>SHRDLU</td>
<td>Terry Winograd</td>
<td><a href="http://hci.stanford.edu/~winograd/shrdlu/AITR-235.pdf">[10]</a></td>
<td>1971</td>
<td>A NLP dialog system, which was only partially written in LISP</td>
</tr>
<tr>
<td>MYCIN</td>
<td>Edward (Ted) Shortliffe</td>
<td><a href="https://ai.stanford.edu/~nilsson/QAI/qai.pdf#page=291">[11]</a></td>
<td>~1970</td>
<td>An expert system that would consult with physicians about bacterial infections and therapy; MYCYN is a common suffix for antibacterial <a href="https://hearinglosshelp.com/blog/the-ototoxicity-of-drugs-ending-in-mycin-and-micin/">[12]</a>; the specific version of LISP used was <a href="http://www.softwarepreservation.org/projects/LISP/bbnlisp">BBN-LISP</a></td>
</tr>
<tr>
<td>Language Interface Facility with Elliptical and Recursive Features (LIFER)</td>
<td>Gary Hendrix</td>
<td><a href="http://www.ai.sri.com/pubs/files/1414.pdf">[13]</a></td>
<td>1976</td>
<td>A program to interact with databases in a subset of natural language (e.g. English); the specific version of LISP used was <a href="https://interlisp.org/">INTERLISP</a>, a successor of <a href="http://www.softwarepreservation.org/projects/LISP/bbnlisp">BBN-LISP</a></td>
</tr>
</tbody>
</table>
<p>In addition to these programs, many of the implementations of the <em>conceptual structures</em> by Roger C. Schank were in LISP <a href="https://ai.stanford.edu/~nilsson/QAI/qai.pdf#page=207">[8]</a>.</p>
<p>Later, LISP was also <a href="https://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/genetic/gp/systems/koza/0.html">used by John Koza in the context of GP</a> (but this was already in the 90s). In 1998, NASA also developed in <a href="http://www.lispworks.com/">LISP Works</a> the “Remote Agent” (RA), a robotic system for planning and executing spacecraft actions, in the context of <a href="https://www.jpl.nasa.gov/missions/deep-space-1-ds1">Deep Space 1</a> <a href="https://ai.stanford.edu/~nilsson/QAI/qai.pdf#page=603">[8]</a>.</p>
<p>If you are aware of any LISP program developed in the early days of AI (50s-90s) that is not mentioned above, you can share it with us in the comment section below and I will include it in the table above.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>I am not a LISP programmer, but 3-4 years ago I had implemented a simple plugin for Emacs in Emacs Lisp. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Most of these programs are mentioned in the book <a href="https://ai.stanford.edu/~nilsson/QAI/qai.pdf">The Quest for Artificial Intelligence: A History of Ideas and Achievements</a>, (2009) by Nils J. Nilsson, which I’ve been reading and enjoying. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>Not all of these programs were fully implemented in LISP, and it’s possible that there also other implementations of these programs in other programming languages. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Sat, 04 Dec 2021 00:00:00 +0000
https://nbro.gitlab.io//blogging/2021/12/04/historically-relevant-programs-developed-in-lisp/
https://nbro.gitlab.io//blogging/2021/12/04/historically-relevant-programs-developed-in-lisp/Optimal value function of shifted rewards<h2 id="theorem">Theorem</h2>
<p>Consider the following <strong>Bellman optimality equation (BOE)</strong> (<a href="http://incompleteideas.net/book/RLbook2020.pdf#page=86">equation 3.20 of Sutton & Barto book on RL, 2nd edition, p. 64</a>)</p>
\[q_*(s,a)
=\sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a) \left(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right)\tag{1}\label{1}.\]
<p>If we add the same constant \(c \in \mathbb{R}\) to all rewards \(r \in \mathcal{R}\), then the new optimal state-action value function is given by</p>
\[q_*(s, a) + k,\]
<p>where</p>
\[k = \frac{c}{1 - \gamma}
= c\left(\frac{1}{1 - \gamma}\right)
= c \left( \sum_{i=0}^{\infty} \gamma^{i} \right)
= c \left( 1 + \gamma + \gamma^2 + \gamma^3 + \dots \right),\]
<p>where \(0 \leq \gamma < 1\) is the discount factor of the MDP and \(\sum_{i=0}^{\infty} \gamma^{i}\) is a <a href="https://en.wikipedia.org/wiki/Geometric_series">geometric series</a> <sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">1</a></sup>.</p>
<h3 id="assumptions">Assumptions</h3>
<ul>
<li>
<p>\(0 \leq \gamma < 1\); if we allowed \(\gamma = 1\), then \(\frac{c}{1 - \gamma} = c/0\), which is undefined.</p>
</li>
<li>
<p>For <strong>episodic problems</strong> <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">2</a></sup>, we <strong>assume</strong> that we have an <strong>absorbing state</strong> \(s_\text{absorbing}\) <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">3</a></sup>, which is the state that the agent moves to after it has reached the goal, where the agent gets a reward of \(0\) for all future time steps. So, \(q_*(s_\text{absorbing}, a) =0, \forall a \in\mathcal{A}(s_\text{absorbing})\).</p>
</li>
</ul>
<h2 id="proof">Proof</h2>
<p>To show this, we need to show that the following equation is equivalent to the BOE in \ref{1}.</p>
\[q_*(s,a) + k
= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)\left((r + c) + \gamma \max_{a' \in\mathcal{A}(s')} \left( q_*(s',a') + k \right) \right) \tag{2}\label{2}\]
<p>Given that \(k = \frac{c}{1 - \gamma}\) is a constant, it does not affect the max, because we add this constant to all state-action values: this holds even if \(c\) is negative! So, we can take \(k\) out of the max and add it to \(\max_{a'\in\mathcal{A}(s')} q_*(s',a')\)</p>
\[\begin{align*}
q_*(s,a) + k
&= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)\left((r + c) + \gamma \left (k + \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right) \right)
\\
&= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)\left((r + c) + \frac{c \gamma}{1 - \gamma} + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right)
\\
&= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)\left(r + \frac{c(1 - \gamma) + c \gamma}{1 - \gamma} + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right)
\\
&= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)\left(r + \frac{c - c\gamma + c \gamma}{1 - \gamma} + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right)
\\
&= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} \left ( p(s',r \mid s,a)\frac{c}{1 - \gamma} \right) +
\\
&
\sum_{s' \in \mathcal{S}, r \in \mathcal{R}} \left( p(s',r \mid s,a) \left(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right) \right).
\tag{3}\label{3}
\end{align*}\]
<p>Given that \(p(s',r \mid s,a)\) is a probability distribution, then \(\sum_{s' \in \mathcal{S}, r \in \mathcal{R}} \left ( p(s',r \mid s,a)\frac{c}{1 - \gamma} \right)\) is the expectation of the constant \(\frac{c}{1 - \gamma}\), which is equal to the constant itself.</p>
<p>So, equation \ref{3} becomes</p>
\[q_*(s,a) + \frac{c}{1 - \gamma}
= \frac{c}{1 - \gamma} + \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a) \left(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right) \\
\iff \\
q_*(s,a)
=\sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a) \left(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a') \right)\]
<p>which is the Bellman optimality equation \ref{1}.</p>
<h2 id="interpretation">Interpretation</h2>
<p>The result above suggests that, <strong>if we add a constant to all rewards</strong>, which is a form of <strong>reward shaping</strong>, <strong>the set of optimal policies does not change</strong>.</p>
<p>Is this always true? Yes, <strong>in theory</strong>.</p>
<p>However, we must be careful with <em>episodic problems</em>.</p>
<ul>
<li>
<p>In theory, after we shift the rewards by \(c\), the agent will precisely get an additional reward of \(k = \frac{c}{1 - \gamma}\) for being in any state, including the <em>absorbing state</em>, and taking any action. So, after we shift the rewards, we have \(q_*(s_\text{absorbing}, a) = \frac{c}{1 - \gamma}, \forall a \in\mathcal{A}(s_\text{absorbing})\).</p>
</li>
<li>
<p><strong>In practice</strong>, one might mis-specify the reward functions, if we shift the rewards and terminate the episode once the agent gets to \(s_\text{absorbing}\).</p>
</li>
</ul>
<h3 id="example">Example</h3>
<p>To illustrate this issue, let’s say that, for an episodic problem (for example, a problem where the agent is in a grid and needs to go to a goal location), we have the following (deterministic) reward function</p>
\[r(s, a) =
\begin{cases}
1, \text{if } s = s_\text{goal}\\
0, \text{if } s = s_\text{absorbing}\\
0, \text{otherwise} \\
\end{cases}\]
<p>\(s_\text{absorbing}\) is just the state that we <strong>assume</strong> the agent moves to after having reached the goal state, so that he continues to get a reward of \(0\), which is an <strong>assumption</strong> that we make that allows us to terminate the episode once we get to \(s_\text{goal}\).</p>
<p>Now, let’s say that we define a new reward function as \(r'(s, a) \triangleq r(s, a) - 1\), i.e.</p>
\[r'(s, a) =
\begin{cases}
0, \text{if } s = s_\text{goal}\\
-1, \text{if } s = s_\text{absorbing}\\
-1, \text{otherwise} \\
\end{cases}\]
<p>So, in theory, you don’t get a reward of \(0\) anymore after the agent gets to the goal with \(r'\). If you terminate the episode once the agent gets to the goal, this will not be taken into account, i.e. if you terminate the episode once the agent got to the goal, then you assume that \(r'(s_\text{absorbing}, a) = 0, \forall a \in \mathcal{A}(s_\text{absorbing})\), i.e. you’re actually optimizing</p>
\[r''(s, a) =
\begin{cases}
0, \text{if } s = s_\text{goal}\\
0, \text{if } s = s_\text{absorbing}\\
-1, \text{otherwise} \\
\end{cases}\]
<p>So, in practice, you might be optimizing a different objective function than the one you <em>implicitly</em> or <em>unconsciously</em> assumed. In this example, \(r''(s, a)\) is the reward function that encourages the agent to get to the goal as quickly as possible (because you get a penalty of \(-1\) for every time step that you have not reached the goal), so, in practice, \(r''(s, a)\) might be what you want to optimize, but, in general, you must be careful with <strong>reward misspecification</strong> <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">4</a></sup>!</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:4" role="doc-endnote">
<p>\(k\) is also written as a geometric series to emphasize that \(k\) is similar to the <em>discounted return</em>, which is defined as \(G_t = \sum_{i=0}^{\infty} \gamma^{i}R_{t+1+i}\), where \(R_{t+1+i}\) is the reward at time step \(t+1+i\). If all rewards were equal to \(c\), then \(G_t = \frac{c}{1 - \gamma}\). <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:1" role="doc-endnote">
<p>See <a href="http://incompleteideas.net/book/RLbook2020.pdf#page=76">sections 3.3 and 3.4. (p. 54)</a> of Sutton & Barto book (2nd edition) for more details about the difference between episodic and continuing problems and how they can be unified. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>The assumption of having an absorbing state is also made in <a href="http://luthuli.cs.uiuc.edu/~daf/courses/games/AIpapers/ml99-shaping.pdf">Policy invariance under reward transformations: Theory and application to reward shaping</a> (1999) by Andrew Y. Ng et al., which is a seminal paper on reward shaping, which cites the book <a href="https://jmvidal.cse.sc.edu/library/neumann44a.pdf">Theory of Games and Economic Behavior</a> (1944) by John von Neumann et al. to support the claim that, for single-step decisions (which I assume to be some kind of bandit problem), positive linear transformations of the utility function do not change the optimal decision/policy: if we combine the theorem in this blog post and <a href="https://nbro.gitlab.io/blogging/2019/09/15/optimal-value-function-of-scaled-rewards/">the theorem in my previous blog post</a>, we get a similar result. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>This idea of <em>reward misspecification</em> has been studied in the literature. For example, in the paper, <a href="https://papers.nips.cc/paper/2017/hash/32fdab6559cdfa4f167f8c31b9199643-Abstract.html">Inverse Reward Design</a> (2017) by Dylan Hadfield-Menell et al. the authors propose an approach to deal with <em>proxy reward functions</em> (i.e. the reward functions designed by the human, which might not be the reward functions that the human intended to define). <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Sun, 01 Nov 2020 00:00:00 +0000
https://nbro.gitlab.io//blogging/2020/11/01/optimal-value-function-of-shifted-rewards/
https://nbro.gitlab.io//blogging/2020/11/01/optimal-value-function-of-shifted-rewards/On the definition of intelligence<h2 id="introduction">Introduction</h2>
<p>There are many people that claim that we still do not agree on a definition of intelligence (and thus what constitutes an artificial intelligence), with the usual argument that intelligence means something different for different people or that we still do not understand everything about (human or animal) intelligence. In fact, in the article <a href="http://www-formal.stanford.edu/jmc/whatisai/">What is artificial intelligence?</a> (2007), John McCarthy, one of the official founders of the AI field, states</p>
<blockquote>
<p>The problem is that we cannot yet characterize in general what kinds of computational procedures <strong>we want to call</strong> intelligent. We understand some of the mechanisms of intelligence and not others.</p>
</blockquote>
<p>To understand all mechanisms of intelligence, some people, such as Jeff Hawkins, have been studying the human brain (which is the main example of a system that is associated with intelligence).</p>
<p>We might not know <strong>how</strong> we are intelligent (i.e. how the human brain makes us intelligent), but this does not mean that we can’t come up with a general definition of intelligence that comprises all forms of intelligence (that people could possibly refer to). In other words, you do not need to fully understand all mechanisms of intelligence in order to attempt to provide a general definition of intelligence. For example, theoretical physicists (such as Albert Einstein) do not need to understand all the details of physics in order to come up with general laws of physics that are applicable in most cases and that explain many phenomena.</p>
<h2 id="universal-intelligence">Universal Intelligence</h2>
<p>There has been at least one quite serious attempt to formally define <em>intelligence</em> (and machine intelligence), so that it comprises all forms of intelligence that people could refer to.</p>
<p>In the paper <a href="https://arxiv.org/pdf/0712.3329.pdf">Universal Intelligence: A Definition of Machine Intelligence</a> (2007), Legg and Hutter, after having researched many previously given definitions of intelligence, informally define intelligence as follows</p>
<blockquote>
<p><strong>Intelligence measures an agent’s ability to achieve goals in a wide range of environments</strong></p>
</blockquote>
<p>This definition favors systems that are able to solve many tasks, which are often known as <a href="http://www.scholarpedia.org/article/Artificial_General_Intelligence"><strong>artificial general intelligences (AGIs)</strong></a>, than systems that are only able to solve a specific task, sometimes known as <strong>narrow AIs</strong>.</p>
<h3 id="mathematical-formalization">Mathematical Formalization</h3>
<p>To understand why this is the case, let’s look at their simple mathematical formalization of this definition (<a href="https://arxiv.org/pdf/0712.3329.pdf#page=20">section 3.3 of the paper</a>)</p>
\[\Gamma(\pi) := \sum_{\mu \in E} \frac{1}{2^{K(\mu)}} V_{\mu}^{\pi}\]
<p>where</p>
<ul>
<li>\(\Gamma(\pi)\) is the <em>universal intelligence</em> of agent \(\pi\)</li>
<li>\(E\) is the space of all <em>computable reward summable environmental measures with respect to the reference machine \(U\)</em> (roughly speaking, the space of all environments)</li>
<li>\(\mu\) is the environment (or task/problem)</li>
<li>\(V_{\mu}^{\pi}\) is the ability of the agent \(\pi\) to achieve goals in the environment \(\mu\)</li>
<li>\(K(\mu)\) is the Kolmogorov complexity of the environment \(\mu\)</li>
</ul>
<h3 id="interpretation">Interpretation</h3>
<p>We can immediately notice that the intelligence of an agent is a weighted combination of the ability to achieve goals in the environments (which represent the tasks/problems to be solved), where each weight is inversely proportional to the complexity of the environment (i.e. the difficulty of describing/solving the corresponding task). In other words, \(\Gamma(\pi)\) is defined as an expectation of \(V_{\mu}^{\pi}\) with respect to the probability distribution \(\frac{1}{2^{K(\mu)}}\), which Legg and Hutter call the <em>universal distribution</em>.</p>
<p>So, the higher the complexity of an environment, the less the ability of the agent to achieve goals in this environment contributes to the intelligence of the agent. In other words, the ability to solve a very difficult task successfully might not be enough to have high intelligence. You can have higher intelligence by solving many but simpler problems. Of course, an intelligent agent that solves all tasks optimally would be the optimal or perfect agent. <a href="https://jan.leike.name/AIXI.html">AIXI</a>, developed and formalized by Hutter, is actually an optimal agent (in some sense), but, unfortunately, it is incomputable (because it uses the Kolmogorov complexity)<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>.</p>
<p>Consequently, according to this definition, we could say that all animals (and maybe even other biological organisms) are <em>more intelligent</em> than, for example, <a href="https://www.nature.com/articles/nature16961">AlphaGo</a> or <a href="https://www.sciencedirect.com/science/article/pii/S0004370201001291">DeepBlue</a>, because all animals solve many problems, although they might not be as difficult as Go, while AlphaGo only solves Go <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>.</p>
<h3 id="open-questions">Open Questions</h3>
<p>I like this definition of universal intelligence because it implies that humans (and other animals) are more (generally) intelligent than AlphaGo or any other computer program, but it raises at least 1-2 questions:</p>
<ol>
<li>
<p>How would we measure the difficulty of a real-world environment?</p>
</li>
<li>
<p>So, in practice, can we really compare an animal with AlphaGo? Yes, we can with intelligent tests like the <a href="https://academic.oup.com/mind/article/LIX/236/433/986238">Turing test</a>, but can we do it with \(\Gamma(\pi)\)? The answer to this question clearly depends on the answer to the question above.</p>
</li>
</ol>
<h3 id="intelligence-tests">Intelligence Tests</h3>
<p>In <a href="https://arxiv.org/pdf/0712.3329.pdf">the paper</a>, they also discuss issues like <em>intelligence tests</em> and their relation to the definition of intelligence: that is, is an intelligence test sufficient to define intelligence, or is an intelligence test and a definition of intelligence distinct concepts?</p>
<h2 id="conclusion">Conclusion</h2>
<p>In my view, it is unproductive to come up with new definitions of intelligence (unless it’s more generally applicable than the universal intelligence) or to avoid choosing one definition with the excuse that we don’t know what intelligence is. I know what intelligence is. It’s measured by \(\Gamma(\pi)\). So, I don’t need to know <strong>how</strong> we can create an agent that is (highly) intelligent before I know <strong>what</strong> intelligence is. It’s not matter of liking or not a definition, it’s a matter of defining a set of axioms or hypotheses and deriving other properties from them or test those hypotheses, respectively.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>\(\Gamma(\pi)\) is also a function of the Kolmogorov complexity, but this is just a definition, i.e. it does not <em>directly</em> give you the instructions to develop intelligent agents. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Note that, according to this definition, AlphaGo is still intelligent, but just not as intelligent as animals. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Wed, 20 May 2020 00:00:00 +0000
https://nbro.gitlab.io//blogging/2020/05/20/on-the-definition-of-intelligence/
https://nbro.gitlab.io//blogging/2020/05/20/on-the-definition-of-intelligence/Optimal value function of scaled rewards<h2 id="theorem">Theorem</h2>
<p>Consider the following <strong>Bellman optimality equation (BOE)</strong> (<a href="http://incompleteideas.net/book/RLbook2020.pdf#page=86">equation 3.20 of Sutton & Barto book on RL, 2nd edition, p. 64</a>)</p>
\[q_*(s,a) = \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)(r + \gamma \max_{a'\in\mathcal{A}(s')}q_*(s',a')) \tag{1}\label{1}.\]
<p>If we multiply all rewards by the same constant \(c > 0 \in \mathbb{R}\), then the new optimal state-action value function is given by</p>
\[cq_*(s, a).\]
<h2 id="proof">Proof</h2>
<p>To prove this, we need to show that the following BOE</p>
\[c q_*(s,a)
= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)(c r + \gamma \max_{a'\in\mathcal{A}(s')} c q_*(s',a')). \tag{2}\label{2}\]
<p>is equivalent to the BOE in equation \ref{1}.</p>
<p>Given that \(c > 0\), then</p>
\[\max_{a'\in\mathcal{A}(s')} c q_*(s',a') = c\max_{a'\in\mathcal{A}(s')}q_*(s',a'),\]
<p>so \(c\) can be taken out of the \(\operatorname{max}\) operator.</p>
<p>Therefore, the equation \ref{2} becomes</p>
\[\begin{align*}
c q_*(s,a)
&= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)(c r + \gamma c \max_{a'\in\mathcal{A}(s')} q_*(s',a')) \\
&= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}c p(s',r \mid s,a)(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a')) \\
&= c \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a)(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a')) \\
&\iff
\\
q_*(s,a)
&= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a)(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a')),
\end{align*}
\tag{3}\label{3}\]
<p>which is equal to the the Bellman optimality equation in \ref{1}, which implies that, when the reward is given by \(cr\), \(c q_*(s,a)\) is the solution to the Bellman optimality equation.</p>
<h2 id="interpretation">Interpretation</h2>
<p>Consequently, <strong>whenever we multiply the reward function by some positive constant</strong>, which can be viewed as a form of <strong>reward shaping</strong> <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">1</a></sup>, <strong>the set of optimal policies does not change</strong> <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">2</a></sup>.</p>
<h2 id="what-if-the-constant-is-zero-or-negative">What if the constant is zero or negative?</h2>
<p>For completeness, if \(c=0\), then \ref{2} becomes \(0=0\), which is true.</p>
<p>If \(c < 0\), then \(\max_{a'\in\mathcal{A}(s')} c q_*(s',a') = c\min_{a'\in\mathcal{A}(s')}q_*(s',a')\), so equation \ref{3} becomes</p>
\[q_*(s,a)
= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a)(r + \gamma \min_{a'\in\mathcal{A}(s')} q_*(s',a')),\]
<p>which is <em>not</em> equal to the Bellman optimality equation in \ref{1}.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:2" role="doc-endnote">
<p>A seminal paper on reward shaping is <a href="http://luthuli.cs.uiuc.edu/~daf/courses/games/AIpapers/ml99-shaping.pdf">Policy invariance under reward transformations: Theory and application to reward shaping</a> (1999) by Andrew Y. Ng et al. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:1" role="doc-endnote">
<p>There can be more than one optimal policy for a given optimal value function (and Markov Decision Process) because we might have \(q_*(s,a_1) = q_*(s,a_2) \geq q_*(s,a), \text{for } a_1, a_2 \in \mathcal{A}(s) \text{ and } \forall a \in \mathcal{A}(s)\). Any greedy policy with respect to the optimal value function \(q_*(s,a)\) is an optimal policy. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Sun, 15 Sep 2019 00:00:00 +0000
https://nbro.gitlab.io//blogging/2019/09/15/optimal-value-function-of-scaled-rewards/
https://nbro.gitlab.io//blogging/2019/09/15/optimal-value-function-of-scaled-rewards/An example of how to use VisualDL with PyTorch<h2 id="abstract">Abstract</h2>
<blockquote>
<p>In this blog post, I will describe my journey while looking for visualization tools for PyTorch. In particular, I will briefly describe the options I tried out, and why I opted for VisualDL. Finally, and more importantly, I will show you a simple example of how to use VisualDL with PyTorch, both to visualize the parameters of the model and to read them back from the file system, in case you need them, e.g. to plot them with another tool (e.g. with Matplotlib).</p>
</blockquote>
<h2 id="introduction">Introduction</h2>
<p>Yesterday, I have been trying to find and use a visualization tool, similar to <a href="https://www.tensorflow.org/guide/summaries_and_tensorboard">TensorBoard</a>, but for <a href="https://pytorch.org/">PyTorch</a>. I find this type of visualization tools very useful, because they allow me to intuitively understand how the model is behaving and, in particular, how certain parameters and hyper-parameters of the model are changing, while the model is being trained and tested. So, this type of tools are especially useful while debugging our programs or if one needs to present the model to other people (e.g. teammates) while it is being trained or tested.</p>
<p>There isn’t still a “standard” tool for visualization in PyTorch (AFAIK). However, there are several “decent” options: <a href="https://github.com/pytorch/tnt">TNT</a>, <a href="https://github.com/lanpa/tensorboardX">tensorboardX</a> or <a href="https://github.com/PaddlePaddle/VisualDL">VisualDL</a>. There may be other options, but these are, apparently, the most popular (that I found), according to the stars of the corresponding Github repositories.</p>
<p>If you perform a search on the web, you will find discussions and questions regarding visualization tools for PyTorch, for example <a href="https://discuss.pytorch.org/t/any-visualization-tools-for-pytorch-to-help-people-debugging-the-network/19540">this</a>, <a href="https://discuss.pytorch.org/t/graph-visualization/1558/5">this</a> and <a href="https://www.quora.com/What-are-the-different-tools-to-visualize-the-training-process-in-PyTorch">this</a>.</p>
<h2 id="the-options">The options</h2>
<p>I tried all three options I mentioned above.</p>
<h3 id="tnt">TNT</h3>
<p>I first tried to use <a href="https://github.com/pytorch/tnt">TNT</a>, which can be used for logging purposes during the training and testing process of our models. TNT actually uses <a href="https://github.com/facebookresearch/visdom">Visdom</a> (which is a quite flexible and general visualization tool created by Facebook) to display the info in the form of plots. Therefore, you also need Visdom as a dependency.</p>
<p>More specifically, I tried to run <a href="https://github.com/pytorch/tnt/blob/master/example/mnist_with_visdom.py">this TNT example</a>.
To successfully run the mentioned example using PyTorch 1.0.0, you need to modify a statement (which was already deprecated, but nobody cared to update the example), which causes an error. See <a href="https://github.com/pytorch/tnt/issues/108">this Github issue</a> (on the TNT Github repo) for more info. I didn’t like much this example (and, generalizing, TNT) because the logic of the program drastically changed (with respect to a “usual” PyTorch program) only to perform the visualization of the evolution of e.g. the training loss. To me, this seemed like a sign of inflexibility. Therefore, I tried to look for other options.</p>
<h3 id="tensorboardx">tensorboardX</h3>
<p>I then tried to use <a href="https://github.com/lanpa/tensorboardX">tensorboardX</a>, which is, right now, among the three options, the one with the highest popularity (in terms of Github stars).</p>
<p>There are several examples which show how to use this tool. You can find them <a href="https://github.com/lanpa/tensorboardX/tree/master/examples">here</a> and <a href="https://github.com/lanpa/tensorboard-pytorch-examples">here</a>. In particular, I tried <a href="https://github.com/lanpa/tensorboard-pytorch-examples/tree/master/mnist">this example</a>. To use <a href="https://github.com/lanpa/tensorboardX">tensorboardX</a>, we actually need to install <a href="https://www.tensorflow.org/guide/summaries_and_tensorboard">TensorBoard</a> and <a href="https://www.tensorflow.org/">TensorFlow</a>. Apart from the fact that these are not lightweight dependencies, this was also my first time using PyTorch, and the thought of needing to use TensorFlow (which I previously used in other projects) to achieve something in PyTorch made me think that I’d better just stick with TensorFlow and give up on PyTorch. Of course, this doesn’t make much sense, but I just didn’t feel this was the right direction. Therefore, I looked for another option.</p>
<h3 id="visualdl">VisualDL</h3>
<p>Finally, I tried <a href="https://github.com/PaddlePaddle/VisualDL">VisualDL</a> (Visual Deep Learning), which is essentially a visualization tool very similar to TensorBoard, whose backend is written in C++, but with both a C++ and Python APIs. Its frontend or web interface is written in <a href="https://vuejs.org/">Vue</a>. You can find its documentation at <a href="http://visualdl.paddlepaddle.org/">http://visualdl.paddlepaddle.org/</a>. Nowadays, most machine learning frameworks and libraries are written in C++ and have a Python API, so, of course, this characteristic of VisualDL seems consistent with many other machine learning tools. One of the goals of this visualization tool is to be “cross-framework”, i.e. not to be tailored to a specific framework (like TensorFlow or PyTorch). I like flexibility, therefore this feature immediately biased me towards VisualDL. In the <a href="http://visualdl.paddlepaddle.org/">official website</a> of the tool, it is claimed that VisualDL works with <a href="https://caffe2.ai/">Caffe2</a>, <a href="http://www.paddlepaddle.org/">PaddlePaddle</a>, <a href="https://pytorch.org/">PyTorch</a>, <a href="https://keras.io/">Keras</a> and <a href="https://mxnet.apache.org/">MXNet</a>. I imagine that in the future more frameworks will be supported.</p>
<p>I first read <a href="https://github.com/PaddlePaddle/VisualDL">the README file</a> of the Github repo of VisualDL, <a href="http://visualdl.paddlepaddle.org/documentation/visualdl/en/develop/getting_started/introduction_en.html">the documentation</a>, <a href="http://visualdl.paddlepaddle.org/documentation/visualdl/en/develop/getting_started/demo/pytorch/TUTORIAL_EN.html">this tutorial</a> and <a href="https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/pytorch/pytorch_cifar10.py">this example</a>. There are other examples (for other frameworks) you can find <a href="https://github.com/PaddlePaddle/VisualDL/tree/develop/demo">here</a>. VisualDL seems to be in its preliminary phases, but you can already accomplish several things that you expect to accomplish with e.g. TensorBoard. For example, you can visualize (more or less in real-time) the evolution of scalar values of your model (e.g. the learning rate or the training loss), you can also plot histograms and visualize the computational graph.</p>
<p>However, this blog post is not dedicated to the explanation and presentation of all features of VisualDL, therefore I let the reader explore by him or herself the remaining features of VisualDL. In the next section, I will show you a brief example of how to use VisualDL with PyTorch and how to read the logging data, once it has been logged (and possibly visualized), using the API that VisualDL already provides.</p>
<h2 id="how-can-visualdl-be-used-to-visualize-statistics-of-pytorch-models">How can VisualDL be used to visualize statistics of PyTorch models?</h2>
<p>Before proceeding, you need to install PyTorch and VisualDL. In this example, where the source code can be found at <a href="https://github.com/nbro/visualdl_with_pytorch_example">https://github.com/nbro/visualdl_with_pytorch_example</a>, I installed PyTorch and VisualDL in a <a href="https://www.anaconda.com/">Anaconda</a> environment, but you can install them as you please. If you want to exactly follow along with my example, please, read the instructions <a href="https://github.com/nbro/visualdl_with_pytorch_example">here</a> on how to set up your environment and run the example.</p>
<p>I will not describe all the details inside this example, but only the ones associated with the usage of VisualDL with PyTorch.</p>
<p>In the file <a href="https://github.com/nbro/visualdl_with_pytorch_example/blob/master/write_visualdl_data.py"><code class="language-plaintext highlighter-rouge">write_visualdl_data.py</code></a>, to use VisualDL to visualize the evolution of some of the metrics or statistics (specifically, the training loss, the test loss and the test accuracy) of the associated model (a CNN trained and tested on the <a href="http://yann.lecun.com/exdb/mnist/">MNIST dataset</a>), I first imported the class <code class="language-plaintext highlighter-rouge">LogWriter</code> from <code class="language-plaintext highlighter-rouge">visualdl</code> (line <a href="https://github.com/nbro/visualdl_with_pytorch_example/blob/master/write_visualdl_data.py#L11">11</a>):</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">visualdl</span> <span class="kn">import</span> <span class="n">LogWriter</span></code></pre></figure>
<p>I then created a <code class="language-plaintext highlighter-rouge">LogWriter</code> object (at line <a href="https://github.com/nbro/visualdl_with_pytorch_example/blob/master/write_visualdl_data.py#L13">13</a>)</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">log_writer</span> <span class="o">=</span> <span class="n">LogWriter</span><span class="p">(</span><span class="s">"./log"</span><span class="p">,</span> <span class="n">sync_cycle</span><span class="o">=</span><span class="mi">1000</span><span class="p">)</span></code></pre></figure>
<p>where <code class="language-plaintext highlighter-rouge">"./log"</code> is the name of the folder where the logging files will be placed and <code class="language-plaintext highlighter-rouge">sync_cycle</code> is a parameter which controls when VisualDL will force the logging to be written to the file system. Have a look at <a href="http://visualdl.paddlepaddle.org/documentation/visualdl/en/develop/api/initialize_logger.html#visualdl.LogWriter">the documentation</a> for more info.</p>
<p>Then, at line <a href="https://github.com/nbro/visualdl_with_pytorch_example/blob/master/write_visualdl_data.py#L158">158</a> and <a href="https://github.com/nbro/visualdl_with_pytorch_example/blob/master/write_visualdl_data.py#L161">161</a>, I defined the specific loggers (which are of type “scalars”, given that the training loss, the test loss and the test accuracy are scalar values) which will be used to record statistics during respectively the training and testing phases:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">with</span> <span class="n">log_writer</span><span class="p">.</span><span class="n">mode</span><span class="p">(</span><span class="s">"train"</span><span class="p">)</span> <span class="k">as</span> <span class="n">logger</span><span class="p">:</span>
<span class="n">train_losses</span> <span class="o">=</span> <span class="n">logger</span><span class="p">.</span><span class="n">scalar</span><span class="p">(</span><span class="s">"scalars/train_loss"</span><span class="p">)</span>
<span class="k">with</span> <span class="n">log_writer</span><span class="p">.</span><span class="n">mode</span><span class="p">(</span><span class="s">"test"</span><span class="p">)</span> <span class="k">as</span> <span class="n">logger</span><span class="p">:</span>
<span class="n">test_losses</span> <span class="o">=</span> <span class="n">logger</span><span class="p">.</span><span class="n">scalar</span><span class="p">(</span><span class="s">"scalars/test_loss"</span><span class="p">)</span>
<span class="n">test_accuracies</span> <span class="o">=</span> <span class="n">logger</span><span class="p">.</span><span class="n">scalar</span><span class="p">(</span><span class="s">"scalars/test_accuracy"</span><span class="p">)</span></code></pre></figure>
<p>What this piece of code tell us is that under the <a href="http://visualdl.paddlepaddle.org/documentation/visualdl/en/develop/api/initialize_logger.html#visualdl.LogWriter.as_mode">“mode”</a> <code class="language-plaintext highlighter-rouge">"train"</code>, we are defining the <a href="http://visualdl.paddlepaddle.org/documentation/visualdl/en/develop/api/initialize_logger.html#visualdl.LogWriter.scalar">scalar</a> logger <code class="language-plaintext highlighter-rouge">train_losses</code> which is associated with the “tag” <code class="language-plaintext highlighter-rouge">"scalars/train_loss"</code>. Similarly for the loggers associated with the mode <code class="language-plaintext highlighter-rouge">"test"</code>. VisualDL is actually aware of these modes: they will then be useful to retrieve the logging data from the file system (we will see this in the next example).</p>
<p>The specific scalar loggers <code class="language-plaintext highlighter-rouge">train_losses</code>, <code class="language-plaintext highlighter-rouge">test_losses</code> and <code class="language-plaintext highlighter-rouge">test_accuracies</code> are then passed to the functions <code class="language-plaintext highlighter-rouge">train</code> and <code class="language-plaintext highlighter-rouge">test</code> at lines <a href="https://github.com/nbro/visualdl_with_pytorch_example/blob/master/write_visualdl_data.py#L168">168</a> and <a href="https://github.com/nbro/visualdl_with_pytorch_example/blob/master/write_visualdl_data.py#L169">169</a>. The functions <code class="language-plaintext highlighter-rouge">train</code> and <code class="language-plaintext highlighter-rouge">test</code> are called at every epoch (inside a <a href="https://github.com/nbro/visualdl_with_pytorch_example/blob/master/write_visualdl_data.py#L165">loop</a>). Inside the <code class="language-plaintext highlighter-rouge">train</code> function, at line <a href="https://github.com/nbro/visualdl_with_pytorch_example/blob/master/write_visualdl_data.py#L61">61</a>, we add a “record” to the <code class="language-plaintext highlighter-rouge">train_losses</code> logger using:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">train_losses</span><span class="p">.</span><span class="n">add_record</span><span class="p">(</span><span class="n">epoch</span><span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="n">loss</span><span class="p">.</span><span class="n">item</span><span class="p">()))</span></code></pre></figure>
<p>Similarly, inside <code class="language-plaintext highlighter-rouge">test</code>, at line <a href="https://github.com/nbro/visualdl_with_pytorch_example/blob/master/write_visualdl_data.py#L86">86</a> and 87, we add respectively records for the <code class="language-plaintext highlighter-rouge">test_losses</code> and <code class="language-plaintext highlighter-rouge">test_accuracies</code> loggers.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">test_losses</span><span class="p">.</span><span class="n">add_record</span><span class="p">(</span><span class="n">epoch</span><span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="n">test_loss</span><span class="p">))</span>
<span class="n">test_accuracies</span><span class="p">.</span><span class="n">add_record</span><span class="p">(</span><span class="n">epoch</span><span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="n">test_accuracy</span><span class="p">))</span></code></pre></figure>
<p>The first argument of the <code class="language-plaintext highlighter-rouge">add_record</code> method is a “tag” or “id”, which is basically a key that will be needed later to retrieve the epoch associated with the corresponding record (which, in the examples above, is either a loss or an accuracy value). I converted the record values using the function <code class="language-plaintext highlighter-rouge">float</code> to make sure they are all floating-point values.</p>
<p>These are the only lines of code I needed to add to <a href="https://github.com/pytorch/examples/blob/master/mnist/main.py">the original PyTorch program</a> to obtain logging and visualization functionalities using VisualDL. More specifically, I added about 10 lines, and these lines are quite self-explanatory.</p>
<p>The following picture shows the resulting web interface of VisualDL, after having executed this example and after having waited for the <code class="language-plaintext highlighter-rouge">log</code> folder to be created and containing some logging files produced by VisualDL (as explained <a href="https://github.com/nbro/visualdl_with_pytorch_example/blob/master/README.md">in this README file</a>):</p>
<p><img src="/images/visualdl.png" alt="The web interface of VisualDL" /></p>
<p>The screenshot does not completely show the bottom plot, but, of course, in the VisualDL web interface, you can scroll down. You can even expand single plots, among other things.</p>
<h2 id="how-can-we-read-the-logging-data-produced-by-visualdl">How can we read the logging data produced by VisualDL?</h2>
<p>During the training and testing phases of your model, VisualDL will produce some logging files, in our case, under the folder <code class="language-plaintext highlighter-rouge">log</code>. These files are in a format which is not human-readable. They are files associated with <a href="https://github.com/protocolbuffers/protobuf">ProtoBuf</a> (you can ignore this!).</p>
<p>Anyway, VisualDL also allows us to read these files using its API. We may want to do this because we may need to produce Matplotlib plots using the generated data during the training and testing phases.</p>
<p>More specifically, we can read these logging data (previously logged to a file using <code class="language-plaintext highlighter-rouge">LogWriter</code>, as explained in the previous example) using <code class="language-plaintext highlighter-rouge">LogReader</code>.</p>
<p>The simple Python module <a href="https://github.com/nbro/visualdl_with_pytorch_example/blob/master/read_visualdl_data.py"><code class="language-plaintext highlighter-rouge">read_visualdl_data.py</code></a> does exactly this. The statements are quite self-explanatory.</p>
<p>But, in particular, I would like to note a few things. First, at line <a href="https://github.com/nbro/visualdl_with_pytorch_example/blob/master/read_visualdl_data.py#L6">6</a>, 14, and 22, I am creating a <code class="language-plaintext highlighter-rouge">LogReader</code> but in a certain context or “mode”, and these modes correspond to the modes where the <code class="language-plaintext highlighter-rouge">LogWriter</code>s (in the previous example) had been created (see the example above).</p>
<p>Note that the “ids” correspond to the variable <code class="language-plaintext highlighter-rouge">epoch</code> in the previous example.</p>
<p>Anyway, note that you should not run <a href="https://github.com/nbro/visualdl_with_pytorch_example/blob/master/read_visualdl_data.py"><code class="language-plaintext highlighter-rouge">read_visualdl_data.py</code></a> before <a href="https://github.com/nbro/visualdl_with_pytorch_example/blob/master/write_visualdl_data.py"><code class="language-plaintext highlighter-rouge">write_visualdl_data.py</code></a> (or, at least, before the <code class="language-plaintext highlighter-rouge">log</code> folder has been created and already contains the logging files).</p>
<h2 id="visualdl-has-a-few-problems">VisualDL has a few problems</h2>
<p>I have chosen VisualDL (as opposed to TNT and tensorboardX), but VisualDL has a few problems too. For some reason, at least in my case, the line charts are only displayed after a few minutes: more specifically, in the example above, only towards the end of the second epoch. See e.g. <a href="https://github.com/PaddlePaddle/VisualDL/issues/524">this Github issue</a>. Even worse, I noticed that sometimes the line charts are not displayed at all (i.e. they are blank): in that case, I need to wait for the experiment to finish or I need to restart the VisualDL server in order to see them. I have also encountered a few weird runtime error messages on the terminal (similar to the one described in <a href="https://github.com/PaddlePaddle/VisualDL/issues/315">this issue</a>), the causes of which I don’t yet know with certainty. Furthermore, I would like to note that I only tried to visualize line charts, so it is possible that you will encounter other problems while using other features of VisualDL. Finally, I would like to note that these problems can also be due to my inexperience with VisualDL (i.e. I may have done something wrong!).</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this blog post, I have briefly described three tools which can be used to visualize statistics of our PyTorch models, while they are being trained and tested. I particularly liked VisualDL, and so I provided two examples which show how to use VisualDL with PyTorch: one to visualize the actual statistics and the other to read them back from the file system. VisualDL is still in its infancy, but, hopefully, it will be improved and the bugs will be fixed.</p>
Sun, 06 Jan 2019 00:00:00 +0000
https://nbro.gitlab.io//blogging/2019/01/06/an-example-of-how-to-use-visualdl-with-pytorch/
https://nbro.gitlab.io//blogging/2019/01/06/an-example-of-how-to-use-visualdl-with-pytorch/