Chizoba Obasi blog Exploring Deep Learning. https://chizkidd.github.io// Wed, 25 Mar 2026 18:59:49 +0000 Wed, 25 Mar 2026 18:59:49 +0000 Jekyll v3.10.0 Inkcast: A Free, Browser-Based Audiobook Player <p>Earlier this year, I decided to force myself to read more. Not a New Year’s resolution, because those never last. The reason is that growing up as a child and young teenager, reading often felt like punishment. My mum required my siblings and me to read a certain number of pages from a designated book every day throughout elementary school. Missing a day meant mandatory punishment. In boarding secondary school, this eventually led to a stubborn, subconscious resistance to non-essential reading. Over the six years I spent there, I probably read only five to ten non-academic fiction books (though <em>Artemis Fowl</em> was a delight). So it is not hard to see where my indifference to reading came from.</p> <p>During the COVID-19 pandemic, however, I fell deep into podcasts. As an avid sports fan and TV show buff, I listened to everything: sports recaps, tech podcasts, expert interviews, the works (meeting Walter White at Anfield would be the ultimate dream). So when I decided to read more this year, audiobooks felt like the natural bridge. I already had a few EPUBs in the Apple Books app on my reading list and wondered: <em>Can I listen to these EPUBs using Apple Dictation’s two-finger swipe-down feature?</em> Unfortunately, it only works for the current page. It is quite janky, not very user-friendly, and frankly does not work well for my use case.</p> <p>Recently, I worked on <a href="https://chizkidd.github.io/2026/03/01/tonal-fidelity-multilingual-asr/">evaluating how a Facebook state-of-the-art (SOTA) automatic speech recognition (ASR) model handles Igbo tones</a>, trying to see whether it actually “listens” properly. So I have been dabbling with audio quite a bit this year. You could say I have been thinking about listening a lot. In the past, I also experimented with WaveNet (a generative model for raw audio) and its fundamental building block, <a href="https://chizkidd.github.io/Karpathy-Neural-Networks-Zero-to-Hero/006_makemore_WaveNet/makemore_WaveNet.html">the dilated causal convolution</a>.</p> <p>With these experiences in mind, I wondered: <em>Can I build an iPhone Shortcut that lets me listen to EPUBs properly?</em> That question eventually led to <a href="https://chizkidd.github.io/inkcast/"><strong>Inkcast</strong></a>. The goal was not to build a Speechify competitor. I simply wanted to solve a personal problem. My aim was to create a low-effort, frictionless tool for personal use, so I made a GitHub repository and started building. Within a few days, I had a working website that could take EPUBs and PDFs and let users listen to the content organized by chapters in a sidebar with one-tap navigation. It included basic controls such as play/pause, rewind (15 seconds), forward (30 seconds), playback speed control (0.75× to 2×), and voice selection.</p> <p>It worked well on desktop, so I used the URL to create a Shortcut on my iPhone. On mobile it also worked, but there was one problem: the reader voices sounded robotic and monotone, which is not ideal for long-form listening. The irony was that only weeks earlier I had been evaluating how machines handle speech. So I went back to the drawing board to figure out how to get more natural-sounding reader voices on Inkcast.</p> <p>While researching audiobook-quality text-to-speech, I came across several APIs (OpenAI, ElevenLabs, Google Cloud). None were free, and for a personal project I wanted something that required no subscriptions or API keys. Most resources suggested that human-quality narration requires a dedicated TTS service. Eventually I discovered that the Web Speech API can access premium voices already installed on the device. These voices are free, require no API keys, and remain available offline. They are not state of the art, but they are surprisingly good. Many people do not realize that higher-quality Siri voices can be downloaded. The voice quality improved, but the project also started evolving in another direction.</p> <p>I have always wanted to work through Paul Graham’s essays properly. There are 229 of them, and they read almost like long-form podcasts. But they live on webpages, which raised another question: Why limit the input to EPUBs and PDFs? So I added URL support. I pasted Paul Graham’s archive page into Inkcast, and it automatically pulled all 200+ essays into the sidebar. That was the moment I realized the idea actually worked.</p> <p>The entire project lives in a single HTML file. There are no accounts, no installations, and files never leave the user’s device. Because the app has no server dependencies, it ended up functioning as a privacy-preserving tool by default. In a way, I started the year studying whether machines listen well. Along the way, I realized that humans do not have many good free tools for listening either. Speechify costs $139 per year and Audible requires a subscription, so I built something that worked for me.</p> <p>If you find it useful, pls <a href="https://chizkidd.github.io/inkcast/">try it</a>, <a href="https://github.com/chizkidd/inkcast">star it</a>, or <a href="https://buymeacoffee.com/cobasi">buy me a coffee</a> if it saves you a Speechify subscription.</p> Mon, 16 Mar 2026 00:00:00 +0000 https://chizkidd.github.io//2026/03/16/inkcast/ https://chizkidd.github.io//2026/03/16/inkcast/ Sutton & Barto, Ch. 12: Eligibility Traces (Personal Notes) <ul> <li>Eligibility traces are one of the basic mechanisms of RL that unify and generalize TD and Monte Carlo (MC) methods.</li> <li>TD methods augmented with eligibility traces produce a family of methods spanning a range from MC methods at one end ($\lambda = 1$) to one-step TD (TD(0)) methods at the other end ($\lambda = 0$).</li> <li>With eligibility traces, MC methods can be implemented online and on continuing problems.</li> <li>$n$-step methods also unify TD and MC methods but are not as elegant algorithmically as eligibility traces (ET).</li> <li>The eligibility traces algorithm entails: <ul> <li>First, we have a short-term memory vector, the <strong>eligibility trace</strong> $\mathbf{z}_t \in \mathbb{R}^d$, that parallels the long-term weight vector $\mathbf{w}_t \in \mathbb{R}^d$.</li> <li>Then when a component of $\mathbf{w}_t$ participates in producing an estimated value, the corresponding component of $\mathbf{z}_t$ is bumped up and then begins to fade away.</li> <li>Learning occurs in that component of $\mathbf{w}_t$ if a non-zero TD error occurs before the trace falls back to zero (fades away).</li> <li>The trace-decay parameter $\lambda \in [0,1)$ determines the rate at which the trace falls.</li> </ul> </li> <li>Advantages of ET over $n$-step methods: <ul> <li>Requires only a single trace vector $\mathbf{z}_t$ rather than storing the last $n$ feature vectors.</li> <li>Learning occurs continually and uniformly in time rather than being delayed and playing “catch up” at episode end. This leads to immediate behavior effect from learning rather than being delayed.</li> <li>ET is a <strong>backward view</strong> algorithm, unlike $n$-step methods that are <strong>forward view</strong> algorithms, which is less complex to implement.</li> </ul> </li> <li><strong>Forward view</strong> algorithms are based on looking forward from the updated state, and the updated state depends on all the future rewards.</li> <li><strong>Backward view</strong> algorithms use the current TD error, looking backward to recently visited states, to achieve nearly the same updates as forward view.</li> <li>We start with ideas for state values and prediction, then extend them to action values and control, then do on-policy, then extend to off-policy learning. The field of focus is linear function approximation (covers tabular and state aggregation cases).</li> </ul> <hr /> <h2 id="table-of-contents">Table of Contents</h2> <ul> <li><a href="#121-the--return">12.1 The $\lambda$-return</a></li> <li><a href="#122-td">12.2 TD($\lambda$)</a></li> <li><a href="#123--step-truncated--return-methods">12.3 $n$-step Truncated $\lambda$-return Methods</a></li> <li><a href="#124-redoing-updates-online--return-algorithm">12.4 Redoing Updates: Online $\lambda$-return Algorithm</a></li> <li><a href="#125-true-online-td">12.5 True Online TD($\lambda$)</a></li> <li><a href="#126-dutch-traces-in-monte-carlo-learning">12.6 Dutch Traces in Monte Carlo Learning</a></li> <li><a href="#127-sarsa">12.7 Sarsa($\lambda$)</a></li> <li><a href="#128-variable--and">12.8 Variable $\lambda$ and $\gamma$</a></li> <li><a href="#129-off-policy-traces-with-control-variates">12.9 Off-Policy Traces with Control Variates</a></li> <li><a href="#1210-watkinss-q-to-tree-backup">12.10 Watkins’s Q($\lambda$) to Tree-Backup($\lambda$)</a></li> <li><a href="#1211-stable-off-policy-methods-with-traces">12.11 Stable Off-Policy Methods with Traces</a></li> <li><a href="#1212-implementation-issues">12.12 Implementation Issues</a></li> <li><a href="#1213-conclusions">12.13 Conclusions</a></li> </ul> <h2 id="appendix">Appendix</h2> <ul> <li><a href="#citation">Citation</a></li> </ul> <hr /> <h2 id="121-the-lambda-return-">12.1 The $\lambda$-return <a name="121-the--return"></a></h2> <ul> <li>Recall in Chapter 7 we defined an $n$-step return as the sum of the first $n$ rewards plus the estimated value of the state reached in $n$ steps, each appropriately discounted:</li> </ul> \[G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{v}(S_{t+n}, \mathbf{w}_{t+n-1}), \quad\quad 0 \leq t \leq T - n\] <ul> <li>A valid update can be done not just towards any $n$-step return, but also towards any average of $n$-step returns. <ul> <li>E.g. average the 2-step and 4-step return: $\frac{1}{2} G_{t:t+2} + \frac{1}{2} G_{t:t+4}$</li> </ul> </li> </ul> <p><img src="/assets/images/2026/rl-sutton-barto/ch12-12-1-the-2-and-4-step-returns.png" alt="compound update of 2-step and 4-step" /></p> <div class="callout callout--note"> <div class="callout__title"> <strong>Backup Diagram for Compound Update</strong> </div> <div class="callout__body"> <p>The compound update mixing half of a two-step return and half of a four-step return.</p> </div> </div> <ul> <li>Any set of $n$-step returns can be averaged, even an infinite set, as long as the weights on the component returns are positive and sum to $1$.</li> <li>What if instead of using one $n$-step return, we use a weighted average of all $n$-step returns? This leads to averaging which produces a substantial new range of algorithms.E.g., <ol> <li>Averaging one-step and infinite-step returns to interrelate TD and MC methods.</li> <li>Averaging experience-based updates with Dynamic Programming (DP) updates to obtain a single combination of experience-based and model-based methods.</li> </ol> </li> <li>An update that averages simpler component updates is called a <strong>compound update</strong> or the <strong>$\lambda$-return</strong>.</li> <li>The TD($\lambda$) algorithm is one way of averaging $n$-step updates, each weighted proportionally by $\lambda^{n-1}$ (where $\lambda \in [0,1]$) and normalized by a factor of $(1-\lambda)$ to ensure that the weights sum to $1$.</li> </ul> <p><img src="/assets/images/2026/rl-sutton-barto/ch12-12-1-td-lambda.png" alt="TD(lambda)" /></p> <div class="callout callout--note"> <div class="callout__title"> <strong>Backup Diagram for TD($\lambda$)</strong> </div> <div class="callout__body"> <p>If $\lambda = 0$, then the overall update reduces to its first component, the <strong>TD(0)</strong> update, whereas if $\lambda = 1$, then the overall update reduces to its last component, the <strong>MC</strong> update.</p> </div> </div> <ul> <li>Essentially, $\lambda$-return, $G_t^\lambda$, combines all $n$-step returns $G_{t:t+n}$ in a weighted average manner, $(1-\lambda)\lambda^{n-1}$, and is defined in its state-based form by:</li> </ul> \[\boxed{G_t^\lambda \doteq (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n}}\] <h3 id="tdlambda-weighting">TD($\lambda$) Weighting</h3> <ul> <li>The TD($\lambda$) weighting function diagram illustrates the weighting on the sequence of $n$-step returns in the $\lambda$-return: <ul> <li>$1$-step return gets the largest weight, $1-\lambda$</li> <li>$2$-step return gets the next (2nd) largest weight, $(1-\lambda)\lambda$</li> <li>$3$-step return gets the 3rd largest weight, $(1-\lambda)\lambda^2$</li> <li>$n$-step return gets the $n$-th largest (smallest) weight, $(1-\lambda)\lambda^{n-1}$</li> <li>The weight fades by $\lambda$ with each additional step.</li> <li>After a terminal state has been reached, all subsequent $n$-step returns are equal to the conventional return $G_t$.</li> <li>So essentially, we can decompose $G_t^\lambda$ based on the TD($\lambda$) weighting function diagram into the main sum and post-termination terms:</li> </ul> \[\begin{array}{l} G_t^\lambda = (1-\lambda) \sum\nolimits_{n=1}^{T-t-1} \lambda^{n-1} G_{t:t+n} + \lambda^{T-t-1} G_t \\ \hspace{3em} \underbrace{\hspace{11em}}_{\text{pre-termination}} \kern{0.5em}\underbrace{\hspace{4em}}_{\text{post-termination}} \end{array}\] <ul> <li>So now we can see the impact of $\lambda$ more clearly:</li> </ul> \[\begin{aligned} \text{if } \lambda = 1: \quad &amp; G_t^\lambda = G_t \hspace{18em} \text{(MC)} \\[6pt] \text{if } \lambda = 0: \quad &amp; G_t^\lambda = \left\{ \begin{array}{ll} \sum_{n=1}^{T-t-1} G_{t:t+n} &amp; \text{for } n=1 \\ 0 &amp; \text{for } n &gt; 1 \end{array} \right\} = G_{t:t+1} \quad \text{(TD(0))} \end{aligned}\] </li> </ul> <p><img src="/assets/images/2026/rl-sutton-barto/ch12-12-1-td-lambda-weighting-function.png" alt="TD(lambda) weighting" /></p> <div class="callout callout--note"> <div class="callout__title"> <strong>TD($\lambda$) Weighting</strong> </div> <div class="callout__body"> <p>Weighting given in the $\lambda$-return to each of the $n$-step returns.</p> </div> </div> <ul> <li>Our first learning algorithm based on the $\lambda$-return is the <strong>off-line $\lambda$-return algorithm</strong>, which waits until the end of an episode to make updates. Its semi-gradient, $\lambda$-return, target update for $t = 0, 1, 2, \ldots, T-1$ is:</li> </ul> \[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \!\left[G_t^\lambda - \hat{v}(S_t, \mathbf{w}_t)\right] \nabla \hat{v}(S_t, \mathbf{w}_t)}\] <ul> <li>The $\lambda$-return allows us to smoothly move between MC and TD(0) methods, comparable to $n$-step returning.</li> </ul> <h3 id="forward-view">Forward View</h3> <ul> <li>This approach is called the theoretical, forward view of a learning algorithm: <ul> <li>Update value function towards the $\lambda$-return.</li> <li>Look forward in time to all the future rewards to compute $G_t^\lambda$.</li> <li>Like MC, can only be computed from complete return.</li> </ul> </li> </ul> <p><img src="/assets/images/2026/rl-sutton-barto/ch12-12-1-forward-view-td-lambda.png" alt="Forward view" /></p> <div class="callout callout--note"> <div class="callout__title"> <strong>Forward View</strong> </div> <div class="callout__body"> <p>We decide how to update each state by looking forward to future rewards and states.</p> </div> </div> <hr /> <h2 id="122-tdlambda-">12.2 TD($\lambda$) <a name="122-td"></a></h2> <ul> <li>TD($\lambda$) was the first algorithm that showed a formal relationship between a forward view and backward view using eligibility traces.</li> <li>TD($\lambda$) improves over the off-line $\lambda$-return algorithm in 3 ways: <ul> <li>Updates the weight vector on every step of an episode and not just the end.</li> <li>Equal distributions in time of its computations rather than at an episode’s end.</li> <li>Can be applied to continuing problems and not just episodic problems.</li> </ul> </li> <li>Let’s focus on the <strong>semi-gradient version of TD($\lambda$)</strong> with function approximation: <ul> <li>The <strong>eligibility trace $\mathbf{z}_t$</strong> has the same number of components as $\mathbf{w}_t$.</li> <li>$\mathbf{z}$ is initialized to $\mathbf{0}$, incremented on each time step by the value gradient, and then fades away by $\gamma\lambda$:</li> </ul> \[\begin{align*} \mathbf{z}_{-1} &amp;\doteq \mathbf{0} \\ \mathbf{z}_t &amp;\doteq \gamma\lambda \mathbf{z}_{t-1} + \nabla \hat{v}(S_t, \mathbf{w}_t), \quad 0 \leq t \leq T \end{align*}\] \[\text{where } \lambda \equiv \text{trace decay parameter and } \gamma \equiv \text{discount rate}\] </li> <li>The eligibility trace keeps track of which $\mathbf{w}_t$ components have contributed, positively or negatively, to recent state valuations.</li> <li>This is the <strong>recency heuristic</strong> used for <strong>credit assignment,</strong> where more credit is assigned to the most recent states. <strong>Recent</strong> is defined in terms of $\gamma\lambda$.</li> <li> <p>The TD error for state-value prediction is:</p> \[\delta_t \doteq R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t)\] <p>and the weight vector update in TD($\lambda$) is proportional to the scalar TD error and the vector eligibility trace:</p> </li> </ul> \[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\, \delta_t\, \mathbf{z}_t}\] <h3 id="backward-view">Backward View</h3> <ul> <li>Forward view provides theory but backward view provides mechanism (practical) where we update online, every step, from incomplete sequences.</li> <li>Keep an eligibility trace for every state $s$.</li> <li>Update value $V(s)$ for every state $s$ in proportion to TD-error $\delta_t$ and eligibility trace $\mathbf{z}_t$:</li> </ul> \[\begin{aligned} \delta_t &amp;= R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \\ V(s) &amp;\leftarrow V(s) + \alpha\, \delta_t\, \mathbf{z}_t \end{aligned}\] <p><img src="/assets/images/2026/rl-sutton-barto/ch12-12-2-backward-view-td-lambda.png" alt="Backward TD(lambda)" /></p> <div class="callout callout--note"> <div class="callout__title"> <strong>Backward View of TD($\lambda$)</strong> </div> <div class="callout__body"> <p>In the backward or mechanistic view of TD($\lambda$), each update depends on the current TD error combined with the current eligibility traces of past events.</p> </div> </div> <ul> <li>Let’s look at the effect of $\lambda$ to understand the backward view of TD($\lambda$):</li> </ul> \[\begin{align*} \text{if } \lambda = 0: \quad &amp; \mathbf{z}_t = \nabla \hat{v}(S_t, \mathbf{w}_t) \\ &amp; \mathbf{w}_{t+1} = \mathbf{w}_t + \alpha\, \delta_t \nabla \hat{v}(S_t, \mathbf{w}_t) \quad \longrightarrow \quad \text{TD(0)} \\[6pt] \text{if } 0 &lt; \lambda &lt; 1: \quad &amp; \text{earlier states are given less credit for the TD error} \\[6pt] \text{if } \lambda = 1: \quad &amp; \mathbf{z}_t = \gamma \mathbf{z}_{t-1} + \nabla \hat{v}(S_t, \mathbf{w}_t) \quad \longrightarrow \quad \text{credit for earlier states falls by } \gamma \text{ per step} \\[6pt] \text{if } \lambda = 1,\ \gamma = 1: \quad &amp; \mathbf{z}_t = \mathbf{z}_{t-1} + \nabla \hat{v}(S_t, \mathbf{w}_t) \quad \longrightarrow \quad \text{MC-like behavior (no time decay for ET)} \\[6pt] \text{if } \lambda = 1: \quad &amp; \text{we get TD(1)} \end{align*}\] <ul> <li> <p>In summary, $\lambda = 1$ yields TD(1).</p> </li> <li>TD(1) implements MC algorithms in a more general way and for a wider range of applicability: <ul> <li>Not limited to episodic tasks, can be applied to discounted continuing tasks.</li> <li>Can be performed <strong>incrementally and online.</strong></li> <li>Learns <strong>immediately</strong> and alters behavior during an episode if something good or bad happens, for control methods.</li> </ul> </li> <li>Linear TD($\lambda$) converges in the on-policy case if the step-size parameter $\alpha$ is reduced over time according to stochastic approximation theory conditions.</li> <li>The convergence of linear TD($\lambda$) is not to the minimum-error weight vector but to a nearby weight vector that depends on $\lambda$.</li> <li>The bound on solution quality generalized for any $\lambda$, for the continuing, discounted case is:</li> </ul> \[\overline{\text{VE}}(\mathbf{w}_\infty) \leq \frac{1 - \gamma\lambda}{1 - \gamma} \min_\mathbf{w} \overline{\text{VE}}(\mathbf{w})\] <ul> <li>That is, the asymptotic error is no more than $\dfrac{1-\gamma\lambda}{1-\gamma}$ times the smallest possible error for TD($\lambda$):</li> </ul> \[\begin{align*} \text{as } \lambda \to 1: \quad &amp; \overline{\text{VE}}(\mathbf{w}_\infty) \to \min_\mathbf{w} \overline{\text{VE}}(\mathbf{w}) \\[6pt] \text{as } \lambda \to 0: \quad &amp; \overline{\text{VE}}(\mathbf{w}_\infty) \to \frac{1}{1-\gamma} \min_\mathbf{w} \overline{\text{VE}}(\mathbf{w}) = \overline{\text{VE}}(\mathbf{w}_\text{TD}) \quad \text{(TD(0))} \end{align*}\] <ul> <li>However, $\lambda = 1$ is often the poorest choice.</li> </ul> <hr /> <h2 id="123-n-step-truncated-lambda-return-methods-">12.3 $n$-step Truncated $\lambda$-return Methods <a name="123--step-truncated--return-methods"></a></h2> <ul> <li>The off-line $\lambda$-return is of limited use because the $\lambda$-return is not known until the episode ends.</li> <li>The off-line $\lambda$-return approximation is to truncate the sequence after a <strong>fixed</strong> number of steps.</li> <li>Hence, a natural approximation is to truncate the sequence where $\lambda$-return cannot be calculated for an arbitrarily large $n$. This handles the continuing case.</li> <li>The truncated $\lambda$-return for time $t$, given data only up to some later horizon $h$, is:</li> </ul> \[G_{t:h}^\lambda \doteq (1-\lambda) \sum_{n=1}^{h-t-1} \lambda^{n-1} G_{t:t+n} + \lambda^{h-t-1} G_{t:h}, \quad 0 \leq t \leq h \leq T\] \[\begin{aligned} \text{where } h &amp;\equiv \text{horizon (plays same role as time of termination } T\text{)} \end{aligned}\] <ul> <li>Here the <strong>residual weighting</strong> is given to the longest available $n$-step return $G_{t:h}$.</li> <li>The truncated $\lambda$-return gives rise to a family of $n$-step $\lambda$-return algorithms, known in the state-value case as <strong>Truncated TD($\lambda$)</strong> or <strong>TTD($\lambda$)</strong>.</li> <li>TTD($\lambda$) is defined for $0 \leq t &lt; T$ by:</li> </ul> \[\boxed{\mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha \!\left[G_{t:t+n}^\lambda - \hat{v}(S_t, \mathbf{w}_{t+n-1})\right] \nabla \hat{v}(S_t, \mathbf{w}_{t+n-1})}\] <ul> <li>Efficient implementation of TTD($\lambda$) relies on the $k$-step $\lambda$-return:</li> </ul> \[\boxed{G_{t:t+k}^\lambda = \hat{v}(S_t, \mathbf{w}_{t-1}) + \sum_{i=t}^{t+k-1} (\delta\lambda)^{i-t} \delta_i'}\] \[\begin{aligned} \text{where } \delta_i' &amp;\equiv R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_{t-1}) - \hat{v}(S_t, \mathbf{w}_{t-1}) \end{aligned}\] <p><img src="/assets/images/2026/rl-sutton-barto/ch12-12-3-truncated-td-lambda.png" alt="TTD(lambda)" /></p> <div class="callout callout--note"> <div class="callout__title"> <strong>Backup Diagram for Truncated TD($\lambda$)</strong> </div> <div class="callout__body"> <p>The truncated $\lambda$-return gives rise to a family of $n$-step $\lambda$-return algorithms called <strong>TTD($\lambda$)</strong>.</p> </div> </div> <hr /> <h2 id="124-redoing-updates-online-lambda-return-algorithm-">12.4 Redoing Updates: Online $\lambda$-return Algorithm <a name="124-redoing-updates-online--return-algorithm"></a></h2> <ul> <li>How do we choose the truncation parameter $n$ in TTD($\lambda$)?</li> <li>It involves a tradeoff: <ul> <li>$n$ should be large so that TTD($\lambda$) closely approximates off-line $\lambda$-return, but</li> <li>$n$ should also be small so that the updates can be made sooner and can influence behavior sooner.</li> </ul> </li> <li>In principle, we can achieve both cases via the <strong>online $\lambda$-return algorithm</strong>, but at the cost of computational complexity.</li> <li>Essentially at each time step, we go back and redo all the updates since the beginning of the episode as we gather new increment of data: <ul> <li>The new updates are better than the old ones because now they account for the time step’s new data.</li> <li>Basically this conceptual algorithm involves multiple passes over the episode, one at each horizon, each generating a different sequence of weight vectors.</li> </ul> </li> <li>Let’s distinguish between the weight vectors computed at the different horizons by seeing the first 3 sequences:</li> </ul> \[\begin{align*} h=1: \quad &amp; \mathbf{w}_1^1 \doteq \mathbf{w}_0^1 + \alpha \!\left[G_{0:1}^\lambda - \hat{v}(S_0, \mathbf{w}_0^1)\right] \nabla \hat{v}(S_0, \mathbf{w}_0^1) \\[6pt] h=2: \quad &amp; \mathbf{w}_0^2 \doteq \mathbf{w}_0^2 + \alpha \!\left[G_{0:2}^\lambda - \hat{v}(S_0, \mathbf{w}_0^2)\right] \nabla \hat{v}(S_0, \mathbf{w}_0^2) \\ &amp; \mathbf{w}_2^2 \doteq \mathbf{w}_1^2 + \alpha \!\left[G_{1:2}^\lambda - \hat{v}(S_1, \mathbf{w}_1^2)\right] \nabla \hat{v}(S_1, \mathbf{w}_1^2) \\[6pt] h=3: \quad &amp; \mathbf{w}_1^3 \doteq \mathbf{w}_0^3 + \alpha \!\left[G_{0:3}^\lambda - \hat{v}(S_0, \mathbf{w}_0^3)\right] \nabla \hat{v}(S_0, \mathbf{w}_0^3) \\ &amp; \mathbf{w}_2^3 \doteq \mathbf{w}_1^3 + \alpha \!\left[G_{1:3}^\lambda - \hat{v}(S_1, \mathbf{w}_1^3)\right] \nabla \hat{v}(S_1, \mathbf{w}_1^3) \\ &amp; \mathbf{w}_3^3 \doteq \mathbf{w}_2^3 + \alpha \!\left[G_{2:3}^\lambda - \hat{v}(S_2, \mathbf{w}_2^3)\right] \nabla \hat{v}(S_2, \mathbf{w}_2^3) \end{align*}\] \[\begin{aligned} \text{where} \\ \mathbf{w}_t^h &amp;\equiv \text{weights used to generate the value at time } t \text{ in the sequence up to horizon } h \\ \mathbf{w}_0^h &amp;\equiv \text{1st weight vector in each sequence that is inherited from the previous episode} \\ \mathbf{w}_n^h &amp;\equiv \text{last weight vector in each sequence or the ultimate weight-vector sequence} \end{aligned}\] <ul> <li>The general form of the <strong>online $\lambda$-return update</strong> for $0 \leq t &lt; h \leq T$ is:</li> </ul> \[\boxed{\mathbf{w}_{t+1}^h \doteq \mathbf{w}_t^h + \alpha \!\left[G_{t:h}^\lambda - \hat{v}(S_t, \mathbf{w}_t^h)\right] \nabla \hat{v}(S_t, \mathbf{w}_t^h)}\] \[\mathbf{w}_t \doteq \mathbf{w}_t^t\] <hr /> <h2 id="125-true-online-tdlambda-">12.5 True Online TD($\lambda$) <a name="125-true-online-td"></a></h2> <ul> <li>The original ideal online $\lambda$-return algorithm shown in Section 12.4 is very complex so we use online TD($\lambda$) to approximate it.</li> <li>We’ll use eligibility trace to invert the forward view, online $\lambda$-return to an efficient backward view algorithm. This is called the <strong>True Online TD($\lambda$)</strong>.</li> <li>It uses a simple strategy trick with the weight matrix, where we only need to keep the last weight vector from all the updates at the last time step (the diagonals of online $\lambda$-return weight matrix).</li> </ul> \[\begin{aligned} \begin{bmatrix} \mathbf{w}_0^0 &amp; &amp; &amp; &amp; \\ \mathbf{w}_0^1 &amp; \mathbf{w}_1^1 &amp; &amp; &amp; \\ \mathbf{w}_0^2 &amp; \mathbf{w}_1^2 &amp; \mathbf{w}_2^2 &amp; &amp; \\ \mathbf{w}_0^3 &amp; \mathbf{w}_1^3 &amp; \mathbf{w}_2^3 &amp; \mathbf{w}_3^3 &amp; \\ \vdots &amp; \vdots &amp; \vdots &amp; \vdots &amp; \ddots \\ \mathbf{w}_0^T &amp; \mathbf{w}_1^T &amp; \mathbf{w}_2^T &amp; \mathbf{w}_3^T &amp; \cdots &amp; \mathbf{w}_T^T \end{bmatrix} &amp;\longrightarrow \begin{bmatrix} \mathbf{w}_0^0 \\ &amp; \mathbf{w}_1^1 \\ &amp; &amp; \mathbf{w}_2^2 \\ &amp; &amp; &amp; \mathbf{w}_3^3 \\ &amp; &amp; &amp; &amp; \ddots \\ &amp; &amp; &amp; &amp; &amp; \mathbf{w}_T^T \end{bmatrix} \\ \end{aligned}\] \[\text{Online } \lambda\text{-return} \hspace{8em} \text{True Online TD}(\lambda)\] <ul> <li>For the linear case in which $\hat{v}(s, \mathbf{w}) = \mathbf{w}^T \mathbf{x}(s)$, the true online TD($\lambda$) algorithm is:</li> </ul> \[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\, \delta_t\, \mathbf{z}_t + \alpha \!\left(\mathbf{w}_t^T \mathbf{x}_t - \mathbf{w}_{t-1}^T \mathbf{x}_t\right)\!\left(\mathbf{z}_t - \mathbf{x}_t\right)}\] \[\begin{aligned} \text{where} \\ \mathbf{w}_t &amp;\doteq \mathbf{w}_t^t \\ \mathbf{x}_t &amp;\doteq \mathbf{x}(S_t) \\ \mathbf{z}_t &amp;\doteq \gamma\lambda \mathbf{z}_{t-1} + (1 - \alpha\gamma\lambda\, \mathbf{z}_{t-1}^T \mathbf{x}_t)\, \mathbf{x}_t \end{aligned}\] <ul> <li>The per-step computational complexity of true online TD($\lambda$) is the same as TD($\lambda$), $O(d)$.</li> <li>$\mathbf{z}_t$ used in true online TD($\lambda$) is called a <strong>dutch trace</strong>, unlike that of TD($\lambda$) which is called an <strong>accumulating trace</strong>.</li> <li>Earlier work used a 3rd kind of trace called the <strong>replacing trace</strong>, defined only for the tabular case or for binary feature vectors (tile coding). It is defined:</li> </ul> \[\tilde{z}_{i,t} \doteq \left\{ \begin{array}{ll} 1 &amp; \text{if } x_{i,t} = 1 \\ \gamma\lambda\, z_{i,t-1} &amp; \text{otherwise} \end{array} \right\}\] <ul> <li>Nowadays, dutch traces usually perform better than replacing traces and have a clearer theoretical basis.</li> <li>Accumulating traces remain of interest for nonlinear function approximations where dutch traces are unavailable.</li> </ul> <hr /> <h2 id="126-dutch-traces-in-monte-carlo-learning">12.6 Dutch Traces in Monte Carlo Learning</h2> <ul> <li>Eligibility traces have nothing to do with TD learning despite their close historical association.</li> <li>Eligibility traces arise even in Monte Carlo learning.</li> <li>Using dutch traces, we can invert the forward view MC algorithm to an equivalent, yet computationally cheaper backward view algorithm.</li> <li> <p>This is the only equivalence of forward and backward view that is explicitly demonstrated in this book.</p> </li> <li>The linear, gradient MC prediction algorithm makes the following sequence of updates, one for each time step of the episode:</li> </ul> \[\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \!\left[G - \mathbf{w}_t^T \mathbf{x}_t\right] \mathbf{x}_t, \quad 0 \leq t &lt; T\] <ul> <li>For simplicity, assume that the return $G$ is a single reward at the end of the episode (hence no subscript by time) and that there is no discounting.</li> <li>This is known as the <strong>Least Mean Square (LMS)</strong> rule.</li> <li> <p>We introduce an additional vector memory, the <strong>eligibility trace,</strong> that keeps in it a summary of all the feature vectors seen so far. The overall update will be the same as the MC updates’ sequence shown above and is:</p> \[\begin{align*} \mathbf{w}_T &amp;= \mathbf{w}_{T-1} + \alpha \!\left(G - \mathbf{w}_{T-1}^T \mathbf{x}_{T-1}\right) \mathbf{x}_{T-1} \\ &amp;= \mathbf{w}_{T-1} + \alpha \mathbf{x}_{T-1}\!\left(-\mathbf{x}_{T-1}^T \mathbf{w}_{T-1}\right) + \alpha G \mathbf{x}_{T-1} \\ &amp;= \!\left(\mathbf{I} - \alpha \mathbf{x}_{T-1} \mathbf{x}_{T-1}^T\right) \mathbf{w}_{T-1} + \alpha G \mathbf{x}_{T-1} \\ &amp;= \mathbf{F}_{T-1}\, \mathbf{w}_{T-1} + \alpha G \mathbf{x}_{T-1} \end{align*}\] \[\begin{aligned} \text{where} \\ \mathbf{F}_t &amp;\doteq \mathbf{I} - \alpha \mathbf{x}_t \mathbf{x}_t^T \equiv \text{a forgetting or fading matrix} \end{aligned}\] \[\therefore\quad \mathbf{w}_{T-1} = \mathbf{F}_{T-2}\, \mathbf{w}_{T-2} + \alpha G \mathbf{x}_{T-2}\] <p>Now recursing:</p> \[\begin{align*} \mathbf{w}_T &amp;= \mathbf{F}_{T-1}\, \mathbf{w}_{T-1} + \alpha G \mathbf{x}_{T-1} \\ &amp;= \mathbf{F}_{T-1}\!\left(\mathbf{F}_{T-2}\, \mathbf{w}_{T-2} + \alpha G \mathbf{x}_{T-2}\right) + \alpha G \mathbf{x}_{T-1} \\ &amp;= \mathbf{F}_{T-1} \mathbf{F}_{T-2}\, \mathbf{w}_{T-2} + \alpha G\!\left(\mathbf{F}_{T-1} \mathbf{x}_{T-2} + \mathbf{x}_{T-1}\right) \\ &amp;= \mathbf{F}_{T-1} \mathbf{F}_{T-2}\!\left(\mathbf{F}_{T-3}\, \mathbf{w}_{T-3} + \alpha G\, \mathbf{x}_{T-3}\right) + \alpha G\!\left(\mathbf{F}_{T-1}\, \mathbf{x}_{T-2} + \mathbf{x}_{T-1}\right) \\ &amp;= \mathbf{F}_{T-1} \mathbf{F}_{T-2} \mathbf{F}_{T-3}\, \mathbf{w}_{T-3} + \alpha G\!\left(\mathbf{F}_{T-1} \mathbf{F}_{T-2}\, \mathbf{x}_{T-3} + \mathbf{F}_{T-1}\, \mathbf{x}_{T-2} + \mathbf{x}_{T-1}\right) \\ &amp;\quad \vdots \\ &amp;= \underbrace{\mathbf{F}_{T-1} \mathbf{F}_{T-2} \cdots \mathbf{F}_0\, \mathbf{w}_0}_{\mathbf{a}_{T-1}} + \alpha G \underbrace{\sum\nolimits_{k=0}^{T-1} \mathbf{F}_{T-1} \mathbf{F}_{T-2} \cdots \mathbf{F}_{k+1}\, \mathbf{x}_k}_{\mathbf{z}_{T-1}} \\ &amp;= \mathbf{a}_{T-1} + \alpha G\, \mathbf{z}_{T-1} \end{align*}\] \[\begin{aligned} \text{where} \\ \mathbf{a}_{T-1}\ \&amp;\ \mathbf{z}_{T-1} &amp;\equiv \text{values at time } T-1 \text{ of 2 auxiliary memory vectors that can be updated} \\ &amp;\phantom{{}\equiv{}} \text{incrementally w/o knowledge of } G \text{ and with } O(d) \text{ complexity per time step} \\ \mathbf{z}_t &amp;\equiv \text{dutch-style eligibility trace, initialized to } \mathbf{z}_0 = \mathbf{x}_0 \end{aligned}\] </li> <li> <p>The $\mathbf{z}_t$ vector is in fact a dutch-style eligibility trace, initialized to $\mathbf{z}_0 = \mathbf{x}_0$, that can be updated according to:</p> \[\begin{align*} \mathbf{z}_t &amp;= \sum\nolimits_{k=0}^{t} \mathbf{F}_t \mathbf{F}_{t-1} \cdots \mathbf{F}_{k+1}\, \mathbf{x}_k, \quad 1 \leq t &lt; T \\ &amp;= \sum\nolimits_{k=0}^{t-1} \mathbf{F}_t \mathbf{F}_{t-1} \cdots \mathbf{F}_{k+1}\, \mathbf{x}_k + \mathbf{x}_t \\ &amp;= \mathbf{F}_t \sum\nolimits_{k=0}^{t-1} \mathbf{F}_{t-1} \mathbf{F}_{t-2} \cdots \mathbf{F}_{k+1}\, \mathbf{x}_k + \mathbf{x}_t \\ &amp;= \mathbf{F}_t\, \mathbf{z}_{t-1} + \mathbf{x}_t \\ &amp;= \!\left(\mathbf{I} - \alpha \mathbf{x}_t \mathbf{x}_t^T\right) \mathbf{z}_{t-1} + \mathbf{x}_t \\ &amp;= \mathbf{z}_{t-1} - \alpha\!\left(\mathbf{z}_{t-1}^T \mathbf{x}_t\right) \mathbf{x}_t + \mathbf{x}_t \\ &amp;\boxed{= \mathbf{z}_{t-1} + \!\left(1 - \alpha\, \mathbf{z}_{t-1}^T \mathbf{x}_t\right) \mathbf{x}_t} \end{align*}\] <p>which is the dutch trace for $\gamma\lambda = 1$.</p> </li> <li> <p>The $\mathbf{a}_t$ auxiliary vector is initialized to $\mathbf{a}_0 = \mathbf{w}_0$ and then updated according to:</p> \[\begin{align*} \mathbf{a}_t &amp;= \mathbf{F}_t \mathbf{F}_{t-1} \cdots \mathbf{F}_0\, \mathbf{w}_0, \quad 1 \leq t &lt; T \\ &amp;= \mathbf{F}_t\, \mathbf{a}_{t-1} \\ &amp;= \mathbf{a}_{t-1} - \alpha \mathbf{x}_t \mathbf{x}_t^T \mathbf{a}_{t-1} \\ &amp;\boxed{= \mathbf{a}_{t-1}\!\left(\mathbf{I} - \alpha \mathbf{x}_t \mathbf{x}_t^T\right)} \end{align*}\] </li> </ul> <h3 id="takeaways">Takeaways</h3> <ul> <li>The auxiliary vectors, $\mathbf{a}_t$ and $\mathbf{z}_t$, are updated on each time step $t &lt; T$ and then, at time $T$ when $G$ is observed, they are used to compute:</li> </ul> \[\boxed{\mathbf{w}_T = \mathbf{a}_{T-1} + \alpha G\, \mathbf{z}_{T-1}}\] <ul> <li>The time and memory complexity per step is $O(d)$.</li> <li>This is surprising and intriguing since ET is working in a non-TD setting (ET arise where long-term predictions are needed to be learned efficiently).</li> </ul> <hr /> <h2 id="127-sarsalambda-">12.7 Sarsa($\lambda$) <a name="127-sarsa"></a></h2> <ul> <li>Now let’s extend eligibility traces to action-value methods.</li> <li> <p>First, let’s recall the action-value form of the <strong>$n$-step</strong> return:</p> \[G_{t:t+n} \doteq R_{t+1} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}_{t+n-1}), \quad t+n &lt; T\] \[\text{with} \quad G_{t:t+n} = G_t \quad \text{ if } t+n \geq T.\] </li> <li>With this, for $t = 0, \ldots, T-1,$ let’s form the action-value form of the <strong>off-line $\lambda$-return</strong> algorithm which uses $\hat{q}$ rather than $\hat{v}$:</li> </ul> \[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \!\left[G_t^\lambda - \hat{q}(S_t, A_t, \mathbf{w}_t)\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)}\] \[\begin{aligned} \text{where} \quad G_t^\lambda &amp;\doteq G_{t:\infty}^\lambda \end{aligned}\] <ul> <li>For the forward view shown in the figure below, which is similar to TD($\lambda$), the updates are: <ul> <li><strong>1st update:</strong> one full-step lookahead</li> <li><strong>2nd update:</strong> two-step lookahead</li> <li><strong>Final update:</strong> complete return.</li> </ul> </li> </ul> <p><img src="/assets/images/2026/rl-sutton-barto/ch12-12-7-sarsa-lambda.png" alt="Sarsa(lambda)" /></p> <div class="callout callout--note"> <div class="callout__title"> <strong>Backup Diagram for Sarsa($\lambda$)</strong> </div> <div class="callout__body"> <p>The first update looks ahead one full step, to the next state–action pair, the second looks ahead two steps, to the second state–action pair, and so on. A final update is based on the complete return.</p> </div> </div> <ul> <li>The weighting of each $n$-step update in the $\lambda$-return is same as in TD($\lambda$) and $\lambda$-return.</li> <li>The forward view TD for action values, Sarsa($\lambda$), has the same update rule as TD($\lambda$):</li> </ul> \[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\, \delta_t\, \mathbf{z}_t}\] <ul> <li>The action-value form of the TD error is used:</li> </ul> \[\delta_t \doteq R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t)\] <ul> <li>The action-value form of the eligibility trace:</li> </ul> \[\begin{align*} \mathbf{z}_{-1} &amp;\doteq \mathbf{0} \\ \mathbf{z}_t &amp;\doteq \gamma\lambda \mathbf{z}_{t-1} + \nabla \hat{q}(S_t, A_t, \mathbf{w}_t), \quad 0 \leq t \leq T \end{align*}\] <ul> <li>There exists an action-value version of our ideal TD method, the online $\lambda$-return algorithm and its efficient implementation as true online TD($\lambda$): <ul> <li><strong>Section <a href="#124-redoing-updates-online--return-algorithm">12.4</a>:</strong> Everything there holds here except for using the action-value form of the $n$-step return, $G_{t:t+n}$.</li> <li><strong>Sections <a href="#125-true-online-td">12.5</a> &amp; <a href="#126-dutch-traces-in-monte-carlo-learning">12.6</a>:</strong> Everything holds here except for using state-action feature vectors $\mathbf{x}_t = \mathbf{x}(S_t, A_t)$ instead of state feature vectors $\mathbf{x}_t = \mathbf{x}(S_t)$.</li> <li>The resulting efficient backward algorithm obtained from using the eligibility trace to invert the action-value form of the forward view, online $\lambda$-return is called the <strong>True Online Sarsa($\lambda$)</strong>.</li> </ul> </li> <li>There is also a truncated version of Sarsa($\lambda$), called <strong>Forward Sarsa($\lambda$)</strong>, which appears to be a model-free, control method for use in conjunction with multi-layer ANNs.</li> </ul> <hr /> <h2 id="128-variable-lambda-and-gamma-">12.8 Variable $\lambda$ and $\gamma$ <a name="128-variable--and"></a></h2> <ul> <li>To get the most general forms of the final TD algorithms, it is vital to generalize the degree of bootstrapping and discounting beyond constant parameters to functions dependent on the state and action:</li> </ul> \[\begin{align*} \lambda_t &amp;\doteq \lambda(S_t, A_t), \quad &amp; \lambda &amp;: S \times A \to [0,1] \\ \gamma_t &amp;\doteq \gamma(S_t), \quad &amp; \gamma &amp;: S \to [0,1] \end{align*}\] <ul> <li>$\lambda_t$ is the termination function and is significant because it changes the return $G_t$, which is now more generally defined as:</li> </ul> \[\begin{align*} G_t &amp;\doteq R_{t+1} + \gamma_{t+1} G_{t+1} \\ &amp;= R_{t+1} + \gamma_{t+1} R_{t+2} + \gamma_{t+1} \gamma_{t+2} R_{t+3} + \gamma_{t+1} \gamma_{t+2} \gamma_{t+3} R_{t+4} + \ldots \\ &amp;= \sum_{k=t}^{\infty} \left(\prod_{i=t+1}^{k} \gamma_i\right) R_{k+1} \end{align*}\] \[\begin{aligned} \text{where } \prod_{k=t}^{\infty} \gamma_k &amp;= 0 \text{ with probability 1 for all } t, \text{ to assure the sums are finite} \end{aligned}\] <ul> <li>This general return $G_t$ definition enables episodic settings to become a single stream of experience, without special terminal state, start distributions or termination times <ul> <li>A terminal state just becomes a state with $\gamma(s) = 0$ that transitions to the start distribution.</li> </ul> </li> <li>Generalization to variable bootstrapping yields a new state-based $\lambda$-return:</li> </ul> \[\boxed{G_t^{\lambda s} \doteq R_{t+1} + \gamma_{t+1}\!\left[(1 - \lambda_{t+1})\, \hat{v}(S_{t+1}, \mathbf{w}_t) + \lambda_{t+1}\, G_{t+1}^{\lambda s}\right]}\] <ul> <li>Action-based $\lambda$-return is either the <strong>Sarsa form</strong>:</li> </ul> \[\boxed{G_t^{\lambda a} \doteq R_{t+1} + \gamma_{t+1}\!\left[(1 - \lambda_{t+1})\, \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) + \lambda_{t+1}\, G_{t+1}^{\lambda a}\right]}\] <ul> <li>or the <strong>Expected Sarsa form</strong>:</li> </ul> \[\boxed{G_t^{\lambda a} \doteq R_{t+1} + \gamma_{t+1}\!\left[(1 - \lambda_{t+1})\, \bar{V}_t(S_{t+1}) + \lambda_{t+1}\, G_{t+1}^{\bar{\lambda}a}\right]}\] \[\begin{aligned} \text{where } \bar{V}_t(s) \doteq \sum_a \Pi(a \vert s)\, \hat{q}(s, a, \mathbf{w}_t) \\ \end{aligned}\] <h3 id="superscripts-notation-for-i-in-g_tlambda-i">Superscripts notation for $i$ in $G_t^{\lambda i}$</h3> \[\begin{aligned} \text{"s"} &amp;: \text{bootstraps from state values} \\ \text{"a"} &amp;: \text{bootstraps from action values} \end{aligned}\] <hr /> <h2 id="129-off-policy-traces-with-control-variates">12.9 Off-Policy Traces with Control Variates</h2> <ul> <li>To generalize to off-policy, we need to incorporate importance sampling using eligibility traces.</li> <li>Let’s focus on the bootstrapping generalization of per-decision importance sampling with control variates <strong>(Section 7.4).</strong></li> <li>The new state-based $\lambda$-return in <strong>Section <a href="#128-variable--and">12.8</a></strong> generalizes, after the off-policy, control variate, $n$-step return (ending at horizon $h$) model to:</li> </ul> \[\boxed{G_t^{\lambda s} \doteq \rho_t \!\left(R_{t+1} + \gamma_{t+1}\!\left[(1 - \lambda_{t+1})\, \hat{v}(S_{t+1}, \mathbf{w}_t) + \lambda_{t+1}\, G_{t+1}^{\lambda s}\right]\right) + (1 - \rho_t)\, \hat{v}(S_t, \mathbf{w}_t)}\] \[\begin{aligned} \text{where } \rho_t &amp;= \frac{\pi(A_t \vert S_t)}{b(A_t \vert S_t)} \end{aligned}\] <ul> <li>The final $\lambda$-return can be approximated in terms of sums of the state-based TD error $\delta_t^s$, with the approximation becoming exact if the approximate value function does not change:</li> </ul> \[\begin{align*} G_t^{\lambda s} &amp;\approx \hat{v}(S_t, \mathbf{w}_t) + \rho_t \sum_{k=t}^{\infty} \delta_k^s \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \\ \delta_t^s &amp;\doteq R_{t+1} + \gamma_{t+1}\, \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t) \end{align*}\] <ul> <li>The forward view update of the approximate $\lambda$-return is:</li> </ul> \[\begin{align*} \mathbf{w}_{t+1} &amp;= \mathbf{w}_t + \alpha \!\left[G_t^{\lambda s} - \hat{v}(S_t, \mathbf{w}_t)\right] \nabla \hat{v}(S_t, \mathbf{w}_t) \\ &amp;\boxed{\approx \mathbf{w}_t + \alpha \rho_t \!\left(\sum_{k=t}^{\infty} \delta_k^s \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i\right) \nabla \hat{v}(S_t, \mathbf{w}_t)} \end{align*}\] <ul> <li>We’re interested in the equivalence (approximately) between the forward-view update summed over time and a backward-view update summed over time. The equivalence is approximate because we ignore changes in the value function.</li> <li>The sum of the forward-view update over time is:</li> </ul> \[\begin{align*} \sum_{t=0}^{\infty} \!\left(\mathbf{w}_{t+1} - \mathbf{w}_t\right) &amp;\approx \sum_{t=0}^{\infty} \sum_{k=t}^{\infty} \alpha \rho_t\, \delta_k^s \nabla \hat{v}(S_t, \mathbf{w}_t) \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \\ &amp;= \sum_{k=0}^{\infty} \sum_{t=0}^{k} \alpha \rho_t \nabla \hat{v}(S_t, \mathbf{w}_t)\, \delta_k^s \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \\ &amp;\quad \left(\text{using the summation rule: } \sum_{t=x}^{y} \sum_{k=t}^{y} = \sum_{k=x}^{y} \sum_{t=x}^{k}\right) \\ &amp;= \sum_{k=0}^{\infty} \alpha\, \delta_k^s \sum_{t=0}^{k} \rho_t \nabla \hat{v}(S_t, \mathbf{w}_t) \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \end{align*}\] <ul> <li>If the entire expression from the 2nd sum on could be written and updated incrementally as an eligibility trace, then the sum of the forward-view update over time would be in the form of the sum of a backward-view TD update. <ul> <li>That is, if this expression was the trace at time $k$, then we could update it from its value at time $k-1$ by:</li> </ul> </li> </ul> \[\begin{align*} \mathbf{z}_k &amp;= \sum_{t=0}^{k} \rho_t \nabla \hat{v}(S_t, \mathbf{w}_t) \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \\ &amp;= \sum_{t=0}^{k-1} \rho_t \nabla \hat{v}(S_t, \mathbf{w}_t) \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i + \rho_k \nabla \hat{v}(S_k, \mathbf{w}_k) \\ &amp;= \gamma_k \lambda_k \rho_k \underbrace{\sum_{t=0}^{k-1} \rho_t \nabla \hat{v}(S_t, \mathbf{w}_t) \prod_{i=t+1}^{k-1} \gamma_i \lambda_i \rho_i}_{\mathbf{z}_{k-1}} + \rho_k \nabla \hat{v}(S_k, \mathbf{w}_k) \end{align*}\] \[\boxed{\mathbf{z}_k = \rho_k \!\left[\gamma_k \lambda_k\, \mathbf{z}_{k-1} + \nabla \hat{v}(S_k, \mathbf{w}_k)\right]}\] <ul> <li>If we change the index from $k$ to $t$ of the $\mathbf{z}_k$ equation above, we get the <strong>general accumulating trace</strong> update for state values:</li> </ul> \[\boxed{\mathbf{z}_t \doteq \rho_t \!\left[\gamma_t \lambda_t\, \mathbf{z}_{t-1} + \nabla \hat{v}(S_t, \mathbf{w}_t)\right]}\] <ul> <li>This eligibility trace combined with the usual semi-gradient TD($\lambda$) parameter-update rule <strong>(Section <a href="#122-td">12.2</a>)</strong> forms a <strong>general TD($\lambda$)</strong> algorithm that can be applied to either on-policy or off-policy data: <ul> <li>In on-policy, the algorithm is exactly TD($\lambda$) because $\rho_t = 1$ always and the ET above becomes the usual accumulating trace for variable $\lambda$ and $\gamma$:</li> </ul> \[\mathbf{z}_t \doteq \gamma_t \lambda_t\, \mathbf{z}_{t-1} + \nabla \hat{v}(S_t, \mathbf{w}_t)\] <ul> <li>In off-policy, the algorithm stays as it is, although not guaranteed to be stable as a semi-gradient method.</li> <li>For off-policy, we’ll consider extensions that guarantee stability in the next few sections.</li> </ul> </li> <li>Let’s derive the off-policy ET for <strong>action-value</strong> methods and corresponding general Sarsa($\lambda$) algorithms. <ul> <li>Starting with either recursive general action-based $\lambda$-return of Sarsa or Expected Sarsa, $G_t^{\lambda a}$, in <strong>Section <a href="#128-variable--and">12.8</a></strong> (Expected Sarsa works out to be simpler), we can extend the Expected Sarsa $G_t^{\lambda a}$ to the off-policy case after the off-policy model of action-based, off-policy, control variate, $n$-step return:</li> </ul> </li> </ul> \[\boxed{\begin{align*} G_t^{\lambda a} &amp;\doteq R_{t+1} + \gamma_{t+1}\!\left(\!\left[1 - \lambda_{t+1}\right] \bar{V}_t(S_{t+1}) + \lambda_{t+1}\!\left[\rho_{t+1} G_{t+1}^{\lambda a} + \bar{V}_t(S_{t+1}) - \rho_{t+1}\, \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t)\right]\right) \\ &amp;= R_{t+1} + \gamma_{t+1}\!\left(\bar{V}_t(S_{t+1}) + \lambda_{t+1} \rho_{t+1} \!\left[G_{t+1}^{\lambda a} - \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t)\right]\right) \end{align*}}\] \[\begin{aligned} \text{where } \bar{V}_t(S_{t+1}) &amp;= \sum_a \Pi(a \vert S_{t+1})\, \hat{q}(S_{t+1}, a, \mathbf{w}_t) \end{aligned}\] <ul> <li>The $\lambda$-return, approximately as the sum of TD errors, is:</li> </ul> \[\begin{align*} G_t^{\lambda a} &amp;\approx \hat{q}(S_t, A_t, \mathbf{w}_t) + \sum_{k=t}^{\infty} \delta_k^a \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \\ \delta_t^a &amp;= R_{t+1} + \gamma_{t+1} \bar{V}(S_{t+1}) - \hat{q}(S_t, A_t, \mathbf{w}_t) \end{align*}\] <ul> <li>Using steps analogous to those for the state case earlier in this section, write a forward-view update based on action-based $\lambda$-return $G_t^{\lambda a}$ above, then transform the sum of the updates using the summation rule and finally derive the eligibility trace for action values:</li> </ul> \[\boxed{\mathbf{z}_t \doteq \gamma_t \lambda_t \rho_t\, \mathbf{z}_{t-1} + \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)}\] <ul> <li>This ET combined with the action-based, expected TD error $\delta_t^a$ and the usual semi-gradient TD($\lambda$) parameter-update rule <strong>(Section <a href="#122-td">12.2</a>)</strong> forms an elegant, efficient <strong>Expected Sarsa($\lambda$)</strong> algorithm that can be applied to either on-policy or off-policy data: <ul> <li><strong><u>On-policy case</u>:</strong> The algorithm becomes the Sarsa($\lambda$) algorithm given constant $\lambda$ and $\gamma$, and the usual state-action TD error:</li> </ul> \[\begin{aligned} &amp;\quad \rho_t = 1, \quad \nabla\lambda_t = \nabla\gamma_t = 0 \\ &amp;\boxed{\mathbf{z}_t \doteq \gamma\lambda\, \mathbf{z}_{t-1} + \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)} \end{aligned}\] </li> <li>At $\lambda = 1$, these algorithms become closely related to corresponding Monte Carlo algorithms.</li> <li>No episode-by-episode equivalence of updates exist, only of their expectations, even under the most favorable conditions. <ul> <li>Methods have been proposed recently <strong>[Sutton, Mahmood, Precup &amp; van Hasselt, 2014]</strong> that do achieve an exact equivalence.</li> <li>These methods require an additional vector of <strong>“provisional weights”</strong> that keep track of executed updates but may need to be retracted/emphasized depending on future actions taken.</li> <li>The state and state-action versions of these methods are called <strong>PTD($\lambda$) and PQ($\lambda$)</strong> respectively, where the ‘P’ stands for Provisional.</li> </ul> </li> <li> <p>If $\lambda &lt; 1$, then all these off-policy algorithms involve bootstrapping and <strong>the deadly triad</strong> applies, meaning that they can be guaranteed stable only for the tabular case, state aggregation and other limited forms of function approximation.</p> </li> <li>Recall the challenge of off-policy learning has 2 parts. Off-policy eligibility traces deal effectively with the 1st part, correcting for the expected value of the targets, but not with the 2nd part that has to do with distribution of updates (matching off-policy to on-policy).</li> <li>Algorithmic strategies for handling the 2nd part of the off-policy learning challenge with eligibility traces are summarized in <strong>Section <a href="#1211-stable-off-policy-methods-with-traces">12.11</a>.</strong></li> </ul> <hr /> <h2 id="1210-watkinss-qlambda-to-tree-backuplambda-">12.10 Watkins’s Q($\lambda$) to Tree-Backup($\lambda$) <a name="1210-watkinss-q-to-tree-backup"></a></h2> <h3 id="watkinss-qlambda">Watkins’s Q($\lambda$)</h3> <ul> <li>Watkins’s Q($\lambda$) is the original method for extending Q-learning to eligibility traces.</li> <li>It involves decaying the ET in the usual way as long as a greedy action was taken, then cuts the traces to 0 after the 1st non-greedy action.</li> </ul> <p><img src="/assets/images/2026/rl-sutton-barto/ch12-12-10-watkins-q-lambda.png" alt="Watkins's Q(lambda)" /></p> <div class="callout callout--note"> <div class="callout__title"> <strong>Backup Diagram for Watkins's Q($\lambda$)</strong> </div> <div class="callout__body"> <p>The series of component updates ends either with the end of the episode or with the first nongreedy action, whichever comes first.</p> </div> </div> <h3 id="tree-backuplambda">Tree-Backup($\lambda$)</h3> <ul> <li>Let’s look at the eligibility trace version of Tree Backup, which is called <strong>Tree-Backup($\lambda$)</strong> or <strong>TB($\lambda$)</strong>.</li> <li>TB($\lambda$) is the <strong>true successor</strong> to Q-learning because it has no importance sampling.</li> <li>TB($\lambda$) concept is straightforward: <ul> <li> <p>The tree-backup updates of each length (Section 7.5) are weighted dependent on the bootstrapping parameter $\lambda$.</p> </li> <li> <p>Using the recursive form of the action-based $\lambda$-return for Expected Sarsa and then expanding the bootstrapping target case after the model of tree-backup $n$-step return (Section 7.5):</p> \[\boxed{\begin{align*} G_t^{\lambda a} &amp;\doteq R_{t+1} + \gamma_{t+1}\!\left(\!\left[1 - \lambda_{t+1}\right] \bar{V}_t(S_{t+1}) + \lambda_{t+1}\!\left[\sum_{a \neq A_{t+1}} \pi(a \vert S_{t+1})\, \hat{q}(S_{t+1}, a, \mathbf{w}_t) + \pi(A_{t+1} \vert S_{t+1}) G_{t+1}^{\lambda a}\right]\right) \\ &amp;= R_{t+1} + \gamma_{t+1}\!\left(\bar{V}_t(S_{t+1}) + \lambda_{t+1} \pi(A_{t+1} \vert S_{t+1}) \!\left[G_{t+1}^{\lambda a} - \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t)\right]\right) \end{align*}}\] </li> </ul> </li> <li> <p>$G_t^{\lambda a}$ can be approximated (ignoring changes in approx. value function) as a sum of TD errors:</p> \[\begin{align*} G_t^{\lambda a} &amp;\approx \hat{q}(S_t, A_t, \mathbf{w}_t) + \sum_{k=t}^{\infty} \delta_k^a \prod_{i=t+1}^{k} \gamma_i \lambda_i \pi(A_i \vert S_i) \\ \delta_t^a &amp;= R_{t+1} + \gamma_{t+1} \bar{V}_t(S_{t+1}) - \hat{q}(S_t, A_t, \mathbf{w}_t) \end{align*}\] </li> <li> <p>As always, using same steps as in the previous section, we get a special eligibility trace update involving the target-policy probabilities of the selected actions:</p> \[\boxed{\mathbf{z}_t \doteq \gamma_t \lambda_t \pi(A_t \vert S_t)\, \mathbf{z}_{t-1} + \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)}\] </li> </ul> <p><img src="/assets/images/2026/rl-sutton-barto/ch12-12-10-tree-backup-q-lambda.png" alt="Tree Backup (lambda)" /></p> <div class="callout callout--note"> <div class="callout__title"> <strong>Backup Diagram for Tree Backup($\lambda$)</strong> </div> <div class="callout__body"> <p>The tree-backup updates of each length are weighted in the usual way dependent on the bootstrapping parameter $\lambda$</p> </div> </div> <ul> <li>The ET above combined with the usual semi-gradient TD($\lambda$) parameter-update rule defines the TB($\lambda$) algorithm.</li> <li>Like all semi-gradient algorithms, TB($\lambda$) is not guaranteed to be stable when used with off-policy data and a powerful function approximator <strong>(the deadly triad).</strong></li> </ul> <hr /> <h2 id="1211-stable-off-policy-methods-with-traces">12.11 Stable Off-Policy Methods with Traces</h2> <ul> <li>Let’s look at 4 of the most important methods that achieve stable off-policy methods/training using eligibility traces.</li> <li>All 4 are based on either the <strong>Gradient-TD or Emphatic TD</strong> and linear function approximation.</li> </ul> <h3 id="gtdlambda">GTD($\lambda$)</h3> <ul> <li> <p>Analogous to TDC, and aims to learn a parameter $\mathbf{w}_{t}$ such that $\hat{v}(s, \mathbf{w}) \doteq \mathbf{w}_{t}^T \mathbf{x}(s) \approx v_{\pi}(s)$ even from data that is due to following another policy $b$. Its update is:</p> \[\begin{aligned} \mathbf{w}_{t+1} &amp;\doteq \mathbf{w}_t + \alpha\, \delta_t^s\, \mathbf{z}_t - \alpha \gamma_{t+1}(1 - \lambda_{t+1})\!\left(\mathbf{z}_t^T \mathbf{v}_t\right) \mathbf{x}_{t+1} \\ \mathbf{v}_{t+1} &amp;\doteq \mathbf{v}_t + \beta\, \delta_t^s\, \mathbf{z}_t - \beta\!\left(\mathbf{v}_t^T \mathbf{x}_t\right) \mathbf{x}_t \end{aligned}\] \[\begin{aligned} \text{where} \\ \mathbf{v} &amp;\in \mathbb{R}^d \equiv \text{a vector of the same dimension as } \mathbf{w}, \text{ initialized to } \mathbf{v}_0 = \mathbf{0} \\ \beta &amp;&gt; 0 \equiv \text{a 2nd step-size parameter} \end{aligned}\] </li> </ul> <h3 id="gqlambda">GQ($\lambda$)</h3> <ul> <li>Gradient-TD algorithm for action values with eligibility traces.</li> <li>GQ($\lambda$) aims to learn $\mathbf{w}_{t}$ such that $\hat{q}(s, a, \mathbf{w}_{t}) \doteq \mathbf{w}_{t}^T \mathbf{x}(s,a) \approx q_{\pi}(s,a)$ from off-policy data.</li> <li>If the target policy is $\varepsilon$-greedy, or otherwise biased towards the greedy policy for $\hat{q}$, then GQ($\lambda$) can be used as a control algorithm.</li> <li> <p>GQ($\lambda$) update is:</p> \[\begin{aligned} \mathbf{w}_{t+1} &amp;\doteq \mathbf{w}_t + \alpha\, \delta_t^a\, \mathbf{z}_t - \alpha \gamma_{t+1}(1 - \lambda_{t+1})\!\left(\mathbf{z}_t^T \mathbf{v}_t\right) \bar{\mathbf{x}}_{t+1} \\ \bar{\mathbf{x}}_t &amp;\doteq \sum_a \pi(a \vert S_t)\, \mathbf{x}(S_t, a) \\ \delta_t^a &amp;\doteq R_{t+1} + \gamma_{t+1}\, \mathbf{w}_t^T \bar{\mathbf{x}}_{t+1} - \mathbf{w}_t^T \mathbf{x}_t \\ \mathbf{z}_t &amp;\doteq \gamma_t \lambda_t \rho_t\, \mathbf{z}_{t-1} + \nabla \hat{q}(S_t, A_t, \mathbf{w}_t) \end{aligned}\] \[\begin{aligned} \text{where} \\ \bar{\mathbf{x}}_t &amp;\equiv \text{average feature vector for } S_t \text{ under the target policy} \\ \delta_t^a &amp;\equiv \text{expectation form of the TD error} \end{aligned}\] </li> </ul> <h3 id="htdlambda">HTD($\lambda$)</h3> <ul> <li>Hybrid TD($\lambda$) state-value algorithm combines aspects of GTD($\lambda$) and TD($\lambda$).</li> <li> <p>HTD($\lambda$) is a strict generalization of TD($\lambda$) to the off-policy setting, meaning it reduces exactly to TD($\lambda$) when the behavior and target policies coincide; a property GTD($\lambda$) does not share:</p> \[b(A_t \vert S_t) = \pi(A_t \vert S_t), \quad \rho_t = 1 \implies \text{HTD}(\lambda) = \text{TD}(\lambda)\] </li> <li> <p>HTD($\lambda$) is defined by:</p> \[\begin{aligned} \mathbf{w}_{t+1} &amp;\doteq \mathbf{w}_t + \alpha\, \delta_t^s\, \mathbf{z}_t + \alpha\!\left[\!\left(\mathbf{z}_t - \mathbf{z}_t^b\right)^T \mathbf{v}_t\right]\!\left(\mathbf{x}_t - \gamma_{t+1} \mathbf{x}_{t+1}\right) \\ \mathbf{v}_{t+1} &amp;\doteq \mathbf{v}_t + \beta\, \delta_t^s\, \mathbf{z}_t - \beta\!\left(\mathbf{z}_t^T \mathbf{v}_t\right)\!\left(\mathbf{x}_t - \gamma_{t+1} \mathbf{x}_{t+1}\right), \quad &amp; \mathbf{v}_0 \doteq \mathbf{0} \\ \mathbf{z}_t &amp;\doteq \rho_t \!\left(\gamma_t \lambda_t\, \mathbf{z}_{t-1} + \mathbf{x}_t\right), \quad &amp; \mathbf{z}_{-1} \doteq \mathbf{0} \\ \mathbf{z}_t^b &amp;\doteq \gamma_t \lambda_t\, \mathbf{z}_{t-1}^b + \mathbf{x}_t, \quad &amp; \mathbf{z}_{-1}^b \doteq \mathbf{0} \end{aligned}\] </li> <li>We get <ul> <li>a 2nd set of weights, $\mathbf{v}_t$.</li> <li>a 2nd set of eligibility traces, $\mathbf{z}_t^b$, <strong>conventional accumulating traces</strong> for the behavior policy.</li> </ul> \[\begin{aligned} \mathbf{z}_t^b = \mathbf{z}_t \text{ if all } \rho_t = 1 &amp;\implies \left(\mathbf{z}_t - \mathbf{z}_t^b\right)^T = \mathbf{0} \\ &amp;\implies \mathbf{w}_{t+1} = \mathbf{w}_t + \alpha\, \delta_t^s\, \mathbf{z}_t \quad \text{(TD(}\lambda\text{))} \end{aligned}\] </li> </ul> <h3 id="emphatic-tdlambda">Emphatic TD($\lambda$)</h3> <ul> <li>Extension of one-step Emphatic TD (Sections 9.11 &amp; 11.8) to eligibility traces.</li> <li>The resulting algorithm: <ul> <li>(+) retains strong off-policy convergence guarantees</li> <li>(+) enables any degree of bootstrapping</li> <li>(-) has high variance</li> <li>(-) potentially slow convergence.</li> </ul> </li> <li> <p>Emphatic TD($\lambda$) is defined by:</p> \[\begin{aligned} \mathbf{w}_{t+1} &amp;\doteq \mathbf{w}_t + \alpha\, \delta_t\, \mathbf{z}_t \\ \delta_t &amp;\doteq R_{t+1} + \gamma_{t+1}\, \mathbf{w}_t^T \mathbf{x}_{t+1} - \mathbf{w}_t^T \mathbf{x}_t \\ \mathbf{z}_t &amp;\doteq \rho_t \!\left(\gamma_t \lambda_t\, \mathbf{z}_{t-1} + M_t \mathbf{x}_t\right), \quad &amp; \mathbf{z}_{-1} \doteq \mathbf{0} \\ M_t &amp;\doteq \lambda_t \mathcal{I}_t + (1 - \lambda_t) F_t \\ F_t &amp;\doteq \rho_{t-1} \gamma_t F_{t-1} + \mathcal{I}_t, \quad &amp; F_0 \doteq \mathcal{i}(S_0) \end{aligned}\] \[\begin{aligned} \text{where} \\ M_t &amp;\geq 0 \equiv \text{emphasis} \\ F_t &amp;\geq 0 \equiv \text{followon trace} \\ \mathcal{I}_t &amp;\geq 0 \equiv \text{interest} \end{aligned}\] </li> <li>In the on-policy case ($\rho_t = 1$ for all $t$), Emphatic TD($\lambda$) is similar to conventional TD($\lambda$), but still significantly different: <ul> <li>Emphatic TD($\lambda$) is guaranteed to converge for all state-dependent $\lambda$ functions; TD($\lambda$) is not (TD($\lambda$) is guaranteed only for constant $\lambda$).</li> <li>See Yu’s counterexample <strong>[Ghassian, Rafiee &amp; Sutton, 2016].</strong></li> </ul> </li> </ul> <hr /> <h2 id="1212-implementation-issues">12.12 Implementation Issues</h2> <ul> <li><strong>Naive implementation seems expensive:</strong> Updating eligibility traces for every state at every time step appears computationally costly on serial computers.</li> <li><strong>Practical optimization:</strong> Most ET are nearly 0; only recently visited states have significant traces, so implementations can track and update only these few states.</li> <li><strong>Computational cost:</strong> With this optimization, tabular methods with traces are only a few times more expensive than one-step methods.</li> <li><strong>Function approximation reduces overhead:</strong> When using neural networks, ET typically only double memory and computation per step (much less overhead than in tabular case).</li> <li><strong>Tabular is the worst case:</strong> The tabular setting represents the highest computational complexity for ET relative to simpler methods.</li> </ul> <hr /> <h2 id="1213-conclusions">12.13 Conclusions</h2> <ul> <li><strong>Eligibility traces</strong> provide an efficient, incremental way to interpolate between TD and MC methods.</li> <li>ET offer advantages over $n$-step methods in terms of <strong>generality and computational trade-offs.</strong></li> <li>Empirically, <strong>an intermediate mix works best:</strong> ET should move towards MC but not all the way since pure MC performance degrades sharply.</li> <li>ET are the <strong>first defense against long-delayed rewards and non-Markov tasks,</strong> used with TD methods to make them behave more like MC methods without full bootstrapping.</li> <li>Use traces <strong>when data is scarce and online learning is required,</strong> as they provide faster learning per sample despite higher computational cost per step.</li> <li><strong>Avoid traces in offline settings</strong> with cheap abundant data (maximum data processing speed matters more than learning efficiency per sample).</li> <li><strong>True online methods</strong> achieve ideal $\lambda$-return performance while maintaining $O(d)$ computational efficiency.</li> <li>Forward-to-backward view derivations provide <strong>computationally efficient, mechanistic,</strong> practical implementations of theory.</li> </ul> <hr /> <h2 id="citation">Citation</h2> <p>If you found this blog post helpful, please consider citing it:</p> <div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@article</span><span class="p">{</span><span class="nl">obasi2026RLsuttonBartoCh12notes</span><span class="p">,</span> <span class="na">title</span> <span class="p">=</span> <span class="s">"Sutton &amp; Barto, Ch. 12: Eligibility Traces (Personal Notes)"</span><span class="p">,</span> <span class="na">author</span> <span class="p">=</span> <span class="s">"Obasi, Chizoba"</span><span class="p">,</span> <span class="na">journal</span> <span class="p">=</span> <span class="s">"chizkidd.github.io"</span><span class="p">,</span> <span class="na">year</span> <span class="p">=</span> <span class="s">"2026"</span><span class="p">,</span> <span class="na">month</span> <span class="p">=</span> <span class="s">"Mar"</span><span class="p">,</span> <span class="na">url</span> <span class="p">=</span> <span class="s">"https://chizkidd.github.io/2026/03/13/rl-sutton-barto-notes-ch012/"</span> <span class="p">}</span> </code></pre></div></div> <hr /> Fri, 13 Mar 2026 00:00:00 +0000 https://chizkidd.github.io//2026/03/13/rl-sutton-barto-notes-ch012/ https://chizkidd.github.io//2026/03/13/rl-sutton-barto-notes-ch012/ Sutton & Barto, Ch. 11: Off-Policy Methods with Approximation (Personal Notes) <ul> <li>Let’s discuss the extension of off-policy methods from the tabular case (Ch. 6 &amp; 7) to function approximation.</li> <li>We’ll explore the convergence problems, the theory of linear function approximation, the notion of learnability, and stronger convergence off-policy algorithms.</li> <li>Off-policy learning with function approximation has 2 challenges: <ol> <li>Finding the target of the update.</li> <li>The off-policy distribution of updates does not match that of the on-policy distribution.</li> </ol> </li> </ul> <hr /> <h2 id="table-of-contents">Table of Contents</h2> <ul> <li><a href="#111-semi-gradient-methods">11.1 Semi-gradient Methods</a></li> <li><a href="#112-examples-of-off-policy-divergence">11.2 Examples of Off-Policy Divergence</a></li> <li><a href="#113-the-deadly-triad">11.3 The Deadly Triad</a></li> <li><a href="#114-linear-value-function-geometry">11.4 Linear Value-Function Geometry</a></li> <li><a href="#115-gradient-descent-in-the-bellman-error">11.5 Gradient Descent in the Bellman Error</a></li> <li><a href="#116-the-bellman-error-is-not-learnable">11.6 The Bellman Error is Not Learnable</a></li> <li><a href="#117-gradient-td-methods">11.7 Gradient-TD Methods</a></li> <li><a href="#118-emphatic-td-methods">11.8 Emphatic-TD Methods</a></li> <li><a href="#119-reducing-variance">11.9 Reducing Variance</a></li> <li><a href="#1110-summary">11.10 Summary</a></li> </ul> <h2 id="appendix">Appendix</h2> <ul> <li><a href="#citation">Citation</a></li> </ul> <hr /> <h2 id="111-semi-gradient-methods">11.1 Semi-gradient Methods</h2> <ul> <li>Let’s discuss the extension of previous off-policy methods to function approximation as semi-gradient methods.</li> <li> <p>This is how we find the update target (or change it) to address the first challenge.</p> </li> <li>Recall the semi-gradient update:</li> </ul> \[\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha\!\left[U_t - \hat{v}(S_t, \mathbf{w}_t)\right] \nabla \hat{v}(S_t, \mathbf{w}_t)\] \[U_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t)\] <ul> <li>In the tabular case, we update the array ($V$ or $Q$), but now we update the weight vector $\mathbf{w}$.</li> <li>Many off-policy algorithms use the per-step importance sampling ratio:</li> </ul> \[\rho_t \doteq \rho_{t:t} = \frac{\pi(A_t \vert S_t)}{b(A_t \vert S_t)}\] <ul> <li>The off-policy, semi-gradient <strong>TD(0)</strong> is same as that of the on-policy TD(0) except for the addition of the $\rho_t$ term:</li> </ul> \[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \rho_t\, \delta_t \nabla \hat{v}(S_t, \mathbf{w}_t)}\] \[\begin{align*} \text{(episodic)} \quad \delta_t &amp;= R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t) \\ \text{(continuing)} \quad \delta_t &amp;= R_{t+1} - \bar{R}_t + \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t) \end{align*}\] <ul> <li>For action values, the off-policy, semi-gradient <strong>Expected Sarsa</strong> update rule is (no importance sampling):</li> </ul> \[\boxed{\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha\, \delta_t \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)}\] \[\begin{align*} \text{(episodic)} \quad \delta_t &amp;= R_{t+1} + \gamma \sum_a \pi(a \vert S_{t+1}) \hat{q}(S_{t+1}, a, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t) \\ \text{(continuing)} \quad \delta_t &amp;= R_{t+1} - \bar{R}_t + \sum_a \pi(a \vert S_{t+1}) \hat{q}(S_{t+1}, a, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t) \end{align*}\] <ul> <li>The lack of use of importance sampling in Expected Sarsa is an unclear choice since we might want to weight different state-action pairs <strong>differently</strong> once they all contribute to the same overall approximation. This issue can only be properly resolved by more thorough understanding of the <strong>theory of function approximation</strong> in RL.</li> <li>In the multi-step generalizations of the algorithms, both the state-value and action-value algorithms involve importance sampling. For example, the off-policy, semi-gradient $\mathbf{n}$<strong>-step Sarsa</strong> update is:</li> </ul> \[\boxed{\mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha \rho_{t+1} \cdots \rho_{t+n}\!\left[G_{t:t+n} - \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1})\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1})}\] \[\begin{align*} \text{(episodic)} \quad G_{t:t+n} &amp;= R_{t+1} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}_{t+n-1}) \\ \text{(continuing)} \quad G_{t:t+n} &amp;= R_{t+1} - \bar{R}_t + \ldots + R_{t+n} - \bar{R}_{t+n-1} + \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}_{t+n-1}) \end{align*}\] \[\text{where } \rho_k = 1 \hspace{0.5em} \text{ for } k \geq T \quad \text{and} \quad G_{t:t+n} = G_t \hspace{0.5em} \text{ if } t+n \geq T\] <ul> <li>The off-policy, semi-gradient $\mathbf{n}$<strong>-step backup tree</strong> (no importance sampling) algorithm is:</li> </ul> \[\boxed{\mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha\!\left[G_{t:t+n} - \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1})\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1})}\] \[G_{t:t+n} \doteq \hat{q}(S_t, A_t, \mathbf{w}_{t+n}) + \sum_{k=t}^{t+n-1} \delta_k \prod_{i=t+1}^{k} \gamma \pi(A_i \vert S_i)\] \[\text{where } \delta_t \text{ is the Expected Sarsa TD error defined earlier in this section.}\] <hr /> <h2 id="112-examples-of-off-policy-divergence">11.2 Examples of Off-Policy Divergence</h2> <ul> <li>Now let’s discuss the 2nd off-policy function approximation challenge.</li> <li>We’ll look at some instructive counterexamples where the semi-gradient algorithm diverges.</li> </ul> <h3 id="example-1">Example 1</h3> <p>Consider part of a larger MDP with 2 states whose estimated values are $w$ and $2w$:</p> <p><img src="/assets/images/2026/rl-sutton-barto/ch11-11-2-example1-DOWNSPACED.png" alt="Counterexample" /></p> <blockquote> <p><strong>Simple Counterexample:</strong> 2-state part of an MDP.</p> </blockquote> <ul> <li>$w$ updates will diverge to infinity, since the transition will always look good (higher next-state estimated value than current state estimated value).</li> <li>The TD error on a transition between the 2 states is:</li> </ul> \[\begin{align*} \delta_t &amp;= R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t) \\ &amp;= 0 + \gamma \cdot 2w_t - w_t \\ &amp;= (2\gamma - 1)\, w_t \end{align*}\] <ul> <li>The off-policy, semi-gradient TD(0) update is:</li> </ul> \[\begin{align*} w_{t+1} &amp;= w_t + \alpha \rho_t\, \delta_t \nabla \hat{v}(S_t, w_t) \\ &amp;= w_t + (\alpha)(1)\!\left[(2\gamma - 1) w_t\right](1) \\ &amp;= w_t\!\left[1 + \alpha(2\gamma - 1)\right] \end{align*}\] \[\begin{aligned} \Rightarrow \quad &amp; 1 + \alpha(2\gamma - 1) &gt; 1 \\ &amp; \alpha(2\gamma - 1) &gt; 0 \\ &amp; 2\gamma - 1 &gt; 0 \\ &amp; \gamma &gt; \tfrac{1}{2} \quad \longrightarrow \quad w \to \pm\infty \end{aligned}\] <h3 id="example-2-bairds-counterexample">Example 2 (Baird’s Counterexample)</h3> <ul> <li>Now let’s look at an entire complete system with instability (divergence).</li> <li>Consider the episodic 7-state, 2-action MDP shown below.</li> </ul> <p><img src="/assets/images/2026/rl-sutton-barto/ch11-11-2-bairds-counterexample.png" alt="Baird's Counterexample" /></p> <blockquote> <p><strong>Baird’s Counterexample:</strong> Episodic 7-state, 2-action MDP.</p> </blockquote> <ul> <li><strong>Assumptions/knowns:</strong> <ul> <li>$b(\text{dashed}\,\vert\,\cdot) = 6/7$</li> <li>$b(\text{solid}\,\vert\,\cdot) = 1/7$</li> <li>$\pi(a\,\vert\,\cdot) = \pi(\text{solid}\,\vert\,\cdot) = 1$</li> <li>$R = 0$ (on all transitions)</li> <li>$\gamma = 0.99$</li> <li>The state values are estimated via linear parametrization.</li> </ul> </li> <li>The estimated value of the leftmost state is $2w_1 + w_8$, which corresponds to a feature vector for the 1st state being:</li> </ul> \[\mathbf{x}(1) = (2, 0, 0, 0, 0, 0, 0, 1)^T\] \[R = 0 \quad \therefore\quad v_\pi(s) = 0 \; \forall s, \text{ which can be exactly approximated if } \mathbf{w} = \mathbf{0}\] <ul> <li>Since there are 8 components of the weight vector (more than the 7 non-terminal states), there exist many solutions.</li> <li>Applying semi-gradient TD(0) to this problem will cause the weights to diverge to infinity. This also applies for the dynamic programming (DP) case.</li> <li>The semi-gradient DP update is:</li> </ul> \[\mathbf{w}_{k+1} \doteq \mathbf{w}_k + \frac{\alpha}{\vert S \vert} \sum_s \left(\mathbb{E}_\pi\!\left[R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_k) \mid S_t = s\right] - \hat{v}(s, \mathbf{w}_k)\right) \nabla \hat{v}(s, \mathbf{w}_k)\] <ul> <li>This example shows that even the simplest combination of bootstrapping and function approximation can be unstable in the off-policy case. <ul> <li><u>Simplest bootstrapping</u>: DP and TD.</li> <li><u>Simplest function approximation</u>: linear, semi-gradient descent method.</li> </ul> </li> </ul> <h3 id="example-3-tsitsiklis--van-roys-counterexample">Example 3 (Tsitsiklis &amp; Van Roy’s Counterexample)</h3> <p>This extends Example 1 with a terminal state and $R = 0$:</p> <p><img src="/assets/images/2026/rl-sutton-barto/ch11-11-2-tsitsiklis-van-roy-counterexample.png" alt="Tsitsiklis &amp; Van Roy's Counterexample" /></p> <blockquote> <p><strong>Tsitsiklis &amp; Van Roy’s Counterexample:</strong> Extension of Example 1 with probability $\varepsilon$ of transitioning to the terminal state (shaded).</p> </blockquote> <ul> <li>Let’s find $w_{k+1}$ at each step that minimizes the $\overline{\text{VE}}$ between the estimated value and the expected one-step return:</li> </ul> \[\begin{align*} w_{k+1} &amp;= \arg\min_{w \in \mathbb{R}} \sum_{s \in S} \left(\hat{v}(s, w) - \mathbb{E}_\pi\!\left[R_{t+1} + \gamma \hat{v}(S_{t+1}, w_k) \mid S_t = s\right]\right)^2 \\[6pt] &amp;= \arg\min_{w \in \mathbb{R}} \left(w - \gamma \cdot 2w_k\right)^2 + \left(2w - (1 - \varepsilon)\gamma \cdot 2w_k\right)^2 \\[6pt] &amp;= \left(\frac{6 - 4\varepsilon}{5}\right) \gamma w_k \end{align*}\] <ul> <li>The sequence ${w_k}$ diverges when $\gamma &gt; \dfrac{5}{6 - 4\varepsilon}$ and $w_0 \neq 0$.</li> </ul> <h3 id="takeaways">Takeaways</h3> <ul> <li>Instability can be prevented by using special methods for function approximation.</li> <li>These special methods guarantee stability because they do not extrapolate from the observed targets. They are called <strong>averagers</strong>.</li> <li>Averagers include: <ol> <li>Nearest neighbor</li> <li>Locally weighted regression</li> <li>Tile coding</li> <li>Artificial neural networks (ANNs)</li> </ol> </li> </ul> <hr /> <h2 id="113-the-deadly-triad">11.3 The Deadly Triad</h2> <ul> <li>The danger of instability and divergence arises when we combine these 3 elements, which make up the <strong>deadly triad</strong>: <ul> <li>Function approximation</li> <li>Bootstrapping</li> <li>Off-policy training</li> </ul> </li> <li>Instability can be avoided if one of the elements is absent: <ul> <li>Function approximation cannot be given up (needed for large-scale problems).</li> <li>Bootstrapping can be given up but at the cost of computational and data efficiency.</li> <li>Off-policy can be given up (replace Q-learning with Sarsa).</li> <li>There is no perfect solution as we still need off-policy for planning and parallel learning.</li> </ul> </li> </ul> <hr /> <h2 id="114-linear-value-function-geometry">11.4 Linear Value-Function Geometry</h2> <ul> <li>To better understand the stability challenge of off-policy learning, let’s think about value-function approximation <strong>more abstractly and independently</strong> of how learning is done.</li> <li>Let’s consider the case with 3 states $S = {s_1, s_2, s_3}$ and 2 parameters $\mathbf{w} = (w_1, w_2)^T$. <ul> <li>All value functions exist in a 3-D space, however the parameters provide a 2-D subspace.</li> <li>Any weight vector $\mathbf{w} = (w_1, w_2)^T$ is a point in the 2-D subspace and thus also a complete value function $v_\mathbf{w}$ that assigns values to all 3 states.</li> <li>In linear value-function approximation, the subspace is a simple plane.</li> </ul> </li> <li>How do we represent $v_\pi$ in the $d$-dimensional space? <ul> <li>We need to perform a projection operation.</li> <li>TD methods present other solutions.</li> </ul> </li> </ul> <p><img src="/assets/images/2026/rl-sutton-barto/ch11-11-4-linear-value-func-approx-geometry-DOWNSPACED.png" alt="Linear value-func. approx. geometry" /></p> <!-- >**The Geometry of Linear Value-Function Approximation:** Shown is the 3D space of all value functions over three states, while shown as a plane is the subspace of all value functions representable by a linear function approximator with parameter $\mathbf{w} = (w_1, w_2)^T$. The true value function $v_\pi$ is in the larger space and can be projected down (into the subspace, using a projection operator $\Pi$) to its best approximation in the value error ($\text{VE}$) sense. The best approximators in the Bellman error ($\text{BE}$), projected Bellman error ($\text{PBE}$), and temporal difference error ($\text{TDE}$) senses are all potentially different and are shown in the lower right. --> <div class="callout callout--note"> <div class="callout__title"> <strong>The Geometry of Linear Value-Function Approximation</strong> </div> <div class="callout__body"> <p>Shown is the 3D space of all value functions over three states, while shown as a plane is the subspace of all value functions representable by a linear function approximator with parameter $\mathbf{w} = (w_1, w_2)^T$. The true value function $v_\pi$ is in the larger space and can be projected down (into the subspace, using a projection operator $\Pi$) to its best approximation in the value error ($\text{VE}$) sense. The best approximators in the Bellman error ($\text{BE}$), projected Bellman error ($\text{PBE}$), and temporal difference error ($\text{TDE}$) senses are all potentially different and are shown in the lower right.</p> </div> </div> <h3 id="projection-operation">Projection Operation</h3> <ul> <li>For the projection operation, the distance between value functions using the norm is:</li> </ul> \[\begin{align*} \lVert v \rVert_\mu^2 &amp;\doteq \sum_{s \in S} \mu(s)\, v(s)^2 \\[6pt] \overline{\text{VE}}(\mathbf{w}) &amp;= \lVert v_\mathbf{w} - v_\pi \rVert_\mu^2 \\[6pt] \Pi\, v &amp;\doteq v_\mathbf{w} \quad \\[6pt] \text{where } \mathbf{w} = \arg\min_{\mathbf{w} \in \mathbb{R}^d} \lVert v - v_\mathbf{w} \rVert_\mu^2 &amp; \hspace{0.8em} \text{and} \hspace{0.5em} \Pi \equiv \text{projection operator} \end{align*}\] <ul> <li> <p>The representable value function that is closest to the true value function $V_\pi$ is its projection $\Pi V_\pi$ (MC method asymptotic solution).</p> </li> <li> <p><strong>Projection matrix</strong>: with $\mathbf{D} \equiv \vert S \vert \times \vert S \vert$ diagonal matrix with $\mu(s)$ on the diagonal and $\mathbf{X} \equiv \vert S \vert \times d$ matrix whose rows are the feature vectors $\mathbf{x}(s)^T$:</p> \[\Pi \doteq \mathbf{X}\!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{D}\] <p>If the inverse does not exist, the pseudoinverse is substituted. Using these matrices, the squared norm of a vector can be written as:</p> \[\lVert v \rVert_\mu^2 = v^T \mathbf{D}\, v\] <p>and the approximate linear value function written as:</p> \[v_\mathbf{w} = \mathbf{X}\mathbf{w}\] </li> </ul> <h3 id="td-solutions">TD Solutions</h3> <p><strong>Bellman Error</strong></p> <ul> <li>The true value function $v_\pi$ solves the Bellman equation exactly.</li> <li>The <strong>Bellman error</strong> shows how far off $v_\mathbf{w}$ is from $v_\pi$. The Bellman error at state $s$ is:</li> </ul> \[\begin{align*} \bar{\delta}_\mathbf{w}(s) &amp;\doteq \left(\sum_a \pi(a \vert s) \sum_{s', r} p(s', r \vert s, a)\!\left[r + \gamma v_\mathbf{w}(s')\right]\right) - v_\mathbf{w}(s) \\ &amp;= \mathbb{E}_\pi\!\left[R_{t+1} + \gamma v_\mathbf{w}(S_{t+1}) - v_\mathbf{w}(S_t) \mid S_t = s, A_t \sim \pi\right] \end{align*}\] <ul> <li>The Bellman error is the expectation of the TD error.</li> <li>The vector of all the Bellman errors, at all states, $\bar{\delta}_\mathbf{w} \in \mathbb{R}^{\vert S \vert}$, is called the <strong>Bellman error vector</strong> ($\text{BE}$).</li> <li>The overall size of $\text{BE}$ is the <strong>Mean Squared Bellman Error</strong>, $\overline{\text{BE}}$:</li> </ul> \[\overline{\text{BE}}(\mathbf{w}) = \lVert \bar{\delta}_\mathbf{w} \rVert_\mu^2\] <ul> <li>The <strong>Bellman operator</strong> $B_\pi : \mathbb{R}^{\vert S \vert} \to \mathbb{R}^{\vert S \vert}$ is defined by:</li> </ul> \[\begin{align*} (B_\pi v)(s) &amp;\doteq \sum_a \pi(a \vert s) \sum_{s', r} p(s', r \vert s, a)\!\left[r + \gamma v(s')\right], \quad \forall s \in S \text{ and } v : S \to \mathbb{R} \\[6pt] \bar{\delta}_\mathbf{w} &amp;= B_\pi v_\mathbf{w} - v_\mathbf{w} \\[6pt] v_\pi &amp;= B_\pi v_\pi \end{align*}\] <ul> <li>The projection of the Bellman error vector back into the representable space creates the <strong>Projected Bellman Error $(\text{PBE})$</strong> vector:</li> </ul> \[\text{PBE} = \Pi\, \bar{\delta}_\mathbf{w}\] <ul> <li>The size of $\text{PBE}$, in the norm, is another measure of error in the approximate value function, called the <strong>Mean Square Projected Bellman Error</strong>, $\overline{\text{PBE}}$:</li> </ul> \[\overline{\text{PBE}}(\mathbf{w}) = \lVert \Pi\, \delta_\mathbf{w} \rVert_\mu^2\] <ul> <li>With linear function approximation, there always exists an approximate value function (within the subspace) with zero $\overline{\text{PBE}}$; this is the TD fixed point $\mathbf{w}_\text{TD}$.</li> </ul> <hr /> <h2 id="115-gradient-descent-in-the-bellman-error">11.5 Gradient Descent in the Bellman Error</h2> <ul> <li>Let’s apply the approach of SGD in dealing with the challenge of stability in off-policy learning.</li> </ul> <h3 id="td-error-naive-residual-gradient-algorithm">TD Error (Naive Residual-Gradient Algorithm)</h3> <ul> <li>Let’s take the minimization of the expected square of the TD error as the objective, TD(0):</li> </ul> \[\delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t)\] <ul> <li>Using the TD error, we can find the Mean Squared TD error, the objective function $\overline{\text{TDE}}$:</li> </ul> \[\begin{align*} \overline{\text{TDE}}(\mathbf{w}) &amp;= \sum_{s \in S} \mu(s)\, \mathbb{E}\!\left[\delta_t^2 \mid S_t = s, A_t \sim \pi\right] \\ &amp;= \sum_{s \in S} \mu(s)\, \mathbb{E}\!\left[\rho_t\, \delta_t^2 \mid S_t = s, A_t \sim b\right] \\ &amp;= \mathbb{E}_b\!\left[\rho_t\, \delta_t^2\right] \end{align*}\] <ul> <li>Following the standard SGD approach, the per-step update based on a sample of this expected value:</li> </ul> \[\begin{align*} \mathbf{w}_{t+1} &amp;= \mathbf{w}_t - \tfrac{1}{2}\alpha \nabla\!\left(\rho_t\, \delta_t^2\right) \\ &amp;= \mathbf{w}_t - \alpha \rho_t\, \delta_t \nabla \delta_t \\ &amp;= \mathbf{w}_t + \alpha \rho_t\, \delta_t\!\left(\nabla \hat{v}(S_t, \mathbf{w}_t) - \gamma \nabla \hat{v}(S_{t+1}, \mathbf{w}_t)\right) \end{align*}\] <ul> <li>This is the same as the semi-gradient TD algorithm except for the additional final term.</li> <li>This method is <strong>naive</strong> because it achieves temporal smoothing-like behavior rather than accurate prediction by penalizing all TD errors.</li> </ul> <h3 id="bellman-error-residual-gradient-algorithm">Bellman Error (Residual-Gradient Algorithm)</h3> <ul> <li>Consider the minimization of the Bellman error (if the exact values are learned, the Bellman error is zero everywhere).</li> <li> <p>This yields the <strong>residual gradient algorithm</strong>:</p> \[\begin{align*} \mathbf{w}_{t+1} &amp;= \mathbf{w}_t - \tfrac{1}{2}\alpha \nabla\!\left(\mathbb{E}_\pi\!\left[\delta_t\right]^2\right) \\ &amp;= \mathbf{w}_t - \tfrac{1}{2}\alpha \nabla\!\left(\mathbb{E}_b\!\left[\rho_t\, \delta_t\right]^2\right) \\ &amp;= \mathbf{w}_t - \alpha\, \mathbb{E}_b\!\left[\rho_t\, \delta_t\right] \nabla \mathbb{E}_b\!\left[\rho_t\, \delta_t\right] \\ &amp;= \mathbf{w}_t - \alpha\, \mathbb{E}_b\!\left[\rho_t\!\left(R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w})\right)\right] \mathbb{E}_b\!\left[\rho_t \nabla \delta_t\right] \\ &amp;= \mathbf{w}_t + \alpha\!\left[\mathbb{E}_b\!\left[\rho_t\!\left(R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w})\right)\right] - \hat{v}(S_t, \mathbf{w})\right]\!\left[\nabla \hat{v}(S_t, \mathbf{w}) - \gamma\, \mathbb{E}_b\!\left[\rho_t \nabla \hat{v}(S_{t+1}, \mathbf{w})\right]\right] \end{align*}\] </li> <li>Two ways to make the residual-gradient algorithm work: <ul> <li>In the case of deterministic environments.</li> <li>Obtain 2 independent samples of the next state $S_{t+1}$ from $S_t$.</li> </ul> </li> <li> <p>In both ways above, the algorithm is guaranteed to converge to a minimum of the $\overline{\text{BE}}$ under the usual conditions on the step-size parameter.</p> </li> <li>However, there are at least 3 ways in which the convergence of the residual-gradient algorithm is unsatisfactory: <ul> <li>Very slow.</li> <li>Converges to the wrong values.</li> <li>A problem with the $\overline{\text{BE}}$ objective covered in the next section.</li> </ul> </li> </ul> <hr /> <h2 id="116-the-bellman-error-is-not-learnable">11.6 The Bellman Error is Not Learnable</h2> <ul> <li>The Bellman error is not learnable from the observed sequence of feature vectors, actions, and rewards.</li> <li>Since the Bellman error objective cannot be learned from the observable data, this is the strongest reason not to seek it.</li> <li>Examples of non-learnable Markov Reward Processes (MRPs):</li> </ul> <h3 id="example-1-1">Example 1</h3> <p><img src="/assets/images/2026/rl-sutton-barto/ch11-11-6-example1.png" alt="VE learnability Counterexample" /></p> <blockquote> <p><strong>Value Error (VE) Learnability Counterexample:</strong> Deterministic MRP pair with an endless stream of $0$s and $2$s</p> </blockquote> <ul> <li>These MRPs have a deterministic reward with observable data of an endless stream of 0s and 2s.</li> <li>We cannot learn if the MRP has one state or two, or is stochastic or deterministic.</li> <li>The pair of MRPs shows that the $\overline{\text{VE}}$ objective is not learnable:</li> </ul> \[\overline{\text{VE}}(\mathbf{w}) \doteq \sum_{s \in S} \mu(s)\!\left[v_\pi(s) - \hat{v}(s, \mathbf{w})\right]^2\] <ul> <li>The $\overline{\text{VE}}$ is not learnable, but the parameter that optimizes it is!</li> <li>We introduce a learnable natural objective function that is always observable. This is the error between the value estimate at each time and the return from that time, called the <strong>return error</strong>. The <strong>Mean Square Return Error</strong> $(\overline{\text{RE}})$ is the expectation, under $\mu$, of the square of this return error.</li> <li>$\overline{\text{RE}}$ in the on-policy case is:</li> </ul> \[\begin{align*} \overline{\text{RE}}(\mathbf{w}) &amp;= \mathbb{E}\!\left[\left(G_t - \hat{v}(S_t, \mathbf{w})\right)^2\right] \\ &amp;= \overline{\text{VE}}(\mathbf{w}) + \mathbb{E}\!\left[\left(G_t - v_\pi(S_t)\right)^2\right] \end{align*}\] <ul> <li>The $\overline{\text{BE}}$ can be computed from knowledge of the MDP but is not learnable from data, and its minimum solution is not learnable.</li> </ul> <h3 id="example-2">Example 2</h3> <p><img src="/assets/images/2026/rl-sutton-barto/ch11-11-6-example2.png" alt="BE learnability Counterexample" /></p> <blockquote> <p><strong>Bellman Error (BE) Learnability Counterexample:</strong> Complex Deterministic MRP pair with same distribution but different minimizing parameter vector</p> </blockquote> <ul> <li>The example above serves as a counterexample to the learnability of the Bellman error.</li> <li>The 2 MRPs generate the same data distribution but have different minimizing parameter vectors, proving that the optimal parameter vector is not a function of the data and thus cannot be learned from it.</li> <li>Other bootstrapping objectives, like $\overline{\text{PBE}}$ and $\overline{\text{TDE}}$, are learnable from data and yield optimal solutions different from each other and that of $\overline{\text{BE}}$.</li> <li>$\overline{\text{BE}}$ is limited to model-based settings, therefore $\overline{\text{PBE}}$ is preferred.</li> </ul> <p><img src="/assets/images/2026/rl-sutton-barto/ch11-11-6-causal-relationships-mdps-datadistr-errors.png" alt="MDPs-data distribution-objectives causal relationships" /></p> <!-- >**Casual Relationships among the data distribution, MDPs & various objectives:** Monte-Carlo & Bootstrapping objectives --> <div class="callout callout--note"> <div class="callout__title"> <strong>Casual Relationships among the data distribution, MDPs &amp; various objectives</strong> </div> <div class="callout__body"> <p><strong>Left, Monte Carlo objectives:</strong> Two different MDPs can produce the same data distribution yet also produce different $\overline{\text{VE}}$s, proving that the $\overline{\text{VE}}$ objective cannot be determined from data and is not learnable. However, all such $\overline{\text{VE}}$s must have the same optimal parameter vector, $\mathbf{w}^{*}$! Moreover, this same $\mathbf{w}^{*}$ can be determined from another objective, the $\overline{\text{RE}}$, which is uniquely determined from the data distribution. Thus $\mathbf{w}^{*}$ and the $\overline{\text{RE}}$ are learnable even though the $\overline{\text{VE}}$s are not. <br /> <strong>Right, Bootstrapping objectives:</strong> Two different MDPs can produce the same data distribution yet also produce different $\overline{\text{BE}}$s <em>and</em> have different minimizing parameter vectors; these are not learnable from the data distribution. The $\overline{\text{PBE}}$ and $\overline{\text{TDE}}$ objectives and their (different) minima can be directly determined from data and thus are learnable.</p> </div> </div> <hr /> <h2 id="117-gradient-td-methods">11.7 Gradient-TD Methods</h2> <ul> <li>Let’s consider SGD methods for minimizing the $\overline{\text{PBE}}$.</li> <li>True SGD methods, <strong>Gradient-TD methods</strong>, have robust convergence properties even under off-policy training and nonlinear function approximation.</li> <li>In the linear case, there exists an exact solution, the TD fixed point $\mathbf{w}_\text{TD}$, at which the $\overline{\text{PBE}}$ is zero.</li> <li>This solution via <strong>least-squares</strong> methods yields a $O(d^2)$ complexity; however, we want an SGD method with $O(d)$ that converges robustly.</li> <li> <p>Let’s derive an SGD method for the $\overline{\text{PBE}}$ assuming linear function approximation:</p> \[\begin{align*} \overline{\text{PBE}}(\mathbf{w}) &amp;= \lVert \Pi\, \bar{\delta}_\mathbf{w} \rVert_\mu^2 \\ &amp;= \left(\Pi\, \bar{\delta}_\mathbf{w}\right)^T \mathbf{D}\, \Pi\, \bar{\delta}_\mathbf{w} \\ &amp;= \bar{\delta}_\mathbf{w}^T \Pi^T \mathbf{D}\, \Pi\, \bar{\delta}_\mathbf{w} \\ &amp;= \bar{\delta}_\mathbf{w}^T \mathbf{D} \mathbf{X}\!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w} \\ &amp;= \left(\mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w}\right)^T \!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \!\left(\mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w}\right) \end{align*}\] <p>$\quad \left(\text{using } \Pi = \mathbf{X}!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{D} \text{ and the identity } \Pi^T \mathbf{D} \Pi = \mathbf{D} \mathbf{X}!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{D}\right)$</p> </li> <li>The gradient of the $\overline{\text{PBE}}$ w.r.t $\mathbf{w}$ is:</li> </ul> \[\nabla \overline{\text{PBE}}(\mathbf{w}) = 2\, \nabla\!\left[\mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w}\right]^T \!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \!\left(\mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w}\right)\] <ul> <li>Let’s turn this into an SGD method via converting the 3 factors above into <strong>expectations</strong> under this distribution:</li> </ul> \[\begin{aligned} \mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w} &amp;= \sum_s \mu(s)\, \mathbf{x}(s)\, \bar{\delta}_\mathbf{w}(s) = \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\[6pt] \nabla\!\left[\mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w}\right] &amp;= \nabla \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\ &amp;= \mathbb{E}\!\left[\rho_t \nabla \delta_t^T\, \mathbf{x}_t^T\right] \\ &amp;= \mathbb{E}\!\left[\rho_t \nabla\!\left(R_{t+1} + \gamma \mathbf{w}^T \mathbf{x}_{t+1} - \mathbf{w}^T \mathbf{x}_t\right) \mathbf{x}_t^T\right] \\ &amp;= \mathbb{E}\!\left[\rho_t\!\left(\gamma \mathbf{x}_{t+1} - \mathbf{x}_t\right) \mathbf{x}_t^T\right] \\[6pt] \mathbf{X}^T \mathbf{D} \mathbf{X} &amp;= \sum_s \mu(s)\, \mathbf{x}_s\, \mathbf{x}_s^T = \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right] \end{aligned}\] <p>Substituting these expectations for the three factors into $\nabla \overline{\text{PBE}}$:</p> \[\nabla \overline{\text{PBE}}(\mathbf{w}) = 2\, \mathbb{E}\!\left[\rho_t\!\left(\gamma \mathbf{x}_{t+1} - \mathbf{x}_t\right) \mathbf{x}_t^T\right] \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right]\] <ul> <li>The 1st and last terms are not independent (<strong>biased gradient estimate</strong>).</li> <li>Could estimate all 3 terms separately and combine (<strong>unbiased gradient estimate</strong>) but too computationally expensive.</li> </ul> <h3 id="gradient-td">Gradient-TD</h3> <ul> <li>Estimate and store the product of the last 2 terms of $\nabla \overline{\text{PBE}}(\mathbf{w})$ (product of a $d \times d$ matrix and a $d$-vector yields a $d$-vector like $\mathbf{w}$ itself):</li> </ul> \[\mathbf{v} \approx \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right]\] <ul> <li>In linear supervised learning, this is the solution to a linear least-squares problem for $\rho_t\, \delta_t$ approximation from the features.</li> <li>The standard SGD method for incrementally finding $\mathbf{v}$ that minimizes the expected squared error $\left(\mathbf{v}^T \mathbf{x}_t - \rho_t\, \delta_t\right)^2$ is known as the <strong>Least Mean Square (LMS)</strong> rule:</li> </ul> \[\mathbf{v}_{t+1} \doteq \mathbf{v}_t + \beta\, \delta_t\!\left(\delta_t - \mathbf{v}_t^T \mathbf{x}_t\right) \mathbf{x}_t\] \[\begin{aligned} \text{where} \\ \beta &amp;&gt; 0 \equiv \text{another step-size parameter} \\ \rho_t &amp;\equiv \text{importance sampling ratio} \\ O(d) &amp;\equiv \text{storage \&amp; per-step computational complexity} \end{aligned}\] <h3 id="gtd2">GTD2</h3> <ul> <li> <p>With $\mathbf{v}_t$, we can update $\mathbf{w}_t$ using SGD:</p> \[\begin{align*} \mathbf{w}_{t+1} &amp;= \mathbf{w}_t - \tfrac{1}{2}\alpha \nabla \overline{\text{PBE}}(\mathbf{w}_t) \\ &amp;= \mathbf{w}_t - \tfrac{1}{2}\alpha\!\left(2\, \mathbb{E}\!\left[\rho_t\!\left(\gamma \mathbf{x}_{t+1} - \mathbf{x}_t\right) \mathbf{x}_t^T\right] \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right]\right) \\ &amp;= \mathbf{w}_t + \alpha\, \mathbb{E}\!\left[\rho_t\!\left(\mathbf{x}_t - \gamma \mathbf{x}_{t+1}\right) \mathbf{x}_t^T\right] \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\ &amp;\approx \mathbf{w}_t + \alpha\, \mathbb{E}\!\left[\rho_t\!\left(\mathbf{x}_t - \gamma \mathbf{x}_{t+1}\right) \mathbf{x}_t^T\right] \mathbf{v}_t \\ &amp;\approx \mathbf{w}_t + \alpha \rho_t\!\left(\mathbf{x}_t - \gamma \mathbf{x}_{t+1}\right) \mathbf{x}_t^T \mathbf{v}_t \end{align*}\] <p>where $O(d)$ per-step computational complexity of $(\mathbf{x}_t^T \mathbf{v}_t)$ is done first.</p> </li> </ul> <h3 id="td0-with-gradient-correction-gtd0-or-tdc">TD(0) with Gradient Correction (GTD(0) or TDC)</h3> <ul> <li> <p>Let’s look at another analytical algorithm called TD(0) with gradient correction, <strong>TDC</strong>:</p> \[\begin{align*} \mathbf{w}_{t+1} &amp;= \mathbf{w}_t + \alpha\, \mathbb{E}\!\left[\rho_t\!\left(\mathbf{x}_t - \gamma \mathbf{x}_{t+1}\right) \mathbf{x}_t^T\right] \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\ &amp;= \mathbf{w}_t + \alpha\!\left(\mathbb{E}\!\left[\rho_t\, \mathbf{x}_t\, \mathbf{x}_t\right] - \gamma\, \mathbb{E}\!\left[\rho_t\, \mathbf{x}_{t+1}\, \mathbf{x}_t^T\right]\right) \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\ &amp;= \mathbf{w}_t + \alpha\!\left(\mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t\right] - \gamma\, \mathbb{E}\!\left[\rho_t\, \mathbf{x}_{t+1}\, \mathbf{x}_t^T\right]\right) \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\ &amp;= \mathbf{w}_t + \alpha\!\left(\mathbb{E}\!\left[\mathbf{x}_t\, \rho_t\, \delta_t\right] - \gamma\, \mathbb{E}\!\left[\rho_t\, \mathbf{x}_{t+1}\, \mathbf{x}_t^T\right] \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right]\right) \\ &amp;\approx \mathbf{w}_t + \alpha\!\left(\mathbb{E}\!\left[\mathbf{x}_t\, \rho_t\, \delta_t\right] - \gamma\, \mathbb{E}\!\left[\rho_t\, \mathbf{x}_{t+1}\, \mathbf{x}_t^T\right] \mathbf{v}_t\right) \\ &amp;\approx \mathbf{w}_t + \alpha \rho_t\!\left(\delta_t\, \mathbf{x}_t - \gamma \mathbf{x}_{t+1}\, \mathbf{x}_t^T \mathbf{v}_t\right) \end{align*}\] <p>with $O(d)$ complexity if the final product $(\mathbf{x}_t^T \mathbf{v}_t)$ is done first.</p> </li> </ul> <h3 id="takeaways-1">Takeaways</h3> <ul> <li>GTD2 and TDC both involve 2 learning processes: a primary one for $\mathbf{w}$ and a secondary one for $\mathbf{v}$.</li> <li>Asymmetrical dependence ($\mathbf{w}$ depends on $\mathbf{v}$ but $\mathbf{v}$ does not depend on $\mathbf{w}$) is referred to as a <strong>cascade</strong>.</li> <li>Gradient-TD methods are the most well-understood and widely used stable off-policy methods.</li> <li>Extensions of GTD methods include to: <ol> <li>Action values and control: <strong>GQ [Maei et al., 2010]</strong></li> <li>Eligibility traces: <strong>GTD($\lambda$), GQ($\lambda$) [Maei, 2011; Maei &amp; Sutton, 2010]</strong></li> <li>Nonlinear function approximation <strong>[Maei et al., 2009]</strong></li> </ol> </li> <li>Hybrid algorithms include: <ol> <li>Midway between semi-gradient TD and gradient TD <strong>[Hackman, 2012; White &amp; White, 2016]</strong></li> <li>GTD + proximal methods &amp; control variates <strong>[Mahadevan et al., 2014; Du et al., 2017]</strong></li> </ol> </li> </ul> <hr /> <h2 id="118-emphatic-td-methods">11.8 Emphatic-TD Methods</h2> <ul> <li>Let’s explore a major strategy for obtaining a cheap and efficient off-policy learning method with function approximation.</li> <li>Recall that linear semi-gradient TD methods are stable when trained under the on-policy distribution .</li> <li>The match between the on-policy state distribution $\mu_\pi$ and the state-transition probabilities $p(s’ \vert s, a)$ under the target policy does not exist in off-policy learning.</li> <li><strong>Mismatch Fix</strong>: <ul> <li>Re-weight the states, emphasizing some and de-emphasizing others, so as to return the distribution of the updates to the on-policy distribution.</li> <li>Then there would be a match, and convergence and stability would be achieved. This is the idea of <strong>Emphatic-TD methods</strong>.</li> </ul> </li> <li>The <strong>one-step Emphatic-TD algorithm</strong> for learning episodic state values is defined by:</li> </ul> \[\delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t)\] \[\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha M_t \rho_t\, \delta_t \nabla \hat{v}(S_t, \mathbf{w}_t)\] \[M_t = \gamma \rho_{t-1} M_{t-1} + \mathcal{I}_t\] \[\begin{aligned} \text{where} \\ \mathcal{I}_t &amp;\equiv \text{the interest} \\ M_t &amp;\equiv \text{the emphasis} \quad (M_{-1} = 0) \end{aligned}\] <ul> <li>Applying Emphatic TD to Baird’s counterexample yields very high variance results (impossible to get consistent results in experiments).</li> <li>We focus on how we reduce the variance in all these algorithms in the next section.</li> </ul> <hr /> <h2 id="119-reducing-variance">11.9 Reducing Variance</h2> <ul> <li>Off-policy learning is inherently of greater variance than on-policy learning.</li> <li>The raison d’être of off-policy learning is to enable generalization to the vast number of related-but-not-identical policies.</li> <li>Why is variance control critical in off-policy learning based on importance sampling? <ul> <li>Recall importance sampling involves products of policy ratios:</li> </ul> \[\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k \vert S_k)}{b(A_k \vert S_k)}\] <ul> <li>The policy ratios always have an expected value of 1, but their actual values might be high or as low as 0:</li> </ul> \[\mathbb{E}\!\left[\frac{\pi(A_k \vert S_k)}{b(A_k \vert S_k)}\right] \doteq \sum_a b(a \vert S_k) \frac{\pi(a \vert S_k)}{b(a \vert S_k)} = \sum_a \pi(a \vert S_k) = 1\] <ul> <li>Successive ratios are uncorrelated, so their products are always 1 in expected value, but they can be of high variance.</li> <li>These ratios multiply the step size in SGD methods, so their high variance is problematic for SGD because of the occasional huge steps.</li> </ul> </li> <li>How can we alleviate the effects of high variance via small step-size settings enough to ensure the expected step taken by SGD is small? Some approaches: <ul> <li>Momentum</li> <li>Polyak-Ruppert averaging</li> <li>Methods for adaptively setting separate step sizes for different components of the parameter vector</li> <li>“Importance weight aware” updates of <strong>Karampatziakis &amp; Langford (2015)</strong></li> <li>Weighted importance sampling, which is well-behaved with lower variance updates than ordinary importance sampling, but adapting it to function approximation is challenging <strong>[Mahmood &amp; Sutton, 2015]</strong></li> <li>Tree backup algorithm (off-policy, without importance sampling)</li> <li>Allow the target policy $\pi$ to be determined partly by the behavior policy $b$ to limit creating large importance sampling ratios</li> </ul> </li> </ul> <hr /> <h2 id="1110-summary">11.10 Summary</h2> <ul> <li>Off-policy learning poses a challenge that requires creating stable and efficient learning algorithms.</li> <li>Tabular Q-learning makes off-policy learning seem easy, as does its generalizations to Expected Sarsa and tree backup.</li> <li>Extension further to function approximation (even linear) is challenging.</li> <li>The challenge of off-policy learning is divided into two parts: <ul> <li>Correcting the targets of learning for the behavior policy.</li> <li>Dealing with the instability of bootstrapping (mismatch between off-policy and on-policy distribution of updates).</li> </ul> </li> <li>The <strong>deadly triad</strong> arises when we try to combine these 3 elements: <strong>function approximation, off-policy learning, and bootstrapping,</strong> thereby causing instability and divergence.</li> <li>SGD in the Bellman error $\overline{\text{BE}}$ is not learnable so it does not work.</li> <li>Gradient-TD methods perform SGD in the projected Bellman error $\overline{\text{PBE}}$ and are learnable with $O(d)$ computational complexity.</li> <li>Emphatic-TD methods re-weight updates, emphasizing some and de-emphasizing others, to get the off-policy distribution of the updates to match that of on-policy.</li> <li>There are many ways of reducing high variance in off-policy learning that are centered on minimizing the step taken by SGD by using small step-size parameters to counter the multiplicative effect from the successive policy ratios.</li> </ul> <hr /> <h2 id="citation">Citation</h2> <p>If you found this blog post helpful, please consider citing it:</p> <div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@article</span><span class="p">{</span><span class="nl">obasi2026RLsuttonBartoCh11notes</span><span class="p">,</span> <span class="na">title</span> <span class="p">=</span> <span class="s">"Sutton &amp; Barto, Ch. 11: Off-Policy Methods with Approximation (Personal Notes)"</span><span class="p">,</span> <span class="na">author</span> <span class="p">=</span> <span class="s">"Obasi, Chizoba"</span><span class="p">,</span> <span class="na">journal</span> <span class="p">=</span> <span class="s">"chizkidd.github.io"</span><span class="p">,</span> <span class="na">year</span> <span class="p">=</span> <span class="s">"2026"</span><span class="p">,</span> <span class="na">month</span> <span class="p">=</span> <span class="s">"Mar"</span><span class="p">,</span> <span class="na">url</span> <span class="p">=</span> <span class="s">"https://chizkidd.github.io/2026/03/09/rl-sutton-barto-notes-ch011/"</span> <span class="p">}</span> </code></pre></div></div> <hr /> Mon, 09 Mar 2026 00:00:00 +0000 https://chizkidd.github.io//2026/03/09/rl-sutton-barto-notes-ch011/ https://chizkidd.github.io//2026/03/09/rl-sutton-barto-notes-ch011/ Sutton & Barto, Ch. 10: On-Policy Control with Approximation (Personal Notes) <ul> <li>Let’s dive into the control problem now with parametric approximation of the action-value function $\hat{q}(s, a, \mathbf{w}) \approx q_{*}(s, a)$, where $\mathbf{w} \in \mathbb{R}^d$ is a <strong>finite-dimensional weight vector.</strong></li> <li>We’ll focus on <strong>semi-gradient Sarsa</strong>, the natural extension of semi-gradient TD(0) to action values and to on-policy control.</li> <li>We’ll look at this extension in both the episodic and continuing case.</li> <li>We’ll look at $n$-step linear Sarsa.</li> </ul> <hr /> <h2 id="table-of-contents">Table of Contents</h2> <ul> <li><a href="#101-episodic-semi-gradient-control">10.1 Episodic Semi-gradient Control</a></li> <li><a href="#102-semi-gradient-n-step-sarsa">10.2 Semi-gradient $n$-step Sarsa</a></li> <li><a href="#103-average-reward-a-new-problem-setting-for-continuing-tasks">10.3 Average Reward: A New Problem Setting for Continuing Tasks</a></li> <li><a href="#104-deprecating-the-discounted-setting">10.4 Deprecating the Discounted Setting</a></li> <li><a href="#105-differential-semi-gradient-n-step-sarsa">10.5 Differential Semi-gradient $n$-step Sarsa</a></li> <li><a href="#106-summary">10.6 Summary</a></li> </ul> <h2 id="appendix">Appendix</h2> <ul> <li><a href="#citation">Citation</a></li> </ul> <hr /> <h2 id="101-episodic-semi-gradient-control">10.1 Episodic Semi-gradient Control</h2> <ul> <li>The extension of the semi-gradient prediction methods of Chapter 9 to action values is straightforward.</li> <li>It is the approximate action-value function, $\hat{q} \approx q_\pi$, that is represented as a parametrized functional form with weight vector $\mathbf{w}$.</li> <li>Before, the training examples had the form $S_t \mapsto U_t$; now the examples have the form $S_t, A_t \mapsto U_t$.</li> <li>The update target $U_t$ can be any approximation of $q_\pi(S_t, A_t)$, including the usual backed-up values such as the full Monte Carlo (MC) return $G_t$ or any $n$-step Sarsa return $G_{t:t+n}$.</li> <li>The general gradient-descent update for action-value prediction is:</li> </ul> \[\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\!\left[U_t - \hat{q}(S_t, A_t, \mathbf{w}_t)\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)\] <ul> <li>The update for the one-step Sarsa method is:</li> </ul> \[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\!\left[R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t)\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)}\] <ul> <li>This method is called <strong>episodic semi-gradient one-step Sarsa</strong>. For a constant policy, this method converges in the same way that TD(0) does with the same kind of error bound.</li> <li><strong>Control</strong> = action-value prediction + policy improvement &amp; action selection:</li> </ul> \[\boxed{a, S_{t+1} \longrightarrow \hat{q}(S_{t+1}, a, \mathbf{w}_t) \longrightarrow A^*_{t+1} = \arg\max_a \hat{q}(S_{t+1}, a, \mathbf{w}_t) \longrightarrow \varepsilon\text{-greedy policy improvement} \longrightarrow \varepsilon\text{-greedy action selection}}\] <ul> <li>Linear function approximation for the action-value function is:</li> </ul> \[\hat{q}(s, a, \mathbf{w}) \doteq \mathbf{w}^T \mathbf{x}(s, a) = \sum_{i=1}^{d} w_i \cdot x_i(s, a)\] <hr /> <h2 id="102-semi-gradient-n-step-sarsa">10.2 Semi-gradient $n$-step Sarsa</h2> <ul> <li>We use an $n$-step return as the update target for episodic semi-gradient $n$-step Sarsa. The $n$-step return generalizes from its tabular form to a function approximation form:</li> </ul> \[G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}_{t+n-1}), \quad t+n &lt; T\] \[\text{with } G_{t:t+n} \doteq G_t \text{ if } t+n \geq T\] <ul> <li>The $n$-step update equation is:</li> </ul> \[\boxed{\mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha\!\left[G_{t:t+n} - \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1})\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1}), \quad 0 \leq t &lt; T}\] <ul> <li>Performance is best if an intermediate level of bootstrapping is used ($n &gt; 1$).</li> </ul> <hr /> <h2 id="103-average-reward-a-new-problem-setting-for-continuing-tasks">10.3 Average Reward: A New Problem Setting for Continuing Tasks</h2> <ul> <li>Average reward applies to continuing problems for goal formulation in MDPs.</li> <li>Average reward uses <strong>no discounting</strong>; the agent has the same level of care for immediate and delayed rewards.</li> <li>Average reward setting is more commonly considered in dynamic programming and less commonly in reinforcement learning (RL).</li> <li>The discounted setting is problematic with function approximation, hence the need for average reward to replace it.</li> <li>In the average-reward setting, the quality of a policy $\pi$ is defined as the average rate of reward, or simply <strong>average reward</strong>, while following that policy, denoted as $r(\pi)$:</li> </ul> \[\begin{align*} r(\pi) &amp;\doteq \lim_{h \to \infty} \frac{1}{h} \sum_{t=1}^{h} \mathbb{E}\!\left[R_t \mid S_0, A_{0:t-1} \sim \pi\right] \\ &amp;= \lim_{t \to \infty} \mathbb{E}\!\left[R_t \mid S_0, A_{0:t-1} \sim \pi\right] \\ &amp;= \sum_s \mu_\pi(s) \sum_a \pi(a \vert s) \sum_{s', r} p(s', r \vert s, a)\, r \end{align*}\] <ul> <li>The expectations in the above equations are conditioned on the initial state $S_0$, and on the subsequent actions $A_0, A_1, \ldots, A_{t-1}$, being taken according to $\pi$.</li> <li>The 2nd and 3rd equations above hold if the MDP is <strong>ergodic</strong>, i.e., if the steady-state distribution exists and is independent of the starting state $S_0$:</li> </ul> \[\mu_\pi(s) \doteq \lim_{t \to \infty} \Pr\!\left\{S_t = s \mid A_{0:t-1} \sim \pi\right\}\] <ul> <li>In an ergodic MDP, the starting state can have only a temporary effect, but in the long run the expectation of being in a state depends only on the policy and the MDP transition probabilities.</li> <li>Ergodicity is sufficient but not necessary to guarantee the existence of the limit in the $r(\pi)$ equation above.</li> <li>It may be adequate practically to simply order policies according to their average reward per time step, otherwise called the <strong>return rate</strong>.</li> <li>All policies that reach the maximal value of $r(\pi)$ are optimal.</li> <li>The steady-state distribution $\mu_\pi$ is the special distribution under which, if you select actions according to $\pi$, you remain in the same distribution, i.e., for which:</li> </ul> \[\sum_s \mu_\pi(s) \sum_a \pi(a \vert s)\, p(s' \vert s, a) = \mu_\pi(s')\] <ul> <li>In the average-reward setting, returns are defined in terms of differences between rewards and the average reward; this is called the <strong>differential return</strong>:</li> </ul> \[G_t \doteq R_{t+1} - r(\pi) + R_{t+2} - r(\pi) + R_{t+3} - r(\pi) + \ldots\] <ul> <li>The corresponding value functions for the differential return are known as <strong>differential value functions</strong>:</li> </ul> \[\begin{aligned} v_\pi(s) &amp;\doteq \mathbb{E}_\pi\!\left[G_t \mid S_t = s\right] \\ q_\pi(s, a) &amp;\doteq \mathbb{E}_\pi\!\left[G_t \mid S_t = s, A_t = a\right] \end{aligned}\] <ul> <li>Differential value functions also have Bellman equations:</li> </ul> \[\begin{aligned} v_\pi(s) &amp;= \sum_a \pi(a \vert s) \sum_{r, s'} p(s', r \vert s, a)\!\left[r - r(\pi) + v_\pi(s')\right] \\[6pt] q_\pi(s, a) &amp;= \sum_{r, s'} p(s', r \vert s, a)\!\left[r - r(\pi) + \sum_{a'} \pi(a' \vert s')\, q_\pi(s', a')\right] \\[6pt] v_{*}(s) &amp;= \max_a \sum_{r, s'} p(s', r \vert s, a)\!\left[r - \max_\pi r(\pi) + v_{*}(s)\right] \\[6pt] q_{*}(s, a) &amp;= \sum_{r, s'} p(s', r \vert s, a)\!\left[r - \max_\pi r(\pi) + \max_{a'} q_{*}(s', a')\right] \end{aligned}\] <ul> <li>The differential form of the 2 TD errors:</li> </ul> \[\begin{aligned} \delta_t &amp;\doteq R_{t+1} - \bar{R}_t + \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t) \\ \delta_t &amp;\doteq R_{t+1} - \bar{R}_t + \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t) \end{aligned}\] \[\begin{aligned} \text{where} \quad \bar{R}_t &amp;= \text{average reward } r(\pi) \text{ estimate at time } t \end{aligned}\] <ul> <li>Most of the algorithms covered so far don’t change for the average-reward setting. For example, the semi-gradient Sarsa average-reward version is the same as the regular version except with the differential version of the TD error:</li> </ul> \[\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\, \delta_t \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)\] <hr /> <h2 id="104-deprecating-the-discounted-setting">10.4 Deprecating the Discounted Setting</h2> <ul> <li>For the tabular case, the continuing, discounted problem formulation is useful, but in the approximate case, is this problem formulation necessary?</li> <li>Should we use the discounted reward or average reward in continuing tasks?</li> <li>It turns out that the average of the discounted return is proportional to the average reward.</li> <li>The ordering of all policies in the average discounted return setting would be exactly the same as in the average-reward setting.</li> <li>This idea of the <strong>futility of discounting in continuing problems</strong> can be proven by the <strong>symmetry argument.</strong> <ul> <li> <p>Let’s choose an objective that saves discounting by summing discounted values over the distribution with which states occur under the policy (where $v^\gamma_\pi \equiv$ discounted value function):</p> \[\begin{align*} J(\pi) &amp;= \sum_s \mu_\pi(s)\, v^\gamma_\pi(s) \\ &amp;= \sum_s \mu_\pi(s) \sum_a \pi(a \vert s) \sum_{s'} \sum_r p(s', r \vert s, a)\!\left[r + \gamma v^\gamma_\pi(s')\right] \\ &amp;= r(\pi) + \sum_s \mu_\pi(s) \sum_a \pi(a \vert s) \sum_{s'} \sum_r p(s', r \vert s, a)\, \gamma v^\gamma_\pi(s') \\ &amp;= r(\pi) + \gamma \sum_{s'} v^\gamma_\pi(s') \sum_s \mu_\pi(s) \sum_a \pi(a \vert s)\, p(s' \vert s, a) \\ &amp;= r(\pi) + \gamma \sum_{s'} v^\gamma_\pi(s')\, \mu_\pi(s') \\ &amp;= r(\pi) + \gamma J(\pi) \\ &amp;= r(\pi) + \gamma\!\left(r(\pi) + \gamma J(\pi)\right) \\ &amp;= r(\pi) + \gamma r(\pi) + \gamma^2 J(\pi) \\ &amp;= r(\pi) + \gamma r(\pi) + \gamma^2 r(\pi) + \gamma^3 r(\pi) + \gamma^4 r(\pi) + \ldots \\ &amp;= r(\pi)\!\left[1 + \gamma + \gamma^2 + \gamma^3 + \ldots\right] \end{align*}\] \[\hspace{-6cm} \boxed{J(\pi) = \left(\frac{1}{1-\gamma}\right) r(\pi)}\] </li> <li><em>The proposed discounted objective orders policies identically to the undiscounted (average reward) objective.</em></li> <li><em>The discount rate $\gamma$ does not influence the ordering.</em></li> </ul> </li> <li>The root cause of the difficulties with the discounted control setting is that with function approximation we have lost the policy improvement theorem.</li> <li>Now if we change the policy to improve the discounted value of one state, we are no longer guaranteed to have improved the overall policy.</li> </ul> <hr /> <h2 id="105-differential-semi-gradient-n-step-sarsa">10.5 Differential Semi-gradient $n$-step Sarsa</h2> <ul> <li>We need an $n$-step version of the TD error in order to generalize to $n$-step bootstrapping.</li> <li>Let’s generalize the $n$-step return to its differential form, with function approximation:</li> </ul> \[\boxed{G_{t:t+n} \doteq R_{t+1} - \bar{R}_{t+n-1} + \ldots + R_{t+n} - \bar{R}_{t+n-1} + \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}_{t+n-1})}\] \[\begin{aligned} \text{where} \quad \bar{R} &amp;\equiv \text{an estimate of } r(\pi),\quad n \geq 1\ \&amp;\ t+n &lt; T \\ G_{t:t+n} &amp;\doteq G_t \quad \text{ if } t+n \geq T \end{aligned}\] <ul> <li>The $n$-step TD error is:</li> </ul> \[\delta_t \doteq G_{t:t+n} - \hat{q}(S_t, A_t, \mathbf{w})\] <hr /> <h2 id="106-summary">10.6 Summary</h2> <ul> <li>Extended parametrized function approximation &amp; semi-gradient descent to control.</li> <li>The extension is immediate for the episodic case, but dependent on a new problem formulation based on maximizing the <strong>average reward</strong> setting per time step, for the continuing case.</li> <li>The discounted formulation cannot be carried over to control in the presence of approximations.</li> <li>Most policies cannot be represented by a value function in the approximate case.</li> <li>The scalar average reward $r(\pi)$ provides an effective way of ranking the remaining arbitrary policies.</li> <li>The average reward formulation involves new <strong>differential</strong> versions of value functions, Bellman equations, and TD errors, but all of these parallel the old ones and the conceptual changes are small.</li> <li>The average reward setting has a new parallel set of differential algorithms.</li> </ul> <hr /> <h2 id="citation">Citation</h2> <p>If you found this blog post helpful, please consider citing it:</p> <div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@article</span><span class="p">{</span><span class="nl">obasi2026RLsuttonBartoCh10notes</span><span class="p">,</span> <span class="na">title</span> <span class="p">=</span> <span class="s">"Sutton &amp; Barto, Ch. 10: On-Policy Control with Approximation (Personal Notes)"</span><span class="p">,</span> <span class="na">author</span> <span class="p">=</span> <span class="s">"Obasi, Chizoba"</span><span class="p">,</span> <span class="na">journal</span> <span class="p">=</span> <span class="s">"chizkidd.github.io"</span><span class="p">,</span> <span class="na">year</span> <span class="p">=</span> <span class="s">"2026"</span><span class="p">,</span> <span class="na">month</span> <span class="p">=</span> <span class="s">"Mar"</span><span class="p">,</span> <span class="na">url</span> <span class="p">=</span> <span class="s">"https://chizkidd.github.io/2026/03/09/rl-sutton-barto-notes-ch010/"</span> <span class="p">}</span> </code></pre></div></div> <hr /> Mon, 09 Mar 2026 00:00:00 +0000 https://chizkidd.github.io//2026/03/09/rl-sutton-barto-notes-ch010/ https://chizkidd.github.io//2026/03/09/rl-sutton-barto-notes-ch010/ When Your Voice Assistant Can't Hear Tones: Evaluating ASR Bias in Igbo <p>I grew up in an Igbo household in Northern Nigeria, that code-switched between English, Igbo, and Hausa almost unconsciously. Like many bilingual Nigerians, I’ve watched voice assistants and ASR systems get better and better at English while struggling with our languages. When Meta released omniASR claiming support for over 1,600 languages including Igbo, I was curious. Does “supported” mean it actually works?</p> <p>Turns out, the answer is more complicated than I expected.</p> <h2 id="the-problem-what-does-language-support-really-mean">The Problem: What Does “Language Support” Really Mean?</h2> <p>Here’s the thing about Igbo: tone changes word meaning. The difference between “akwa” (crying), “akwà” (cloth), “àkwà” (egg), and “ákwá” (bridge) isn’t just decorative accent marks. These are completely different words that happen to have the same consonants and vowels. The tone is the difference.</p> <p>So when I saw that omniASR listed Igbo among its supported languages, I wanted to know: does it actually preserve these tonal distinctions? Or does “support” just mean “we trained on some Igbo data and hope for the best”?</p> <h2 id="the-experiment-21-audio-samples">The Experiment: 21 Audio Samples</h2> <p>I designed a simple test. Using my iPhone Voice Memos app, I recorded 21 short audio clips in different categories:</p> <p><strong>Tonal minimal pairs</strong>: I said “akwa, akwa, akwa” three times with no tone, then “akwà, akwà, akwà” three times with low tone, then “àkwà, àkwà, àkwà” with low-low tone, and finally “ákwá, ákwá, ákwá” with high-high tone. Four distinct words, each repeated three times.</p> <p><strong>Code-switching</strong>: Phrases like “The ụlọ is beautiful” where I mix English and Igbo naturally, the way we actually speak.</p> <p><strong>Place names and cultural terms</strong>: Nigerian cities, Igbo food words, proverbs. The stuff that’s probably not in training data.</p> <p><strong>The smoking gun test</strong>: I spoke a sentence with deliberately flat intonation, no tonal variation at all. If the model is actually listening to tone in the audio, it shouldn’t add tone marks to monotone speech.</p> <p>Then I ran everything through omniASR and compared what I actually said to what it transcribed.</p> <h2 id="the-results-75-tone-loss">The Results: 75% Tone Loss</h2> <p>The numbers were worse than I expected.</p> <p>For the tonal sample after bootstrapping, the model dropped 75.5% of the tone marks. Not just a few mistakes here and there. Three out of every four tone marks, gone.</p> <p>When I said the four different “akwa” words, the model output was: “akua akua akua akua akwa akwa akwa akua akwa ọkua ọkua ọkua”. Random variations. The semantic distinctions completely lost.</p> <p>But here’s what really convinced me the model isn’t actually listening to tones: the monotone test. I spoke “O na-eri oji n’ututu” (He eats kolanut in the morning) with flat intonation, like a robot. The model transcribed it as “ọne rị ọjí nụ tútú” and added tone marks that I never spoke.</p> <p>If the model were using acoustic information to place diacritics, it shouldn’t be adding tones to flat speech. This suggests it’s doing something else: probably using statistical patterns from training data to guess where diacritics should go, rather than actually hearing them.</p> <div class="callout callout--note"> <div class="callout__title"> <strong>Key Diagnostic: The Monotone Test</strong> </div> <div class="callout__body"> <p><strong>File 09:</strong> Spoke “O na-eri oji n’ututu” with FLAT intonation<br /> <strong>Expected:</strong> 0 diacritics (no tonal variation in audio)<br /> <strong>Result:</strong> Model added 7 tone marks that weren’t spoken<br /> <br /> This is evidence of <strong>orthographic bias,</strong> not acoustic perception.</p> </div> </div> <h3 id="what-the-data-shows">What the Data Shows</h3> <p>I created three visualizations to make the patterns clear.</p> <p><img src="/assets/images/2026/omniASR/fig1_loss_by_category.png" alt="loss by category" /></p> <p><strong>Figure 1</strong> shows diacritic loss by category. The tonal category (in red) jumps out immediately at 61.2% raw count loss. For comparison, the domain-specific category had only 6.3% loss. But look at the cross-lingual interference category: it’s at -38.9%, which means the model was adding diacritics that don’t exist. It’s not just dropping tones, it’s hallucinating them in the wrong places.</p> <p><img src="/assets/images/2026/omniASR/fig2_cer_vs_diacritic_loss.png" alt="char error rate vs diacritic loss" /></p> <p><strong>Figure 2</strong> plots character error rate against diacritic loss for each sample. What’s interesting here is that the tonal samples (red dots) show high diacritic loss even when the overall character error rate is moderate (20-40%). This means tone errors aren’t just a consequence of the model doing poorly in general. The model can get most of the characters right while still completely failing on tones specifically.</p> <p><img src="/assets/images/2026/omniASR/fig3_bootstrap_ci.png" alt="boostrap confidence interval" /></p> <p><strong>Figure 3</strong> shows the bootstrap confidence intervals. Even with only 21 samples, the error bars don’t overlap between categories. The tonal category’s worst-case lower bound is 57.1%, which is still terrible. This confirms that what I’m seeing isn’t just noise from a small sample size.</p> <h2 id="the-statistical-story">The Statistical Story</h2> <p>I’m not a statistician, but I know enough to be careful with small sample sizes. Twenty-one samples isn’t huge. So I used bootstrap resampling (basically, randomly resampling my data 10,000 times to get confidence intervals) to make sure these effects weren’t just random noise.</p> <p>Even under the most conservative estimate (the lower bound of the 95% confidence interval), tonal diacritic loss was still 57.1%. The worst-case scenario is still terrible.</p> <p>I also created a custom metric called Diacritic Error Rate (DER) because standard Character Error Rate treats tone marks the same as spacing errors. DER specifically tracks dropped tone marks versus hallucinated tone marks. Turns out the model isn’t just dropping tones. It’s also adding tones that don’t exist, which is a whole different kind of problem.</p> <h2 id="the-categories">The Categories</h2> <p>Breaking down the errors helped me understand what’s going wrong:</p> <p><strong>Cross-lingual interference</strong>: When I spoke phrases with no tone marks at all (like names), the model added incorrect diacritics 38.9% of the time. It’s probably applying orthographic patterns from other languages.</p> <p><strong>Code-switching boundary effects</strong>: The English portions of code-switched sentences were transcribed perfectly. The Igbo portions immediately adjacent to English lost their tones. Something about language boundaries is disrupting processing.</p> <p><strong>Domain coverage</strong>: Culturally specific terms (place names, food words) had the best diacritic preservation at only 6.3% loss, but terrible overall accuracy. The model knows the orthography but doesn’t know the words.</p> <p><strong>Tonal collapse</strong>: 75.5% loss. This is the big one.</p> <h2 id="why-this-matters">Why This Matters</h2> <p>I keep coming back to the monotone hallucination test. If I were building a voice assistant for Igbo speakers and it’s adding tones I didn’t speak, that’s not just an accuracy problem. It’s an epistemological problem. The system is presenting confident outputs that have no acoustic basis.</p> <p>Imagine you’re dictating a text message in Igbo and the system confidently transcribes “crying” when you said “cloth.” Not just a typo you can spot and fix. A completely different word that makes semantic nonsense but looks plausible.</p> <div class="callout callout--note"> <div class="callout__title"> <strong>What 75% Tonal Loss Means</strong> </div> <div class="callout__body"> <p>75.5% bootstrap diacritic loss means:<br /> <strong>3 out of 4</strong> tone marks disappear<br /> <strong>“cloth”</strong> → could mean “crying”<br /> <strong>“egg”</strong> → meaning lost entirely<br /> <strong>“bridge”</strong> → wrong word <br /><br /> In English, this would be like dropping 75% of consonants.</p> </div> </div> <p>This isn’t just about transcription accuracy. It’s about whether “supporting 1,600+ languages” means anything more than “we trained on data from 1,600+ languages and didn’t check if it actually works for tonal distinctions.”</p> <h2 id="the-bigger-picture-zenos-paradox-of-low-resource-languages">The Bigger Picture: Zeno’s Paradox of Low-Resource Languages</h2> <p>There’s a paper from EMNLP 2024 that talks about “The Zeno’s Paradox of Low-Resource Languages.” The basic idea: models keep claiming to support more and more languages, but the quality asymptote never actually reaches parity with high-resource languages. We get closer and closer, but never quite there.</p> <p>Igbo is interesting because by speaker population (45 million people), it’s not low-resource. But by model performance, it clearly behaves like one. The gap between coverage (we trained on Igbo data) and competence (the model preserves linguistically meaningful distinctions) is huge.</p> <div class="callout callout--note"> <div class="callout__title"> <strong>'Supported' ≠ Works Well</strong> </div> <div class="callout__body"> <p>omniASR claims support for 1,600+ languages. Igbo has 45 million speakers, but its tonal accuracy is 24.5% (only 1 in 4 tone marks preserved).<br /><br /> <strong>Coverage</strong> (in training data) ≠ <strong>Competence</strong> (preserves meaning)</p> </div> </div> <p>This makes me think about all the other languages in that 1,600+ list. How many of them have this same gap? How many communities are using systems that confidently produce nonsense because nobody with native speaker expertise has stress-tested them?</p> <h2 id="what-i-learned">What I Learned</h2> <p><strong>Small, targeted datasets can reveal problems big datasets hide.</strong> I didn’t need thousands of hours of audio. Twenty-one carefully designed samples were enough to show systematic failure modes.</p> <p><strong>Native speaker expertise matters.</strong> Automated metrics can’t catch when “crying” is transcribed as “cloth” because the character error rate looks fine. You need someone who speaks the language to know that the semantic content is destroyed.</p> <p><strong>Bootstrap resampling is powerful for small samples.</strong> I was worried 21 samples was too few, but bootstrap confidence intervals let me quantify uncertainty rigorously. Even the pessimistic lower bounds showed substantial effects.</p> <p><strong>The monotone test is a better diagnostic than I expected.</strong> If diacritics are added to flat speech, that’s clear evidence of orthographic bias over acoustic conditioning. One simple test that revealed the core mechanism.</p> <h2 id="the-technical-details">The Technical Details</h2> <p>For anyone interested in replicating this:</p> <ul> <li>I used my iPhone for recording (Voice Memos app, M4A format)</li> <li>Ran inference through Google Colab with omniASR’s official pipeline</li> <li>Computed bootstrap CIs with 10,000 iterations at the utterance level</li> <li>Created a custom DER metric to separate tonal errors from general transcription errors</li> <li>All code, data, and analysis is on GitHub and HuggingFace</li> </ul> <p>The whole analysis took about half a week of evening work. Most of that was iterating on the sample design and figuring out the right statistical approach. The actual recording and inference was maybe a day.</p> <h2 id="whats-next">What’s Next</h2> <p>This is really just a proof of concept. To make stronger claims, I’d need:</p> <ul> <li>Multi-speaker evaluation (10+ speakers across different Igbo dialects)</li> <li>Acoustic analysis (F0 contour tracking to verify what’s actually in the audio)</li> <li>Comparative evaluation (does Whisper do better? What about Google’s USM?)</li> <li>Fine-tuning experiments (can we fix this with targeted training data?)</li> </ul> <p>I have ideas for all of these, but they’re bigger projects. For now, I’m focused on documenting the blind spot and making the methodology replicable.</p> <h2 id="why-im-sharing-this">Why I’m Sharing This</h2> <p>This started as curiosity about whether “multilingual” ASR systems actually work for the languages I grew up speaking. But it turned into something bigger.</p> <p>There’s a tendency in ML to treat “supporting” a language as a checkbox. Train on some data, add it to the model card, ship it. But languages aren’t just data. They’re how people communicate, how they think, how they preserve culture.</p> <p>When voice assistants strip tone marks from Igbo, they’re not just making transcription errors. They’re normalizing a version of the language that doesn’t preserve meaning. If every voice interface does this, what happens to how people write Igbo? Do they start thinking tone marks are optional because the AI doesn’t use them?</p> <p>I don’t know the answers to these questions. But I think they’re worth asking before we claim to “support” 1,600+ languages.</p> <h2 id="resources">Resources</h2> <p>If you want to explore the data or replicate the analysis:</p> <ul> <li><strong>Dataset:</strong> <a href="https://huggingface.co/datasets/chiz/omniASR-igbo-blindspots">HuggingFace</a></li> <li><strong>Code:</strong> <a href="https://github.com/chizkidd/igbo-asr-tonal-evaluation">GitHub</a></li> <li><strong>Audio samples:</strong> You can actually listen to the 21 clips and see the transcription failures yourself</li> </ul> <p>The dataset is CC-BY-4.0 licensed, while the code is MIT licensed. If this is useful for your work, feel free to use it, cite it, and build on it.</p> <h2 id="final-thoughts">Final Thoughts</h2> <p>This project taught me something important: you don’t need massive compute or huge datasets to find meaningful problems in ML systems. You just need to know where to look and what questions to ask.</p> <p>As a native Igbo speaker, I knew what questions to ask. As someone learning ML, I knew how to design tests and interpret results. That combination turned out to be more valuable than I expected.</p> <p>If you speak a language that’s “supported” by these big multilingual models, I encourage you to test them. Record some minimal pairs. Try code-switching. See if the system actually works the way you use the language, not just the way it appears in training data.</p> <p>You might be surprised what you find.</p> <h2 id="citation">Citation</h2> <p>If you found this work helpful, please consider citing it:</p> <div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@article</span><span class="p">{</span><span class="nl">obasi2026igboasr</span><span class="p">,</span> <span class="na">title</span> <span class="p">=</span> <span class="s">"When Your Voice Assistant Can't Hear Tones: Evaluating ASR Bias in Igbo"</span><span class="p">,</span> <span class="na">author</span> <span class="p">=</span> <span class="s">"Obasi, Chizoba"</span><span class="p">,</span> <span class="na">journal</span> <span class="p">=</span> <span class="s">"chizkidd.github.io"</span><span class="p">,</span> <span class="na">year</span> <span class="p">=</span> <span class="s">"2026"</span><span class="p">,</span> <span class="na">month</span> <span class="p">=</span> <span class="s">"Mar"</span><span class="p">,</span> <span class="na">url</span> <span class="p">=</span> <span class="s">"https://chizkidd.github.io/2026/03/04/igbo-asr-tonal-evaluation/"</span> <span class="p">}</span> </code></pre></div></div> Wed, 04 Mar 2026 00:00:00 +0000 https://chizkidd.github.io//2026/03/04/igbo-asr-tonal-eval/ https://chizkidd.github.io//2026/03/04/igbo-asr-tonal-eval/ Tonal Fidelity in Multilingual ASR: A Diagnostic Evaluation <p>This is a brief guide to my evaluation of tonal preservation in facebook’s omniASR-CTC-1B Automatic Speech Recognition (ASR) model for Igbo, a tonal Niger-Congo language with 45 million speakers. The model claims support for 1,600+ languages including Igbo, but what does “support” mean when tone changes word meaning? I created 21 systematically designed audio samples, ran them through the model, and measured a 75.5% bootstrapped diacritic loss rate on tonal markers. The core finding: the model appears to generate tone marks probabilistically based on orthographic priors rather than acoustic conditioning. I cannot simplify this investigation any further.</p> <p>Where to find it: The dataset with audio is on <a href="https://huggingface.co/datasets/chiz/omniASR-igbo-blindspots">HuggingFace</a>. The code and analysis are on <a href="https://github.com/chizkidd/igbo-asr-tonal-evaluation">GitHub</a>. The full analysis notebook is available at <a href="https://github.com/chizkidd/igbo-asr-tonal-evaluation/blob/main/analysis.ipynb">analysis.ipynb</a>.</p> <p>The following is my guide to stepping through the evaluation methodology.</p> <h2 id="the-problem">The Problem</h2> <p>In Igbo, tone is phonemic. This means tone changes word meaning, not just prosody. The difference between:</p> <ul> <li>akwa (crying)</li> <li>akwà (cloth)</li> <li>àkwà (egg)</li> <li>ákwá (bridge)</li> </ul> <p>…isn’t decorative. These are four completely different words that happen to share consonants and vowels. The tone marks (diacritics) are the only thing distinguishing them. When omniASR lists Igbo as “supported,” does it preserve these tonal distinctions? Or does “support” just mean “we trained on some Igbo data”?</p> <h2 id="dataset-design">Dataset Design</h2> <p>I recorded 21 audio samples using my iPhone SE Voice Memos app. Each sample targets a specific failure mode across four categories.</p> <p>The first category tests cross-lingual orthographic interference. My hypothesis was that the model applies incorrect orthographic conventions from other languages to Igbo text. I recorded five samples: personal names without tone marks, formal greetings, numbers in Igbo, well-known proverbs, and a slow prosody test. I expected 0% diacritic loss since there was nothing to lose, but observed -38.9%, meaning the model added diacritics that don’t exist.</p> <p>The second category tests phonemic tone sensitivity. The hypothesis here is that the model cannot distinguish phonemically contrastive tones. I recorded six samples including minimal pairs like akwa/akwà/àkwà/ákwá and oke/òkè/ọkè, dense tone marking, a monotone control (the key diagnostic), and two Yoruba controls. I expected low loss if the model uses acoustic information, but observed 75.5% loss with a bootstrap 95% confidence interval of [57.1%, 89.7%].</p> <p>The smoking gun is file 09. I spoke “O na-eri oji n’ututu” with deliberately flat intonation, with no tonal variation at all. The model transcribed it as “ọne rị ọjí nụ tútú” and ADDED tone marks I never spoke. If the model were using acoustics, it shouldn’t hallucinate tones on monotone speech.</p> <p>The third category tests language boundary effects from code-switching. I hypothesized that switching between English and Igbo disrupts language-specific processing. Five samples test different patterns: English embedding into Igbo, Igbo embedding into English, sentence-level alternation, diacritics in English context, and Nigerian Pidgin as a control. The result was 14.3% diacritic loss, with English portions transcribed perfectly while adjacent Igbo lost tone marks.</p> <p>The fourth category tests domain-specific lexical coverage. The hypothesis is that culturally specific terms outside the training distribution would struggle. I recorded Nigerian place names, Igbo food terms, long proverbs, French as a high-resource control, and background noise robustness. This category showed the best diacritic preservation at only 6.3% loss, but terrible overall accuracy with 30% character error rate, indicating word-level errors.</p> <p>The data looks like this (metadata.csv):</p> <pre><code class="language-csv">file_name,ground_truth,model_output,category,character_error_rate,diacritics_expected,diacritics_produced 06_tonal_akwa.m4a,"akwa, akwa, akwa. Akwà, akwà, akwà...","akua akua akua akua akwa akwa...",tonal_diacritics,0.583,12,3 09_tonal_flat.m4a,"O na-eri oji n'ututu","ọne rị ọjí nụ tútú",tonal_diacritics,0.744,0,7 ... </code></pre> <h2 id="model-inference">Model Inference</h2> <p>I used omniASR’s official inference pipeline:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">omnilingual_asr.models.inference.pipeline</span> <span class="kn">import</span> <span class="n">ASRInferencePipeline</span> <span class="n">pipeline</span> <span class="o">=</span> <span class="n">ASRInferencePipeline</span><span class="p">(</span><span class="n">model_card</span><span class="o">=</span><span class="s">"omniASR_CTC_1B"</span><span class="p">)</span> <span class="n">transcription</span> <span class="o">=</span> <span class="n">pipeline</span><span class="p">.</span><span class="n">transcribe</span><span class="p">(</span> <span class="n">inp</span><span class="o">=</span><span class="p">[</span><span class="s">"data/audio/06_tonal_akwa.m4a"</span><span class="p">],</span> <span class="n">lang</span><span class="o">=</span><span class="p">[</span><span class="s">"ibo_Latn"</span><span class="p">]</span> <span class="p">)</span> </code></pre></div></div> <p>The model has 975 million parameters and uses a CTC-based ASR architecture with a wav2vec2-style encoder and CTC head. It was trained on multilingual data covering over 1,600 languages and released on November 14, 2025.</p> <p>For each audio file, I extracted:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ground_truth</span> <span class="o">=</span> <span class="s">"akwa, akwa, akwa. Akwà, akwà, akwà. Àkwà, àkwà, àkwà. Ákwá, ákwá, ákwá."</span> <span class="n">model_output</span> <span class="o">=</span> <span class="n">transcription</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="s">'transcription'</span><span class="p">]</span> <span class="c1"># Compare and compute metrics </span></code></pre></div></div> <h2 id="metrics">Metrics</h2> <p>Standard Character Error Rate (CER) conflates spacing errors with tonal errors. I defined a custom metric:</p> <h3 id="diacritic-error-rate-der">Diacritic Error Rate (DER)</h3> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">diacritic_error_rate</span><span class="p">(</span><span class="n">ground_truth</span><span class="p">,</span> <span class="n">model_output</span><span class="p">):</span> <span class="n">E</span> <span class="o">=</span> <span class="n">count_diacritics</span><span class="p">(</span><span class="n">ground_truth</span><span class="p">)</span> <span class="c1"># expected </span> <span class="n">P</span> <span class="o">=</span> <span class="n">count_diacritics</span><span class="p">(</span><span class="n">model_output</span><span class="p">)</span> <span class="c1"># produced </span> <span class="n">D</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">E</span> <span class="o">-</span> <span class="n">P</span><span class="p">)</span> <span class="c1"># dropped </span> <span class="n">H</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">P</span> <span class="o">-</span> <span class="n">E</span><span class="p">)</span> <span class="c1"># hallucinated </span> <span class="k">return</span> <span class="p">(</span><span class="n">D</span> <span class="o">+</span> <span class="n">H</span><span class="p">)</span> <span class="o">/</span> <span class="n">E</span> <span class="k">if</span> <span class="n">E</span> <span class="o">&gt;</span> <span class="mi">0</span> <span class="k">else</span> <span class="mi">0</span> <span class="k">def</span> <span class="nf">count_diacritics</span><span class="p">(</span><span class="n">text</span><span class="p">):</span> <span class="n">diacritics</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="s">'ụọịàèìòùáéíóúẹṣ'</span><span class="p">)</span> <span class="k">return</span> <span class="nb">sum</span><span class="p">(</span><span class="mi">1</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">text</span><span class="p">.</span><span class="n">lower</span><span class="p">()</span> <span class="k">if</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">diacritics</span><span class="p">)</span> </code></pre></div></div> <p>DER isolates tone-related failures:</p> <table> <thead> <tr> <th>Metric</th> <th>Formula</th> <th>What it captures</th> </tr> </thead> <tbody> <tr> <td>CER</td> <td>Levenshtein distance / length</td> <td>All character errors</td> </tr> <tr> <td>RDD (Raw Drop Rate)</td> <td>dropped / expected</td> <td>Only missing tone marks</td> </tr> <tr> <td>DER</td> <td>(dropped + hallucinated) / expected</td> <td>Total tonal deviation</td> </tr> </tbody> </table> <p>Note that DER can exceed 100% when hallucinations are substantial, because the denominator reflects ground truth expectations, not produced output.</p> <h2 id="bootstrap-uncertainty">Bootstrap Uncertainty</h2> <p>With N=21 samples, I needed to quantify uncertainty. I used bootstrap resampling:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">bootstrap_ci</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">stat_fn</span><span class="p">,</span> <span class="n">n_boot</span><span class="o">=</span><span class="mi">10000</span><span class="p">,</span> <span class="n">ci</span><span class="o">=</span><span class="mf">0.95</span><span class="p">,</span> <span class="n">seed</span><span class="o">=</span><span class="mi">42</span><span class="p">):</span> <span class="n">rng</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">default_rng</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span> <span class="n">n</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">)</span> <span class="c1"># Point estimate </span> <span class="n">point</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="n">stat_fn</span><span class="p">(</span><span class="n">data</span><span class="p">))</span> <span class="c1"># Bootstrap resampling </span> <span class="n">boots</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">empty</span><span class="p">(</span><span class="n">n_boot</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_boot</span><span class="p">):</span> <span class="n">idx</span> <span class="o">=</span> <span class="n">rng</span><span class="p">.</span><span class="n">integers</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">n</span><span class="p">)</span> <span class="n">boots</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="n">stat_fn</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">idx</span><span class="p">]))</span> <span class="c1"># Percentile CI </span> <span class="n">alpha</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">ci</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span> <span class="n">lo</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">quantile</span><span class="p">(</span><span class="n">boots</span><span class="p">,</span> <span class="n">alpha</span><span class="p">))</span> <span class="n">hi</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">quantile</span><span class="p">(</span><span class="n">boots</span><span class="p">,</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">alpha</span><span class="p">))</span> <span class="k">return</span> <span class="p">(</span><span class="n">point</span><span class="p">,</span> <span class="n">lo</span><span class="p">,</span> <span class="n">hi</span><span class="p">)</span> </code></pre></div></div> <p>Bootstrap resampling occurs at the <strong>utterance level</strong>, not event level. This matters because diacritic distribution is uneven across samples. Some files have 0 expected tone marks, others have 12. Resampling utterances captures this variability.</p> <p>Example result:</p> <ul> <li>Raw count: 30/49 = 61.2% drop rate</li> <li>Bootstrap mean: 75.5%</li> <li>95% CI: [57.1%, 89.7%]</li> </ul> <p>The bootstrap mean exceeds the raw percentage because resampling at utterance level gives more weight to samples with extreme loss rates. Both values are reported for transparency.</p> <div class="callout callout--note"> <div class="callout__title"> <strong>Why Bootstrap Matters</strong> </div> <div class="callout__body"> <p>With only 21 samples, we need uncertainty quantification. Bootstrap resampling (10,000 iterations) shows: <strong>Worst-case lower bound:</strong> 57.1%<br /> <strong>Even pessimistically,</strong> loss is still &gt;50%<br /> <strong>Not</strong> a small-sample fluke</p> </div> </div> <h2 id="results">Results</h2> <h3 id="quantitative-summary">Quantitative Summary</h3> <table> <thead> <tr> <th>Category</th> <th>Samples</th> <th>Diacritic Loss</th> <th>Avg CER</th> </tr> </thead> <tbody> <tr> <td><strong>Phonemic Tone Sensitivity</strong></td> <td>6</td> <td><strong>75.5%</strong></td> <td>50.6%</td> </tr> <tr> <td>Cross-lingual Interference</td> <td>5</td> <td>-38.9%</td> <td>28.8%</td> </tr> <tr> <td>Domain-Specific Coverage</td> <td>5</td> <td>6.3%</td> <td>30.1%</td> </tr> <tr> <td>Language Boundary Effects</td> <td>5</td> <td>14.3%</td> <td>20.0%</td> </tr> <tr> <td><strong>Overall</strong></td> <td><strong>21</strong></td> <td><strong>26.8%</strong></td> <td><strong>32.5%</strong></td> </tr> </tbody> </table> <h3 id="bootstrap-confidence-intervals">Bootstrap Confidence Intervals</h3> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Tonal category: 75.5% (95% CI: [57.1%, 89.7%]) Overall: 52.6% (95% CI: [30.3%, 69.7%]) </code></pre></div></div> <p>Even under the worst-case lower bound (57.1%), tonal diacritic loss remains severe.</p> <h3 id="visualizations">Visualizations</h3> <p><img src="/assets/images/2026/omniASR/fig1_loss_by_category.png" alt="loss by category" /> Bar chart showing 61.2% raw count loss for tonal category (red), with negative values indicating diacritic hallucination (script interference).</p> <p><img src="/assets/images/2026/omniASR/fig2_cer_vs_diacritic_loss.png" alt="char error rate vs diacritic loss" /> Scatter plot showing tonal samples (red) have high diacritic loss even when CER is moderate.</p> <p><img src="/assets/images/2026/omniASR/fig3_bootstrap_ci.png" alt="boostrap confidence interval" /> Forest plot showing 95% CIs for each category, with 50% threshold line.</p> <h2 id="example-tonal-minimal-pairs">Example: Tonal Minimal Pairs</h2> <p>File 06 is the clearest demonstration:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Input (what I said): "akwa, akwa, akwa. Akwà, akwà, akwà. Àkwà, àkwà, àkwà. Ákwá, ákwá, ákwá." Model output: "akua akua akua akua akwa akwa akwa akua akwa ọkua ọkua ọkua" Expected diacritics: 12 Produced diacritics: 3 Loss rate: 75% </code></pre></div></div> <p>The four distinct words collapsed into random variations. From a linguistic perspective, this is catastrophic. The word akwà meaning cloth got transcribed as akwa, which could mean crying instead. The word àkwà meaning egg got transcribed as akwa, and the meaning is completely lost. The word ákwá meaning bridge got transcribed as akua, which is wrong both in word and tone.</p> <h2 id="the-monotone-test">The Monotone Test</h2> <p>File 09 is my favorite diagnostic. Setup:</p> <ul> <li>Spoke “O na-eri oji n’ututu” (He eats kolanut in the morning)</li> <li>Deliberately flat intonation, like a robot</li> <li>Zero tonal variation in the audio</li> </ul> <p>If the model uses acoustic information to place diacritics, it should produce few or no tone marks on flat speech. Result:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Ground truth: "O na-eri oji n'ututu" (0 diacritics) Model output: "ọne rị ọjí nụ tútú" (7 diacritics) </code></pre></div></div> <p>The model ADDED tone marks I never spoke. This is clear evidence of orthographic bias over acoustic conditioning. The model is using statistical patterns from training data to guess where diacritics should go, not listening to the audio.</p> <h2 id="statistical-analysis">Statistical Analysis</h2> <h3 id="hypothesis-testing">Hypothesis Testing</h3> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Null hypothesis (H0): Diacritic loss in tonal category ≤ other categories Alternative (H1): Tonal category shows higher loss Test: Bootstrap confidence intervals (10,000 iterations, 95% CI) Result: Tonal bootstrap mean (75.5%) substantially exceeds all other categories (highest alternative: 38.9% for script hallucination). While confidence intervals show some overlap due to small sample size, the tonal category's point estimate is nearly 2x higher than the next closest category. Conclusion: Tonal degradation exhibits the highest loss rate across all categories (bootstrap mean: 75.5%). While confidence intervals show some overlap with script hallucination due to small sample size (N=21), the effect size is large and consistent across resamples. </code></pre></div></div> <h3 id="robustness-check">Robustness Check</h3> <p>Even under worst-case assumptions using the lower bound of the confidence interval, tonal loss remains at 57.1%, which is still greater than 50%. Overall loss stays at 30.3%, which is still substantial. This suggests the observed tonal degradation is unlikely to be driven solely by sampling variability.</p> <h2 id="code">Code</h2> <p>The full analysis is in <code class="language-plaintext highlighter-rouge">analysis.ipynb</code>. The core evaluation functions handle diacritic counting, character error rate calculation, and bootstrap resampling. Diacritic counting uses a set of Igbo tone mark characters and counts occurrences in the text. Character error rate is computed using Python’s SequenceMatcher for character-level similarity. Bootstrap resampling runs 10,000 iterations on the tonal diacritics category to compute confidence intervals.</p> <p>All evaluation code is organized in the <code class="language-plaintext highlighter-rouge">src/</code> directory. The <code class="language-plaintext highlighter-rouge">evaluate.py</code> module contains metrics like DER and bootstrap confidence intervals. The <code class="language-plaintext highlighter-rouge">visualize.py</code> module has plotting functions for all three figures. The <code class="language-plaintext highlighter-rouge">utils.py</code> module handles data loading and validation.</p> <h2 id="run-it">Run it</h2> <p>Clone the repository and reproduce:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/chizkidd/igbo-asr-tonal-evaluation.git <span class="nb">cd </span>igbo-asr-tonal-evaluation pip <span class="nb">install</span> <span class="nt">-r</span> requirements.txt jupyter notebook analysis.ipynb </code></pre></div></div> <p>Or run in Google Colab: <a href="https://colab.research.google.com/github/chizkidd/igbo-asr-tonal-evaluation/blob/main/analysis.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a></p> <p>The notebook takes about 5-10 minutes to run on Colab with a T4 GPU. You’ll see the analysis output:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Loading metadata... Total samples: 21 Categories: 4 Computing metrics... Overall DER: 26.8% Tonal category: 75.5% Script interference: -38.9% Code-switching: 14.3% Domain-specific: 6.3% Bootstrap resampling (10,000 iterations)... Tonal diacritics: 75.5% [57.1%, 89.7%] Overall: 52.6% [30.3%, 69.7%] Generating visualizations... Saved: results/visualizations/fig1_loss_by_category.png Saved: results/visualizations/fig2_cer_vs_loss.png Saved: results/visualizations/fig3_bootstrap_ci.png </code></pre></div></div> <div class="callout callout--note"> <div class="callout__title"> <strong>Reproducibility</strong> </div> <div class="callout__body"> <p><strong>Model:</strong> omniASR-CTC-1B (975M params)<br /> <strong>Data:</strong> 21 samples, 4 categories<br /> <strong>Metrics:</strong> Custom DER (Diacritic Error Rate)<br /> <strong>Stats:</strong> Bootstrap with utterance-level resampling<br /> <strong>Code:</strong> github.com/chizkidd/igbo-asr-tonal-evaluation</p> </div> </div> <h2 id="scope-and-limitations">Scope and Limitations</h2> <p>This study demonstrates three things. First, systematic diacritic loss in omniASR on Igbo across 21 controlled samples. Second, failure to preserve tonal minimal pairs in this evaluation setup. Third, diacritic hallucination on monotone speech, which is evidence of orthographic bias.</p> <p>This study does not claim four things. It doesn’t claim universal failure on all Igbo speech. It doesn’t claim that tone modeling is architecturally absent from the model. It doesn’t claim that Igbo is uniquely disadvantaged compared to all other low-resource languages. And it doesn’t claim that the observed error rates generalize to all dialects or all speakers.</p> <p>What would strengthen these claims? Multi-speaker evaluation with 10+ speakers across different dialects. Acoustic analysis with F0 contour extraction and pitch tracking validation. Comparative evaluation on other models like Whisper, MMS, USM, and Azure Speech. And controlled resynthesis experiments that isolate acoustic factors from lexical priors.</p> <div class="callout callout--note"> <div class="callout__title"> <strong>Future Work</strong> </div> <div class="callout__body"> <p><strong>Current:</strong> Single speaker, 21 samples (proof-of-concept)<br /> <strong>Next:</strong> 200 samples, 10+ speakers, 5 dialects<br /> <strong>Then:</strong> Comparative evaluation (Whisper, MMS, Azure)<br /> <strong>Finally:</strong> Fine-tuning intervention with tone-annotated data</p> </div> </div> <h2 id="real-production-systems">Real Production Systems</h2> <p>Between this evaluation and a production-grade ASR fairness audit, there is a long list of things that change:</p> <p><strong>Data.</strong> Instead of 21 samples, production evaluations use thousands of hours across multiple speakers, dialects, ages, and recording conditions.</p> <p><strong>Speakers.</strong> Instead of single-speaker, you need balanced sampling across: dialects (Owerri, Onitsha, Enugu, Nsukka, Afikpo), gender, age ranges, native vs. L2 speakers.</p> <p><strong>Acoustic analysis.</strong> Instead of just comparing transcriptions, you need F0 (fundamental frequency) tracking to verify what’s actually in the audio. Praat or similar tools extract pitch contours frame-by-frame.</p> <p><strong>Comparative evaluation.</strong> Instead of one model, you audit multiple: Whisper (OpenAI), MMS (Meta), USM (Google), Azure Speech (Microsoft). This isolates whether the problem is specific to omniASR or universal.</p> <p><strong>Fine-tuning experiments.</strong> You collect tone-annotated Igbo data (50-100 hours), fine-tune the model, and measure pre/post accuracy. This tests whether the problem is architectural or just data scarcity.</p> <p><strong>Real-world deployment.</strong> You partner with Nigerian developers building voice assistants and measure downstream impact: do users trust ASR that strips tones? Does it affect adoption?</p> <p>All of these are important, but if you understand this 21-sample evaluation, you understand the diagnostic methodology.</p> <h2 id="faq">FAQ</h2> <p><strong>Why only 21 samples?</strong> This is a proof-of-concept for blind spot discovery. Large datasets measure prevalence; small targeted datasets reveal failure modes. I prioritized depth (systematic coverage of error types) over breadth (statistical power).</p> <p><strong>Is 75.5% loss generalizable?</strong> Not necessarily. This is the loss rate on my voice, my dialect, my recording setup, for these specific test cases. Multi-speaker evaluation would give population estimates.</p> <p><strong>Why not use Word Error Rate?</strong> WER measures whole-word accuracy. In Igbo, “akwa” vs “akwà” counts as correct by WER (same word, different tone), but semantically these are different words. Diacritic-specific metrics capture what WER misses.</p> <p><strong>Does the model “understand” Igbo?</strong> That’s philosophical. Mechanically: it learned statistical patterns from training data. Whether assigning probability distributions to tokens constitutes “understanding” is up to you.</p> <p><strong>Why does the bootstrap mean exceed the raw percentage?</strong> Bootstrap resamples at utterance level. Samples with extreme loss rates (e.g., file 09 with 0 expected, 7 hallucinated) get resampled more in some iterations, pulling the mean up. This reflects uncertainty about which utterances are “typical.”</p> <p><strong>What’s next?</strong> Collect a 200-sample multi-speaker dataset across 5 Igbo dialects. After that: comparative model evaluation (Whisper vs MMS vs omniASR) and fine-tuning experiments with tone-annotated data.</p> <h2 id="why-this-matters">Why This Matters</h2> <p>There’s a tendency in ML to treat “supporting” a language as a checkbox. Add it to the model card, ship it. But Igbo has 45 million speakers. When ASR systems strip tone marks, they normalize a version of the language that doesn’t preserve meaning.</p> <p>If every voice interface does this, what happens to how people write Igbo? Do they internalize that tone marks are optional because the AI doesn’t use them? I don’t know, but these are questions worth asking before claiming to “support” 1,600+ languages.</p> <h2 id="resources">Resources</h2> <p>The dataset is available on <a href="huggingface.co/datasets/chiz/omniASR-igbo-blindspots">Huggingface</a>. The code is on <a href="github.com/chizkidd/igbo-asr-tonal-evaluation">github</a>. The model evaluated is <a href="https://huggingface.co/facebook/omniASR-CTC-1B">facebook/omniASR-CTC-1B</a> on HuggingFace. The dataset is licensed under CC-BY-4.0 and the code under MIT. Feel free to use it, cite it, and build on it.</p> <h2 id="citation">Citation</h2> <p>If you found this evaluation helpful, please consider citing it:</p> <div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@article</span><span class="p">{</span><span class="nl">obasi2026tonalevaluation</span><span class="p">,</span> <span class="na">title</span> <span class="p">=</span> <span class="s">"Tonal Fidelity in Multilingual ASR: A Diagnostic Evaluation"</span><span class="p">,</span> <span class="na">author</span> <span class="p">=</span> <span class="s">"Obasi, Chizoba"</span><span class="p">,</span> <span class="na">journal</span> <span class="p">=</span> <span class="s">"chizkidd.github.io"</span><span class="p">,</span> <span class="na">year</span> <span class="p">=</span> <span class="s">"2026"</span><span class="p">,</span> <span class="na">month</span> <span class="p">=</span> <span class="s">"Mar"</span><span class="p">,</span> <span class="na">url</span> <span class="p">=</span> <span class="s">"https://chizkidd.github.io/2026/03/01/tonal-fidelity-diagnostic-evaluation/"</span> <span class="p">}</span> </code></pre></div></div> <p>For the dataset:</p> <div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@misc</span><span class="p">{</span><span class="nl">obasi2026igbodataset</span><span class="p">,</span> <span class="na">title</span><span class="p">=</span><span class="s">{Igbo Blind Spot Dataset for omniASR-CTC-1B: Systematic Evaluation of Tonal Diacritic Loss}</span><span class="p">,</span> <span class="na">author</span><span class="p">=</span><span class="s">{Obasi, Chizoba}</span><span class="p">,</span> <span class="na">year</span><span class="p">=</span><span class="s">{2026}</span><span class="p">,</span> <span class="na">publisher</span><span class="p">=</span><span class="s">{HuggingFace}</span><span class="p">,</span> <span class="na">howpublished</span><span class="p">=</span><span class="s">{\url{https://huggingface.co/datasets/chiz/omniASR-igbo-blindspots}}</span><span class="p">,</span> <span class="na">note</span><span class="p">=</span><span class="s">{Model evaluated: facebook/omniASR-CTC-1B (975M parameters)}</span> <span class="p">}</span> </code></pre></div></div> Sun, 01 Mar 2026 00:00:00 +0000 https://chizkidd.github.io//2026/03/01/tonal-fidelity-multilingual-asr/ https://chizkidd.github.io//2026/03/01/tonal-fidelity-multilingual-asr/ Sutton & Barto, Ch. 09: On-Policy Prediction with Approximation (Personal Notes) <ul> <li>Study of <strong>function approximation</strong> in RL by considering its use in estimating the state-value function from on-policy data, i.e. in approximating $v_\pi$ from experience generated using a known policy $\pi$.</li> <li>The approximate value function is represented as a parameterized functional form with <strong>weight vector</strong> $\mathbf{w} \in \mathbb{R}^d$:</li> </ul> \[\hat{v}(s, \mathbf{w}) \approx v_\pi(s)\] <ul> <li>The function above is for the approximate value of state $s$ given weight vector $\mathbf{w}$.</li> <li>$\hat{v}$ might be a linear function, a multi-layer artificial neural network, or a decision tree.</li> <li>Extending RL to function approximation makes it applicable to <strong>partially observable problems</strong>, in which the full state is unavailable to the agent.</li> <li>Function approximation cannot augment the state representation with memories of past observations.</li> </ul> <hr /> <h2 id="table-of-contents">Table of Contents</h2> <ul> <li><a href="#91-value-function-approximation">9.1 Value-Function Approximation</a></li> <li><a href="#92-the-prediction-objective-overlinetextve">9.2 The Prediction Objective (VE)</a></li> <li><a href="#93-stochastic-gradient--semi-gradient-methods">9.3 Stochastic-Gradient &amp; Semi-Gradient Methods</a></li> <li><a href="#94-linear-methods">9.4 Linear Methods</a></li> <li><a href="#95-feature-construction-for-linear-methods">9.5 Feature Construction for Linear Methods</a> <ul> <li><a href="#951-polynomials">9.5.1 Polynomials</a></li> <li><a href="#952-fourier-basis">9.5.2 Fourier Basis</a></li> <li><a href="#953-coarse-coding">9.5.3 Coarse Coding</a></li> <li><a href="#954-tile-coding">9.5.4 Tile Coding</a></li> <li><a href="#955-radial-basis-functions-rbf">9.5.5 Radial Basis Functions (RBF)</a></li> </ul> </li> <li><a href="#96-selecting-step-size-parameters-manually">9.6 Selecting Step-Size Parameters Manually</a></li> <li><a href="#97-nonlinear-function-approximation-artificial-neural-networks-anns">9.7 Nonlinear Function Approximation: Artificial Neural Networks (ANNs)</a></li> <li><a href="#98-least-squares-td-lstd">9.8 Least-Squares TD (LSTD)</a></li> <li><a href="#99-memory-based-function-approximation">9.9 Memory-based Function Approximation</a></li> <li><a href="#910-kernel-based-function-approximation">9.10 Kernel-based Function Approximation</a></li> <li><a href="#911-looking-deeper-at-on-policy-learning-interest--emphasis">9.11 Looking Deeper at On-Policy Learning: Interest &amp; Emphasis</a></li> <li><a href="#912-summary">9.12 Summary</a></li> </ul> <h2 id="appendix">Appendix</h2> <ul> <li><a href="#citation">Citation</a></li> </ul> <hr /> <h2 id="91-value-function-approximation">9.1 Value-Function Approximation</h2> <ul> <li>All the prediction methods covered so far involve an update to an estimated value function that shifts its value towards a “backed-up value” (update target, $u$). Let’s denote an individual state update by $s \mapsto u$:</li> </ul> \[\begin{align*} \text{Monte-Carlo (MC):} \quad &amp; S_t \mapsto G_t \\ \text{TD(0):} \quad &amp; S_t \mapsto R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) \\ n\text{-step TD:} \quad &amp; S_t \mapsto G_{t:t+n} \\ \text{Dynamic Programming (DP):} \quad &amp; s \mapsto \mathbb{E}_\pi\!\left[R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) \,\middle\vert\, S_t = s\right] \end{align*}\] <ul> <li>Each update is interpretable as an example of the desired input-output behaviour of the value function. <ul> <li>Until now, the value function updates have been made in the tabular setting, which is quite trivial because estimated values of all other states are left unchanged.</li> <li>Now using arbitrarily complex and sophisticated methods to perform the update, updating at $s$ generalizes so that the estimated value of many other states are changed as well.</li> <li>Machine learning methods that learn to mimic input-output examples in this way are called <strong>supervised learning</strong> methods, and when the outputs are numbers, like $u$, the process is called <strong>function approximation</strong>.</li> </ul> </li> <li>Function approximation methods expect to receive examples of the desired input-output behavior of the function they are trying to approximate. <ul> <li>We view each update as a conventional training example.</li> <li>This allows us to use any of a wide range of existing function approximation methods for value prediction.</li> <li>Whatever function approximation method is chosen needs to be able to perform <strong>online learning</strong>.</li> </ul> </li> </ul> <hr /> <h2 id="92-the-prediction-objective-overlinetextve">9.2 The Prediction Objective ($\overline{\text{VE}}$)</h2> <ul> <li>We have more states $s$ than weights $\mathbf{w}$, therefore we cannot feasibly approximate the value function perfectly.</li> <li>Making one state’s estimate more accurate invariably leads to making the others’ less accurate.</li> <li>Therefore, it is necessary to define which states we care most about, based on a <strong>state distribution,</strong> $\mu(s)$, that represents how much we care about the error in each state $s$.</li> </ul> <!-- $$\text{state distribution, } \hspace{1em} \mu(s) \geq 0, \sum_s \mu(s) = 1$$ --> \[\mu(s) \geq 0, \quad \sum_s \mu(s) = 1\] <ul> <li>Weighting the error in a state $s$, the difference between the approximate value $\hat{v}(s, \mathbf{w})$ and the true value $v_\pi(s)$, over the state space by $\mu$ leads to obtaining a natural objective function called the <strong>mean squared value error</strong>, denoted by $\overline{\text{VE}}$:</li> </ul> \[\boxed{\overline{\text{VE}}(\mathbf{w}) \doteq \sum_{s \in S} \mu(s) \left[v_\pi(s) - \hat{v}(s, \mathbf{w})\right]^2}\] <ul> <li>$\sqrt{\overline{\text{VE}}}$ tells us roughly how much the approximate values differ from the true values.</li> <li>Often $\mu$ is chosen to be the fraction of time spent in $s$, called the <strong>on-policy distribution</strong> under on-policy training.</li> </ul> <h3 id="on-policy-distribution">On-policy Distribution</h3> <ol> <li> <strong>Continuing tasks</strong>: the on-policy distribution is the stationary distribution under $\pi$: $$\mu(s) = \sum_{s'} \mu(s') \sum_a \pi(a \vert s')\, p(s \vert s', a), \quad \forall s \in S$$ $$ \begin{aligned} \text{where} \\ \mu(s) &amp;= \text{stationary probability of being in state } s \text{ under } \pi \\ s' &amp;= \text{preceding state} \end{aligned} $$ <div class="callout callout--note"> <div class="callout__title"> <strong>Balance Equation</strong> </div> <div class="callout__body"> <p>The probability of being in state $s$ equals the sum over all ways of arriving in $s$ from any previous state $s’$ under policy $\pi$.</p> </div> </div> </li> <li> <strong>Episodic tasks</strong>: it depends on how the initial states of episodes are chosen: $$\eta(s) = h(s) + \sum_{\bar{s}} \eta(\bar{s}) \sum_a \pi(a \vert \bar{s})\, p(s \vert \bar{s}, a), \quad \forall s \in S$$ $$ \begin{aligned} \text{where} \\ h(s) &amp;= \text{probability that an episode begins in each state } s \\ \eta(s) &amp;= \text{number of time steps spent, on average, in state } s \text{ in a single episode} \\ \bar{s} &amp;= \text{preceding state} \end{aligned} $$ <div class="callout callout--note"> <div class="callout__title"> <strong>Visitation Equation</strong> </div> <div class="callout__body"> <p>The expected number of visits to state $s$ equals the probability of starting in state $s$ plus the expected number of visits to all preceding states $s’$ that transition into $s$ under policy $\pi$.</p> </div> </div> <ul> <li>This system of equations can be solved for the expected number of visits $\eta(s)$.</li> <li>The on-policy distribution is the fraction of time spent in each state, normalized to sum to 1:</li> </ul> $$\mu(s) = \frac{\eta(s)}{\sum_{s'} \eta(s')}, \quad \forall s \in S$$ <ul> <li>If discounting exists, then we redefine $\eta(s)$:</li> </ul> $$\eta(s) = h(s) + \gamma \sum_{\bar{s}} \eta(\bar{s}) \sum_a \pi(a \vert \bar{s})\, p(s \vert \bar{s}, a), \quad \forall s \in S$$ $$ \begin{aligned} \text{where} \ \gamma &amp;= \text{discount factor } \end{aligned} $$ </li> </ol> <h3 id="performance-objective">Performance Objective</h3> <ul> <li>Although $\overline{\text{VE}}$ is a good starting point, it’s not completely clear that it is the right performance objective.</li> <li>The <strong>ultimate goal</strong> (reason for learning a value function) is to find a better policy.</li> <li>If we use $\overline{\text{VE}}$, the goal is to find a <strong>global optimum</strong> (optimal weight vector, $\mathbf{w}^*$):</li> </ul> \[\mathbf{w}^* \hspace{.5em} \text{for which} \hspace{.75em} \overline{\text{VE}}(\mathbf{w}^*) \leq \overline{\text{VE}}(\mathbf{w}), \quad \forall \mathbf{w}\] <ul> <li>Complex function approximators may converge to a <strong>local optimum</strong>:</li> </ul> \[\mathbf{w}^* \hspace{.5em} \text{for which} \hspace{.75em} \overline{\text{VE}}(\mathbf{w}^*) \leq \overline{\text{VE}}(\mathbf{w}), \quad \forall \mathbf{w} \hspace{.5em} \text{in some neighborhood of} \hspace{.5em} \mathbf{w}^*\] <hr /> <h2 id="93-stochastic-gradient--semi-gradient-methods">9.3 Stochastic-Gradient &amp; Semi-Gradient Methods</h2> <h3 id="stochastic-gradient-descent-sgd">Stochastic Gradient Descent (SGD)</h3> <ul> <li>SGD methods are among the most widely used of all function approximation methods and are well suited to online RL.</li> <li>The function approximator is parameterized by a weight vector with a fixed number of real components (column vector):</li> </ul> \[\mathbf{w} \doteq (w_1, w_2, w_3, \ldots, w_d)^T\] <ul> <li>In SGD, we update the weight vector at each time step by moving it in the direction that minimises the error most quickly for the example shown:</li> </ul> \[\mathbf{w}_{t+1} \doteq \mathbf{w}_t - \frac{1}{2}\alpha \nabla\!\left[v_\pi(S_t) - \hat{v}(S_t, \mathbf{w}_t)\right]^2\] \[\boxed{ \mathbf{w}_{t+1}= \mathbf{w}_t + \alpha\!\left[v_\pi(S_t) - \hat{v}(S_t, \mathbf{w}_t)\right] \nabla \hat{v}(S_t, \mathbf{w}_t)}\] \[\begin{aligned} \text{where} \\ \hat{v}(s, \mathbf{w}) &amp;= \text{approximate value function (differentiable in } \mathbf{w},\ \forall s \in S) \\ \alpha &amp;= \text{positive step-size parameter} \\ \nabla f(\mathbf{w}) &amp;= \text{column vector of partial derivatives of } f(\mathbf{w}) \text{ WRT components of } \mathbf{w} \end{aligned}\] <ul> <li>The equation for $\nabla f(\mathbf{w})$, the gradient of $f$ WRT $\mathbf{w}$, is:</li> </ul> \[\nabla f(\mathbf{w}) \doteq \left(\frac{\partial f(\mathbf{w})}{\partial w_1},\ \frac{\partial f(\mathbf{w})}{\partial w_2},\ \frac{\partial f(\mathbf{w})}{\partial w_3},\ \ldots,\ \frac{\partial f(\mathbf{w})}{\partial w_d}\right)^T\] <ul> <li>SGD methods are <strong><em>“gradient descent”</em></strong> methods because the overall step in $\mathbf{w}_t$ is proportional to the negative gradient of the example’s squared error ($w_{t+1}$). This is the direction in which the error falls most rapidly.</li> <li>Gradient descent methods are called <strong><em>“stochastic”</em></strong> because the update is done on only a single example, which might have been selected stochastically.</li> <li>If $\alpha$ decreases as expected in satisfaction of the standard stochastic approximation conditions, then SGD is guaranteed to converge to a local optimum.</li> </ul> <h3 id="true-value-estimates">True Value Estimates</h3> <ul> <li>When the true value function $v_\pi(S_t)$ is unknown, we can approximate it by substituting $U_t$ in place of $v_\pi(S_t)$.</li> <li>This yields the following general SGD method for state-value prediction:</li> </ul> \[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\!\left[U_t - \hat{v}(S_t, \mathbf{w}_t)\right] \nabla \hat{v}(S_t, \mathbf{w}_t)}\] <ul> <li>If $U_t$ is an <strong><em><u>unbiased</u></em></strong> estimate, that is, if \(\mathbb{E}[U_t \vert S_t = s] = v_\pi(s)\) for each $t$, then $\mathbf{w}_t$ is guaranteed to converge to a local optimum under the usual stochastic approximation conditions for decreasing $\alpha$.</li> <li>An example of an unbiased estimator is the <strong>Monte Carlo</strong> estimate for state $S_t$:</li> </ul> \[U_t = G_t\] \[\mathbf{w} \leftarrow \mathbf{w} + \alpha\!\left[G_t - \hat{v}(S_t, \mathbf{w})\right] \nabla \hat{v}(S_t, \mathbf{w}), \quad \alpha &gt; 0\] <ul> <li><strong>Bootstrapping methods</strong> are <strong><em><u>biased</u></em></strong> in that their target is dependent on the current value of the weights $\mathbf{w}$. <ul> <li><strong>Semi-gradient</strong> (bootstrapping) methods converge reliably in linear cases.</li> <li>Bootstrapping targets could be <strong>$n$-step returns</strong> or <strong>dynamic programming (DP)</strong> targets.</li> </ul> </li> </ul> \[\begin{align*} (\text{semi-gradient}) \quad &amp; U_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) \\ (n\text{-step}) \quad &amp; U_t = G_{t:t+n} \\ (\text{DP}) \quad &amp; U_t = \sum_{a,s',r} \pi(a \vert S_t)\, p(s', r \vert S_t, a)\!\left[r + \gamma \hat{v}(s', \mathbf{w}_t)\right] \end{align*}\] <h3 id="state-aggregation">State Aggregation</h3> <ul> <li>A simple form of generalizing function approximation in which states are grouped together, with one estimated value (one component of the weight vector $\mathbf{w}$) for each group.</li> <li>A special SGD case in which:</li> </ul> \[\nabla \hat{v}(S_t, \mathbf{w}_t) = \left\{ \begin{array}{ll} 1 &amp; \text{for } S_t\text{'s group's component} \\ 0 &amp; \text{for the other components} \end{array} \right\}\] <hr /> <h2 id="94-linear-methods">9.4 Linear Methods</h2> <ul> <li>One of the simplest cases for function approximation is when the approximate value function is a <strong>linear combination</strong> of the weight vector.</li> <li>Linear methods approximate the state-value function by the inner product between $\mathbf{w}$ and $\mathbf{x}(s)$:</li> </ul> \[\hat{v}(s, \mathbf{w}) \doteq \mathbf{w}^T \mathbf{x}(s) \doteq \sum_{i=1}^{d} w_i x_i(s)\] \[\begin{aligned} \\ \text{where } \quad \mathbf{x}(s) &amp;= \textbf{feature vector } \text{representing state } s \\ \mathbf{x}(s) &amp;\doteq \bigl(x_1(s),\ x_2(s),\ x_3(s),\ \ldots,\ x_d(s)\bigr)^T \\ x_i(s) &amp;= \text{value of function } x_i : S \to \mathbb{R} \end{aligned}\] <ul> <li>For linear methods, features are <strong>basis functions</strong> because they form a linear basis for the set of approximate functions.</li> <li>For linear methods, the gradient of the approximate value function WRT $\mathbf{w}$ is:</li> </ul> \[\nabla \hat{v}(s, \mathbf{w}) = \mathbf{x}(s)\] <ul> <li>Thus in the linear case, the general SGD update reduces to:</li> </ul> \[\boxed{ \mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\!\left[U_t - \hat{v}(S_t, \mathbf{w}_t)\right] \mathbf{x}(S_t)}\] <ul> <li>The semi-gradient TD(0) algorithm converges under linear function approximation:</li> </ul> \[\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\!\left(R_{t+1} + \gamma \mathbf{w}_t^T \mathbf{x}_{t+1} - \mathbf{w}_t^T \mathbf{x}_t\right) \mathbf{x}_t\] \[= \mathbf{w}_t + \alpha\!\left(R_{t+1} \mathbf{x}_t - \mathbf{x}_t\!\left(\mathbf{x}_t - \gamma \mathbf{x}_{t+1}\right)^T \mathbf{w}_t\right)\] \[\begin{aligned} \\ \text{where} \quad \mathbf{x}_t = \mathbf{x}(S_t) \end{aligned}\] <ul> <li>Once the system has reached steady state, for any given $\mathbf{w}_t$ the expected next weight vector can be represented as:</li> </ul> \[\mathbb{E}\!\left[\mathbf{w}_{t+1} \vert \mathbf{w}_t\right] = \mathbf{w}_t + \alpha\!\left(\mathbf{b} - \mathbf{A}\mathbf{w}_t\right)\] \[\begin{aligned} \\ \text{where} \quad b &amp;\doteq \mathbb{E}\!\left[R_{t+1}\, \mathbf{x}_t\right] \in \mathbb{R}^d \\ A &amp;\doteq \mathbb{E}\!\left[\mathbf{x}_t\!\left(\mathbf{x}_t - \gamma \mathbf{x}_{t+1}\right)^T\right] \in \mathbb{R}^{d \times d} \end{aligned}\] <ul> <li>It is clear that, if the system converges, it must converge to the weight vector $\mathbf{w}_\text{TD}$ at which:</li> </ul> \[\mathbf{b} - \mathbf{A}\mathbf{w}_\text{TD} = \mathbf{0}\] \[\mathbf{b} = \mathbf{A}\mathbf{w}_\text{TD}\] \[\mathbf{w}_\text{TD} \doteq \mathbf{A}^{-1}\mathbf{b}\] <ul> <li>$\mathbf{w}_\text{TD}$ is called the <strong>TD fixed point</strong>. It is the point that linear semi-gradient TD(0) converges to.</li> <li> <p>$\mathbf{A}$ needs to be <strong><em><u>positive definite</u></em></strong> to ensure that $\mathbf{A}^{-1}$ exists. \(y^T \mathbf{A} y &gt; 0, \text{for any } y \neq 0\)</p> </li> <li>The semi-gradient $n$-step TD algorithm is the natural extension of the tabular $n$-step TD algorithm in Ch. 7 to semi-gradient function approximation:</li> </ul> <!-- $$\mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha\!\left[G_{t:t+n} - \hat{v}(S_t, \mathbf{w}_{t+n-1})\right] \nabla \hat{v}(S_t, \mathbf{w}_{t+n-1}), \quad 0 \leq t < T$$ $$\text{where } \hspace{0.75em} G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{v}(S_{t+n}, \mathbf{w}_{t+n-1}), \quad 0 \leq t \leq T-n$$ --> \[\boxed{ \begin{aligned} \mathbf{w}_{t+n} &amp;\doteq \mathbf{w}_{t+n-1} + \alpha\!\left[G_{t:t+n} - \hat{v}(S_t, \mathbf{w}_{t+n-1})\right] \nabla \hat{v}(S_t, \mathbf{w}_{t+n-1}), \quad 0 \leq t &lt; T \\ \\ G_{t:t+n} &amp;\doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{v}(S_{t+n}, \mathbf{w}_{t+n-1}), \quad 0 \leq t \leq T-n \end{aligned} }\] <h3 id="bounded-expansion">Bounded Expansion</h3> <ul> <li>At the TD fixed point, $\overline{\text{VE}}$ is within a bounded expansion of the lowest possible error:</li> </ul> \[\boxed{\overline{\text{VE}}(\mathbf{w}_\text{TD}) \leq \frac{1}{1-\gamma} \min_{\mathbf{w}} \overline{\text{VE}}(\mathbf{w})}\] <ul> <li>That is, the asymptotic error of the TD method is no more than $\frac{1}{1-\gamma}$ times the smallest possible error (attained in the limit by the MC method).</li> <li>$\gamma$ is often near 1, so the expansion factor can be quite large; therefore there is substantial potential loss in asymptotic performance with the TD method.</li> <li>A bound analogous to TD’s fixed point &amp; bound above on $\overline{\text{VE}}(\mathbf{w}_\text{TD})$ applies to other on-policy bootstrapping methods as well.</li> <li>One-step <strong>action-value</strong> methods such as <strong><em>semi-gradient Sarsa(0)</em></strong> converge to an analogous fixed point and an analogous bound.</li> <li>For <strong>episodic</strong> tasks, there exists a bound [Bertsekas &amp; Tsitsiklis, 1996].</li> </ul> <hr /> <h2 id="95-feature-construction-for-linear-methods">9.5 Feature Construction for Linear Methods</h2> <ul> <li>Let’s discuss different ways of constructing features, which is important for function approximation in linear methods.</li> </ul> <h3 id="951-polynomials">9.5.1 Polynomials</h3> <ul> <li>There may be a need to design features with higher complexity that can be captured with linear methods.</li> <li>Say for example a state has 2 numerical dimensions, $s_1, s_2 \in \mathbb{R}$. Then it is insufficient to represent this state as $\mathbf{x}(s) = (s_1, s_2)^T$ because it does not take into account any interactions between the 2 dimensions.</li> <li>We can overcome this limitation by instead representing $s$ by the 4-D feature vector:</li> </ul> \[\mathbf{x}(s) = \bigl(1,\ s_1,\ s_2,\ s_1 s_2\bigr)^T\] <ul> <li>For $k$ numerical dimensions, suppose each state $s$ corresponds to $k$ numbers, $s_1, s_2, s_3, \ldots, s_k \text{ with each } s_i \in \mathbb{R}$, each order-$n$ polynomial-basis feature $x_i$ can be written as:</li> </ul> \[\boxed{x_i(s) = \prod_{j=1}^{k} s_j^{c_{i,j}}}\] \[\begin{aligned} \text{where} \\ c_{i,j} &amp;= \text{integer in } \{0, 1, 2, 3, \ldots, n\}, \quad \text{ for } n \geq 0 \\ (n+1)^k &amp;= \text{no. of distinct features for a } k\text{-dimensional state space} \end{aligned}\] <h3 id="952-fourier-basis">9.5.2 Fourier Basis</h3> <ul> <li>Fourier series express periodic functions as weighted sums of sine and cosine basis functions of different frequencies, with a function $f$ being periodic if:</li> </ul> \[f(x) = f(x + \tau) \quad \forall x, \text{ and for some period } \tau\] <ul> <li>The 1-D, order-$n$ Fourier cosine basis consists of the $n+1$ features:</li> </ul> \[x_i(s) = \cos(i \pi s), \quad s \in [0, 1] \quad \text{for } i = 0, 1, 2, \ldots, n\] <ul> <li>For $k$-dimensional state space, the $i$-th feature in the order-$n$ Fourier cosine basis is:</li> </ul> \[\boxed{x_i(s) = \cos\!\left(\pi \mathbf{s}^T \mathbf{c}^i\right)}\] \[\begin{aligned} \text{where} \\ \mathbf{c}^i &amp;= \bigl(c_1^i, c_2^i, \ldots, c_k^i\bigr)^T \\ c_j^i &amp;\in \{0, 1, 2, \ldots, n\} \quad \text{for } j = 1, 2, \ldots, k \text{ and } i = 1, 2, \ldots, (n+1)^k \\ i &amp;\equiv \text{features} \\ j &amp;\equiv \text{dimensions} \end{aligned}\] <h3 id="953-coarse-coding">9.5.3 Coarse Coding</h3> <ul> <li>Mapping circles to the features of a state space for 2-dimension.</li> </ul> <p><img src="/assets/images/2026/rl-sutton-barto/ch09-9-5-3-coarse-coding.png" alt="coarse coding" /></p> <!-- >**Coarse coding:** Generalization from state $s$ to state $s'$ depends on the number of their features whose receptive fields (in this case, circles) overlap. These states have one feature in common, so there will be slight generalization between them. --> <div class="callout callout--note"> <div class="callout__title"> <strong>Coarse Coding</strong> </div> <div class="callout__body"> <p>Generalization from state $s$ to state $s’$ depends on the number of their features whose receptive fields (in this case, circles) overlap. These states have one feature in common, so there will be slight generalization between them.</p> </div> </div> <ul> <li>In the diagram above, if the state is inside a circle, then the corresponding feature has the value of 1 and is said to be <strong>present</strong>; otherwise the feature is 0 and is said to be <strong>absent</strong>.</li> <li>This kind of 1-0 valued feature is called a <strong>binary feature</strong>.</li> <li>Representing a state with features that overlap in this way is known as <strong>coarse coding</strong>.</li> <li>If we train at one state (a point in the space), then the weights of all circles intersecting that state will be affected (each circle has a corresponding weight, a single $\mathbf{w}$ component).</li> <li>Generalization in linear function approximation methods is determined by the <strong>sizes</strong> and <strong>shapes</strong> of the features’ receptive fields.</li> </ul> <h3 id="954-tile-coding">9.5.4 Tile Coding</h3> <ul> <li>Tile coding is a form of coarse coding for multi-dimensional continuous spaces that is flexible and computationally efficient.</li> <li>It may be the most practical feature representation for modern sequential digital computers.</li> <li>In tile coding, the receptive fields of the features are grouped into partitions of the state space.</li> <li>Each such partition is called a <strong>tiling</strong>, and each element of the partition is called a <strong>tile</strong>.</li> <li>These tiles can be sets that are overlapping, uniform, or asymmetrically distributed.</li> <li>The tiles do not need to be squares; they can be irregular shapes, horizontal/vertical/log lines.</li> <li><strong>Advantages:</strong> <ul> <li>Because tile coding works with partitions, the overall number of features that are active at one time is the <strong>same</strong> for any state.</li> <li>Because tile coding uses binary feature vectors, computing the approximate value function reduces to simply summing the $n$ weight components corresponding to the $n$ active tiles, rather than performing $d$ multiplications.</li> </ul> </li> <li>Generalization extends to states sharing tiles with the trained state, proportional to <strong>tiles in common.</strong> <ul> <li><strong>Uniform offsets</strong> can introduce directional artifacts (e.g. diagonal bias);</li> <li><strong>Asymmetric offsets</strong> produce better-centered patterns.</li> </ul> </li> <li>Tilings are offset by $\frac{w}{n}$ (the fundamental unit). <ul> <li><strong>Uniform offsets</strong> use displacement vector $(1,1)$;</li> <li><strong>Asymmetric offsets</strong> use $(1,3)$ for superior generalization.</li> </ul> </li> <li>Miller &amp; Glanz (1996) recommend displacement vectors of <strong>first odd integers</strong> $(1, 3, 5, \ldots, 2k-1)$ for $k$ dimensions, with $n$ set to a <strong>power of 2</strong> $\geq 4k$.</li> <li>Tile <strong>number &amp; size</strong> determine resolution, while tile <strong>shape</strong> determines generalization. <ul> <li>Square tiles generalize equally</li> <li>Elongated tiles generalize along their longer dimension</li> <li>Diagonal tiles generalize along a diagonal.</li> </ul> </li> <li><strong>Mixed tile shapes</strong> (horizontal, vertical, conjunctive) allow per-dimension generalization while still learning values for specific conjunctions. Tile choice fully determines generalization behavior.</li> <li><strong>Hashing</strong> pseudorandomly collapses a large tiling into a much smaller set of tiles (each consisting of noncontiguous, disjoint regions), drastically reducing memory requirements with little performance loss. <ul> <li>This sidesteps the curse of dimensionality since memory need only match the task’s real demands rather than grow exponentially with dimensions.</li> </ul> </li> </ul> <p><img src="/assets/images/2026/rl-sutton-barto/ch09-9-5-4-hashing-border.png" alt="hashing" /></p> <!-- >**Hashing:** 4 subtiles collapse into 1 tile in the above diagram --> <div class="callout callout--note"> <div class="callout__title"> <strong>Hashing</strong> </div> <div class="callout__body"> <p>4 subtiles collapse into 1 tile in the above diagram.</p> </div> </div> <h3 id="955-radial-basis-functions-rbf">9.5.5 Radial Basis Functions (RBF)</h3> <ul> <li>RBFs are the natural extension/generalization of coarse coding to continuous-valued domain/feature space. <ul> <li>Rather than a feature being 0 or 1 (binary), it can take any value in the interval $[0, 1]$.</li> </ul> </li> <li>A typical RBF feature, $x_i$, has a Gaussian (bell-shaped) response $x_i(s)$ dependent only on the distance between the state $s$ and the feature’s center state $c_i$, and relative to the feature’s width $\sigma_i$:</li> </ul> \[\boxed{x_i(s) \doteq \exp\!\left(-\frac{\lVert s - c_i \rVert^2}{2\sigma_i^2}\right)}\] <ul> <li>let’s see a 1-D example with a Euclidean distance metric</li> </ul> <p><img src="/assets/images/2026/rl-sutton-barto/ch09-9-5-5-1dim-rbf.png" alt="1D RBF" /></p> <!-- >**1-D RBF**: One-dimensional radial basis function --> <div class="callout callout--note"> <div class="callout__title"> <strong>1-D RBF</strong> </div> <div class="callout__body"> <p>A one-dimensional example with a Euclidean distance metric.</p> </div> </div> <ul> <li>RBFs are more advantageous to binary features: they produce approximate functions that vary smoothly and are differentiable.</li> <li>An <strong>RBF network</strong> is a linear function approximator using RBFs for its features.</li> <li>Some learning methods for RBF networks change the features’ centers and widths, making them nonlinear function approximators.</li> <li>Nonlinear methods are much more precise in fitting the target functions, but (nonlinear) RBF networks are <strong><em>much more computationally complex</em></strong> and require more manual tuning before learning.</li> </ul> <hr /> <h2 id="96-selecting-step-size-parameters-manually">9.6 Selecting Step-Size Parameters Manually</h2> <ul> <li>What is the best way to select $\alpha$ for function approximation?</li> <li>So far, <ul> <li>The theory of stochastic approximation gives us conditions on a slowly decreasing step-size sequence that are sufficient to guarantee convergence, but these tend to result in learning that is <strong>too slow.</strong></li> <li>The sample-average classical method $\alpha_t = \frac{1}{t}$ is <strong>not appropriate</strong> for TD methods, for nonstationary problems, or for any function approximation method.</li> <li>For linear methods, there are recursive least-square methods that set an optimal matrix step size which can be extended to LSTD methods (seen in Section 9.8), but these require $O(d^2)$ step-size parameters, or $d$ times more parameters than we are learning. Therefore, it is inappropriate for function approximation.</li> </ul> </li> <li>A good rule of thumb for setting the step-size parameter of linear SGD models is:</li> </ul> \[\boxed{\alpha \doteq \left(\tau\, \mathbb{E}\!\left[\mathbf{x}^T \mathbf{x}\right]\right)^{-1}}\] \[\begin{aligned} \text{where} \\ \mathbf{x} &amp;= \text{random feature vector chosen from the same distribution as input vectors in the SGD} \\ \tau &amp;= \text{number of experiences within which learning converges} \end{aligned}\] <ul> <li>This rule of thumb works best if the feature vectors do not vary greatly in length; ideally $\mathbf{x}^T \mathbf{x}$ is a constant.</li> </ul> <hr /> <h2 id="97-nonlinear-function-approximation-artificial-neural-networks-anns">9.7 Nonlinear Function Approximation: Artificial Neural Networks (ANNs)</h2> <ul> <li>ANNs are widely used for nonlinear function approximation.</li> <li>An ANN is a network of interconnected units that have some of the properties of neurons, the main components of nervous systems.</li> </ul> <p><img src="/assets/images/2026/rl-sutton-barto/ch09-9-7-ann.png" alt="Generic ANN" /></p> <!-- >**ANN**: A generic feedforward ANN with 4 input units, 2 output units, and 2 hidden layers. --> <div class="callout callout--note"> <div class="callout__title"> <strong>ANN</strong> </div> <div class="callout__body"> <p>A generic feedforward Artifical Neural Network with 4 input units, 2 output units, and 2 hidden layers.</p> </div> </div> <ul> <li>The units (circles in the figure above) compute a weighted sum of their input signals, and then apply a nonlinear function, called the <strong>activation function</strong>, to the result.</li> <li> <p>Some activation functions include: \(\begin{aligned} \text{Sigmoid:} \quad &amp; f(x) = \frac{1}{1 + e^{-x}} \\ \\ \text{ReLU:} \quad &amp; f(x) = \max(0, x) \\ \\ \text{Binary step:} \quad &amp; f(x) = \left\{ \begin{array}{ll} 1 &amp; \text{if } x \geq 0 \\ 0 &amp; \text{if } x &lt; 0 \end{array} \right\} \end{aligned}\)</p> </li> <li>ANNs can use TD errors to learn value functions, or they can aim to maximize expected reward as in a gradient bandit (Section 2.8) or a policy gradient algorithm (Chapter 13).</li> <li>Overfitting in ANNs is a problem for function approximation that can be reduced through the <strong>dropout</strong> method [Srivastava, Hinton, Krizhevsky, Sutskever &amp; Salakhutdinov, 2014].</li> <li><strong>Deep belief networks</strong> presented a major step towards solving the training problem of deep layers of a deep ANN [Hinton, Osindero &amp; Teh, 2006].</li> <li><strong>Batch normalization</strong> makes it easier to train deep ANNs by normalizing the output of deep layers before they feed into the following layer. This improves the learning rate of deep ANNs [Ioffe &amp; Szegedy, 2015].</li> <li><strong>Deep residual learning</strong> is another technique useful for training deep ANNs [He, Zhang, Ren &amp; Sun, 2016].</li> <li><strong>Deep Convolutional Networks</strong> have proven to be very successful in impressive RL algorithms [LeCun, Bottou, Bengio &amp; Haffner, 1998].</li> <li>In summary, advances in the design and training of ANNs, of which we have only mentioned a few, all contribute to RL.</li> <li>Although current RL theory is mostly limited to methods using tabular or linear function approximation methods, the impressive performances of notable RL applications owe much of their success to nonlinear function approximation by multi-layer ANNs.</li> </ul> <hr /> <h2 id="98-least-squares-td-lstd">9.8 Least-Squares TD (LSTD)</h2> <ul> <li>As established earlier, TD(0) with linear function approximator converges asymptotically (for appropriately decreasing step sizes) to the TD fixed point:</li> </ul> \[\boxed{\mathbf{w}_\text{TD} = \mathbf{A}^{-1}\mathbf{b}}\] \[\begin{aligned} \text{where} \\ A &amp;\doteq \mathbb{E}\!\left[\mathbf{x}_t\!\left(\mathbf{x}_t - \gamma \mathbf{x}_{t+1}\right)^T\right] \\ b &amp;\doteq \mathbb{E}\!\left[R_{t+1}\, \mathbf{x}_t\right] \end{aligned}\] <ul> <li>Instead of computing $\mathbf{w}_\text{TD}$ iteratively, let’s compute estimates of $\mathbf{A}$ and $\mathbf{b}$ and then directly compute the TD fixed point. This is what <strong>Least-Squares TD (LSTD)</strong> does exactly. It forms the natural estimates:</li> </ul> \[\mathbf{\hat{A}_t} \doteq \sum_{k=0}^{t-1} \mathbf{x_k}\!\left(\mathbf{x}_k - \gamma \mathbf{x}_{k+1}\right)^T + \varepsilon \mathbf{I}\] \[\mathbf{\hat{b}_t} \doteq \sum_{k=0}^{t-1} R_{k+1}\, \mathbf{x}_k\] \[\begin{aligned} \text{where} \\ \mathbf{I} &amp;= \text{identity matrix}, \text{and} \\ \varepsilon &amp; \mathbf{I} \text{, for some small } \varepsilon &gt; 0 \text{, ensures that } \mathbf{\hat{A}_t} \text{ is always invertible} \end{aligned}\] <ul> <li>The LSTD estimated TD fixed point is:</li> </ul> \[\boxed{\mathbf{w}_t \doteq \mathbf{\hat{A}_t}^{-1} \mathbf{\hat{b}_t}}\] <h3 id="computational-complexity">Computational Complexity</h3> <ul> <li>The inverse of $\mathbf{\hat{A}_t}$ computation costs $O(d^3)$, but this can be reduced to $O(d^2)$ using the <strong>Sherman-Morrison</strong> formula:</li> </ul> \[\mathbf{\hat{A}_t^{-1}} = \left(\mathbf{\hat{A}_{t-1}} + \mathbf{x_{t-1}}\!\left(\mathbf{x_{t-1}} - \gamma \mathbf{x}_t\right)^T\right)^{-1}\] \[\boxed{\mathbf{\hat{A}_t^{-1}} = \mathbf{\hat{A}_{t-1}^{-1}} - \frac{\mathbf{\hat{A}_{t-1}^{-1}} \mathbf{x}_{t-1}\!\left(\mathbf{x}_{t-1} - \gamma \mathbf{x}_t\right)^T \mathbf{\hat{A}_{t-1}^{-1}}}{1 + \left(\mathbf{x}_{t-1} - \gamma \mathbf{x}_t\right)^T \mathbf{\hat{A}_t^{-1}} \mathbf{x}_{t-1}}, \quad \text{for } t &gt; 0, \text{ with } \mathbf{\hat{A}_0} \doteq \varepsilon \mathbf{I}}\] <ul> <li>LSTD does <strong>not</strong> require a step-size and as such means that it never forgets, which is sometimes desirable but often not in RL problems where both the policy and environment change with time.</li> <li>However in control applications, LSTD typically has to be combined with some other mechanism to induce forgetting, negating its advantage of not requiring a step-size parameter.</li> </ul> <hr /> <h2 id="99-memory-based-function-approximation">9.9 Memory-based Function Approximation</h2> <ul> <li>So far we have been parametrising functions that approximate our value function, and these parameters are updated as we see more data. This is called the <strong>parametric</strong> approach.</li> <li>Memory-based function approximators store training examples in memory as they arrive without updating any parameters and retrieve them from memory upon query of a state value’s estimate.</li> <li>Memory-based function approximators are examples of <strong>non-parametric</strong> methods.</li> <li>Memory-based function approximation is sometimes called <strong>lazy learning</strong> because processing training examples is postponed until the system is queried to provide an output.</li> <li>Non-parametric methods could produce increasingly accurate approximations of any target function with increasing number of training examples accumulating in memory.</li> <li>There are different memory-based methods, but let’s focus on <strong>local learning</strong>: <ul> <li>Local learning methods approximate a value function only locally in the neighborhood of the current query state by retrieving states in memory via a distance metric between the query state and training example states.</li> <li>Local learning methods discard the local approximation after the query state is assigned a value.</li> </ul> </li> <li>Examples of local learning methods: <ul> <li><strong>Nearest Neighbor</strong>: retrieve from memory the closest state to the query state and return that example’s value as the local approximation of the query state.</li> <li><strong>Weighted Average</strong>: retrieve a set of nearest neighbor examples and return a weighted average of their target values (weigh their value via some distance metric).</li> <li><strong>Locally Weighted Regression</strong>: similar to weighted average, but fits a surface to the values of a set of nearest states based on a parametric approximation method.</li> </ul> </li> <li>Pros of non-parametric, memory-based methods: <ul> <li>Do not limit approximations to pre-specified functional forms.</li> <li>The more data accumulated, the better the accuracy.</li> <li>Allow for relatively immediate effect on value estimates in the neighborhood of the current state.</li> <li>Handle/address the curse of dimensionality, which is a big problem for global approximation. For example, for a state space with $K$ dimensions, <ul> <li>A <strong>tabular method</strong> storing a global approximation requires memory <strong><em>exponential in $K$,</em></strong> while</li> <li>Storing examples in a <strong>memory-based method</strong> requires only memory <strong><em>proportional to $K$, or linear in the number of examples $n$.</em></strong></li> </ul> </li> </ul> </li> </ul> <hr /> <h2 id="910-kernel-based-function-approximation">9.10 Kernel-based Function Approximation</h2> <ul> <li>Weighted average and locally weighted regression depend on assigning weights based on some distance metric between examples $s’$ and a query state $s$.</li> <li>The function that assigns these weights is called a <strong>kernel function</strong> or simply a <strong>kernel</strong>.</li> <li>A kernel function $k : \mathbb{R} \to \mathbb{R}$ assigns weights to distances between states, but more generally weights do not have to depend on distances; they can also depend on some similarity measure:</li> </ul> \[k : S \times S \to \mathbb{R}\] <ul> <li>$k(s, s’)$ is a measure of the strength of generalization from $s’$ to $s$.</li> <li>Kernel functions numerically express <strong>how relevant</strong> knowledge about any state is to any other state.</li> <li><strong>Kernel regression</strong> is the memory-based method that computes a kernel weighted average of the targets of <strong>all examples</strong> stored in memory:</li> </ul> \[\boxed{\hat{v}(s, D) = \sum_{s' \in D} k(s, s')\, g(s')}\] \[\begin{aligned} \text{where} \\ \hat{v}(s, D) &amp;= \text{value function approximation for query state } s \text{ over stored examples } D \\ D &amp;= \text{set of stored examples} \\ g(s') &amp;= \text{target for state } s' \text{ in a stored example} \end{aligned}\] <ul> <li>A common kernel is the Gaussian Radial Basis Function (RBF) discussed earlier in Section 9.5.5.</li> <li>Kernel regression with RBF differs from linear-based RBF in 2 ways: <ul> <li>It is <strong>memory-based</strong> $\Rightarrow$ RBFs are centered on the states of the stored examples.</li> <li>It is <strong>non-parametric</strong> $\Rightarrow$ no parameters to learn.</li> </ul> </li> <li>We can recast any linear parametric regression method with feature vectors $\mathbf{x}(s) = (x_1(s), x_2(s), x_3(s), \ldots, x_d(s))^T$ into a kernel regression as:</li> </ul> \[k(s, s') = \mathbf{x}(s)^T \mathbf{x}(s')\] <ul> <li>Kernel methods allow evaluation in high-dimensional feature spaces, only using stored examples, and without the need for a complicated, parametric model. This is the so-called <strong>kernel trick</strong> that forms the basis for many machine learning methods.</li> </ul> <hr /> <h2 id="911-looking-deeper-at-on-policy-learning-interest--emphasis">9.11 Looking Deeper at On-Policy Learning: Interest &amp; Emphasis</h2> <ul> <li>So far we have treated all encountered states with equal importance, however it seems more likely that we often have more interest in some states than others.</li> <li>For this scenario, we introduce 2 new concepts: <strong>Interest</strong> and <strong>Emphasis</strong>.</li> </ul> <h3 id="interest">Interest</h3> <ul> <li>Interest is the degree to which we value an accurate estimate of the values of a given state. How interested are we in the accurate valuation of a given state at time $t$?</li> </ul> \[\mathcal{I}_t \in [0, 1]\] <h3 id="emphasis">Emphasis</h3> <ul> <li>Emphasis is a non-negative scalar random variable that multiplies the learning update and thus emphasizes or de-emphasizes the learning done at time $t$.</li> <li>The general $n$-step learning update is:</li> </ul> \[\mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha M_t\!\left[G_{t:t+n} - \hat{v}(S_t, \mathbf{w}_{t+n-1})\right] \nabla \hat{v}(S_t, \mathbf{w}_{t+n-1}), \quad 0 \leq t &lt; T\] \[G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{v}(S_{t+n}, \mathbf{w}_{t+n-1}), \quad 0 \leq t \leq T-n\] <ul> <li>The emphasis is determined recursively from the interest:</li> </ul> \[\boxed{M_t = \mathcal{I}_t + \gamma^n M_{t-n}, \quad 0 \leq t &lt; T}\] \[\text{with } M_t \doteq 0, \quad \forall t &lt; 0\] <hr /> <h2 id="912-summary">9.12 Summary</h2> <ul> <li><strong>Generalization</strong> is a must for RL systems for AI applications and <strong>supervised-learning function approximation</strong> helps achieve this.</li> <li>The prediction objective $\overline{\text{VE}}(\mathbf{w})$, called the <strong>mean squared value error,</strong> gives us a clear way to rank different value-function approximations in the on-policy case.</li> <li>Most techniques use stochastic gradient descent (SGD) to find the set of weight parameters that minimize $\overline{\text{VE}}(\mathbf{w})$.</li> <li>Linear methods converge to the global optimum under certain conditions.</li> <li>Features constructed for linear function approximations could be represented as: <strong>polynomials, Fouriers, coarse codings, tile codings, and radial basis functions (RBFs).</strong></li> <li>A huge success in notable RL applications could be attributed to multi-layer ANNs as nonlinear function approximators.</li> <li>Non-parametric models help us <strong>avoid the curse of dimensionality.</strong></li> <li><strong>Interest and Emphasis</strong> enable us to focus the function approximation on the states we’re more interested in.</li> </ul> <hr /> <h2 id="citation">Citation</h2> <p>If you found this blog post helpful, please consider citing it:</p> <div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@article</span><span class="p">{</span><span class="nl">obasi2026RLsuttonBartoCh09notes</span><span class="p">,</span> <span class="na">title</span> <span class="p">=</span> <span class="s">"Sutton &amp; Barto, Ch. 09: On-Policy Prediction with Approximation (Personal Notes)"</span><span class="p">,</span> <span class="na">author</span> <span class="p">=</span> <span class="s">"Obasi, Chizoba"</span><span class="p">,</span> <span class="na">journal</span> <span class="p">=</span> <span class="s">"chizkidd.github.io"</span><span class="p">,</span> <span class="na">year</span> <span class="p">=</span> <span class="s">"2026"</span><span class="p">,</span> <span class="na">month</span> <span class="p">=</span> <span class="s">"Feb"</span><span class="p">,</span> <span class="na">url</span> <span class="p">=</span> <span class="s">"https://chizkidd.github.io/2026/02/27/rl-sutton-barto-notes-ch009/"</span> <span class="p">}</span> </code></pre></div></div> <hr /> Fri, 27 Feb 2026 00:00:00 +0000 https://chizkidd.github.io//2026/02/27/rl-sutton-barto-notes-ch009/ https://chizkidd.github.io//2026/02/27/rl-sutton-barto-notes-ch009/ Sutton & Barto, Ch. 08: Planning & Learning with Tabular Methods (Personal Notes) <ul> <li><strong>Model-Based RL methods</strong> require a model of the environment and rely on <strong>planning</strong> as their primary component. <ul> <li>Dynamic Programming (DP), Heuristic Search</li> </ul> </li> <li><strong>Model-Free RL methods</strong> don’t require a model of the environment and primarily rely on <strong>learning</strong>. <ul> <li>Monte Carlo (MC), Temporal-Difference (TD)</li> </ul> </li> <li>Both methods still use value functions and both make backups to state values based on future returns.</li> </ul> <hr /> <h2 id="table-of-contents">Table of Contents</h2> <ul> <li><a href="#81-models--planning">8.1 Models &amp; Planning</a></li> <li><a href="#82-dyna-integrated-planning-acting-and-learning">8.2 Dyna: Integrated Planning, Acting, and Learning</a></li> <li><a href="#83-when-the-model-is-wrong">8.3 When the Model is Wrong</a></li> <li><a href="#84-prioritized-sweeping">8.4 Prioritized Sweeping</a></li> <li><a href="#85-expected-vs-sample-updates">8.5 Expected vs Sample Updates</a></li> <li><a href="#86-trajectory-sampling">8.6 Trajectory Sampling</a></li> <li><a href="#87-real-time-dynamic-programming-rtdp">8.7 Real-Time Dynamic Programming (RTDP)</a></li> <li><a href="#88-planning-at-decision-time">8.8 Planning at Decision Time</a></li> <li><a href="#89-heuristic-search">8.9 Heuristic Search</a></li> <li><a href="#810-rollout-algorithms">8.10 Rollout Algorithms</a></li> <li><a href="#811-monte-carlo-tree-search-mcts">8.11 Monte Carlo Tree Search (MCTS)</a></li> <li><a href="#812-summary">8.12 Summary</a></li> <li><a href="#813-summary-of-part-i-dimensions">8.13 Summary of Part I: Dimensions</a></li> </ul> <h2 id="appendix">Appendix</h2> <ul> <li><a href="#citation">Citation</a></li> </ul> <hr /> <h2 id="81-models--planning">8.1 Models &amp; Planning</h2> <ul> <li>A <strong>model</strong> is anything the agent can use to predict the environment’s behavior. <ul> <li>Given a state and an action, a model predicts the next state and the next reward.</li> <li>If the model is stochastic, then there are several possible next states and next rewards. The model produces: <ul> <li>All the possibilities and probabilities; these are <strong>distribution models</strong>.</li> <li>One of the possibilities, sampled according to the probabilities; these are <strong>sample models</strong>.</li> </ul> </li> </ul> </li> <li> <p>Models are used to simulate the environment and thus produce <strong>simulated experiences</strong>.</p> </li> <li><strong>Planning</strong> refers to any computational process that takes a model as input and produces or improves a policy for interacting with the modelled environment.</li> </ul> \[\text{model} \xrightarrow{\text{Planning}} \text{policy}\] <ul> <li>There are 2 distinct approaches to planning: <ul> <li><strong>State-space planning</strong>: search through the state space for an optimal policy. Actions cause transitions from state to state, and value functions are computed over states.</li> <li><strong>Plan-space planning</strong>: search through the space of plans. Operators transform one plan into another, and value functions, if any, are defined over the space of plans. e.g. Evolutionary methods, partial-order planning.</li> </ul> </li> <li><strong>State-space planning common structure:</strong></li> </ul> \[\text{model} \to \text{simulated experience} \xrightarrow{\text{backups}} \text{values} \to \text{policy}\] <ul> <li>All state-space planning methods involve computing value functions as a key intermediate step toward improving the policy.</li> <li> <p>They compute value functions by updates or backup operations applied to simulated experience.</p> </li> <li>Dynamic programming methods make sweeps through the space of states, generating for each state the distribution of possible transitions. Each distribution is then used to compute a backed-up value (update target) and update the state’s estimated value.</li> <li>At the heart of both learning and planning is the estimation of value functions by backing-up update operations.</li> <li>The difference however is that: <ul> <li>Planning uses <strong>simulated</strong> experience generated by a model.</li> <li>Learning uses <strong>real</strong> experience generated by the environment.</li> </ul> </li> <li>Despite the differences, a learning algorithm can be substituted for the key update step of a planning method, because it applies just as well to simulated experience.</li> </ul> <h3 id="random-sample-one-step-tabular-q-learning">Random-sample One-step Tabular Q-Learning</h3> <ul> <li>A planning method based on one-step tabular Q-learning and on random samples from a sample model.</li> <li>It converges to the optimal policy for the model under the same conditions as that for the real environment.</li> </ul> <h3 id="pseudocode-random-sample-one-step-tabular-q-learning">Pseudocode: Random-sample One-step Tabular Q-Learning</h3> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Loop forever: 1. Select a state, S ∈ S, and an action, A ∈ A(S), at random 2. Send S, A to a sample model, and obtain a sample next reward, R, and a sample next state, S' 3. Apply one-step tabular Q-learning to S, A, R, S': Q(S,A) ← Q(S,A) + α[R + γ max_a Q(S',a) - Q(S,A)] </code></pre></div></div> <ul> <li>Planning in very small, incremental steps may be the most <strong>efficient</strong> approach especially in large scale problems.</li> </ul> <hr /> <h2 id="82-dyna-integrated-planning-acting-and-learning">8.2 Dyna: Integrated Planning, Acting, and Learning</h2> <p><img src="/assets/images/2026/rl-sutton-barto/ch08-8-2-title-border.png" alt="Dyna" /></p> <blockquote> <p><strong>Dyna-Q interactions and General Dyna architecture</strong></p> </blockquote> <ul> <li>When planning is done online, while interacting with the environment, a number of interesting issues arise: <ul> <li>New information gained from the interaction may change the model (and thus the planning).</li> <li>How do we divide the computational resources available between decision making and model learning?</li> </ul> </li> <li> <p><strong>Dyna-Q</strong>: a simple architecture integrating the major functions needed in an online planning agent.</p> </li> <li>Within a planning agent, there are at least 2 roles for real experience: <ul> <li><strong>Model learning</strong>: to improve the model (to make it more accurately match the real environment).</li> <li><strong>Direct RL</strong>: to directly improve the value function and policy.</li> <li><strong>Indirect RL</strong>: to indirectly improve the value functions and policies via the model (planning).</li> </ul> </li> <li>Both direct and indirect methods have advantages and disadvantages: <ul> <li>Indirect methods often make fuller use of a limited amount of experience.</li> <li>Direct methods are much simpler and are not affected by biases in the design of the model.</li> </ul> </li> <li> <p><strong>Dyna-Q includes all of the RL processes in the interactions diagram shown above occurring continuously: planning, acting, model-learning, and direct RL.</strong></p> </li> <li>The planning method is the random-sample, one-step tabular Q-planning.</li> <li>The model-learning method is also table-based and assumes the environment is deterministic. <ul> <li>After each transition $S_t, A_t \to R_{t+1}, S_{t+1}$, the model records in its table entry for $S_t, A_t$ the prediction that $R_{t+1}, S_{t+1}$ will deterministically follow.</li> <li>Thus, if the model is queried with a state-action pair that has been experienced before, it simply returns the last $S_{t+1}, R_{t+1}$ experienced as its prediction.</li> <li>During planning, the Q-learning algorithm randomly samples only from state-action pairs that have previously been experienced.</li> </ul> </li> <li>Based on the overall architecture of Dyna agents: <ul> <li><strong>Search control</strong>: the process that selects the starting states &amp; actions for the simulated experiences.</li> <li>Planning is achieved by applying RL methods to the simulated experiences just as if they had really happened.</li> </ul> </li> <li>Typically, as in Dyna-Q, the same RL method is used both for learning from real experience and for planning from simulated experience.</li> <li>Learning and planning share almost all the <strong>same</strong> machinery, differing only in the source of their experience.</li> <li>Conceptually, planning, acting, direct RL, and model learning occur simultaneously and in parallel in Dyna agents. For computational concreteness and implementation, we specify the order of occurrence within a time step: <ul> <li>Acting, model-learning and direct RL processes require little time.</li> <li>Planning takes the remaining time in each step because it is inherently computationally intensive.</li> </ul> </li> </ul> <h3 id="tabular-dyna-q-pseudocode">Tabular Dyna-Q Pseudocode</h3> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Initialize Q(s,a) and Model(s,a) for all s ∈ S, a ∈ A(s) Loop forever: (a) S ← current (non-terminal) state (b) A ← ε-greedy(S, Q) (c) Take action A; observe resultant reward, R, and state, S' (d) Q(S,A) ← Q(S,A) + α[R + γ max_a Q(S',a) - Q(S,A)] (e) Model(S,A) ← R, S' (assuming deterministic environment) (f) Loop repeat n times: S ← random previously observed state A ← random action previously taken in S R, S' ← Model(S, A) Q(S,A) ← Q(S,A) + α[R + γ max_a Q(S',a) - Q(S,A)] </code></pre></div></div> <ul> <li>In the pseudocode algorithm for Dyna-Q above: <ul> <li>$\text{Model}(S,a)$ denotes the contents of the model (predicted $S_{t+1}$ &amp; $R_{t+1}$) for state-action pair.</li> <li>Direct RL, model-learning and planning are steps <strong>(d), (e)</strong> and <strong>(f)</strong> respectively.</li> <li>If <strong>(e)</strong> and <strong>(f)</strong> were omitted, the algorithm becomes one-step tabular Q-learning.</li> </ul> </li> <li>A simple maze example shows that adding planning ($n &gt; 0$) dramatically improves the agent’s behavior.</li> <li>Due to the incremental nature of planning, it is trivial to intermix planning &amp; learning. Both proceed quickly. <ul> <li>The agent is always reactive and deliberative, and yet always planning and model-learning in the background.</li> </ul> </li> </ul> <hr /> <h2 id="83-when-the-model-is-wrong">8.3 When the Model is Wrong</h2> <ul> <li>Models may be incorrect because: <ul> <li>The environment is <strong>stochastic</strong> and only a limited number of samples have been observed.</li> <li>The model was learned using function approximation that has <strong>generalized imperfectly.</strong></li> <li>The environment has <strong>changed</strong> and its new behavior has not yet been observed.</li> </ul> </li> <li>In some scenarios, the suboptimal policy computed by planning quickly leads to the discovery and correction of the modeling error. <ul> <li>This tends to happen when the model aims to predict a greater reward than is possible.</li> <li>The planned policy attempts to exploit these opportunities and discovers that they do not exist.</li> </ul> </li> <li>Issues arise when the environment changes to become better than it was before, and yet the formerly correct policy does not reveal the improvement. <ul> <li>In these cases the current optimal policy remains unchanged, but there lies an even better policy available that the agent may never access because it has no reason to doubt its previously learned optimal policy.</li> <li>To address this <strong>exploration/exploitation tradeoff/conflict</strong>, there is no perfect and practical solution, but there are simple heuristics that can be effective.</li> <li>One proposed algorithm that used one such heuristic was the <strong>Dyna-Q+</strong> agent.</li> </ul> </li> </ul> <h3 id="dyna-q">Dyna-Q+</h3> <ul> <li>The agent keeps track of the number of timesteps elapsed $\tau$ since a state-action pair had been selected, and if sufficient time has elapsed, it is presumed that the dynamics of the environment from that state has <strong>changed</strong> and the model of it is incorrect.</li> <li>A special bonus reward is added to simulated experience involving these actions to encourage <strong>exploratory behavior</strong> towards long-untried actions ($k\sqrt{\tau}$).</li> <li>The modelled rewards for each state-action pair now becomes $r + k\sqrt{\tau}$ for some small $k$. <ul> <li>$r$ is the initial unchanged model reward without any heuristic applied.</li> </ul> </li> <li>This encourages the agent to be more exploratory to all accessible state transitions, and even though this may be costly computationally, it’s <strong>well worth it.</strong></li> </ul> <hr /> <h2 id="84-prioritized-sweeping">8.4 Prioritized Sweeping</h2> <ul> <li>In Dyna agents, simulated transitions are started in state-action pairs selected uniformly at random from all previously experienced state-action pairs. <ul> <li>Uniform selection is usually not the best; focusing on particular state-action pairs could be much more efficient for planning.</li> </ul> </li> <li> <p><strong>Prioritized sweeping</strong> optimizes Dyna-style planning by selectively updating state-action pairs based on expected magnitude of value change, rather than uniform random selection.</p> <ul> <li><u>Steps:</u> <ol> <li>Keep a priority queue of which state-action pairs need updating most.</li> <li>Update the ones with biggest potential changes first.</li> <li>Work backwards from important states (like the goal).</li> </ol> </li> <li><u>Mechanism:</u> <ol> <li>Maintain priority queue of $(s,a)$ pairs ranked by Bellman error magnitude.</li> <li>Propagate updates <strong>backward</strong> from states with changed values.</li> <li>Queue predecessors weighted by: $\vert R + \gamma V(s’) - Q(s,a) \vert$</li> </ol> </li> <li><u>Key advantages:</u> <ol> <li><strong>Efficiency</strong>: avoid wasteful updates (such as $0 \to 0$ reward transitions).</li> <li><strong>Convergence speed</strong>: dramatic empirical improvements.</li> <li><strong>Backward focusing</strong>: value propagation follows reverse trajectory from changed states.</li> </ol> </li> </ul> </li> <li>Extensions of prioritized sweeping to <strong>stochastic environments</strong> are straightforward: <ul> <li><strong>Expected updates</strong>: enumerate all $s’$ with transition probabilities, which is comprehensive but computationally expensive on low-probability transitions.</li> <li><strong>Sample updates</strong>: lower variance per computation, better granularity enables selective focus on high-impact transitions.</li> <li>Essentially, when outcomes are random, one can either update based on all possibilities (slow but thorough) or sample specific outcomes (faster, focuses effort).</li> </ul> </li> <li>All planning entails sequences of value updates varying in: <ul> <li><u>Update type</u> $\Rightarrow$ expected/sample, full/partial backup.</li> <li><u>Update ordering</u> $\Rightarrow$ backward/forward focusing, prioritization heuristic.</li> </ul> </li> <li>Forward focusing prioritizes states by reachability under current policy rather than backward value propagation.</li> </ul> <hr /> <h2 id="85-expected-vs-sample-updates">8.5 Expected vs Sample Updates</h2> <ul> <li>So far in the book we’ve discussed dynamic programming (DP) as a way of conducting policy evaluation and policy improvement given a distribution model of the environment.</li> <li>We’ve also discussed sampling methods like Monte Carlo (MC), temporal-difference (TD), and $n$-step bootstrapping to estimate value functions in the absence of a model.</li> <li>Given a fixed computational budget, are expected or sample updates more efficient for planning?</li> </ul> <p><img src="/assets/images/2026/rl-sutton-barto/ch08-8-5-one-step.png" alt="One-Step Update backup diagrams" /></p> <blockquote> <p><strong>Backup diagrams for one-step updates</strong>: showing 7 out of 8 possible cases for one-step updates (sample and expected) for state-value and action-value functions and their optimal versions too.</p> </blockquote> <table> <thead> <tr> <th>Value estimated</th> <th>Expected updates (DP)</th> <th>Sample updates (one-step TD)</th> </tr> </thead> <tbody> <tr> <td>$v_\pi(s)$</td> <td>Policy evaluation<br />(full branching over actions &amp; next states)</td> <td>TD(0)<br />(single sampled transition)</td> </tr> <tr> <td>$v_*(s)$</td> <td>Value iteration (max over actions, full branching)</td> <td>—</td> </tr> <tr> <td>$q_\pi(s,a)$</td> <td>$q$-policy evaluation</td> <td>Sarsa</td> </tr> <tr> <td>$q_*(s,a)$</td> <td>$q$-value iteration</td> <td>Q-learning</td> </tr> </tbody> </table> <ul> <li>We have considered many value function updates. If we focus on one-step updates, they vary along 3 binary dimensions: <ul> <li>Whether they update state values or action values.</li> <li>Whether they estimate the value for the optimal policy or for an arbitrary given policy.</li> <li>Whether the updates are <strong>expected updates</strong> (consider all possible events that might happen) or <strong>sample updates</strong> (consider a single sample of what might happen).</li> </ul> </li> <li> <p>These 3 binary dimensions give rise to 8 cases, 7 of which are shown in the figure above. The 8th case does not seem to correspond to any useful update.</p> </li> <li>Any of these one-step updates can be used in planning methods: <ul> <li><strong>Dyna-Q</strong> uses $q_*$ sample or expected updates, or either expected or sample $q_\pi$ updates. <!-- - **Dyna-Q** uses $q_*$ sample updates, but could also use $q_{*}$ expected updates, or either expected or sample $q_{\pi}$ updates --></li> <li><strong>Dyna-AC</strong> uses $v_\pi$ sample updates together with a learning policy structure.</li> <li>For stochastic problems, prioritized sweeping is always done using one of the expected updates.</li> </ul> </li> <li>Absence of a distribution model means that expectation is impossible, but sampling can be done. <ul> <li>So which is better? Expectation or Sampling? <ul> <li>Expected updates yield a better estimate because they are uncorrupted by sampling error.</li> <li>Expected updates require more computation, and computation is often the limiting resource in planning.</li> </ul> </li> </ul> </li> </ul> <h3 id="computational-requirements--formal-comparison-given-discrete-statesactions">Computational Requirements &amp; Formal Comparison (given discrete states/actions)</h3> <ul> <li><u>Model:</u> $\hat{p}(s’, r \vert s, a)$ known.</li> <li><u>Goal:</u> Approximate $q_{*}$ (optimal action values).</li> <li> <p><u>Branching factor:</u> $b = \vert {s’: p(s’ \vert s,a) &gt; 0} \vert$ (effective stochasticity).</p> </li> <li><strong>Expected Update (exact):</strong> <ul> <li><u>Computational complexity:</u> $O(b)$.</li> </ul> </li> </ul> \[\boxed{Q(s,a) \leftarrow \sum_{s', r} \hat{p}(s', r | s, a)\left[r + \gamma \max_{a'} Q(s', a')\right]}\] <ul> <li><strong>Sample Update (stochastic):</strong> <ul> <li><u>Computational complexity:</u> $O(1)$.</li> </ul> </li> </ul> \[\boxed{Q(s,a) \leftarrow Q(s,a) + \alpha\left[R + \gamma \max_{a'} Q(S', a') - Q(s,a)\right]}\] <ul> <li>In the same fixed time budget for 1 expected update, you get $b$ sample updates.</li> </ul> <h3 id="theoretical-comparison-empirical-analysis">Theoretical Comparison (Empirical Analysis)</h3> <ul> <li>Assume all $b$ successors are equiprobable, and initial $ \vert \text{error} \vert = 1$ at $(s,a)$; successor values are assumed already correct.</li> <li><u>Expected update:</u> error $= 0$ after one update (cost: $\sim b$ units).</li> <li><u>Sample updates (assuming sample averages, i.e. $\alpha = \frac{1}{t}$):</u> error $\approx \sqrt{\frac{b-1}{bt}}$. <ul> <li> <p>For moderate $b$ (e.g. $b = 10$) and large $b$, the error falls dramatically with a tiny fraction of $b$ updates.</p> </li> <li> <p>For large $b$, error drops exponentially fast early for sample updates, allowing broad updates across many $(s,a)$ pairs in the same time as one expected update:</p> \[\text{error} = \sqrt{\frac{b-1}{bt}}\] </li> <li> <p>For large $b$: $\quad \text{error} \approx \sqrt{\frac{1}{t}}$, $\therefore \lim_{t \to \infty} \sqrt{\frac{1}{t}} \to 0$.</p> \[\begin{aligned} b=1:&amp; \quad \text{error} = 0 \quad \text{for } t \geq 1 \\[6pt] b=2:&amp; \quad \text{error} = \frac{1}{\sqrt{2t}} \implies \text{error}(t=1) \approx 0.707,\ \text{error}(t=2) = 0.5 \\[6pt] b=100:&amp; \quad \text{error} \approx \frac{1}{\sqrt{t}} \implies \text{error}(t=1) \approx 0.995,\ \text{error}(t=10) \approx 0.316,\ \text{error}(t=100) \approx 0.1 \end{aligned}\] </li> </ul> </li> <li>Pros of sample updates: <ul> <li><strong>Breadth vs depth</strong>: cover more state space per unit computation.</li> <li><strong>Bootstrap benefits</strong>: earlier updates improve successor value estimates for subsequent backups.</li> <li><strong>Diminishing returns</strong>: marginal value of incorporating low-probability branches is low.</li> </ul> </li> <li>Sample updates dominate in large-scale stochastic domains where exhaustive sweeping is intractable.</li> <li>Expected updates only preferable when: <ul> <li>Small branching factor ($b \leq 3$).</li> <li>Small state space (exact solution feasible).</li> <li>Deterministic dynamics ($b = 1$, methods are equivalent).</li> </ul> </li> </ul> <hr /> <h2 id="86-trajectory-sampling">8.6 Trajectory Sampling</h2> <ul> <li>Let’s compare two ways of distributing updates: <ul> <li><strong>Exhaustive Sweeps</strong>: classical DP approach that entails performing sweeps through the entire state (or state-action) space, updating each state (or state-action pair) once per sweep. Computationally inefficient especially on large tasks (no time to complete one full sweep).</li> <li><strong>Trajectory Sampling</strong>: generate simulated trajectories by rolling out the current policy in the model, performing one-step backups along the trajectory.</li> </ul> </li> <li>2 common sampling distributions: <ul> <li>Uniform sampling of states or state-action pairs.</li> <li>On-policy distribution.</li> </ul> </li> <li>For planning updates in tabular RL, should state-action pairs be selected uniformly or according to the on-policy distribution?</li> </ul> <h3 id="formal-comparison">Formal Comparison</h3> <ul> <li><strong>Uniform distribution:</strong> <ul> <li>Cycle systematically through all $\vert S \vert \times \vert A \vert$ state-action pairs.</li> <li>Each pair receives equal computational resources.</li> <li>Complete coverage regardless of policy.</li> <li>Starting state distribution $\approx$ uniform or some fixed distribution $\mu(S_t)$.</li> </ul> </li> <li><strong>On-policy trajectory sampling:</strong> <ul> <li>Sample states $S_t \sim d^\pi$ where $d^\pi$ is the on-policy state distribution under policy $\pi$.</li> <li>Select actions $a_t \sim \pi(\cdot \vert S_t)$.</li> <li>Generate trajectories ${S_0, a_0, S_1, a_1, \ldots}$ following current policy.</li> <li>Update only visited state-action pairs.</li> </ul> </li> <li><strong>Advantages of trajectory sampling:</strong> <ul> <li><strong>Computational focusing</strong>: for large state spaces where $\vert S \vert \gg$ states reachable under $\pi$, trajectory sampling concentrates updates on the reachable subset.</li> <li><strong>Irrelevant state pruning</strong>: 3 categories emerge: <ul> <li>Initial states (starting distribution).</li> <li>States reachable under optimal control.</li> <li>Irrelevant states (never visited optimally).</li> </ul> </li> </ul> </li> <li><strong>Disadvantages/Limitations of trajectory sampling:</strong> <ul> <li>Requires a generative model to simulate trajectories.</li> <li>May miss important states early in learning if policy is poor.</li> <li>Can be sample-inefficient if trajectories are long.</li> </ul> </li> <li>On-policy distribution sampling is useful/better for problems with large state-spaces and small branching factors (prioritized sweeping).</li> <li> <p>Trajectory sampling is orthogonal to prioritization. The former addresses which states to update, while the latter addresses in what order. They can be combined.</p> </li> <li>Trajectory sampling anticipates importance sampling concepts in off-policy learning and naturally extends to continuous state spaces with function approximation.</li> </ul> <hr /> <h2 id="87-real-time-dynamic-programming-rtdp">8.7 Real-Time Dynamic Programming (RTDP)</h2> <ul> <li>RTDP is an on-policy trajectory-sampling version of the value-iteration algorithm of dynamic programming (DP).</li> <li>RTDP is an asynchronous DP algorithm; <ul> <li>async DP algorithms are not organized in terms of systematic sweeps of the state set, instead</li> <li>they update state values in any order whatsoever, using whatever values of other states happen to be available.</li> </ul> </li> <li>RTDP is basically the combination of 3 ideas: <ul> <li><strong>On-policy trajectory sampling</strong> $\Rightarrow$ follow the current greedy policy.</li> <li><strong>Asynchronous updates</strong> $\Rightarrow$ only update states you actually visit in any order.</li> <li><strong>Focused learning</strong> $\Rightarrow$ concentrate on <strong>“relevant states”</strong> (states on good paths to the goal).</li> </ul> </li> <li><strong>RTDP update rule:</strong></li> </ul> \[\boxed{V(S_t) \leftarrow \max_{a \in A}\left(R^a_{S_t} + \gamma \sum_{s'} P^a_{S_t s'}\, V(s')\right)}\] <ul> <li>RTDP’s relationship to other methods: <ul> <li><strong>Value Iteration</strong>: <ul> <li>VI updates all states per iteration;</li> <li>RTDP updates only states on sampled trajectories.</li> </ul> </li> <li><strong>Prioritized Sweeping</strong>: <ul> <li>PS uses model to work backward from goals;</li> <li>RTDP follows forward trajectories.</li> </ul> </li> <li><strong>Trajectory Sampling</strong>: <ul> <li>TS can use any policy;</li> <li>RTDP uses greedy policy for sampling.</li> </ul> </li> </ul> </li> <li> <p><strong>Computational Complexity:</strong></p> <ul> <li>Traditional Value Iteration $\Rightarrow O(\vert S \vert ^2 \vert A \vert)$ per iteration.</li> <li>RTDP Trial $\Rightarrow O(L)$ where $L$ = episode length, typically $L \ll \vert S \vert$.</li> </ul> </li> <li>RTDP bridges pure planning and pure learning (focusing on relevant state space regions).</li> <li>RTDP is guaranteed to find an optimal policy for the relevant states under certain conditions: <ol> <li>The initial value of every goal state is zero.</li> <li>There exists at least one policy that guarantees that a goal state will be reached with probability one from any start state.</li> <li>All rewards for transitions from non-goal states are strictly negative.</li> <li>All the initial values are equal to or greater than their optimal values (which can be satisfied by simply setting the initial values of all states to zero).</li> </ol> </li> <li>Tasks with these properties are usually called <strong>stochastic optimal path problems</strong>. <ul> <li>RTDP can find optimal policies for these tasks with approximately 50% of the computation required by traditional sweep-based value iteration (i.e. dynamic programming).</li> <li>These kinds of problems are usually expressed in cost minimization not reward maximization.</li> </ul> </li> </ul> <p><img src="/assets/images/2026/rl-sutton-barto/ch08-8-7-RTDP.png" alt="RTDP" /></p> <blockquote> <p><strong>State space diagram</strong>: Start states on the left, irrelevant states (unreachable from any start state under any optimal policy) in the outer region, and relevant states (reachable from some start state under some optimal policy) in the inner region.</p> </blockquote> <h3 id="heuristic-initialization-optimistic">Heuristic Initialization (Optimistic)</h3> <ul> <li>RTDP typically initializes $V$ with an admissible heuristic $h(s)$ where $h(s) \geq V^*(s)$.</li> <li>This provides optimistic values that guide exploration towards goal states.</li> </ul> <h3 id="considerations">Considerations</h3> <ul> <li>Most effective domains for RTDP are domains with: <ul> <li>Large state spaces.</li> <li>Sparse goal states.</li> <li>Clearly defined initial distribution.</li> <li>Deterministic or near-deterministic dynamics.</li> </ul> </li> </ul> <hr /> <h2 id="88-planning-at-decision-time">8.8 Planning at Decision Time</h2> <ul> <li>Planning can be used in at least 2 ways: <ul> <li><strong>Background planning</strong>: planning that occurs independently of and prior to the need for action. Here planning is used to gradually improve a policy/value function on the basis of simulated experience from a model, then selects an action via lookup.</li> <li><strong>Decision-time planning</strong>: planning that occurs at the moment an action is required after encountering the current state. It’s essentially planning at the time of action selection.</li> </ul> </li> <li>Decision-time planning is useful when fast response is not required and the state space is large. In low-latency actions, background planning is better for action selection.</li> <li>Decision-time planning is <strong>memoryless</strong> (discards updates after selection of action), but background planning is <strong>persistent</strong> (permanently stores and accumulates learned values).</li> </ul> <hr /> <h2 id="89-heuristic-search">8.9 Heuristic Search</h2> <ul> <li>The classical state-space planning methods are decision-time planning methods collectively known as <strong>heuristic search</strong>.</li> <li>In heuristic search, for each state encountered, a large tree of possible continuations is considered. <ul> <li>The search evaluates leaf nodes at the end of the search and backs up the values to the state-action nodes for the current states.</li> <li>The value maximising action from the current state is found and then selected (the values are usually discarded).</li> </ul> </li> <li>This kind of planning is effective because it focuses only on pertinent next states and actions, and focuses resource on obtaining the next best one-step action.</li> <li>Heuristic search is an extension of greedy policy beyond one-step to multi-step lookahead to obtain better action selections.</li> </ul> <p><img src="/assets/images/2026/rl-sutton-barto/ch08-8-9-Heuristic-Search.png" alt="heuristic-search" /></p> <blockquote> <p><strong>Heuristic Search diagram (selective depth-first search)</strong>: a large tree rooted at the current state, with branches for each action and subtrees for each successor. The tree policy traverses the tree greedily, evaluating and backing up values from the leaf nodes toward the root.</p> </blockquote> <h3 id="theoretical-properties">Theoretical Properties</h3> <ol> <li><strong>Optimality horizon</strong>: for sufficiently large depth $K$ where $\gamma^K \approx 0$, the selected action approaches the optimal action $a^*(s)$.</li> <li><strong>Computational complexity</strong>: <ul> <li><u>Full tree expansion:</u> $O(b^K)$ where $b$ = branching factor.</li> <li><u>With pruning/selection:</u> $O(f(b, K))$ where $f &lt; b^K$.</li> </ul> </li> <li><strong>Memory</strong>: $O(bK)$ with depth-first implementation.</li> <li><strong>Backed-up value interpretation</strong>: $v_{\text{tree}}(s)$ estimates the $K$-step optimal value starting from $s$.</li> </ol> <hr /> <h2 id="810-rollout-algorithms">8.10 Rollout Algorithms</h2> <ul> <li>Rollout algorithms are decision-time planning algorithms based on Monte Carlo control applied to simulated trajectories that all begin at the current environment state.</li> <li> <p>They estimate action values for a given policy by averaging the returns of many simulated trajectories that start with each possible action and then follow the given policy.</p> </li> <li><strong>Key characteristics:</strong> <ul> <li>It is memoryless; it doesn’t store/update a permanent value table (no persistence).</li> <li>The rollout policy $\Pi$ is usually the current greedy policy.</li> <li>Relies on MC averaging to reduce variance (more rollouts $\to$ better estimates).</li> <li>It’s quite very simple; no tree-building, no backups during rollouts.</li> </ul> </li> <li><strong>Goal</strong>: <ul> <li>Not to estimate a complete $q_*$ or $q_\pi$ for a given policy $\pi$; instead the focus is on MC estimates of action values only for each current state and for a given fixed policy called the <strong>rollout policy</strong>.</li> <li>Improve upon rollout policy, not find optimal policy.</li> </ul> </li> <li>Rollout algorithms follow the <strong>policy improvement theorem</strong> by acting greedily w.r.t. $\hat{Q}(s,a)$: <ul> <li>If $q_\pi(s,a) \geq v_\pi(s)$ for any 2 policies $\pi$ and $\pi’$, then $\pi’ \geq \pi$.</li> </ul> </li> </ul> <h3 id="computational-complexity-quite-expensive-due-to-many-full-episodes">Computational Complexity (quite expensive due to many full episodes)</h3> <ul> <li>Per decision $\Rightarrow$: <ul> <li>$\vert A(s) \vert$ = number of actions to evaluate,</li> <li>$n$ = rollouts per action,</li> <li>$L$ = average episode length.</li> </ul> </li> </ul> \[\text{total cost} = O\left(|A(s)| \cdot n \cdot L\right)\] <ul> <li>As we can see the computational time depends on many factors and balancing these factors is very important and challenging. To handle the challenge: <ul> <li>It is possible to run many trials in parallel on separate processors (because the MC trials are independent of one another).</li> <li>Truncate the simulated trajectories short of complete episodes, correcting the truncated returns by means of a stored evaluation function.</li> <li>Pruning away candidate actions that are unlikely to be the best.</li> </ul> </li> <li>Rollout algorithms aren’t learning algorithms because of no long-term memory of values/policies. <ul> <li>But they still use RL techniques: <strong>Monte Carlo control + Policy Improvement Theorem.</strong></li> <li>Use MC control to estimate action values via averaging the returns of a collection of sample trajectories.</li> <li>Take advantage of the policy improvement property by acting greedily w.r.t. the estimated action values.</li> </ul> </li> </ul> <hr /> <h2 id="811-monte-carlo-tree-search-mcts">8.11 Monte Carlo Tree Search (MCTS)</h2> <ul> <li>MCTS is a rollout algorithm that is enhanced by the addition of a means for accumulating value estimates obtained from the MC simulations in order to successively direct simulations toward more highly-rewarding trajectories.</li> <li>It is a best-first search algorithm that builds a decision tree incrementally by iteratively sampling trajectories through an MDP, using statistical confidence bounds for exploration-exploitation balance.</li> <li>When the environment changes to a new state, MCTS executes as many iterations as possible before an action needs to be selected, incrementally building a tree whose root node represents the current state.</li> </ul> <p><strong>Each iteration consists of 4 operations:</strong></p> <h3 id="1-selection">1. Selection</h3> <ul> <li>Starting at the root node, a tree policy based on the action values attached to the edges of the tree traverses the tree to select a leaf node.</li> <li>Traverse tree using tree policy (typically Upper Confidence bounds for Trees, <strong>UCT</strong>):</li> </ul> \[\boxed{\text{UCT}(s,a) = \underbrace{\frac{W(s,a)}{N(s,a)}}_{\text{exploitation}} + c\underbrace{\sqrt{\frac{\ln(N(s))}{N(s,a)}}}_{\text{exploration}} = Q(s,a) + c\sqrt{\frac{\ln(N(s))}{N(s,a)}}}\] <ul> <li>where: <ul> <li>$Q(s,a)$ = average return from $(s,a) = \frac{W(s,a)}{N(s,a)}$</li> <li>$W(s,a)$ = total reward accumulated through $(s,a)$</li> <li>$N(s,a)$ = visit count for $(s,a)$</li> <li>$N(s)$ = visit count for parent state $s$</li> <li>$c$ = exploration constant (typically $\sqrt{2}$); <ul> <li>$c &lt; 0.5 \Rightarrow$ more exploitation</li> <li>$c &gt; 2 \Rightarrow$ more exploration</li> </ul> </li> </ul> </li> <li>Selection terminates when: <ul> <li>Leaf node is reached (not fully expanded).</li> <li>Terminal state is reached.</li> </ul> </li> </ul> <h3 id="2-expansion">2. Expansion</h3> <ul> <li>On some iterations, the tree is expanded from the selected leaf node by adding one or more child nodes reached from the selected node via unexplored actions.</li> <li>Expansion strategies: <ul> <li>Single child per iteration (standard).</li> <li>All children at once (batch expansion).</li> <li>Progressive widening for continuous actions.</li> </ul> </li> </ul> <h3 id="3-simulation">3. Simulation</h3> <ul> <li>From the selected node, or from one of its newly-added child nodes (if any), simulation of a complete episode is run with actions selected by the rollout policy (actions are selected first by the tree policy and beyond the tree by the rollout policy).</li> </ul> <h3 id="4-backup">4. Backup</h3> <ul> <li>The return generated by the simulated episode is backed up to update, or to initialize, the action values attached to the edges of the tree traversed by the tree policy in this iteration of MCTS.</li> <li>Propagate simulation result up the path and update all nodes/edges on the path.</li> </ul> <p><img src="/assets/images/2026/rl-sutton-barto/ch08-8-11-MCTS.png" alt="MCTS" /></p> <blockquote> <p><strong>MCTS diagram</strong>: 4 stages shown left to right: Selection (tree policy traverses with blue arrows to a leaf), Expansion (leaf expanded), Simulation (rollout policy runs from expanded node to terminal $\Delta$), Backup (return propagated back up with blue arrows).</p> </blockquote> <ul> <li> <p>MCTS continues executing these 4 steps, starting each time at the tree’s root node, until no more time is left, or some other computational resource is exhausted.</p> </li> <li> <p>Then finally, an action from the root node (representative of the environment’s current state) is selected according to some mechanism that depends on the accumulated statistics in the tree (action with largest action value or action with largest visit count to avoid outliers).</p> </li> </ul> <h3 id="mcts-pseudocode">MCTS Pseudocode</h3> <p><strong>Main loop:</strong></p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Initialize: root = current state, S_0 for i = 1 to num_simulations: node = Selection(root) node = Expansion(node) Δ = Simulation(node) Backup(node, Δ) return argmax_a N(S_0, a) </code></pre></div></div> <p><strong>Expansion:</strong></p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>If non-terminal leaf reached: select unvisited action a' ∈ A(S) \ {visited actions} create child node S' ~ p(·|S, a') add to tree return S' </code></pre></div></div> <p><strong>Simulation:</strong></p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>S ← expanded_node G ← 0 t ← 0 while S non terminal: a ~ Π_default(·|S) r, S' ~ p(·|S, a) G ← G + γᵗ r S ← S' t ← t + 1 return G </code></pre></div></div> <p><strong>Backup:</strong></p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>while node ≠ null: N(node) ← N(node) + 1 W(node) ← W(node) + Δ node ← node.parent </code></pre></div></div> <h3 id="computational-complexity">Computational Complexity</h3> <ul> <li>Per iteration (where $d$ = tree depth, $L$ = episode length): <ul> <li>Selection: $O(d)$,</li> <li>Expansion: $O(1)$,</li> <li>Simulation: $O(L)$,</li> <li>Backup: $O(d)$</li> </ul> </li> </ul> \[\Rightarrow \text{total: } O\!\left(n \cdot (d + L)\right) \text{ for } n \text{ simulations}\] <h3 id="summary-of-mcts">Summary of MCTS</h3> <ul> <li>MCTS is a decision-time planning algorithm based on Monte Carlo control applied to simulations that start from the root state.</li> <li>MCTS benefits from online, incremental, sample-based value estimation &amp; policy improvement.</li> <li>MCTS saves action-value estimates attached to the tree edges and updates them using RL’s sample updates.</li> <li>MCTS, via incremental tree expansion, effectively grows a lookup table to store a partial action-value function.</li> <li>MCTS thus avoids the problem of globally approximating an action-value function while it retains the benefit of using past experience to guide exploration.</li> </ul> <h3 id="pros--cons-of-mcts">Pros &amp; Cons of MCTS</h3> <ul> <li><strong>Pros</strong>: <ul> <li>anytime algorithm,</li> <li>asymmetric tree growth,</li> <li>no domain heuristic required,</li> <li>handles high branching factors.</li> </ul> </li> <li><strong>Cons</strong>: <ul> <li>high computational cost,</li> <li>may miss deep forced sequences,</li> <li>random rollouts that are weak in tactical domains,</li> <li>finite simulations miss long-term consequences.</li> </ul> </li> </ul> <hr /> <h2 id="812-summary">8.12 Summary</h2> <ul> <li>Planning requires a model of the environment. <ul> <li>A <strong>distribution model</strong> consists of the probabilities of next states and rewards for possible actions. Dynamic Programming requires a distribution model because it uses expected updates, which involve computing expectations over all the possible next states and rewards.</li> <li>A <strong>sample model</strong> is required for simulating the environment interaction using sample updates.</li> <li>Sample models are generally much easier to obtain than distribution models.</li> </ul> </li> <li>There exists a close relationship between planning optimal behavior and learning optimal behavior: <ul> <li>Both involve estimating the same value functions.</li> <li>Both naturally update the estimates incrementally, in a long series of small backup operations.</li> <li>Any of the learning methods can be converted into planning methods simply by applying them to simulated rather than real experience (model-based not model-free).</li> </ul> </li> <li> <p>It is straightforward to integrate incremental planning methods with acting and model-learning. Planning, acting and model-learning interact in a circular fashion, each producing what the other needs to improve. All processes naturally proceed asynchronously and in parallel.</p> </li> <li><strong>Dimensions of variation among state-space planning methods:</strong> <ul> <li><strong>Size of updates</strong>: the smaller the updates, the more incremental the planning methods can be. One-step updates, as in Dyna, are the smallest updates.</li> <li><strong>Distribution of updates</strong>: primarily regarded as the focus of search. <ul> <li><strong>Prioritized sweeping</strong> focuses backward on the predecessors of recently changed states.</li> <li><strong>On-policy trajectory sampling</strong> focuses on states/state-action pairs that are likely.</li> </ul> </li> </ul> </li> <li> <p><strong>Real-time DP (RTDP),</strong> an on-policy trajectory sampling version of value iteration, illustrates some of the advantages that focusing on the relevant regions of the state-space has over conventional sweep-based policy iteration (exhaustive).</p> </li> <li>Planning can also focus forward from pertinent states, such as states actually encountered during an agent-environment interaction, and the most important form of this is when <strong>planning is done at decision time</strong> as part of the action-selection process. <ul> <li>Another example of this is <strong>classical heuristic search.</strong></li> <li>Other examples are <strong>rollout algorithms &amp; Monte Carlo Tree Search (MCTS)</strong> that both benefit from online, incremental, sample-based value estimation and policy improvement.</li> </ul> </li> </ul> <hr /> <h2 id="813-summary-of-part-i-dimensions">8.13 Summary of Part I: Dimensions</h2> <p><strong>Two axes for the update diagram:</strong></p> <blockquote> <p>\(\text{HORIZONTAL (L to R): sample backups} \xrightarrow{\text{width of update}} \text{full/expected backups}\) \(\text{VERTICAL (Top to Bottom): shallow backups} \xrightarrow{\text{depth/length of update}} \text{deep backups}\)</p> </blockquote> <p><img src="/assets/images/2026/rl-sutton-barto/ch08-8-13-summary-unified-rl.png" alt="Unified View of RL depicting a slice through the space of RL methods" /></p> <blockquote> <p><strong>Unified View of RL</strong> depicting a slice through the space of RL methods</p> </blockquote> <ul> <li>Each RL idea presented so far can be viewed as a dimension along which methods vary. The set of the dimensions spans a large space of possible methods (quasi-infinite possibilities).</li> <li>All the methods discussed so far in this book have 3 key ideas in common: <ul> <li>They all seek to estimate value functions.</li> <li>They all operate by backing up values along actual or possible state trajectories.</li> <li>They all follow the general strategy of <strong>generalized policy iteration (GPI).</strong> This means they maintain an approximate value function and an approximate policy, and they continually try to improve each on the basis of the other.</li> </ul> </li> <li>3 important <strong>dimensions</strong> along which RL methods vary: <ul> <li><strong>Width of updates</strong>: sample updates (based on a sample trajectory) vs expected updates (based on a distribution of possible trajectories).</li> <li><strong>Depth of updates</strong>: degree of bootstrapping ($\lambda$).</li> <li><strong>On-policy vs off-policy methods</strong>.</li> </ul> </li> <li><strong>Other dimensions along which RL methods vary:</strong> <ul> <li><strong>Definition of return</strong>: is the task episodic or continuing, discounted or undiscounted?</li> <li><strong>Action values vs state values vs afterstate values</strong>.</li> <li><strong>Action selection/exploration</strong>: how are actions selected to ensure a suitable exploration/exploitation tradeoff? Simple ways considered are $\varepsilon$-greedy, optimistic initialization, soft-max, upper confidence bound (UCB).</li> <li><strong>Synchronous vs asynchronous</strong>: are the updates for all states performed simultaneously or one-by-one in some order?</li> <li><strong>Real vs simulated experience</strong>.</li> <li><strong>Location of updates</strong>: what states or state-action pairs should be updated? Model-free methods can choose only among encountered states, but model-based methods can choose arbitrarily.</li> <li><strong>Timing of updates</strong>: should updates be done as part of action selection, or only afterward?</li> <li><strong>Memory for updates</strong>: how long should updated values be retained? <ul> <li>Should they be retained permanently? (<strong>persistence</strong>)</li> <li>Or only while computing an action selection and then discarded? (<strong>memoryless</strong>)</li> </ul> </li> </ul> </li> <li>These dimensions are <strong>neither exhaustive nor mutually exclusive.</strong> e.g. Dyna methods use both real and simulated experience to affect the same value function.</li> <li> <p>These dimensions constitute a coherent set of ideas for description &amp; exploration of a wide space of possible methods.</p> </li> <li>The most important dimension not mentioned or covered yet is that of <strong>function approximation</strong>: <ul> <li>Function approximation can be viewed as an orthogonal spectrum of possibilities ranging from <strong>tabular methods</strong> at one extreme through <strong>state aggregation</strong>, a variety of <strong>linear methods</strong>, and then a diverse set of <strong>non-linear methods</strong>.</li> </ul> </li> </ul> <hr /> <h2 id="citation">Citation</h2> <p>If you found this blog post helpful, please consider citing it:</p> <div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@article</span><span class="p">{</span><span class="nl">obasi2026RLsuttonBartoCh08notes</span><span class="p">,</span> <span class="na">title</span> <span class="p">=</span> <span class="s">"Sutton &amp; Barto, Ch. 08: Planning &amp; Learning with Tabular Methods (Personal Notes)"</span><span class="p">,</span> <span class="na">author</span> <span class="p">=</span> <span class="s">"Obasi, Chizoba"</span><span class="p">,</span> <span class="na">journal</span> <span class="p">=</span> <span class="s">"chizkidd.github.io"</span><span class="p">,</span> <span class="na">year</span> <span class="p">=</span> <span class="s">"2026"</span><span class="p">,</span> <span class="na">month</span> <span class="p">=</span> <span class="s">"Feb"</span><span class="p">,</span> <span class="na">url</span> <span class="p">=</span> <span class="s">"https://chizkidd.github.io/2026/02/24/rl-sutton-barto-notes-ch008/"</span> <span class="p">}</span> </code></pre></div></div> <hr /> Tue, 24 Feb 2026 00:00:00 +0000 https://chizkidd.github.io//2026/02/24/rl-sutton-barto-notes-ch008/ https://chizkidd.github.io//2026/02/24/rl-sutton-barto-notes-ch008/ Architectural and Mathematical Foundations of Machine Learning: A Rigorous Synthesis of Theory, Geometry, and Implementation <hr /> <blockquote> <p><strong>Abstract:</strong> The maturation of machine learning from a subfield of heuristic-driven statistics into a cornerstone of modern computational science has necessitated a re-evaluation of its pedagogical foundations. Modern practitioners often rely on high-level libraries that abstract away the underlying mathematics, but as evidenced by critical reviews from the research community, this abstraction often leads to a superficial understanding of model dynamics, failure modes, and optimization bottlenecks.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> A robust understanding of machine learning is not merely a collection of isolated equations but a synthesis of linear algebra, information theory, multivariate calculus, and probabilistic estimation. This report provides an exhaustive analysis of these mathematical pillars, correcting common technical misconceptions and bridging the gap between theoretical derivation and numerically stable implementation.</p> </blockquote> <hr /> <h2 id="the-geometry-of-representation-linear-and-affine-transformations">The Geometry of Representation: Linear and Affine Transformations</h2> <p>In the discourse of deep learning, the term “linear layer” is frequently used as a shorthand for the fundamental operation of weight-input multiplication followed by a bias shift. However, a rigorous geometric analysis reveals that the operations defining neural networks are more accurately described as affine transformations.<sup id="fnref:1:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> This distinction is not merely semantic; it defines the topology of the latent space and the constraints of the optimization landscape.</p> <h3 id="defining-the-affine-space">Defining the Affine Space</h3> <p>A linear transformation between two vector spaces must satisfy two core properties: additivity and homogeneity. Geometrically, this requires that the transformation fixes the origin; the zero vector in the input space must map to the zero vector in the output space. In a standard neural network layer, the operation is defined as $y = Ax + b$. While the term $Ax$ represents a linear transformation (scaling, rotating, or shearing the input $x$), the addition of the bias vector $b$ shifts the resulting vector away from the origin.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p> <p>This shift renders the transformation affine. An affine transformation is the composition of a linear mapping followed by a translation. In high-dimensional spaces, the bias term $b$ is what allows a hyperplane to exist in any position within the space, rather than being forced to pass through the coordinate center.<sup id="fnref:2:1" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> Without this capability, the expressive power of a neural network would be severely diminished, as it would be unable to model datasets where the decision boundary does not intersect the origin.</p> <table> <thead> <tr> <th>Feature</th> <th>Linear Transformation ($Ax$)</th> <th>Affine Transformation ($Ax+b$)</th> </tr> </thead> <tbody> <tr> <td>Origin Preservation</td> <td>Maps zero vector to zero vector ($f(0) = 0$)</td> <td>Shifts the origin by the vector $b$ <br />($f(0) = b$)</td> </tr> <tr> <td>Algebraic Properties</td> <td>Satisfies $f(x+y) = f(x) + f(y)$ and $f(cx) = cf(x)$</td> <td>Violates both unless $b = 0$<sup id="fnref:2:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></td> </tr> <tr> <td>Geometric Action</td> <td>Rotation, scaling, reflection, shearing</td> <td>Rotation/Scaling followed by Translation</td> </tr> <tr> <td>Machine Learning Role</td> <td>Feature interaction and dimensionality change</td> <td>Decision boundary positioning and normalization</td> </tr> </tbody> </table> <h3 id="spectral-decomposition-and-the-warping-of-space">Spectral Decomposition and the Warping of Space</h3> <p>Beyond simple layer operations, the internal structure of data matrices is analyzed through spectral decomposition. Eigendecomposition and Singular Value Decomposition (SVD) provide the mathematical tools to understand how a model “views” the variance of its input. An eigenvector $v$ of a square matrix $A$ is a characteristic direction that, under the transformation $A$, is only scaled by a factor $\lambda$, termed the eigenvalue: $Av = \lambda v$.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> Geometrically, if we align our coordinate system with these eigenvectors, the matrix $A$ simply acts as a scaling factor along those axes.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup></p> <p>However, eigendecomposition is limited to square matrices and often lacks orthogonality in the basis vectors unless the matrix is symmetric.<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup> Singular Value Decomposition (SVD) generalizes this concept to any $m \times n$ matrix $A$, decomposing it into $A = U\Sigma V^T$. This decomposition reveals a three-step geometric process:</p> <ol> <li><strong>Input Rotation</strong>: The matrix $V^T$ rotates the input space to align with the principal axes of the data.<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup></li> <li><strong>Stretching</strong>: The diagonal matrix $\Sigma$ scales the data along these axes according to the singular values $\sigma_i$.<sup id="fnref:4:1" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup></li> <li><strong>Output Rotation</strong>: The matrix $U$ rotates the scaled data into the output coordinate system.<sup id="fnref:6:1" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup></li> </ol> <p>In the context of dimensionality reduction, SVD allows for the optimal projection of data onto a lower-dimensional subspace. By retaining only the largest $k$ singular values in $\Sigma$ and setting the rest to zero, we minimize the reconstruction error in terms of the Frobenius norm, a principle that underlies both Principal Component Analysis (PCA) and modern matrix completion algorithms.<sup id="fnref:3:1" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></p> <h2 id="information-theory-as-the-metric-of-learning">Information Theory as the Metric of Learning</h2> <p>While linear algebra defines the transformations, information theory provides the objective functions used to measure the “success” of those transformations. In machine learning, the goal is often to minimize the distance between a predicted probability distribution and the true data distribution.<sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup></p> <h3 id="surprisal-and-the-derivation-of-entropy">Surprisal and the Derivation of Entropy</h3> <p>The fundamental unit of information theory is “surprisal” or self-information. Intuitively, an event that is certain carries no information, whereas a rare event provides significant insight when it occurs. This is quantified by the negative logarithm of the probability $p$ of an event: $I(x) = -\log p(x)$.<sup id="fnref:8" role="doc-noteref"><a href="#fn:8" class="footnote" rel="footnote">8</a></sup> The logarithmic form is essential because it ensures that information is additive for independent events: $I(x,y) = I(x) + I(y)$.<sup id="fnref:7:1" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup></p> <p>Entropy $H(P)$ is the expected value of surprisal across an entire distribution $P$. It represents the average amount of uncertainty or “average surprise” inherent in the distribution<sup id="fnref:7:2" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup>:</p> \[H(P) = -\sum_{x} P(x) \log P(x)\] <p>A uniform distribution maximizes entropy, as it represents a state where every outcome is equally likely, providing the highest level of average uncertainty. In decision trees, for example, entropy is used to measure the “purity” of a node; a node with low entropy contains samples mostly from a single class, indicating high certainty in the prediction.<sup id="fnref:10" role="doc-noteref"><a href="#fn:10" class="footnote" rel="footnote">9</a></sup></p> <h3 id="cross-entropy-the-cost-of-misaligned-models">Cross-Entropy: The Cost of Misaligned Models</h3> <p>When we train a model, we generate a predicted distribution $Q$ intended to approximate the true distribution $P$. Cross-entropy $H(P,Q)$ measures the average surprisal we experience if we encode data from $P$ using the “codebook” optimized for $Q$.<sup id="fnref:8:1" role="doc-noteref"><a href="#fn:8" class="footnote" rel="footnote">8</a></sup> Mathematically:</p> \[H(P, Q) = -\sum_{x} P(x) \log Q(x)\] <p>Cross-entropy is a staple loss function in classification tasks. It is inherently asymmetric ($H(P,Q) \neq H(Q,P)$), a property that reflects the physical reality of communication: the cost of using a wrong model depends on which direction the error occurs.<sup id="fnref:7:3" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup> Specifically, if the model $Q$ predicts a zero probability for an event that actually occurs in $P$, the cross-entropy becomes infinite, reflecting “infinite surprise” and forcing the model to never be “certainly wrong”.<sup id="fnref:7:4" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup></p> <h3 id="kullback-leibler-kl-divergence">Kullback-Leibler (KL) Divergence</h3> <p>KL Divergence $D_{KL}(P | Q)$ isolates the “extra” surprisal caused by the model’s inaccuracy. It is defined as the difference between cross-entropy and the inherent entropy of the data<sup id="fnref:8:2" role="doc-noteref"><a href="#fn:8" class="footnote" rel="footnote">8</a></sup>:</p> \[D_{KL}(P \| Q) = H(P, Q) - H(P) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}\] <p>Since the entropy of the data $H(P)$ is constant with respect to the model parameters, minimizing cross-entropy is functionally identical to minimizing KL divergence.<sup id="fnref:7:5" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup> This relationship is the backbone of Maximum Likelihood Estimation (MLE), as minimizing the divergence between the data and the model is equivalent to finding the parameters that make the observed data most probable.<sup id="fnref:7:6" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup></p> <table> <thead> <tr> <th>Metric</th> <th>Formula</th> <th>Intuition</th> <th>Application</th> </tr> </thead> <tbody> <tr> <td>Entropy</td> <td>$H(P) = -\sum P(x) \log P(x)$</td> <td>Average uncertainty of a single source</td> <td>Data compression, decision tree splitting</td> </tr> <tr> <td>Cross-Entropy</td> <td>$H(P,Q) = -\sum P(x) \log Q(x)$</td> <td>Total cost of using model $Q$ for data $P$</td> <td>Loss function for classifiers<sup id="fnref:8:3" role="doc-noteref"><a href="#fn:8" class="footnote" rel="footnote">8</a></sup></td> </tr> <tr> <td>KL Divergence</td> <td>$D_{KL}(P | Q) = \sum P(x) \log \frac{P(x)}{Q(x)}$</td> <td>“Distance” or extra cost between distributions</td> <td>Variational inference, GANs, RL regularization</td> </tr> </tbody> </table> <h2 id="optimization-dynamics-jacobians-hessians-and-the-curvature-of-loss">Optimization Dynamics: Jacobians, Hessians, and the Curvature of Loss</h2> <p>Optimization in machine learning is essentially a navigation problem through a high-dimensional landscape. While the gradient provides the direction of the slope, higher-order derivatives provide the context of that slope: its sensitivity and its curvature.<sup id="fnref:15" role="doc-noteref"><a href="#fn:15" class="footnote" rel="footnote">10</a></sup></p> <h3 id="the-jacobian-first-order-sensitivity">The Jacobian: First-Order Sensitivity</h3> <p>For a vector-valued function $f: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian matrix $J$ contains all first-order partial derivatives. Each element $J_{ij} = \frac{\partial f_i}{\partial x_j}$ represents how the $i$-th output changes with respect to the $j$-th input.<sup id="fnref:17" role="doc-noteref"><a href="#fn:17" class="footnote" rel="footnote">11</a></sup> In neural network training, the Jacobian is the fundamental object of backpropagation.</p> <p>A common misunderstanding in technical literature is the classification of backpropagation itself. As noted in expert feedback, backpropagation is not an optimization algorithm like Gradient Descent; rather, it is a computationally efficient method for calculating the Jacobian through the application of the chain rule.<sup id="fnref:1:2" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> The efficiency of backpropagation stems from its ability to reuse intermediate partial derivatives, avoiding the combinatorial explosion that would occur if each path through the network were differentiated independently.<sup id="fnref:17:1" role="doc-noteref"><a href="#fn:17" class="footnote" rel="footnote">11</a></sup></p> <h3 id="the-hessian-and-the-topology-of-generalization">The Hessian and the Topology of Generalization</h3> <p>While the Jacobian tells us where to move, the Hessian matrix $H$ (the second derivative) tells us the shape of the area we are moving through. The Hessian is a square matrix of second-order partial derivatives: $H_{ij} = \frac{\partial^2 L}{\partial \theta_i \partial \theta_j}$.<sup id="fnref:16" role="doc-noteref"><a href="#fn:16" class="footnote" rel="footnote">12</a></sup> The eigenvalues of the Hessian at a local minimum define the “sharpness” or “flatness” of that minimum.<sup id="fnref:19" role="doc-noteref"><a href="#fn:19" class="footnote" rel="footnote">13</a></sup></p> <p>The “Flat Minimum” hypothesis suggests that minima with low curvature (low Hessian eigenvalues) are more likely to generalize to unseen data.<sup id="fnref:21" role="doc-noteref"><a href="#fn:21" class="footnote" rel="footnote">14</a></sup> The intuition is that a flat minimum represents a region of parameter space where small perturbations in the weights (caused by noise in the data or finite precision) do not significantly increase the loss. In contrast, a “sharp” minimum is highly sensitive; a slight shift in the data distribution might move the “true” minimum slightly, causing the loss for the sharp-minimum parameters to skyrocket.<sup id="fnref:21:1" role="doc-noteref"><a href="#fn:21" class="footnote" rel="footnote">14</a></sup></p> <table> <thead> <tr> <th>Hessian Eigenvalue Status</th> <th>Geometric Interpretation</th> <th>Generalization Outcome</th> </tr> </thead> <tbody> <tr> <td>Large Eigenvalues</td> <td>Sharp, steep “valley”</td> <td>High sensitivity, prone to overfitting<sup id="fnref:19:1" role="doc-noteref"><a href="#fn:19" class="footnote" rel="footnote">13</a></sup></td> </tr> <tr> <td>Small Eigenvalues</td> <td>Broad, flat “plateau”</td> <td>Robust to noise, better generalization<sup id="fnref:20" role="doc-noteref"><a href="#fn:20" class="footnote" rel="footnote">15</a></sup></td> </tr> <tr> <td>Negative Eigenvalues</td> <td>Surface curves downward (Max/Saddle)</td> <td>Unstable, gradient descent will move away</td> </tr> <tr> <td>Zero Eigenvalues</td> <td>Function is locally linear</td> <td>Inconclusive; often indicates overparameterization</td> </tr> </tbody> </table> <p>Advanced optimization algorithms, such as Sharpness-Aware Minimization (SAM), explicitly incorporate the Hessian’s information by seeking parameter values whose entire neighborhood has low loss, rather than just a single point.<sup id="fnref:22" role="doc-noteref"><a href="#fn:22" class="footnote" rel="footnote">16</a></sup> This shift from point-wise optimization to neighborhood optimization marks a significant trend in improving the robustness of Large Language Models (LLMs).<sup id="fnref:19:2" role="doc-noteref"><a href="#fn:19" class="footnote" rel="footnote">13</a></sup></p> <h2 id="statistical-frameworks-mle-map-and-the-bayesian-paradigm">Statistical Frameworks: MLE, MAP, and the Bayesian Paradigm</h2> <p>The process of “learning” from data is fundamentally an exercise in statistical estimation. Machine learning models typically operate under one of two paradigms: Frequentist or Bayesian.<sup id="fnref:24" role="doc-noteref"><a href="#fn:24" class="footnote" rel="footnote">17</a></sup></p> <h3 id="maximum-likelihood-estimation-mle">Maximum Likelihood Estimation (MLE)</h3> <p>MLE is the Frequentist’s primary tool. It assumes that there is a fixed, “true” parameter $\theta$ and seeks the value that makes the observed data $D$ most probable:</p> \[\hat{\theta}_{MLE} = \arg\max_{\theta} P(D|\theta)\] <p>In practice, we maximize the log-likelihood to transform product-based probabilities into summation-based losses, which are easier to differentiate and less prone to numerical underflow.<sup id="fnref:26" role="doc-noteref"><a href="#fn:26" class="footnote" rel="footnote">18</a></sup> MLE is effective for large datasets where the data itself provides enough signal to overcome initial uncertainty, but it is notoriously prone to overfitting in high-dimensional settings with sparse data.<sup id="fnref:26:1" role="doc-noteref"><a href="#fn:26" class="footnote" rel="footnote">18</a></sup></p> <h3 id="maximum-a-posteriori-map-and-regularization">Maximum A Posteriori (MAP) and Regularization</h3> <p>MAP estimation adopts a Bayesian stance, treating the parameter $\theta$ as a random variable with its own prior distribution $P(\theta)$. Using Bayes’ Theorem:</p> \[P(\theta|D) = \frac{P(D|\theta)P(\theta)}{P(D)}\] \[\hat{\theta}_{MAP} = \arg\max_{\theta} P(\theta|D) = \arg\max_{\theta} [P(D|\theta)P(\theta)]\] <p>The inclusion of the prior $P(\theta)$ acts as a “regularizer.” For instance, assuming a Gaussian prior centered at zero is mathematically equivalent to $L_2$ regularization (Weight Decay), while a Laplacian prior yields $L_1$ regularization (Sparsity).<sup id="fnref:26:2" role="doc-noteref"><a href="#fn:26" class="footnote" rel="footnote">18</a></sup> MAP provides a bridge between pure data-driven learning and the incorporation of domain knowledge, acting as the “experienced analyst” who balances new evidence against historical trends.<sup id="fnref:28" role="doc-noteref"><a href="#fn:28" class="footnote" rel="footnote">19</a></sup></p> <table> <thead> <tr> <th>Aspect</th> <th>Maximum Likelihood Estimation (MLE)</th> <th>Maximum A Posteriori (MAP)</th> </tr> </thead> <tbody> <tr> <td>Philosophy</td> <td>Frequentist: $\theta$ is fixed but unknown</td> <td>Bayesian: $\theta$ is a random variable</td> </tr> <tr> <td>Prior Used?</td> <td>No<sup id="fnref:24:1" role="doc-noteref"><a href="#fn:24" class="footnote" rel="footnote">17</a></sup></td> <td>Yes ($P(\theta)$)<sup id="fnref:25" role="doc-noteref"><a href="#fn:25" class="footnote" rel="footnote">20</a></sup></td> </tr> <tr> <td>Regularization</td> <td>None (unless explicit)</td> <td>Implicit via the prior distribution<sup id="fnref:26:3" role="doc-noteref"><a href="#fn:26" class="footnote" rel="footnote">18</a></sup></td> </tr> <tr> <td>Data Sensitivity</td> <td>High; prone to overfitting on small sets</td> <td>Lower; the prior stabilizes estimates<sup id="fnref:28:1" role="doc-noteref"><a href="#fn:28" class="footnote" rel="footnote">19</a></sup></td> </tr> <tr> <td>Convergence</td> <td>Converges to MAP as data size $\to \infty$</td> <td>Incorporates prior knowledge for small $n$</td> </tr> </tbody> </table> <h2 id="architecture-dynamics-softmax-attention-and-implicit-mappings">Architecture Dynamics: Softmax, Attention, and Implicit Mappings</h2> <p>Modern neural architectures rely on specific functional forms to control the flow of information and the stability of gradients. Two of the most critical are the Softmax activation and the Attention mechanism.</p> <h3 id="the-jacobian-of-softmax-and-the-backpropagation-fusion">The Jacobian of Softmax and the Backpropagation Fusion</h3> <p>The Softmax function $\sigma(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$ is the standard output for multi-class classification. A common technical oversight in tutorial literature is failing to explain the Jacobian of Softmax. Because each output $\sigma_i$ depends on every input $z_j$ (due to the shared denominator), the derivative is not a simple vector but a matrix.<sup id="fnref:13" role="doc-noteref"><a href="#fn:13" class="footnote" rel="footnote">21</a></sup></p> <ul> <li><strong>Diagonal elements</strong>: $\frac{\partial \sigma_i}{\partial z_i} = \sigma_i(1 - \sigma_i)$</li> <li><strong>Off-diagonal elements</strong>: $\frac{\partial \sigma_i}{\partial z_j} = -\sigma_i \sigma_j$</li> </ul> <p>However, when Softmax is combined with the Categorical Cross-Entropy loss, the gradient of the entire block with respect to the input $z$ simplifies to $\sigma - y$, where $y$ is the one-hot encoded ground truth.<sup id="fnref:13:1" role="doc-noteref"><a href="#fn:13" class="footnote" rel="footnote">21</a></sup> This simplicity is why the combination is ubiquitous in deep learning frameworks; it provides a clean, linear error signal $(\sigma - y)$ that directly represents the model’s confidence error.<sup id="fnref:18" role="doc-noteref"><a href="#fn:18" class="footnote" rel="footnote">22</a></sup></p> <h3 id="the-attention-mechanism-a-retrieval-framework">The Attention Mechanism: A Retrieval Framework</h3> <p>The Attention mechanism, particularly in the Transformer architecture, revolutionized sequential modeling by replacing fixed-length memory with a dynamic retrieval system. This system is defined by three vectors: Query ($Q$), Key ($K$), and Value ($V$).<sup id="fnref:30" role="doc-noteref"><a href="#fn:30" class="footnote" rel="footnote">23</a></sup></p> <p>The intuition is analogous to a library search:</p> <ul> <li><strong>Query ($Q$)</strong>: The search term or information you are currently looking for.<sup id="fnref:30:1" role="doc-noteref"><a href="#fn:30" class="footnote" rel="footnote">23</a></sup></li> <li><strong>Key ($K$)</strong>: The metadata or “index” on the spine of every book.<sup id="fnref:31" role="doc-noteref"><a href="#fn:31" class="footnote" rel="footnote">24</a></sup></li> <li><strong>Value ($V$)</strong>: The actual content or “knowledge” inside the book.<sup id="fnref:32" role="doc-noteref"><a href="#fn:32" class="footnote" rel="footnote">25</a></sup></li> </ul> <p>The attention weight is computed by measuring the compatibility (dot product) between $Q$ and $K$. After scaling and applying Softmax, these weights determine how much of each $V$ is aggregated into the final representation.<sup id="fnref:31:1" role="doc-noteref"><a href="#fn:31" class="footnote" rel="footnote">24</a></sup> The “Scaled” Dot-Product Attention includes a factor of $\frac{1}{\sqrt{d_k}}$ to prevent the dot products from growing so large that the Softmax function enters a region of near-zero gradients, which would stall learning.<sup id="fnref:32:1" role="doc-noteref"><a href="#fn:32" class="footnote" rel="footnote">25</a></sup></p> <h3 id="kernel-machines-and-the-mapping-paradox">Kernel Machines and the Mapping Paradox</h3> <p>Support Vector Machines (SVMs) and kernel-based methods provide an alternative to deep learning’s explicit feature engineering. The “Kernel Trick” allows a model to operate in an implicitly high-dimensional space without ever actually computing the coordinates in that space.<sup id="fnref:35" role="doc-noteref"><a href="#fn:35" class="footnote" rel="footnote">26</a></sup></p> <p>By reformulating the optimization problem into its “Dual Form,” the objective depends only on the dot products between inputs: $\langle x_i, x_j \rangle$.<sup id="fnref:37" role="doc-noteref"><a href="#fn:37" class="footnote" rel="footnote">27</a></sup> Replacing this dot product with a kernel function $K(x_i, x_j)$ effectively maps the data into a high-dimensional feature space where it may be linearly separable.<sup id="fnref:35:1" role="doc-noteref"><a href="#fn:35" class="footnote" rel="footnote">26</a></sup> For example, the Radial Basis Function (RBF) kernel corresponds to an infinite-dimensional feature space, yet it can be computed with a simple exponential function in the original input space.<sup id="fnref:36" role="doc-noteref"><a href="#fn:36" class="footnote" rel="footnote">28</a></sup></p> <table> <thead> <tr> <th>Kernel Type</th> <th>Function $K(x,y)$</th> <th>Geometric Space</th> </tr> </thead> <tbody> <tr> <td>Linear</td> <td>$x^T y$</td> <td>Original input space</td> </tr> <tr> <td>Polynomial</td> <td>$(x^T y + c)^d$</td> <td>Finite-dimensional feature combinations</td> </tr> <tr> <td>Gaussian RBF</td> <td>$\exp(-\gamma |x-y|^2)$</td> <td>Infinite-dimensional space<sup id="fnref:38" role="doc-noteref"><a href="#fn:38" class="footnote" rel="footnote">29</a></sup></td> </tr> <tr> <td>Sigmoid</td> <td>$\tanh(\alpha x^T y + c)$</td> <td>Relates SVMs to Neural Networks</td> </tr> </tbody> </table> <h2 id="generative-modeling-variational-inference-and-diffusion">Generative Modeling: Variational Inference and Diffusion</h2> <p>The frontier of machine learning math is currently dominated by generative models, which require estimating the underlying probability density of high-dimensional data.</p> <h3 id="variational-autoencoders-vaes-and-the-elbo">Variational Autoencoders (VAEs) and the ELBO</h3> <p>VAEs treat generation as a latent variable problem: we assume data $x$ is generated from a hidden code $z$. The true posterior $P(z \mid x)$ is intractable, so we approximate it with $Q(z \mid x)$ (the encoder).<sup id="fnref:39" role="doc-noteref"><a href="#fn:39" class="footnote" rel="footnote">30</a></sup> To train this, we maximize the Evidence Lower Bound (ELBO)<sup id="fnref:12" role="doc-noteref"><a href="#fn:12" class="footnote" rel="footnote">31</a></sup>:</p> \[\text{ELBO} = \mathbb{E}_{Q(z|x)}[\log P(x|z)] - D_{KL}(Q(z|x) \| P(z))\] <p>The first term is the <strong>Reconstruction Error</strong>, ensuring the decoder can recreate the input from the code. The second is the <strong>KL Regularizer</strong>, which forces the latent codes to follow a standard Gaussian distribution.<sup id="fnref:41" role="doc-noteref"><a href="#fn:41" class="footnote" rel="footnote">32</a></sup> This ensures the latent space is well-behaved, allowing us to sample new points and generate realistic data.</p> <h3 id="diffusion-models-score-based-generative-dynamics">Diffusion Models: Score-Based Generative Dynamics</h3> <p>Diffusion models represent a paradigm shift. Rather than learning a mapping or a lower bound, they learn to reverse a stochastic process.<sup id="fnref:43" role="doc-noteref"><a href="#fn:43" class="footnote" rel="footnote">33</a></sup> The forward process gradually destroys data by adding Gaussian noise until the sample is pure noise.<sup id="fnref:45" role="doc-noteref"><a href="#fn:45" class="footnote" rel="footnote">34</a></sup></p> <p>The model is trained to predict the noise $\epsilon$ that was added at any given step $t$. By knowing how to remove the noise, the model can iteratively “denoise” a random sample into a high-quality data point.<sup id="fnref:45:1" role="doc-noteref"><a href="#fn:45" class="footnote" rel="footnote">34</a></sup> Mathematically, this is governed by the Stochastic Differential Equation (SDE):</p> \[dx = f(x, t)dt + g(t)dw\] <p>The reverse process involves the “score function” $\nabla_x \log p(x)$, which points in the direction of increasing data density.<sup id="fnref:44" role="doc-noteref"><a href="#fn:44" class="footnote" rel="footnote">35</a></sup> Modern diffusion models essentially learn this score function, providing a robust mathematical way to sample from complex, high-dimensional manifolds.<sup id="fnref:43:1" role="doc-noteref"><a href="#fn:43" class="footnote" rel="footnote">33</a></sup></p> <h2 id="numerical-pragmatism-the-gap-between-math-and-machine">Numerical Pragmatism: The Gap Between Math and Machine</h2> <p>One of the most persistent failures in machine learning development is the “theoretical success, numerical failure” trap. Mathematical equations assume infinite precision, but hardware operates on finite-precision floating-point numbers.<sup id="fnref:49" role="doc-noteref"><a href="#fn:49" class="footnote" rel="footnote">36</a></sup></p> <h3 id="the-softmax-instability">The Softmax Instability</h3> <p>The Softmax function is mathematically robust but numerically fragile. Large logits cause the exponential function to overflow into <code class="language-plaintext highlighter-rouge">inf</code>, while large negative logits cause underflow to <code class="language-plaintext highlighter-rouge">0</code>, resulting in <code class="language-plaintext highlighter-rouge">NaN</code> gradients.<sup id="fnref:49:1" role="doc-noteref"><a href="#fn:49" class="footnote" rel="footnote">36</a></sup> The standard solution is the <strong>Translation Invariance Trick</strong>: subtracting the maximum value from all logits before exponentiating. This ensures that the largest exponent is $e^0 = 1$, preventing overflow and guaranteeing numerical stability.<sup id="fnref:51" role="doc-noteref"><a href="#fn:51" class="footnote" rel="footnote">37</a></sup></p> <h3 id="the-logsumexp-trick">The LogSumExp Trick</h3> <p>In the calculation of cross-entropy, we often encounter the log of a sum of exponentials. A naive implementation would calculate the exponentials, sum them, and then take the log, which is prone to overflow. The stable approach uses the LogSumExp identity:</p> \[\log \sum_i e^{x_i} = \alpha + \log \sum_i e^{x_i - \alpha}\] <p>where $\alpha = \max_i x_i$.<sup id="fnref:49:2" role="doc-noteref"><a href="#fn:49" class="footnote" rel="footnote">36</a></sup> This ensures that intermediate computations stay within the representable range of floating-point numbers.</p> <h3 id="correct-implementation-of-entropy">Correct Implementation of Entropy</h3> <p>Critiques of earlier implementations highlighted that naive entropy calculations using <code class="language-plaintext highlighter-rouge">np.log(p, where=p &gt; 0)</code> can be dangerous if the output is not properly initialized, as it leaves the results at $p=0$ locations as uninitialized garbage values.<sup id="fnref:1:3" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> A robust implementation must explicitly handle the limit $\lim_{p \to 0} p \log p = 0$ to ensure consistency and correctness across the entire domain.<sup id="fnref:1:4" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p> <table> <thead> <tr> <th>Mathematical Operation</th> <th>Potential Numerical Failure</th> <th>Robust Implementation Strategy</th> </tr> </thead> <tbody> <tr> <td>Softmax</td> <td>Overflow ($e^{1000} = \infty$)</td> <td>Subtract maximum logit before $\exp$<sup id="fnref:51:1" role="doc-noteref"><a href="#fn:51" class="footnote" rel="footnote">37</a></sup></td> </tr> <tr> <td>Cross-Entropy</td> <td>Underflow ($\log(0) = -\infty$)</td> <td>Use LogSoftmax and LogSumExp fusion<sup id="fnref:49:3" role="doc-noteref"><a href="#fn:49" class="footnote" rel="footnote">36</a></sup></td> </tr> <tr> <td>Information Entropy</td> <td>Uninitialized memory at $p=0$</td> <td>Use <code class="language-plaintext highlighter-rouge">np.where</code> with default initialization to zero<sup id="fnref:1:5" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></td> </tr> <tr> <td>Hessian Calculation</td> <td>High memory cost/Instability</td> <td>Use Hessian-Vector Products (HVP)<sup id="fnref:19:3" role="doc-noteref"><a href="#fn:19" class="footnote" rel="footnote">13</a></sup></td> </tr> <tr> <td>SVD</td> <td>Convergence failure on singular matrices</td> <td>Use Moore-Penrose pseudo-inverse via SVD<sup id="fnref:5:1" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup></td> </tr> </tbody> </table> <h2 id="advanced-theoretical-integration-kernels-attention-and-manifold-learning">Advanced Theoretical Integration: Kernels, Attention, and Manifold Learning</h2> <p>The synthesis of these concepts reveals the deeper structure of modern machine learning. For instance, the Attention mechanism can be viewed as a data-dependent kernel where the weights are dynamically computed for each input pair.<sup id="fnref:31:2" role="doc-noteref"><a href="#fn:31" class="footnote" rel="footnote">24</a></sup> Similarly, the success of Diffusion models is intrinsically linked to the spectral properties of the data manifold, the model learns to project noise back onto the low-dimensional manifold where data resides.<sup id="fnref:44:1" role="doc-noteref"><a href="#fn:44" class="footnote" rel="footnote">35</a></sup></p> <p>The distinction between “Sharp” and “Flat” minima provides a bridge between the optimization dynamics of the Hessian and the statistical requirements of generalization. A flat minimum is not just a point of low loss; it is a region of high local entropy in the parameter space, suggesting that the solution is not a lucky “overfit” but a robust feature of the data distribution.<sup id="fnref:20:1" role="doc-noteref"><a href="#fn:20" class="footnote" rel="footnote">15</a></sup></p> <h2 id="synthesis-and-recommendations-for-practitioners">Synthesis and Recommendations for Practitioners</h2> <p>The evolution of machine learning mathematics demonstrates that technical robustness is achieved only through the rigorous application of foundational principles. The fundamental, mathematical, derivative approach used here addresses the community’s concerns regarding “LLM slop” and technical vapidity.<sup id="fnref:1:6" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> By explicitly connecting surprisal to KL divergence, and the Jacobian of Softmax to the cross-entropy gradient, we move from rote memorization to functional understanding.</p> <p>For practitioners looking to improve their models, the focus should be on three critical areas:</p> <ol> <li> <p><strong>Numerical Integrity</strong>: Always use fused loss functions and log-domain calculations to avoid the silent corruption of gradients.<sup id="fnref:49:4" role="doc-noteref"><a href="#fn:49" class="footnote" rel="footnote">36</a></sup></p> </li> <li> <p><strong>Geometric Awareness</strong>: Recognize that model operations are affine, and prioritize architectures that allow for flexible decision boundary placement.<sup id="fnref:2:3" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p> </li> <li> <p><strong>Curvature Monitoring</strong>: In high-stakes applications, move beyond monitoring simple training loss. Analyzing the Hessian spectrum or the local flatness of the solution provides the only reliable indicator of how a model will perform on unseen, real-world data.<sup id="fnref:21:2" role="doc-noteref"><a href="#fn:21" class="footnote" rel="footnote">14</a></sup></p> </li> </ol> <p>The future of machine learning lies in this intersection of physics-inspired dynamics (Diffusion), information theory (Entropy), and the geometry of high-dimensional spaces (SVD/Kernels). As models continue to scale, the mathematical “shortcuts” of the past will increasingly fail, leaving only those who understand the foundational rigor of the field capable of driving its next major breakthroughs.<sup id="fnref:1:7" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p> <hr /> <h2 id="citation">Citation</h2> <p>If you found this blog post helpful, please consider citing it:</p> <div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@article</span><span class="p">{</span><span class="nl">obasi2026MLmathfoundations</span><span class="p">,</span> <span class="na">title</span> <span class="p">=</span> <span class="s">"Architectural and Mathematical Foundations of Machine Learning: A Rigorous Synthesis of Theory, Geometry, and Implementation"</span><span class="p">,</span> <span class="na">author</span> <span class="p">=</span> <span class="s">"Obasi, Chizoba"</span><span class="p">,</span> <span class="na">journal</span> <span class="p">=</span> <span class="s">"chizkidd.github.io"</span><span class="p">,</span> <span class="na">year</span> <span class="p">=</span> <span class="s">"2026"</span><span class="p">,</span> <span class="na">month</span> <span class="p">=</span> <span class="s">"Feb"</span><span class="p">,</span> <span class="na">url</span> <span class="p">=</span> <span class="s">"https://chizkidd.github.io/2026/02/09/mathematical-machine-learning-foundations/"</span> <span class="p">}</span> </code></pre></div></div> <hr /> <h2 id="references">References</h2> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:1" role="doc-endnote"> <p><a href="https://news.ycombinator.com/item?id=45050931">Important machine learning equations</a>. Hacker News. 2025. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:1:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a> <a href="#fnref:1:2" class="reversefootnote" role="doc-backlink">&#8617;<sup>3</sup></a> <a href="#fnref:1:3" class="reversefootnote" role="doc-backlink">&#8617;<sup>4</sup></a> <a href="#fnref:1:4" class="reversefootnote" role="doc-backlink">&#8617;<sup>5</sup></a> <a href="#fnref:1:5" class="reversefootnote" role="doc-backlink">&#8617;<sup>6</sup></a> <a href="#fnref:1:6" class="reversefootnote" role="doc-backlink">&#8617;<sup>7</sup></a> <a href="#fnref:1:7" class="reversefootnote" role="doc-backlink">&#8617;<sup>8</sup></a></p> </li> <li id="fn:2" role="doc-endnote"> <p>Ian Quah. <a href="https://ianq.ai/Hessian-Jacobian/">Fundamentals Part 2: Hessians and Jacobians</a>. ianq.ai. 2018. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:2:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a> <a href="#fnref:2:2" class="reversefootnote" role="doc-backlink">&#8617;<sup>3</sup></a> <a href="#fnref:2:3" class="reversefootnote" role="doc-backlink">&#8617;<sup>4</sup></a></p> </li> <li id="fn:3" role="doc-endnote"> <p><a href="https://towardsdatascience.com/eigen-intuitions-understanding-eigenvectors-and-eigenvalues-630e9ef1f719/">Eigen Intuitions: Understanding Eigenvectors and Eigenvalues</a>. Towards Data Science. 2022. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:3:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> <li id="fn:4" role="doc-endnote"> <p><a href="https://www.math.utoronto.ca/mpugh/Teaching/MAT267_19/Geometric_description_of_SVD.pdf">The geometry of linear transformations</a>. Department of Mathematics, University of Toronto. 2019. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:4:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> <li id="fn:5" role="doc-endnote"> <p><a href="https://math.stackexchange.com/questions/320220/intuitively-what-is-the-difference-between-eigendecomposition-and-singular-valu">Intuitively, what is the difference between Eigendecomposition and Singular Value Decomposition?</a> Mathematics Stack Exchange. 2013. <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:5:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> <li id="fn:6" role="doc-endnote"> <p><a href="https://math.stackexchange.com/questions/1450097/geometrical-interpretations-of-svd">Geometrical interpretations of SVD</a>. Mathematics Stack Exchange. 2018. <a href="#fnref:6" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:6:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> <li id="fn:7" role="doc-endnote"> <p>Sidharth SS. <a href="https://medium.com/@sidharth.ss/entropy-cross-entropy-and-kl-divergence-mathematical-foundations-and-applications-6a6f23da5ef1">Entropy, Cross-Entropy, and KL Divergence: Mathematical Foundations and Applications</a>. Medium. 2025. <a href="#fnref:7" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:7:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a> <a href="#fnref:7:2" class="reversefootnote" role="doc-backlink">&#8617;<sup>3</sup></a> <a href="#fnref:7:3" class="reversefootnote" role="doc-backlink">&#8617;<sup>4</sup></a> <a href="#fnref:7:4" class="reversefootnote" role="doc-backlink">&#8617;<sup>5</sup></a> <a href="#fnref:7:5" class="reversefootnote" role="doc-backlink">&#8617;<sup>6</sup></a> <a href="#fnref:7:6" class="reversefootnote" role="doc-backlink">&#8617;<sup>7</sup></a></p> </li> <li id="fn:8" role="doc-endnote"> <p>Eli Bendersky. <a href="https://eli.thegreenplace.net/2025/cross-entropy-and-kl-divergence/">Cross-entropy and KL divergence</a>. eli.thegreenplace. 2025. <a href="#fnref:8" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:8:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a> <a href="#fnref:8:2" class="reversefootnote" role="doc-backlink">&#8617;<sup>3</sup></a> <a href="#fnref:8:3" class="reversefootnote" role="doc-backlink">&#8617;<sup>4</sup></a></p> </li> <li id="fn:10" role="doc-endnote"> <p><a href="https://www.reddit.com/r/MachineLearning/comments/7vhmp7/d_a_short_introduction_to_entropy_crossentropy/">A Short Introduction to Entropy, Cross-Entropy and KL-Divergence</a>. r/MachineLearning. Reddit. 2018. <a href="#fnref:10" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:15" role="doc-endnote"> <p><a href="https://www.reddit.com/r/quant/comments/1muhmro/why_are_the_hessian_and_jacobian_matrices/">Why are the Hessian and Jacobian matrices important?</a>. r/quant. Reddit. 2025. <a href="#fnref:15" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:17" role="doc-endnote"> <p><a href="https://www.geeksforgeeks.org/engineering-mathematics/jacobian-and-hessian-matrices/">Jacobian and Hessian Matrices</a>. GeeksforGeeks. 2025. <a href="#fnref:17" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:17:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> <li id="fn:16" role="doc-endnote"> <p><a href="https://www.datacamp.com/tutorial/hessian-matrix">Hessian Matrix: A Guide to Second-Order Derivatives in Optimization and Beyond</a>. DataCamp. 2025. <a href="#fnref:16" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:19" role="doc-endnote"> <p>Dayal Singh Kalra, et al. <a href="https://arxiv.org/abs/2601.16979">A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs</a>. arXiv:2601.16979 (2026). <a href="#fnref:19" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:19:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a> <a href="#fnref:19:2" class="reversefootnote" role="doc-backlink">&#8617;<sup>3</sup></a> <a href="#fnref:19:3" class="reversefootnote" role="doc-backlink">&#8617;<sup>4</sup></a></p> </li> <li id="fn:21" role="doc-endnote"> <p>Ferenc Huszár. <a href="https://www.inference.vc/sharp-vs-flat-minima-are-still-a-mystery-to-me/">The Generalization Mystery: Sharp vs Flat Minima</a>. inFERENCe.vc. 2018. <a href="#fnref:21" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:21:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a> <a href="#fnref:21:2" class="reversefootnote" role="doc-backlink">&#8617;<sup>3</sup></a></p> </li> <li id="fn:20" role="doc-endnote"> <p><a href="https://www.emergentmind.com/topics/flat-minima-and-generalization">Flat Minima and Generalization</a>. Emergent Mind. 2025. <a href="#fnref:20" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:20:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> <li id="fn:22" role="doc-endnote"> <p>Tuan-Anh Bui. <a href="https://tuananhbui89.github.io/blog/2024/sharpness/">Connection between Flatness and Generalization</a>. tuananhbui89.github.io. 2024. <a href="#fnref:22" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:24" role="doc-endnote"> <p>Stanford CS109. <a href="https://web.stanford.edu/class/archive/cs/cs109/cs109.1218/files/student_drive/7.5.pdf">7.5: Maximum A Posteriori Estimation</a>. 2018. <a href="#fnref:24" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:24:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> <li id="fn:26" role="doc-endnote"> <p><a href="https://www.geeksforgeeks.org/data-science/mle-vs-map/">MLE vs MAP</a>. GeeksforGeeks. 2025. <a href="#fnref:26" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:26:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a> <a href="#fnref:26:2" class="reversefootnote" role="doc-backlink">&#8617;<sup>3</sup></a> <a href="#fnref:26:3" class="reversefootnote" role="doc-backlink">&#8617;<sup>4</sup></a></p> </li> <li id="fn:28" role="doc-endnote"> <p>Bohsun Chen. <a href="https://medium.com/@devcharlie2698619/the-intuition-behind-maximum-likelihood-estimation-mle-and-maximum-a-posteriori-estimation-map-b8ba1ba1078f">The Intuition behind Maximum Likelihood Estimation (MLE) and Maximum A Posteriori Estimation (MAP)</a>. Medium. 2024. <a href="#fnref:28" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:28:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> <li id="fn:25" role="doc-endnote"> <p>Agustinus Kristiadi. <a href="https://agustinus.kristia.de/blog/mle-vs-map/">MLE vs MAP: the connection between Maximum Likelihood and Maximum A Posteriori Estimation</a>. agustinus.kristia.de. 2017. <a href="#fnref:25" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:13" role="doc-endnote"> <p>Thomas Kurbiel. <a href="https://medium.com/data-science/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1">Derivative of the Softmax Function and the Categorical Cross-Entropy Loss</a>. Medium. 2021. <a href="#fnref:13" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:13:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> <li id="fn:18" role="doc-endnote"> <p><a href="https://www.mldawn.com/back-propagation-with-cross-entropy-and-softmax/">Back-propagation with Cross-Entropy and Softmax</a>. MLDawn Academy. <a href="#fnref:18" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:30" role="doc-endnote"> <p>Nitin Mittapally. <a href="https://medium.com/@nitinmittapally/understanding-attention-in-transformers-a-visual-guide-df416bfe495a">Understanding Attention in Transformers: A Visual Guide</a>. Medium. 2025. <a href="#fnref:30" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:30:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> <li id="fn:31" role="doc-endnote"> <p>Michael Brenndoerfer. <a href="https://mbrenndoerfer.com/writing/query-key-value-attention-mechanism">Query, Key, Value: The Foundation of Transformer Attention</a>. mbrenndoerfer.com. 2025. <a href="#fnref:31" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:31:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a> <a href="#fnref:31:2" class="reversefootnote" role="doc-backlink">&#8617;<sup>3</sup></a></p> </li> <li id="fn:32" role="doc-endnote"> <p>Lili Jiang. <a href="https://medium.com/data-science/how-gpt-works-a-metaphoric-explanation-of-key-value-query-in-attention-using-a-tale-of-potion-8c66ace1f470">How GPT works: A Metaphoric Explanation of Key, Value, Query in Attention</a>. Medium. 2023. <a href="#fnref:32" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:32:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> <li id="fn:35" role="doc-endnote"> <p>Nguyen Ha Thai Son. <a href="https://medium.com/data-science-collective/kernel-trick-under-the-hood-246ca9b36bae">Kernel Trick Under The Hood</a>. Medium. 2025. <a href="#fnref:35" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:35:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> <li id="fn:37" role="doc-endnote"> <p><a href="http://www.columbia.edu/~mh2078/MachineLearningORFE/SVMs_MasterSlides.pdf">Support Vector Machines (and the Kernel Trick)</a>. Columbia University. <a href="#fnref:37" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:36" role="doc-endnote"> <p><a href="https://stats.stackexchange.com/questions/152897/how-to-intuitively-explain-what-a-kernel-is">How to intuitively explain what a kernel is?</a> Stats StackExchange. 2018. <a href="#fnref:36" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:38" role="doc-endnote"> <p>Sanghavi Harsh. <a href="https://medium.com/@sanghaviharsh666/mastering-svm-kernel-tricks-a-comprehensive-guide-to-dual-problems-and-kernel-functions-612bfff2061e">Mastering SVM Kernel Tricks: A Comprehensive Guide to Dual Problems and Kernel Functions</a>. Medium. 2024. <a href="#fnref:38" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:39" role="doc-endnote"> <p>Tony Duan. <a href="https://github.com/tonyduan/variational-autoencoders">Variational autoencoder implemented in PyTorch</a>. GitHub. <a href="#fnref:39" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:12" role="doc-endnote"> <p>Matthew N. Bernstein. <a href="https://mbernste.github.io/posts/elbo/">The evidence lower bound (ELBO)</a>. mbernste.github.io. 2020. <a href="#fnref:12" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:41" role="doc-endnote"> <p><a href="https://en.wikipedia.org/wiki/Evidence_lower_bound">Evidence lower bound</a>. Wikipedia. <a href="#fnref:41" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:43" role="doc-endnote"> <p>Diederik P Kingma, Ruiqi Gao.<a href="https://openreview.net/forum?id=NnMEadcdyD">Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation</a>. OpenReview. 2023. <a href="#fnref:43" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:43:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> <li id="fn:45" role="doc-endnote"> <p>Yazhou Li. <a href="https://flaneur2020.github.io/posts/2024-07-22-diffusion-model/">Notes on Diffusion Model: Intuition</a>. flaneur2020.github.io. 2024. <a href="#fnref:45" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:45:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> <li id="fn:44" role="doc-endnote"> <p>Katie Keegan. <a href="https://katiekeegan.org/2025/08/11/diffeqs.html">Diffusion Models and (Many) Differential Equations</a>. katiekeegan.org. 2025. <a href="#fnref:44" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:44:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> <li id="fn:49" role="doc-endnote"> <p><a href="https://www.marktechpost.com/2026/01/06/implementing-softmax-from-scratch-avoiding-the-numerical-stability-trap/">Implementing Softmax From Scratch: Avoiding the Numerical Stability Trap</a>. MarktechPost. 2026. <a href="#fnref:49" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:49:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a> <a href="#fnref:49:2" class="reversefootnote" role="doc-backlink">&#8617;<sup>3</sup></a> <a href="#fnref:49:3" class="reversefootnote" role="doc-backlink">&#8617;<sup>4</sup></a> <a href="#fnref:49:4" class="reversefootnote" role="doc-backlink">&#8617;<sup>5</sup></a></p> </li> <li id="fn:51" role="doc-endnote"> <p>Jay Mody. <a href="https://jaykmody.com/blog/stable-softmax/">Numerically Stable Softmax and Cross Entropy</a>. jaykmody.com. 2022. <a href="#fnref:51" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:51:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> </ol> </div> Mon, 09 Feb 2026 00:00:00 +0000 https://chizkidd.github.io//2026/02/09/mathematical-machine-learning-foundations/ https://chizkidd.github.io//2026/02/09/mathematical-machine-learning-foundations/ A Complete Guide to Neural Network Optimizers <hr /> <blockquote> <p><strong>TLDR:</strong> This guide covers 8 neural network optimizers from SGD to Muon. <strong>For most tasks, start with Adam or AdamW</strong>, they’re robust and require minimal tuning. <strong>For large language models, consider Muon</strong> for 2x faster training. <strong>For computer vision with proper learning rate scheduling, SGD+Momentum often achieves the best final accuracy</strong>. Each optimizer builds on the limitations of its predecessors, from basic SGD through adaptive methods (Adam/AdamW) to modern matrix-aware approaches (Muon).</p> </blockquote> <hr /> <h2 id="table-of-contents">Table of Contents</h2> <ol> <li><a href="#quick-reference-optimizer-comparison">Quick Reference: Optimizer Comparison</a></li> <li><a href="#when-to-use-which-optimizer">When to Use Which Optimizer</a></li> <li><a href="#optimizers-explained">Optimizers Explained</a> <ul> <li><a href="#1-stochastic-gradient-descent-sgd">Stochastic Gradient Descent (SGD)</a></li> <li><a href="#2-momentum">Momentum</a></li> <li><a href="#3-nesterov-momentum">Nesterov Momentum</a></li> <li><a href="#4-adagrad">AdaGrad</a></li> <li><a href="#5-rmsprop">RMSProp</a></li> <li><a href="#6-adam-adaptive-moment-estimation">Adam (Adaptive Moment Estimation)</a></li> <li><a href="#7-adamw">AdamW</a></li> <li><a href="#8-muon-momentum-orthogonalized-by-newton-schulz">Muon (MomentUm Orthogonalized by Newton-Schulz)</a></li> </ul> </li> <li><a href="#detailed-technical-comparison">Detailed Technical Comparison</a></li> <li><a href="#hyperparameter-reference">Hyperparameter Reference</a></li> <li><a href="#common-pitfalls-and-how-to-avoid-them">Common Pitfalls and How to Avoid Them</a></li> <li><a href="#conclusion">Conclusion</a></li> </ol> <hr /> <p>Training neural networks is fundamentally an optimization problem: we’re searching for the best set of weights that minimize our loss function. While the concept sounds straightforward, the path from random initialization to a well-trained model is rarely a smooth descent. The landscape of loss functions in high-dimensional spaces is filled with valleys, plateaus, and saddle points that can trap or slow down naive optimization approaches.</p> <p>This is where optimization algorithms come in. Over the years, researchers have developed increasingly sophisticated methods to navigate these challenging landscapes more efficiently. Each optimizer builds upon the limitations of its predecessors, introducing new mechanisms to accelerate convergence, handle sparse gradients, or adapt to different learning scenarios.</p> <p>In this guide, we’ll explore eight key optimization techniques: SGD, Momentum, Nesterov Momentum, AdaGrad, RMSProp, Adam, AdamW and Muon. We’ll examine how each one works, what problems it solves, and when you might want to use it.</p> <hr /> <h2 id="quick-reference-optimizer-comparison">Quick Reference: Optimizer Comparison</h2> <table> <thead> <tr> <th>Optimizer</th> <th>Key Feature</th> <th>Solves Issue in</th> <th>Pros</th> <th>Cons</th> </tr> </thead> <tbody> <tr> <td>SGD</td> <td>Simple gradient descent</td> <td>N/A</td> <td>Easy to implement</td> <td>Oscillation, fixed learning rate</td> </tr> <tr> <td>Momentum</td> <td>Gradient accumulation</td> <td>SGD</td> <td>Reduces oscillations</td> <td>No anticipation of future trends</td> </tr> <tr> <td>Nesterov</td> <td>Lookahead gradients</td> <td>Momentum</td> <td>Better convergence</td> <td>Slightly higher computation</td> </tr> <tr> <td>AdaGrad</td> <td>Adaptive learning rates</td> <td>Nesterov</td> <td>Handles sparse gradients</td> <td>Learning rate decays too fast</td> </tr> <tr> <td>RMSProp</td> <td>Smoothed adaptive learning rates</td> <td>AdaGrad</td> <td>Stabilizes learning rates</td> <td>Sensitive to hyperparameters</td> </tr> <tr> <td>Adam</td> <td>Momentum + RMSProp</td> <td>RMSProp</td> <td>Combines best features</td> <td>May converge to suboptimal minima</td> </tr> <tr> <td>AdamW</td> <td>Decoupled weight decay</td> <td>Adam</td> <td>Better generalization</td> <td>Requires tuning decay parameter</td> </tr> <tr> <td>Muon</td> <td>Matrix orthogonalization</td> <td>AdamW</td> <td>33% less memory, automatic LR transfer, faster convergence</td> <td>Only for 2D matrices, requires hybrid approach</td> </tr> </tbody> </table> <hr /> <h2 id="when-to-use-which-optimizer">When to Use Which Optimizer</h2> <p>The flowchart below will help you quickly choose the right optimizer for your task:</p> <div class="mermaid"> graph TD Start([Choose Your Optimizer]) --&gt; Q1{What are you training?} Q1 --&gt;|Large Language Model<br />Transformer| Q2{Model size?} Q1 --&gt;|Computer Vision<br />CNN/ResNet| Q3{Priority?} Q1 --&gt;|Other/Mixed/Unsure| Default["<b><font size="5">AdamW</font></b><br />LR=0.001, <br />weight decay=0.01<br />"] Q2 --&gt;|&lt; 1B parameters| Adam1["<b><font size="5">AdamW</font></b><br />LR=3e-4<br />"] Q2 --&gt;|&gt; 1B parameters| Q4{Can implement<br />hybrid setup?} Q4 --&gt;|Yes| Muon1["<b><font size="5">Muon + AdamW</font></b><br />"] Q4 --&gt;|No| Adam1 Q3 --&gt;|Speed/Prototyping| Adam2["<b><font size="5">Adam</font></b><br />LR=0.001<br />"] Q3 --&gt;|Best Final Accuracy| Q5{Can tune learning<br />rate schedule?} Q5 --&gt;|Yes| SGD1["<b><font size="5">SGD + Momentum</font></b><br />LR=0.01 to 0.1, momentum=0.9<br />+ Cosine/Step schedule<br />"] Q5 --&gt;|No| Adam2 style Start fill:#4a90e2,color:#fff style Default fill:none,stroke:#2ecc71,stroke-width:3px style Adam1 fill:none,stroke:#2ecc71,stroke-width:3px style Adam2 fill:none,stroke:#2ecc71,stroke-width:3px style Muon1 fill:none,stroke:#f39c12,stroke-width:3px style SGD1 fill:none,stroke:#f39c12,stroke-width:3px classDef question fill:#e8f4f8,stroke:#4a90e2,stroke-width:2px class Q1,Q2,Q3,Q4,Q5 question </div> <p><strong>Key for Flowchart:</strong></p> <ul> <li><strong>Blue-filled</strong>: Starting point and decision questions</li> <li><strong>Green Border</strong>: Recommended safe defaults, works well out-of-the-box</li> <li><strong>Orange Border</strong>: Advanced options with higher payoff but more tuning</li> </ul> <h3 id="detailed-guidance">Detailed Guidance</h3> <p><strong>For Large Language Models (LLMs):</strong></p> <ul> <li>Models &lt; 1B params: <code class="language-plaintext highlighter-rouge">AdamW</code> (lr=3e-4, betas=(0.9, 0.95))</li> <li>Models &gt; 1B params: <code class="language-plaintext highlighter-rouge">Muon</code> + <code class="language-plaintext highlighter-rouge">AdamW</code> hybrid (possible 2x speedup)</li> </ul> <p><strong>For Computer Vision:</strong></p> <ul> <li>Quick prototyping: <code class="language-plaintext highlighter-rouge">Adam</code> (lr=0.001)</li> <li>Best accuracy: <code class="language-plaintext highlighter-rouge">SGD</code> + <code class="language-plaintext highlighter-rouge">Momentum</code> + <code class="language-plaintext highlighter-rouge">LR scheduling</code> (lr=0.01-0.1)</li> </ul> <p><strong>Special Cases:</strong></p> <ul> <li>NLP with Sparse features: <code class="language-plaintext highlighter-rouge">Adam</code> or <code class="language-plaintext highlighter-rouge">AdaGrad</code> (lr=0.001-0.01)</li> <li>Memory constrained: <code class="language-plaintext highlighter-rouge">Muon</code> or <code class="language-plaintext highlighter-rouge">SGD+Momentum</code></li> <li>Fast experimentation: <code class="language-plaintext highlighter-rouge">Adam/AdamW</code></li> </ul> <p><strong>When in doubt:</strong> Start with <code class="language-plaintext highlighter-rouge">AdamW</code> (lr=0.001, weight_decay=0.01). It’s a solid default choice for almost any task.</p> <hr /> <h2 id="1-stochastic-gradient-descent-sgd">1. <a href="https://projecteuclid.org/euclid.aoms/1177729586">Stochastic Gradient Descent (SGD)</a></h2> <p><strong>How It Works:</strong> Updates weights by calculating gradients using a small batch of data.</p> \[w_t = w_{t-1} - \eta \nabla f(w_{t-1})\] <p><strong>Pros:</strong></p> <ul> <li>Simple and computationally efficient</li> <li>Works well with large datasets</li> </ul> <p><strong>Cons:</strong></p> <ul> <li>Can oscillate or converge slowly, especially in narrow valleys or near saddle points</li> <li>Learning rate (η) is fixed, leading to potential overshooting or slow convergence</li> </ul> <p><strong>Code:</strong></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch.optim</span> <span class="k">as</span> <span class="n">optim</span> <span class="n">optimizer</span> <span class="o">=</span> <span class="n">optim</span><span class="p">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">lr</span><span class="o">=</span><span class="mf">0.01</span><span class="p">,</span> <span class="n">weight_decay</span><span class="o">=</span><span class="mf">0.0001</span><span class="p">)</span> </code></pre></div></div> <hr /> <h2 id="2-momentum">2. <a href="https://hengshuaiyao.github.io/papers/polyak64.pdf">Momentum</a></h2> <p><strong>How It Works:</strong> Accumulates gradients to build momentum in directions with consistent gradients.</p> <p>\(v_t = \beta v_{t-1} - \eta \nabla f(w_{t-1})\)<br /> \(w_t = w_{t-1} + v_t\)</p> <p><strong>Pros:</strong></p> <ul> <li>Speeds up convergence in shallow but consistent directions (e.g., valleys)</li> <li>Reduces oscillations compared to SGD</li> </ul> <p><strong>Cons:</strong></p> <ul> <li>Still overshoots if the learning rate is too high</li> <li>Cannot predict future gradient directions</li> </ul> <p><strong>Improvement Over SGD:</strong> Addresses oscillation and slow convergence by incorporating past gradients.</p> <p><strong>Code:</strong></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">optimizer</span> <span class="o">=</span> <span class="n">optim</span><span class="p">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">lr</span><span class="o">=</span><span class="mf">0.01</span><span class="p">,</span> <span class="n">momentum</span><span class="o">=</span><span class="mf">0.9</span><span class="p">,</span> <span class="n">weight_decay</span><span class="o">=</span><span class="mf">0.0001</span><span class="p">)</span> </code></pre></div></div> <hr /> <h2 id="3-nesterov-momentum">3. <a href="https://proceedings.mlr.press/v28/sutskever13.pdf">Nesterov Momentum</a></h2> <p><strong>How It Works:</strong> Looks ahead by computing gradients at the projected position.</p> <p>\(v_t = \beta v_{t-1} - \eta \nabla f(w_{t-1} + \beta v_{t-1})\)<br /> \(w_t = w_{t-1} + v_t\)</p> <p><strong>Pros:</strong></p> <ul> <li>More precise updates by considering where the momentum is leading</li> <li>Accelerates convergence further compared to vanilla momentum</li> </ul> <p><strong>Cons:</strong></p> <ul> <li>Slightly more computationally expensive due to gradient computation at the lookahead point</li> </ul> <p><strong>Improvement Over Momentum:</strong> Anticipates future gradient directions, resulting in better convergence.</p> <p><strong>Code:</strong></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">optimizer</span> <span class="o">=</span> <span class="n">optim</span><span class="p">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">lr</span><span class="o">=</span><span class="mf">0.01</span><span class="p">,</span> <span class="n">momentum</span><span class="o">=</span><span class="mf">0.9</span><span class="p">,</span> <span class="n">nesterov</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">weight_decay</span><span class="o">=</span><span class="mf">0.0001</span><span class="p">)</span> </code></pre></div></div> <hr /> <h2 id="4-adagrad">4. <a href="https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf">AdaGrad</a></h2> <p><strong>How It Works:</strong> Adjusts the learning rate for each parameter based on the magnitude of past gradients.</p> <p>\(g_t = \nabla f(w_{t-1})\)<br /> \(w_t = w_{t-1} - \frac{\eta}{\sqrt{G_t + \epsilon}} g_t, \quad G_t = \sum_{i=1}^t g_i^2\)</p> <p><strong>Pros:</strong></p> <ul> <li>Works well for sparse gradients (e.g., NLP tasks)</li> <li>Automatically adapts learning rates for each parameter</li> </ul> <p><strong>Cons:</strong></p> <ul> <li>Learning rate diminishes too quickly due to cumulative gradient sum, leading to potential underfitting</li> </ul> <p><strong>Improvement Over Nesterov Momentum:</strong> Introduces adaptive learning rates to handle sparse gradients.</p> <hr /> <h2 id="5-rmsprop">5. <a href="https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf">RMSProp</a></h2> <p><strong>How It Works:</strong> Modifies AdaGrad by using an exponentially weighted moving average of past squared gradients instead of a cumulative sum.</p> <p>\(v_t = \beta v_{t-1} + (1 - \beta)(\nabla f(w_{t-1}))^2\)<br /> \(w_t = w_{t-1} - \frac{\eta}{\sqrt{v_t + \epsilon}} \nabla f(w_{t-1})\)</p> <p><strong>Pros:</strong></p> <ul> <li>Prevents the learning rate from diminishing too quickly</li> <li>Suitable for non-stationary objectives</li> </ul> <p><strong>Cons:</strong></p> <ul> <li>Sensitive to hyperparameter choices (e.g., β)</li> </ul> <p><strong>Improvement Over AdaGrad:</strong> Stabilizes learning rates by introducing an exponentially weighted average of squared gradients.</p> <hr /> <h2 id="6-adam-adaptive-moment-estimation">6. <a href="https://arxiv.org/pdf/1412.6980">Adam</a> (Adaptive Moment Estimation)</h2> <p><strong>How It Works:</strong> Combines Momentum (first moment) and RMSProp (second moment).</p> <ul> <li> <p>Update rules:<br /> \(m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla f(w_{t-1})\)<br /> \(v_t = \beta_2 v_{t-1} + (1 - \beta_2)(\nabla f(w_{t-1}))^2\)<br /></p> </li> <li> <p>Bias corrections:<br /> \(\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}\)<br /></p> </li> <li> <p>Update step:<br /> \(w_t = w_{t-1} - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\)</p> </li> </ul> <p><strong>Pros:</strong></p> <ul> <li>Combines the benefits of Momentum and RMSProp</li> <li>Automatically adjusts learning rates for each parameter</li> <li>Bias correction ensures stability in early training</li> </ul> <p><strong>Cons:</strong></p> <ul> <li>May converge to suboptimal solutions in some scenarios (e.g., small datasets or high regularization)</li> <li>Hyperparameter tuning can be challenging</li> </ul> <p><strong>Improvement Over RMSProp:</strong> Adds momentum and bias correction to handle noisy gradients and early instability.</p> <p><strong>Code:</strong></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">optimizer</span> <span class="o">=</span> <span class="n">optim</span><span class="p">.</span><span class="n">Adam</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">lr</span><span class="o">=</span><span class="mf">0.001</span><span class="p">,</span> <span class="n">betas</span><span class="o">=</span><span class="p">(</span><span class="mf">0.9</span><span class="p">,</span> <span class="mf">0.999</span><span class="p">),</span> <span class="n">eps</span><span class="o">=</span><span class="mf">1e-8</span><span class="p">)</span> </code></pre></div></div> <hr /> <h2 id="7-adamw">7. <a href="https://arxiv.org/pdf/1711.05101">AdamW</a></h2> <p><strong>How It Works:</strong></p> <p>Decouples weight decay from the gradient update to improve generalization.</p> \[w_t = w_{t-1} - \eta \bigg( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda w_{t-1} \bigg)\] <p><strong>Pros:</strong></p> <ul> <li>Better generalization compared to Adam</li> <li>Retains benefits of adaptive learning rates</li> </ul> <p><strong>Cons:</strong></p> <ul> <li>Still requires careful hyperparameter tuning</li> </ul> <p><strong>Improvement Over Adam:</strong> Decouples weight decay from gradient updates, improving generalization performance.</p> <p><strong>Code (Common Settings for Transformers):</strong></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">optimizer</span> <span class="o">=</span> <span class="n">optim</span><span class="p">.</span><span class="n">AdamW</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">lr</span><span class="o">=</span><span class="mf">3e-4</span><span class="p">,</span> <span class="n">betas</span><span class="o">=</span><span class="p">(</span><span class="mf">0.9</span><span class="p">,</span> <span class="mf">0.95</span><span class="p">),</span> <span class="n">eps</span><span class="o">=</span><span class="mf">1e-8</span><span class="p">,</span> <span class="n">weight_decay</span><span class="o">=</span><span class="mf">0.1</span><span class="p">)</span> </code></pre></div></div> <hr /> <h2 id="8-muon-momentum-orthogonalized-by-newton-schulz">8. <a href="https://kellerjordan.github.io/posts/muon/">Muon</a> (MomentUm Orthogonalized by Newton-Schulz)</h2> <p><strong>How It Works:</strong></p> <p>Muon is designed specifically for 2D weight matrices in neural network hidden layers (Linear layers). Unlike traditional optimizers that treat each parameter independently, Muon leverages the geometric structure of weight matrices by orthogonalizing gradients using the Newton-Schulz iteration.</p> <p>The optimizer formulates weight updates as a constrained optimization problem in the RMS-to-RMS operator norm space:</p> \[\min_{\Delta W} \langle G, \Delta W \rangle \quad \text{subject to} \quad |\Delta W|_{op,RMS} \leq \beta\] <p>Where $G$ is the gradient matrix. The solution involves projecting the gradient onto the set of orthogonal matrices, which standardizes all singular values to 1 while preserving gradient directions. Its github implementation can be found <a href="https://github.com/KellerJordan/Muon">here.</a></p> <p><strong>Update Rules:</strong></p> <ol> <li> <p><strong>Momentum accumulation:</strong> \(V_t = \mu V_{t-1} + G_t\)</p> </li> <li> <p><strong>Newton-Schulz orthogonalization (5 iterations):</strong> \(Z_0 = \frac{V_t}{\|V_t\|_F}\) \(Z_{i+1} = aZ_i + bZ_i^3 + cZ_i^5\)</p> <p>Default coefficients: $(a, b, c) = (3.4445, -4.775, 2.0315)$</p> </li> <li> <p><strong>Weight update:</strong> \(W_t = W_{t-1} - \eta \cdot Z_{final} - \lambda W_{t-1}\)</p> </li> </ol> <p><strong>Important:</strong> Muon should <strong>only</strong> be applied to 2D weight matrices (hidden layer Linear layers). All other parameters (embeddings, biases, normalization layers, classifier heads) must use a standard optimizer like AdamW.</p> <p><strong>Pros:</strong></p> <ul> <li><strong>Memory efficient:</strong> Only tracks momentum (no second moment statistics like Adam), reducing memory by ~33% compared to Adam</li> <li><strong>Automatic learning rate transfer:</strong> Learning rates transfer across different network widths without retuning</li> <li><strong>Superior convergence:</strong> Faster training than Adam/AdamW, especially for transformers and large models <ul> <li>Improved CIFAR-10 training speed record from 3.3 to 2.6 A100-seconds for 94% accuracy</li> <li>Improved NanoGPT speedrunning record by 1.35x</li> <li>Trained 1.5B transformer to GPT-2 XL performance in 10 hours vs 13.3 hours with AdamW</li> </ul> </li> <li><strong>Better saddle point handling:</strong> Orthogonalization helps escape saddle points more effectively</li> <li><strong>Scalable:</strong> Performance improvements increase with model size</li> </ul> <p><strong>Cons:</strong></p> <ul> <li><strong>Hybrid approach required:</strong> Must use AdamW or another optimizer for non-2D parameters</li> <li><strong>Higher computational cost:</strong> Newton-Schulz iterations add ~5% overhead (though Turbo-Muon reduces this to ~1%)</li> <li><strong>Implementation complexity:</strong> More complex than standard optimizers</li> <li><strong>Limited to dense layers:</strong> Only applicable to Linear layers with dense activations</li> </ul> <p><strong>Improvement Over AdamW:</strong> Exploits the matrix structure of neural network weights rather than treating parameters independently. This geometric approach provides automatic scaling properties and faster convergence while using less memory. Particularly effective for transformer architectures and language model pre-training.</p> <p><strong>Code:</strong></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">muon</span> <span class="kn">import</span> <span class="n">MuonWithAuxAdam</span> <span class="c1"># Separate parameters by type </span><span class="n">hidden_weights</span> <span class="o">=</span> <span class="p">[</span><span class="n">p</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">model</span><span class="p">.</span><span class="n">body</span><span class="p">.</span><span class="n">parameters</span><span class="p">()</span> <span class="k">if</span> <span class="n">p</span><span class="p">.</span><span class="n">ndim</span> <span class="o">&gt;=</span> <span class="mi">2</span><span class="p">]</span> <span class="n">hidden_gains_biases</span> <span class="o">=</span> <span class="p">[</span><span class="n">p</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">model</span><span class="p">.</span><span class="n">body</span><span class="p">.</span><span class="n">parameters</span><span class="p">()</span> <span class="k">if</span> <span class="n">p</span><span class="p">.</span><span class="n">ndim</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">]</span> <span class="n">nonhidden_params</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="n">model</span><span class="p">.</span><span class="n">head</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="o">*</span><span class="n">model</span><span class="p">.</span><span class="n">embed</span><span class="p">.</span><span class="n">parameters</span><span class="p">()]</span> <span class="c1"># Create parameter groups </span><span class="n">param_groups</span> <span class="o">=</span> <span class="p">[</span> <span class="nb">dict</span><span class="p">(</span><span class="n">params</span><span class="o">=</span><span class="n">hidden_weights</span><span class="p">,</span> <span class="n">use_muon</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">lr</span><span class="o">=</span><span class="mf">0.02</span><span class="p">,</span> <span class="n">weight_decay</span><span class="o">=</span><span class="mf">0.01</span><span class="p">),</span> <span class="nb">dict</span><span class="p">(</span><span class="n">params</span><span class="o">=</span><span class="n">hidden_gains_biases</span><span class="o">+</span><span class="n">nonhidden_params</span><span class="p">,</span> <span class="n">use_muon</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">lr</span><span class="o">=</span><span class="mf">3e-4</span><span class="p">,</span> <span class="n">betas</span><span class="o">=</span><span class="p">(</span><span class="mf">0.9</span><span class="p">,</span> <span class="mf">0.95</span><span class="p">),</span> <span class="n">weight_decay</span><span class="o">=</span><span class="mf">0.01</span><span class="p">),</span> <span class="p">]</span> <span class="n">optimizer</span> <span class="o">=</span> <span class="n">MuonWithAuxAdam</span><span class="p">(</span><span class="n">param_groups</span><span class="p">)</span> </code></pre></div></div> <hr /> <h2 id="detailed-technical-comparison">Detailed Technical Comparison</h2> <table> <thead> <tr> <th>Method</th> <th>Working Mechanism</th> <th>Pros</th> <th>Cons</th> <th>Improvement Over Prior Method</th> </tr> </thead> <tbody> <tr> <td><strong>SGD</strong></td> <td>Updates weights using gradients calculated on mini-batches. $w_t = w_{t-1} - \eta\nabla f(w_{t-1})$</td> <td>Simple, computationally efficient</td> <td>Oscillates, slow convergence, fixed learning rate</td> <td>-</td> </tr> <tr> <td><strong>Momentum</strong></td> <td>Accumulates gradients to build momentum for smoother updates. $v_t = \beta v_{t-1} - \eta\nabla f(w_{t-1})$, $w_t = w_{t-1} + v_t$</td> <td>Speeds up convergence, reduces oscillations</td> <td>May overshoot, lacks anticipation of future gradients</td> <td>Reduces oscillations and improves convergence speed</td> </tr> <tr> <td><strong>Nesterov</strong></td> <td>Looks ahead to compute gradients at a projected future position. $v_t = \beta v_{t-1} - \eta\nabla f(w_{t-1} + \beta v_{t-1})$, $w_t = w_{t-1} + v_t$</td> <td>More precise updates, faster convergence</td> <td>Slightly more computationally expensive</td> <td>Anticipates future gradient directions</td> </tr> <tr> <td><strong>AdaGrad</strong></td> <td>Adjusts learning rates based on accumulated squared gradients. $w_t = w_{t-1} - \frac{\eta}{\sqrt{G_t + \epsilon}}g_t$, $G_t = \sum g_i^2$</td> <td>Adapts learning rates, good for sparse gradients</td> <td>Learning rate diminishes too quickly, potential underfitting</td> <td>Introduces adaptive learning rates for sparse features</td> </tr> <tr> <td><strong>RMSProp</strong></td> <td>Uses exponentially weighted moving averages of squared gradients. $v_t = \beta v_{t-1} + (1-\beta)g_t^2$, $w_t = w_{t-1} - \frac{\eta}{\sqrt{v_t + \epsilon}}g_t$</td> <td>Prevents learning rate decay, handles non-stationary objectives</td> <td>Sensitive to hyperparameters (e.g., β)</td> <td>Stabilizes learning rates using moving averages</td> </tr> <tr> <td><strong>Adam</strong></td> <td>Combines Momentum (1st moment) and RMSProp (2nd moment) with bias correction. $w_t = w_{t-1} - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$</td> <td>Fast convergence, handles noisy gradients</td> <td>May converge to suboptimal minima in some cases</td> <td>Combines momentum and adaptive learning rates</td> </tr> <tr> <td><strong>AdamW</strong></td> <td>Decouples weight decay from gradient updates. $w_t = w_{t-1} - \eta[\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda w_{t-1}]$</td> <td>Better generalization, retains Adam’s benefits</td> <td>Requires tuning of decay parameter</td> <td>Improves generalization by decoupling weight decay</td> </tr> <tr> <td><strong>Muon</strong></td> <td>Orthogonalizes gradients of weight matrices using Newton-Schulz iteration, then applies Newton-Schulz polynomial to normalize $V_t$. $V_t = \mu V_{t-1} + G_t$, $W_t = W_{t-1} - \eta \cdot \text{NS}(V_t) - \lambda W_{t-1}$</td> <td>Fast convergence, memory efficient, automatic LR transfer across model sizes</td> <td>Only for 2D parameters, requires hybrid approach with AdamW</td> <td>Leverages matrix geometry for better conditioning and faster training</td> </tr> </tbody> </table> <hr /> <h2 id="hyperparameter-reference">Hyperparameter Reference</h2> <table> <thead> <tr> <th>Method</th> <th>Hyperparameter</th> <th>Meaning</th> <th>Typical Values</th> <th>Tuning Suggestions</th> </tr> </thead> <tbody> <tr> <td><strong>SGD</strong></td> <td>Learning rate ($\eta$)</td> <td>Step size for updating weights</td> <td>0.01 to 0.1</td> <td>Start with a smaller value and adjust based on convergence</td> </tr> <tr> <td><strong>Momentum</strong></td> <td>Momentum coefficient ($\beta$)</td> <td>Controls the contribution of past gradients to the current update</td> <td>0.9</td> <td>Keep fixed at 0.9 or tune slightly</td> </tr> <tr> <td><strong>Nesterov</strong></td> <td>Momentum coefficient ($\beta$)</td> <td>Same as Momentum, with anticipation of future gradients</td> <td>0.9</td> <td>Same as Momentum</td> </tr> <tr> <td><strong>AdaGrad</strong></td> <td>Learning rate ($\eta$)</td> <td>Base learning rate scaled by the inverse square root of accumulated squared gradients</td> <td>0.01</td> <td>Lower than SGD learning rates to avoid overshooting</td> </tr> <tr> <td><strong>RMSProp</strong></td> <td>Learning rate ($\eta$)</td> <td>Similar to AdaGrad, with smoothing via an exponential moving average</td> <td>0.001 to 0.01</td> <td>Tune for stability based on loss</td> </tr> <tr> <td> </td> <td>Decay rate ($\beta$)</td> <td>Smoothing parameter for the moving average of squared gradients</td> <td>0.9</td> <td>Commonly fixed at 0.9</td> </tr> <tr> <td><strong>Adam</strong></td> <td>Learning rate ($\eta$)</td> <td>Base learning rate for parameter updates</td> <td>0.001</td> <td>Often works well without much tuning</td> </tr> <tr> <td> </td> <td>$\beta_1$</td> <td>Decay rate for the first moment (mean of gradients)</td> <td>0.9</td> <td>Usually fixed</td> </tr> <tr> <td> </td> <td>$\beta_2$</td> <td>Decay rate for the second moment (variance of gradients)</td> <td>0.999</td> <td>Keep fixed or tune slightly for sensitivity</td> </tr> <tr> <td> </td> <td>$\epsilon$</td> <td>Small value to avoid division by zero</td> <td>$10^{-7}$ or smaller</td> <td>Rarely changed</td> </tr> <tr> <td><strong>AdamW</strong></td> <td>Learning rate ($\eta$)</td> <td>Same as Adam</td> <td>0.001</td> <td>Same as Adam</td> </tr> <tr> <td> </td> <td>$\beta_1$, $\beta_1$, $\epsilon$</td> <td>Same as Adam</td> <td>0.9, 0.999, $10^{-7}$</td> <td>Same as Adam</td> </tr> <tr> <td> </td> <td>Weight decay ($\lambda$)</td> <td>Regularization parameter to control overfitting by penalizing large weights</td> <td>$10^{-4}$ to $10^{-2}$</td> <td>Start small and increase if overfitting is observed</td> </tr> <tr> <td><strong>Muon</strong></td> <td>Learning rate ($\eta$)</td> <td>Base learning rate for matrix updates</td> <td>0.02 (can be 5-10x larger than Adam)</td> <td>Start with 0.02, can use much larger values than Adam</td> </tr> <tr> <td> </td> <td>Momentum ($\mu$)</td> <td>Momentum coefficient</td> <td>0.95</td> <td>Usually fixed at 0.95</td> </tr> <tr> <td> </td> <td>Weight decay ($\lambda$)</td> <td>Regularization parameter</td> <td>0.01</td> <td>Same as AdamW</td> </tr> <tr> <td> </td> <td>Nesterov</td> <td>Whether to use Nesterov momentum</td> <td>True</td> <td>Typically enabled</td> </tr> <tr> <td> </td> <td>NS coefficients $(a,b,c)$</td> <td>Newton-Schulz polynomial coefficients</td> <td>(3.4445, -4.775, 2.0315)</td> <td>Rarely changed, but can be tuned for specific architectures</td> </tr> <tr> <td> </td> <td><strong>For non-2D params</strong></td> <td>Use AdamW with standard settings</td> <td>$\eta$ = 3e-4, $\beta_1$ = 0.9, $\beta_2$ = 0.95</td> <td>Keep separate learning rate for embeddings/biases</td> </tr> </tbody> </table> <hr /> <h2 id="common-pitfalls-and-how-to-avoid-them">Common Pitfalls and How to Avoid Them</h2> <p>Even with the right optimizer, certain mistakes can derail your training. Here are the most common issues:</p> <h3 id="1-using-adam-without-learning-rate-decay">1. Using Adam without Learning Rate Decay</h3> <p><strong>Problem:</strong> Adam can fail to converge to optimal solutions without learning rate scheduling.</p> <p><strong>Solution:</strong> Always use a learning rate scheduler with Adam/AdamW, especially for long training runs.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">scheduler</span> <span class="o">=</span> <span class="n">CosineAnnealingLR</span><span class="p">(</span><span class="n">optimizer</span><span class="p">,</span> <span class="n">T_max</span><span class="o">=</span><span class="n">epochs</span><span class="p">)</span> </code></pre></div></div> <h3 id="2-sgd-learning-rate-too-high">2. SGD Learning Rate Too High</h3> <p><strong>Problem:</strong> Divergence, exploding gradients, NaN losses.</p> <p><strong>Solution:</strong> Start with a conservative learning rate (0.01-0.1) and use warmup:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Warmup for first 5 epochs </span><span class="k">if</span> <span class="n">epoch</span> <span class="o">&lt;</span> <span class="mi">5</span><span class="p">:</span> <span class="n">lr</span> <span class="o">=</span> <span class="n">base_lr</span> <span class="o">*</span> <span class="p">(</span><span class="n">epoch</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">/</span> <span class="mi">5</span> <span class="k">else</span><span class="p">:</span> <span class="n">lr</span> <span class="o">=</span> <span class="n">base_lr</span> </code></pre></div></div> <h3 id="3-confusing-adam-and-adamw">3. Confusing Adam and AdamW</h3> <p><strong>Problem:</strong> Using <code class="language-plaintext highlighter-rouge">torch.optim.Adam</code> when you meant to use weight decay.</p> <p><strong>Critical:</strong> In PyTorch, <code class="language-plaintext highlighter-rouge">Adam</code> with <code class="language-plaintext highlighter-rouge">weight_decay</code> parameter is <strong>NOT</strong> the same as <code class="language-plaintext highlighter-rouge">AdamW</code>!</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># WRONG - This is L2 regularization, not weight decay </span><span class="n">optimizer</span> <span class="o">=</span> <span class="n">optim</span><span class="p">.</span><span class="n">Adam</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">lr</span><span class="o">=</span><span class="mf">0.001</span><span class="p">,</span> <span class="n">weight_decay</span><span class="o">=</span><span class="mf">0.01</span><span class="p">)</span> <span class="c1"># CORRECT - Use AdamW for proper weight decay </span><span class="n">optimizer</span> <span class="o">=</span> <span class="n">optim</span><span class="p">.</span><span class="n">AdamW</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">lr</span><span class="o">=</span><span class="mf">0.001</span><span class="p">,</span> <span class="n">weight_decay</span><span class="o">=</span><span class="mf">0.01</span><span class="p">)</span> </code></pre></div></div> <h3 id="4-not-separating-parameter-groups-for-muon">4. Not Separating Parameter Groups for Muon</h3> <p><strong>Problem:</strong> Applying Muon to all parameters (embeddings, biases, etc.) causes training instability.</p> <p><strong>Solution:</strong> Only use Muon for 2D weight matrices. Use AdamW for everything else:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Correctly separate parameters </span><span class="n">hidden_weights</span> <span class="o">=</span> <span class="p">[</span><span class="n">p</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">()</span> <span class="k">if</span> <span class="n">p</span><span class="p">.</span><span class="n">ndim</span> <span class="o">&gt;=</span> <span class="mi">2</span><span class="p">]</span> <span class="n">other_params</span> <span class="o">=</span> <span class="p">[</span><span class="n">p</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">()</span> <span class="k">if</span> <span class="n">p</span><span class="p">.</span><span class="n">ndim</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">]</span> </code></pre></div></div> <h3 id="5-forgetting-gradient-clipping">5. Forgetting Gradient Clipping</h3> <p><strong>Problem:</strong> Training instability, especially with RNNs, transformers, or high learning rates.</p> <p><strong>Solution:</strong> Add gradient clipping before optimizer step:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">utils</span><span class="p">.</span><span class="n">clip_grad_norm_</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">max_norm</span><span class="o">=</span><span class="mf">1.0</span><span class="p">)</span> <span class="n">optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">()</span> </code></pre></div></div> <h3 id="6-using-adagrad-for-long-training">6. Using AdaGrad for Long Training</h3> <p><strong>Problem:</strong> Learning rate diminishes to nearly zero, causing training to stall.</p> <p><strong>Solution:</strong> Use RMSProp or Adam instead for long training runs. AdaGrad works best for shorter, sparse gradient scenarios.</p> <h3 id="7-ignoring-batch-size-effects">7. Ignoring Batch Size Effects</h3> <p><strong>Problem:</strong> Optimizer performance varies dramatically with batch size.</p> <p><strong>Key Rule:</strong> Larger batch sizes often require larger learning rates:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Linear scaling rule (approximate) </span><span class="n">lr</span> <span class="o">=</span> <span class="n">base_lr</span> <span class="o">*</span> <span class="p">(</span><span class="n">batch_size</span> <span class="o">/</span> <span class="n">base_batch_size</span><span class="p">)</span> </code></pre></div></div> <h3 id="8-not-using-different-optimizers-for-different-parameters">8. Not Using Different Optimizers for Different Parameters</h3> <p><strong>Problem:</strong> Embeddings and classifier heads may need different learning rates than the main network.</p> <p><strong>Solution:</strong> Use parameter groups:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">optimizer</span> <span class="o">=</span> <span class="n">optim</span><span class="p">.</span><span class="n">AdamW</span><span class="p">([</span> <span class="p">{</span><span class="s">'params'</span><span class="p">:</span> <span class="n">model</span><span class="p">.</span><span class="n">embedding</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="s">'lr'</span><span class="p">:</span> <span class="mf">1e-3</span><span class="p">},</span> <span class="p">{</span><span class="s">'params'</span><span class="p">:</span> <span class="n">model</span><span class="p">.</span><span class="n">encoder</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="s">'lr'</span><span class="p">:</span> <span class="mf">3e-4</span><span class="p">},</span> <span class="p">{</span><span class="s">'params'</span><span class="p">:</span> <span class="n">model</span><span class="p">.</span><span class="n">head</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="s">'lr'</span><span class="p">:</span> <span class="mf">5e-4</span><span class="p">}</span> <span class="p">])</span> </code></pre></div></div> <h3 id="9-misunderstanding-momentum-hyperparameters">9. Misunderstanding Momentum Hyperparameters</h3> <p><strong>Problem:</strong> Using $\beta_1 = 0.9$ for both Adam and SGD without understanding the difference.</p> <p><strong>Key Insight:</strong></p> <ul> <li>SGD Momentum: 0.9 is standard</li> <li>Adam $\beta_1$ : 0.9 is standard</li> <li>But they behave differently! Adam’s momentum is applied to normalized gradients.</li> </ul> <h3 id="10-not-validating-optimizer-setup">10. Not Validating Optimizer Setup</h3> <p><strong>Problem:</strong> Subtle bugs in optimizer configuration go unnoticed until poor results.</p> <p><strong>Solution:</strong> Always verify your setup:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Check which parameters are being optimized </span><span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Optimizing </span><span class="si">{</span><span class="nb">sum</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">numel</span><span class="p">()</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">optimizer</span><span class="p">.</span><span class="n">param_groups</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="s">'params'</span><span class="p">])</span><span class="si">}</span><span class="s"> parameters"</span><span class="p">)</span> <span class="c1"># Verify learning rates </span><span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">group</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">optimizer</span><span class="p">.</span><span class="n">param_groups</span><span class="p">):</span> <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Group </span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s">: lr=</span><span class="si">{</span><span class="n">group</span><span class="p">[</span><span class="s">'lr'</span><span class="p">]</span><span class="si">}</span><span class="s">, params=</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">group</span><span class="p">[</span><span class="s">'params'</span><span class="p">])</span><span class="si">}</span><span class="s">"</span><span class="p">)</span> </code></pre></div></div> <hr /> <h2 id="conclusion">Conclusion</h2> <p>Choosing the right optimizer can dramatically impact your model’s training efficiency and final performance. While there’s no universal “best” optimizer, understanding the strengths and weaknesses of each approach helps you make informed decisions for your specific use case.</p> <p>For most modern deep learning applications, <strong>Adam</strong> and <strong>AdamW</strong> have emerged as go-to choices due to their robust performance across diverse tasks with minimal hyperparameter tuning. Adam’s combination of momentum and adaptive learning rates makes it particularly effective for handling noisy gradients and training deep networks, while AdamW’s improved weight decay mechanism often leads to better generalization.</p> <p><strong>Muon</strong> represents a paradigm shift in optimization by explicitly leveraging the matrix structure of neural network weights. For large-scale language model training, Muon has demonstrated consistent speed improvements over AdamW while using significantly less memory. Its ability to automatically transfer learning rates across model sizes makes it particularly valuable for scaling experiments. However, its requirement for a hybrid approach (using AdamW for non-matrix parameters) adds implementation complexity. If you’re training large transformers and have the engineering resources to implement it properly, Muon is worth serious consideration.</p> <p>Regardless of which optimizer you choose, <strong>learning rate scheduling</strong> is crucial for achieving optimal results. Modern training almost always combines an optimizer with a schedule like cosine annealing, step decay, or warmup-then-decay. The Adam paper’s promise of “little tuning required” applies to the optimizer’s internal hyperparameters ($\beta_1$, $\beta_2$), but you should still tune the learning rate and use scheduling for best results.</p> <p>However, don’t overlook the classics. <strong>SGD with Momentum</strong> remains highly competitive, especially for computer vision tasks, and often achieves better final test accuracy when combined with proper learning rate scheduling. For problems with sparse gradients, such as natural language processing with large vocabularies, <strong>AdaGrad</strong> or <strong>RMSProp</strong> might be more appropriate.</p> <p>The key takeaway is that optimizer selection should be guided by your problem’s characteristics: dataset size, gradient sparsity, computational budget, and generalization requirements. Start with a well-established baseline (Adam is usually a safe bet), monitor your training dynamics, and don’t hesitate to experiment with alternatives if you’re not seeing the convergence behavior you expect.</p> <p>As the field continues to evolve, new optimizers and variants will undoubtedly emerge. But the fundamental principles underlying these eight methods: managing learning rates, leveraging momentum, adapting to gradient statistics, and combining optimizers, will remain central to training neural networks effectively. However, new optimizers like Muon (2024) show that there’s still room for innovation. Stay curious, read the papers (linked throughout this guide) here and new papers, and don’t be afraid to experiment with different optimizers for your specific use case.</p> Thu, 22 Jan 2026 00:00:00 +0000 https://chizkidd.github.io//2026/01/22/neural-net-optimizers/ https://chizkidd.github.io//2026/01/22/neural-net-optimizers/