Chizoba Obasi blog

Inkcast: A Free, Browser-Based Audiobook Player

Mon, 16 Mar 2026 00:00:00 +0000

Earlier this year, I decided to force myself to read more. Not a New Year’s resolution, because those never last. The reason is that growing up as a child and young teenager, reading often felt like punishment. My mum required my siblings and me to read a certain number of pages from a designated book every day throughout elementary school. Missing a day meant mandatory punishment. In boarding secondary school, this eventually led to a stubborn, subconscious resistance to non-essential reading. Over the six years I spent there, I probably read only five to ten non-academic fiction books (though Artemis Fowl was a delight). So it is not hard to see where my indifference to reading came from.

During the COVID-19 pandemic, however, I fell deep into podcasts. As an avid sports fan and TV show buff, I listened to everything: sports recaps, tech podcasts, expert interviews, the works (meeting Walter White at Anfield would be the ultimate dream). So when I decided to read more this year, audiobooks felt like the natural bridge. I already had a few EPUBs in the Apple Books app on my reading list and wondered: Can I listen to these EPUBs using Apple Dictation’s two-finger swipe-down feature? Unfortunately, it only works for the current page. It is quite janky, not very user-friendly, and frankly does not work well for my use case.

Recently, I worked on evaluating how a Facebook state-of-the-art (SOTA) automatic speech recognition (ASR) model handles Igbo tones, trying to see whether it actually “listens” properly. So I have been dabbling with audio quite a bit this year. You could say I have been thinking about listening a lot. In the past, I also experimented with WaveNet (a generative model for raw audio) and its fundamental building block, the dilated causal convolution.

With these experiences in mind, I wondered: Can I build an iPhone Shortcut that lets me listen to EPUBs properly? That question eventually led to Inkcast. The goal was not to build a Speechify competitor. I simply wanted to solve a personal problem. My aim was to create a low-effort, frictionless tool for personal use, so I made a GitHub repository and started building. Within a few days, I had a working website that could take EPUBs and PDFs and let users listen to the content organized by chapters in a sidebar with one-tap navigation. It included basic controls such as play/pause, rewind (15 seconds), forward (30 seconds), playback speed control (0.75× to 2×), and voice selection.

It worked well on desktop, so I used the URL to create a Shortcut on my iPhone. On mobile it also worked, but there was one problem: the reader voices sounded robotic and monotone, which is not ideal for long-form listening. The irony was that only weeks earlier I had been evaluating how machines handle speech. So I went back to the drawing board to figure out how to get more natural-sounding reader voices on Inkcast.

While researching audiobook-quality text-to-speech, I came across several APIs (OpenAI, ElevenLabs, Google Cloud). None were free, and for a personal project I wanted something that required no subscriptions or API keys. Most resources suggested that human-quality narration requires a dedicated TTS service. Eventually I discovered that the Web Speech API can access premium voices already installed on the device. These voices are free, require no API keys, and remain available offline. They are not state of the art, but they are surprisingly good. Many people do not realize that higher-quality Siri voices can be downloaded. The voice quality improved, but the project also started evolving in another direction.

I have always wanted to work through Paul Graham’s essays properly. There are 229 of them, and they read almost like long-form podcasts. But they live on webpages, which raised another question: Why limit the input to EPUBs and PDFs? So I added URL support. I pasted Paul Graham’s archive page into Inkcast, and it automatically pulled all 200+ essays into the sidebar. That was the moment I realized the idea actually worked.

The entire project lives in a single HTML file. There are no accounts, no installations, and files never leave the user’s device. Because the app has no server dependencies, it ended up functioning as a privacy-preserving tool by default. In a way, I started the year studying whether machines listen well. Along the way, I realized that humans do not have many good free tools for listening either. Speechify costs $139 per year and Audible requires a subscription, so I built something that worked for me.

If you find it useful, pls try it, star it, or buy me a coffee if it saves you a Speechify subscription.

Sutton & Barto, Ch. 12: Eligibility Traces (Personal Notes)

Fri, 13 Mar 2026 00:00:00 +0000

Eligibility traces are one of the basic mechanisms of RL that unify and generalize TD and Monte Carlo (MC) methods.
TD methods augmented with eligibility traces produce a family of methods spanning a range from MC methods at one end ($\lambda = 1$) to one-step TD (TD(0)) methods at the other end ($\lambda = 0$).
With eligibility traces, MC methods can be implemented online and on continuing problems.
$n$-step methods also unify TD and MC methods but are not as elegant algorithmically as eligibility traces (ET).
The eligibility traces algorithm entails:
- First, we have a short-term memory vector, the eligibility trace $\mathbf{z}_t \in \mathbb{R}^d$, that parallels the long-term weight vector $\mathbf{w}_t \in \mathbb{R}^d$.
- Then when a component of $\mathbf{w}_t$ participates in producing an estimated value, the corresponding component of $\mathbf{z}_t$ is bumped up and then begins to fade away.
- Learning occurs in that component of $\mathbf{w}_t$ if a non-zero TD error occurs before the trace falls back to zero (fades away).
- The trace-decay parameter $\lambda \in [0,1)$ determines the rate at which the trace falls.
Advantages of ET over $n$-step methods:
- Requires only a single trace vector $\mathbf{z}_t$ rather than storing the last $n$ feature vectors.
- Learning occurs continually and uniformly in time rather than being delayed and playing “catch up” at episode end. This leads to immediate behavior effect from learning rather than being delayed.
- ET is a backward view algorithm, unlike $n$-step methods that are forward view algorithms, which is less complex to implement.
Forward view algorithms are based on looking forward from the updated state, and the updated state depends on all the future rewards.
Backward view algorithms use the current TD error, looking backward to recently visited states, to achieve nearly the same updates as forward view.
We start with ideas for state values and prediction, then extend them to action values and control, then do on-policy, then extend to off-policy learning. The field of focus is linear function approximation (covers tabular and state aggregation cases).

12.1 The $\lambda$-return
12.2 TD($\lambda$)
12.3 $n$-step Truncated $\lambda$-return Methods
12.4 Redoing Updates: Online $\lambda$-return Algorithm
12.5 True Online TD($\lambda$)
12.6 Dutch Traces in Monte Carlo Learning
12.7 Sarsa($\lambda$)
12.8 Variable $\lambda$ and $\gamma$
12.9 Off-Policy Traces with Control Variates
12.10 Watkins’s Q($\lambda$) to Tree-Backup($\lambda$)
12.11 Stable Off-Policy Methods with Traces
12.12 Implementation Issues
12.13 Conclusions

Appendix

Citation

12.1 The $\lambda$-return

Recall in Chapter 7 we defined an $n$-step return as the sum of the first $n$ rewards plus the estimated value of the state reached in $n$ steps, each appropriately discounted:

\[G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{v}(S_{t+n}, \mathbf{w}_{t+n-1}), \quad\quad 0 \leq t \leq T - n\]

A valid update can be done not just towards any $n$-step return, but also towards any average of $n$-step returns.
- E.g. average the 2-step and 4-step return: $\frac{1}{2} G_{t:t+2} + \frac{1}{2} G_{t:t+4}$

Backup Diagram for Compound Update

The compound update mixing half of a two-step return and half of a four-step return.

Any set of $n$-step returns can be averaged, even an infinite set, as long as the weights on the component returns are positive and sum to $1$.
What if instead of using one $n$-step return, we use a weighted average of all $n$-step returns? This leads to averaging which produces a substantial new range of algorithms.E.g.,
1. Averaging one-step and infinite-step returns to interrelate TD and MC methods.
2. Averaging experience-based updates with Dynamic Programming (DP) updates to obtain a single combination of experience-based and model-based methods.
An update that averages simpler component updates is called a compound update or the $\lambda$-return.
The TD($\lambda$) algorithm is one way of averaging $n$-step updates, each weighted proportionally by $\lambda^{n-1}$ (where $\lambda \in [0,1]$) and normalized by a factor of $(1-\lambda)$ to ensure that the weights sum to $1$.

Backup Diagram for TD($\lambda$)

If $\lambda = 0$, then the overall update reduces to its first component, the TD(0) update, whereas if $\lambda = 1$, then the overall update reduces to its last component, the MC update.

Essentially, $\lambda$-return, $G_t^\lambda$, combines all $n$-step returns $G_{t:t+n}$ in a weighted average manner, $(1-\lambda)\lambda^{n-1}$, and is defined in its state-based form by:

\[\boxed{G_t^\lambda \doteq (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n}}\]

TD($\lambda$) Weighting

The TD($\lambda$) weighting function diagram illustrates the weighting on the sequence of $n$-step returns in the $\lambda$-return:
- $1$-step return gets the largest weight, $1-\lambda$
- $2$-step return gets the next (2nd) largest weight, $(1-\lambda)\lambda$
- $3$-step return gets the 3rd largest weight, $(1-\lambda)\lambda^2$
- $n$-step return gets the $n$-th largest (smallest) weight, $(1-\lambda)\lambda^{n-1}$
- The weight fades by $\lambda$ with each additional step.
- After a terminal state has been reached, all subsequent $n$-step returns are equal to the conventional return $G_t$.
- So essentially, we can decompose $G_t^\lambda$ based on the TD($\lambda$) weighting function diagram into the main sum and post-termination terms:
\[\begin{array}{l} G_t^\lambda = (1-\lambda) \sum\nolimits_{n=1}^{T-t-1} \lambda^{n-1} G_{t:t+n} + \lambda^{T-t-1} G_t \\ \hspace{3em} \underbrace{\hspace{11em}}_{\text{pre-termination}} \kern{0.5em}\underbrace{\hspace{4em}}_{\text{post-termination}} \end{array}\]
- So now we can see the impact of $\lambda$ more clearly:
\[\begin{aligned} \text{if } \lambda = 1: \quad & G_t^\lambda = G_t \hspace{18em} \text{(MC)} \\[6pt] \text{if } \lambda = 0: \quad & G_t^\lambda = \left\{ \begin{array}{ll} \sum_{n=1}^{T-t-1} G_{t:t+n} & \text{for } n=1 \\ 0 & \text{for } n > 1 \end{array} \right\} = G_{t:t+1} \quad \text{(TD(0))} \end{aligned}\]

TD($\lambda$) Weighting

Weighting given in the $\lambda$-return to each of the $n$-step returns.

Our first learning algorithm based on the $\lambda$-return is the off-line $\lambda$-return algorithm, which waits until the end of an episode to make updates. Its semi-gradient, $\lambda$-return, target update for $t = 0, 1, 2, \ldots, T-1$ is:

\[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \!\left[G_t^\lambda - \hat{v}(S_t, \mathbf{w}_t)\right] \nabla \hat{v}(S_t, \mathbf{w}_t)}\]

The $\lambda$-return allows us to smoothly move between MC and TD(0) methods, comparable to $n$-step returning.

Forward View

This approach is called the theoretical, forward view of a learning algorithm:
- Update value function towards the $\lambda$-return.
- Look forward in time to all the future rewards to compute $G_t^\lambda$.
- Like MC, can only be computed from complete return.

Forward View

We decide how to update each state by looking forward to future rewards and states.

12.2 TD($\lambda$)

TD($\lambda$) was the first algorithm that showed a formal relationship between a forward view and backward view using eligibility traces.
TD($\lambda$) improves over the off-line $\lambda$-return algorithm in 3 ways:
- Updates the weight vector on every step of an episode and not just the end.
- Equal distributions in time of its computations rather than at an episode’s end.
- Can be applied to continuing problems and not just episodic problems.
Let’s focus on the semi-gradient version of TD($\lambda$) with function approximation:
- The eligibility trace $\mathbf{z}_t$ has the same number of components as $\mathbf{w}_t$.
- $\mathbf{z}$ is initialized to $\mathbf{0}$, incremented on each time step by the value gradient, and then fades away by $\gamma\lambda$:
\[\begin{align*} \mathbf{z}_{-1} &\doteq \mathbf{0} \\ \mathbf{z}_t &\doteq \gamma\lambda \mathbf{z}_{t-1} + \nabla \hat{v}(S_t, \mathbf{w}_t), \quad 0 \leq t \leq T \end{align*}\] \[\text{where } \lambda \equiv \text{trace decay parameter and } \gamma \equiv \text{discount rate}\]
The eligibility trace keeps track of which $\mathbf{w}_t$ components have contributed, positively or negatively, to recent state valuations.
This is the recency heuristic used for credit assignment, where more credit is assigned to the most recent states. Recent is defined in terms of $\gamma\lambda$.
The TD error for state-value prediction is:
\[\delta_t \doteq R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t)\]
and the weight vector update in TD($\lambda$) is proportional to the scalar TD error and the vector eligibility trace:

\[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\, \delta_t\, \mathbf{z}_t}\]

Backward View

Forward view provides theory but backward view provides mechanism (practical) where we update online, every step, from incomplete sequences.
Keep an eligibility trace for every state $s$.
Update value $V(s)$ for every state $s$ in proportion to TD-error $\delta_t$ and eligibility trace $\mathbf{z}_t$:

\[\begin{aligned} \delta_t &= R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \\ V(s) &\leftarrow V(s) + \alpha\, \delta_t\, \mathbf{z}_t \end{aligned}\]

Backward View of TD($\lambda$)

In the backward or mechanistic view of TD($\lambda$), each update depends on the current TD error combined with the current eligibility traces of past events.

Let’s look at the effect of $\lambda$ to understand the backward view of TD($\lambda$):

\[\begin{align*} \text{if } \lambda = 0: \quad & \mathbf{z}_t = \nabla \hat{v}(S_t, \mathbf{w}_t) \\ & \mathbf{w}_{t+1} = \mathbf{w}_t + \alpha\, \delta_t \nabla \hat{v}(S_t, \mathbf{w}_t) \quad \longrightarrow \quad \text{TD(0)} \\[6pt] \text{if } 0 < \lambda < 1: \quad & \text{earlier states are given less credit for the TD error} \\[6pt] \text{if } \lambda = 1: \quad & \mathbf{z}_t = \gamma \mathbf{z}_{t-1} + \nabla \hat{v}(S_t, \mathbf{w}_t) \quad \longrightarrow \quad \text{credit for earlier states falls by } \gamma \text{ per step} \\[6pt] \text{if } \lambda = 1,\ \gamma = 1: \quad & \mathbf{z}_t = \mathbf{z}_{t-1} + \nabla \hat{v}(S_t, \mathbf{w}_t) \quad \longrightarrow \quad \text{MC-like behavior (no time decay for ET)} \\[6pt] \text{if } \lambda = 1: \quad & \text{we get TD(1)} \end{align*}\]

In summary, $\lambda = 1$ yields TD(1).
TD(1) implements MC algorithms in a more general way and for a wider range of applicability:
- Not limited to episodic tasks, can be applied to discounted continuing tasks.
- Can be performed incrementally and online.
- Learns immediately and alters behavior during an episode if something good or bad happens, for control methods.
Linear TD($\lambda$) converges in the on-policy case if the step-size parameter $\alpha$ is reduced over time according to stochastic approximation theory conditions.
The convergence of linear TD($\lambda$) is not to the minimum-error weight vector but to a nearby weight vector that depends on $\lambda$.
The bound on solution quality generalized for any $\lambda$, for the continuing, discounted case is:

\[\overline{\text{VE}}(\mathbf{w}_\infty) \leq \frac{1 - \gamma\lambda}{1 - \gamma} \min_\mathbf{w} \overline{\text{VE}}(\mathbf{w})\]

That is, the asymptotic error is no more than $\dfrac{1-\gamma\lambda}{1-\gamma}$ times the smallest possible error for TD($\lambda$):

\[\begin{align*} \text{as } \lambda \to 1: \quad & \overline{\text{VE}}(\mathbf{w}_\infty) \to \min_\mathbf{w} \overline{\text{VE}}(\mathbf{w}) \\[6pt] \text{as } \lambda \to 0: \quad & \overline{\text{VE}}(\mathbf{w}_\infty) \to \frac{1}{1-\gamma} \min_\mathbf{w} \overline{\text{VE}}(\mathbf{w}) = \overline{\text{VE}}(\mathbf{w}_\text{TD}) \quad \text{(TD(0))} \end{align*}\]

However, $\lambda = 1$ is often the poorest choice.

12.3 $n$-step Truncated $\lambda$-return Methods

The off-line $\lambda$-return is of limited use because the $\lambda$-return is not known until the episode ends.
The off-line $\lambda$-return approximation is to truncate the sequence after a fixed number of steps.
Hence, a natural approximation is to truncate the sequence where $\lambda$-return cannot be calculated for an arbitrarily large $n$. This handles the continuing case.
The truncated $\lambda$-return for time $t$, given data only up to some later horizon $h$, is:

\[G_{t:h}^\lambda \doteq (1-\lambda) \sum_{n=1}^{h-t-1} \lambda^{n-1} G_{t:t+n} + \lambda^{h-t-1} G_{t:h}, \quad 0 \leq t \leq h \leq T\] \[\begin{aligned} \text{where } h &\equiv \text{horizon (plays same role as time of termination } T\text{)} \end{aligned}\]

Here the residual weighting is given to the longest available $n$-step return $G_{t:h}$.
The truncated $\lambda$-return gives rise to a family of $n$-step $\lambda$-return algorithms, known in the state-value case as Truncated TD($\lambda$) or TTD($\lambda$).
TTD($\lambda$) is defined for $0 \leq t < T$ by:

\[\boxed{\mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha \!\left[G_{t:t+n}^\lambda - \hat{v}(S_t, \mathbf{w}_{t+n-1})\right] \nabla \hat{v}(S_t, \mathbf{w}_{t+n-1})}\]

Efficient implementation of TTD($\lambda$) relies on the $k$-step $\lambda$-return:

\[\boxed{G_{t:t+k}^\lambda = \hat{v}(S_t, \mathbf{w}_{t-1}) + \sum_{i=t}^{t+k-1} (\delta\lambda)^{i-t} \delta_i'}\] \[\begin{aligned} \text{where } \delta_i' &\equiv R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_{t-1}) - \hat{v}(S_t, \mathbf{w}_{t-1}) \end{aligned}\]

Backup Diagram for Truncated TD($\lambda$)

The truncated $\lambda$-return gives rise to a family of $n$-step $\lambda$-return algorithms called TTD($\lambda$).

12.4 Redoing Updates: Online $\lambda$-return Algorithm

How do we choose the truncation parameter $n$ in TTD($\lambda$)?
It involves a tradeoff:
- $n$ should be large so that TTD($\lambda$) closely approximates off-line $\lambda$-return, but
- $n$ should also be small so that the updates can be made sooner and can influence behavior sooner.
In principle, we can achieve both cases via the online $\lambda$-return algorithm, but at the cost of computational complexity.
Essentially at each time step, we go back and redo all the updates since the beginning of the episode as we gather new increment of data:
- The new updates are better than the old ones because now they account for the time step’s new data.
- Basically this conceptual algorithm involves multiple passes over the episode, one at each horizon, each generating a different sequence of weight vectors.
Let’s distinguish between the weight vectors computed at the different horizons by seeing the first 3 sequences:

\[\begin{align*} h=1: \quad & \mathbf{w}_1^1 \doteq \mathbf{w}_0^1 + \alpha \!\left[G_{0:1}^\lambda - \hat{v}(S_0, \mathbf{w}_0^1)\right] \nabla \hat{v}(S_0, \mathbf{w}_0^1) \\[6pt] h=2: \quad & \mathbf{w}_0^2 \doteq \mathbf{w}_0^2 + \alpha \!\left[G_{0:2}^\lambda - \hat{v}(S_0, \mathbf{w}_0^2)\right] \nabla \hat{v}(S_0, \mathbf{w}_0^2) \\ & \mathbf{w}_2^2 \doteq \mathbf{w}_1^2 + \alpha \!\left[G_{1:2}^\lambda - \hat{v}(S_1, \mathbf{w}_1^2)\right] \nabla \hat{v}(S_1, \mathbf{w}_1^2) \\[6pt] h=3: \quad & \mathbf{w}_1^3 \doteq \mathbf{w}_0^3 + \alpha \!\left[G_{0:3}^\lambda - \hat{v}(S_0, \mathbf{w}_0^3)\right] \nabla \hat{v}(S_0, \mathbf{w}_0^3) \\ & \mathbf{w}_2^3 \doteq \mathbf{w}_1^3 + \alpha \!\left[G_{1:3}^\lambda - \hat{v}(S_1, \mathbf{w}_1^3)\right] \nabla \hat{v}(S_1, \mathbf{w}_1^3) \\ & \mathbf{w}_3^3 \doteq \mathbf{w}_2^3 + \alpha \!\left[G_{2:3}^\lambda - \hat{v}(S_2, \mathbf{w}_2^3)\right] \nabla \hat{v}(S_2, \mathbf{w}_2^3) \end{align*}\] \[\begin{aligned} \text{where} \\ \mathbf{w}_t^h &\equiv \text{weights used to generate the value at time } t \text{ in the sequence up to horizon } h \\ \mathbf{w}_0^h &\equiv \text{1st weight vector in each sequence that is inherited from the previous episode} \\ \mathbf{w}_n^h &\equiv \text{last weight vector in each sequence or the ultimate weight-vector sequence} \end{aligned}\]

The general form of the online $\lambda$-return update for $0 \leq t < h \leq T$ is:

\[\boxed{\mathbf{w}_{t+1}^h \doteq \mathbf{w}_t^h + \alpha \!\left[G_{t:h}^\lambda - \hat{v}(S_t, \mathbf{w}_t^h)\right] \nabla \hat{v}(S_t, \mathbf{w}_t^h)}\] \[\mathbf{w}_t \doteq \mathbf{w}_t^t\]

12.5 True Online TD($\lambda$)

The original ideal online $\lambda$-return algorithm shown in Section 12.4 is very complex so we use online TD($\lambda$) to approximate it.
We’ll use eligibility trace to invert the forward view, online $\lambda$-return to an efficient backward view algorithm. This is called the True Online TD($\lambda$).
It uses a simple strategy trick with the weight matrix, where we only need to keep the last weight vector from all the updates at the last time step (the diagonals of online $\lambda$-return weight matrix).

\[\begin{aligned} \begin{bmatrix} \mathbf{w}_0^0 & & & & \\ \mathbf{w}_0^1 & \mathbf{w}_1^1 & & & \\ \mathbf{w}_0^2 & \mathbf{w}_1^2 & \mathbf{w}_2^2 & & \\ \mathbf{w}_0^3 & \mathbf{w}_1^3 & \mathbf{w}_2^3 & \mathbf{w}_3^3 & \\ \vdots & \vdots & \vdots & \vdots & \ddots \\ \mathbf{w}_0^T & \mathbf{w}_1^T & \mathbf{w}_2^T & \mathbf{w}_3^T & \cdots & \mathbf{w}_T^T \end{bmatrix} &\longrightarrow \begin{bmatrix} \mathbf{w}_0^0 \\ & \mathbf{w}_1^1 \\ & & \mathbf{w}_2^2 \\ & & & \mathbf{w}_3^3 \\ & & & & \ddots \\ & & & & & \mathbf{w}_T^T \end{bmatrix} \\ \end{aligned}\] \[\text{Online } \lambda\text{-return} \hspace{8em} \text{True Online TD}(\lambda)\]

For the linear case in which $\hat{v}(s, \mathbf{w}) = \mathbf{w}^T \mathbf{x}(s)$, the true online TD($\lambda$) algorithm is:

\[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\, \delta_t\, \mathbf{z}_t + \alpha \!\left(\mathbf{w}_t^T \mathbf{x}_t - \mathbf{w}_{t-1}^T \mathbf{x}_t\right)\!\left(\mathbf{z}_t - \mathbf{x}_t\right)}\] \[\begin{aligned} \text{where} \\ \mathbf{w}_t &\doteq \mathbf{w}_t^t \\ \mathbf{x}_t &\doteq \mathbf{x}(S_t) \\ \mathbf{z}_t &\doteq \gamma\lambda \mathbf{z}_{t-1} + (1 - \alpha\gamma\lambda\, \mathbf{z}_{t-1}^T \mathbf{x}_t)\, \mathbf{x}_t \end{aligned}\]

The per-step computational complexity of true online TD($\lambda$) is the same as TD($\lambda$), $O(d)$.
$\mathbf{z}_t$ used in true online TD($\lambda$) is called a dutch trace, unlike that of TD($\lambda$) which is called an accumulating trace.
Earlier work used a 3rd kind of trace called the replacing trace, defined only for the tabular case or for binary feature vectors (tile coding). It is defined:

\[\tilde{z}_{i,t} \doteq \left\{ \begin{array}{ll} 1 & \text{if } x_{i,t} = 1 \\ \gamma\lambda\, z_{i,t-1} & \text{otherwise} \end{array} \right\}\]

Nowadays, dutch traces usually perform better than replacing traces and have a clearer theoretical basis.
Accumulating traces remain of interest for nonlinear function approximations where dutch traces are unavailable.

12.6 Dutch Traces in Monte Carlo Learning

Eligibility traces have nothing to do with TD learning despite their close historical association.
Eligibility traces arise even in Monte Carlo learning.
Using dutch traces, we can invert the forward view MC algorithm to an equivalent, yet computationally cheaper backward view algorithm.
This is the only equivalence of forward and backward view that is explicitly demonstrated in this book.
The linear, gradient MC prediction algorithm makes the following sequence of updates, one for each time step of the episode:

\[\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \!\left[G - \mathbf{w}_t^T \mathbf{x}_t\right] \mathbf{x}_t, \quad 0 \leq t < T\]

For simplicity, assume that the return $G$ is a single reward at the end of the episode (hence no subscript by time) and that there is no discounting.
This is known as the Least Mean Square (LMS) rule.
We introduce an additional vector memory, the eligibility trace, that keeps in it a summary of all the feature vectors seen so far. The overall update will be the same as the MC updates’ sequence shown above and is:
\[\begin{align*} \mathbf{w}_T &= \mathbf{w}_{T-1} + \alpha \!\left(G - \mathbf{w}_{T-1}^T \mathbf{x}_{T-1}\right) \mathbf{x}_{T-1} \\ &= \mathbf{w}_{T-1} + \alpha \mathbf{x}_{T-1}\!\left(-\mathbf{x}_{T-1}^T \mathbf{w}_{T-1}\right) + \alpha G \mathbf{x}_{T-1} \\ &= \!\left(\mathbf{I} - \alpha \mathbf{x}_{T-1} \mathbf{x}_{T-1}^T\right) \mathbf{w}_{T-1} + \alpha G \mathbf{x}_{T-1} \\ &= \mathbf{F}_{T-1}\, \mathbf{w}_{T-1} + \alpha G \mathbf{x}_{T-1} \end{align*}\] \[\begin{aligned} \text{where} \\ \mathbf{F}_t &\doteq \mathbf{I} - \alpha \mathbf{x}_t \mathbf{x}_t^T \equiv \text{a forgetting or fading matrix} \end{aligned}\] \[\therefore\quad \mathbf{w}_{T-1} = \mathbf{F}_{T-2}\, \mathbf{w}_{T-2} + \alpha G \mathbf{x}_{T-2}\]
Now recursing:
\[\begin{align*} \mathbf{w}_T &= \mathbf{F}_{T-1}\, \mathbf{w}_{T-1} + \alpha G \mathbf{x}_{T-1} \\ &= \mathbf{F}_{T-1}\!\left(\mathbf{F}_{T-2}\, \mathbf{w}_{T-2} + \alpha G \mathbf{x}_{T-2}\right) + \alpha G \mathbf{x}_{T-1} \\ &= \mathbf{F}_{T-1} \mathbf{F}_{T-2}\, \mathbf{w}_{T-2} + \alpha G\!\left(\mathbf{F}_{T-1} \mathbf{x}_{T-2} + \mathbf{x}_{T-1}\right) \\ &= \mathbf{F}_{T-1} \mathbf{F}_{T-2}\!\left(\mathbf{F}_{T-3}\, \mathbf{w}_{T-3} + \alpha G\, \mathbf{x}_{T-3}\right) + \alpha G\!\left(\mathbf{F}_{T-1}\, \mathbf{x}_{T-2} + \mathbf{x}_{T-1}\right) \\ &= \mathbf{F}_{T-1} \mathbf{F}_{T-2} \mathbf{F}_{T-3}\, \mathbf{w}_{T-3} + \alpha G\!\left(\mathbf{F}_{T-1} \mathbf{F}_{T-2}\, \mathbf{x}_{T-3} + \mathbf{F}_{T-1}\, \mathbf{x}_{T-2} + \mathbf{x}_{T-1}\right) \\ &\quad \vdots \\ &= \underbrace{\mathbf{F}_{T-1} \mathbf{F}_{T-2} \cdots \mathbf{F}_0\, \mathbf{w}_0}_{\mathbf{a}_{T-1}} + \alpha G \underbrace{\sum\nolimits_{k=0}^{T-1} \mathbf{F}_{T-1} \mathbf{F}_{T-2} \cdots \mathbf{F}_{k+1}\, \mathbf{x}_k}_{\mathbf{z}_{T-1}} \\ &= \mathbf{a}_{T-1} + \alpha G\, \mathbf{z}_{T-1} \end{align*}\] \[\begin{aligned} \text{where} \\ \mathbf{a}_{T-1}\ \&\ \mathbf{z}_{T-1} &\equiv \text{values at time } T-1 \text{ of 2 auxiliary memory vectors that can be updated} \\ &\phantom{{}\equiv{}} \text{incrementally w/o knowledge of } G \text{ and with } O(d) \text{ complexity per time step} \\ \mathbf{z}_t &\equiv \text{dutch-style eligibility trace, initialized to } \mathbf{z}_0 = \mathbf{x}_0 \end{aligned}\]
The $\mathbf{z}_t$ vector is in fact a dutch-style eligibility trace, initialized to $\mathbf{z}_0 = \mathbf{x}_0$, that can be updated according to:
\[\begin{align*} \mathbf{z}_t &= \sum\nolimits_{k=0}^{t} \mathbf{F}_t \mathbf{F}_{t-1} \cdots \mathbf{F}_{k+1}\, \mathbf{x}_k, \quad 1 \leq t < T \\ &= \sum\nolimits_{k=0}^{t-1} \mathbf{F}_t \mathbf{F}_{t-1} \cdots \mathbf{F}_{k+1}\, \mathbf{x}_k + \mathbf{x}_t \\ &= \mathbf{F}_t \sum\nolimits_{k=0}^{t-1} \mathbf{F}_{t-1} \mathbf{F}_{t-2} \cdots \mathbf{F}_{k+1}\, \mathbf{x}_k + \mathbf{x}_t \\ &= \mathbf{F}_t\, \mathbf{z}_{t-1} + \mathbf{x}_t \\ &= \!\left(\mathbf{I} - \alpha \mathbf{x}_t \mathbf{x}_t^T\right) \mathbf{z}_{t-1} + \mathbf{x}_t \\ &= \mathbf{z}_{t-1} - \alpha\!\left(\mathbf{z}_{t-1}^T \mathbf{x}_t\right) \mathbf{x}_t + \mathbf{x}_t \\ &\boxed{= \mathbf{z}_{t-1} + \!\left(1 - \alpha\, \mathbf{z}_{t-1}^T \mathbf{x}_t\right) \mathbf{x}_t} \end{align*}\]
which is the dutch trace for $\gamma\lambda = 1$.
The $\mathbf{a}_t$ auxiliary vector is initialized to $\mathbf{a}_0 = \mathbf{w}_0$ and then updated according to:
\[\begin{align*} \mathbf{a}_t &= \mathbf{F}_t \mathbf{F}_{t-1} \cdots \mathbf{F}_0\, \mathbf{w}_0, \quad 1 \leq t < T \\ &= \mathbf{F}_t\, \mathbf{a}_{t-1} \\ &= \mathbf{a}_{t-1} - \alpha \mathbf{x}_t \mathbf{x}_t^T \mathbf{a}_{t-1} \\ &\boxed{= \mathbf{a}_{t-1}\!\left(\mathbf{I} - \alpha \mathbf{x}_t \mathbf{x}_t^T\right)} \end{align*}\]

Takeaways

The auxiliary vectors, $\mathbf{a}_t$ and $\mathbf{z}_t$, are updated on each time step $t < T$ and then, at time $T$ when $G$ is observed, they are used to compute:

\[\boxed{\mathbf{w}_T = \mathbf{a}_{T-1} + \alpha G\, \mathbf{z}_{T-1}}\]

The time and memory complexity per step is $O(d)$.
This is surprising and intriguing since ET is working in a non-TD setting (ET arise where long-term predictions are needed to be learned efficiently).

12.7 Sarsa($\lambda$)

Now let’s extend eligibility traces to action-value methods.
First, let’s recall the action-value form of the $n$-step return:
\[G_{t:t+n} \doteq R_{t+1} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}_{t+n-1}), \quad t+n < T\] \[\text{with} \quad G_{t:t+n} = G_t \quad \text{ if } t+n \geq T.\]
With this, for $t = 0, \ldots, T-1,$ let’s form the action-value form of the off-line $\lambda$-return algorithm which uses $\hat{q}$ rather than $\hat{v}$:

\[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \!\left[G_t^\lambda - \hat{q}(S_t, A_t, \mathbf{w}_t)\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)}\] \[\begin{aligned} \text{where} \quad G_t^\lambda &\doteq G_{t:\infty}^\lambda \end{aligned}\]

For the forward view shown in the figure below, which is similar to TD($\lambda$), the updates are:
- 1st update: one full-step lookahead
- 2nd update: two-step lookahead
- Final update: complete return.

Backup Diagram for Sarsa($\lambda$)

The first update looks ahead one full step, to the next state–action pair, the second looks ahead two steps, to the second state–action pair, and so on. A final update is based on the complete return.

The weighting of each $n$-step update in the $\lambda$-return is same as in TD($\lambda$) and $\lambda$-return.
The forward view TD for action values, Sarsa($\lambda$), has the same update rule as TD($\lambda$):

\[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\, \delta_t\, \mathbf{z}_t}\]

The action-value form of the TD error is used:

\[\delta_t \doteq R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t)\]

The action-value form of the eligibility trace:

\[\begin{align*} \mathbf{z}_{-1} &\doteq \mathbf{0} \\ \mathbf{z}_t &\doteq \gamma\lambda \mathbf{z}_{t-1} + \nabla \hat{q}(S_t, A_t, \mathbf{w}_t), \quad 0 \leq t \leq T \end{align*}\]

There exists an action-value version of our ideal TD method, the online $\lambda$-return algorithm and its efficient implementation as true online TD($\lambda$):
- Section 12.4: Everything there holds here except for using the action-value form of the $n$-step return, $G_{t:t+n}$.
- Sections 12.5 & 12.6: Everything holds here except for using state-action feature vectors $\mathbf{x}_t = \mathbf{x}(S_t, A_t)$ instead of state feature vectors $\mathbf{x}_t = \mathbf{x}(S_t)$.
- The resulting efficient backward algorithm obtained from using the eligibility trace to invert the action-value form of the forward view, online $\lambda$-return is called the True Online Sarsa($\lambda$).
There is also a truncated version of Sarsa($\lambda$), called Forward Sarsa($\lambda$), which appears to be a model-free, control method for use in conjunction with multi-layer ANNs.

12.8 Variable $\lambda$ and $\gamma$

To get the most general forms of the final TD algorithms, it is vital to generalize the degree of bootstrapping and discounting beyond constant parameters to functions dependent on the state and action:

\[\begin{align*} \lambda_t &\doteq \lambda(S_t, A_t), \quad & \lambda &: S \times A \to [0,1] \\ \gamma_t &\doteq \gamma(S_t), \quad & \gamma &: S \to [0,1] \end{align*}\]

$\lambda_t$ is the termination function and is significant because it changes the return $G_t$, which is now more generally defined as:

\[\begin{align*} G_t &\doteq R_{t+1} + \gamma_{t+1} G_{t+1} \\ &= R_{t+1} + \gamma_{t+1} R_{t+2} + \gamma_{t+1} \gamma_{t+2} R_{t+3} + \gamma_{t+1} \gamma_{t+2} \gamma_{t+3} R_{t+4} + \ldots \\ &= \sum_{k=t}^{\infty} \left(\prod_{i=t+1}^{k} \gamma_i\right) R_{k+1} \end{align*}\] \[\begin{aligned} \text{where } \prod_{k=t}^{\infty} \gamma_k &= 0 \text{ with probability 1 for all } t, \text{ to assure the sums are finite} \end{aligned}\]

This general return $G_t$ definition enables episodic settings to become a single stream of experience, without special terminal state, start distributions or termination times
- A terminal state just becomes a state with $\gamma(s) = 0$ that transitions to the start distribution.
Generalization to variable bootstrapping yields a new state-based $\lambda$-return:

\[\boxed{G_t^{\lambda s} \doteq R_{t+1} + \gamma_{t+1}\!\left[(1 - \lambda_{t+1})\, \hat{v}(S_{t+1}, \mathbf{w}_t) + \lambda_{t+1}\, G_{t+1}^{\lambda s}\right]}\]

Action-based $\lambda$-return is either the Sarsa form:

\[\boxed{G_t^{\lambda a} \doteq R_{t+1} + \gamma_{t+1}\!\left[(1 - \lambda_{t+1})\, \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) + \lambda_{t+1}\, G_{t+1}^{\lambda a}\right]}\]

or the Expected Sarsa form:

\[\boxed{G_t^{\lambda a} \doteq R_{t+1} + \gamma_{t+1}\!\left[(1 - \lambda_{t+1})\, \bar{V}_t(S_{t+1}) + \lambda_{t+1}\, G_{t+1}^{\bar{\lambda}a}\right]}\] \[\begin{aligned} \text{where } \bar{V}_t(s) \doteq \sum_a \Pi(a \vert s)\, \hat{q}(s, a, \mathbf{w}_t) \\ \end{aligned}\]

Superscripts notation for $i$ in $G_t^{\lambda i}$

\[\begin{aligned} \text{"s"} &: \text{bootstraps from state values} \\ \text{"a"} &: \text{bootstraps from action values} \end{aligned}\]

12.9 Off-Policy Traces with Control Variates

To generalize to off-policy, we need to incorporate importance sampling using eligibility traces.
Let’s focus on the bootstrapping generalization of per-decision importance sampling with control variates (Section 7.4).
The new state-based $\lambda$-return in Section 12.8 generalizes, after the off-policy, control variate, $n$-step return (ending at horizon $h$) model to:

\[\boxed{G_t^{\lambda s} \doteq \rho_t \!\left(R_{t+1} + \gamma_{t+1}\!\left[(1 - \lambda_{t+1})\, \hat{v}(S_{t+1}, \mathbf{w}_t) + \lambda_{t+1}\, G_{t+1}^{\lambda s}\right]\right) + (1 - \rho_t)\, \hat{v}(S_t, \mathbf{w}_t)}\] \[\begin{aligned} \text{where } \rho_t &= \frac{\pi(A_t \vert S_t)}{b(A_t \vert S_t)} \end{aligned}\]

The final $\lambda$-return can be approximated in terms of sums of the state-based TD error $\delta_t^s$, with the approximation becoming exact if the approximate value function does not change:

\[\begin{align*} G_t^{\lambda s} &\approx \hat{v}(S_t, \mathbf{w}_t) + \rho_t \sum_{k=t}^{\infty} \delta_k^s \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \\ \delta_t^s &\doteq R_{t+1} + \gamma_{t+1}\, \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t) \end{align*}\]

The forward view update of the approximate $\lambda$-return is:

\[\begin{align*} \mathbf{w}_{t+1} &= \mathbf{w}_t + \alpha \!\left[G_t^{\lambda s} - \hat{v}(S_t, \mathbf{w}_t)\right] \nabla \hat{v}(S_t, \mathbf{w}_t) \\ &\boxed{\approx \mathbf{w}_t + \alpha \rho_t \!\left(\sum_{k=t}^{\infty} \delta_k^s \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i\right) \nabla \hat{v}(S_t, \mathbf{w}_t)} \end{align*}\]

We’re interested in the equivalence (approximately) between the forward-view update summed over time and a backward-view update summed over time. The equivalence is approximate because we ignore changes in the value function.
The sum of the forward-view update over time is:

\[\begin{align*} \sum_{t=0}^{\infty} \!\left(\mathbf{w}_{t+1} - \mathbf{w}_t\right) &\approx \sum_{t=0}^{\infty} \sum_{k=t}^{\infty} \alpha \rho_t\, \delta_k^s \nabla \hat{v}(S_t, \mathbf{w}_t) \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \\ &= \sum_{k=0}^{\infty} \sum_{t=0}^{k} \alpha \rho_t \nabla \hat{v}(S_t, \mathbf{w}_t)\, \delta_k^s \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \\ &\quad \left(\text{using the summation rule: } \sum_{t=x}^{y} \sum_{k=t}^{y} = \sum_{k=x}^{y} \sum_{t=x}^{k}\right) \\ &= \sum_{k=0}^{\infty} \alpha\, \delta_k^s \sum_{t=0}^{k} \rho_t \nabla \hat{v}(S_t, \mathbf{w}_t) \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \end{align*}\]

If the entire expression from the 2nd sum on could be written and updated incrementally as an eligibility trace, then the sum of the forward-view update over time would be in the form of the sum of a backward-view TD update.
- That is, if this expression was the trace at time $k$, then we could update it from its value at time $k-1$ by:

\[\begin{align*} \mathbf{z}_k &= \sum_{t=0}^{k} \rho_t \nabla \hat{v}(S_t, \mathbf{w}_t) \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \\ &= \sum_{t=0}^{k-1} \rho_t \nabla \hat{v}(S_t, \mathbf{w}_t) \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i + \rho_k \nabla \hat{v}(S_k, \mathbf{w}_k) \\ &= \gamma_k \lambda_k \rho_k \underbrace{\sum_{t=0}^{k-1} \rho_t \nabla \hat{v}(S_t, \mathbf{w}_t) \prod_{i=t+1}^{k-1} \gamma_i \lambda_i \rho_i}_{\mathbf{z}_{k-1}} + \rho_k \nabla \hat{v}(S_k, \mathbf{w}_k) \end{align*}\] \[\boxed{\mathbf{z}_k = \rho_k \!\left[\gamma_k \lambda_k\, \mathbf{z}_{k-1} + \nabla \hat{v}(S_k, \mathbf{w}_k)\right]}\]

If we change the index from $k$ to $t$ of the $\mathbf{z}_k$ equation above, we get the general accumulating trace update for state values:

\[\boxed{\mathbf{z}_t \doteq \rho_t \!\left[\gamma_t \lambda_t\, \mathbf{z}_{t-1} + \nabla \hat{v}(S_t, \mathbf{w}_t)\right]}\]

This eligibility trace combined with the usual semi-gradient TD($\lambda$) parameter-update rule (Section 12.2) forms a general TD($\lambda$) algorithm that can be applied to either on-policy or off-policy data:
- In on-policy, the algorithm is exactly TD($\lambda$) because $\rho_t = 1$ always and the ET above becomes the usual accumulating trace for variable $\lambda$ and $\gamma$:
\[\mathbf{z}_t \doteq \gamma_t \lambda_t\, \mathbf{z}_{t-1} + \nabla \hat{v}(S_t, \mathbf{w}_t)\]
- In off-policy, the algorithm stays as it is, although not guaranteed to be stable as a semi-gradient method.
- For off-policy, we’ll consider extensions that guarantee stability in the next few sections.
Let’s derive the off-policy ET for action-value methods and corresponding general Sarsa($\lambda$) algorithms.
- Starting with either recursive general action-based $\lambda$-return of Sarsa or Expected Sarsa, $G_t^{\lambda a}$, in Section 12.8 (Expected Sarsa works out to be simpler), we can extend the Expected Sarsa $G_t^{\lambda a}$ to the off-policy case after the off-policy model of action-based, off-policy, control variate, $n$-step return:

\[\boxed{\begin{align*} G_t^{\lambda a} &\doteq R_{t+1} + \gamma_{t+1}\!\left(\!\left[1 - \lambda_{t+1}\right] \bar{V}_t(S_{t+1}) + \lambda_{t+1}\!\left[\rho_{t+1} G_{t+1}^{\lambda a} + \bar{V}_t(S_{t+1}) - \rho_{t+1}\, \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t)\right]\right) \\ &= R_{t+1} + \gamma_{t+1}\!\left(\bar{V}_t(S_{t+1}) + \lambda_{t+1} \rho_{t+1} \!\left[G_{t+1}^{\lambda a} - \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t)\right]\right) \end{align*}}\] \[\begin{aligned} \text{where } \bar{V}_t(S_{t+1}) &= \sum_a \Pi(a \vert S_{t+1})\, \hat{q}(S_{t+1}, a, \mathbf{w}_t) \end{aligned}\]

The $\lambda$-return, approximately as the sum of TD errors, is:

\[\begin{align*} G_t^{\lambda a} &\approx \hat{q}(S_t, A_t, \mathbf{w}_t) + \sum_{k=t}^{\infty} \delta_k^a \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \\ \delta_t^a &= R_{t+1} + \gamma_{t+1} \bar{V}(S_{t+1}) - \hat{q}(S_t, A_t, \mathbf{w}_t) \end{align*}\]

Using steps analogous to those for the state case earlier in this section, write a forward-view update based on action-based $\lambda$-return $G_t^{\lambda a}$ above, then transform the sum of the updates using the summation rule and finally derive the eligibility trace for action values:

\[\boxed{\mathbf{z}_t \doteq \gamma_t \lambda_t \rho_t\, \mathbf{z}_{t-1} + \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)}\]

This ET combined with the action-based, expected TD error $\delta_t^a$ and the usual semi-gradient TD($\lambda$) parameter-update rule (Section 12.2) forms an elegant, efficient Expected Sarsa($\lambda$) algorithm that can be applied to either on-policy or off-policy data:
- On-policy case: The algorithm becomes the Sarsa($\lambda$) algorithm given constant $\lambda$ and $\gamma$, and the usual state-action TD error:
\[\begin{aligned} &\quad \rho_t = 1, \quad \nabla\lambda_t = \nabla\gamma_t = 0 \\ &\boxed{\mathbf{z}_t \doteq \gamma\lambda\, \mathbf{z}_{t-1} + \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)} \end{aligned}\]
At $\lambda = 1$, these algorithms become closely related to corresponding Monte Carlo algorithms.
No episode-by-episode equivalence of updates exist, only of their expectations, even under the most favorable conditions.
- Methods have been proposed recently [Sutton, Mahmood, Precup & van Hasselt, 2014] that do achieve an exact equivalence.
- These methods require an additional vector of “provisional weights” that keep track of executed updates but may need to be retracted/emphasized depending on future actions taken.
- The state and state-action versions of these methods are called PTD($\lambda$) and PQ($\lambda$) respectively, where the ‘P’ stands for Provisional.
If $\lambda < 1$, then all these off-policy algorithms involve bootstrapping and the deadly triad applies, meaning that they can be guaranteed stable only for the tabular case, state aggregation and other limited forms of function approximation.
Recall the challenge of off-policy learning has 2 parts. Off-policy eligibility traces deal effectively with the 1st part, correcting for the expected value of the targets, but not with the 2nd part that has to do with distribution of updates (matching off-policy to on-policy).
Algorithmic strategies for handling the 2nd part of the off-policy learning challenge with eligibility traces are summarized in Section 12.11.

12.10 Watkins’s Q($\lambda$) to Tree-Backup($\lambda$)

Watkins’s Q($\lambda$)

Watkins’s Q($\lambda$) is the original method for extending Q-learning to eligibility traces.
It involves decaying the ET in the usual way as long as a greedy action was taken, then cuts the traces to 0 after the 1st non-greedy action.

Backup Diagram for Watkins's Q($\lambda$)

The series of component updates ends either with the end of the episode or with the first nongreedy action, whichever comes first.

Tree-Backup($\lambda$)

Let’s look at the eligibility trace version of Tree Backup, which is called Tree-Backup($\lambda$) or TB($\lambda$).
TB($\lambda$) is the true successor to Q-learning because it has no importance sampling.
TB($\lambda$) concept is straightforward:
- The tree-backup updates of each length (Section 7.5) are weighted dependent on the bootstrapping parameter $\lambda$.
- Using the recursive form of the action-based $\lambda$-return for Expected Sarsa and then expanding the bootstrapping target case after the model of tree-backup $n$-step return (Section 7.5):
  \[\boxed{\begin{align*} G_t^{\lambda a} &\doteq R_{t+1} + \gamma_{t+1}\!\left(\!\left[1 - \lambda_{t+1}\right] \bar{V}_t(S_{t+1}) + \lambda_{t+1}\!\left[\sum_{a \neq A_{t+1}} \pi(a \vert S_{t+1})\, \hat{q}(S_{t+1}, a, \mathbf{w}_t) + \pi(A_{t+1} \vert S_{t+1}) G_{t+1}^{\lambda a}\right]\right) \\ &= R_{t+1} + \gamma_{t+1}\!\left(\bar{V}_t(S_{t+1}) + \lambda_{t+1} \pi(A_{t+1} \vert S_{t+1}) \!\left[G_{t+1}^{\lambda a} - \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t)\right]\right) \end{align*}}\]
$G_t^{\lambda a}$ can be approximated (ignoring changes in approx. value function) as a sum of TD errors:
\[\begin{align*} G_t^{\lambda a} &\approx \hat{q}(S_t, A_t, \mathbf{w}_t) + \sum_{k=t}^{\infty} \delta_k^a \prod_{i=t+1}^{k} \gamma_i \lambda_i \pi(A_i \vert S_i) \\ \delta_t^a &= R_{t+1} + \gamma_{t+1} \bar{V}_t(S_{t+1}) - \hat{q}(S_t, A_t, \mathbf{w}_t) \end{align*}\]
As always, using same steps as in the previous section, we get a special eligibility trace update involving the target-policy probabilities of the selected actions:
\[\boxed{\mathbf{z}_t \doteq \gamma_t \lambda_t \pi(A_t \vert S_t)\, \mathbf{z}_{t-1} + \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)}\]

Backup Diagram for Tree Backup($\lambda$)

The tree-backup updates of each length are weighted in the usual way dependent on the bootstrapping parameter $\lambda$

The ET above combined with the usual semi-gradient TD($\lambda$) parameter-update rule defines the TB($\lambda$) algorithm.
Like all semi-gradient algorithms, TB($\lambda$) is not guaranteed to be stable when used with off-policy data and a powerful function approximator (the deadly triad).

12.11 Stable Off-Policy Methods with Traces

Let’s look at 4 of the most important methods that achieve stable off-policy methods/training using eligibility traces.
All 4 are based on either the Gradient-TD or Emphatic TD and linear function approximation.

GTD($\lambda$)

Analogous to TDC, and aims to learn a parameter $\mathbf{w}_{t}$ such that $\hat{v}(s, \mathbf{w}) \doteq \mathbf{w}_{t}^T \mathbf{x}(s) \approx v_{\pi}(s)$ even from data that is due to following another policy $b$. Its update is:
\[\begin{aligned} \mathbf{w}_{t+1} &\doteq \mathbf{w}_t + \alpha\, \delta_t^s\, \mathbf{z}_t - \alpha \gamma_{t+1}(1 - \lambda_{t+1})\!\left(\mathbf{z}_t^T \mathbf{v}_t\right) \mathbf{x}_{t+1} \\ \mathbf{v}_{t+1} &\doteq \mathbf{v}_t + \beta\, \delta_t^s\, \mathbf{z}_t - \beta\!\left(\mathbf{v}_t^T \mathbf{x}_t\right) \mathbf{x}_t \end{aligned}\] \[\begin{aligned} \text{where} \\ \mathbf{v} &\in \mathbb{R}^d \equiv \text{a vector of the same dimension as } \mathbf{w}, \text{ initialized to } \mathbf{v}_0 = \mathbf{0} \\ \beta &> 0 \equiv \text{a 2nd step-size parameter} \end{aligned}\]

GQ($\lambda$)

Gradient-TD algorithm for action values with eligibility traces.
GQ($\lambda$) aims to learn $\mathbf{w}_{t}$ such that $\hat{q}(s, a, \mathbf{w}_{t}) \doteq \mathbf{w}_{t}^T \mathbf{x}(s,a) \approx q_{\pi}(s,a)$ from off-policy data.
If the target policy is $\varepsilon$-greedy, or otherwise biased towards the greedy policy for $\hat{q}$, then GQ($\lambda$) can be used as a control algorithm.
GQ($\lambda$) update is:
\[\begin{aligned} \mathbf{w}_{t+1} &\doteq \mathbf{w}_t + \alpha\, \delta_t^a\, \mathbf{z}_t - \alpha \gamma_{t+1}(1 - \lambda_{t+1})\!\left(\mathbf{z}_t^T \mathbf{v}_t\right) \bar{\mathbf{x}}_{t+1} \\ \bar{\mathbf{x}}_t &\doteq \sum_a \pi(a \vert S_t)\, \mathbf{x}(S_t, a) \\ \delta_t^a &\doteq R_{t+1} + \gamma_{t+1}\, \mathbf{w}_t^T \bar{\mathbf{x}}_{t+1} - \mathbf{w}_t^T \mathbf{x}_t \\ \mathbf{z}_t &\doteq \gamma_t \lambda_t \rho_t\, \mathbf{z}_{t-1} + \nabla \hat{q}(S_t, A_t, \mathbf{w}_t) \end{aligned}\] \[\begin{aligned} \text{where} \\ \bar{\mathbf{x}}_t &\equiv \text{average feature vector for } S_t \text{ under the target policy} \\ \delta_t^a &\equiv \text{expectation form of the TD error} \end{aligned}\]

HTD($\lambda$)

Hybrid TD($\lambda$) state-value algorithm combines aspects of GTD($\lambda$) and TD($\lambda$).
HTD($\lambda$) is a strict generalization of TD($\lambda$) to the off-policy setting, meaning it reduces exactly to TD($\lambda$) when the behavior and target policies coincide; a property GTD($\lambda$) does not share:
\[b(A_t \vert S_t) = \pi(A_t \vert S_t), \quad \rho_t = 1 \implies \text{HTD}(\lambda) = \text{TD}(\lambda)\]
HTD($\lambda$) is defined by:
\[\begin{aligned} \mathbf{w}_{t+1} &\doteq \mathbf{w}_t + \alpha\, \delta_t^s\, \mathbf{z}_t + \alpha\!\left[\!\left(\mathbf{z}_t - \mathbf{z}_t^b\right)^T \mathbf{v}_t\right]\!\left(\mathbf{x}_t - \gamma_{t+1} \mathbf{x}_{t+1}\right) \\ \mathbf{v}_{t+1} &\doteq \mathbf{v}_t + \beta\, \delta_t^s\, \mathbf{z}_t - \beta\!\left(\mathbf{z}_t^T \mathbf{v}_t\right)\!\left(\mathbf{x}_t - \gamma_{t+1} \mathbf{x}_{t+1}\right), \quad & \mathbf{v}_0 \doteq \mathbf{0} \\ \mathbf{z}_t &\doteq \rho_t \!\left(\gamma_t \lambda_t\, \mathbf{z}_{t-1} + \mathbf{x}_t\right), \quad & \mathbf{z}_{-1} \doteq \mathbf{0} \\ \mathbf{z}_t^b &\doteq \gamma_t \lambda_t\, \mathbf{z}_{t-1}^b + \mathbf{x}_t, \quad & \mathbf{z}_{-1}^b \doteq \mathbf{0} \end{aligned}\]
We get
- a 2nd set of weights, $\mathbf{v}_t$.
- a 2nd set of eligibility traces, $\mathbf{z}_t^b$, conventional accumulating traces for the behavior policy.
\[\begin{aligned} \mathbf{z}_t^b = \mathbf{z}_t \text{ if all } \rho_t = 1 &\implies \left(\mathbf{z}_t - \mathbf{z}_t^b\right)^T = \mathbf{0} \\ &\implies \mathbf{w}_{t+1} = \mathbf{w}_t + \alpha\, \delta_t^s\, \mathbf{z}_t \quad \text{(TD(}\lambda\text{))} \end{aligned}\]

Emphatic TD($\lambda$)

Extension of one-step Emphatic TD (Sections 9.11 & 11.8) to eligibility traces.
The resulting algorithm:
- (+) retains strong off-policy convergence guarantees
- (+) enables any degree of bootstrapping
- (-) has high variance
- (-) potentially slow convergence.
Emphatic TD($\lambda$) is defined by:
\[\begin{aligned} \mathbf{w}_{t+1} &\doteq \mathbf{w}_t + \alpha\, \delta_t\, \mathbf{z}_t \\ \delta_t &\doteq R_{t+1} + \gamma_{t+1}\, \mathbf{w}_t^T \mathbf{x}_{t+1} - \mathbf{w}_t^T \mathbf{x}_t \\ \mathbf{z}_t &\doteq \rho_t \!\left(\gamma_t \lambda_t\, \mathbf{z}_{t-1} + M_t \mathbf{x}_t\right), \quad & \mathbf{z}_{-1} \doteq \mathbf{0} \\ M_t &\doteq \lambda_t \mathcal{I}_t + (1 - \lambda_t) F_t \\ F_t &\doteq \rho_{t-1} \gamma_t F_{t-1} + \mathcal{I}_t, \quad & F_0 \doteq \mathcal{i}(S_0) \end{aligned}\] \[\begin{aligned} \text{where} \\ M_t &\geq 0 \equiv \text{emphasis} \\ F_t &\geq 0 \equiv \text{followon trace} \\ \mathcal{I}_t &\geq 0 \equiv \text{interest} \end{aligned}\]
In the on-policy case ($\rho_t = 1$ for all $t$), Emphatic TD($\lambda$) is similar to conventional TD($\lambda$), but still significantly different:
- Emphatic TD($\lambda$) is guaranteed to converge for all state-dependent $\lambda$ functions; TD($\lambda$) is not (TD($\lambda$) is guaranteed only for constant $\lambda$).
- See Yu’s counterexample [Ghassian, Rafiee & Sutton, 2016].

12.12 Implementation Issues

Naive implementation seems expensive: Updating eligibility traces for every state at every time step appears computationally costly on serial computers.
Practical optimization: Most ET are nearly 0; only recently visited states have significant traces, so implementations can track and update only these few states.
Computational cost: With this optimization, tabular methods with traces are only a few times more expensive than one-step methods.
Function approximation reduces overhead: When using neural networks, ET typically only double memory and computation per step (much less overhead than in tabular case).
Tabular is the worst case: The tabular setting represents the highest computational complexity for ET relative to simpler methods.

12.13 Conclusions

Eligibility traces provide an efficient, incremental way to interpolate between TD and MC methods.
ET offer advantages over $n$-step methods in terms of generality and computational trade-offs.
Empirically, an intermediate mix works best: ET should move towards MC but not all the way since pure MC performance degrades sharply.
ET are the first defense against long-delayed rewards and non-Markov tasks, used with TD methods to make them behave more like MC methods without full bootstrapping.
Use traces when data is scarce and online learning is required, as they provide faster learning per sample despite higher computational cost per step.
Avoid traces in offline settings with cheap abundant data (maximum data processing speed matters more than learning efficiency per sample).
True online methods achieve ideal $\lambda$-return performance while maintaining $O(d)$ computational efficiency.
Forward-to-backward view derivations provide computationally efficient, mechanistic, practical implementations of theory.

Citation

If you found this blog post helpful, please consider citing it:

@article{obasi2026RLsuttonBartoCh12notes,
  title   = "Sutton & Barto, Ch. 12: Eligibility Traces (Personal Notes)",
  author  = "Obasi, Chizoba",
  journal = "chizkidd.github.io",
  year    = "2026",
  month   = "Mar",
  url     = "https://chizkidd.github.io/2026/03/13/rl-sutton-barto-notes-ch012/"
}

Sutton & Barto, Ch. 11: Off-Policy Methods with Approximation (Personal Notes)

Mon, 09 Mar 2026 00:00:00 +0000

Let’s discuss the extension of off-policy methods from the tabular case (Ch. 6 & 7) to function approximation.
We’ll explore the convergence problems, the theory of linear function approximation, the notion of learnability, and stronger convergence off-policy algorithms.
Off-policy learning with function approximation has 2 challenges:
1. Finding the target of the update.
2. The off-policy distribution of updates does not match that of the on-policy distribution.

11.1 Semi-gradient Methods
11.2 Examples of Off-Policy Divergence
11.3 The Deadly Triad
11.4 Linear Value-Function Geometry
11.5 Gradient Descent in the Bellman Error
11.6 The Bellman Error is Not Learnable
11.7 Gradient-TD Methods
11.8 Emphatic-TD Methods
11.9 Reducing Variance
11.10 Summary

Appendix

Citation

11.1 Semi-gradient Methods

Let’s discuss the extension of previous off-policy methods to function approximation as semi-gradient methods.
This is how we find the update target (or change it) to address the first challenge.
Recall the semi-gradient update:

\[\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha\!\left[U_t - \hat{v}(S_t, \mathbf{w}_t)\right] \nabla \hat{v}(S_t, \mathbf{w}_t)\] \[U_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t)\]

In the tabular case, we update the array ($V$ or $Q$), but now we update the weight vector $\mathbf{w}$.
Many off-policy algorithms use the per-step importance sampling ratio:

\[\rho_t \doteq \rho_{t:t} = \frac{\pi(A_t \vert S_t)}{b(A_t \vert S_t)}\]

The off-policy, semi-gradient TD(0) is same as that of the on-policy TD(0) except for the addition of the $\rho_t$ term:

\[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \rho_t\, \delta_t \nabla \hat{v}(S_t, \mathbf{w}_t)}\] \[\begin{align*} \text{(episodic)} \quad \delta_t &= R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t) \\ \text{(continuing)} \quad \delta_t &= R_{t+1} - \bar{R}_t + \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t) \end{align*}\]

For action values, the off-policy, semi-gradient Expected Sarsa update rule is (no importance sampling):

\[\boxed{\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha\, \delta_t \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)}\] \[\begin{align*} \text{(episodic)} \quad \delta_t &= R_{t+1} + \gamma \sum_a \pi(a \vert S_{t+1}) \hat{q}(S_{t+1}, a, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t) \\ \text{(continuing)} \quad \delta_t &= R_{t+1} - \bar{R}_t + \sum_a \pi(a \vert S_{t+1}) \hat{q}(S_{t+1}, a, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t) \end{align*}\]

The lack of use of importance sampling in Expected Sarsa is an unclear choice since we might want to weight different state-action pairs differently once they all contribute to the same overall approximation. This issue can only be properly resolved by more thorough understanding of the theory of function approximation in RL.
In the multi-step generalizations of the algorithms, both the state-value and action-value algorithms involve importance sampling. For example, the off-policy, semi-gradient $\mathbf{n}$-step Sarsa update is:

\[\boxed{\mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha \rho_{t+1} \cdots \rho_{t+n}\!\left[G_{t:t+n} - \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1})\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1})}\] \[\begin{align*} \text{(episodic)} \quad G_{t:t+n} &= R_{t+1} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}_{t+n-1}) \\ \text{(continuing)} \quad G_{t:t+n} &= R_{t+1} - \bar{R}_t + \ldots + R_{t+n} - \bar{R}_{t+n-1} + \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}_{t+n-1}) \end{align*}\] \[\text{where } \rho_k = 1 \hspace{0.5em} \text{ for } k \geq T \quad \text{and} \quad G_{t:t+n} = G_t \hspace{0.5em} \text{ if } t+n \geq T\]

The off-policy, semi-gradient $\mathbf{n}$-step backup tree (no importance sampling) algorithm is:

\[\boxed{\mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha\!\left[G_{t:t+n} - \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1})\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1})}\] \[G_{t:t+n} \doteq \hat{q}(S_t, A_t, \mathbf{w}_{t+n}) + \sum_{k=t}^{t+n-1} \delta_k \prod_{i=t+1}^{k} \gamma \pi(A_i \vert S_i)\] \[\text{where } \delta_t \text{ is the Expected Sarsa TD error defined earlier in this section.}\]

11.2 Examples of Off-Policy Divergence

Now let’s discuss the 2nd off-policy function approximation challenge.
We’ll look at some instructive counterexamples where the semi-gradient algorithm diverges.

Example 1

Consider part of a larger MDP with 2 states whose estimated values are $w$ and $2w$:

Simple Counterexample: 2-state part of an MDP.

$w$ updates will diverge to infinity, since the transition will always look good (higher next-state estimated value than current state estimated value).
The TD error on a transition between the 2 states is:

\[\begin{align*} \delta_t &= R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t) \\ &= 0 + \gamma \cdot 2w_t - w_t \\ &= (2\gamma - 1)\, w_t \end{align*}\]

The off-policy, semi-gradient TD(0) update is:

\[\begin{align*} w_{t+1} &= w_t + \alpha \rho_t\, \delta_t \nabla \hat{v}(S_t, w_t) \\ &= w_t + (\alpha)(1)\!\left[(2\gamma - 1) w_t\right](1) \\ &= w_t\!\left[1 + \alpha(2\gamma - 1)\right] \end{align*}\] \[\begin{aligned} \Rightarrow \quad & 1 + \alpha(2\gamma - 1) > 1 \\ & \alpha(2\gamma - 1) > 0 \\ & 2\gamma - 1 > 0 \\ & \gamma > \tfrac{1}{2} \quad \longrightarrow \quad w \to \pm\infty \end{aligned}\]

Example 2 (Baird’s Counterexample)

Now let’s look at an entire complete system with instability (divergence).
Consider the episodic 7-state, 2-action MDP shown below.

Baird’s Counterexample: Episodic 7-state, 2-action MDP.

Assumptions/knowns:
- $b(\text{dashed}\,\vert\,\cdot) = 6/7$
- $b(\text{solid}\,\vert\,\cdot) = 1/7$
- $\pi(a\,\vert\,\cdot) = \pi(\text{solid}\,\vert\,\cdot) = 1$
- $R = 0$ (on all transitions)
- $\gamma = 0.99$
- The state values are estimated via linear parametrization.
The estimated value of the leftmost state is $2w_1 + w_8$, which corresponds to a feature vector for the 1st state being:

\[\mathbf{x}(1) = (2, 0, 0, 0, 0, 0, 0, 1)^T\] \[R = 0 \quad \therefore\quad v_\pi(s) = 0 \; \forall s, \text{ which can be exactly approximated if } \mathbf{w} = \mathbf{0}\]

Since there are 8 components of the weight vector (more than the 7 non-terminal states), there exist many solutions.
Applying semi-gradient TD(0) to this problem will cause the weights to diverge to infinity. This also applies for the dynamic programming (DP) case.
The semi-gradient DP update is:

\[\mathbf{w}_{k+1} \doteq \mathbf{w}_k + \frac{\alpha}{\vert S \vert} \sum_s \left(\mathbb{E}_\pi\!\left[R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_k) \mid S_t = s\right] - \hat{v}(s, \mathbf{w}_k)\right) \nabla \hat{v}(s, \mathbf{w}_k)\]

This example shows that even the simplest combination of bootstrapping and function approximation can be unstable in the off-policy case.
- Simplest bootstrapping: DP and TD.
- Simplest function approximation: linear, semi-gradient descent method.

Example 3 (Tsitsiklis & Van Roy’s Counterexample)

This extends Example 1 with a terminal state and $R = 0$:

Tsitsiklis & Van Roy’s Counterexample: Extension of Example 1 with probability $\varepsilon$ of transitioning to the terminal state (shaded).

Let’s find $w_{k+1}$ at each step that minimizes the $\overline{\text{VE}}$ between the estimated value and the expected one-step return:

\[\begin{align*} w_{k+1} &= \arg\min_{w \in \mathbb{R}} \sum_{s \in S} \left(\hat{v}(s, w) - \mathbb{E}_\pi\!\left[R_{t+1} + \gamma \hat{v}(S_{t+1}, w_k) \mid S_t = s\right]\right)^2 \\[6pt] &= \arg\min_{w \in \mathbb{R}} \left(w - \gamma \cdot 2w_k\right)^2 + \left(2w - (1 - \varepsilon)\gamma \cdot 2w_k\right)^2 \\[6pt] &= \left(\frac{6 - 4\varepsilon}{5}\right) \gamma w_k \end{align*}\]

The sequence ${w_k}$ diverges when $\gamma > \dfrac{5}{6 - 4\varepsilon}$ and $w_0 \neq 0$.

Takeaways

Instability can be prevented by using special methods for function approximation.
These special methods guarantee stability because they do not extrapolate from the observed targets. They are called averagers.
Averagers include:
1. Nearest neighbor
2. Locally weighted regression
3. Tile coding
4. Artificial neural networks (ANNs)

11.3 The Deadly Triad

The danger of instability and divergence arises when we combine these 3 elements, which make up the deadly triad:
- Function approximation
- Bootstrapping
- Off-policy training
Instability can be avoided if one of the elements is absent:
- Function approximation cannot be given up (needed for large-scale problems).
- Bootstrapping can be given up but at the cost of computational and data efficiency.
- Off-policy can be given up (replace Q-learning with Sarsa).
- There is no perfect solution as we still need off-policy for planning and parallel learning.

11.4 Linear Value-Function Geometry

To better understand the stability challenge of off-policy learning, let’s think about value-function approximation more abstractly and independently of how learning is done.
Let’s consider the case with 3 states $S = {s_1, s_2, s_3}$ and 2 parameters $\mathbf{w} = (w_1, w_2)^T$.
- All value functions exist in a 3-D space, however the parameters provide a 2-D subspace.
- Any weight vector $\mathbf{w} = (w_1, w_2)^T$ is a point in the 2-D subspace and thus also a complete value function $v_\mathbf{w}$ that assigns values to all 3 states.
- In linear value-function approximation, the subspace is a simple plane.
How do we represent $v_\pi$ in the $d$-dimensional space?
- We need to perform a projection operation.
- TD methods present other solutions.

The Geometry of Linear Value-Function Approximation

Shown is the 3D space of all value functions over three states, while shown as a plane is the subspace of all value functions representable by a linear function approximator with parameter $\mathbf{w} = (w_1, w_2)^T$. The true value function $v_\pi$ is in the larger space and can be projected down (into the subspace, using a projection operator $\Pi$) to its best approximation in the value error ($\text{VE}$) sense. The best approximators in the Bellman error ($\text{BE}$), projected Bellman error ($\text{PBE}$), and temporal difference error ($\text{TDE}$) senses are all potentially different and are shown in the lower right.

Projection Operation

For the projection operation, the distance between value functions using the norm is:

\[\begin{align*} \lVert v \rVert_\mu^2 &\doteq \sum_{s \in S} \mu(s)\, v(s)^2 \\[6pt] \overline{\text{VE}}(\mathbf{w}) &= \lVert v_\mathbf{w} - v_\pi \rVert_\mu^2 \\[6pt] \Pi\, v &\doteq v_\mathbf{w} \quad \\[6pt] \text{where } \mathbf{w} = \arg\min_{\mathbf{w} \in \mathbb{R}^d} \lVert v - v_\mathbf{w} \rVert_\mu^2 & \hspace{0.8em} \text{and} \hspace{0.5em} \Pi \equiv \text{projection operator} \end{align*}\]

The representable value function that is closest to the true value function $V_\pi$ is its projection $\Pi V_\pi$ (MC method asymptotic solution).
Projection matrix: with $\mathbf{D} \equiv \vert S \vert \times \vert S \vert$ diagonal matrix with $\mu(s)$ on the diagonal and $\mathbf{X} \equiv \vert S \vert \times d$ matrix whose rows are the feature vectors $\mathbf{x}(s)^T$:
\[\Pi \doteq \mathbf{X}\!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{D}\]
If the inverse does not exist, the pseudoinverse is substituted. Using these matrices, the squared norm of a vector can be written as:
\[\lVert v \rVert_\mu^2 = v^T \mathbf{D}\, v\]
and the approximate linear value function written as:
\[v_\mathbf{w} = \mathbf{X}\mathbf{w}\]

TD Solutions

Bellman Error

The true value function $v_\pi$ solves the Bellman equation exactly.
The Bellman error shows how far off $v_\mathbf{w}$ is from $v_\pi$. The Bellman error at state $s$ is:

\[\begin{align*} \bar{\delta}_\mathbf{w}(s) &\doteq \left(\sum_a \pi(a \vert s) \sum_{s', r} p(s', r \vert s, a)\!\left[r + \gamma v_\mathbf{w}(s')\right]\right) - v_\mathbf{w}(s) \\ &= \mathbb{E}_\pi\!\left[R_{t+1} + \gamma v_\mathbf{w}(S_{t+1}) - v_\mathbf{w}(S_t) \mid S_t = s, A_t \sim \pi\right] \end{align*}\]

The Bellman error is the expectation of the TD error.
The vector of all the Bellman errors, at all states, $\bar{\delta}_\mathbf{w} \in \mathbb{R}^{\vert S \vert}$, is called the Bellman error vector ($\text{BE}$).
The overall size of $\text{BE}$ is the Mean Squared Bellman Error, $\overline{\text{BE}}$:

\[\overline{\text{BE}}(\mathbf{w}) = \lVert \bar{\delta}_\mathbf{w} \rVert_\mu^2\]

The Bellman operator $B_\pi : \mathbb{R}^{\vert S \vert} \to \mathbb{R}^{\vert S \vert}$ is defined by:

\[\begin{align*} (B_\pi v)(s) &\doteq \sum_a \pi(a \vert s) \sum_{s', r} p(s', r \vert s, a)\!\left[r + \gamma v(s')\right], \quad \forall s \in S \text{ and } v : S \to \mathbb{R} \\[6pt] \bar{\delta}_\mathbf{w} &= B_\pi v_\mathbf{w} - v_\mathbf{w} \\[6pt] v_\pi &= B_\pi v_\pi \end{align*}\]

The projection of the Bellman error vector back into the representable space creates the Projected Bellman Error $(\text{PBE})$ vector:

\[\text{PBE} = \Pi\, \bar{\delta}_\mathbf{w}\]

The size of $\text{PBE}$, in the norm, is another measure of error in the approximate value function, called the Mean Square Projected Bellman Error, $\overline{\text{PBE}}$:

\[\overline{\text{PBE}}(\mathbf{w}) = \lVert \Pi\, \delta_\mathbf{w} \rVert_\mu^2\]

With linear function approximation, there always exists an approximate value function (within the subspace) with zero $\overline{\text{PBE}}$; this is the TD fixed point $\mathbf{w}_\text{TD}$.

11.5 Gradient Descent in the Bellman Error

Let’s apply the approach of SGD in dealing with the challenge of stability in off-policy learning.

TD Error (Naive Residual-Gradient Algorithm)

Let’s take the minimization of the expected square of the TD error as the objective, TD(0):

\[\delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t)\]

Using the TD error, we can find the Mean Squared TD error, the objective function $\overline{\text{TDE}}$:

\[\begin{align*} \overline{\text{TDE}}(\mathbf{w}) &= \sum_{s \in S} \mu(s)\, \mathbb{E}\!\left[\delta_t^2 \mid S_t = s, A_t \sim \pi\right] \\ &= \sum_{s \in S} \mu(s)\, \mathbb{E}\!\left[\rho_t\, \delta_t^2 \mid S_t = s, A_t \sim b\right] \\ &= \mathbb{E}_b\!\left[\rho_t\, \delta_t^2\right] \end{align*}\]

Following the standard SGD approach, the per-step update based on a sample of this expected value:

\[\begin{align*} \mathbf{w}_{t+1} &= \mathbf{w}_t - \tfrac{1}{2}\alpha \nabla\!\left(\rho_t\, \delta_t^2\right) \\ &= \mathbf{w}_t - \alpha \rho_t\, \delta_t \nabla \delta_t \\ &= \mathbf{w}_t + \alpha \rho_t\, \delta_t\!\left(\nabla \hat{v}(S_t, \mathbf{w}_t) - \gamma \nabla \hat{v}(S_{t+1}, \mathbf{w}_t)\right) \end{align*}\]

This is the same as the semi-gradient TD algorithm except for the additional final term.
This method is naive because it achieves temporal smoothing-like behavior rather than accurate prediction by penalizing all TD errors.

Bellman Error (Residual-Gradient Algorithm)

Consider the minimization of the Bellman error (if the exact values are learned, the Bellman error is zero everywhere).
This yields the residual gradient algorithm:
\[\begin{align*} \mathbf{w}_{t+1} &= \mathbf{w}_t - \tfrac{1}{2}\alpha \nabla\!\left(\mathbb{E}_\pi\!\left[\delta_t\right]^2\right) \\ &= \mathbf{w}_t - \tfrac{1}{2}\alpha \nabla\!\left(\mathbb{E}_b\!\left[\rho_t\, \delta_t\right]^2\right) \\ &= \mathbf{w}_t - \alpha\, \mathbb{E}_b\!\left[\rho_t\, \delta_t\right] \nabla \mathbb{E}_b\!\left[\rho_t\, \delta_t\right] \\ &= \mathbf{w}_t - \alpha\, \mathbb{E}_b\!\left[\rho_t\!\left(R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w})\right)\right] \mathbb{E}_b\!\left[\rho_t \nabla \delta_t\right] \\ &= \mathbf{w}_t + \alpha\!\left[\mathbb{E}_b\!\left[\rho_t\!\left(R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w})\right)\right] - \hat{v}(S_t, \mathbf{w})\right]\!\left[\nabla \hat{v}(S_t, \mathbf{w}) - \gamma\, \mathbb{E}_b\!\left[\rho_t \nabla \hat{v}(S_{t+1}, \mathbf{w})\right]\right] \end{align*}\]
Two ways to make the residual-gradient algorithm work:
- In the case of deterministic environments.
- Obtain 2 independent samples of the next state $S_{t+1}$ from $S_t$.
In both ways above, the algorithm is guaranteed to converge to a minimum of the $\overline{\text{BE}}$ under the usual conditions on the step-size parameter.
However, there are at least 3 ways in which the convergence of the residual-gradient algorithm is unsatisfactory:
- Very slow.
- Converges to the wrong values.
- A problem with the $\overline{\text{BE}}$ objective covered in the next section.

11.6 The Bellman Error is Not Learnable

The Bellman error is not learnable from the observed sequence of feature vectors, actions, and rewards.
Since the Bellman error objective cannot be learned from the observable data, this is the strongest reason not to seek it.
Examples of non-learnable Markov Reward Processes (MRPs):

Example 1

Value Error (VE) Learnability Counterexample: Deterministic MRP pair with an endless stream of $0$s and $2$s

These MRPs have a deterministic reward with observable data of an endless stream of 0s and 2s.
We cannot learn if the MRP has one state or two, or is stochastic or deterministic.
The pair of MRPs shows that the $\overline{\text{VE}}$ objective is not learnable:

\[\overline{\text{VE}}(\mathbf{w}) \doteq \sum_{s \in S} \mu(s)\!\left[v_\pi(s) - \hat{v}(s, \mathbf{w})\right]^2\]

The $\overline{\text{VE}}$ is not learnable, but the parameter that optimizes it is!
We introduce a learnable natural objective function that is always observable. This is the error between the value estimate at each time and the return from that time, called the return error. The Mean Square Return Error $(\overline{\text{RE}})$ is the expectation, under $\mu$, of the square of this return error.
$\overline{\text{RE}}$ in the on-policy case is:

\[\begin{align*} \overline{\text{RE}}(\mathbf{w}) &= \mathbb{E}\!\left[\left(G_t - \hat{v}(S_t, \mathbf{w})\right)^2\right] \\ &= \overline{\text{VE}}(\mathbf{w}) + \mathbb{E}\!\left[\left(G_t - v_\pi(S_t)\right)^2\right] \end{align*}\]

The $\overline{\text{BE}}$ can be computed from knowledge of the MDP but is not learnable from data, and its minimum solution is not learnable.

Example 2

Bellman Error (BE) Learnability Counterexample: Complex Deterministic MRP pair with same distribution but different minimizing parameter vector

The example above serves as a counterexample to the learnability of the Bellman error.
The 2 MRPs generate the same data distribution but have different minimizing parameter vectors, proving that the optimal parameter vector is not a function of the data and thus cannot be learned from it.
Other bootstrapping objectives, like $\overline{\text{PBE}}$ and $\overline{\text{TDE}}$, are learnable from data and yield optimal solutions different from each other and that of $\overline{\text{BE}}$.
$\overline{\text{BE}}$ is limited to model-based settings, therefore $\overline{\text{PBE}}$ is preferred.

Casual Relationships among the data distribution, MDPs & various objectives

Left, Monte Carlo objectives: Two different MDPs can produce the same data distribution yet also produce different $\overline{\text{VE}}$s, proving that the $\overline{\text{VE}}$ objective cannot be determined from data and is not learnable. However, all such $\overline{\text{VE}}$s must have the same optimal parameter vector, $\mathbf{w}^{*}$! Moreover, this same $\mathbf{w}^{*}$ can be determined from another objective, the $\overline{\text{RE}}$, which is uniquely determined from the data distribution. Thus $\mathbf{w}^{*}$ and the $\overline{\text{RE}}$ are learnable even though the $\overline{\text{VE}}$s are not.
Right, Bootstrapping objectives: Two different MDPs can produce the same data distribution yet also produce different $\overline{\text{BE}}$s and have different minimizing parameter vectors; these are not learnable from the data distribution. The $\overline{\text{PBE}}$ and $\overline{\text{TDE}}$ objectives and their (different) minima can be directly determined from data and thus are learnable.

11.7 Gradient-TD Methods

Let’s consider SGD methods for minimizing the $\overline{\text{PBE}}$.
True SGD methods, Gradient-TD methods, have robust convergence properties even under off-policy training and nonlinear function approximation.
In the linear case, there exists an exact solution, the TD fixed point $\mathbf{w}_\text{TD}$, at which the $\overline{\text{PBE}}$ is zero.
This solution via least-squares methods yields a $O(d^2)$ complexity; however, we want an SGD method with $O(d)$ that converges robustly.
Let’s derive an SGD method for the $\overline{\text{PBE}}$ assuming linear function approximation:
\[\begin{align*} \overline{\text{PBE}}(\mathbf{w}) &= \lVert \Pi\, \bar{\delta}_\mathbf{w} \rVert_\mu^2 \\ &= \left(\Pi\, \bar{\delta}_\mathbf{w}\right)^T \mathbf{D}\, \Pi\, \bar{\delta}_\mathbf{w} \\ &= \bar{\delta}_\mathbf{w}^T \Pi^T \mathbf{D}\, \Pi\, \bar{\delta}_\mathbf{w} \\ &= \bar{\delta}_\mathbf{w}^T \mathbf{D} \mathbf{X}\!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w} \\ &= \left(\mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w}\right)^T \!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \!\left(\mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w}\right) \end{align*}\]
$\quad \left(\text{using } \Pi = \mathbf{X}!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{D} \text{ and the identity } \Pi^T \mathbf{D} \Pi = \mathbf{D} \mathbf{X}!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{D}\right)$
The gradient of the $\overline{\text{PBE}}$ w.r.t $\mathbf{w}$ is:

\[\nabla \overline{\text{PBE}}(\mathbf{w}) = 2\, \nabla\!\left[\mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w}\right]^T \!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \!\left(\mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w}\right)\]

Let’s turn this into an SGD method via converting the 3 factors above into expectations under this distribution:

\[\begin{aligned} \mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w} &= \sum_s \mu(s)\, \mathbf{x}(s)\, \bar{\delta}_\mathbf{w}(s) = \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\[6pt] \nabla\!\left[\mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w}\right] &= \nabla \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\ &= \mathbb{E}\!\left[\rho_t \nabla \delta_t^T\, \mathbf{x}_t^T\right] \\ &= \mathbb{E}\!\left[\rho_t \nabla\!\left(R_{t+1} + \gamma \mathbf{w}^T \mathbf{x}_{t+1} - \mathbf{w}^T \mathbf{x}_t\right) \mathbf{x}_t^T\right] \\ &= \mathbb{E}\!\left[\rho_t\!\left(\gamma \mathbf{x}_{t+1} - \mathbf{x}_t\right) \mathbf{x}_t^T\right] \\[6pt] \mathbf{X}^T \mathbf{D} \mathbf{X} &= \sum_s \mu(s)\, \mathbf{x}_s\, \mathbf{x}_s^T = \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right] \end{aligned}\]

Substituting these expectations for the three factors into $\nabla \overline{\text{PBE}}$:

\[\nabla \overline{\text{PBE}}(\mathbf{w}) = 2\, \mathbb{E}\!\left[\rho_t\!\left(\gamma \mathbf{x}_{t+1} - \mathbf{x}_t\right) \mathbf{x}_t^T\right] \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right]\]

The 1st and last terms are not independent (biased gradient estimate).
Could estimate all 3 terms separately and combine (unbiased gradient estimate) but too computationally expensive.

Gradient-TD

Estimate and store the product of the last 2 terms of $\nabla \overline{\text{PBE}}(\mathbf{w})$ (product of a $d \times d$ matrix and a $d$-vector yields a $d$-vector like $\mathbf{w}$ itself):

\[\mathbf{v} \approx \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right]\]

In linear supervised learning, this is the solution to a linear least-squares problem for $\rho_t\, \delta_t$ approximation from the features.
The standard SGD method for incrementally finding $\mathbf{v}$ that minimizes the expected squared error $\left(\mathbf{v}^T \mathbf{x}_t - \rho_t\, \delta_t\right)^2$ is known as the Least Mean Square (LMS) rule:

\[\mathbf{v}_{t+1} \doteq \mathbf{v}_t + \beta\, \delta_t\!\left(\delta_t - \mathbf{v}_t^T \mathbf{x}_t\right) \mathbf{x}_t\] \[\begin{aligned} \text{where} \\ \beta &> 0 \equiv \text{another step-size parameter} \\ \rho_t &\equiv \text{importance sampling ratio} \\ O(d) &\equiv \text{storage \& per-step computational complexity} \end{aligned}\]

GTD2

With $\mathbf{v}_t$, we can update $\mathbf{w}_t$ using SGD:
\[\begin{align*} \mathbf{w}_{t+1} &= \mathbf{w}_t - \tfrac{1}{2}\alpha \nabla \overline{\text{PBE}}(\mathbf{w}_t) \\ &= \mathbf{w}_t - \tfrac{1}{2}\alpha\!\left(2\, \mathbb{E}\!\left[\rho_t\!\left(\gamma \mathbf{x}_{t+1} - \mathbf{x}_t\right) \mathbf{x}_t^T\right] \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right]\right) \\ &= \mathbf{w}_t + \alpha\, \mathbb{E}\!\left[\rho_t\!\left(\mathbf{x}_t - \gamma \mathbf{x}_{t+1}\right) \mathbf{x}_t^T\right] \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\ &\approx \mathbf{w}_t + \alpha\, \mathbb{E}\!\left[\rho_t\!\left(\mathbf{x}_t - \gamma \mathbf{x}_{t+1}\right) \mathbf{x}_t^T\right] \mathbf{v}_t \\ &\approx \mathbf{w}_t + \alpha \rho_t\!\left(\mathbf{x}_t - \gamma \mathbf{x}_{t+1}\right) \mathbf{x}_t^T \mathbf{v}_t \end{align*}\]
where $O(d)$ per-step computational complexity of $(\mathbf{x}_t^T \mathbf{v}_t)$ is done first.

TD(0) with Gradient Correction (GTD(0) or TDC)

Let’s look at another analytical algorithm called TD(0) with gradient correction, TDC:
\[\begin{align*} \mathbf{w}_{t+1} &= \mathbf{w}_t + \alpha\, \mathbb{E}\!\left[\rho_t\!\left(\mathbf{x}_t - \gamma \mathbf{x}_{t+1}\right) \mathbf{x}_t^T\right] \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\ &= \mathbf{w}_t + \alpha\!\left(\mathbb{E}\!\left[\rho_t\, \mathbf{x}_t\, \mathbf{x}_t\right] - \gamma\, \mathbb{E}\!\left[\rho_t\, \mathbf{x}_{t+1}\, \mathbf{x}_t^T\right]\right) \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\ &= \mathbf{w}_t + \alpha\!\left(\mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t\right] - \gamma\, \mathbb{E}\!\left[\rho_t\, \mathbf{x}_{t+1}\, \mathbf{x}_t^T\right]\right) \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\ &= \mathbf{w}_t + \alpha\!\left(\mathbb{E}\!\left[\mathbf{x}_t\, \rho_t\, \delta_t\right] - \gamma\, \mathbb{E}\!\left[\rho_t\, \mathbf{x}_{t+1}\, \mathbf{x}_t^T\right] \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right]\right) \\ &\approx \mathbf{w}_t + \alpha\!\left(\mathbb{E}\!\left[\mathbf{x}_t\, \rho_t\, \delta_t\right] - \gamma\, \mathbb{E}\!\left[\rho_t\, \mathbf{x}_{t+1}\, \mathbf{x}_t^T\right] \mathbf{v}_t\right) \\ &\approx \mathbf{w}_t + \alpha \rho_t\!\left(\delta_t\, \mathbf{x}_t - \gamma \mathbf{x}_{t+1}\, \mathbf{x}_t^T \mathbf{v}_t\right) \end{align*}\]
with $O(d)$ complexity if the final product $(\mathbf{x}_t^T \mathbf{v}_t)$ is done first.

Takeaways

GTD2 and TDC both involve 2 learning processes: a primary one for $\mathbf{w}$ and a secondary one for $\mathbf{v}$.
Asymmetrical dependence ($\mathbf{w}$ depends on $\mathbf{v}$ but $\mathbf{v}$ does not depend on $\mathbf{w}$) is referred to as a cascade.
Gradient-TD methods are the most well-understood and widely used stable off-policy methods.
Extensions of GTD methods include to:
1. Action values and control: GQ [Maei et al., 2010]
2. Eligibility traces: GTD($\lambda$), GQ($\lambda$) [Maei, 2011; Maei & Sutton, 2010]
3. Nonlinear function approximation [Maei et al., 2009]
Hybrid algorithms include:
1. Midway between semi-gradient TD and gradient TD [Hackman, 2012; White & White, 2016]
2. GTD + proximal methods & control variates [Mahadevan et al., 2014; Du et al., 2017]

11.8 Emphatic-TD Methods

Let’s explore a major strategy for obtaining a cheap and efficient off-policy learning method with function approximation.
Recall that linear semi-gradient TD methods are stable when trained under the on-policy distribution .
The match between the on-policy state distribution $\mu_\pi$ and the state-transition probabilities $p(s’ \vert s, a)$ under the target policy does not exist in off-policy learning.
Mismatch Fix:
- Re-weight the states, emphasizing some and de-emphasizing others, so as to return the distribution of the updates to the on-policy distribution.
- Then there would be a match, and convergence and stability would be achieved. This is the idea of Emphatic-TD methods.
The one-step Emphatic-TD algorithm for learning episodic state values is defined by:

\[\delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t)\] \[\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha M_t \rho_t\, \delta_t \nabla \hat{v}(S_t, \mathbf{w}_t)\] \[M_t = \gamma \rho_{t-1} M_{t-1} + \mathcal{I}_t\] \[\begin{aligned} \text{where} \\ \mathcal{I}_t &\equiv \text{the interest} \\ M_t &\equiv \text{the emphasis} \quad (M_{-1} = 0) \end{aligned}\]

Applying Emphatic TD to Baird’s counterexample yields very high variance results (impossible to get consistent results in experiments).
We focus on how we reduce the variance in all these algorithms in the next section.

11.9 Reducing Variance

Off-policy learning is inherently of greater variance than on-policy learning.
The raison d’être of off-policy learning is to enable generalization to the vast number of related-but-not-identical policies.
Why is variance control critical in off-policy learning based on importance sampling?
- Recall importance sampling involves products of policy ratios:
\[\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k \vert S_k)}{b(A_k \vert S_k)}\]
- The policy ratios always have an expected value of 1, but their actual values might be high or as low as 0:
\[\mathbb{E}\!\left[\frac{\pi(A_k \vert S_k)}{b(A_k \vert S_k)}\right] \doteq \sum_a b(a \vert S_k) \frac{\pi(a \vert S_k)}{b(a \vert S_k)} = \sum_a \pi(a \vert S_k) = 1\]
- Successive ratios are uncorrelated, so their products are always 1 in expected value, but they can be of high variance.
- These ratios multiply the step size in SGD methods, so their high variance is problematic for SGD because of the occasional huge steps.
How can we alleviate the effects of high variance via small step-size settings enough to ensure the expected step taken by SGD is small? Some approaches:
- Momentum
- Polyak-Ruppert averaging
- Methods for adaptively setting separate step sizes for different components of the parameter vector
- “Importance weight aware” updates of Karampatziakis & Langford (2015)
- Weighted importance sampling, which is well-behaved with lower variance updates than ordinary importance sampling, but adapting it to function approximation is challenging [Mahmood & Sutton, 2015]
- Tree backup algorithm (off-policy, without importance sampling)
- Allow the target policy $\pi$ to be determined partly by the behavior policy $b$ to limit creating large importance sampling ratios

11.10 Summary

Off-policy learning poses a challenge that requires creating stable and efficient learning algorithms.
Tabular Q-learning makes off-policy learning seem easy, as does its generalizations to Expected Sarsa and tree backup.
Extension further to function approximation (even linear) is challenging.
The challenge of off-policy learning is divided into two parts:
- Correcting the targets of learning for the behavior policy.
- Dealing with the instability of bootstrapping (mismatch between off-policy and on-policy distribution of updates).
The deadly triad arises when we try to combine these 3 elements: function approximation, off-policy learning, and bootstrapping, thereby causing instability and divergence.
SGD in the Bellman error $\overline{\text{BE}}$ is not learnable so it does not work.
Gradient-TD methods perform SGD in the projected Bellman error $\overline{\text{PBE}}$ and are learnable with $O(d)$ computational complexity.
Emphatic-TD methods re-weight updates, emphasizing some and de-emphasizing others, to get the off-policy distribution of the updates to match that of on-policy.
There are many ways of reducing high variance in off-policy learning that are centered on minimizing the step taken by SGD by using small step-size parameters to counter the multiplicative effect from the successive policy ratios.

Citation

If you found this blog post helpful, please consider citing it:

@article{obasi2026RLsuttonBartoCh11notes,
  title   = "Sutton & Barto, Ch. 11: Off-Policy Methods with Approximation (Personal Notes)",
  author  = "Obasi, Chizoba",
  journal = "chizkidd.github.io",
  year    = "2026",
  month   = "Mar",
  url     = "https://chizkidd.github.io/2026/03/09/rl-sutton-barto-notes-ch011/"
}

Sutton & Barto, Ch. 10: On-Policy Control with Approximation (Personal Notes)

Mon, 09 Mar 2026 00:00:00 +0000

Let’s dive into the control problem now with parametric approximation of the action-value function $\hat{q}(s, a, \mathbf{w}) \approx q_{*}(s, a)$, where $\mathbf{w} \in \mathbb{R}^d$ is a finite-dimensional weight vector.
We’ll focus on semi-gradient Sarsa, the natural extension of semi-gradient TD(0) to action values and to on-policy control.
We’ll look at this extension in both the episodic and continuing case.
We’ll look at $n$-step linear Sarsa.

10.1 Episodic Semi-gradient Control
10.2 Semi-gradient $n$-step Sarsa
10.3 Average Reward: A New Problem Setting for Continuing Tasks
10.4 Deprecating the Discounted Setting
10.5 Differential Semi-gradient $n$-step Sarsa
10.6 Summary

Appendix

Citation

10.1 Episodic Semi-gradient Control

The extension of the semi-gradient prediction methods of Chapter 9 to action values is straightforward.
It is the approximate action-value function, $\hat{q} \approx q_\pi$, that is represented as a parametrized functional form with weight vector $\mathbf{w}$.
Before, the training examples had the form $S_t \mapsto U_t$; now the examples have the form $S_t, A_t \mapsto U_t$.
The update target $U_t$ can be any approximation of $q_\pi(S_t, A_t)$, including the usual backed-up values such as the full Monte Carlo (MC) return $G_t$ or any $n$-step Sarsa return $G_{t:t+n}$.
The general gradient-descent update for action-value prediction is:

\[\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\!\left[U_t - \hat{q}(S_t, A_t, \mathbf{w}_t)\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)\]

The update for the one-step Sarsa method is:

\[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\!\left[R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t)\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)}\]

This method is called episodic semi-gradient one-step Sarsa. For a constant policy, this method converges in the same way that TD(0) does with the same kind of error bound.
Control = action-value prediction + policy improvement & action selection:

\[\boxed{a, S_{t+1} \longrightarrow \hat{q}(S_{t+1}, a, \mathbf{w}_t) \longrightarrow A^*_{t+1} = \arg\max_a \hat{q}(S_{t+1}, a, \mathbf{w}_t) \longrightarrow \varepsilon\text{-greedy policy improvement} \longrightarrow \varepsilon\text{-greedy action selection}}\]

Linear function approximation for the action-value function is:

\[\hat{q}(s, a, \mathbf{w}) \doteq \mathbf{w}^T \mathbf{x}(s, a) = \sum_{i=1}^{d} w_i \cdot x_i(s, a)\]

10.2 Semi-gradient $n$-step Sarsa

We use an $n$-step return as the update target for episodic semi-gradient $n$-step Sarsa. The $n$-step return generalizes from its tabular form to a function approximation form:

\[G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}_{t+n-1}), \quad t+n < T\] \[\text{with } G_{t:t+n} \doteq G_t \text{ if } t+n \geq T\]

The $n$-step update equation is:

\[\boxed{\mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha\!\left[G_{t:t+n} - \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1})\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1}), \quad 0 \leq t < T}\]

Performance is best if an intermediate level of bootstrapping is used ($n > 1$).

10.3 Average Reward: A New Problem Setting for Continuing Tasks

Average reward applies to continuing problems for goal formulation in MDPs.
Average reward uses no discounting; the agent has the same level of care for immediate and delayed rewards.
Average reward setting is more commonly considered in dynamic programming and less commonly in reinforcement learning (RL).
The discounted setting is problematic with function approximation, hence the need for average reward to replace it.
In the average-reward setting, the quality of a policy $\pi$ is defined as the average rate of reward, or simply average reward, while following that policy, denoted as $r(\pi)$:

\[\begin{align*} r(\pi) &\doteq \lim_{h \to \infty} \frac{1}{h} \sum_{t=1}^{h} \mathbb{E}\!\left[R_t \mid S_0, A_{0:t-1} \sim \pi\right] \\ &= \lim_{t \to \infty} \mathbb{E}\!\left[R_t \mid S_0, A_{0:t-1} \sim \pi\right] \\ &= \sum_s \mu_\pi(s) \sum_a \pi(a \vert s) \sum_{s', r} p(s', r \vert s, a)\, r \end{align*}\]

The expectations in the above equations are conditioned on the initial state $S_0$, and on the subsequent actions $A_0, A_1, \ldots, A_{t-1}$, being taken according to $\pi$.
The 2nd and 3rd equations above hold if the MDP is ergodic, i.e., if the steady-state distribution exists and is independent of the starting state $S_0$:

\[\mu_\pi(s) \doteq \lim_{t \to \infty} \Pr\!\left\{S_t = s \mid A_{0:t-1} \sim \pi\right\}\]

In an ergodic MDP, the starting state can have only a temporary effect, but in the long run the expectation of being in a state depends only on the policy and the MDP transition probabilities.
Ergodicity is sufficient but not necessary to guarantee the existence of the limit in the $r(\pi)$ equation above.
It may be adequate practically to simply order policies according to their average reward per time step, otherwise called the return rate.
All policies that reach the maximal value of $r(\pi)$ are optimal.
The steady-state distribution $\mu_\pi$ is the special distribution under which, if you select actions according to $\pi$, you remain in the same distribution, i.e., for which:

\[\sum_s \mu_\pi(s) \sum_a \pi(a \vert s)\, p(s' \vert s, a) = \mu_\pi(s')\]

In the average-reward setting, returns are defined in terms of differences between rewards and the average reward; this is called the differential return:

\[G_t \doteq R_{t+1} - r(\pi) + R_{t+2} - r(\pi) + R_{t+3} - r(\pi) + \ldots\]

The corresponding value functions for the differential return are known as differential value functions:

\[\begin{aligned} v_\pi(s) &\doteq \mathbb{E}_\pi\!\left[G_t \mid S_t = s\right] \\ q_\pi(s, a) &\doteq \mathbb{E}_\pi\!\left[G_t \mid S_t = s, A_t = a\right] \end{aligned}\]

Differential value functions also have Bellman equations:

\[\begin{aligned} v_\pi(s) &= \sum_a \pi(a \vert s) \sum_{r, s'} p(s', r \vert s, a)\!\left[r - r(\pi) + v_\pi(s')\right] \\[6pt] q_\pi(s, a) &= \sum_{r, s'} p(s', r \vert s, a)\!\left[r - r(\pi) + \sum_{a'} \pi(a' \vert s')\, q_\pi(s', a')\right] \\[6pt] v_{*}(s) &= \max_a \sum_{r, s'} p(s', r \vert s, a)\!\left[r - \max_\pi r(\pi) + v_{*}(s)\right] \\[6pt] q_{*}(s, a) &= \sum_{r, s'} p(s', r \vert s, a)\!\left[r - \max_\pi r(\pi) + \max_{a'} q_{*}(s', a')\right] \end{aligned}\]

The differential form of the 2 TD errors:

\[\begin{aligned} \delta_t &\doteq R_{t+1} - \bar{R}_t + \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t) \\ \delta_t &\doteq R_{t+1} - \bar{R}_t + \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t) \end{aligned}\] \[\begin{aligned} \text{where} \quad \bar{R}_t &= \text{average reward } r(\pi) \text{ estimate at time } t \end{aligned}\]

Most of the algorithms covered so far don’t change for the average-reward setting. For example, the semi-gradient Sarsa average-reward version is the same as the regular version except with the differential version of the TD error:

\[\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\, \delta_t \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)\]

10.4 Deprecating the Discounted Setting

For the tabular case, the continuing, discounted problem formulation is useful, but in the approximate case, is this problem formulation necessary?
Should we use the discounted reward or average reward in continuing tasks?
It turns out that the average of the discounted return is proportional to the average reward.
The ordering of all policies in the average discounted return setting would be exactly the same as in the average-reward setting.
This idea of the futility of discounting in continuing problems can be proven by the symmetry argument.
- Let’s choose an objective that saves discounting by summing discounted values over the distribution with which states occur under the policy (where $v^\gamma_\pi \equiv$ discounted value function):
  \[\begin{align*} J(\pi) &= \sum_s \mu_\pi(s)\, v^\gamma_\pi(s) \\ &= \sum_s \mu_\pi(s) \sum_a \pi(a \vert s) \sum_{s'} \sum_r p(s', r \vert s, a)\!\left[r + \gamma v^\gamma_\pi(s')\right] \\ &= r(\pi) + \sum_s \mu_\pi(s) \sum_a \pi(a \vert s) \sum_{s'} \sum_r p(s', r \vert s, a)\, \gamma v^\gamma_\pi(s') \\ &= r(\pi) + \gamma \sum_{s'} v^\gamma_\pi(s') \sum_s \mu_\pi(s) \sum_a \pi(a \vert s)\, p(s' \vert s, a) \\ &= r(\pi) + \gamma \sum_{s'} v^\gamma_\pi(s')\, \mu_\pi(s') \\ &= r(\pi) + \gamma J(\pi) \\ &= r(\pi) + \gamma\!\left(r(\pi) + \gamma J(\pi)\right) \\ &= r(\pi) + \gamma r(\pi) + \gamma^2 J(\pi) \\ &= r(\pi) + \gamma r(\pi) + \gamma^2 r(\pi) + \gamma^3 r(\pi) + \gamma^4 r(\pi) + \ldots \\ &= r(\pi)\!\left[1 + \gamma + \gamma^2 + \gamma^3 + \ldots\right] \end{align*}\] \[\hspace{-6cm} \boxed{J(\pi) = \left(\frac{1}{1-\gamma}\right) r(\pi)}\]
- The proposed discounted objective orders policies identically to the undiscounted (average reward) objective.
- The discount rate $\gamma$ does not influence the ordering.
The root cause of the difficulties with the discounted control setting is that with function approximation we have lost the policy improvement theorem.
Now if we change the policy to improve the discounted value of one state, we are no longer guaranteed to have improved the overall policy.

10.5 Differential Semi-gradient $n$-step Sarsa

We need an $n$-step version of the TD error in order to generalize to $n$-step bootstrapping.
Let’s generalize the $n$-step return to its differential form, with function approximation:

\[\boxed{G_{t:t+n} \doteq R_{t+1} - \bar{R}_{t+n-1} + \ldots + R_{t+n} - \bar{R}_{t+n-1} + \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}_{t+n-1})}\] \[\begin{aligned} \text{where} \quad \bar{R} &\equiv \text{an estimate of } r(\pi),\quad n \geq 1\ \&\ t+n < T \\ G_{t:t+n} &\doteq G_t \quad \text{ if } t+n \geq T \end{aligned}\]

The $n$-step TD error is:

\[\delta_t \doteq G_{t:t+n} - \hat{q}(S_t, A_t, \mathbf{w})\]

10.6 Summary

Extended parametrized function approximation & semi-gradient descent to control.
The extension is immediate for the episodic case, but dependent on a new problem formulation based on maximizing the average reward setting per time step, for the continuing case.
The discounted formulation cannot be carried over to control in the presence of approximations.
Most policies cannot be represented by a value function in the approximate case.
The scalar average reward $r(\pi)$ provides an effective way of ranking the remaining arbitrary policies.
The average reward formulation involves new differential versions of value functions, Bellman equations, and TD errors, but all of these parallel the old ones and the conceptual changes are small.
The average reward setting has a new parallel set of differential algorithms.

Citation

If you found this blog post helpful, please consider citing it:

@article{obasi2026RLsuttonBartoCh10notes,
  title   = "Sutton & Barto, Ch. 10: On-Policy Control with Approximation (Personal Notes)",
  author  = "Obasi, Chizoba",
  journal = "chizkidd.github.io",
  year    = "2026",
  month   = "Mar",
  url     = "https://chizkidd.github.io/2026/03/09/rl-sutton-barto-notes-ch010/"
}

When Your Voice Assistant Can't Hear Tones: Evaluating ASR Bias in Igbo

Wed, 04 Mar 2026 00:00:00 +0000

I grew up in an Igbo household in Northern Nigeria, that code-switched between English, Igbo, and Hausa almost unconsciously. Like many bilingual Nigerians, I’ve watched voice assistants and ASR systems get better and better at English while struggling with our languages. When Meta released omniASR claiming support for over 1,600 languages including Igbo, I was curious. Does “supported” mean it actually works?

Turns out, the answer is more complicated than I expected.

The Problem: What Does “Language Support” Really Mean?

Here’s the thing about Igbo: tone changes word meaning. The difference between “akwa” (crying), “akwà” (cloth), “àkwà” (egg), and “ákwá” (bridge) isn’t just decorative accent marks. These are completely different words that happen to have the same consonants and vowels. The tone is the difference.

So when I saw that omniASR listed Igbo among its supported languages, I wanted to know: does it actually preserve these tonal distinctions? Or does “support” just mean “we trained on some Igbo data and hope for the best”?

The Experiment: 21 Audio Samples

I designed a simple test. Using my iPhone Voice Memos app, I recorded 21 short audio clips in different categories:

Tonal minimal pairs: I said “akwa, akwa, akwa” three times with no tone, then “akwà, akwà, akwà” three times with low tone, then “àkwà, àkwà, àkwà” with low-low tone, and finally “ákwá, ákwá, ákwá” with high-high tone. Four distinct words, each repeated three times.

Code-switching: Phrases like “The ụlọ is beautiful” where I mix English and Igbo naturally, the way we actually speak.

Place names and cultural terms: Nigerian cities, Igbo food words, proverbs. The stuff that’s probably not in training data.

The smoking gun test: I spoke a sentence with deliberately flat intonation, no tonal variation at all. If the model is actually listening to tone in the audio, it shouldn’t add tone marks to monotone speech.

Then I ran everything through omniASR and compared what I actually said to what it transcribed.

The Results: 75% Tone Loss

The numbers were worse than I expected.

For the tonal sample after bootstrapping, the model dropped 75.5% of the tone marks. Not just a few mistakes here and there. Three out of every four tone marks, gone.

When I said the four different “akwa” words, the model output was: “akua akua akua akua akwa akwa akwa akua akwa ọkua ọkua ọkua”. Random variations. The semantic distinctions completely lost.

But here’s what really convinced me the model isn’t actually listening to tones: the monotone test. I spoke “O na-eri oji n’ututu” (He eats kolanut in the morning) with flat intonation, like a robot. The model transcribed it as “ọne rị ọjí nụ tútú” and added tone marks that I never spoke.

If the model were using acoustic information to place diacritics, it shouldn’t be adding tones to flat speech. This suggests it’s doing something else: probably using statistical patterns from training data to guess where diacritics should go, rather than actually hearing them.

Key Diagnostic: The Monotone Test

File 09: Spoke “O na-eri oji n’ututu” with FLAT intonation
Expected: 0 diacritics (no tonal variation in audio)
Result: Model added 7 tone marks that weren’t spoken

This is evidence of orthographic bias, not acoustic perception.

What the Data Shows

I created three visualizations to make the patterns clear.

Figure 1 shows diacritic loss by category. The tonal category (in red) jumps out immediately at 61.2% raw count loss. For comparison, the domain-specific category had only 6.3% loss. But look at the cross-lingual interference category: it’s at -38.9%, which means the model was adding diacritics that don’t exist. It’s not just dropping tones, it’s hallucinating them in the wrong places.

Figure 2 plots character error rate against diacritic loss for each sample. What’s interesting here is that the tonal samples (red dots) show high diacritic loss even when the overall character error rate is moderate (20-40%). This means tone errors aren’t just a consequence of the model doing poorly in general. The model can get most of the characters right while still completely failing on tones specifically.

Figure 3 shows the bootstrap confidence intervals. Even with only 21 samples, the error bars don’t overlap between categories. The tonal category’s worst-case lower bound is 57.1%, which is still terrible. This confirms that what I’m seeing isn’t just noise from a small sample size.

The Statistical Story

I’m not a statistician, but I know enough to be careful with small sample sizes. Twenty-one samples isn’t huge. So I used bootstrap resampling (basically, randomly resampling my data 10,000 times to get confidence intervals) to make sure these effects weren’t just random noise.

Even under the most conservative estimate (the lower bound of the 95% confidence interval), tonal diacritic loss was still 57.1%. The worst-case scenario is still terrible.

I also created a custom metric called Diacritic Error Rate (DER) because standard Character Error Rate treats tone marks the same as spacing errors. DER specifically tracks dropped tone marks versus hallucinated tone marks. Turns out the model isn’t just dropping tones. It’s also adding tones that don’t exist, which is a whole different kind of problem.

The Categories

Breaking down the errors helped me understand what’s going wrong:

Cross-lingual interference: When I spoke phrases with no tone marks at all (like names), the model added incorrect diacritics 38.9% of the time. It’s probably applying orthographic patterns from other languages.

Code-switching boundary effects: The English portions of code-switched sentences were transcribed perfectly. The Igbo portions immediately adjacent to English lost their tones. Something about language boundaries is disrupting processing.

Domain coverage: Culturally specific terms (place names, food words) had the best diacritic preservation at only 6.3% loss, but terrible overall accuracy. The model knows the orthography but doesn’t know the words.

Tonal collapse: 75.5% loss. This is the big one.

Why This Matters

I keep coming back to the monotone hallucination test. If I were building a voice assistant for Igbo speakers and it’s adding tones I didn’t speak, that’s not just an accuracy problem. It’s an epistemological problem. The system is presenting confident outputs that have no acoustic basis.

Imagine you’re dictating a text message in Igbo and the system confidently transcribes “crying” when you said “cloth.” Not just a typo you can spot and fix. A completely different word that makes semantic nonsense but looks plausible.

What 75% Tonal Loss Means

75.5% bootstrap diacritic loss means:
3 out of 4 tone marks disappear
“cloth” → could mean “crying”
“egg” → meaning lost entirely
“bridge” → wrong word

In English, this would be like dropping 75% of consonants.

This isn’t just about transcription accuracy. It’s about whether “supporting 1,600+ languages” means anything more than “we trained on data from 1,600+ languages and didn’t check if it actually works for tonal distinctions.”

The Bigger Picture: Zeno’s Paradox of Low-Resource Languages

There’s a paper from EMNLP 2024 that talks about “The Zeno’s Paradox of Low-Resource Languages.” The basic idea: models keep claiming to support more and more languages, but the quality asymptote never actually reaches parity with high-resource languages. We get closer and closer, but never quite there.

Igbo is interesting because by speaker population (45 million people), it’s not low-resource. But by model performance, it clearly behaves like one. The gap between coverage (we trained on Igbo data) and competence (the model preserves linguistically meaningful distinctions) is huge.

'Supported' ≠ Works Well

omniASR claims support for 1,600+ languages. Igbo has 45 million speakers, but its tonal accuracy is 24.5% (only 1 in 4 tone marks preserved).

Coverage (in training data) ≠ Competence (preserves meaning)

This makes me think about all the other languages in that 1,600+ list. How many of them have this same gap? How many communities are using systems that confidently produce nonsense because nobody with native speaker expertise has stress-tested them?

What I Learned

Small, targeted datasets can reveal problems big datasets hide. I didn’t need thousands of hours of audio. Twenty-one carefully designed samples were enough to show systematic failure modes.

Native speaker expertise matters. Automated metrics can’t catch when “crying” is transcribed as “cloth” because the character error rate looks fine. You need someone who speaks the language to know that the semantic content is destroyed.

Bootstrap resampling is powerful for small samples. I was worried 21 samples was too few, but bootstrap confidence intervals let me quantify uncertainty rigorously. Even the pessimistic lower bounds showed substantial effects.

The monotone test is a better diagnostic than I expected. If diacritics are added to flat speech, that’s clear evidence of orthographic bias over acoustic conditioning. One simple test that revealed the core mechanism.

The Technical Details

For anyone interested in replicating this:

I used my iPhone for recording (Voice Memos app, M4A format)
Ran inference through Google Colab with omniASR’s official pipeline
Computed bootstrap CIs with 10,000 iterations at the utterance level
Created a custom DER metric to separate tonal errors from general transcription errors
All code, data, and analysis is on GitHub and HuggingFace

The whole analysis took about half a week of evening work. Most of that was iterating on the sample design and figuring out the right statistical approach. The actual recording and inference was maybe a day.

What’s Next

This is really just a proof of concept. To make stronger claims, I’d need:

Multi-speaker evaluation (10+ speakers across different Igbo dialects)
Acoustic analysis (F0 contour tracking to verify what’s actually in the audio)
Comparative evaluation (does Whisper do better? What about Google’s USM?)
Fine-tuning experiments (can we fix this with targeted training data?)

I have ideas for all of these, but they’re bigger projects. For now, I’m focused on documenting the blind spot and making the methodology replicable.

This started as curiosity about whether “multilingual” ASR systems actually work for the languages I grew up speaking. But it turned into something bigger.

There’s a tendency in ML to treat “supporting” a language as a checkbox. Train on some data, add it to the model card, ship it. But languages aren’t just data. They’re how people communicate, how they think, how they preserve culture.

When voice assistants strip tone marks from Igbo, they’re not just making transcription errors. They’re normalizing a version of the language that doesn’t preserve meaning. If every voice interface does this, what happens to how people write Igbo? Do they start thinking tone marks are optional because the AI doesn’t use them?

I don’t know the answers to these questions. But I think they’re worth asking before we claim to “support” 1,600+ languages.

Resources

If you want to explore the data or replicate the analysis:

Dataset: HuggingFace
Code: GitHub
Audio samples: You can actually listen to the 21 clips and see the transcription failures yourself

The dataset is CC-BY-4.0 licensed, while the code is MIT licensed. If this is useful for your work, feel free to use it, cite it, and build on it.

Final Thoughts

This project taught me something important: you don’t need massive compute or huge datasets to find meaningful problems in ML systems. You just need to know where to look and what questions to ask.

As a native Igbo speaker, I knew what questions to ask. As someone learning ML, I knew how to design tests and interpret results. That combination turned out to be more valuable than I expected.

If you speak a language that’s “supported” by these big multilingual models, I encourage you to test them. Record some minimal pairs. Try code-switching. See if the system actually works the way you use the language, not just the way it appears in training data.

You might be surprised what you find.

Citation

If you found this work helpful, please consider citing it:

@article{obasi2026igboasr,
  title   = "When Your Voice Assistant Can't Hear Tones: Evaluating ASR Bias in Igbo",
  author  = "Obasi, Chizoba",
  journal = "chizkidd.github.io",
  year    = "2026",
  month   = "Mar",
  url     = "https://chizkidd.github.io/2026/03/04/igbo-asr-tonal-evaluation/"
}

Tonal Fidelity in Multilingual ASR: A Diagnostic Evaluation

Sun, 01 Mar 2026 00:00:00 +0000

This is a brief guide to my evaluation of tonal preservation in facebook’s omniASR-CTC-1B Automatic Speech Recognition (ASR) model for Igbo, a tonal Niger-Congo language with 45 million speakers. The model claims support for 1,600+ languages including Igbo, but what does “support” mean when tone changes word meaning? I created 21 systematically designed audio samples, ran them through the model, and measured a 75.5% bootstrapped diacritic loss rate on tonal markers. The core finding: the model appears to generate tone marks probabilistically based on orthographic priors rather than acoustic conditioning. I cannot simplify this investigation any further.

Where to find it: The dataset with audio is on HuggingFace. The code and analysis are on GitHub. The full analysis notebook is available at analysis.ipynb.

The following is my guide to stepping through the evaluation methodology.

The Problem

In Igbo, tone is phonemic. This means tone changes word meaning, not just prosody. The difference between:

akwa (crying)
akwà (cloth)
àkwà (egg)
ákwá (bridge)

…isn’t decorative. These are four completely different words that happen to share consonants and vowels. The tone marks (diacritics) are the only thing distinguishing them. When omniASR lists Igbo as “supported,” does it preserve these tonal distinctions? Or does “support” just mean “we trained on some Igbo data”?

Dataset Design

I recorded 21 audio samples using my iPhone SE Voice Memos app. Each sample targets a specific failure mode across four categories.

The first category tests cross-lingual orthographic interference. My hypothesis was that the model applies incorrect orthographic conventions from other languages to Igbo text. I recorded five samples: personal names without tone marks, formal greetings, numbers in Igbo, well-known proverbs, and a slow prosody test. I expected 0% diacritic loss since there was nothing to lose, but observed -38.9%, meaning the model added diacritics that don’t exist.

The second category tests phonemic tone sensitivity. The hypothesis here is that the model cannot distinguish phonemically contrastive tones. I recorded six samples including minimal pairs like akwa/akwà/àkwà/ákwá and oke/òkè/ọkè, dense tone marking, a monotone control (the key diagnostic), and two Yoruba controls. I expected low loss if the model uses acoustic information, but observed 75.5% loss with a bootstrap 95% confidence interval of [57.1%, 89.7%].

The smoking gun is file 09. I spoke “O na-eri oji n’ututu” with deliberately flat intonation, with no tonal variation at all. The model transcribed it as “ọne rị ọjí nụ tútú” and ADDED tone marks I never spoke. If the model were using acoustics, it shouldn’t hallucinate tones on monotone speech.

The third category tests language boundary effects from code-switching. I hypothesized that switching between English and Igbo disrupts language-specific processing. Five samples test different patterns: English embedding into Igbo, Igbo embedding into English, sentence-level alternation, diacritics in English context, and Nigerian Pidgin as a control. The result was 14.3% diacritic loss, with English portions transcribed perfectly while adjacent Igbo lost tone marks.

The fourth category tests domain-specific lexical coverage. The hypothesis is that culturally specific terms outside the training distribution would struggle. I recorded Nigerian place names, Igbo food terms, long proverbs, French as a high-resource control, and background noise robustness. This category showed the best diacritic preservation at only 6.3% loss, but terrible overall accuracy with 30% character error rate, indicating word-level errors.

The data looks like this (metadata.csv):

file_name,ground_truth,model_output,category,character_error_rate,diacritics_expected,diacritics_produced
06_tonal_akwa.m4a,"akwa, akwa, akwa. Akwà, akwà, akwà...","akua akua akua akua akwa akwa...",tonal_diacritics,0.583,12,3
09_tonal_flat.m4a,"O na-eri oji n'ututu","ọne rị ọjí nụ tútú",tonal_diacritics,0.744,0,7
...

Model Inference

I used omniASR’s official inference pipeline:

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

pipeline = ASRInferencePipeline(model_card="omniASR_CTC_1B")
transcription = pipeline.transcribe(
    inp=["data/audio/06_tonal_akwa.m4a"],
    lang=["ibo_Latn"]
)

The model has 975 million parameters and uses a CTC-based ASR architecture with a wav2vec2-style encoder and CTC head. It was trained on multilingual data covering over 1,600 languages and released on November 14, 2025.

For each audio file, I extracted:

ground_truth = "akwa, akwa, akwa. Akwà, akwà, akwà. Àkwà, àkwà, àkwà. Ákwá, ákwá, ákwá."
model_output = transcription[0]['transcription']
# Compare and compute metrics

Metrics

Standard Character Error Rate (CER) conflates spacing errors with tonal errors. I defined a custom metric:

Diacritic Error Rate (DER)

def diacritic_error_rate(ground_truth, model_output):
    E = count_diacritics(ground_truth)  # expected
    P = count_diacritics(model_output)  # produced
    D = max(0, E - P)  # dropped
    H = max(0, P - E)  # hallucinated
    return (D + H) / E if E > 0 else 0

def count_diacritics(text):
    diacritics = set('ụọịàèìòùáéíóúẹṣ')
    return sum(1 for c in text.lower() if c in diacritics)

DER isolates tone-related failures:

Metric	Formula	What it captures
CER	Levenshtein distance / length	All character errors
RDD (Raw Drop Rate)	dropped / expected	Only missing tone marks
DER	(dropped + hallucinated) / expected	Total tonal deviation

Note that DER can exceed 100% when hallucinations are substantial, because the denominator reflects ground truth expectations, not produced output.

Bootstrap Uncertainty

With N=21 samples, I needed to quantify uncertainty. I used bootstrap resampling:

def bootstrap_ci(data, stat_fn, n_boot=10000, ci=0.95, seed=42):
    rng = np.random.default_rng(seed)
    n = len(data)
    
    # Point estimate
    point = float(stat_fn(data))
    
    # Bootstrap resampling
    boots = np.empty(n_boot)
    for i in range(n_boot):
        idx = rng.integers(0, n, size=n)
        boots[i] = float(stat_fn(data.iloc[idx]))
    
    # Percentile CI
    alpha = (1 - ci) / 2
    lo = float(np.quantile(boots, alpha))
    hi = float(np.quantile(boots, 1 - alpha))
    
    return (point, lo, hi)

Bootstrap resampling occurs at the utterance level, not event level. This matters because diacritic distribution is uneven across samples. Some files have 0 expected tone marks, others have 12. Resampling utterances captures this variability.

Example result:

Raw count: 30/49 = 61.2% drop rate
Bootstrap mean: 75.5%
95% CI: [57.1%, 89.7%]

The bootstrap mean exceeds the raw percentage because resampling at utterance level gives more weight to samples with extreme loss rates. Both values are reported for transparency.

Why Bootstrap Matters

With only 21 samples, we need uncertainty quantification. Bootstrap resampling (10,000 iterations) shows: Worst-case lower bound: 57.1%
Even pessimistically, loss is still >50%
Not a small-sample fluke

Results

Quantitative Summary

Category	Samples	Diacritic Loss	Avg CER
Phonemic Tone Sensitivity	6	75.5%	50.6%
Cross-lingual Interference	5	-38.9%	28.8%
Domain-Specific Coverage	5	6.3%	30.1%
Language Boundary Effects	5	14.3%	20.0%
Overall	21	26.8%	32.5%

Bootstrap Confidence Intervals

Tonal category:  75.5% (95% CI: [57.1%, 89.7%])
Overall:         52.6% (95% CI: [30.3%, 69.7%])

Even under the worst-case lower bound (57.1%), tonal diacritic loss remains severe.

Visualizations

Bar chart showing 61.2% raw count loss for tonal category (red), with negative values indicating diacritic hallucination (script interference).

Scatter plot showing tonal samples (red) have high diacritic loss even when CER is moderate.

Forest plot showing 95% CIs for each category, with 50% threshold line.

Example: Tonal Minimal Pairs

File 06 is the clearest demonstration:

Input (what I said):
"akwa, akwa, akwa. Akwà, akwà, akwà. Àkwà, àkwà, àkwà. Ákwá, ákwá, ákwá."

Model output:
"akua akua akua akua akwa akwa akwa akua akwa ọkua ọkua ọkua"

Expected diacritics: 12
Produced diacritics: 3
Loss rate: 75%

The four distinct words collapsed into random variations. From a linguistic perspective, this is catastrophic. The word akwà meaning cloth got transcribed as akwa, which could mean crying instead. The word àkwà meaning egg got transcribed as akwa, and the meaning is completely lost. The word ákwá meaning bridge got transcribed as akua, which is wrong both in word and tone.

The Monotone Test

File 09 is my favorite diagnostic. Setup:

Spoke “O na-eri oji n’ututu” (He eats kolanut in the morning)
Deliberately flat intonation, like a robot
Zero tonal variation in the audio

If the model uses acoustic information to place diacritics, it should produce few or no tone marks on flat speech. Result:

Ground truth: "O na-eri oji n'ututu"  (0 diacritics)
Model output: "ọne rị ọjí nụ tútú"    (7 diacritics)

The model ADDED tone marks I never spoke. This is clear evidence of orthographic bias over acoustic conditioning. The model is using statistical patterns from training data to guess where diacritics should go, not listening to the audio.

Statistical Analysis

Hypothesis Testing

Null hypothesis (H0): Diacritic loss in tonal category ≤ other categories  
Alternative (H1): Tonal category shows higher loss

Test: Bootstrap confidence intervals (10,000 iterations, 95% CI)

Result: Tonal bootstrap mean (75.5%) substantially exceeds all other categories (highest alternative: 38.9% for script hallucination). While confidence intervals show some overlap due to small sample size, the tonal category's point estimate is nearly 2x higher than the next closest category.

Conclusion: Tonal degradation exhibits the highest loss rate across all categories (bootstrap mean: 75.5%). While confidence intervals show some overlap with script hallucination due to small sample size (N=21), the effect size is large and consistent across resamples.

Robustness Check

Even under worst-case assumptions using the lower bound of the confidence interval, tonal loss remains at 57.1%, which is still greater than 50%. Overall loss stays at 30.3%, which is still substantial. This suggests the observed tonal degradation is unlikely to be driven solely by sampling variability.

Code

The full analysis is in analysis.ipynb. The core evaluation functions handle diacritic counting, character error rate calculation, and bootstrap resampling. Diacritic counting uses a set of Igbo tone mark characters and counts occurrences in the text. Character error rate is computed using Python’s SequenceMatcher for character-level similarity. Bootstrap resampling runs 10,000 iterations on the tonal diacritics category to compute confidence intervals.

All evaluation code is organized in the src/ directory. The evaluate.py module contains metrics like DER and bootstrap confidence intervals. The visualize.py module has plotting functions for all three figures. The utils.py module handles data loading and validation.

Run it

Clone the repository and reproduce:

git clone https://github.com/chizkidd/igbo-asr-tonal-evaluation.git
cd igbo-asr-tonal-evaluation
pip install -r requirements.txt
jupyter notebook analysis.ipynb

Or run in Google Colab:

The notebook takes about 5-10 minutes to run on Colab with a T4 GPU. You’ll see the analysis output:

Loading metadata...
Total samples: 21
Categories: 4

Computing metrics...
Overall DER: 26.8%
  Tonal category: 75.5%
  Script interference: -38.9%
  Code-switching: 14.3%
  Domain-specific: 6.3%

Bootstrap resampling (10,000 iterations)...
Tonal diacritics: 75.5% [57.1%, 89.7%]
Overall: 52.6% [30.3%, 69.7%]

Generating visualizations...
Saved: results/visualizations/fig1_loss_by_category.png
Saved: results/visualizations/fig2_cer_vs_loss.png
Saved: results/visualizations/fig3_bootstrap_ci.png

Reproducibility

Model: omniASR-CTC-1B (975M params)
Data: 21 samples, 4 categories
Metrics: Custom DER (Diacritic Error Rate)
Stats: Bootstrap with utterance-level resampling
Code: github.com/chizkidd/igbo-asr-tonal-evaluation

Scope and Limitations

This study demonstrates three things. First, systematic diacritic loss in omniASR on Igbo across 21 controlled samples. Second, failure to preserve tonal minimal pairs in this evaluation setup. Third, diacritic hallucination on monotone speech, which is evidence of orthographic bias.

This study does not claim four things. It doesn’t claim universal failure on all Igbo speech. It doesn’t claim that tone modeling is architecturally absent from the model. It doesn’t claim that Igbo is uniquely disadvantaged compared to all other low-resource languages. And it doesn’t claim that the observed error rates generalize to all dialects or all speakers.

What would strengthen these claims? Multi-speaker evaluation with 10+ speakers across different dialects. Acoustic analysis with F0 contour extraction and pitch tracking validation. Comparative evaluation on other models like Whisper, MMS, USM, and Azure Speech. And controlled resynthesis experiments that isolate acoustic factors from lexical priors.

Future Work

Current: Single speaker, 21 samples (proof-of-concept)
Next: 200 samples, 10+ speakers, 5 dialects
Then: Comparative evaluation (Whisper, MMS, Azure)
Finally: Fine-tuning intervention with tone-annotated data

Real Production Systems

Between this evaluation and a production-grade ASR fairness audit, there is a long list of things that change:

Data. Instead of 21 samples, production evaluations use thousands of hours across multiple speakers, dialects, ages, and recording conditions.

Speakers. Instead of single-speaker, you need balanced sampling across: dialects (Owerri, Onitsha, Enugu, Nsukka, Afikpo), gender, age ranges, native vs. L2 speakers.

Acoustic analysis. Instead of just comparing transcriptions, you need F0 (fundamental frequency) tracking to verify what’s actually in the audio. Praat or similar tools extract pitch contours frame-by-frame.

Comparative evaluation. Instead of one model, you audit multiple: Whisper (OpenAI), MMS (Meta), USM (Google), Azure Speech (Microsoft). This isolates whether the problem is specific to omniASR or universal.

Fine-tuning experiments. You collect tone-annotated Igbo data (50-100 hours), fine-tune the model, and measure pre/post accuracy. This tests whether the problem is architectural or just data scarcity.

Real-world deployment. You partner with Nigerian developers building voice assistants and measure downstream impact: do users trust ASR that strips tones? Does it affect adoption?

All of these are important, but if you understand this 21-sample evaluation, you understand the diagnostic methodology.

FAQ

Why only 21 samples? This is a proof-of-concept for blind spot discovery. Large datasets measure prevalence; small targeted datasets reveal failure modes. I prioritized depth (systematic coverage of error types) over breadth (statistical power).

Is 75.5% loss generalizable? Not necessarily. This is the loss rate on my voice, my dialect, my recording setup, for these specific test cases. Multi-speaker evaluation would give population estimates.

Why not use Word Error Rate? WER measures whole-word accuracy. In Igbo, “akwa” vs “akwà” counts as correct by WER (same word, different tone), but semantically these are different words. Diacritic-specific metrics capture what WER misses.

Does the model “understand” Igbo? That’s philosophical. Mechanically: it learned statistical patterns from training data. Whether assigning probability distributions to tokens constitutes “understanding” is up to you.

Why does the bootstrap mean exceed the raw percentage? Bootstrap resamples at utterance level. Samples with extreme loss rates (e.g., file 09 with 0 expected, 7 hallucinated) get resampled more in some iterations, pulling the mean up. This reflects uncertainty about which utterances are “typical.”

What’s next? Collect a 200-sample multi-speaker dataset across 5 Igbo dialects. After that: comparative model evaluation (Whisper vs MMS vs omniASR) and fine-tuning experiments with tone-annotated data.

Why This Matters

There’s a tendency in ML to treat “supporting” a language as a checkbox. Add it to the model card, ship it. But Igbo has 45 million speakers. When ASR systems strip tone marks, they normalize a version of the language that doesn’t preserve meaning.

If every voice interface does this, what happens to how people write Igbo? Do they internalize that tone marks are optional because the AI doesn’t use them? I don’t know, but these are questions worth asking before claiming to “support” 1,600+ languages.

Resources

The dataset is available on Huggingface. The code is on github. The model evaluated is facebook/omniASR-CTC-1B on HuggingFace. The dataset is licensed under CC-BY-4.0 and the code under MIT. Feel free to use it, cite it, and build on it.

Citation

If you found this evaluation helpful, please consider citing it:

@article{obasi2026tonalevaluation,
  title   = "Tonal Fidelity in Multilingual ASR: A Diagnostic Evaluation",
  author  = "Obasi, Chizoba",
  journal = "chizkidd.github.io",
  year    = "2026",
  month   = "Mar",
  url     = "https://chizkidd.github.io/2026/03/01/tonal-fidelity-diagnostic-evaluation/"
}

For the dataset:

@misc{obasi2026igbodataset,
  title={Igbo Blind Spot Dataset for omniASR-CTC-1B: Systematic Evaluation of Tonal Diacritic Loss},
  author={Obasi, Chizoba},
  year={2026},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/datasets/chiz/omniASR-igbo-blindspots}},
  note={Model evaluated: facebook/omniASR-CTC-1B (975M parameters)}
}

Sutton & Barto, Ch. 09: On-Policy Prediction with Approximation (Personal Notes)

Fri, 27 Feb 2026 00:00:00 +0000

Study of function approximation in RL by considering its use in estimating the state-value function from on-policy data, i.e. in approximating $v_\pi$ from experience generated using a known policy $\pi$.
The approximate value function is represented as a parameterized functional form with weight vector $\mathbf{w} \in \mathbb{R}^d$:

\[\hat{v}(s, \mathbf{w}) \approx v_\pi(s)\]

The function above is for the approximate value of state $s$ given weight vector $\mathbf{w}$.
$\hat{v}$ might be a linear function, a multi-layer artificial neural network, or a decision tree.
Extending RL to function approximation makes it applicable to partially observable problems, in which the full state is unavailable to the agent.
Function approximation cannot augment the state representation with memories of past observations.

9.1 Value-Function Approximation
9.2 The Prediction Objective (VE)
9.3 Stochastic-Gradient & Semi-Gradient Methods
9.4 Linear Methods
9.5 Feature Construction for Linear Methods
9.6 Selecting Step-Size Parameters Manually
9.7 Nonlinear Function Approximation: Artificial Neural Networks (ANNs)
9.8 Least-Squares TD (LSTD)
9.9 Memory-based Function Approximation
9.10 Kernel-based Function Approximation
9.11 Looking Deeper at On-Policy Learning: Interest & Emphasis
9.12 Summary

Appendix

Citation

9.1 Value-Function Approximation

All the prediction methods covered so far involve an update to an estimated value function that shifts its value towards a “backed-up value” (update target, $u$). Let’s denote an individual state update by $s \mapsto u$:

\[\begin{align*} \text{Monte-Carlo (MC):} \quad & S_t \mapsto G_t \\ \text{TD(0):} \quad & S_t \mapsto R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) \\ n\text{-step TD:} \quad & S_t \mapsto G_{t:t+n} \\ \text{Dynamic Programming (DP):} \quad & s \mapsto \mathbb{E}_\pi\!\left[R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) \,\middle\vert\, S_t = s\right] \end{align*}\]

Each update is interpretable as an example of the desired input-output behaviour of the value function.
- Until now, the value function updates have been made in the tabular setting, which is quite trivial because estimated values of all other states are left unchanged.
- Now using arbitrarily complex and sophisticated methods to perform the update, updating at $s$ generalizes so that the estimated value of many other states are changed as well.
- Machine learning methods that learn to mimic input-output examples in this way are called supervised learning methods, and when the outputs are numbers, like $u$, the process is called function approximation.
Function approximation methods expect to receive examples of the desired input-output behavior of the function they are trying to approximate.
- We view each update as a conventional training example.
- This allows us to use any of a wide range of existing function approximation methods for value prediction.
- Whatever function approximation method is chosen needs to be able to perform online learning.

9.2 The Prediction Objective ($\overline{\text{VE}}$)

We have more states $s$ than weights $\mathbf{w}$, therefore we cannot feasibly approximate the value function perfectly.
Making one state’s estimate more accurate invariably leads to making the others’ less accurate.
Therefore, it is necessary to define which states we care most about, based on a state distribution, $\mu(s)$, that represents how much we care about the error in each state $s$.

\[\mu(s) \geq 0, \quad \sum_s \mu(s) = 1\]

Weighting the error in a state $s$, the difference between the approximate value $\hat{v}(s, \mathbf{w})$ and the true value $v_\pi(s)$, over the state space by $\mu$ leads to obtaining a natural objective function called the mean squared value error, denoted by $\overline{\text{VE}}$:

\[\boxed{\overline{\text{VE}}(\mathbf{w}) \doteq \sum_{s \in S} \mu(s) \left[v_\pi(s) - \hat{v}(s, \mathbf{w})\right]^2}\]

$\sqrt{\overline{\text{VE}}}$ tells us roughly how much the approximate values differ from the true values.
Often $\mu$ is chosen to be the fraction of time spent in $s$, called the on-policy distribution under on-policy training.

On-policy Distribution

Continuing tasks: the on-policy distribution is the stationary distribution under $\pi$: $$\mu(s) = \sum_{s'} \mu(s') \sum_a \pi(a \vert s')\, p(s \vert s', a), \quad \forall s \in S$$ $$ \begin{aligned} \text{where} \\ \mu(s) &= \text{stationary probability of being in state } s \text{ under } \pi \\ s' &= \text{preceding state} \end{aligned} $$

Balance Equation

The probability of being in state $s$ equals the sum over all ways of arriving in $s$ from any previous state $s’$ under policy $\pi$.
Episodic tasks: it depends on how the initial states of episodes are chosen: $$\eta(s) = h(s) + \sum_{\bar{s}} \eta(\bar{s}) \sum_a \pi(a \vert \bar{s})\, p(s \vert \bar{s}, a), \quad \forall s \in S$$ $$ \begin{aligned} \text{where} \\ h(s) &= \text{probability that an episode begins in each state } s \\ \eta(s) &= \text{number of time steps spent, on average, in state } s \text{ in a single episode} \\ \bar{s} &= \text{preceding state} \end{aligned} $$

Visitation Equation

The expected number of visits to state $s$ equals the probability of starting in state $s$ plus the expected number of visits to all preceding states $s’$ that transition into $s$ under policy $\pi$.
- This system of equations can be solved for the expected number of visits $\eta(s)$.
- The on-policy distribution is the fraction of time spent in each state, normalized to sum to 1:
$$\mu(s) = \frac{\eta(s)}{\sum_{s'} \eta(s')}, \quad \forall s \in S$$
- If discounting exists, then we redefine $\eta(s)$:
$$\eta(s) = h(s) + \gamma \sum_{\bar{s}} \eta(\bar{s}) \sum_a \pi(a \vert \bar{s})\, p(s \vert \bar{s}, a), \quad \forall s \in S$$ $$ \begin{aligned} \text{where} \ \gamma &= \text{discount factor } \end{aligned} $$

Performance Objective

Although $\overline{\text{VE}}$ is a good starting point, it’s not completely clear that it is the right performance objective.
The ultimate goal (reason for learning a value function) is to find a better policy.
If we use $\overline{\text{VE}}$, the goal is to find a global optimum (optimal weight vector, $\mathbf{w}^*$):

\[\mathbf{w}^* \hspace{.5em} \text{for which} \hspace{.75em} \overline{\text{VE}}(\mathbf{w}^*) \leq \overline{\text{VE}}(\mathbf{w}), \quad \forall \mathbf{w}\]

Complex function approximators may converge to a local optimum:

\[\mathbf{w}^* \hspace{.5em} \text{for which} \hspace{.75em} \overline{\text{VE}}(\mathbf{w}^*) \leq \overline{\text{VE}}(\mathbf{w}), \quad \forall \mathbf{w} \hspace{.5em} \text{in some neighborhood of} \hspace{.5em} \mathbf{w}^*\]

9.3 Stochastic-Gradient & Semi-Gradient Methods

Stochastic Gradient Descent (SGD)

SGD methods are among the most widely used of all function approximation methods and are well suited to online RL.
The function approximator is parameterized by a weight vector with a fixed number of real components (column vector):

\[\mathbf{w} \doteq (w_1, w_2, w_3, \ldots, w_d)^T\]

In SGD, we update the weight vector at each time step by moving it in the direction that minimises the error most quickly for the example shown:

\[\mathbf{w}_{t+1} \doteq \mathbf{w}_t - \frac{1}{2}\alpha \nabla\!\left[v_\pi(S_t) - \hat{v}(S_t, \mathbf{w}_t)\right]^2\] \[\boxed{ \mathbf{w}_{t+1}= \mathbf{w}_t + \alpha\!\left[v_\pi(S_t) - \hat{v}(S_t, \mathbf{w}_t)\right] \nabla \hat{v}(S_t, \mathbf{w}_t)}\] \[\begin{aligned} \text{where} \\ \hat{v}(s, \mathbf{w}) &= \text{approximate value function (differentiable in } \mathbf{w},\ \forall s \in S) \\ \alpha &= \text{positive step-size parameter} \\ \nabla f(\mathbf{w}) &= \text{column vector of partial derivatives of } f(\mathbf{w}) \text{ WRT components of } \mathbf{w} \end{aligned}\]

The equation for $\nabla f(\mathbf{w})$, the gradient of $f$ WRT $\mathbf{w}$, is:

\[\nabla f(\mathbf{w}) \doteq \left(\frac{\partial f(\mathbf{w})}{\partial w_1},\ \frac{\partial f(\mathbf{w})}{\partial w_2},\ \frac{\partial f(\mathbf{w})}{\partial w_3},\ \ldots,\ \frac{\partial f(\mathbf{w})}{\partial w_d}\right)^T\]

SGD methods are “gradient descent” methods because the overall step in $\mathbf{w}_t$ is proportional to the negative gradient of the example’s squared error ($w_{t+1}$). This is the direction in which the error falls most rapidly.
Gradient descent methods are called “stochastic” because the update is done on only a single example, which might have been selected stochastically.
If $\alpha$ decreases as expected in satisfaction of the standard stochastic approximation conditions, then SGD is guaranteed to converge to a local optimum.

True Value Estimates

When the true value function $v_\pi(S_t)$ is unknown, we can approximate it by substituting $U_t$ in place of $v_\pi(S_t)$.
This yields the following general SGD method for state-value prediction:

\[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\!\left[U_t - \hat{v}(S_t, \mathbf{w}_t)\right] \nabla \hat{v}(S_t, \mathbf{w}_t)}\]

If $U_t$ is an unbiased estimate, that is, if $\mathbb{E}[U_t \vert S_t = s] = v_\pi(s)$ for each $t$, then $\mathbf{w}_t$ is guaranteed to converge to a local optimum under the usual stochastic approximation conditions for decreasing $\alpha$.
An example of an unbiased estimator is the Monte Carlo estimate for state $S_t$:

\[U_t = G_t\] \[\mathbf{w} \leftarrow \mathbf{w} + \alpha\!\left[G_t - \hat{v}(S_t, \mathbf{w})\right] \nabla \hat{v}(S_t, \mathbf{w}), \quad \alpha > 0\]

Bootstrapping methods are biased in that their target is dependent on the current value of the weights $\mathbf{w}$.
- Semi-gradient (bootstrapping) methods converge reliably in linear cases.
- Bootstrapping targets could be $n$-step returns or dynamic programming (DP) targets.

\[\begin{align*} (\text{semi-gradient}) \quad & U_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) \\ (n\text{-step}) \quad & U_t = G_{t:t+n} \\ (\text{DP}) \quad & U_t = \sum_{a,s',r} \pi(a \vert S_t)\, p(s', r \vert S_t, a)\!\left[r + \gamma \hat{v}(s', \mathbf{w}_t)\right] \end{align*}\]

State Aggregation

A simple form of generalizing function approximation in which states are grouped together, with one estimated value (one component of the weight vector $\mathbf{w}$) for each group.
A special SGD case in which:

\[\nabla \hat{v}(S_t, \mathbf{w}_t) = \left\{ \begin{array}{ll} 1 & \text{for } S_t\text{'s group's component} \\ 0 & \text{for the other components} \end{array} \right\}\]

9.4 Linear Methods

One of the simplest cases for function approximation is when the approximate value function is a linear combination of the weight vector.
Linear methods approximate the state-value function by the inner product between $\mathbf{w}$ and $\mathbf{x}(s)$:

\[\hat{v}(s, \mathbf{w}) \doteq \mathbf{w}^T \mathbf{x}(s) \doteq \sum_{i=1}^{d} w_i x_i(s)\] \[\begin{aligned} \\ \text{where } \quad \mathbf{x}(s) &= \textbf{feature vector } \text{representing state } s \\ \mathbf{x}(s) &\doteq \bigl(x_1(s),\ x_2(s),\ x_3(s),\ \ldots,\ x_d(s)\bigr)^T \\ x_i(s) &= \text{value of function } x_i : S \to \mathbb{R} \end{aligned}\]

For linear methods, features are basis functions because they form a linear basis for the set of approximate functions.
For linear methods, the gradient of the approximate value function WRT $\mathbf{w}$ is:

\[\nabla \hat{v}(s, \mathbf{w}) = \mathbf{x}(s)\]

Thus in the linear case, the general SGD update reduces to:

\[\boxed{ \mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\!\left[U_t - \hat{v}(S_t, \mathbf{w}_t)\right] \mathbf{x}(S_t)}\]

The semi-gradient TD(0) algorithm converges under linear function approximation:

\[\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\!\left(R_{t+1} + \gamma \mathbf{w}_t^T \mathbf{x}_{t+1} - \mathbf{w}_t^T \mathbf{x}_t\right) \mathbf{x}_t\] \[= \mathbf{w}_t + \alpha\!\left(R_{t+1} \mathbf{x}_t - \mathbf{x}_t\!\left(\mathbf{x}_t - \gamma \mathbf{x}_{t+1}\right)^T \mathbf{w}_t\right)\] \[\begin{aligned} \\ \text{where} \quad \mathbf{x}_t = \mathbf{x}(S_t) \end{aligned}\]

Once the system has reached steady state, for any given $\mathbf{w}_t$ the expected next weight vector can be represented as:

\[\mathbb{E}\!\left[\mathbf{w}_{t+1} \vert \mathbf{w}_t\right] = \mathbf{w}_t + \alpha\!\left(\mathbf{b} - \mathbf{A}\mathbf{w}_t\right)\] \[\begin{aligned} \\ \text{where} \quad b &\doteq \mathbb{E}\!\left[R_{t+1}\, \mathbf{x}_t\right] \in \mathbb{R}^d \\ A &\doteq \mathbb{E}\!\left[\mathbf{x}_t\!\left(\mathbf{x}_t - \gamma \mathbf{x}_{t+1}\right)^T\right] \in \mathbb{R}^{d \times d} \end{aligned}\]

It is clear that, if the system converges, it must converge to the weight vector $\mathbf{w}_\text{TD}$ at which:

\[\mathbf{b} - \mathbf{A}\mathbf{w}_\text{TD} = \mathbf{0}\] \[\mathbf{b} = \mathbf{A}\mathbf{w}_\text{TD}\] \[\mathbf{w}_\text{TD} \doteq \mathbf{A}^{-1}\mathbf{b}\]

$\mathbf{w}_\text{TD}$ is called the TD fixed point. It is the point that linear semi-gradient TD(0) converges to.
$\mathbf{A}$ needs to be positive definite to ensure that $\mathbf{A}^{-1}$ exists. $y^T \mathbf{A} y > 0, \text{for any } y \neq 0$
The semi-gradient $n$-step TD algorithm is the natural extension of the tabular $n$-step TD algorithm in Ch. 7 to semi-gradient function approximation:

\[\boxed{ \begin{aligned} \mathbf{w}_{t+n} &\doteq \mathbf{w}_{t+n-1} + \alpha\!\left[G_{t:t+n} - \hat{v}(S_t, \mathbf{w}_{t+n-1})\right] \nabla \hat{v}(S_t, \mathbf{w}_{t+n-1}), \quad 0 \leq t < T \\ \\ G_{t:t+n} &\doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{v}(S_{t+n}, \mathbf{w}_{t+n-1}), \quad 0 \leq t \leq T-n \end{aligned} }\]

Bounded Expansion

At the TD fixed point, $\overline{\text{VE}}$ is within a bounded expansion of the lowest possible error:

\[\boxed{\overline{\text{VE}}(\mathbf{w}_\text{TD}) \leq \frac{1}{1-\gamma} \min_{\mathbf{w}} \overline{\text{VE}}(\mathbf{w})}\]

That is, the asymptotic error of the TD method is no more than $\frac{1}{1-\gamma}$ times the smallest possible error (attained in the limit by the MC method).
$\gamma$ is often near 1, so the expansion factor can be quite large; therefore there is substantial potential loss in asymptotic performance with the TD method.
A bound analogous to TD’s fixed point & bound above on $\overline{\text{VE}}(\mathbf{w}_\text{TD})$ applies to other on-policy bootstrapping methods as well.
One-step action-value methods such as semi-gradient Sarsa(0) converge to an analogous fixed point and an analogous bound.
For episodic tasks, there exists a bound [Bertsekas & Tsitsiklis, 1996].

9.5 Feature Construction for Linear Methods

Let’s discuss different ways of constructing features, which is important for function approximation in linear methods.

9.5.1 Polynomials

There may be a need to design features with higher complexity that can be captured with linear methods.
Say for example a state has 2 numerical dimensions, $s_1, s_2 \in \mathbb{R}$. Then it is insufficient to represent this state as $\mathbf{x}(s) = (s_1, s_2)^T$ because it does not take into account any interactions between the 2 dimensions.
We can overcome this limitation by instead representing $s$ by the 4-D feature vector:

\[\mathbf{x}(s) = \bigl(1,\ s_1,\ s_2,\ s_1 s_2\bigr)^T\]

For $k$ numerical dimensions, suppose each state $s$ corresponds to $k$ numbers, $s_1, s_2, s_3, \ldots, s_k \text{ with each } s_i \in \mathbb{R}$, each order-$n$ polynomial-basis feature $x_i$ can be written as:

\[\boxed{x_i(s) = \prod_{j=1}^{k} s_j^{c_{i,j}}}\] \[\begin{aligned} \text{where} \\ c_{i,j} &= \text{integer in } \{0, 1, 2, 3, \ldots, n\}, \quad \text{ for } n \geq 0 \\ (n+1)^k &= \text{no. of distinct features for a } k\text{-dimensional state space} \end{aligned}\]

9.5.2 Fourier Basis

Fourier series express periodic functions as weighted sums of sine and cosine basis functions of different frequencies, with a function $f$ being periodic if:

\[f(x) = f(x + \tau) \quad \forall x, \text{ and for some period } \tau\]

The 1-D, order-$n$ Fourier cosine basis consists of the $n+1$ features:

\[x_i(s) = \cos(i \pi s), \quad s \in [0, 1] \quad \text{for } i = 0, 1, 2, \ldots, n\]

For $k$-dimensional state space, the $i$-th feature in the order-$n$ Fourier cosine basis is:

\[\boxed{x_i(s) = \cos\!\left(\pi \mathbf{s}^T \mathbf{c}^i\right)}\] \[\begin{aligned} \text{where} \\ \mathbf{c}^i &= \bigl(c_1^i, c_2^i, \ldots, c_k^i\bigr)^T \\ c_j^i &\in \{0, 1, 2, \ldots, n\} \quad \text{for } j = 1, 2, \ldots, k \text{ and } i = 1, 2, \ldots, (n+1)^k \\ i &\equiv \text{features} \\ j &\equiv \text{dimensions} \end{aligned}\]

9.5.3 Coarse Coding

Mapping circles to the features of a state space for 2-dimension.

Coarse Coding

Generalization from state $s$ to state $s’$ depends on the number of their features whose receptive fields (in this case, circles) overlap. These states have one feature in common, so there will be slight generalization between them.

In the diagram above, if the state is inside a circle, then the corresponding feature has the value of 1 and is said to be present; otherwise the feature is 0 and is said to be absent.
This kind of 1-0 valued feature is called a binary feature.
Representing a state with features that overlap in this way is known as coarse coding.
If we train at one state (a point in the space), then the weights of all circles intersecting that state will be affected (each circle has a corresponding weight, a single $\mathbf{w}$ component).
Generalization in linear function approximation methods is determined by the sizes and shapes of the features’ receptive fields.

9.5.4 Tile Coding

Tile coding is a form of coarse coding for multi-dimensional continuous spaces that is flexible and computationally efficient.
It may be the most practical feature representation for modern sequential digital computers.
In tile coding, the receptive fields of the features are grouped into partitions of the state space.
Each such partition is called a tiling, and each element of the partition is called a tile.
These tiles can be sets that are overlapping, uniform, or asymmetrically distributed.
The tiles do not need to be squares; they can be irregular shapes, horizontal/vertical/log lines.
Advantages:
- Because tile coding works with partitions, the overall number of features that are active at one time is the same for any state.
- Because tile coding uses binary feature vectors, computing the approximate value function reduces to simply summing the $n$ weight components corresponding to the $n$ active tiles, rather than performing $d$ multiplications.
Generalization extends to states sharing tiles with the trained state, proportional to tiles in common.
- Uniform offsets can introduce directional artifacts (e.g. diagonal bias);
- Asymmetric offsets produce better-centered patterns.
Tilings are offset by $\frac{w}{n}$ (the fundamental unit).
- Uniform offsets use displacement vector $(1,1)$;
- Asymmetric offsets use $(1,3)$ for superior generalization.
Miller & Glanz (1996) recommend displacement vectors of first odd integers $(1, 3, 5, \ldots, 2k-1)$ for $k$ dimensions, with $n$ set to a power of 2 $\geq 4k$.
Tile number & size determine resolution, while tile shape determines generalization.
- Square tiles generalize equally
- Elongated tiles generalize along their longer dimension
- Diagonal tiles generalize along a diagonal.
Mixed tile shapes (horizontal, vertical, conjunctive) allow per-dimension generalization while still learning values for specific conjunctions. Tile choice fully determines generalization behavior.
Hashing pseudorandomly collapses a large tiling into a much smaller set of tiles (each consisting of noncontiguous, disjoint regions), drastically reducing memory requirements with little performance loss.
- This sidesteps the curse of dimensionality since memory need only match the task’s real demands rather than grow exponentially with dimensions.

Hashing

4 subtiles collapse into 1 tile in the above diagram.

9.5.5 Radial Basis Functions (RBF)

RBFs are the natural extension/generalization of coarse coding to continuous-valued domain/feature space.
- Rather than a feature being 0 or 1 (binary), it can take any value in the interval $[0, 1]$.
A typical RBF feature, $x_i$, has a Gaussian (bell-shaped) response $x_i(s)$ dependent only on the distance between the state $s$ and the feature’s center state $c_i$, and relative to the feature’s width $\sigma_i$:

\[\boxed{x_i(s) \doteq \exp\!\left(-\frac{\lVert s - c_i \rVert^2}{2\sigma_i^2}\right)}\]

let’s see a 1-D example with a Euclidean distance metric

1-D RBF

A one-dimensional example with a Euclidean distance metric.

RBFs are more advantageous to binary features: they produce approximate functions that vary smoothly and are differentiable.
An RBF network is a linear function approximator using RBFs for its features.
Some learning methods for RBF networks change the features’ centers and widths, making them nonlinear function approximators.
Nonlinear methods are much more precise in fitting the target functions, but (nonlinear) RBF networks are much more computationally complex and require more manual tuning before learning.

9.6 Selecting Step-Size Parameters Manually

What is the best way to select $\alpha$ for function approximation?
So far,
- The theory of stochastic approximation gives us conditions on a slowly decreasing step-size sequence that are sufficient to guarantee convergence, but these tend to result in learning that is too slow.
- The sample-average classical method $\alpha_t = \frac{1}{t}$ is not appropriate for TD methods, for nonstationary problems, or for any function approximation method.
- For linear methods, there are recursive least-square methods that set an optimal matrix step size which can be extended to LSTD methods (seen in Section 9.8), but these require $O(d^2)$ step-size parameters, or $d$ times more parameters than we are learning. Therefore, it is inappropriate for function approximation.
A good rule of thumb for setting the step-size parameter of linear SGD models is:

\[\boxed{\alpha \doteq \left(\tau\, \mathbb{E}\!\left[\mathbf{x}^T \mathbf{x}\right]\right)^{-1}}\] \[\begin{aligned} \text{where} \\ \mathbf{x} &= \text{random feature vector chosen from the same distribution as input vectors in the SGD} \\ \tau &= \text{number of experiences within which learning converges} \end{aligned}\]

This rule of thumb works best if the feature vectors do not vary greatly in length; ideally $\mathbf{x}^T \mathbf{x}$ is a constant.

9.7 Nonlinear Function Approximation: Artificial Neural Networks (ANNs)

ANNs are widely used for nonlinear function approximation.
An ANN is a network of interconnected units that have some of the properties of neurons, the main components of nervous systems.

ANN

A generic feedforward Artifical Neural Network with 4 input units, 2 output units, and 2 hidden layers.

The units (circles in the figure above) compute a weighted sum of their input signals, and then apply a nonlinear function, called the activation function, to the result.
Some activation functions include: $\begin{aligned} \text{Sigmoid:} \quad & f(x) = \frac{1}{1 + e^{-x}} \\ \\ \text{ReLU:} \quad & f(x) = \max(0, x) \\ \\ \text{Binary step:} \quad & f(x) = \left\{ \begin{array}{ll} 1 & \text{if } x \geq 0 \\ 0 & \text{if } x < 0 \end{array} \right\} \end{aligned}$
ANNs can use TD errors to learn value functions, or they can aim to maximize expected reward as in a gradient bandit (Section 2.8) or a policy gradient algorithm (Chapter 13).
Overfitting in ANNs is a problem for function approximation that can be reduced through the dropout method [Srivastava, Hinton, Krizhevsky, Sutskever & Salakhutdinov, 2014].
Deep belief networks presented a major step towards solving the training problem of deep layers of a deep ANN [Hinton, Osindero & Teh, 2006].
Batch normalization makes it easier to train deep ANNs by normalizing the output of deep layers before they feed into the following layer. This improves the learning rate of deep ANNs [Ioffe & Szegedy, 2015].
Deep residual learning is another technique useful for training deep ANNs [He, Zhang, Ren & Sun, 2016].
Deep Convolutional Networks have proven to be very successful in impressive RL algorithms [LeCun, Bottou, Bengio & Haffner, 1998].
In summary, advances in the design and training of ANNs, of which we have only mentioned a few, all contribute to RL.
Although current RL theory is mostly limited to methods using tabular or linear function approximation methods, the impressive performances of notable RL applications owe much of their success to nonlinear function approximation by multi-layer ANNs.

9.8 Least-Squares TD (LSTD)

As established earlier, TD(0) with linear function approximator converges asymptotically (for appropriately decreasing step sizes) to the TD fixed point:

\[\boxed{\mathbf{w}_\text{TD} = \mathbf{A}^{-1}\mathbf{b}}\] \[\begin{aligned} \text{where} \\ A &\doteq \mathbb{E}\!\left[\mathbf{x}_t\!\left(\mathbf{x}_t - \gamma \mathbf{x}_{t+1}\right)^T\right] \\ b &\doteq \mathbb{E}\!\left[R_{t+1}\, \mathbf{x}_t\right] \end{aligned}\]

Instead of computing $\mathbf{w}_\text{TD}$ iteratively, let’s compute estimates of $\mathbf{A}$ and $\mathbf{b}$ and then directly compute the TD fixed point. This is what Least-Squares TD (LSTD) does exactly. It forms the natural estimates:

\[\mathbf{\hat{A}_t} \doteq \sum_{k=0}^{t-1} \mathbf{x_k}\!\left(\mathbf{x}_k - \gamma \mathbf{x}_{k+1}\right)^T + \varepsilon \mathbf{I}\] \[\mathbf{\hat{b}_t} \doteq \sum_{k=0}^{t-1} R_{k+1}\, \mathbf{x}_k\] \[\begin{aligned} \text{where} \\ \mathbf{I} &= \text{identity matrix}, \text{and} \\ \varepsilon & \mathbf{I} \text{, for some small } \varepsilon > 0 \text{, ensures that } \mathbf{\hat{A}_t} \text{ is always invertible} \end{aligned}\]

The LSTD estimated TD fixed point is:

\[\boxed{\mathbf{w}_t \doteq \mathbf{\hat{A}_t}^{-1} \mathbf{\hat{b}_t}}\]

Computational Complexity

The inverse of $\mathbf{\hat{A}_t}$ computation costs $O(d^3)$, but this can be reduced to $O(d^2)$ using the Sherman-Morrison formula:

\[\mathbf{\hat{A}_t^{-1}} = \left(\mathbf{\hat{A}_{t-1}} + \mathbf{x_{t-1}}\!\left(\mathbf{x_{t-1}} - \gamma \mathbf{x}_t\right)^T\right)^{-1}\] \[\boxed{\mathbf{\hat{A}_t^{-1}} = \mathbf{\hat{A}_{t-1}^{-1}} - \frac{\mathbf{\hat{A}_{t-1}^{-1}} \mathbf{x}_{t-1}\!\left(\mathbf{x}_{t-1} - \gamma \mathbf{x}_t\right)^T \mathbf{\hat{A}_{t-1}^{-1}}}{1 + \left(\mathbf{x}_{t-1} - \gamma \mathbf{x}_t\right)^T \mathbf{\hat{A}_t^{-1}} \mathbf{x}_{t-1}}, \quad \text{for } t > 0, \text{ with } \mathbf{\hat{A}_0} \doteq \varepsilon \mathbf{I}}\]

LSTD does not require a step-size and as such means that it never forgets, which is sometimes desirable but often not in RL problems where both the policy and environment change with time.
However in control applications, LSTD typically has to be combined with some other mechanism to induce forgetting, negating its advantage of not requiring a step-size parameter.

9.9 Memory-based Function Approximation

So far we have been parametrising functions that approximate our value function, and these parameters are updated as we see more data. This is called the parametric approach.
Memory-based function approximators store training examples in memory as they arrive without updating any parameters and retrieve them from memory upon query of a state value’s estimate.
Memory-based function approximators are examples of non-parametric methods.
Memory-based function approximation is sometimes called lazy learning because processing training examples is postponed until the system is queried to provide an output.
Non-parametric methods could produce increasingly accurate approximations of any target function with increasing number of training examples accumulating in memory.
There are different memory-based methods, but let’s focus on local learning:
- Local learning methods approximate a value function only locally in the neighborhood of the current query state by retrieving states in memory via a distance metric between the query state and training example states.
- Local learning methods discard the local approximation after the query state is assigned a value.
Examples of local learning methods:
- Nearest Neighbor: retrieve from memory the closest state to the query state and return that example’s value as the local approximation of the query state.
- Weighted Average: retrieve a set of nearest neighbor examples and return a weighted average of their target values (weigh their value via some distance metric).
- Locally Weighted Regression: similar to weighted average, but fits a surface to the values of a set of nearest states based on a parametric approximation method.
Pros of non-parametric, memory-based methods:
- Do not limit approximations to pre-specified functional forms.
- The more data accumulated, the better the accuracy.
- Allow for relatively immediate effect on value estimates in the neighborhood of the current state.
- Handle/address the curse of dimensionality, which is a big problem for global approximation. For example, for a state space with $K$ dimensions,
  - A tabular method storing a global approximation requires memory exponential in $K$, while
  - Storing examples in a memory-based method requires only memory proportional to $K$, or linear in the number of examples $n$.

9.10 Kernel-based Function Approximation

Weighted average and locally weighted regression depend on assigning weights based on some distance metric between examples $s’$ and a query state $s$.
The function that assigns these weights is called a kernel function or simply a kernel.
A kernel function $k : \mathbb{R} \to \mathbb{R}$ assigns weights to distances between states, but more generally weights do not have to depend on distances; they can also depend on some similarity measure:

\[k : S \times S \to \mathbb{R}\]

$k(s, s’)$ is a measure of the strength of generalization from $s’$ to $s$.
Kernel functions numerically express how relevant knowledge about any state is to any other state.
Kernel regression is the memory-based method that computes a kernel weighted average of the targets of all examples stored in memory:

\[\boxed{\hat{v}(s, D) = \sum_{s' \in D} k(s, s')\, g(s')}\] \[\begin{aligned} \text{where} \\ \hat{v}(s, D) &= \text{value function approximation for query state } s \text{ over stored examples } D \\ D &= \text{set of stored examples} \\ g(s') &= \text{target for state } s' \text{ in a stored example} \end{aligned}\]

A common kernel is the Gaussian Radial Basis Function (RBF) discussed earlier in Section 9.5.5.
Kernel regression with RBF differs from linear-based RBF in 2 ways:
- It is memory-based $\Rightarrow$ RBFs are centered on the states of the stored examples.
- It is non-parametric $\Rightarrow$ no parameters to learn.
We can recast any linear parametric regression method with feature vectors $\mathbf{x}(s) = (x_1(s), x_2(s), x_3(s), \ldots, x_d(s))^T$ into a kernel regression as:

\[k(s, s') = \mathbf{x}(s)^T \mathbf{x}(s')\]

Kernel methods allow evaluation in high-dimensional feature spaces, only using stored examples, and without the need for a complicated, parametric model. This is the so-called kernel trick that forms the basis for many machine learning methods.

9.11 Looking Deeper at On-Policy Learning: Interest & Emphasis

So far we have treated all encountered states with equal importance, however it seems more likely that we often have more interest in some states than others.
For this scenario, we introduce 2 new concepts: Interest and Emphasis.

Interest

Interest is the degree to which we value an accurate estimate of the values of a given state. How interested are we in the accurate valuation of a given state at time $t$?

\[\mathcal{I}_t \in [0, 1]\]

Emphasis

Emphasis is a non-negative scalar random variable that multiplies the learning update and thus emphasizes or de-emphasizes the learning done at time $t$.
The general $n$-step learning update is:

\[\mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha M_t\!\left[G_{t:t+n} - \hat{v}(S_t, \mathbf{w}_{t+n-1})\right] \nabla \hat{v}(S_t, \mathbf{w}_{t+n-1}), \quad 0 \leq t < T\] \[G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{v}(S_{t+n}, \mathbf{w}_{t+n-1}), \quad 0 \leq t \leq T-n\]

The emphasis is determined recursively from the interest:

\[\boxed{M_t = \mathcal{I}_t + \gamma^n M_{t-n}, \quad 0 \leq t < T}\] \[\text{with } M_t \doteq 0, \quad \forall t < 0\]

9.12 Summary

Generalization is a must for RL systems for AI applications and supervised-learning function approximation helps achieve this.
The prediction objective $\overline{\text{VE}}(\mathbf{w})$, called the mean squared value error, gives us a clear way to rank different value-function approximations in the on-policy case.
Most techniques use stochastic gradient descent (SGD) to find the set of weight parameters that minimize $\overline{\text{VE}}(\mathbf{w})$.
Linear methods converge to the global optimum under certain conditions.
Features constructed for linear function approximations could be represented as: polynomials, Fouriers, coarse codings, tile codings, and radial basis functions (RBFs).
A huge success in notable RL applications could be attributed to multi-layer ANNs as nonlinear function approximators.
Non-parametric models help us avoid the curse of dimensionality.
Interest and Emphasis enable us to focus the function approximation on the states we’re more interested in.

Citation

If you found this blog post helpful, please consider citing it:

@article{obasi2026RLsuttonBartoCh09notes,
  title   = "Sutton & Barto, Ch. 09: On-Policy Prediction with Approximation (Personal Notes)",
  author  = "Obasi, Chizoba",
  journal = "chizkidd.github.io",
  year    = "2026",
  month   = "Feb",
  url     = "https://chizkidd.github.io/2026/02/27/rl-sutton-barto-notes-ch009/"
}

Sutton & Barto, Ch. 08: Planning & Learning with Tabular Methods (Personal Notes)

Tue, 24 Feb 2026 00:00:00 +0000

Model-Based RL methods require a model of the environment and rely on planning as their primary component.
- Dynamic Programming (DP), Heuristic Search
Model-Free RL methods don’t require a model of the environment and primarily rely on learning.
- Monte Carlo (MC), Temporal-Difference (TD)
Both methods still use value functions and both make backups to state values based on future returns.

8.1 Models & Planning
8.2 Dyna: Integrated Planning, Acting, and Learning
8.3 When the Model is Wrong
8.4 Prioritized Sweeping
8.5 Expected vs Sample Updates
8.6 Trajectory Sampling
8.7 Real-Time Dynamic Programming (RTDP)
8.8 Planning at Decision Time
8.9 Heuristic Search
8.10 Rollout Algorithms
8.11 Monte Carlo Tree Search (MCTS)
8.12 Summary
8.13 Summary of Part I: Dimensions

Appendix

Citation

8.1 Models & Planning

A model is anything the agent can use to predict the environment’s behavior.
- Given a state and an action, a model predicts the next state and the next reward.
- If the model is stochastic, then there are several possible next states and next rewards. The model produces:
  - All the possibilities and probabilities; these are distribution models.
  - One of the possibilities, sampled according to the probabilities; these are sample models.
Models are used to simulate the environment and thus produce simulated experiences.
Planning refers to any computational process that takes a model as input and produces or improves a policy for interacting with the modelled environment.

\[\text{model} \xrightarrow{\text{Planning}} \text{policy}\]

There are 2 distinct approaches to planning:
- State-space planning: search through the state space for an optimal policy. Actions cause transitions from state to state, and value functions are computed over states.
- Plan-space planning: search through the space of plans. Operators transform one plan into another, and value functions, if any, are defined over the space of plans. e.g. Evolutionary methods, partial-order planning.
State-space planning common structure:

\[\text{model} \to \text{simulated experience} \xrightarrow{\text{backups}} \text{values} \to \text{policy}\]

All state-space planning methods involve computing value functions as a key intermediate step toward improving the policy.
They compute value functions by updates or backup operations applied to simulated experience.
Dynamic programming methods make sweeps through the space of states, generating for each state the distribution of possible transitions. Each distribution is then used to compute a backed-up value (update target) and update the state’s estimated value.
At the heart of both learning and planning is the estimation of value functions by backing-up update operations.
The difference however is that:
- Planning uses simulated experience generated by a model.
- Learning uses real experience generated by the environment.
Despite the differences, a learning algorithm can be substituted for the key update step of a planning method, because it applies just as well to simulated experience.

Random-sample One-step Tabular Q-Learning

A planning method based on one-step tabular Q-learning and on random samples from a sample model.
It converges to the optimal policy for the model under the same conditions as that for the real environment.

Pseudocode: Random-sample One-step Tabular Q-Learning

Loop forever:
    1. Select a state, S ∈ S, and an action, A ∈ A(S), at random
    2. Send S, A to a sample model, and obtain
       a sample next reward, R, and a sample next state, S'
    3. Apply one-step tabular Q-learning to S, A, R, S':
       Q(S,A) ← Q(S,A) + α[R + γ max_a Q(S',a) - Q(S,A)]

Planning in very small, incremental steps may be the most efficient approach especially in large scale problems.

8.2 Dyna: Integrated Planning, Acting, and Learning

Dyna-Q interactions and General Dyna architecture

When planning is done online, while interacting with the environment, a number of interesting issues arise:
- New information gained from the interaction may change the model (and thus the planning).
- How do we divide the computational resources available between decision making and model learning?
Dyna-Q: a simple architecture integrating the major functions needed in an online planning agent.
Within a planning agent, there are at least 2 roles for real experience:
- Model learning: to improve the model (to make it more accurately match the real environment).
- Direct RL: to directly improve the value function and policy.
- Indirect RL: to indirectly improve the value functions and policies via the model (planning).
Both direct and indirect methods have advantages and disadvantages:
- Indirect methods often make fuller use of a limited amount of experience.
- Direct methods are much simpler and are not affected by biases in the design of the model.
Dyna-Q includes all of the RL processes in the interactions diagram shown above occurring continuously: planning, acting, model-learning, and direct RL.
The planning method is the random-sample, one-step tabular Q-planning.
The model-learning method is also table-based and assumes the environment is deterministic.
- After each transition $S_t, A_t \to R_{t+1}, S_{t+1}$, the model records in its table entry for $S_t, A_t$ the prediction that $R_{t+1}, S_{t+1}$ will deterministically follow.
- Thus, if the model is queried with a state-action pair that has been experienced before, it simply returns the last $S_{t+1}, R_{t+1}$ experienced as its prediction.
- During planning, the Q-learning algorithm randomly samples only from state-action pairs that have previously been experienced.
Based on the overall architecture of Dyna agents:
- Search control: the process that selects the starting states & actions for the simulated experiences.
- Planning is achieved by applying RL methods to the simulated experiences just as if they had really happened.
Typically, as in Dyna-Q, the same RL method is used both for learning from real experience and for planning from simulated experience.
Learning and planning share almost all the same machinery, differing only in the source of their experience.
Conceptually, planning, acting, direct RL, and model learning occur simultaneously and in parallel in Dyna agents. For computational concreteness and implementation, we specify the order of occurrence within a time step:
- Acting, model-learning and direct RL processes require little time.
- Planning takes the remaining time in each step because it is inherently computationally intensive.

Tabular Dyna-Q Pseudocode

Initialize Q(s,a) and Model(s,a) for all s ∈ S, a ∈ A(s)

Loop forever:
    (a) S ← current (non-terminal) state
    (b) A ← ε-greedy(S, Q)
    (c) Take action A; observe resultant reward, R, and state, S'
    (d) Q(S,A) ← Q(S,A) + α[R + γ max_a Q(S',a) - Q(S,A)]
    (e) Model(S,A) ← R, S'  (assuming deterministic environment)
    (f) Loop repeat n times:
            S ← random previously observed state
            A ← random action previously taken in S
            R, S' ← Model(S, A)
            Q(S,A) ← Q(S,A) + α[R + γ max_a Q(S',a) - Q(S,A)]

In the pseudocode algorithm for Dyna-Q above:
- $\text{Model}(S,a)$ denotes the contents of the model (predicted $S_{t+1}$ & $R_{t+1}$) for state-action pair.
- Direct RL, model-learning and planning are steps (d), (e) and (f) respectively.
- If (e) and (f) were omitted, the algorithm becomes one-step tabular Q-learning.
A simple maze example shows that adding planning ($n > 0$) dramatically improves the agent’s behavior.
Due to the incremental nature of planning, it is trivial to intermix planning & learning. Both proceed quickly.
- The agent is always reactive and deliberative, and yet always planning and model-learning in the background.

8.3 When the Model is Wrong

Models may be incorrect because:
- The environment is stochastic and only a limited number of samples have been observed.
- The model was learned using function approximation that has generalized imperfectly.
- The environment has changed and its new behavior has not yet been observed.
In some scenarios, the suboptimal policy computed by planning quickly leads to the discovery and correction of the modeling error.
- This tends to happen when the model aims to predict a greater reward than is possible.
- The planned policy attempts to exploit these opportunities and discovers that they do not exist.
Issues arise when the environment changes to become better than it was before, and yet the formerly correct policy does not reveal the improvement.
- In these cases the current optimal policy remains unchanged, but there lies an even better policy available that the agent may never access because it has no reason to doubt its previously learned optimal policy.
- To address this exploration/exploitation tradeoff/conflict, there is no perfect and practical solution, but there are simple heuristics that can be effective.
- One proposed algorithm that used one such heuristic was the Dyna-Q+ agent.

Dyna-Q+

The agent keeps track of the number of timesteps elapsed $\tau$ since a state-action pair had been selected, and if sufficient time has elapsed, it is presumed that the dynamics of the environment from that state has changed and the model of it is incorrect.
A special bonus reward is added to simulated experience involving these actions to encourage exploratory behavior towards long-untried actions ($k\sqrt{\tau}$).
The modelled rewards for each state-action pair now becomes $r + k\sqrt{\tau}$ for some small $k$.
- $r$ is the initial unchanged model reward without any heuristic applied.
This encourages the agent to be more exploratory to all accessible state transitions, and even though this may be costly computationally, it’s well worth it.

8.4 Prioritized Sweeping

In Dyna agents, simulated transitions are started in state-action pairs selected uniformly at random from all previously experienced state-action pairs.
- Uniform selection is usually not the best; focusing on particular state-action pairs could be much more efficient for planning.
Prioritized sweeping optimizes Dyna-style planning by selectively updating state-action pairs based on expected magnitude of value change, rather than uniform random selection.
- Steps:
  1. Keep a priority queue of which state-action pairs need updating most.
  2. Update the ones with biggest potential changes first.
  3. Work backwards from important states (like the goal).
- Mechanism:
  1. Maintain priority queue of $(s,a)$ pairs ranked by Bellman error magnitude.
  2. Propagate updates backward from states with changed values.
  3. Queue predecessors weighted by: $\vert R + \gamma V(s’) - Q(s,a) \vert$
- Key advantages:
  1. Efficiency: avoid wasteful updates (such as $0 \to 0$ reward transitions).
  2. Convergence speed: dramatic empirical improvements.
  3. Backward focusing: value propagation follows reverse trajectory from changed states.
Extensions of prioritized sweeping to stochastic environments are straightforward:
- Expected updates: enumerate all $s’$ with transition probabilities, which is comprehensive but computationally expensive on low-probability transitions.
- Sample updates: lower variance per computation, better granularity enables selective focus on high-impact transitions.
- Essentially, when outcomes are random, one can either update based on all possibilities (slow but thorough) or sample specific outcomes (faster, focuses effort).
All planning entails sequences of value updates varying in:
- Update type $\Rightarrow$ expected/sample, full/partial backup.
- Update ordering $\Rightarrow$ backward/forward focusing, prioritization heuristic.
Forward focusing prioritizes states by reachability under current policy rather than backward value propagation.

8.5 Expected vs Sample Updates

So far in the book we’ve discussed dynamic programming (DP) as a way of conducting policy evaluation and policy improvement given a distribution model of the environment.
We’ve also discussed sampling methods like Monte Carlo (MC), temporal-difference (TD), and $n$-step bootstrapping to estimate value functions in the absence of a model.
Given a fixed computational budget, are expected or sample updates more efficient for planning?

Backup diagrams for one-step updates: showing 7 out of 8 possible cases for one-step updates (sample and expected) for state-value and action-value functions and their optimal versions too.

Value estimated	Expected updates (DP)	Sample updates (one-step TD)
$v_\pi(s)$	Policy evaluation (full branching over actions & next states)	TD(0) (single sampled transition)
$v_*(s)$	Value iteration (max over actions, full branching)	—
$q_\pi(s,a)$	$q$-policy evaluation	Sarsa
$q_*(s,a)$	$q$-value iteration	Q-learning

We have considered many value function updates. If we focus on one-step updates, they vary along 3 binary dimensions:
- Whether they update state values or action values.
- Whether they estimate the value for the optimal policy or for an arbitrary given policy.
- Whether the updates are expected updates (consider all possible events that might happen) or sample updates (consider a single sample of what might happen).
These 3 binary dimensions give rise to 8 cases, 7 of which are shown in the figure above. The 8th case does not seem to correspond to any useful update.
Any of these one-step updates can be used in planning methods:
- Dyna-Q uses $q_*$ sample or expected updates, or either expected or sample $q_\pi$ updates.
- Dyna-AC uses $v_\pi$ sample updates together with a learning policy structure.
- For stochastic problems, prioritized sweeping is always done using one of the expected updates.
Absence of a distribution model means that expectation is impossible, but sampling can be done.
- So which is better? Expectation or Sampling?
  - Expected updates yield a better estimate because they are uncorrupted by sampling error.
  - Expected updates require more computation, and computation is often the limiting resource in planning.

Computational Requirements & Formal Comparison (given discrete states/actions)

Model: $\hat{p}(s’, r \vert s, a)$ known.
Goal: Approximate $q_{*}$ (optimal action values).
Branching factor: $b = \vert {s’: p(s’ \vert s,a) > 0} \vert$ (effective stochasticity).
Expected Update (exact):
- Computational complexity: $O(b)$.

\[\boxed{Q(s,a) \leftarrow \sum_{s', r} \hat{p}(s', r | s, a)\left[r + \gamma \max_{a'} Q(s', a')\right]}\]

Sample Update (stochastic):
- Computational complexity: $O(1)$.

\[\boxed{Q(s,a) \leftarrow Q(s,a) + \alpha\left[R + \gamma \max_{a'} Q(S', a') - Q(s,a)\right]}\]

In the same fixed time budget for 1 expected update, you get $b$ sample updates.

Theoretical Comparison (Empirical Analysis)

Assume all $b$ successors are equiprobable, and initial $ \vert \text{error} \vert = 1$ at $(s,a)$; successor values are assumed already correct.
Expected update: error $= 0$ after one update (cost: $\sim b$ units).
Sample updates (assuming sample averages, i.e. $\alpha = \frac{1}{t}$): error $\approx \sqrt{\frac{b-1}{bt}}$.
- For moderate $b$ (e.g. $b = 10$) and large $b$, the error falls dramatically with a tiny fraction of $b$ updates.
- For large $b$, error drops exponentially fast early for sample updates, allowing broad updates across many $(s,a)$ pairs in the same time as one expected update:
  \[\text{error} = \sqrt{\frac{b-1}{bt}}\]
- For large $b$: $\quad \text{error} \approx \sqrt{\frac{1}{t}}$, $\therefore \lim_{t \to \infty} \sqrt{\frac{1}{t}} \to 0$.
  \[\begin{aligned} b=1:& \quad \text{error} = 0 \quad \text{for } t \geq 1 \\[6pt] b=2:& \quad \text{error} = \frac{1}{\sqrt{2t}} \implies \text{error}(t=1) \approx 0.707,\ \text{error}(t=2) = 0.5 \\[6pt] b=100:& \quad \text{error} \approx \frac{1}{\sqrt{t}} \implies \text{error}(t=1) \approx 0.995,\ \text{error}(t=10) \approx 0.316,\ \text{error}(t=100) \approx 0.1 \end{aligned}\]
Pros of sample updates:
- Breadth vs depth: cover more state space per unit computation.
- Bootstrap benefits: earlier updates improve successor value estimates for subsequent backups.
- Diminishing returns: marginal value of incorporating low-probability branches is low.
Sample updates dominate in large-scale stochastic domains where exhaustive sweeping is intractable.
Expected updates only preferable when:
- Small branching factor ($b \leq 3$).
- Small state space (exact solution feasible).
- Deterministic dynamics ($b = 1$, methods are equivalent).

8.6 Trajectory Sampling

Let’s compare two ways of distributing updates:
- Exhaustive Sweeps: classical DP approach that entails performing sweeps through the entire state (or state-action) space, updating each state (or state-action pair) once per sweep. Computationally inefficient especially on large tasks (no time to complete one full sweep).
- Trajectory Sampling: generate simulated trajectories by rolling out the current policy in the model, performing one-step backups along the trajectory.
2 common sampling distributions:
- Uniform sampling of states or state-action pairs.
- On-policy distribution.
For planning updates in tabular RL, should state-action pairs be selected uniformly or according to the on-policy distribution?

Formal Comparison

Uniform distribution:
- Cycle systematically through all $\vert S \vert \times \vert A \vert$ state-action pairs.
- Each pair receives equal computational resources.
- Complete coverage regardless of policy.
- Starting state distribution $\approx$ uniform or some fixed distribution $\mu(S_t)$.
On-policy trajectory sampling:
- Sample states $S_t \sim d^\pi$ where $d^\pi$ is the on-policy state distribution under policy $\pi$.
- Select actions $a_t \sim \pi(\cdot \vert S_t)$.
- Generate trajectories ${S_0, a_0, S_1, a_1, \ldots}$ following current policy.
- Update only visited state-action pairs.
Advantages of trajectory sampling:
- Computational focusing: for large state spaces where $\vert S \vert \gg$ states reachable under $\pi$, trajectory sampling concentrates updates on the reachable subset.
- Irrelevant state pruning: 3 categories emerge:
  - Initial states (starting distribution).
  - States reachable under optimal control.
  - Irrelevant states (never visited optimally).
Disadvantages/Limitations of trajectory sampling:
- Requires a generative model to simulate trajectories.
- May miss important states early in learning if policy is poor.
- Can be sample-inefficient if trajectories are long.
On-policy distribution sampling is useful/better for problems with large state-spaces and small branching factors (prioritized sweeping).
Trajectory sampling is orthogonal to prioritization. The former addresses which states to update, while the latter addresses in what order. They can be combined.
Trajectory sampling anticipates importance sampling concepts in off-policy learning and naturally extends to continuous state spaces with function approximation.

8.7 Real-Time Dynamic Programming (RTDP)

RTDP is an on-policy trajectory-sampling version of the value-iteration algorithm of dynamic programming (DP).
RTDP is an asynchronous DP algorithm;
- async DP algorithms are not organized in terms of systematic sweeps of the state set, instead
- they update state values in any order whatsoever, using whatever values of other states happen to be available.
RTDP is basically the combination of 3 ideas:
- On-policy trajectory sampling $\Rightarrow$ follow the current greedy policy.
- Asynchronous updates $\Rightarrow$ only update states you actually visit in any order.
- Focused learning $\Rightarrow$ concentrate on “relevant states” (states on good paths to the goal).
RTDP update rule:

\[\boxed{V(S_t) \leftarrow \max_{a \in A}\left(R^a_{S_t} + \gamma \sum_{s'} P^a_{S_t s'}\, V(s')\right)}\]

RTDP’s relationship to other methods:
- Value Iteration:
  - VI updates all states per iteration;
  - RTDP updates only states on sampled trajectories.
- Prioritized Sweeping:
  - PS uses model to work backward from goals;
  - RTDP follows forward trajectories.
- Trajectory Sampling:
  - TS can use any policy;
  - RTDP uses greedy policy for sampling.
Computational Complexity:
- Traditional Value Iteration $\Rightarrow O(\vert S \vert ^2 \vert A \vert)$ per iteration.
- RTDP Trial $\Rightarrow O(L)$ where $L$ = episode length, typically $L \ll \vert S \vert$.
RTDP bridges pure planning and pure learning (focusing on relevant state space regions).
RTDP is guaranteed to find an optimal policy for the relevant states under certain conditions:
1. The initial value of every goal state is zero.
2. There exists at least one policy that guarantees that a goal state will be reached with probability one from any start state.
3. All rewards for transitions from non-goal states are strictly negative.
4. All the initial values are equal to or greater than their optimal values (which can be satisfied by simply setting the initial values of all states to zero).
Tasks with these properties are usually called stochastic optimal path problems.
- RTDP can find optimal policies for these tasks with approximately 50% of the computation required by traditional sweep-based value iteration (i.e. dynamic programming).
- These kinds of problems are usually expressed in cost minimization not reward maximization.

State space diagram: Start states on the left, irrelevant states (unreachable from any start state under any optimal policy) in the outer region, and relevant states (reachable from some start state under some optimal policy) in the inner region.

Heuristic Initialization (Optimistic)

RTDP typically initializes $V$ with an admissible heuristic $h(s)$ where $h(s) \geq V^*(s)$.
This provides optimistic values that guide exploration towards goal states.

Considerations

Most effective domains for RTDP are domains with:
- Large state spaces.
- Sparse goal states.
- Clearly defined initial distribution.
- Deterministic or near-deterministic dynamics.

8.8 Planning at Decision Time

Planning can be used in at least 2 ways:
- Background planning: planning that occurs independently of and prior to the need for action. Here planning is used to gradually improve a policy/value function on the basis of simulated experience from a model, then selects an action via lookup.
- Decision-time planning: planning that occurs at the moment an action is required after encountering the current state. It’s essentially planning at the time of action selection.
Decision-time planning is useful when fast response is not required and the state space is large. In low-latency actions, background planning is better for action selection.
Decision-time planning is memoryless (discards updates after selection of action), but background planning is persistent (permanently stores and accumulates learned values).

8.9 Heuristic Search

The classical state-space planning methods are decision-time planning methods collectively known as heuristic search.
In heuristic search, for each state encountered, a large tree of possible continuations is considered.
- The search evaluates leaf nodes at the end of the search and backs up the values to the state-action nodes for the current states.
- The value maximising action from the current state is found and then selected (the values are usually discarded).
This kind of planning is effective because it focuses only on pertinent next states and actions, and focuses resource on obtaining the next best one-step action.
Heuristic search is an extension of greedy policy beyond one-step to multi-step lookahead to obtain better action selections.

Heuristic Search diagram (selective depth-first search): a large tree rooted at the current state, with branches for each action and subtrees for each successor. The tree policy traverses the tree greedily, evaluating and backing up values from the leaf nodes toward the root.

Theoretical Properties

Optimality horizon: for sufficiently large depth $K$ where $\gamma^K \approx 0$, the selected action approaches the optimal action $a^*(s)$.
Computational complexity:
- Full tree expansion: $O(b^K)$ where $b$ = branching factor.
- With pruning/selection: $O(f(b, K))$ where $f < b^K$.
Memory: $O(bK)$ with depth-first implementation.
Backed-up value interpretation: $v_{\text{tree}}(s)$ estimates the $K$-step optimal value starting from $s$.

8.10 Rollout Algorithms

Rollout algorithms are decision-time planning algorithms based on Monte Carlo control applied to simulated trajectories that all begin at the current environment state.
They estimate action values for a given policy by averaging the returns of many simulated trajectories that start with each possible action and then follow the given policy.
Key characteristics:
- It is memoryless; it doesn’t store/update a permanent value table (no persistence).
- The rollout policy $\Pi$ is usually the current greedy policy.
- Relies on MC averaging to reduce variance (more rollouts $\to$ better estimates).
- It’s quite very simple; no tree-building, no backups during rollouts.
Goal:
- Not to estimate a complete $q_*$ or $q_\pi$ for a given policy $\pi$; instead the focus is on MC estimates of action values only for each current state and for a given fixed policy called the rollout policy.
- Improve upon rollout policy, not find optimal policy.
Rollout algorithms follow the policy improvement theorem by acting greedily w.r.t. $\hat{Q}(s,a)$:
- If $q_\pi(s,a) \geq v_\pi(s)$ for any 2 policies $\pi$ and $\pi’$, then $\pi’ \geq \pi$.

Computational Complexity (quite expensive due to many full episodes)

Per decision $\Rightarrow$:
- $\vert A(s) \vert$ = number of actions to evaluate,
- $n$ = rollouts per action,
- $L$ = average episode length.

\[\text{total cost} = O\left(|A(s)| \cdot n \cdot L\right)\]

As we can see the computational time depends on many factors and balancing these factors is very important and challenging. To handle the challenge:
- It is possible to run many trials in parallel on separate processors (because the MC trials are independent of one another).
- Truncate the simulated trajectories short of complete episodes, correcting the truncated returns by means of a stored evaluation function.
- Pruning away candidate actions that are unlikely to be the best.
Rollout algorithms aren’t learning algorithms because of no long-term memory of values/policies.
- But they still use RL techniques: Monte Carlo control + Policy Improvement Theorem.
- Use MC control to estimate action values via averaging the returns of a collection of sample trajectories.
- Take advantage of the policy improvement property by acting greedily w.r.t. the estimated action values.

8.11 Monte Carlo Tree Search (MCTS)

MCTS is a rollout algorithm that is enhanced by the addition of a means for accumulating value estimates obtained from the MC simulations in order to successively direct simulations toward more highly-rewarding trajectories.
It is a best-first search algorithm that builds a decision tree incrementally by iteratively sampling trajectories through an MDP, using statistical confidence bounds for exploration-exploitation balance.
When the environment changes to a new state, MCTS executes as many iterations as possible before an action needs to be selected, incrementally building a tree whose root node represents the current state.

Each iteration consists of 4 operations:

1. Selection

Starting at the root node, a tree policy based on the action values attached to the edges of the tree traverses the tree to select a leaf node.
Traverse tree using tree policy (typically Upper Confidence bounds for Trees, UCT):

\[\boxed{\text{UCT}(s,a) = \underbrace{\frac{W(s,a)}{N(s,a)}}_{\text{exploitation}} + c\underbrace{\sqrt{\frac{\ln(N(s))}{N(s,a)}}}_{\text{exploration}} = Q(s,a) + c\sqrt{\frac{\ln(N(s))}{N(s,a)}}}\]

where:
- $Q(s,a)$ = average return from $(s,a) = \frac{W(s,a)}{N(s,a)}$
- $W(s,a)$ = total reward accumulated through $(s,a)$
- $N(s,a)$ = visit count for $(s,a)$
- $N(s)$ = visit count for parent state $s$
- $c$ = exploration constant (typically $\sqrt{2}$);
  - $c < 0.5 \Rightarrow$ more exploitation
  - $c > 2 \Rightarrow$ more exploration
Selection terminates when:
- Leaf node is reached (not fully expanded).
- Terminal state is reached.

2. Expansion

On some iterations, the tree is expanded from the selected leaf node by adding one or more child nodes reached from the selected node via unexplored actions.
Expansion strategies:
- Single child per iteration (standard).
- All children at once (batch expansion).
- Progressive widening for continuous actions.

3. Simulation

From the selected node, or from one of its newly-added child nodes (if any), simulation of a complete episode is run with actions selected by the rollout policy (actions are selected first by the tree policy and beyond the tree by the rollout policy).

4. Backup

The return generated by the simulated episode is backed up to update, or to initialize, the action values attached to the edges of the tree traversed by the tree policy in this iteration of MCTS.
Propagate simulation result up the path and update all nodes/edges on the path.

MCTS diagram: 4 stages shown left to right: Selection (tree policy traverses with blue arrows to a leaf), Expansion (leaf expanded), Simulation (rollout policy runs from expanded node to terminal $\Delta$), Backup (return propagated back up with blue arrows).

MCTS continues executing these 4 steps, starting each time at the tree’s root node, until no more time is left, or some other computational resource is exhausted.
Then finally, an action from the root node (representative of the environment’s current state) is selected according to some mechanism that depends on the accumulated statistics in the tree (action with largest action value or action with largest visit count to avoid outliers).

MCTS Pseudocode

Main loop:

Initialize: root = current state, S_0
for i = 1 to num_simulations:
    node = Selection(root)
    node = Expansion(node)
    Δ = Simulation(node)
    Backup(node, Δ)
return argmax_a N(S_0, a)

Expansion:

If non-terminal leaf reached:
    select unvisited action a' ∈ A(S) \ {visited actions}
    create child node S' ~ p(·|S, a')
    add to tree
    return S'

Simulation:

S ← expanded_node
G ← 0
t ← 0
while S non terminal:
    a ~ Π_default(·|S)
    r, S' ~ p(·|S, a)
    G ← G + γᵗ r
    S ← S'
    t ← t + 1
return G

Backup:

while node ≠ null:
    N(node) ← N(node) + 1
    W(node) ← W(node) + Δ
    node ← node.parent

Computational Complexity

Per iteration (where $d$ = tree depth, $L$ = episode length):
- Selection: $O(d)$,
- Expansion: $O(1)$,
- Simulation: $O(L)$,
- Backup: $O(d)$

\[\Rightarrow \text{total: } O\!\left(n \cdot (d + L)\right) \text{ for } n \text{ simulations}\]

Summary of MCTS

MCTS is a decision-time planning algorithm based on Monte Carlo control applied to simulations that start from the root state.
MCTS benefits from online, incremental, sample-based value estimation & policy improvement.
MCTS saves action-value estimates attached to the tree edges and updates them using RL’s sample updates.
MCTS, via incremental tree expansion, effectively grows a lookup table to store a partial action-value function.
MCTS thus avoids the problem of globally approximating an action-value function while it retains the benefit of using past experience to guide exploration.

Pros & Cons of MCTS

Pros:
- anytime algorithm,
- asymmetric tree growth,
- no domain heuristic required,
- handles high branching factors.
Cons:
- high computational cost,
- may miss deep forced sequences,
- random rollouts that are weak in tactical domains,
- finite simulations miss long-term consequences.

8.12 Summary

Planning requires a model of the environment.
- A distribution model consists of the probabilities of next states and rewards for possible actions. Dynamic Programming requires a distribution model because it uses expected updates, which involve computing expectations over all the possible next states and rewards.
- A sample model is required for simulating the environment interaction using sample updates.
- Sample models are generally much easier to obtain than distribution models.
There exists a close relationship between planning optimal behavior and learning optimal behavior:
- Both involve estimating the same value functions.
- Both naturally update the estimates incrementally, in a long series of small backup operations.
- Any of the learning methods can be converted into planning methods simply by applying them to simulated rather than real experience (model-based not model-free).
It is straightforward to integrate incremental planning methods with acting and model-learning. Planning, acting and model-learning interact in a circular fashion, each producing what the other needs to improve. All processes naturally proceed asynchronously and in parallel.
Dimensions of variation among state-space planning methods:
- Size of updates: the smaller the updates, the more incremental the planning methods can be. One-step updates, as in Dyna, are the smallest updates.
- Distribution of updates: primarily regarded as the focus of search.
  - Prioritized sweeping focuses backward on the predecessors of recently changed states.
  - On-policy trajectory sampling focuses on states/state-action pairs that are likely.
Real-time DP (RTDP), an on-policy trajectory sampling version of value iteration, illustrates some of the advantages that focusing on the relevant regions of the state-space has over conventional sweep-based policy iteration (exhaustive).
Planning can also focus forward from pertinent states, such as states actually encountered during an agent-environment interaction, and the most important form of this is when planning is done at decision time as part of the action-selection process.
- Another example of this is classical heuristic search.
- Other examples are rollout algorithms & Monte Carlo Tree Search (MCTS) that both benefit from online, incremental, sample-based value estimation and policy improvement.

8.13 Summary of Part I: Dimensions

Two axes for the update diagram:

$\text{HORIZONTAL (L to R): sample backups} \xrightarrow{\text{width of update}} \text{full/expected backups}$ $\text{VERTICAL (Top to Bottom): shallow backups} \xrightarrow{\text{depth/length of update}} \text{deep backups}$

Unified View of RL depicting a slice through the space of RL methods

Each RL idea presented so far can be viewed as a dimension along which methods vary. The set of the dimensions spans a large space of possible methods (quasi-infinite possibilities).
All the methods discussed so far in this book have 3 key ideas in common:
- They all seek to estimate value functions.
- They all operate by backing up values along actual or possible state trajectories.
- They all follow the general strategy of generalized policy iteration (GPI). This means they maintain an approximate value function and an approximate policy, and they continually try to improve each on the basis of the other.
3 important dimensions along which RL methods vary:
- Width of updates: sample updates (based on a sample trajectory) vs expected updates (based on a distribution of possible trajectories).
- Depth of updates: degree of bootstrapping ($\lambda$).
- On-policy vs off-policy methods.
Other dimensions along which RL methods vary:
- Definition of return: is the task episodic or continuing, discounted or undiscounted?
- Action values vs state values vs afterstate values.
- Action selection/exploration: how are actions selected to ensure a suitable exploration/exploitation tradeoff? Simple ways considered are $\varepsilon$-greedy, optimistic initialization, soft-max, upper confidence bound (UCB).
- Synchronous vs asynchronous: are the updates for all states performed simultaneously or one-by-one in some order?
- Real vs simulated experience.
- Location of updates: what states or state-action pairs should be updated? Model-free methods can choose only among encountered states, but model-based methods can choose arbitrarily.
- Timing of updates: should updates be done as part of action selection, or only afterward?
- Memory for updates: how long should updated values be retained?
  - Should they be retained permanently? (persistence)
  - Or only while computing an action selection and then discarded? (memoryless)
These dimensions are neither exhaustive nor mutually exclusive. e.g. Dyna methods use both real and simulated experience to affect the same value function.
These dimensions constitute a coherent set of ideas for description & exploration of a wide space of possible methods.
The most important dimension not mentioned or covered yet is that of function approximation:
- Function approximation can be viewed as an orthogonal spectrum of possibilities ranging from tabular methods at one extreme through state aggregation, a variety of linear methods, and then a diverse set of non-linear methods.

Citation

If you found this blog post helpful, please consider citing it:

@article{obasi2026RLsuttonBartoCh08notes,
  title   = "Sutton & Barto, Ch. 08: Planning & Learning with Tabular Methods (Personal Notes)",
  author  = "Obasi, Chizoba",
  journal = "chizkidd.github.io",
  year    = "2026",
  month   = "Feb",
  url     = "https://chizkidd.github.io/2026/02/24/rl-sutton-barto-notes-ch008/"
}

Architectural and Mathematical Foundations of Machine Learning: A Rigorous Synthesis of Theory, Geometry, and Implementation

Mon, 09 Feb 2026 00:00:00 +0000

Abstract: The maturation of machine learning from a subfield of heuristic-driven statistics into a cornerstone of modern computational science has necessitated a re-evaluation of its pedagogical foundations. Modern practitioners often rely on high-level libraries that abstract away the underlying mathematics, but as evidenced by critical reviews from the research community, this abstraction often leads to a superficial understanding of model dynamics, failure modes, and optimization bottlenecks.¹ A robust understanding of machine learning is not merely a collection of isolated equations but a synthesis of linear algebra, information theory, multivariate calculus, and probabilistic estimation. This report provides an exhaustive analysis of these mathematical pillars, correcting common technical misconceptions and bridging the gap between theoretical derivation and numerically stable implementation.

The Geometry of Representation: Linear and Affine Transformations

In the discourse of deep learning, the term “linear layer” is frequently used as a shorthand for the fundamental operation of weight-input multiplication followed by a bias shift. However, a rigorous geometric analysis reveals that the operations defining neural networks are more accurately described as affine transformations.¹ This distinction is not merely semantic; it defines the topology of the latent space and the constraints of the optimization landscape.

Defining the Affine Space

A linear transformation between two vector spaces must satisfy two core properties: additivity and homogeneity. Geometrically, this requires that the transformation fixes the origin; the zero vector in the input space must map to the zero vector in the output space. In a standard neural network layer, the operation is defined as $y = Ax + b$. While the term $Ax$ represents a linear transformation (scaling, rotating, or shearing the input $x$), the addition of the bias vector $b$ shifts the resulting vector away from the origin.²

This shift renders the transformation affine. An affine transformation is the composition of a linear mapping followed by a translation. In high-dimensional spaces, the bias term $b$ is what allows a hyperplane to exist in any position within the space, rather than being forced to pass through the coordinate center.² Without this capability, the expressive power of a neural network would be severely diminished, as it would be unable to model datasets where the decision boundary does not intersect the origin.

Feature	Linear Transformation ($Ax$)	Affine Transformation ($Ax+b$)
Origin Preservation	Maps zero vector to zero vector ($f(0) = 0$)	Shifts the origin by the vector $b$ ($f(0) = b$)
Algebraic Properties	Satisfies $f(x+y) = f(x) + f(y)$ and $f(cx) = cf(x)$	Violates both unless $b = 0$²
Geometric Action	Rotation, scaling, reflection, shearing	Rotation/Scaling followed by Translation
Machine Learning Role	Feature interaction and dimensionality change	Decision boundary positioning and normalization

Spectral Decomposition and the Warping of Space

Beyond simple layer operations, the internal structure of data matrices is analyzed through spectral decomposition. Eigendecomposition and Singular Value Decomposition (SVD) provide the mathematical tools to understand how a model “views” the variance of its input. An eigenvector $v$ of a square matrix $A$ is a characteristic direction that, under the transformation $A$, is only scaled by a factor $\lambda$, termed the eigenvalue: $Av = \lambda v$.³ Geometrically, if we align our coordinate system with these eigenvectors, the matrix $A$ simply acts as a scaling factor along those axes.⁴

However, eigendecomposition is limited to square matrices and often lacks orthogonality in the basis vectors unless the matrix is symmetric.⁵ Singular Value Decomposition (SVD) generalizes this concept to any $m \times n$ matrix $A$, decomposing it into $A = U\Sigma V^T$. This decomposition reveals a three-step geometric process:

Input Rotation: The matrix $V^T$ rotates the input space to align with the principal axes of the data.⁶
Stretching: The diagonal matrix $\Sigma$ scales the data along these axes according to the singular values $\sigma_i$.⁴
Output Rotation: The matrix $U$ rotates the scaled data into the output coordinate system.⁶

In the context of dimensionality reduction, SVD allows for the optimal projection of data onto a lower-dimensional subspace. By retaining only the largest $k$ singular values in $\Sigma$ and setting the rest to zero, we minimize the reconstruction error in terms of the Frobenius norm, a principle that underlies both Principal Component Analysis (PCA) and modern matrix completion algorithms.³

Information Theory as the Metric of Learning

While linear algebra defines the transformations, information theory provides the objective functions used to measure the “success” of those transformations. In machine learning, the goal is often to minimize the distance between a predicted probability distribution and the true data distribution.⁷

Surprisal and the Derivation of Entropy

The fundamental unit of information theory is “surprisal” or self-information. Intuitively, an event that is certain carries no information, whereas a rare event provides significant insight when it occurs. This is quantified by the negative logarithm of the probability $p$ of an event: $I(x) = -\log p(x)$.⁸ The logarithmic form is essential because it ensures that information is additive for independent events: $I(x,y) = I(x) + I(y)$.⁷

Entropy $H(P)$ is the expected value of surprisal across an entire distribution $P$. It represents the average amount of uncertainty or “average surprise” inherent in the distribution⁷:

\[H(P) = -\sum_{x} P(x) \log P(x)\]

A uniform distribution maximizes entropy, as it represents a state where every outcome is equally likely, providing the highest level of average uncertainty. In decision trees, for example, entropy is used to measure the “purity” of a node; a node with low entropy contains samples mostly from a single class, indicating high certainty in the prediction.⁹

Cross-Entropy: The Cost of Misaligned Models

When we train a model, we generate a predicted distribution $Q$ intended to approximate the true distribution $P$. Cross-entropy $H(P,Q)$ measures the average surprisal we experience if we encode data from $P$ using the “codebook” optimized for $Q$.⁸ Mathematically:

\[H(P, Q) = -\sum_{x} P(x) \log Q(x)\]

Cross-entropy is a staple loss function in classification tasks. It is inherently asymmetric ($H(P,Q) \neq H(Q,P)$), a property that reflects the physical reality of communication: the cost of using a wrong model depends on which direction the error occurs.⁷ Specifically, if the model $Q$ predicts a zero probability for an event that actually occurs in $P$, the cross-entropy becomes infinite, reflecting “infinite surprise” and forcing the model to never be “certainly wrong”.⁷

Kullback-Leibler (KL) Divergence

KL Divergence $D_{KL}(P | Q)$ isolates the “extra” surprisal caused by the model’s inaccuracy. It is defined as the difference between cross-entropy and the inherent entropy of the data⁸:

\[D_{KL}(P \| Q) = H(P, Q) - H(P) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}\]

Since the entropy of the data $H(P)$ is constant with respect to the model parameters, minimizing cross-entropy is functionally identical to minimizing KL divergence.⁷ This relationship is the backbone of Maximum Likelihood Estimation (MLE), as minimizing the divergence between the data and the model is equivalent to finding the parameters that make the observed data most probable.⁷

Metric	Formula	Intuition	Application
Entropy	$H(P) = -\sum P(x) \log P(x)$	Average uncertainty of a single source	Data compression, decision tree splitting
Cross-Entropy	$H(P,Q) = -\sum P(x) \log Q(x)$	Total cost of using model $Q$ for data $P$	Loss function for classifiers⁸
KL Divergence	$D_{KL}(P \| Q) = \sum P(x) \log \frac{P(x)}{Q(x)}$	“Distance” or extra cost between distributions	Variational inference, GANs, RL regularization

Optimization Dynamics: Jacobians, Hessians, and the Curvature of Loss

Optimization in machine learning is essentially a navigation problem through a high-dimensional landscape. While the gradient provides the direction of the slope, higher-order derivatives provide the context of that slope: its sensitivity and its curvature.¹⁰

The Jacobian: First-Order Sensitivity

For a vector-valued function $f: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian matrix $J$ contains all first-order partial derivatives. Each element $J_{ij} = \frac{\partial f_i}{\partial x_j}$ represents how the $i$-th output changes with respect to the $j$-th input.¹¹ In neural network training, the Jacobian is the fundamental object of backpropagation.

A common misunderstanding in technical literature is the classification of backpropagation itself. As noted in expert feedback, backpropagation is not an optimization algorithm like Gradient Descent; rather, it is a computationally efficient method for calculating the Jacobian through the application of the chain rule.¹ The efficiency of backpropagation stems from its ability to reuse intermediate partial derivatives, avoiding the combinatorial explosion that would occur if each path through the network were differentiated independently.¹¹

The Hessian and the Topology of Generalization

While the Jacobian tells us where to move, the Hessian matrix $H$ (the second derivative) tells us the shape of the area we are moving through. The Hessian is a square matrix of second-order partial derivatives: $H_{ij} = \frac{\partial^2 L}{\partial \theta_i \partial \theta_j}$.¹² The eigenvalues of the Hessian at a local minimum define the “sharpness” or “flatness” of that minimum.¹³

The “Flat Minimum” hypothesis suggests that minima with low curvature (low Hessian eigenvalues) are more likely to generalize to unseen data.¹⁴ The intuition is that a flat minimum represents a region of parameter space where small perturbations in the weights (caused by noise in the data or finite precision) do not significantly increase the loss. In contrast, a “sharp” minimum is highly sensitive; a slight shift in the data distribution might move the “true” minimum slightly, causing the loss for the sharp-minimum parameters to skyrocket.¹⁴

Hessian Eigenvalue Status	Geometric Interpretation	Generalization Outcome
Large Eigenvalues	Sharp, steep “valley”	High sensitivity, prone to overfitting¹³
Small Eigenvalues	Broad, flat “plateau”	Robust to noise, better generalization¹⁵
Negative Eigenvalues	Surface curves downward (Max/Saddle)	Unstable, gradient descent will move away
Zero Eigenvalues	Function is locally linear	Inconclusive; often indicates overparameterization

Advanced optimization algorithms, such as Sharpness-Aware Minimization (SAM), explicitly incorporate the Hessian’s information by seeking parameter values whose entire neighborhood has low loss, rather than just a single point.¹⁶ This shift from point-wise optimization to neighborhood optimization marks a significant trend in improving the robustness of Large Language Models (LLMs).¹³

Statistical Frameworks: MLE, MAP, and the Bayesian Paradigm

The process of “learning” from data is fundamentally an exercise in statistical estimation. Machine learning models typically operate under one of two paradigms: Frequentist or Bayesian.¹⁷

Maximum Likelihood Estimation (MLE)

MLE is the Frequentist’s primary tool. It assumes that there is a fixed, “true” parameter $\theta$ and seeks the value that makes the observed data $D$ most probable:

\[\hat{\theta}_{MLE} = \arg\max_{\theta} P(D|\theta)\]

In practice, we maximize the log-likelihood to transform product-based probabilities into summation-based losses, which are easier to differentiate and less prone to numerical underflow.¹⁸ MLE is effective for large datasets where the data itself provides enough signal to overcome initial uncertainty, but it is notoriously prone to overfitting in high-dimensional settings with sparse data.¹⁸

Maximum A Posteriori (MAP) and Regularization

MAP estimation adopts a Bayesian stance, treating the parameter $\theta$ as a random variable with its own prior distribution $P(\theta)$. Using Bayes’ Theorem:

\[P(\theta|D) = \frac{P(D|\theta)P(\theta)}{P(D)}\] \[\hat{\theta}_{MAP} = \arg\max_{\theta} P(\theta|D) = \arg\max_{\theta} [P(D|\theta)P(\theta)]\]

The inclusion of the prior $P(\theta)$ acts as a “regularizer.” For instance, assuming a Gaussian prior centered at zero is mathematically equivalent to $L_2$ regularization (Weight Decay), while a Laplacian prior yields $L_1$ regularization (Sparsity).¹⁸ MAP provides a bridge between pure data-driven learning and the incorporation of domain knowledge, acting as the “experienced analyst” who balances new evidence against historical trends.¹⁹

Aspect	Maximum Likelihood Estimation (MLE)	Maximum A Posteriori (MAP)
Philosophy	Frequentist: $\theta$ is fixed but unknown	Bayesian: $\theta$ is a random variable
Prior Used?	No¹⁷	Yes ($P(\theta)$)²⁰
Regularization	None (unless explicit)	Implicit via the prior distribution¹⁸
Data Sensitivity	High; prone to overfitting on small sets	Lower; the prior stabilizes estimates¹⁹
Convergence	Converges to MAP as data size $\to \infty$	Incorporates prior knowledge for small $n$

Architecture Dynamics: Softmax, Attention, and Implicit Mappings

Modern neural architectures rely on specific functional forms to control the flow of information and the stability of gradients. Two of the most critical are the Softmax activation and the Attention mechanism.

The Jacobian of Softmax and the Backpropagation Fusion

The Softmax function $\sigma(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$ is the standard output for multi-class classification. A common technical oversight in tutorial literature is failing to explain the Jacobian of Softmax. Because each output $\sigma_i$ depends on every input $z_j$ (due to the shared denominator), the derivative is not a simple vector but a matrix.²¹

Diagonal elements: $\frac{\partial \sigma_i}{\partial z_i} = \sigma_i(1 - \sigma_i)$
Off-diagonal elements: $\frac{\partial \sigma_i}{\partial z_j} = -\sigma_i \sigma_j$

However, when Softmax is combined with the Categorical Cross-Entropy loss, the gradient of the entire block with respect to the input $z$ simplifies to $\sigma - y$, where $y$ is the one-hot encoded ground truth.²¹ This simplicity is why the combination is ubiquitous in deep learning frameworks; it provides a clean, linear error signal $(\sigma - y)$ that directly represents the model’s confidence error.²²

The Attention Mechanism: A Retrieval Framework

The Attention mechanism, particularly in the Transformer architecture, revolutionized sequential modeling by replacing fixed-length memory with a dynamic retrieval system. This system is defined by three vectors: Query ($Q$), Key ($K$), and Value ($V$).²³

The intuition is analogous to a library search:

Query ($Q$): The search term or information you are currently looking for.²³
Key ($K$): The metadata or “index” on the spine of every book.²⁴
Value ($V$): The actual content or “knowledge” inside the book.²⁵

The attention weight is computed by measuring the compatibility (dot product) between $Q$ and $K$. After scaling and applying Softmax, these weights determine how much of each $V$ is aggregated into the final representation.²⁴ The “Scaled” Dot-Product Attention includes a factor of $\frac{1}{\sqrt{d_k}}$ to prevent the dot products from growing so large that the Softmax function enters a region of near-zero gradients, which would stall learning.²⁵

Kernel Machines and the Mapping Paradox

Support Vector Machines (SVMs) and kernel-based methods provide an alternative to deep learning’s explicit feature engineering. The “Kernel Trick” allows a model to operate in an implicitly high-dimensional space without ever actually computing the coordinates in that space.²⁶

By reformulating the optimization problem into its “Dual Form,” the objective depends only on the dot products between inputs: $\langle x_i, x_j \rangle$.²⁷ Replacing this dot product with a kernel function $K(x_i, x_j)$ effectively maps the data into a high-dimensional feature space where it may be linearly separable.²⁶ For example, the Radial Basis Function (RBF) kernel corresponds to an infinite-dimensional feature space, yet it can be computed with a simple exponential function in the original input space.²⁸

Kernel Type	Function $K(x,y)$	Geometric Space
Linear	$x^T y$	Original input space
Polynomial	$(x^T y + c)^d$	Finite-dimensional feature combinations
Gaussian RBF	$\exp(-\gamma \|x-y\|^2)$	Infinite-dimensional space²⁹
Sigmoid	$\tanh(\alpha x^T y + c)$	Relates SVMs to Neural Networks

Generative Modeling: Variational Inference and Diffusion

The frontier of machine learning math is currently dominated by generative models, which require estimating the underlying probability density of high-dimensional data.

Variational Autoencoders (VAEs) and the ELBO

VAEs treat generation as a latent variable problem: we assume data $x$ is generated from a hidden code $z$. The true posterior $P(z \mid x)$ is intractable, so we approximate it with $Q(z \mid x)$ (the encoder).³⁰ To train this, we maximize the Evidence Lower Bound (ELBO)³¹:

\[\text{ELBO} = \mathbb{E}_{Q(z|x)}[\log P(x|z)] - D_{KL}(Q(z|x) \| P(z))\]

The first term is the Reconstruction Error, ensuring the decoder can recreate the input from the code. The second is the KL Regularizer, which forces the latent codes to follow a standard Gaussian distribution.³² This ensures the latent space is well-behaved, allowing us to sample new points and generate realistic data.

Diffusion Models: Score-Based Generative Dynamics

Diffusion models represent a paradigm shift. Rather than learning a mapping or a lower bound, they learn to reverse a stochastic process.³³ The forward process gradually destroys data by adding Gaussian noise until the sample is pure noise.³⁴

The model is trained to predict the noise $\epsilon$ that was added at any given step $t$. By knowing how to remove the noise, the model can iteratively “denoise” a random sample into a high-quality data point.³⁴ Mathematically, this is governed by the Stochastic Differential Equation (SDE):

\[dx = f(x, t)dt + g(t)dw\]

The reverse process involves the “score function” $\nabla_x \log p(x)$, which points in the direction of increasing data density.³⁵ Modern diffusion models essentially learn this score function, providing a robust mathematical way to sample from complex, high-dimensional manifolds.³³

Numerical Pragmatism: The Gap Between Math and Machine

One of the most persistent failures in machine learning development is the “theoretical success, numerical failure” trap. Mathematical equations assume infinite precision, but hardware operates on finite-precision floating-point numbers.³⁶

The Softmax Instability

The Softmax function is mathematically robust but numerically fragile. Large logits cause the exponential function to overflow into inf, while large negative logits cause underflow to 0, resulting in NaN gradients.³⁶ The standard solution is the Translation Invariance Trick: subtracting the maximum value from all logits before exponentiating. This ensures that the largest exponent is $e^0 = 1$, preventing overflow and guaranteeing numerical stability.³⁷

The LogSumExp Trick

In the calculation of cross-entropy, we often encounter the log of a sum of exponentials. A naive implementation would calculate the exponentials, sum them, and then take the log, which is prone to overflow. The stable approach uses the LogSumExp identity:

\[\log \sum_i e^{x_i} = \alpha + \log \sum_i e^{x_i - \alpha}\]

where $\alpha = \max_i x_i$.³⁶ This ensures that intermediate computations stay within the representable range of floating-point numbers.

Correct Implementation of Entropy

Critiques of earlier implementations highlighted that naive entropy calculations using np.log(p, where=p > 0) can be dangerous if the output is not properly initialized, as it leaves the results at $p=0$ locations as uninitialized garbage values.¹ A robust implementation must explicitly handle the limit $\lim_{p \to 0} p \log p = 0$ to ensure consistency and correctness across the entire domain.¹

Mathematical Operation	Potential Numerical Failure	Robust Implementation Strategy
Softmax	Overflow ($e^{1000} = \infty$)	Subtract maximum logit before $\exp$³⁷
Cross-Entropy	Underflow ($\log(0) = -\infty$)	Use LogSoftmax and LogSumExp fusion³⁶
Information Entropy	Uninitialized memory at $p=0$	Use `np.where` with default initialization to zero¹
Hessian Calculation	High memory cost/Instability	Use Hessian-Vector Products (HVP)¹³
SVD	Convergence failure on singular matrices	Use Moore-Penrose pseudo-inverse via SVD⁵

Advanced Theoretical Integration: Kernels, Attention, and Manifold Learning

The synthesis of these concepts reveals the deeper structure of modern machine learning. For instance, the Attention mechanism can be viewed as a data-dependent kernel where the weights are dynamically computed for each input pair.²⁴ Similarly, the success of Diffusion models is intrinsically linked to the spectral properties of the data manifold, the model learns to project noise back onto the low-dimensional manifold where data resides.³⁵

The distinction between “Sharp” and “Flat” minima provides a bridge between the optimization dynamics of the Hessian and the statistical requirements of generalization. A flat minimum is not just a point of low loss; it is a region of high local entropy in the parameter space, suggesting that the solution is not a lucky “overfit” but a robust feature of the data distribution.¹⁵

Synthesis and Recommendations for Practitioners

The evolution of machine learning mathematics demonstrates that technical robustness is achieved only through the rigorous application of foundational principles. The fundamental, mathematical, derivative approach used here addresses the community’s concerns regarding “LLM slop” and technical vapidity.¹ By explicitly connecting surprisal to KL divergence, and the Jacobian of Softmax to the cross-entropy gradient, we move from rote memorization to functional understanding.

For practitioners looking to improve their models, the focus should be on three critical areas:

Numerical Integrity: Always use fused loss functions and log-domain calculations to avoid the silent corruption of gradients.³⁶
Geometric Awareness: Recognize that model operations are affine, and prioritize architectures that allow for flexible decision boundary placement.²
Curvature Monitoring: In high-stakes applications, move beyond monitoring simple training loss. Analyzing the Hessian spectrum or the local flatness of the solution provides the only reliable indicator of how a model will perform on unseen, real-world data.¹⁴

The future of machine learning lies in this intersection of physics-inspired dynamics (Diffusion), information theory (Entropy), and the geometry of high-dimensional spaces (SVD/Kernels). As models continue to scale, the mathematical “shortcuts” of the past will increasingly fail, leaving only those who understand the foundational rigor of the field capable of driving its next major breakthroughs.¹

Citation

If you found this blog post helpful, please consider citing it:

@article{obasi2026MLmathfoundations,
  title   = "Architectural and Mathematical Foundations of Machine Learning: A Rigorous Synthesis of Theory, Geometry, and Implementation",
  author  = "Obasi, Chizoba",
  journal = "chizkidd.github.io",
  year    = "2026",
  month   = "Feb",
  url     = "https://chizkidd.github.io/2026/02/09/mathematical-machine-learning-foundations/"
}

References

Important machine learning equations. Hacker News. 2025. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸
Ian Quah. Fundamentals Part 2: Hessians and Jacobians. ianq.ai. 2018. ↩ ↩² ↩³ ↩⁴
Eigen Intuitions: Understanding Eigenvectors and Eigenvalues. Towards Data Science. 2022. ↩ ↩²
The geometry of linear transformations. Department of Mathematics, University of Toronto. 2019. ↩ ↩²
Intuitively, what is the difference between Eigendecomposition and Singular Value Decomposition? Mathematics Stack Exchange. 2013. ↩ ↩²
Geometrical interpretations of SVD. Mathematics Stack Exchange. 2018. ↩ ↩²
Sidharth SS. Entropy, Cross-Entropy, and KL Divergence: Mathematical Foundations and Applications. Medium. 2025. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
Eli Bendersky. Cross-entropy and KL divergence. eli.thegreenplace. 2025. ↩ ↩² ↩³ ↩⁴
A Short Introduction to Entropy, Cross-Entropy and KL-Divergence. r/MachineLearning. Reddit. 2018. ↩
Why are the Hessian and Jacobian matrices important?. r/quant. Reddit. 2025. ↩
Jacobian and Hessian Matrices. GeeksforGeeks. 2025. ↩ ↩²
Hessian Matrix: A Guide to Second-Order Derivatives in Optimization and Beyond. DataCamp. 2025. ↩
Dayal Singh Kalra, et al. A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs. arXiv:2601.16979 (2026). ↩ ↩² ↩³ ↩⁴
Ferenc Huszár. The Generalization Mystery: Sharp vs Flat Minima. inFERENCe.vc. 2018. ↩ ↩² ↩³
Flat Minima and Generalization. Emergent Mind. 2025. ↩ ↩²
Tuan-Anh Bui. Connection between Flatness and Generalization. tuananhbui89.github.io. 2024. ↩
Stanford CS109. 7.5: Maximum A Posteriori Estimation. 2018. ↩ ↩²
MLE vs MAP. GeeksforGeeks. 2025. ↩ ↩² ↩³ ↩⁴
Bohsun Chen. The Intuition behind Maximum Likelihood Estimation (MLE) and Maximum A Posteriori Estimation (MAP). Medium. 2024. ↩ ↩²
Agustinus Kristiadi. MLE vs MAP: the connection between Maximum Likelihood and Maximum A Posteriori Estimation. agustinus.kristia.de. 2017. ↩
Thomas Kurbiel. Derivative of the Softmax Function and the Categorical Cross-Entropy Loss. Medium. 2021. ↩ ↩²
Back-propagation with Cross-Entropy and Softmax. MLDawn Academy. ↩
Nitin Mittapally. Understanding Attention in Transformers: A Visual Guide. Medium. 2025. ↩ ↩²
Michael Brenndoerfer. Query, Key, Value: The Foundation of Transformer Attention. mbrenndoerfer.com. 2025. ↩ ↩² ↩³
Lili Jiang. How GPT works: A Metaphoric Explanation of Key, Value, Query in Attention. Medium. 2023. ↩ ↩²
Nguyen Ha Thai Son. Kernel Trick Under The Hood. Medium. 2025. ↩ ↩²
Support Vector Machines (and the Kernel Trick). Columbia University. ↩
How to intuitively explain what a kernel is? Stats StackExchange. 2018. ↩
Sanghavi Harsh. Mastering SVM Kernel Tricks: A Comprehensive Guide to Dual Problems and Kernel Functions. Medium. 2024. ↩
Tony Duan. Variational autoencoder implemented in PyTorch. GitHub. ↩
Matthew N. Bernstein. The evidence lower bound (ELBO). mbernste.github.io. 2020. ↩
Evidence lower bound. Wikipedia. ↩
Diederik P Kingma, Ruiqi Gao.Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation. OpenReview. 2023. ↩ ↩²
Yazhou Li. Notes on Diffusion Model: Intuition. flaneur2020.github.io. 2024. ↩ ↩²
Katie Keegan. Diffusion Models and (Many) Differential Equations. katiekeegan.org. 2025. ↩ ↩²
Implementing Softmax From Scratch: Avoiding the Numerical Stability Trap. MarktechPost. 2026. ↩ ↩² ↩³ ↩⁴ ↩⁵
Jay Mody. Numerically Stable Softmax and Cross Entropy. jaykmody.com. 2022. ↩ ↩²

A Complete Guide to Neural Network Optimizers

Thu, 22 Jan 2026 00:00:00 +0000

TLDR: This guide covers 8 neural network optimizers from SGD to Muon. For most tasks, start with Adam or AdamW, they’re robust and require minimal tuning. For large language models, consider Muon for 2x faster training. For computer vision with proper learning rate scheduling, SGD+Momentum often achieves the best final accuracy. Each optimizer builds on the limitations of its predecessors, from basic SGD through adaptive methods (Adam/AdamW) to modern matrix-aware approaches (Muon).

Quick Reference: Optimizer Comparison
When to Use Which Optimizer
Optimizers Explained
Detailed Technical Comparison
Hyperparameter Reference
Common Pitfalls and How to Avoid Them
Conclusion

Training neural networks is fundamentally an optimization problem: we’re searching for the best set of weights that minimize our loss function. While the concept sounds straightforward, the path from random initialization to a well-trained model is rarely a smooth descent. The landscape of loss functions in high-dimensional spaces is filled with valleys, plateaus, and saddle points that can trap or slow down naive optimization approaches.

This is where optimization algorithms come in. Over the years, researchers have developed increasingly sophisticated methods to navigate these challenging landscapes more efficiently. Each optimizer builds upon the limitations of its predecessors, introducing new mechanisms to accelerate convergence, handle sparse gradients, or adapt to different learning scenarios.

In this guide, we’ll explore eight key optimization techniques: SGD, Momentum, Nesterov Momentum, AdaGrad, RMSProp, Adam, AdamW and Muon. We’ll examine how each one works, what problems it solves, and when you might want to use it.

Quick Reference: Optimizer Comparison

Optimizer	Key Feature	Solves Issue in	Pros	Cons
SGD	Simple gradient descent	N/A	Easy to implement	Oscillation, fixed learning rate
Momentum	Gradient accumulation	SGD	Reduces oscillations	No anticipation of future trends
Nesterov	Lookahead gradients	Momentum	Better convergence	Slightly higher computation
AdaGrad	Adaptive learning rates	Nesterov	Handles sparse gradients	Learning rate decays too fast
RMSProp	Smoothed adaptive learning rates	AdaGrad	Stabilizes learning rates	Sensitive to hyperparameters
Adam	Momentum + RMSProp	RMSProp	Combines best features	May converge to suboptimal minima
AdamW	Decoupled weight decay	Adam	Better generalization	Requires tuning decay parameter
Muon	Matrix orthogonalization	AdamW	33% less memory, automatic LR transfer, faster convergence	Only for 2D matrices, requires hybrid approach

When to Use Which Optimizer

The flowchart below will help you quickly choose the right optimizer for your task:

graph TD Start([Choose Your Optimizer]) --> Q1{What are you training?} Q1 -->|Large Language Model
Transformer| Q2{Model size?} Q1 -->|Computer Vision
CNN/ResNet| Q3{Priority?} Q1 -->|Other/Mixed/Unsure| Default["AdamW
LR=0.001,
weight decay=0.01
"] Q2 -->|< 1B parameters| Adam1["AdamW
LR=3e-4
"] Q2 -->|> 1B parameters| Q4{Can implement
hybrid setup?} Q4 -->|Yes| Muon1["Muon + AdamW
"] Q4 -->|No| Adam1 Q3 -->|Speed/Prototyping| Adam2["Adam
LR=0.001
"] Q3 -->|Best Final Accuracy| Q5{Can tune learning
rate schedule?} Q5 -->|Yes| SGD1["SGD + Momentum
LR=0.01 to 0.1, momentum=0.9
+ Cosine/Step schedule
"] Q5 -->|No| Adam2 style Start fill:#4a90e2,color:#fff style Default fill:none,stroke:#2ecc71,stroke-width:3px style Adam1 fill:none,stroke:#2ecc71,stroke-width:3px style Adam2 fill:none,stroke:#2ecc71,stroke-width:3px style Muon1 fill:none,stroke:#f39c12,stroke-width:3px style SGD1 fill:none,stroke:#f39c12,stroke-width:3px classDef question fill:#e8f4f8,stroke:#4a90e2,stroke-width:2px class Q1,Q2,Q3,Q4,Q5 question

Key for Flowchart:

Blue-filled: Starting point and decision questions
Green Border: Recommended safe defaults, works well out-of-the-box
Orange Border: Advanced options with higher payoff but more tuning

Detailed Guidance

For Large Language Models (LLMs):

Models < 1B params: AdamW (lr=3e-4, betas=(0.9, 0.95))
Models > 1B params: Muon + AdamW hybrid (possible 2x speedup)

For Computer Vision:

Quick prototyping: Adam (lr=0.001)
Best accuracy: SGD + Momentum + LR scheduling (lr=0.01-0.1)

Special Cases:

NLP with Sparse features: Adam or AdaGrad (lr=0.001-0.01)
Memory constrained: Muon or SGD+Momentum
Fast experimentation: Adam/AdamW

When in doubt: Start with AdamW (lr=0.001, weight_decay=0.01). It’s a solid default choice for almost any task.

1. Stochastic Gradient Descent (SGD)

How It Works: Updates weights by calculating gradients using a small batch of data.

\[w_t = w_{t-1} - \eta \nabla f(w_{t-1})\]

Pros:

Simple and computationally efficient
Works well with large datasets

Cons:

Can oscillate or converge slowly, especially in narrow valleys or near saddle points
Learning rate (η) is fixed, leading to potential overshooting or slow convergence

Code:

import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.01, weight_decay=0.0001)

2. Momentum

How It Works: Accumulates gradients to build momentum in directions with consistent gradients.

$v_t = \beta v_{t-1} - \eta \nabla f(w_{t-1})$
$w_t = w_{t-1} + v_t$

Pros:

Speeds up convergence in shallow but consistent directions (e.g., valleys)
Reduces oscillations compared to SGD

Cons:

Still overshoots if the learning rate is too high
Cannot predict future gradient directions

Improvement Over SGD: Addresses oscillation and slow convergence by incorporating past gradients.

Code:

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=0.0001)

3. Nesterov Momentum

How It Works: Looks ahead by computing gradients at the projected position.

$v_t = \beta v_{t-1} - \eta \nabla f(w_{t-1} + \beta v_{t-1})$
$w_t = w_{t-1} + v_t$

Pros:

More precise updates by considering where the momentum is leading
Accelerates convergence further compared to vanilla momentum

Cons:

Slightly more computationally expensive due to gradient computation at the lookahead point

Improvement Over Momentum: Anticipates future gradient directions, resulting in better convergence.

Code:

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True, weight_decay=0.0001)

4. AdaGrad

How It Works: Adjusts the learning rate for each parameter based on the magnitude of past gradients.

$g_t = \nabla f(w_{t-1})$
$w_t = w_{t-1} - \frac{\eta}{\sqrt{G_t + \epsilon}} g_t, \quad G_t = \sum_{i=1}^t g_i^2$

Pros:

Works well for sparse gradients (e.g., NLP tasks)
Automatically adapts learning rates for each parameter

Cons:

Learning rate diminishes too quickly due to cumulative gradient sum, leading to potential underfitting

Improvement Over Nesterov Momentum: Introduces adaptive learning rates to handle sparse gradients.

5. RMSProp

How It Works: Modifies AdaGrad by using an exponentially weighted moving average of past squared gradients instead of a cumulative sum.

$v_t = \beta v_{t-1} + (1 - \beta)(\nabla f(w_{t-1}))^2$
$w_t = w_{t-1} - \frac{\eta}{\sqrt{v_t + \epsilon}} \nabla f(w_{t-1})$

Pros:

Prevents the learning rate from diminishing too quickly
Suitable for non-stationary objectives

Cons:

Sensitive to hyperparameter choices (e.g., β)

Improvement Over AdaGrad: Stabilizes learning rates by introducing an exponentially weighted average of squared gradients.

6. Adam (Adaptive Moment Estimation)

How It Works: Combines Momentum (first moment) and RMSProp (second moment).

Update rules:
$m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla f(w_{t-1})$
$v_t = \beta_2 v_{t-1} + (1 - \beta_2)(\nabla f(w_{t-1}))^2$
Bias corrections:
$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$
Update step:
$w_t = w_{t-1} - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$

Pros:

Combines the benefits of Momentum and RMSProp
Automatically adjusts learning rates for each parameter
Bias correction ensures stability in early training

Cons:

May converge to suboptimal solutions in some scenarios (e.g., small datasets or high regularization)
Hyperparameter tuning can be challenging

Improvement Over RMSProp: Adds momentum and bias correction to handle noisy gradients and early instability.

Code:

optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=1e-8)

7. AdamW

How It Works:

Decouples weight decay from the gradient update to improve generalization.

\[w_t = w_{t-1} - \eta \bigg( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda w_{t-1} \bigg)\]

Pros:

Better generalization compared to Adam
Retains benefits of adaptive learning rates

Cons:

Still requires careful hyperparameter tuning

Improvement Over Adam: Decouples weight decay from gradient updates, improving generalization performance.

Code (Common Settings for Transformers):

optimizer = optim.AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.95), eps=1e-8, weight_decay=0.1)

8. Muon (MomentUm Orthogonalized by Newton-Schulz)

How It Works:

Muon is designed specifically for 2D weight matrices in neural network hidden layers (Linear layers). Unlike traditional optimizers that treat each parameter independently, Muon leverages the geometric structure of weight matrices by orthogonalizing gradients using the Newton-Schulz iteration.

The optimizer formulates weight updates as a constrained optimization problem in the RMS-to-RMS operator norm space:

\[\min_{\Delta W} \langle G, \Delta W \rangle \quad \text{subject to} \quad |\Delta W|_{op,RMS} \leq \beta\]

Where $G$ is the gradient matrix. The solution involves projecting the gradient onto the set of orthogonal matrices, which standardizes all singular values to 1 while preserving gradient directions. Its github implementation can be found here.

Update Rules:

Momentum accumulation: $V_t = \mu V_{t-1} + G_t$
Newton-Schulz orthogonalization (5 iterations): $Z_0 = \frac{V_t}{\|V_t\|_F}$ $Z_{i+1} = aZ_i + bZ_i^3 + cZ_i^5$

Default coefficients: $(a, b, c) = (3.4445, -4.775, 2.0315)$
Weight update: $W_t = W_{t-1} - \eta \cdot Z_{final} - \lambda W_{t-1}$

Important: Muon should only be applied to 2D weight matrices (hidden layer Linear layers). All other parameters (embeddings, biases, normalization layers, classifier heads) must use a standard optimizer like AdamW.

Pros:

Memory efficient: Only tracks momentum (no second moment statistics like Adam), reducing memory by ~33% compared to Adam
Automatic learning rate transfer: Learning rates transfer across different network widths without retuning
Superior convergence: Faster training than Adam/AdamW, especially for transformers and large models
- Improved CIFAR-10 training speed record from 3.3 to 2.6 A100-seconds for 94% accuracy
- Improved NanoGPT speedrunning record by 1.35x
- Trained 1.5B transformer to GPT-2 XL performance in 10 hours vs 13.3 hours with AdamW
Better saddle point handling: Orthogonalization helps escape saddle points more effectively
Scalable: Performance improvements increase with model size

Cons:

Hybrid approach required: Must use AdamW or another optimizer for non-2D parameters
Higher computational cost: Newton-Schulz iterations add ~5% overhead (though Turbo-Muon reduces this to ~1%)
Implementation complexity: More complex than standard optimizers
Limited to dense layers: Only applicable to Linear layers with dense activations

Improvement Over AdamW: Exploits the matrix structure of neural network weights rather than treating parameters independently. This geometric approach provides automatic scaling properties and faster convergence while using less memory. Particularly effective for transformer architectures and language model pre-training.

Code:

from muon import MuonWithAuxAdam

# Separate parameters by type
hidden_weights = [p for p in model.body.parameters() if p.ndim >= 2]
hidden_gains_biases = [p for p in model.body.parameters() if p.ndim < 2]
nonhidden_params = [*model.head.parameters(), *model.embed.parameters()]

# Create parameter groups
param_groups = [
    dict(params=hidden_weights, use_muon=True, lr=0.02, weight_decay=0.01),
    dict(params=hidden_gains_biases+nonhidden_params, use_muon=False, 
         lr=3e-4, betas=(0.9, 0.95), weight_decay=0.01),
]

optimizer = MuonWithAuxAdam(param_groups)

Detailed Technical Comparison

Method	Working Mechanism	Pros	Cons	Improvement Over Prior Method
SGD	Updates weights using gradients calculated on mini-batches. $w_t = w_{t-1} - \eta\nabla f(w_{t-1})$	Simple, computationally efficient	Oscillates, slow convergence, fixed learning rate	-
Momentum	Accumulates gradients to build momentum for smoother updates. $v_t = \beta v_{t-1} - \eta\nabla f(w_{t-1})$, $w_t = w_{t-1} + v_t$	Speeds up convergence, reduces oscillations	May overshoot, lacks anticipation of future gradients	Reduces oscillations and improves convergence speed
Nesterov	Looks ahead to compute gradients at a projected future position. $v_t = \beta v_{t-1} - \eta\nabla f(w_{t-1} + \beta v_{t-1})$, $w_t = w_{t-1} + v_t$	More precise updates, faster convergence	Slightly more computationally expensive	Anticipates future gradient directions
AdaGrad	Adjusts learning rates based on accumulated squared gradients. $w_t = w_{t-1} - \frac{\eta}{\sqrt{G_t + \epsilon}}g_t$, $G_t = \sum g_i^2$	Adapts learning rates, good for sparse gradients	Learning rate diminishes too quickly, potential underfitting	Introduces adaptive learning rates for sparse features
RMSProp	Uses exponentially weighted moving averages of squared gradients. $v_t = \beta v_{t-1} + (1-\beta)g_t^2$, $w_t = w_{t-1} - \frac{\eta}{\sqrt{v_t + \epsilon}}g_t$	Prevents learning rate decay, handles non-stationary objectives	Sensitive to hyperparameters (e.g., β)	Stabilizes learning rates using moving averages
Adam	Combines Momentum (1st moment) and RMSProp (2nd moment) with bias correction. $w_t = w_{t-1} - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$	Fast convergence, handles noisy gradients	May converge to suboptimal minima in some cases	Combines momentum and adaptive learning rates
AdamW	Decouples weight decay from gradient updates. $w_t = w_{t-1} - \eta[\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda w_{t-1}]$	Better generalization, retains Adam’s benefits	Requires tuning of decay parameter	Improves generalization by decoupling weight decay
Muon	Orthogonalizes gradients of weight matrices using Newton-Schulz iteration, then applies Newton-Schulz polynomial to normalize $V_t$. $V_t = \mu V_{t-1} + G_t$, $W_t = W_{t-1} - \eta \cdot \text{NS}(V_t) - \lambda W_{t-1}$	Fast convergence, memory efficient, automatic LR transfer across model sizes	Only for 2D parameters, requires hybrid approach with AdamW	Leverages matrix geometry for better conditioning and faster training

Hyperparameter Reference

Method	Hyperparameter	Meaning	Typical Values	Tuning Suggestions
SGD	Learning rate ($\eta$)	Step size for updating weights	0.01 to 0.1	Start with a smaller value and adjust based on convergence
Momentum	Momentum coefficient ($\beta$)	Controls the contribution of past gradients to the current update	0.9	Keep fixed at 0.9 or tune slightly
Nesterov	Momentum coefficient ($\beta$)	Same as Momentum, with anticipation of future gradients	0.9	Same as Momentum
AdaGrad	Learning rate ($\eta$)	Base learning rate scaled by the inverse square root of accumulated squared gradients	0.01	Lower than SGD learning rates to avoid overshooting
RMSProp	Learning rate ($\eta$)	Similar to AdaGrad, with smoothing via an exponential moving average	0.001 to 0.01	Tune for stability based on loss
	Decay rate ($\beta$)	Smoothing parameter for the moving average of squared gradients	0.9	Commonly fixed at 0.9
Adam	Learning rate ($\eta$)	Base learning rate for parameter updates	0.001	Often works well without much tuning
	$\beta_1$	Decay rate for the first moment (mean of gradients)	0.9	Usually fixed
	$\beta_2$	Decay rate for the second moment (variance of gradients)	0.999	Keep fixed or tune slightly for sensitivity
	$\epsilon$	Small value to avoid division by zero	$10^{-7}$ or smaller	Rarely changed
AdamW	Learning rate ($\eta$)	Same as Adam	0.001	Same as Adam
	$\beta_1$, $\beta_1$, $\epsilon$	Same as Adam	0.9, 0.999, $10^{-7}$	Same as Adam
	Weight decay ($\lambda$)	Regularization parameter to control overfitting by penalizing large weights	$10^{-4}$ to $10^{-2}$	Start small and increase if overfitting is observed
Muon	Learning rate ($\eta$)	Base learning rate for matrix updates	0.02 (can be 5-10x larger than Adam)	Start with 0.02, can use much larger values than Adam
	Momentum ($\mu$)	Momentum coefficient	0.95	Usually fixed at 0.95
	Weight decay ($\lambda$)	Regularization parameter	0.01	Same as AdamW
	Nesterov	Whether to use Nesterov momentum	True	Typically enabled
	NS coefficients $(a,b,c)$	Newton-Schulz polynomial coefficients	(3.4445, -4.775, 2.0315)	Rarely changed, but can be tuned for specific architectures
	For non-2D params	Use AdamW with standard settings	$\eta$ = 3e-4, $\beta_1$ = 0.9, $\beta_2$ = 0.95	Keep separate learning rate for embeddings/biases

Common Pitfalls and How to Avoid Them

Even with the right optimizer, certain mistakes can derail your training. Here are the most common issues:

1. Using Adam without Learning Rate Decay

Problem: Adam can fail to converge to optimal solutions without learning rate scheduling.

Solution: Always use a learning rate scheduler with Adam/AdamW, especially for long training runs.

scheduler = CosineAnnealingLR(optimizer, T_max=epochs)

2. SGD Learning Rate Too High

Problem: Divergence, exploding gradients, NaN losses.

Solution: Start with a conservative learning rate (0.01-0.1) and use warmup:

# Warmup for first 5 epochs
if epoch < 5:
    lr = base_lr * (epoch + 1) / 5
else:
    lr = base_lr

3. Confusing Adam and AdamW

Problem: Using torch.optim.Adam when you meant to use weight decay.

Critical: In PyTorch, Adam with weight_decay parameter is NOT the same as AdamW!

# WRONG - This is L2 regularization, not weight decay
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=0.01)

# CORRECT - Use AdamW for proper weight decay
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

4. Not Separating Parameter Groups for Muon

Problem: Applying Muon to all parameters (embeddings, biases, etc.) causes training instability.

Solution: Only use Muon for 2D weight matrices. Use AdamW for everything else:

# Correctly separate parameters
hidden_weights = [p for p in model.parameters() if p.ndim >= 2]
other_params = [p for p in model.parameters() if p.ndim < 2]

5. Forgetting Gradient Clipping

Problem: Training instability, especially with RNNs, transformers, or high learning rates.

Solution: Add gradient clipping before optimizer step:

loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

6. Using AdaGrad for Long Training

Problem: Learning rate diminishes to nearly zero, causing training to stall.

Solution: Use RMSProp or Adam instead for long training runs. AdaGrad works best for shorter, sparse gradient scenarios.

7. Ignoring Batch Size Effects

Problem: Optimizer performance varies dramatically with batch size.

Key Rule: Larger batch sizes often require larger learning rates:

# Linear scaling rule (approximate)
lr = base_lr * (batch_size / base_batch_size)

8. Not Using Different Optimizers for Different Parameters

Problem: Embeddings and classifier heads may need different learning rates than the main network.

Solution: Use parameter groups:

optimizer = optim.AdamW([
    {'params': model.embedding.parameters(), 'lr': 1e-3},
    {'params': model.encoder.parameters(), 'lr': 3e-4},
    {'params': model.head.parameters(), 'lr': 5e-4}
])

9. Misunderstanding Momentum Hyperparameters

Problem: Using $\beta_1 = 0.9$ for both Adam and SGD without understanding the difference.

Key Insight:

SGD Momentum: 0.9 is standard
Adam $\beta_1$ : 0.9 is standard
But they behave differently! Adam’s momentum is applied to normalized gradients.

10. Not Validating Optimizer Setup

Problem: Subtle bugs in optimizer configuration go unnoticed until poor results.

Solution: Always verify your setup:

# Check which parameters are being optimized
print(f"Optimizing {sum(p.numel() for p in optimizer.param_groups[0]['params'])} parameters")

# Verify learning rates
for i, group in enumerate(optimizer.param_groups):
    print(f"Group {i}: lr={group['lr']}, params={len(group['params'])}")

Conclusion

Choosing the right optimizer can dramatically impact your model’s training efficiency and final performance. While there’s no universal “best” optimizer, understanding the strengths and weaknesses of each approach helps you make informed decisions for your specific use case.

For most modern deep learning applications, Adam and AdamW have emerged as go-to choices due to their robust performance across diverse tasks with minimal hyperparameter tuning. Adam’s combination of momentum and adaptive learning rates makes it particularly effective for handling noisy gradients and training deep networks, while AdamW’s improved weight decay mechanism often leads to better generalization.

Muon represents a paradigm shift in optimization by explicitly leveraging the matrix structure of neural network weights. For large-scale language model training, Muon has demonstrated consistent speed improvements over AdamW while using significantly less memory. Its ability to automatically transfer learning rates across model sizes makes it particularly valuable for scaling experiments. However, its requirement for a hybrid approach (using AdamW for non-matrix parameters) adds implementation complexity. If you’re training large transformers and have the engineering resources to implement it properly, Muon is worth serious consideration.

Regardless of which optimizer you choose, learning rate scheduling is crucial for achieving optimal results. Modern training almost always combines an optimizer with a schedule like cosine annealing, step decay, or warmup-then-decay. The Adam paper’s promise of “little tuning required” applies to the optimizer’s internal hyperparameters ($\beta_1$, $\beta_2$), but you should still tune the learning rate and use scheduling for best results.

However, don’t overlook the classics. SGD with Momentum remains highly competitive, especially for computer vision tasks, and often achieves better final test accuracy when combined with proper learning rate scheduling. For problems with sparse gradients, such as natural language processing with large vocabularies, AdaGrad or RMSProp might be more appropriate.

The key takeaway is that optimizer selection should be guided by your problem’s characteristics: dataset size, gradient sparsity, computational budget, and generalization requirements. Start with a well-established baseline (Adam is usually a safe bet), monitor your training dynamics, and don’t hesitate to experiment with alternatives if you’re not seeing the convergence behavior you expect.

As the field continues to evolve, new optimizers and variants will undoubtedly emerge. But the fundamental principles underlying these eight methods: managing learning rates, leveraging momentum, adapting to gradient statistics, and combining optimizers, will remain central to training neural networks effectively. However, new optimizers like Muon (2024) show that there’s still room for innovation. Stay curious, read the papers (linked throughout this guide) here and new papers, and don’t be afraid to experiment with different optimizers for your specific use case.

Chizoba Obasi blog

Inkcast: A Free, Browser-Based Audiobook Player

Sutton & Barto, Ch. 12: Eligibility Traces (Personal Notes)

Table of Contents

Appendix

12.1 The $\lambda$-return

TD($\lambda$) Weighting

Forward View

12.2 TD($\lambda$)

Backward View

12.3 $n$-step Truncated $\lambda$-return Methods

12.4 Redoing Updates: Online $\lambda$-return Algorithm

12.5 True Online TD($\lambda$)

12.6 Dutch Traces in Monte Carlo Learning

Takeaways

12.7 Sarsa($\lambda$)

12.8 Variable $\lambda$ and $\gamma$

Superscripts notation for $i$ in $G_t^{\lambda i}$

12.9 Off-Policy Traces with Control Variates

12.10 Watkins’s Q($\lambda$) to Tree-Backup($\lambda$)

Watkins’s Q($\lambda$)

Tree-Backup($\lambda$)

12.11 Stable Off-Policy Methods with Traces

GTD($\lambda$)

GQ($\lambda$)

HTD($\lambda$)

Emphatic TD($\lambda$)

12.12 Implementation Issues

12.13 Conclusions

Citation

Sutton & Barto, Ch. 11: Off-Policy Methods with Approximation (Personal Notes)

Table of Contents

Appendix

11.1 Semi-gradient Methods

11.2 Examples of Off-Policy Divergence

Example 1

Example 2 (Baird’s Counterexample)

Example 3 (Tsitsiklis & Van Roy’s Counterexample)

Takeaways

11.3 The Deadly Triad

11.4 Linear Value-Function Geometry

Projection Operation

TD Solutions

11.5 Gradient Descent in the Bellman Error

TD Error (Naive Residual-Gradient Algorithm)

Bellman Error (Residual-Gradient Algorithm)

11.6 The Bellman Error is Not Learnable

Example 1

Example 2

11.7 Gradient-TD Methods

Gradient-TD

GTD2

TD(0) with Gradient Correction (GTD(0) or TDC)

Takeaways

11.8 Emphatic-TD Methods

11.9 Reducing Variance

11.10 Summary

Citation

Sutton & Barto, Ch. 10: On-Policy Control with Approximation (Personal Notes)

Table of Contents

Appendix

10.1 Episodic Semi-gradient Control

10.2 Semi-gradient $n$-step Sarsa

10.3 Average Reward: A New Problem Setting for Continuing Tasks

10.4 Deprecating the Discounted Setting

10.5 Differential Semi-gradient $n$-step Sarsa

10.6 Summary

Citation

When Your Voice Assistant Can't Hear Tones: Evaluating ASR Bias in Igbo

The Problem: What Does “Language Support” Really Mean?

The Experiment: 21 Audio Samples

The Results: 75% Tone Loss

What the Data Shows

The Statistical Story

The Categories

Why This Matters

The Bigger Picture: Zeno’s Paradox of Low-Resource Languages

What I Learned

The Technical Details

What’s Next