Chunk-Guided Q-Learning

Gwanwoo Song¹, Kwanyoung Park², Youngwoon Lee¹

¹Yonsei University, ²UC Berkeley

vs. Single-step TD Learning

1-step TD suffers from bootstrapping bias, where errors compound over many backups. CGQ mitigates this with guidance from a chunked critic.

vs. Action-Chunked TD Learning

Chunked TD is poor at stitching across states within a chunk, leading to suboptimal value estimates. CGQ retains single-step TD, recovering the optimal reactive policy.

Overview

Reducing long-horizon error accumulation in single-step TD learning via chunk guidance.

Chunk-Guided Q-Learning (CGQ) is an offline RL method that trains a reactive single-step policy while leveraging action chunking to stabilize value learning over long horizons.
CGQ is stable and performant. The chunk-based critic provides a longer-range bootstrap target, reducing compounding error. The single-step critic preserves full reactivity, recovering fine-grained trajectory stitching that chunked methods cannot.

Challenges

Single-step RL

Bootstrapping Error Accumulation

Each Bellman backup bootstraps from the critic's own estimates. Small errors compound across hundreds of steps, overwhelming the learning signal on long-horizon tasks. This problem worsens as the discount factor \(\gamma \to 1\), which is typical in long-horizon settings.

Action-Chunked RL

Suboptimal Value Learning

Chunked TD assumes the policy executes a fixed action sequence without reacting to intermediate states. This restricts the policy class to open-loop sequences, making the value function structurally suboptimal: \(Q^*_{\text{chunk}} \leq Q^*\) in general.

Visualized in a Gridworld

To isolate these problems without confounding factors like function approximation, we study all three approaches in a controlled 18×18 gridworld environment. The agent must navigate from a start position to a fixed goal.

Dataset: 60 trajectories from an ε-greedy policy (ε=0.9)

Ground-truth optimal value function V*

The dataset covers diverse paths but is far from optimal, a realistic offline RL setting. We add Gaussian noise (σ=0.05) to each value update to simulate the effect of function approximation error. We then compare how each method's estimated value function evolves over training iterations.

Learned Value Function Over Iterations

Compare each method's learned V(s) against the ground-truth V*. Colors closer to V* = better. Darker-than-V* regions indicate value overestimation.

Single-step TD
Values spread, but many regions become darker than V* due to accumulated overestimation from repeated bootstrapping.

Chunked TD (h=4)
Fast early spread, but plateaus. Even states close to the goal fail to reach the correct value due to open-loop constraints.

CGQ (Ours)
Converges to match V*. Fast propagation from chunking, correct final values from single-step reactivity.

Value Estimation Error Over Iterations

The animations below show the MSE between each method's value function and V* at every iteration. Brighter regions = larger error.

Single-step TD
Error grows with distance from the goal. Farther states require more bootstrapping steps, accumulating larger estimation errors.

Chunked TD (h=4)
Error drops fast, but plateaus. The open-loop assumption prevents optimal convergence.

CGQ (Ours)
Fast and accurate, achieving the lowest final error.

Method

It has two main components: an action-chunked critic \(Q_{\phi_c}(s, \mathbf{a})\) and a single-step critic \(Q_\phi(s, a)\).
The action-chunked critic \(Q_{\phi_c}(s, \mathbf{a})\) is trained with \(h\)-step TD backups to provide a stable, long-horizon value signal.
The single-step critic \(Q_\phi(s, a)\) is trained to minimize the standard 1-step TD loss while being regularized toward the chunked critic with the following loss:
\[ \underbrace{\mathbb{E}_{\substack{s_t,a_t,r_t,s_{t+1} \sim \mathcal{D}\\a'_{t+1} \sim \pi}} \left[ \left( Q_{\phi}(s_t, a_t) - r_t - \gamma Q_{\bar\phi}(s_{t+1}, a'_{t+1}) \right)^2 \right]}_{\text{1-step TD loss } \mathcal{L}^{\text{TD}}(\phi)} + \beta \cdot \underbrace{\mathbb{E}_{s_t, \mathbf{a}_t \sim \mathcal{D}} \Big[ \ell_\tau \big( Q_{\phi_c}(s_t, \mathbf{a}_t) - Q_\phi(s_t, a_t) \big) \Big]}_{\text{Distillation loss } \mathcal{L}^{\text{reg}}(\phi)} \]
where \(\beta\) is a hyperparameter that balances the two losses, \(\ell_\tau\) denotes the asymmetric expectile loss, and \(\mathbf{a}_t = (a_t, \dots, a_{t+h-1})\) is the \(h\)-step action chunk from the dataset.
The output of CGQ is the reactive single-step policy, extracted from \(Q_\phi(s, a)\).
Note that the upper-expectile loss asymmetrically biases \(Q_\phi(s, a)\) toward higher chunk-based estimates, which provides stabilization while remaining robust to the chunked critic's suboptimality.

Experiments

Tasks

We evaluate on 30 manipulation and 15 locomotion tasks from OGBench.

Results

Offline RL results aggregated across 45 tasks.

Scene

Cube

Puzzle

Antmaze

Humanoidmaze

📊 Click to see full table

Analysis & Q&A

❓ How is CGQ different from n-step returns?

Both CGQ and methods using n-step returns (e.g., NFQL) learn a single-step critic. However, n-step returns still bootstrap from the same single-step critic at the tail, so they inherit the same error accumulation problem.
CGQ instead introduces an auxiliary action-chunked critic trained separately using chunked value backups. The single-step critic is regularized toward it, enabling stable value learning.

❓ Is CGQ the best critic design combining single-step and chunked RL?

We compared CGQ against several alternatives:

Distill: \(\mathcal{L}(\phi) = \mathcal{L}^{\text{reg}}(\phi)\). Removes the 1-step TD loss completely and only distills from the chunked critic. This loses the single-step compositionality and reactivity.
Max: Uses \(\max(r + \gamma \max_{a'} Q_{\bar{\phi}}(s', a'), Q_{\phi_c}(s, \mathbf{a}))\) as the TD target. Takes the maximum between the 1-step target and chunked Q-value. This proves to be aggressively overoptimistic.
Opposite: \(\mathcal{L}(\phi_c) = \mathcal{L}^{\text{TD}}(\phi_c) + \beta \cdot \mathcal{L}^{\text{reg}}(\phi_c)\). Applies the same structure but in reverse, regularizing the chunked critic toward the single-step critic.

❓ How does chunk size h affect performance?

Larger chunks reduce the effective backup depth, providing a more stable regularization signal. We evaluated h ∈ {5, 10, 15, 20} and found that performance generally improves with larger h. However, we anticipate this trend will diminish for excessively large chunks, as learning an accurate chunked critic becomes more difficult, potentially leading to misleading guidance.

❓ How sensitive is CGQ to the distillation coefficient β?

β controls the strength of chunk guidance. If β is too small, the single-step critic receives insufficient stabilization, while an excessively large β may cause it to over-rely on the suboptimal chunked critic. We find that β ∈ {0.1, 1.0} works well, whereas performance drops sharply when β is reduced to 0.01 or below. This highlights the necessity of providing an adequate amount of chunked guidance to stabilize learning.

❓ How sensitive is the expectile parameter τ?

The expectile parameter τ determines the degree of asymmetry in the distillation loss. While a larger τ places a stronger bias on the critic to match higher chunk-based estimates, we observe that CGQ is generally robust to this choice. Values ranging from 0.5 to 0.95 all lead to comparable final performance (around 60%), indicating that precise tuning of τ is not strictly necessary.