Back to Portfolio
Independent Research for Autonomi

harlm: hierarchical adaptive recursive language models

A Design Study · January 2026 · SSRN

A design and analytical study of four extensions to Recursive Language Models -- a DAG-based parallel scheduler, a learned adaptive router, a hot/warm/cold hierarchical memory, and an information-weighted token budget. Empirical validation is described as a planned follow-up.

Lokesh Mure|Independent Researcher|Version 2

Key Contributions

01

DAG-based parallel scheduler

Runs independent llm_query calls concurrently with optional speculative branching, relaxing the synchronous critical path.

02

Learned adaptive router

A small classifier routes each input to direct, REPL-only, or full hierarchical mode, removing uniform-strategy overhead on short inputs.

03

Hot/warm/cold hierarchical memory

A three-tier cache that bounds any single aggregation context independent of recursion depth.

04

Information-weighted token budget

Distributes a global token budget across the recursion tree by chunk size, information density, and query relevance.

Scope Note

This paper is a design study. It contains analytical arguments with explicit assumptions but no empirical measurements. The planned follow-up paper will report benchmark results against the predictions in Section 4 of the PDF.

Abstract

Recursive Language Models (RLMs; Zhang, Kraska, and Khattab, 2025) treat the user prompt as an external variable inside a Python REPL and let the model emit code to answer the query. The emitted code can include recursive llm_querycalls on substrings of the prompt. This paper is a design study. I examine the published RLM harness and identify three structural inefficiencies: synchronous execution of independent sub-queries, a single level of recursion, and uniform application of the REPL even to inputs that fit in the base model's effective window.

I propose Hierarchical Adaptive Recursive Language Models (HARLM)as a design that addresses each one while keeping the RLM interface unchanged. HARLM adds four components: (a) a DAG-based parallel scheduler that runs independent llm_query calls concurrently, with an optional speculative-branching mode for ambiguous decompositions; (b) a small learned router that sends each input to direct, REPL-only, or full hierarchical mode; (c) a three-tier hot/warm/cold memory that bounds the size of any single aggregation call as recursion deepens; and (d) an information-weighted token budget distributed across the recursion tree.

I analyze each component formally: the scheduler against a parallel-execution bound, the router against the scaling classes of Zhang et al., the memory against an explicit context-size bound, and the budget allocator against a read-at-least-once lower bound on input tokens. I state the assumptions each argument relies on (additive separability of task information, independence of sub-results, bounded router error) and discuss where they fail. Empirical validation on the Zhang et al. benchmark suite is described as a planned follow-up and is not part of this paper. The purpose of releasing the design before the experiments is to make the analysis reviewable on its own terms.

Three Observations About the Reference RLM

Reading the published RLM harness and benchmark results, three structural properties of the design stand out as inefficiencies whose cost, latency, or accuracy impact can be reasoned about without re-running the original experiments. Each property suggests a specific modification, and each HARLM component is designed to relax one of them.

  1. Synchronous sub-query execution. In the published RLM harness, generated Python code runs inside a single REPL whose llm_querycalls execute sequentially. The wall-clock time of K independent sub-queries grows linearly in K, even when those sub-queries are data-independent and could in principle run concurrently. (Proposition 1 in the paper.)
  2. Single-level recursion. Sub-calls in the published RLM are non-recursive: llm_query invokes a base model, not another RLM. This pushes all aggregation onto a single level of fan-out. Whenever the aggregated sub-results exceed the base model's effective window, the aggregator is back inside the long-context regime the RLM wrapper was designed to avoid. (Proposition 2.)
  3. Uniform inference strategy. The harness applies the same REPL-based orchestration regardless of whether the input fits in the base model's effective window. On short items, the REPL adds code-generation latency and token overhead without a corresponding accuracy benefit. (Proposition 3.)

These are not impossibility results for RLM-style systems. They are descriptions of the reference design. Each HARLM component relaxes one of them.

HARLM Architecture

Input (P, Q)

Adaptive Router

T5-large encoder, ~335M params

Simple

Direct LLM

Base model only

Code

REPL-Only

No sub-LLM calls

Complex

Full HARLM

DAG + Memory

DAG Scheduler

Hierarchical Memory

Output
Figure 1: HARLM architecture. Short or direct-solvable inputs skip the REPL; code-only aggregation goes through REPL-only; everything else enters the DAG scheduler with hierarchical memory.

HARLM extends RLMs with four integrated components, each relaxing one of the three observations above:

  1. DAG Scheduler: Runs independent llm_query calls concurrently, with optional speculative branching for ambiguous decompositions. Relaxes the synchronous critical path.
  2. Hierarchical Memory: Three-tier hot/warm/cold cache that bounds any single aggregation context independent of recursion depth. Relaxes the depth-1 aggregation pressure.
  3. Adaptive Router: A learned classifier that routes inputs to direct, REPL-only, or full HARLM mode. Relaxes the uniform-strategy overhead on short inputs.
  4. Budget Allocator: An information-weighted token budget distributed across the recursion tree. Cross-cutting resource allocator connecting the other three.

Parallel and Speculative Scheduling

When the HARLM agent emits code containing multiple llm_query calls, a conservative AST walk builds a dependency DAG: any call whose argument touches the output of another call is declared dependent even when the dependency is trivially avoidable. Overestimating dependencies gives up parallelism but cannot introduce incorrect concurrent execution. Independent batches are dispatched with bounded parallelism P. For the common map-reduce pattern (“split context into K chunks, llm_query each, aggregate”), behavior collapses to batched parallel execution.

For items where the best decomposition is ambiguous, the scheduler can run up to three candidate strategies in parallel and cancel pending ones as soon as any strategy returns a high-confidence answer. This is a latency hedge, not a cost reducer.

Learned Adaptive Routing

The router maps (prompt, query) pairs to one of three modes: Direct (skip the REPL entirely), REPL-only (execute code over context but issue no sub-model calls), or Full HARLM (engage the DAG scheduler and hierarchical memory). The proposed architecture is a T5-large encoder (24 encoder layers, ~335M params) with a 3-way classification head, fed mean-pooled token embeddings plus three scalar features (log of input length, effective-window estimate, task-class estimate).

The loss combines cross-entropy on oracle labels with a cost penalty and a performance penalty. The cost penalty is asymmetric by design: misrouting up (to a more powerful mode than needed) is penalized less than misrouting down (to a mode that cannot solve the item). A router that occasionally over-routes costs more but preserves correctness.

Hierarchical Memory

If the pipeline is allowed to recurse beyond a single level, the aggregation step at the root sees b^d sub-results for branching factor b and depth d. For b=8, d=4 this is 4096 sub-results, which by itself reintroduces the long-context problem the RLM was designed to bypass. HARLM's three-tier memory bounds the context presented to any single aggregation call independent of depth:

  • Hot (H): LRU cache of the k_H most recent sub-results at full fidelity.
  • Warm (W): LRU cache of k_W older sub-results, each compressed by a smaller model. The compression prompt preserves entities, numerals, and task-relevant predicates; loss is concentrated in paraphrasable prose.
  • Cold (C): Vector index over the remaining sub-results, retrieved with hybrid dense + BM25 search.
H

Hot Cache

LRU, full fidelity, capacity k_H

1.0x
W

Warm Cache

Small-model compression, capacity k_W

γx
C

Cold Storage

Vector + BM25 index, retrieve top-k_C per query

retr
Figure 2: Three-tier hierarchical memory. Any single aggregation call sees at most k_H full-fidelity entries, k_W compressed entries, and k_C retrieved entries, independent of recursion depth.

Information-Weighted Token Budgeting

Given a global token budget B, the allocator distributes tokens to nodes in the recursion tree in proportion to a score combining chunk size, a learned information-density estimator, and a query-relevance score. Density is a regression head on top of the router's encoder, trained jointly with the routing head; relevance is a dot product between cached sub-result embeddings and the query embedding. This allocator is a heuristic; the paper does not claim it is optimal.

Analytical Study

Each component is analyzed against the observation it is designed to relax. The assumptions each argument relies on are stated explicitly, along with where they fail.

Scheduler vs. the Sequential Bound

Partitioning K independent queries into ceil(K/P) batches of size at most P gives a wall-clock time of roughly ceil(K/P) · (mean-latency + sync-overhead), an approximate factor-P speedup over the sequential K · mean-latency bound. This relies on two non-trivial assumptions: synchronization overhead being dominated by per-call network latency (likely true in practice), and the provider's concurrency cap P_max being at least P (typical researcher-tier accounts cap at 10-20 in-flight requests, which bounds the achievable speedup on long-tail items).

Router vs. the Short-Input Overhead

The paper derives a condition on router accuracy under which routing reduces expected cost over always-full-mode. The condition is satisfied whenever the ratio between full-mode cost and the cheaper modes is large and the router has any non-trivial accuracy. On the RLM benchmarks of Zhang et al., the reported cost ratios are large, so even a moderately accurate router should recover a meaningful fraction of the short-input overhead.

Memory vs. the Depth-1 Aggregation Pressure

The context size presented to any single aggregation call is bounded by k_H · mean-sub-result-size + k_W · mean-sub-result-size / gamma + k_C · mean-retrieved-chunk-size, independent of recursion depth and total number of sub-results. The cost is information lost to compression (warm) and retrieval recall (cold).

Token Budget vs. a Read-at-Least-Once Lower Bound

Theorem 1 (informal): Under additive separability of task information across chunks, a deterministic evaluator with no auxiliary oracle, and a per-call maximum context length W, any algorithm that answers a task in scaling class C_f must perform at least ceil(f(N)/W) base-model calls, and the total number of input tokens consumed across all calls is at least f(N) (counted with multiplicity).

Proposition 6 (informal): Under additive separability, effective window W, and a router that correctly identifies C_1 items within W_eff, HARLM's expected input-token cost is within a log factor of the lower bound on three scaling classes: O(W) for constant-class tasks (S-NIAH), O(N log(N/W)) for linear tasks (OOLONG-style aggregation), and O((N²/W) log²(N/W)) for quadratic tasks (OOLONG-Pairs). The assumption that bites hardest in practice is additive separability: for tasks with long-range coreference, these bounds give the wrong units.

Where the Cost Savings Should Come From

The analysis produces a falsifiable prediction for the planned empirical follow-up:

  • The router should be the dominant contributor to cost reduction, because it directly attacks the largest constant term in the RLM wrapper's cost.
  • The parallel scheduler should be the dominant contributor to latency reduction.
  • The hierarchical memory should be the dominant contributor to accuracy on quadratic-class tasks like OOLONG-Pairs.

If the ablations show the damage distributed uniformly across components instead, the component decomposition is wrong.

Evaluation Plan

This paper contains no empirical measurements. The follow-up paper will reproduce the protocol of Zhang et al. (2025) and add HARLM alongside their baselines:

  • Reproduction baseline. Verify published RLM numbers on S-NIAH, OOLONG, OOLONG-Pairs, BrowseComp-Plus (1K), and CodeQA using GPT-5 and Qwen3-Coder-480B as base models. Any divergence becomes the calibration baseline.
  • Component ablations. Run HARLM in five configurations: full, minus router, minus scheduler, minus memory, minus speculative. Each ablation should concentrate its damage on the metric the corresponding component is designed to attack.
  • Transfer to newer base models. Separate transfer set with GPT-5.5, Claude Opus 4.7, and Qwen 3.5-MoE. Delta-cost and delta-latency ratios should persist.
  • Measurement protocol. At least 3 seeds per cell; bootstrap 95% confidence intervals rather than paired t-tests on small-n benchmarks; per-seed raw scores published alongside means; provider pricing fixed to a single snapshot; concurrency cap stated explicitly.
  • Failure-mode annotation. A subset of failures hand-labeled into sub-model, aggregation, decomposition, and routing errors, with inter-annotator agreement on a shared sample.
  • Release. Router checkpoint, DAG scheduler, memory implementation, and full evaluation harness alongside the empirical paper.

Limitations

  • Additive separability. The cost-scaling results depend on additive separability of task information across chunks. For tasks with heavy long-range coreference, this fails and the bounds give the wrong units.
  • Router requires labeled data. Training the router as described needs oracle labels produced by running all three modes on each training item. A self-supervised or bandit-style scheme would remove this cost but is not part of this design.
  • Memory compression is lossy. Warm-tier compression is a small-model paraphrase. For tasks where verbatim strings matter (legal, code, named-entity aggregation), compression will lose information the aggregator needs.
  • Provider concurrency caps bound scheduler speedup. Researcher-tier accounts cap at P_max around 10-20. On items with many sub-queries, achievable parallelism is capped below the DAG width.
  • Hardware-aware scheduling is absent. The scheduler counts parallel requests, not their placement.
  • No empirical validation. The most important limitation: analytical predictions are not yet checked against measured numbers.

Conclusion

HARLM is a narrow extension of the RLM idea rather than a reimagining of it. Each of the three observations about the reference RLM maps to one component: synchronous sub-queries to the parallel scheduler, single-level recursion to the hierarchical memory, and uniform inference to the router. The analysis produces a concrete, falsifiable prediction about which component should dominate which metric. Whether that prediction holds is an empirical question that this paper explicitly does not answer.

Citation

Mure, L. (2026). HARLM: Hierarchical Adaptive Recursive Language Models -- A Design Study. SSRN. https://dx.doi.org/10.2139/ssrn.6397658