Published on

Hybrid Context Compaction: Managing Token Growth in Agentic Loops

Authors
  • avatar
    Name
    Xiaoyi Zhu
    Twitter

Context growing too fast? LLM summarization works, but it’s slow and expensive. Here’s how we handled most cases without another API call.

We started with LLM summarization as our default for managing context. But JetBrains' research on efficient context management showed a better way: observation masking as the primary layer, with summarization as fallback. This post describes the implementation and the results.


The Problem: Summarization Has Hidden Costs

Our existing implementation used LLM summarization for context compaction. When context grew too large, we'd call a fast model to compress older messages. It worked, until usage scaled and we noticed frictions:

  • Latency overhead: Every summarization call adds extra latency.
  • Token cost: Summarization requires an additional LLM call, which compounds at scale.
  • Quality trade-offs: LLM summaries occasionally lose nuance. The summarization model might miss "stopping signals" that tell the agent it has enough information.

Bottom line: Summarization works, but it's expensive and sometimes counterproductive. Could we reduce how often it triggers while still keeping context under control?


The Research: JetBrains' Findings

While looking for optimizations, I found JetBrains' research on context management strategies. Their systematic evaluation revealed something counterintuitive:

ApproachDescriptionKey Finding
Observation MaskingReplace older tool outputs with placeholdersFast, no API calls, often outperforms summarization
LLM SummarizationCompress older interactions via separate LLM callCan mask "stopping signals," causing agents to overshoot
HybridMasking as primary, summarization as fallbackBest overall: improved solve rates by ~2.6%

The insight: simpler is often better. LLM summaries can inadvertently remove cues that tell the agent "you have enough information, stop iterating." Observation masking preserves the structure of the conversation while reducing volume.


The Solution: Two-Tier Compaction

Based on the research, I added observation masking as a first-pass layer, running before summarization.

Tier 1: Observation Masking (New)

Applied at the start of every iteration, before the LLM call:

  • Keep the last N tool results in full detail
  • Replace older tool results with brief placeholders
  • Zero API calls, near-zero latency

Tier 2: LLM Summarization (Now Fallback)

Our existing logic now serves as a safety net:

  • Hard threshold: Context exceeds critical limit → trigger immediately
  • Soft threshold: Context moderately large AND agent running for many steps → trigger
  • Uses a fast, cheap model (e.g., Claude 4.5 Haiku)
  • Graceful degradation if summarization fails

The key insight: masking handles the majority of cases. Summarization went from primary mechanism to rare fallback.


Technical Deep Dive

1) Observation Masking

The masking logic is straightforward:

function applyObservationMasking(messages, keepRecent) {
  // Find all tool results in the message history
  const toolMessages = messages.filter(msg => isToolMessage(msg))

  // Keep last N in full, mask the rest
  const toMask = toolMessages.slice(0, -keepRecent)

  for (const msg of toMask) {
    if (!msg.content.startsWith('[Masked')) {
      msg.content = `[Masked: ${extractBriefSummary(msg.content)}]`
    }
  }
}

The extractBriefSummary function is synchronous with almost zero latency. Since you know your tool schemas, you can write simple extraction logic for each response format:

function extractBriefSummary(content) {
  // Pattern-match against your known tool response shapes
  // Example: { results: [...] } -> "15 results"
  // Example: { schema: { fields: [...] } } -> "schema with 20 fields"
  // Fallback: truncate to first 50 chars
}

The power is simplicity: pattern-matching against tool outputs you control. A few conditionals, easy to extend.

2) Fallback Summarization

When masking isn't enough:

async function applyFallbackSummarization(messages, keepRecent) {
  // Separate system messages (never summarize these)
  const systemMessages = messages.filter(msg => isSystemMessage(msg))
  const nonSystemMessages = messages.filter(msg => !isSystemMessage(msg))

  // Split: keep recent messages, summarize older ones
  const recentMessages = nonSystemMessages.slice(-keepRecent)
  const olderMessages = nonSystemMessages.slice(0, -keepRecent)

  // Call fast LLM to summarize older messages
  const summary = await fastLlm.summarize(olderMessages)

  // Reconstruct: system + summary + recent
  return [...systemMessages, summary, ...recentMessages]
}

3) Integration in the Agent Loop

Compaction runs at the start of each iteration, before the LLM call:

for each iteration:
    // Tier 1: Always apply masking first
    applyObservationMasking(messages)

    // Tier 2: Check if summarization needed
    if contextSize > hardThreshold:
        applySummarization(messages)
    else if contextSize > softThreshold AND iterationCount > stepThreshold:
        applySummarization(messages)

    // Proceed with normal LLM call
    response = llm.invoke(messages)

4) Token Estimation

For delta tracking, we use a fast character-based estimate:

function estimateTokens(messages) {
  // Sum up character count across all message content
  const totalChars = messages.reduce((sum, msg) => sum + msg.content.length, 0)

  // Rough estimate: ~4 chars per token for English
  return Math.ceil(totalChars / 4)
}

Intentionally approximate. For threshold checks, we don't need exact counts, just directionally correct estimates.


Results

After adding masking as the first-pass layer:

  1. Token savings: ~30–40% token reduction, plus avoided summarization token costs.

  2. Lower latency: Masking adds ~0ms overhead and avoids the additional latency associated with summarization for most requests

  3. Rate limit headroom: Improved rate-limit resilience due to fewer API calls during peak usage.

Summarization still exists, but now functions as an infrequent safety net, rather than the default path.


Gotchas and Lessons Learned

On Observation Masking

  • Observation masking works well for our data-query patterns because tool calls tend to be sequential dependencies: we fetch metadata, then build a query, then retrieve the actual data. Once we have the final query results, the intermediate steps (schema lookups, metadata checks) become noise rather than useful context. For tasks where the agent iteratively builds on previous observations (e.g., multi-step debugging or comparative analysis), you may need to keep more context or be selective about what gets masked.
  • Placeholders with brief summaries (e.g., [Masked: schema with 20 fields]) let the LLM know that preparatory data was retrieved without including the full payload. This keeps the focus on the actual query results while preserving enough context to avoid redundant tool calls.

On LLM Summarization

  • Slower and more expensive than masking. Demoting it to fallback was the right call.
  • Graceful degradation matters. If summarization fails, keep original messages rather than crashing.

On the Hybrid Approach

  • Adding masking in front of existing summarization was straightforward. Minimal code for meaningful gains.
  • Logging token deltas at each step helps debug context growth and validate the optimization.
  • Cumulative token tracking gives visibility into actual API costs.

Important Caveats

  • Tuned for our use case: This approach is optimized for a low-latency feature with aggressive hyperparameters. Agents built for research, exploration, or long-horizon reasoning will likely require different settings.
  • Heuristic evaluation: Quality was evaluated through manual review and user feedback rather than automated metrics. While the token reduction (~30–40%) is measured, the absence of quality degradation is observational rather than rigorously quantified.

Best Practices

Do

  • Place masking before summarization. It’s low effort and delivers outsized gains.
  • Instrument token usage. Track cumulative input tokens, masking deltas, and summarization trigger rates.
  • Test with long-running conversations. Compaction strategies only reveal their value at scale.

Don't

  • Over-compress context. Retaining too few recent or relevant messages degrades response quality.
  • Treat thresholds as universal. Tune aggressively for your specific model, traffic patterns, and use case.

Wrapping Up

If you’re already using LLM-based summarization for context compaction, adding observation masking in front of it is a straightforward win. In our case, it reduced token usage, improved response latency, and freed up rate-limit headroom with minimal code changes.

This hybrid model uses masking as the primary mechanism and summarization as a fallback. It scales more gracefully than summarization alone. Our experience aligns with JetBrains’ findings that simpler compaction strategies can outperform more complex summarization.

If you’re running agentic systems that rely on summarization for context management, consider layering masking as a first pass. The research is credible, the implementation is simple, and the upside is meaningful.

This is what worked for us. Your mileage may vary, but hopefully it saves you some trial and error.