- Published on
Hybrid Context Compaction: Managing Token Growth in Agentic Loops
- Authors

- Name
- Xiaoyi Zhu
Context growing too fast? LLM summarization works, but it’s slow and expensive. Here’s how we handled most cases without another API call.
We started with LLM summarization as our default for managing context. But JetBrains' research on efficient context management showed a better way: observation masking as the primary layer, with summarization as fallback. This post describes the implementation and the results.
The Problem: Summarization Has Hidden Costs
Our existing implementation used LLM summarization for context compaction. When context grew too large, we'd call a fast model to compress older messages. It worked, until usage scaled and we noticed frictions:
- Latency overhead: Every summarization call adds extra latency.
- Token cost: Summarization requires an additional LLM call, which compounds at scale.
- Quality trade-offs: LLM summaries occasionally lose nuance. The summarization model might miss "stopping signals" that tell the agent it has enough information.
Bottom line: Summarization works, but it's expensive and sometimes counterproductive. Could we reduce how often it triggers while still keeping context under control?
The Research: JetBrains' Findings
While looking for optimizations, I found JetBrains' research on context management strategies. Their systematic evaluation revealed something counterintuitive:
| Approach | Description | Key Finding |
|---|---|---|
| Observation Masking | Replace older tool outputs with placeholders | Fast, no API calls, often outperforms summarization |
| LLM Summarization | Compress older interactions via separate LLM call | Can mask "stopping signals," causing agents to overshoot |
| Hybrid | Masking as primary, summarization as fallback | Best overall: improved solve rates by ~2.6% |
The insight: simpler is often better. LLM summaries can inadvertently remove cues that tell the agent "you have enough information, stop iterating." Observation masking preserves the structure of the conversation while reducing volume.
The Solution: Two-Tier Compaction
Based on the research, I added observation masking as a first-pass layer, running before summarization.
Tier 1: Observation Masking (New)
Applied at the start of every iteration, before the LLM call:
- Keep the last N tool results in full detail
- Replace older tool results with brief placeholders
- Zero API calls, near-zero latency
Tier 2: LLM Summarization (Now Fallback)
Our existing logic now serves as a safety net:
- Hard threshold: Context exceeds critical limit → trigger immediately
- Soft threshold: Context moderately large AND agent running for many steps → trigger
- Uses a fast, cheap model (e.g., Claude 4.5 Haiku)
- Graceful degradation if summarization fails
The key insight: masking handles the majority of cases. Summarization went from primary mechanism to rare fallback.
Technical Deep Dive
1) Observation Masking
The masking logic is straightforward:
function applyObservationMasking(messages, keepRecent) {
// Find all tool results in the message history
const toolMessages = messages.filter(msg => isToolMessage(msg))
// Keep last N in full, mask the rest
const toMask = toolMessages.slice(0, -keepRecent)
for (const msg of toMask) {
if (!msg.content.startsWith('[Masked')) {
msg.content = `[Masked: ${extractBriefSummary(msg.content)}]`
}
}
}
The extractBriefSummary function is synchronous with almost zero latency. Since you know your tool schemas, you can write simple extraction logic for each response format:
function extractBriefSummary(content) {
// Pattern-match against your known tool response shapes
// Example: { results: [...] } -> "15 results"
// Example: { schema: { fields: [...] } } -> "schema with 20 fields"
// Fallback: truncate to first 50 chars
}
The power is simplicity: pattern-matching against tool outputs you control. A few conditionals, easy to extend.
2) Fallback Summarization
When masking isn't enough:
async function applyFallbackSummarization(messages, keepRecent) {
// Separate system messages (never summarize these)
const systemMessages = messages.filter(msg => isSystemMessage(msg))
const nonSystemMessages = messages.filter(msg => !isSystemMessage(msg))
// Split: keep recent messages, summarize older ones
const recentMessages = nonSystemMessages.slice(-keepRecent)
const olderMessages = nonSystemMessages.slice(0, -keepRecent)
// Call fast LLM to summarize older messages
const summary = await fastLlm.summarize(olderMessages)
// Reconstruct: system + summary + recent
return [...systemMessages, summary, ...recentMessages]
}
3) Integration in the Agent Loop
Compaction runs at the start of each iteration, before the LLM call:
for each iteration:
// Tier 1: Always apply masking first
applyObservationMasking(messages)
// Tier 2: Check if summarization needed
if contextSize > hardThreshold:
applySummarization(messages)
else if contextSize > softThreshold AND iterationCount > stepThreshold:
applySummarization(messages)
// Proceed with normal LLM call
response = llm.invoke(messages)
4) Token Estimation
For delta tracking, we use a fast character-based estimate:
function estimateTokens(messages) {
// Sum up character count across all message content
const totalChars = messages.reduce((sum, msg) => sum + msg.content.length, 0)
// Rough estimate: ~4 chars per token for English
return Math.ceil(totalChars / 4)
}
Intentionally approximate. For threshold checks, we don't need exact counts, just directionally correct estimates.
Results
After adding masking as the first-pass layer:
Token savings: ~30–40% token reduction, plus avoided summarization token costs.
Lower latency: Masking adds ~0ms overhead and avoids the additional latency associated with summarization for most requests
Rate limit headroom: Improved rate-limit resilience due to fewer API calls during peak usage.
Summarization still exists, but now functions as an infrequent safety net, rather than the default path.
Gotchas and Lessons Learned
On Observation Masking
- Observation masking works well for our data-query patterns because tool calls tend to be sequential dependencies: we fetch metadata, then build a query, then retrieve the actual data. Once we have the final query results, the intermediate steps (schema lookups, metadata checks) become noise rather than useful context. For tasks where the agent iteratively builds on previous observations (e.g., multi-step debugging or comparative analysis), you may need to keep more context or be selective about what gets masked.
- Placeholders with brief summaries (e.g.,
[Masked: schema with 20 fields]) let the LLM know that preparatory data was retrieved without including the full payload. This keeps the focus on the actual query results while preserving enough context to avoid redundant tool calls.
On LLM Summarization
- Slower and more expensive than masking. Demoting it to fallback was the right call.
- Graceful degradation matters. If summarization fails, keep original messages rather than crashing.
On the Hybrid Approach
- Adding masking in front of existing summarization was straightforward. Minimal code for meaningful gains.
- Logging token deltas at each step helps debug context growth and validate the optimization.
- Cumulative token tracking gives visibility into actual API costs.
Important Caveats
- Tuned for our use case: This approach is optimized for a low-latency feature with aggressive hyperparameters. Agents built for research, exploration, or long-horizon reasoning will likely require different settings.
- Heuristic evaluation: Quality was evaluated through manual review and user feedback rather than automated metrics. While the token reduction (~30–40%) is measured, the absence of quality degradation is observational rather than rigorously quantified.
Best Practices
Do
- Place masking before summarization. It’s low effort and delivers outsized gains.
- Instrument token usage. Track cumulative input tokens, masking deltas, and summarization trigger rates.
- Test with long-running conversations. Compaction strategies only reveal their value at scale.
Don't
- Over-compress context. Retaining too few recent or relevant messages degrades response quality.
- Treat thresholds as universal. Tune aggressively for your specific model, traffic patterns, and use case.
Wrapping Up
If you’re already using LLM-based summarization for context compaction, adding observation masking in front of it is a straightforward win. In our case, it reduced token usage, improved response latency, and freed up rate-limit headroom with minimal code changes.
This hybrid model uses masking as the primary mechanism and summarization as a fallback. It scales more gracefully than summarization alone. Our experience aligns with JetBrains’ findings that simpler compaction strategies can outperform more complex summarization.
If you’re running agentic systems that rely on summarization for context management, consider layering masking as a first pass. The research is credible, the implementation is simple, and the upside is meaningful.
This is what worked for us. Your mileage may vary, but hopefully it saves you some trial and error.