- Published on
Compressing LLM Context Windows: Efficient Data Formats and Context Management
- Authors

- Name
- Xiaoyi Zhu
Your API just returned 100 user records in JSON. Each record has 9 fields:
"id","name","email","phone","address","title","status","created_at","updated_at". That's the same 9 field names, wrapped in quotes, repeated 100 times. Before you even get to the data, you're paying for 900+ redundant tokens of structural overhead.
Sound familiar? Every repeated key costs tokens, which you pay for in inference costs and latency. LLM context windows are finite, and JSON's verbosity compounds rapidly with scale. By reformatting data and filtering fields, you can slash token usage by 20-50% without losing information. This post explores practical techniques to engineer your context for efficiency, not just your prompts.
The Headache: When Your Context Window Becomes a Liability
Without smart context management, you'll hit these walls fast:
- JSON is token-inefficient: Those braces
{}, brackets[], quotes", and repeated keys consume many tokens with little semantic value. A simple API response can balloon to thousands of tokens when most of it is structural overhead. - Nested data compounds the problem: APIs often return deeply nested JSON or arrays of similar objects. Each object repeats the same keys. A list of 100 users might repeat
"id","name","email","phone","address","title","status","created_at","updated_at"100 times—900+ tokens of pure redundancy. - Models lose focus in the middle: Even with larger context windows, LLMs exhibit "lost-in-the-middle" behavior. Critical information gets buried in irrelevant data, tanking answer accuracy.
- Cost increases linearly: More tokens mean higher costs. Every redundant bracket adds up when processing thousands of requests.
- Latency degrades user experience: Larger contexts mean slower inference and response times that users notice.
Bottom line: Naively feeding raw JSON from tools or memory into your LLM wastes tokens and money. Strategic context compression is essential.
The Simple Fix: Engineer Your Context, Not Just Your Prompts
Instead of fighting against verbose data formats, strategically compress and curate context before it reaches the model. The solution has three layers:
Format-Level Optimization (API) If you control the backend, return LLM-friendly formats (YAML, Markdown tables, CSV) instead of verbose JSON. This can reduce tokens by 15-50% depending on data structure.
Transform-Level Optimization (Application) When you can't change the API, transform responses before prompting. Filter fields, flatten structures, and convert to compact formats.
Content-Level Optimization (Intelligence) Use relevance filtering, summarization, and smart retrieval to include only what matters. The best compression is including less data.
Let the format be efficient; let the transformation be smart; let the content be relevant.
Technical Deep Dive
1) Building LLM-Friendly Data Endpoints
If you own the API, this is your biggest leverage point. Four formats consistently outperform standard JSON:
YAML: Variable token reduction
YAML eliminates braces and quotes around keys, uses indentation for structure, and reads naturally. However, token savings vary significantly based on data structure:
# JSON: 187 tokens
{
"users": [
{
"id": 1,
"name": "Alice Chen",
"email": "alice@example.com",
"phone": "+1-555-0100",
"address": "123 Main St",
"title": "Engineer",
"status": "active",
"created_at": "2024-01-15",
"updated_at": "2024-10-20"
},
{
"id": 2,
"name": "Bob Smith",
"email": "bob@example.com",
"phone": "+1-555-0200",
"address": "456 Oak Ave",
"title": "Designer",
"status": "inactive",
"created_at": "2024-02-20",
"updated_at": "2024-10-18"
}
]
}
# YAML: 147 tokens (approximately 20% reduction in this example)
users:
- id: 1
name: Alice Chen
email: alice@example.com
phone: +1-555-0100
address: 123 Main St
title: Engineer
status: active
created_at: 2024-01-15
updated_at: 2024-10-20
- id: 2
name: Bob Smith
email: bob@example.com
phone: +1-555-0200
address: 456 Oak Ave
title: Designer
status: inactive
created_at: 2024-02-20
updated_at: 2024-10-18
Why the savings are small here: With only 2 objects, each containing 9 fields, the JSON repeats keys just 18 times total. The format overhead (braces, quotes) is relatively small compared to the actual data content.
Where YAML really shines: In production APIs returning 100+ similar objects. Consider 100 users with 9 fields each:
- JSON: Repeats
"id","name","email","phone","address","title","status","created_at","updated_at"100 times with quotes and braces - YAML: Same keys, no quotes, no braces—just indentation
For large arrays, YAML typically saves 20-35% tokens compared to JSON. The exact savings depend on:
- Number of objects in the array
- Number of fields per object
- Field name lengths (longer keys = more repeated overhead)
- Data nesting depth
Markdown Tables: Most effective for large arrays
Headers appear once instead of repeating for every row. This is particularly effective for lists of similar objects:
| ID | Name | Email | Phone | Address | Title | Status | Created At | Updated At |
| --- | ---------- | ----------------- | ----------- | ----------- | -------- | -------- | ---------- | ---------- |
| 1 | Alice Chen | alice@example.com | +1-555-0100 | 123 Main St | Engineer | active | 2024-01-15 | 2024-10-20 |
| 2 | Bob Smith | bob@example.com | +1-555-0200 | 456 Oak Ave | Designer | inactive | 2024-02-20 | 2024-10-18 |
With just 2 rows, this markdown table uses 143 tokens, also saving about 20%. The format overhead of the table structure (pipes, separators, aligned spacing) offsets savings for very small datasets.
The real impact comes at scale: For 100 users with the same 9 fields:
- JSON:
"id","name","email","phone","address","title","status","created_at","updated_at"appear 100 times with quotes, braces, and commas → approximately 7k - 8k tokens - Markdown table: Headers appear once, data in clean rows → approximately 4k - 5k tokens
That's a ~40%-50% reduction. The savings grow linearly with row count because:
- JSON: Every row repeats all field names in quotes
- Markdown: Field names appear once in the header row
This is why markdown tables are highly effective for API responses returning arrays of homogeneous objects—the common case for list endpoints, search results, and bulk queries.
Columnar JSON: Best of both worlds
If you need to stay within JSON but want significant token savings, use a columnar format where keys appear once in a header, and data is represented as arrays:
{
"columns": [
"id",
"name",
"email",
"phone",
"address",
"title",
"status",
"created_at",
"updated_at"
],
"count": 2,
"data": [
[
1,
"Alice Chen",
"alice@example.com",
"+1-555-0100",
"123 Main St",
"Engineer",
"active",
"2024-01-15",
"2024-10-20"
],
[
2,
"Bob Smith",
"bob@example.com",
"+1-555-0200",
"456 Oak Ave",
"Designer",
"inactive",
"2024-02-20",
"2024-10-18"
]
]
}
This columnar format uses 165 tokens—a 12% reduction from standard JSON (187 tokens) while remaining valid JSON. For 100 users:
- Standard JSON: ~7k-8k tokens
- Columnar JSON: ~4k-4.5k tokens
- Savings: ~35%-45%
The columnar format scales particularly well because:
- Keys appear once in the
columnsarray - Data rows are simple arrays with no repeated keys
- No structural overhead per row beyond brackets and commas
This approach is ideal when you need JSON compatibility (for type safety, existing parsers, or API contracts) but want better token efficiency.
2) Transform-Level Optimization
When you can't change the API, transform responses before prompting:
Field filtering
Remove fields that don't contribute to the task:
def filter_fields(data, required_fields):
"""Keep only fields needed for the specific task"""
if isinstance(data, list):
return [
{k: v for k, v in item.items() if k in required_fields}
for item in data
]
return {k: v for k, v in data.items() if k in required_fields}
# For query: "Which accounts have revenue over $4M?"
required_fields = {"name", "revenue", "industry"}
filtered_data = filter_fields(raw_api_response["accounts"], required_fields)
Why this works:
- Typical APIs return 10-30 fields per object
- Yo might only need 5-10 fields to answer the query
- Removing unnecessary fields cuts tokens by 40-60%
Format conversion
Convert JSON to compact formats:
import yaml
import pandas as pd
def json_to_yaml(data):
"""Convert JSON to YAML for 15-30% token savings"""
return yaml.dump(data, default_flow_style=False)
def json_to_markdown_table(data, columns=None):
"""Convert array of objects to markdown table"""
if not data:
return ""
df = pd.DataFrame(data)
if columns:
df = df[columns] # Reorder or filter columns
return df.to_markdown(index=False)
# Example usage
yaml_context = json_to_yaml(filtered_data)
# or
table_context = json_to_markdown_table(
filtered_data,
columns=["name", "revenue", "industry"]
)
Structure flattening
Flatten nested structures to reduce depth:
def flatten_nested(data, parent_key="", sep="_"):
"""Flatten nested JSON to single level"""
items = []
for k, v in data.items():
new_key = f"{parent_key}{sep}{k}" if parent_key else k
if isinstance(v, dict):
items.extend(flatten_nested(v, new_key, sep=sep).items())
else:
items.append((new_key, v))
return dict(items)
# Before: {"account": {"metadata": {"campaign": "q1_2023"}}}
# After: {"account_metadata_campaign": "q1_2023"}
3) Content-Level Optimization
Relevance filtering
Only include data relevant to the query:
def semantic_filter(query, documents, threshold=0.7):
"""Use embeddings to filter by relevance"""
query_embedding = get_embedding(query)
relevant_docs = []
for doc in documents:
doc_embedding = get_embedding(doc["content"])
similarity = cosine_similarity(query_embedding, doc_embedding)
if similarity >= threshold:
relevant_docs.append(doc)
return relevant_docs
# Only include documents semantically related to the query
filtered_context = semantic_filter(
query="customer satisfaction trends",
documents=all_documents,
threshold=0.7
)
Summarization
Compress information while preserving key facts:
def hierarchical_summarization(documents, target_tokens=500):
"""Summarize documents to fit token budget"""
# Chunk documents into groups
chunks = chunk_documents(documents, chunk_size=5)
# Summarize each chunk
summaries = []
for chunk in chunks:
summary = llm.summarize(
"\n\n".join([doc["content"] for doc in chunk]),
max_tokens=100
)
summaries.append(summary)
# If still too large, summarize the summaries
if count_tokens(summaries) > target_tokens:
final_summary = llm.summarize(
"\n\n".join(summaries),
max_tokens=target_tokens
)
return final_summary
return "\n\n".join(summaries)
Context offloading
Store intermediate results in the filesystem for long-running workflows:
import json
from pathlib import Path
class ContextManager:
"""Manage context by offloading to filesystem"""
def __init__(self, workspace_dir="/tmp/agent_context"):
self.workspace = Path(workspace_dir)
self.workspace.mkdir(exist_ok=True)
def save_intermediate_result(self, step_name, data):
"""Save step result to file"""
file_path = self.workspace / f"{step_name}.json"
with open(file_path, 'w') as f:
json.dump(data, f)
return file_path
def load_relevant_context(self, current_step, needed_steps):
"""Load only context needed for current step"""
context = {}
for step in needed_steps:
file_path = self.workspace / f"{step}.json"
if file_path.exists():
with open(file_path, 'r') as f:
context[step] = json.load(f)
return context
def create_context_summary(self):
"""Create compact summary of all stored context"""
all_files = list(self.workspace.glob("*.json"))
summary = {
"total_steps": len(all_files),
"available_data": [f.stem for f in all_files]
}
return summary
# Usage in agentic workflow
ctx_mgr = ContextManager()
# Step 1: Research
research_results = agent.research(topic)
ctx_mgr.save_intermediate_result("research", research_results)
# Step 2: Analysis (only load research data)
relevant_context = ctx_mgr.load_relevant_context(
"analysis",
needed_steps=["research"]
)
analysis_results = agent.analyze(relevant_context["research"])
ctx_mgr.save_intermediate_result("analysis", analysis_results)
# Step 3: Final report (load summary, not all data)
context_summary = ctx_mgr.create_context_summary()
final_report = agent.generate_report(
context_summary=context_summary,
detailed_data=ctx_mgr.load_relevant_context(
"report",
needed_steps=["analysis"]
)
)
Why this works:
- Keeps context window focused on current task
- Enables deep research workflows that would otherwise exceed context limits
- Reduces token costs by avoiding redundant context in every step
- Provides clear data lineage for debugging
Important caveat: This adds filesystem I/O latency, making it unsuitable for real-time applications where sub-second response times are critical. Best for batch processing, deep research, and long-running workflows where thoroughness matters more than speed.
4) Context Window Management
Sliding window
Maintain only recent conversation turns:
def sliding_window(messages, window_size=5):
"""Keep only the N most recent messages"""
return messages[-window_size:]
Multi-agent architectures
Split complex tasks across specialized agents, each with focused context:
class AnalysisAgent:
def __init__(self):
self.context_schema = ["accounts", "revenue_data"] # Limited scope
def analyze(self, query):
# This agent only sees financial data
relevant_data = fetch_data(self.context_schema)
return llm.analyze(query, context=relevant_data)
class CustomerAgent:
def __init__(self):
self.context_schema = ["accounts", "contacts", "activities"]
def handle_query(self, query):
# This agent focuses on customer relationships
relevant_data = fetch_data(self.context_schema)
return llm.generate(query, context=relevant_data)
Each agent maintains a smaller, more focused context window instead of one large shared context.
Worked Example: Before and After
Let's see these techniques in action. Imagine an API returns information about customer accounts:
Before: Raw JSON (~370 tokens)
{
"accounts": [
{
"id": "acc_001",
"name": "Acme Corporation",
"industry": "Technology",
"revenue": 5000000,
"created_at": "2023-01-15T08:30:00Z",
"updated_at": "2024-10-18T14:22:00Z",
"status": "active",
"metadata": {
"source": "web_signup",
"campaign": "q1_2023",
"internal_id": "legacy_12345"
},
"contacts": [
{
"id": "con_001",
"name": "John Smith",
"email": "john@acme.com",
"phone": "+1-555-0100",
"title": "CTO"
}
]
},
{
"id": "acc_002",
"name": "Beta Industries",
"industry": "Manufacturing",
"revenue": 3500000,
"created_at": "2023-03-22T10:15:00Z",
"updated_at": "2024-10-17T09:45:00Z",
"status": "active",
"metadata": {
"source": "partner_referral",
"campaign": "q2_2023",
"internal_id": "legacy_67890"
},
"contacts": [
{
"id": "con_002",
"name": "Jane Doe",
"email": "jane@beta.com",
"phone": "+1-555-0200",
"title": "VP Engineering"
}
]
}
]
}
After: Markdown Table (~70 tokens)
| Name | Revenue | Industry | Primary Contact |
| ---------------- | ---------- | ------------- | ------------------------- |
| Acme Corporation | $5,000,000 | Technology | John Smith (CTO) |
| Beta Industries | $3,500,000 | Manufacturing | Jane Doe (VP Engineering) |
Result: Approximately 80% token reduction by combining format optimization with aggressive field filtering.
This represents optimal compression where:
- Most fields (IDs, timestamps, metadata, detailed contact info) are irrelevant to the task
- The query only needs summary information
- Data naturally fits a tabular structure
For queries like "Which accounts have revenue over $4M?", the markdown table contains everything needed. However, the metadata, timestamps, and detailed contact information might be essential for other use cases.
Important caveat: This is just 2 accounts. With 50 or 100 accounts, the JSON would grow quickly, while the markdown table grows much slower, making the savings more dramatic as array size increases.
Benchmarks & Metrics
Performance improvements vary significantly based on data structure, use case, and especially array size. These estimates assume medium to large arrays (20+ objects):
| Optimization | Token Reduction | Notes |
|---|---|---|
| JSON → YAML | 10-25% | Minimal for small arrays; grows with array size |
| JSON → Columnar JSON | 25-50% | JSON-compatible; keys appear once |
| JSON → Markdown Table | 20-40% | Highly effective for large arrays (100+ rows) |
| + Field Filtering | 30-60% | Depends on how many fields are removable |
| + Relevance Filtering | 40-70% | Highly use-case dependent |
| + Summarization | 50-80% | Risk of information loss |
Important caveats:
- Format changes alone provide minimal savings for small arrays (2-10 items). The real benefit comes from eliminating repeated keys in large arrays.
- These ranges represent observed results across different use cases, not guarantees
- Actual savings depend heavily on: array size, number of fields per object, and field name lengths
- Higher compression rates increase the risk of losing important information
- Combined optimizations don't always multiply (e.g., field filtering + format change may overlap)
- Always benchmark with your actual data and validate quality doesn't degrade
Best Practices & Caveats
Do
- Design APIs for LLMs from day one. If you're building new services, return YAML or Markdown natively.
- Measure token counts obsessively. Use
tiktokenor similar libraries to profile every part of your prompt pipeline. What you can measure, you can optimize. - Start with field filtering. The simplest optimization (removing unused fields) often provides meaningful gains with minimal risk.
- Keep a gold standard. Maintain test cases that verify compression doesn't hurt answer quality.
- Layer optimizations strategically. Combine format changes, filtering, and summarization for maximum effect.
Don't
- Don't blindly compress everything. Some use cases (detailed data analysis, debugging) require complete information. Know when to preserve fidelity.
- Don't forget parsing. If you invent a custom format, ensure downstream systems can parse it reliably. Test edge cases.
- Don't assume compression is free. Techniques like LLM-based summarization add latency and cost. Measure the trade-offs.
- Don't over-optimize prematurely. If your context is already small (<1000 tokens), optimization may not be worth the complexity.
Lessons Learned / Key Takeaways
- Context engineering matters: The format and content of your context directly impacts cost, latency, and quality.
- JSON has significant overhead for LLMs: Its verbosity makes it less efficient than more compact formats in token-constrained environments.
- Relevance filtering beats compression: Excluding irrelevant data often provides better results than compressing everything. Focus on relevance first, format second.
- Optimizations can be layered: Combining multiple techniques (format + filtering + summarization) typically yields better results than any single approach.
- Context offloading enables longer agent sessions: For deep research and long-running workflows, storing intermediate results in the filesystem prevents context overflow. However, this adds latency unsuitable for real-time applications.
- Always measure and validate: Token counts, cost metrics, and quality benchmarks should guide optimization decisions. Results vary significantly across use cases.
Wrapping Up
Simply throwing data at an LLM isn't enough for production applications. If you're building AI agents, the pattern that scales is strategic context engineering: return LLM-friendly formats from APIs, transform data before prompting, filter aggressively for relevance, and offload context for long-running workflows. Curate the right data, in the right format, at the right moment—and your agents will be faster, cheaper, and more focused.
Context windows have limits. Don't waste tokens on structural overhead and irrelevant data. The teams shipping production AI agents use these techniques daily to scale efficiently. Start with the low-hanging fruit (field filtering, markdown tables), measure improvements, and progressively layer in more sophisticated approaches like context offloading. That's how you build agents that are fast, focused, and future-ready.