Compressing LLM Context Windows: Efficient Data Formats and Context Management

Your API just returned 100 user records in JSON. Each record has 9 fields: "id", "name", "email", "phone", "address", "title", "status", "created_at", "updated_at". That's the same 9 field names, wrapped in quotes, repeated 100 times. Before you even get to the data, you're paying for 900+ redundant tokens of structural overhead.

Sound familiar? Every repeated key costs tokens, which you pay for in inference costs and latency. LLM context windows are finite, and JSON's verbosity compounds rapidly with scale. By reformatting data and filtering fields, you can slash token usage by 20-50% without losing information. This post explores practical techniques to engineer your context for efficiency, not just your prompts.

The Headache: When Your Context Window Becomes a Liability

Without smart context management, you'll hit these walls fast:

JSON is token-inefficient: Those braces {}, brackets [], quotes ", and repeated keys consume many tokens with little semantic value. A simple API response can balloon to thousands of tokens when most of it is structural overhead.
Nested data compounds the problem: APIs often return deeply nested JSON or arrays of similar objects. Each object repeats the same keys. A list of 100 users might repeat "id", "name", "email", "phone", "address", "title", "status", "created_at", "updated_at" 100 times—900+ tokens of pure redundancy.
Models lose focus in the middle: Even with larger context windows, LLMs exhibit "lost-in-the-middle" behavior. Critical information gets buried in irrelevant data, tanking answer accuracy.
Cost increases linearly: More tokens mean higher costs. Every redundant bracket adds up when processing thousands of requests.
Latency degrades user experience: Larger contexts mean slower inference and response times that users notice.

Bottom line: Naively feeding raw JSON from tools or memory into your LLM wastes tokens and money. Strategic context compression is essential.

The Simple Fix: Engineer Your Context, Not Just Your Prompts

Instead of fighting against verbose data formats, strategically compress and curate context before it reaches the model. The solution has three layers:

Format-Level Optimization (API) If you control the backend, return LLM-friendly formats (YAML, Markdown tables, CSV) instead of verbose JSON. This can reduce tokens by 15-50% depending on data structure.
Transform-Level Optimization (Application) When you can't change the API, transform responses before prompting. Filter fields, flatten structures, and convert to compact formats.
Content-Level Optimization (Intelligence) Use relevance filtering, summarization, and smart retrieval to include only what matters. The best compression is including less data.

Let the format be efficient; let the transformation be smart; let the content be relevant.

Technical Deep Dive

1) Building LLM-Friendly Data Endpoints

If you own the API, this is your biggest leverage point. Four formats consistently outperform standard JSON:

YAML: Variable token reduction

YAML eliminates braces and quotes around keys, uses indentation for structure, and reads naturally. However, token savings vary significantly based on data structure:

# JSON: 187 tokens
{
  "users": [
    {
      "id": 1,
      "name": "Alice Chen",
      "email": "alice@example.com",
      "phone": "+1-555-0100",
      "address": "123 Main St",
      "title": "Engineer",
      "status": "active",
      "created_at": "2024-01-15",
      "updated_at": "2024-10-20"
    },
    {
      "id": 2,
      "name": "Bob Smith",
      "email": "bob@example.com",
      "phone": "+1-555-0200",
      "address": "456 Oak Ave",
      "title": "Designer",
      "status": "inactive",
      "created_at": "2024-02-20",
      "updated_at": "2024-10-18"
    }
  ]
}

# YAML: 147 tokens (approximately 20% reduction in this example)
users:
  - id: 1
    name: Alice Chen
    email: alice@example.com
    phone: +1-555-0100
    address: 123 Main St
    title: Engineer
    status: active
    created_at: 2024-01-15
    updated_at: 2024-10-20
  - id: 2
    name: Bob Smith
    email: bob@example.com
    phone: +1-555-0200
    address: 456 Oak Ave
    title: Designer
    status: inactive
    created_at: 2024-02-20
    updated_at: 2024-10-18

Why the savings are small here: With only 2 objects, each containing 9 fields, the JSON repeats keys just 18 times total. The format overhead (braces, quotes) is relatively small compared to the actual data content.

Where YAML really shines: In production APIs returning 100+ similar objects. Consider 100 users with 9 fields each:

JSON: Repeats "id", "name", "email", "phone", "address", "title", "status", "created_at", "updated_at" 100 times with quotes and braces
YAML: Same keys, no quotes, no braces—just indentation

For large arrays, YAML typically saves 20-35% tokens compared to JSON. The exact savings depend on:

Number of objects in the array
Number of fields per object
Field name lengths (longer keys = more repeated overhead)
Data nesting depth

Markdown Tables: Most effective for large arrays

Headers appear once instead of repeating for every row. This is particularly effective for lists of similar objects:

| ID  | Name       | Email             | Phone       | Address     | Title    | Status   | Created At | Updated At |
| --- | ---------- | ----------------- | ----------- | ----------- | -------- | -------- | ---------- | ---------- |
| 1   | Alice Chen | alice@example.com | +1-555-0100 | 123 Main St | Engineer | active   | 2024-01-15 | 2024-10-20 |
| 2   | Bob Smith  | bob@example.com   | +1-555-0200 | 456 Oak Ave | Designer | inactive | 2024-02-20 | 2024-10-18 |

With just 2 rows, this markdown table uses 143 tokens, also saving about 20%. The format overhead of the table structure (pipes, separators, aligned spacing) offsets savings for very small datasets.

The real impact comes at scale: For 100 users with the same 9 fields:

JSON: "id", "name", "email", "phone", "address", "title", "status", "created_at", "updated_at" appear 100 times with quotes, braces, and commas → approximately 7k - 8k tokens
Markdown table: Headers appear once, data in clean rows → approximately 4k - 5k tokens

That's a ~40%-50% reduction. The savings grow linearly with row count because:

JSON: Every row repeats all field names in quotes
Markdown: Field names appear once in the header row

This is why markdown tables are highly effective for API responses returning arrays of homogeneous objects—the common case for list endpoints, search results, and bulk queries.

Columnar JSON: Best of both worlds

If you need to stay within JSON but want significant token savings, use a columnar format where keys appear once in a header, and data is represented as arrays:

{
  "columns": [
    "id",
    "name",
    "email",
    "phone",
    "address",
    "title",
    "status",
    "created_at",
    "updated_at"
  ],
  "count": 2,
  "data": [
    [
      1,
      "Alice Chen",
      "alice@example.com",
      "+1-555-0100",
      "123 Main St",
      "Engineer",
      "active",
      "2024-01-15",
      "2024-10-20"
    ],
    [
      2,
      "Bob Smith",
      "bob@example.com",
      "+1-555-0200",
      "456 Oak Ave",
      "Designer",
      "inactive",
      "2024-02-20",
      "2024-10-18"
    ]
  ]
}

This columnar format uses 165 tokens—a 12% reduction from standard JSON (187 tokens) while remaining valid JSON. For 100 users:

Standard JSON: ~7k-8k tokens
Columnar JSON: ~4k-4.5k tokens
Savings: ~35%-45%

The columnar format scales particularly well because:

Keys appear once in the columns array
Data rows are simple arrays with no repeated keys
No structural overhead per row beyond brackets and commas

This approach is ideal when you need JSON compatibility (for type safety, existing parsers, or API contracts) but want better token efficiency.

2) Transform-Level Optimization

When you can't change the API, transform responses before prompting:

Field filtering

Remove fields that don't contribute to the task:

def filter_fields(data, required_fields):
    """Keep only fields needed for the specific task"""
    if isinstance(data, list):
        return [
            {k: v for k, v in item.items() if k in required_fields}
            for item in data
        ]
    return {k: v for k, v in data.items() if k in required_fields}

# For query: "Which accounts have revenue over $4M?"
required_fields = {"name", "revenue", "industry"}
filtered_data = filter_fields(raw_api_response["accounts"], required_fields)

Why this works:

Typical APIs return 10-30 fields per object
Yo might only need 5-10 fields to answer the query
Removing unnecessary fields cuts tokens by 40-60%

Format conversion

Convert JSON to compact formats:

import yaml
import pandas as pd

def json_to_yaml(data):
    """Convert JSON to YAML for 15-30% token savings"""
    return yaml.dump(data, default_flow_style=False)

def json_to_markdown_table(data, columns=None):
    """Convert array of objects to markdown table"""
    if not data:
        return ""

    df = pd.DataFrame(data)
    if columns:
        df = df[columns]  # Reorder or filter columns

    return df.to_markdown(index=False)

# Example usage
yaml_context = json_to_yaml(filtered_data)
# or
table_context = json_to_markdown_table(
    filtered_data,
    columns=["name", "revenue", "industry"]
)

Structure flattening

Flatten nested structures to reduce depth:

def flatten_nested(data, parent_key="", sep="_"):
    """Flatten nested JSON to single level"""
    items = []
    for k, v in data.items():
        new_key = f"{parent_key}{sep}{k}" if parent_key else k
        if isinstance(v, dict):
            items.extend(flatten_nested(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

# Before: {"account": {"metadata": {"campaign": "q1_2023"}}}
# After: {"account_metadata_campaign": "q1_2023"}

3) Content-Level Optimization

Relevance filtering

Only include data relevant to the query:

def semantic_filter(query, documents, threshold=0.7):
    """Use embeddings to filter by relevance"""
    query_embedding = get_embedding(query)

    relevant_docs = []
    for doc in documents:
        doc_embedding = get_embedding(doc["content"])
        similarity = cosine_similarity(query_embedding, doc_embedding)

        if similarity >= threshold:
            relevant_docs.append(doc)

    return relevant_docs

# Only include documents semantically related to the query
filtered_context = semantic_filter(
    query="customer satisfaction trends",
    documents=all_documents,
    threshold=0.7
)

Summarization

Compress information while preserving key facts:

def hierarchical_summarization(documents, target_tokens=500):
    """Summarize documents to fit token budget"""
    # Chunk documents into groups
    chunks = chunk_documents(documents, chunk_size=5)

    # Summarize each chunk
    summaries = []
    for chunk in chunks:
        summary = llm.summarize(
            "\n\n".join([doc["content"] for doc in chunk]),
            max_tokens=100
        )
        summaries.append(summary)

    # If still too large, summarize the summaries
    if count_tokens(summaries) > target_tokens:
        final_summary = llm.summarize(
            "\n\n".join(summaries),
            max_tokens=target_tokens
        )
        return final_summary

    return "\n\n".join(summaries)

Context offloading

Store intermediate results in the filesystem for long-running workflows:

import json
from pathlib import Path

class ContextManager:
    """Manage context by offloading to filesystem"""

    def __init__(self, workspace_dir="/tmp/agent_context"):
        self.workspace = Path(workspace_dir)
        self.workspace.mkdir(exist_ok=True)

    def save_intermediate_result(self, step_name, data):
        """Save step result to file"""
        file_path = self.workspace / f"{step_name}.json"
        with open(file_path, 'w') as f:
            json.dump(data, f)

        return file_path

    def load_relevant_context(self, current_step, needed_steps):
        """Load only context needed for current step"""
        context = {}
        for step in needed_steps:
            file_path = self.workspace / f"{step}.json"
            if file_path.exists():
                with open(file_path, 'r') as f:
                    context[step] = json.load(f)

        return context

    def create_context_summary(self):
        """Create compact summary of all stored context"""
        all_files = list(self.workspace.glob("*.json"))
        summary = {
            "total_steps": len(all_files),
            "available_data": [f.stem for f in all_files]
        }
        return summary

# Usage in agentic workflow
ctx_mgr = ContextManager()

# Step 1: Research
research_results = agent.research(topic)
ctx_mgr.save_intermediate_result("research", research_results)

# Step 2: Analysis (only load research data)
relevant_context = ctx_mgr.load_relevant_context(
    "analysis",
    needed_steps=["research"]
)
analysis_results = agent.analyze(relevant_context["research"])
ctx_mgr.save_intermediate_result("analysis", analysis_results)

# Step 3: Final report (load summary, not all data)
context_summary = ctx_mgr.create_context_summary()
final_report = agent.generate_report(
    context_summary=context_summary,
    detailed_data=ctx_mgr.load_relevant_context(
        "report",
        needed_steps=["analysis"]
    )
)

Why this works:

Keeps context window focused on current task
Enables deep research workflows that would otherwise exceed context limits
Reduces token costs by avoiding redundant context in every step
Provides clear data lineage for debugging

Important caveat: This adds filesystem I/O latency, making it unsuitable for real-time applications where sub-second response times are critical. Best for batch processing, deep research, and long-running workflows where thoroughness matters more than speed.

4) Context Window Management

Sliding window

Maintain only recent conversation turns:

def sliding_window(messages, window_size=5):
    """Keep only the N most recent messages"""
    return messages[-window_size:]

Multi-agent architectures

Split complex tasks across specialized agents, each with focused context:

class AnalysisAgent:
    def __init__(self):
        self.context_schema = ["accounts", "revenue_data"]  # Limited scope

    def analyze(self, query):
        # This agent only sees financial data
        relevant_data = fetch_data(self.context_schema)
        return llm.analyze(query, context=relevant_data)

class CustomerAgent:
    def __init__(self):
        self.context_schema = ["accounts", "contacts", "activities"]

    def handle_query(self, query):
        # This agent focuses on customer relationships
        relevant_data = fetch_data(self.context_schema)
        return llm.generate(query, context=relevant_data)

Each agent maintains a smaller, more focused context window instead of one large shared context.

Worked Example: Before and After

Let's see these techniques in action. Imagine an API returns information about customer accounts:

Before: Raw JSON (~370 tokens)

{
  "accounts": [
    {
      "id": "acc_001",
      "name": "Acme Corporation",
      "industry": "Technology",
      "revenue": 5000000,
      "created_at": "2023-01-15T08:30:00Z",
      "updated_at": "2024-10-18T14:22:00Z",
      "status": "active",
      "metadata": {
        "source": "web_signup",
        "campaign": "q1_2023",
        "internal_id": "legacy_12345"
      },
      "contacts": [
        {
          "id": "con_001",
          "name": "John Smith",
          "email": "john@acme.com",
          "phone": "+1-555-0100",
          "title": "CTO"
        }
      ]
    },
    {
      "id": "acc_002",
      "name": "Beta Industries",
      "industry": "Manufacturing",
      "revenue": 3500000,
      "created_at": "2023-03-22T10:15:00Z",
      "updated_at": "2024-10-17T09:45:00Z",
      "status": "active",
      "metadata": {
        "source": "partner_referral",
        "campaign": "q2_2023",
        "internal_id": "legacy_67890"
      },
      "contacts": [
        {
          "id": "con_002",
          "name": "Jane Doe",
          "email": "jane@beta.com",
          "phone": "+1-555-0200",
          "title": "VP Engineering"
        }
      ]
    }
  ]
}

After: Markdown Table (~70 tokens)

| Name             | Revenue    | Industry      | Primary Contact           |
| ---------------- | ---------- | ------------- | ------------------------- |
| Acme Corporation | $5,000,000 | Technology    | John Smith (CTO)          |
| Beta Industries  | $3,500,000 | Manufacturing | Jane Doe (VP Engineering) |

Result: Approximately 80% token reduction by combining format optimization with aggressive field filtering.

This represents optimal compression where:

Most fields (IDs, timestamps, metadata, detailed contact info) are irrelevant to the task
The query only needs summary information
Data naturally fits a tabular structure

For queries like "Which accounts have revenue over $4M?", the markdown table contains everything needed. However, the metadata, timestamps, and detailed contact information might be essential for other use cases.

Important caveat: This is just 2 accounts. With 50 or 100 accounts, the JSON would grow quickly, while the markdown table grows much slower, making the savings more dramatic as array size increases.

Benchmarks & Metrics

Performance improvements vary significantly based on data structure, use case, and especially array size. These estimates assume medium to large arrays (20+ objects):

Optimization	Token Reduction	Notes
JSON → YAML	10-25%	Minimal for small arrays; grows with array size
JSON → Columnar JSON	25-50%	JSON-compatible; keys appear once
JSON → Markdown Table	20-40%	Highly effective for large arrays (100+ rows)
+ Field Filtering	30-60%	Depends on how many fields are removable
+ Relevance Filtering	40-70%	Highly use-case dependent
+ Summarization	50-80%	Risk of information loss

Important caveats:

Format changes alone provide minimal savings for small arrays (2-10 items). The real benefit comes from eliminating repeated keys in large arrays.
These ranges represent observed results across different use cases, not guarantees
Actual savings depend heavily on: array size, number of fields per object, and field name lengths
Higher compression rates increase the risk of losing important information
Combined optimizations don't always multiply (e.g., field filtering + format change may overlap)
Always benchmark with your actual data and validate quality doesn't degrade

Best Practices & Caveats

Design APIs for LLMs from day one. If you're building new services, return YAML or Markdown natively.
Measure token counts obsessively. Use tiktoken or similar libraries to profile every part of your prompt pipeline. What you can measure, you can optimize.
Start with field filtering. The simplest optimization (removing unused fields) often provides meaningful gains with minimal risk.
Keep a gold standard. Maintain test cases that verify compression doesn't hurt answer quality.
Layer optimizations strategically. Combine format changes, filtering, and summarization for maximum effect.

Don't

Don't blindly compress everything. Some use cases (detailed data analysis, debugging) require complete information. Know when to preserve fidelity.
Don't forget parsing. If you invent a custom format, ensure downstream systems can parse it reliably. Test edge cases.
Don't assume compression is free. Techniques like LLM-based summarization add latency and cost. Measure the trade-offs.
Don't over-optimize prematurely. If your context is already small (<1000 tokens), optimization may not be worth the complexity.

Lessons Learned / Key Takeaways

Context engineering matters: The format and content of your context directly impacts cost, latency, and quality.
JSON has significant overhead for LLMs: Its verbosity makes it less efficient than more compact formats in token-constrained environments.
Relevance filtering beats compression: Excluding irrelevant data often provides better results than compressing everything. Focus on relevance first, format second.
Optimizations can be layered: Combining multiple techniques (format + filtering + summarization) typically yields better results than any single approach.
Context offloading enables longer agent sessions: For deep research and long-running workflows, storing intermediate results in the filesystem prevents context overflow. However, this adds latency unsuitable for real-time applications.
Always measure and validate: Token counts, cost metrics, and quality benchmarks should guide optimization decisions. Results vary significantly across use cases.

Wrapping Up

Simply throwing data at an LLM isn't enough for production applications. If you're building AI agents, the pattern that scales is strategic context engineering: return LLM-friendly formats from APIs, transform data before prompting, filter aggressively for relevance, and offload context for long-running workflows. Curate the right data, in the right format, at the right moment—and your agents will be faster, cheaper, and more focused.

Context windows have limits. Don't waste tokens on structural overhead and irrelevant data. The teams shipping production AI agents use these techniques daily to scale efficiently. Start with the low-hanging fruit (field filtering, markdown tables), measure improvements, and progressively layer in more sophisticated approaches like context offloading. That's how you build agents that are fast, focused, and future-ready.