Published on

Multi-Hop RAG with DSPy: When Simple Retrieval Isn't Enough

Authors
  • avatar
    Name
    Xiaoyi Zhu
    Twitter

What college did the composer of Run Run Run go to?

I've been seeing a lot of talk about multi-hop RAG and DSPy lately. The concepts sounded interesting—systems that can retrieve information iteratively, identify gaps, and piece together distributed facts. But I wanted to actually build something to understand how it works, not just read about it.

So I built a demo using the MuSiQue dataset, which has 25,000 Wikipedia-based questions specifically designed to require multi-hop reasoning. Questions like the one above need two distinct retrieval steps: first identifying who composed the song (Lou Reed from The Velvet Underground), then finding which college he attended (Syracuse University). Simple RAG consistently fails on these because the information is genuinely distributed across different Wikipedia articles.

This post shares what I learned from the experiment. The short version: multi-hop RAG with synthesis-driven iteration can significantly improve answer quality for complex questions, but there are real trade-offs around latency and cost. It's not a magic solution, just another tool with specific use cases.


Why I Wanted to Try This

I've been working with standard RAG systems, and they work well for a lot of questions. But I kept reading about cases where simple retrieval comes up short—when information is scattered across multiple sources and you need to synthesize facts from different documents.

The academic papers make multi-hop RAG sound powerful: systems that can reason about what's missing, generate follow-up queries, retrieve iteratively. DSPy was being discussed as a framework that makes this easier to implement. I wanted to see for myself how true that was.

So the goal was simple: build a working multi-hop RAG demo, compare it to simple RAG on the same dataset, and see what the actual differences are. Not to solve a specific problem, but to learn the concepts through hands-on implementation.


Patterns Where Simple RAG Struggles

While building the demo, I tested both approaches on various questions from the dataset. Here's what I observed about where simple RAG tends to have limitations:

  • Information is naturally distributed: Complex questions often require facts from multiple sources. Simple RAG retrieves chunks based on similarity to your query, but if the top results focus on just one aspect, you get an incomplete picture.

  • Single-query bias: When you embed a complex question, the vector tends to favor documents mentioning all entities together. Articles focusing on individual pieces might rank lower, even though connecting those pieces answers the question.

  • Context window trade-offs: You can only fit so many chunks in a prompt. Simple RAG picks the top-k by similarity. If those k chunks happen to come from articles about the same sub-topic, you've wasted context space on redundant information.

  • No awareness of gaps: Simple RAG retrieves once and generates an answer. If critical information isn't in the top-k chunks, the answer will be incomplete—with no mechanism to notice what's missing.

  • No verification mechanism: The system can't tell if it actually has enough information to answer confidently. It generates an answer regardless of context quality.

These aren't dealbreakers—simple RAG works great for many use cases. But for questions genuinely requiring synthesis across sources, I wanted to see if multi-hop retrieval with adaptive iteration could make a difference.


The Breakthrough: Synthesis-in-the-Loop

The key insight that made multi-hop RAG actually work came from a simple but important shift:

Instead of deciding whether to continue based on heuristics, actually try to answer the question after each retrieval hop.

Here's the pattern:

For each hop (up to 3):
  1. Retrieve relevant contexts
  2. Try to synthesize an answer from accumulated contexts
  3. If answer is sufficient → Stop and return
  4. If answer is insufficient → Generate targeted follow-up query
  5. Continue to next hop

The system stops as soon as it CAN answer, not when it THINKS it can.

Why This Matters

The key insight: let the answer quality drive the iteration, not heuristics.

After each retrieval hop, the system attempts to answer the question. If it can construct a complete answer from the accumulated contexts, it stops. If information is still missing, it identifies what's needed and retrieves more.

Example flow:

Hop 1Retrieve contexts → Try to answer
"INSUFFICIENT_INFORMATION: Missing Lou Reed's college"
Generate follow-up: "What college did Lou Reed attend?"

Hop 2Retrieve more contexts → Try to answer again
"SUCCESS: Lou Reed attended Syracuse University"
Return final answer and stop

This makes the system adaptive: it automatically uses 1 hop for simple questions and 2-3 hops for complex ones, based on whether it can actually answer—not based on guessing whether it has enough information.


The Multi-Hop Pattern I Implemented

Here's the complete flow with synthesis-in-the-loop:

High-Level Architecture

User Query: "What college did the composer of Run Run Run go to?"
┌────────────────────────────────────────────────────────────┐
STEP 1: Query Decomposition"Complex query""Who composed Run Run Run?"└────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────┐
ITERATIVE LOOP (Adaptive: 1-3 hops)└────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────┐
Hop N: Retrieve Contexts│ • Embed query (384-dim) + Search vector DB│ • Optional: Query expansion on first hop                   │
│ • Smart deduplication (ID + content similarity)└────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────┐
Try to Synthesize Answer"Can I answer from accumulated contexts?"└────────────────────────────────────────────────────────────┘
    ├─── YES (sufficient) ────────────────────────────┐
    │                                                 │
    NO (insufficient)    ↓                                       ┌──────────────────┐
┌────────────────────────────────────┐      │ Return AnswerGenerate Follow-Up Query           │      │ and Stop"What college did Lou Reed attend?"│      └──────────────────┘
└────────────────────────────────────┘
    [Loop back OR stop if max hops (3) reached]

     (after loop terminates)
┌────────────────────────────────────────────────────────────┐
Rerank Final ContextsCross-encoder scores relevance → Select top-10└────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────┐
Final Answer"Lou Reed attended Syracuse University"└────────────────────────────────────────────────────────────┘

Key Differences

  • Synthesis-in-the-Loop: System tries to answer after EACH hop, not just at the end
  • Adaptive Hop Count: Automatically uses 1-3 hops based on synthesis success
  • Query Decomposition: Breaks complex questions into focused sub-queries
  • Smart Deduplication: Prevents redundant contexts across hops
  • Cross-Encoder Reranking: Final relevance scoring before synthesis
  • Synthesis Transparency: Track all attempts to show iterative reasoning

Why DSPy Made This Cleaner

When I started implementing this, manual prompt engineering quickly became messy. DSPy's value proposition is letting you focus on what you want rather than how to ask for it.

Instead of writing complex prompt templates with examples and formatting instructions, you declare the task structure: define your inputs (original query, accumulated contexts, missing info) and outputs (follow-up query). DSPy automatically handles chain-of-thought reasoning, input/output formatting, and lets you compose complex pipelines from simple, reusable pieces.

It's not magic, but it made the implementation significantly cleaner and more maintainable than my initial manual prompt engineering approach.


What I Built

The Dataset

I used the MuSiQue dataset from this TACL 2022 paper:

  • ~25,000 questions requiring 2-4 reasoning hops
  • ~7,676 unique Wikipedia paragraphs from diverse topics
  • 30% accuracy improvement for multi-hop vs simple RAG (F1 score)
  • Question types: 68% two-hop, 24% three-hop, 8% four-hop

Why this dataset? Wikipedia information is naturally distributed across separate articles—facts about related entities live in different pages. This forces real multi-hop reasoning rather than just retrieving one comprehensive document.

Tech Stack

  • Framework: DSPy for reasoning modules
  • LLM: Google Gemini 2.5 Flash Lite (fast, cost-effective)
  • Embeddings: all-MiniLM-L6-v2 (384-dim, lightweight)
  • Vector DB: Qdrant Cloud (semantic search)
  • Reranker: cross-encoder/ms-marco-MiniLM-L-6-v2 (relevance scoring)

Testing: Real-World Example

Here's a concrete example from the MuSiQue dataset showing the difference.

User Query

"What college did the composer of Run Run Run go to?"

This requires two hops: (1) Who composed "Run Run Run"? (Lou Reed from The Velvet Underground), (2) What college did Lou Reed attend? (Syracuse University).

Simple RAG

Retrieved Contexts (top-5):

  1. Run Run Run (The Velvet Underground song) - Score: 0.585

    • Mentions Lou Reed wrote the song
    • Details drug-themed lyrics and NYC setting
    • No mention of college or education
  2. Run (TV series) - Score: 0.492

    • British miniseries (completely irrelevant)
  3. A Time to Run - Score: 0.488

    • Political novel (completely irrelevant)
  4. The Running Fight - Score: 0.458

    • 1915 silent film (completely irrelevant)
  5. Love on the Run (1979 film) - Score: 0.451

    • François Truffaut film (completely irrelevant)

Generated Answer:

The provided contexts do not contain information about the college the composer of 'Run Run Run' attended. Context 1 mentions that Lou Reed wrote the song, but it does not state where he went to college.

Problem: Vector similarity matched on "Run Run Run" but retrieved mostly irrelevant results (TV shows, films with "Run" in title). Only 1 out of 5 contexts was relevant, and it didn't contain the educational information needed. Single-query retrieval couldn't connect the composer identity to his college.

Multi-Hop RAG with Synthesis-in-the-Loop

Decomposed Query: "Who composed the song Run Run Run?"

Hop 1 - Retrieval (Top contexts):

  1. Run Run Run (The Velvet Underground song) - Score: 0.723

    • Song written by Lou Reed
    • Details about lyrics and NYC setting
    • No college information
  2. Fox on the Run (Manfred Mann) - Score: 0.580 (irrelevant)

  3. Born to Run (Emmylou Harris) - Score: 0.554 (irrelevant)

  4. Love on the Run (film) - Score: 0.551 (irrelevant)

Hop 1 - Synthesis Attempt:

⚠️ INSUFFICIENT → Continue

"The user is asking about the college attended by the composer of 'Run Run Run'. Context 1 mentions that 'Run Run Run' is a song by The Velvet Underground, written by Lou Reed. However, none of the provided contexts mention which college Lou Reed attended."

Follow-Up Query Generated: "What college did Lou Reed attend?"

Hop 2 - Retrieval (Top contexts):

  1. Lou Reed - Score: 0.700 ✓

    • "Reed resumed his education at Syracuse University in 1960, studying journalism, film directing, and creative writing"
    • This is the missing piece!
  2. Michael Reed (American football) - Score: 0.596 (irrelevant)

  3. Jerry Reed (baseball) - Score: 0.555 (irrelevant)

Hop 2 - Synthesis Attempt:

SUFFICIENT → Stop

"Lou Reed, the composer of 'Run Run Run', attended Syracuse University."

The Difference:

  • Hop 1 identified Lou Reed as composer but lacked college info → generated targeted follow-up
  • Hop 2 retrieved Lou Reed's educational background, completing the picture
  • Synthesis-in-the-loop ensured system only stopped when it could actually answer
  • Final answer correctly connects both pieces of information across 2 hops

Try the Demo Yourself

I deployed the demo so others can experiment: Multi-Hop RAG Demo

Three Approaches to Compare

1. Simple RAG

  • Single retrieval pass
  • Fastest (1-2 seconds)
  • Baseline performance

2. Multi-Hop RAG

  • Adaptive 1-3 hops
  • Synthesis-driven iteration
  • Best for distributed information

3. Query Expansion + Multi-Hop

  • Expand query into variations
  • Rank fusion combines results
  • Maximum coverage (but slower)

Sample Questions to Try

Entertainment & Media:

"What college did the composer of Run Run Run go to?"
"Which season of American Idol featured the Part of Me composer as a guest judge?"
"Who was the songwriter of Someday writing about in We Belong Together?"

Historical & Geographic:

"In what Ukrainian province would you find the birthplace of Marjana Gaponenko?"
"What was the former name of Song dynasty's capital?"
"When did WWII end in the country where the screenwriter of E adesso sesso is a citizen?"

Multi-Step Reasoning:

"Who is the sibling of the composer of Pocahontas?"
"What is the enrollment at the owner of Benficence?"
"When was the Commander-in-Chief position abolished in the city that held the Tricorn Centre?"

The demo includes 150 sample questions from the MuSiQue dataset. Use the shuffle button to explore different question types!


What I Learned

What Worked Well

  • Synthesis-in-the-loop is crucial: Trying to answer after each hop prevented the system from stopping too early
  • Adaptive hop count saves cost: Simple questions use 1 hop, complex ones use 2-3
  • Transparency builds trust: Showing synthesis attempts helps users understand the reasoning
  • Score normalization matters: Converting raw reranker scores (-4.5) to probabilities (0.01) improved UX significantly
  • Query decomposition helps: Breaking complex questions into focused sub-queries improves first-hop quality
  • DSPy reduces boilerplate: Signatures are clearer than manual prompt templates once you get used to the pattern

Challenges & Learnings

  • Multi-hop isn't free: 2-3x latency increase, higher token costs
  • Error propagation: Poor first hop affects follow-up query quality
  • Prompt engineering still matters: DSPy signatures need careful design for good outputs
  • Query expansion parsing: LLMs sometimes return malformed JSON, needed robust fallbacks
  • Reranker scores need context: Raw logits confused users, normalization essential
  • UI complexity: Balancing information density with clarity took iteration

Things to Watch For

  • Multi-hop isn't needed for simple factoid questions
  • Token costs add up quickly with multiple LLM calls per query
  • Follow-up query quality is critical—vague queries waste hops
  • More hops doesn't always mean better answers (diminishing returns)
  • Evaluation is hard without test sets (building ground truth is time-consuming)

Trade-offs & Future Improvements

Current Limitations:

  • Latency: 3-4 seconds for multi-hop vs 1-2 seconds for simple RAG
  • Cost: 3-5x token usage (multiple synthesis attempts + reasoning)
  • Complexity: More moving parts means more potential failure points
  • Error propagation: First hop mistakes compound in follow-up queries

Reflections

A few things that stood out while working on this:

  • Building beats reading: Papers describe concepts, but implementation teaches you what actually matters
  • Synthesis-driven iteration was the breakthrough: Stopping based on actual answer attempts (not heuristics) made the biggest difference
  • Transparency is underrated: Showing synthesis attempts improved both debugging and user trust
  • Small UX details matter: Score normalization, color-coded attempt boxes, collapsible sections—these add up
  • Trade-offs are real: Multi-hop isn't free. Be selective about when to use it
  • DSPy's value is composability: The win isn't just cleaner code, it's being able to compose complex reasoning pipelines from reusable pieces

Wrapping Up

I built this demo to understand how multi-hop RAG works and whether synthesis-driven iteration actually helps. The main takeaway: for questions where information is scattered across sources, adaptive multi-hop with synthesis-in-the-loop significantly improves answer quality.

The key insights:

  1. Try to answer after each hop (synthesis drives iteration)
  2. Make the process transparent (show synthesis attempts)
  3. Normalize scores for humans (probabilities > raw logits)
  4. Use adaptive hop counts (save cost on simple questions)

It's not a silver bullet—latency and cost increase, and it's not needed for all question types. But for knowledge-intensive applications where you genuinely need to synthesize distributed information, the pattern seems worth considering.

Check out the Multi-Hop RAG Demo to experiment yourself. It's interesting seeing which questions actually need multi-hop reasoning and which don't. The demo's fun to play around with.