Published on

Stop Prompting, Start Programming: How DSPy Makes Building with LLMs Actually Work

Authors
  • avatar
    Name
    Xiaoyi Zhu
    Twitter

If you've built anything with a LLM, you know the feeling. You start with a brilliant idea, but quickly find yourself lost in a maze of prompts. You tweak a word here, rephrase a sentence there, and pray the output doesn’t break on the next run. For complex workflows, maintaining dozens of these fragile prompts becomes a developer's nightmare, where every new feature or edge case means another round of painful prompt wrestling.

This isn’t ideal. We need a more robust way to build AI systems.

Introduction

Enter DSPy, a framework from the Stanford AI Lab that proposes a radical shift: Programming—not prompting—LMs. DSPy lets you define your AI’s behavior using modular Python code with clear inputs and outputs, rather than getting bogged down in writing brittle prompt strings. Under the hood, DSPy automatically handles the hard parts: generating, optimizing, and executing the prompts for you.

This matters because it transforms AI development from an ad-hoc, trial-and-error art form into a structured and reliable software engineering discipline. DSPy makes LLM-powered components behave more like regular functions: you define what you want (the inputs and outputs), and the framework figures out how to get it (the exact prompt wording and format). This approach can dramatically speed up development, as you spend less time guessing at prompts and more time building your application's high-level logic.

The Headache: The Fragile Art of Prompt Engineering

Developing a sophisticated LLM application, like a RAG system, with traditional tools often feels like building a house of cards. Developers using frameworks like LangGraph consistently run into the same walls:

  • Fragile Prompts: LLM prompts are surprisingly brittle. A prompt that works beautifully with one model version might completely fail with the next update. Even tiny phrasing differences might lead to wildly different outputs. One might give you bullet points, the other a dense paragraph. This unpredictability makes it impossible to build systems you can trust.

  • Complex Boilerplate & Glue Code: You often have to manually manage the state and logic for every single step. This involves parsing the output from one LLM call, formatting it as the input for the next, and handling all the intermediate logic. While frameworks provide abstractions, you still write significant "glue code" to chain prompts, manage state, and handle errors. This boilerplate is not only tedious but also a prime source of bugs.

  • The Endless Cycle of Prompt Tuning: Improving an application often descends into an endless loop of trial-and-error. Developers spend hours tweaking wording, adding few-shot examples, or adjusting instructions just to coax a slightly better output from the model. When requirements change, the entire tuning process starts over. This makes development slow and hard to improve systematically.

  • Lack of Modularity and Testing: Prompt-based logic is often trapped in lengthy, monolithic strings that resist unit testing and reuse. How do you verify that one part of a prompt pipeline produces output in the exact format the next part expects? You can't. The valuable software engineering practices we take for granted—like modular design and reliable testing—are absent.

The root cause of this pain is simple: we're treating LLMs as unpredictable black boxes that can only be controlled through conversational voodoo. This approach conflates the program's logic (what we want to achieve) with the model's instructions (how we ask the LLM to do it). Frameworks like LangGraph help structure these interactions but don't solve the core problem. You are still responsible for hand-crafting the prompts and defining the control flow, which is the source of all the complexity and fragility.

The Simple Fix: Program, Don't Just Prompt

DSPy offers a clean solution by reframing the task as writing a program with well-defined components, not just a series of prompts. It introduces a few core concepts that treat LLM pipelines like modern deep learning models:

  1. Signatures: These are simple, declarative statements that define the input/output behavior of an LLM task. You specify the data fields you need (e.g., context, question -> response) instead of wrestling with a prompt template. DSPy uses this signature to generate the right prompt automatically.
  2. Modules: These are the building blocks of your program, analogous to layers in a neural network. Modules like dspy.Predict, dspy.ReAct or dspy.ChainOfThought take a Signature and handle all the messy interaction with the LLM. You can compose them to build complex logic.
  3. Optimizers: This is where the magic happens. A DSPy optimizer is an algorithm that automatically tunes your program—including the prompts themselves—to maximize a metric you define. It can generate few-shot examples, refine instructions, and even help fine-tune model weights, all without manual intervention.

By separating the program's logic from the LLM's instructions, DSPy's compiler can translate your high-level Python code into an optimized, high-performing prompt pipeline tailored to your specific data and LLM.

Technical Deep Dive

Let's make this concrete by comparing how you'd build a simple RAG agent using LangGraph versus DSPy. The goal is to take a user's query, retrieve relevant documents, and generate an answer based on that context.

The LangGraph/LangChain Approach

With LangGraph, you define each step as a node in a state graph and manually manage the flow. The following snippet reveals the inherent verbosity of this approach:

  • Explicit State Management: The RAGState TypedDict explicitly defines every piece of data that must be passed between nodes.
  • Manual Node Implementation: You must write a Python function for every single step: processing the query, retrieving documents, assembling the context, generating the response, and processing the final output.
  • Boilerplate Code: Notice the boilerplate in _build_graph, where we need to manually construct the graph structure by calling add_node and add_edge.
class RAGState(TypedDict):
    messages: Annotated[List[BaseMessage], add_messages]
    query: str
    processed_query: str
    retrieved_docs: List[Dict]
    context: str
    llm_response: str
    final_response: str
    error: Optional[str]


class RAGWorkflow:
    def __init__(self, config: Config = None):
        self.config = config or Config()
        self.vector_store = PineconeVectorStore()
        self.openai_client = OpenAI(api_key=self.config.OPENAI_API_KEY)
        self.graph = self._build_graph()

    def _build_graph(self) -> StateGraph:
        g = StateGraph(RAGState)
        g.add_node("query_processing", self._process_query_node)
        g.add_node("document_retrieval", self._retrieve_documents_node)
        g.add_node("context_assembly", self._assemble_context_node)
        g.add_node("llm_generation", self._generate_response_node)
        g.add_node("response_processing", self._process_response_node)
        g.add_edge(START, "query_processing")
        g.add_edge("query_processing", "document_retrieval")
        g.add_edge("document_retrieval", "context_assembly")
        g.add_edge("context_assembly", "llm_generation")
        g.add_edge("llm_generation", "response_processing")
        g.add_edge("response_processing", END)
        return g.compile()

    def _process_query_node(self, state: RAGState) -> RAGState:
        try:
            q = state.get("query", "").strip()
            if not q:
                raise ValueError("empty query")
            pq = re.sub(r"[^\w\s\-\.]", " ", q)
            pq = re.sub(r"\s+", " ", pq).strip()
            if len(pq.split()) < 2:
                pq = f"information about {pq}"
            state["processed_query"] = pq
        except Exception as e:
            state["error"] = str(e)
        return state

    def _retrieve_documents_node(self, state: RAGState) -> RAGState:
        try:
            pq = state.get("processed_query", "")
            if not pq:
                raise ValueError("no processed query")
            state["retrieved_docs"] = self.vector_store.similarity_search(
                query=pq,
                top_k=self.config.MAX_RETRIEVAL_DOCS,
                score_threshold=0.3,
            )
        except Exception as e:
            state["error"] = str(e)
        return state

    def _assemble_context_node(self, state: RAGState) -> RAGState:
        try:
            docs = state.get("retrieved_docs", [])
            q = state.get("query", "")
            if not docs:
                ctx = "No relevant documents found for this query."
            else:
                parts = []
                for i, d in enumerate(docs[: self.config.MAX_RETRIEVAL_DOCS]):
                    content = d["content"].strip()
                    if not content:
                        continue
                    title = d["title"] or f"Doc {i+1}"
                    parts.append(f"Source {i+1} ({title}):\n{content}\n")
                body = "\n".join(parts)
                ctx = f"""Based on the following information, answer the user's question: \"{q}\"\n\nRetrieved Information:\n{body}"""
            state["context"] = ctx
        except Exception as e:
            state["error"] = str(e)
        return state

    def _generate_response_node(self, state: RAGState) -> RAGState:
        try:
            ctx = state.get("context", "")
            if not ctx:
                raise ValueError("no context")
            messages = [
                {
                    "role": "system",
                    "content": "Answer user question using the provided documents. Be accurate and keep your response short and to the point.",
                },
                {"role": "user", "content": ctx},
            ]
            res = self.openai_client.chat.completions.create(
                model="gpt-4.1-mini",
                messages=messages,
            )
            state["llm_response"] = res.choices[0].message.content.strip()
        except Exception as e:
            state["error"] = str(e)
        return state

    def _process_response_node(self, state: RAGState) -> RAGState:
        try:
            resp = state.get("llm_response", "")
            if not resp:
                raise ValueError("no llm response")
            state["final_response"] = resp
            state.setdefault("messages", []).append(AIMessage(content=resp))
        except Exception as e:
            state["error"] = str(e)
        return state

    def process_query(
        self, query: str, conversation_history: Optional[List[BaseMessage]] = None
    ) -> Dict:
        state: RAGState = {
            "messages": conversation_history or [],
            "query": query,
            "processed_query": "",
            "retrieved_docs": [],
            "context": "",
            "llm_response": "",
            "final_response": "",
            "error": None,
        }
        state["messages"].append(HumanMessage(content=query))
        final = self.graph.invoke(state)
        if final.get("error"):
            return {
                "success": False,
                "error": final["error"],
                "response": "Error processing your question.",
            }
        return {
            "success": True,
            "response": final.get("final_response", ""),
        }

This is just a simple RAG agent, but to make it work, we need to maintain a state dictionary, write boilerplate for every node, manually format prompts, and wire everything together in a graph. This creates a lot of code that is tightly coupled and difficult to change.

The DSPy Approach: Simple and Declarative

Now, let's look at the same task implemented with DSPy. The code is dramatically cleaner and more concise.

class DSPyRAGWorkflow:
    def __init__(self, config: Config = None):
        self.config = config or Config()
        self.vector_store = PineconeVectorStore()
        self._configure_dspy()
        self.react_agent = self._create_react_agent()

    def _configure_dspy(self):
        try:
            lm = dspy.LM(
                model=f"openai/gpt-4.1-mini",
                api_key=self.config.OPENAI_API_KEY,
            )
            dspy.configure(lm=lm)
        except Exception as e:
            raise

    def search_documents(self, query: str) -> List[str]:
        try:
            return self.vector_store.similarity_search(
                query=query, top_k=self.config.MAX_RETRIEVAL_DOCS, score_threshold=0.3
            )
        except Exception as e:
            return [f"Error retrieving documents: {str(e)}"]

    def _create_react_agent(self) -> dspy.ReAct:
        try:
            signature = dspy.Signature("question -> response")
            react_agent = dspy.ReAct(
                signature=signature,
                tools=[self.search_documents],
                max_iters=3,
            )
            return react_agent
        except Exception as e:
            raise

    def process_query(self, query: str) -> Dict[str, Any]:
        try:
            if not query or not query.strip():
                return {
                    "success": False,
                    "error": "Empty query provided",
                    "response": "Please provide a valid question.",
                }

            result = self.react_agent(question=query.strip())
            # dspy.inspect_history()
            response = result.response if hasattr(result, "response") else str(result)

            return {
                "success": True,
                "response": response,
            }
        except Exception:
            return {
                "success": False,
                "error": "An error occurred during processing.",
                "response": "I encountered an error while processing your question. Please try again.",
            }

Look at what's missing:

  • No Manual Graph Building: There are no explicit nodes or edges. DSPy abstracts away the control flow.
  • No Manual Prompt Formatting: We don't write a single prompt template. The dspy.ReAct module, guided by the simple question -> response signature, figures out how to use the search_documents tool and formulate the final answer.
  • Simplified Tool Definition: Our search_documents tool is just a regular Python function. The dspy.ReAct agent handles when to call it and how to incorporate its results into the reasoning process.

The entire complex RAG logic is handled by a standard, optimizable DSPy module. This radically reduces the amount of code you need to write and maintain.

Optimization and Debugging

This is where DSPy truly shines.

Freedom from Prompt Tweaking & Reliable Optimization

Imagine you want to improve the RAG agent's performance. In the LangGraph version, you'd start tweaking the prompt in _assemble_context_node and hope for the best.

In DSPy, you can use built-in optimizer like dspy.MIPROv2 that bootstrap candidate examples, propose diverse instructions, and leverage Bayesian optimization to discover the highest-performing prompts. You just need to provide a few example questions and answers, along with a metric (e.g., "does the generated response match the gold-standard answer?"). The optimizer then works its magic.

Clear and Easy Debugging

When a LangGraph chain goes wrong, you have to manually inspect the state at each step. DSPy provides a much simpler way to trace execution. Just call dspy.inspect_history() after a run.

# Example output from dspy.inspect_history()
System message:

Your input fields are:
1. `question` (str)
2. `trajectory` (str)

Your output fields are:
1. `reasoning` (str)
2. `response` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## trajectory ## ]]
{trajectory}

[[ ## reasoning ## ]]
{reasoning}

[[ ## response ## ]]
{response}

[[ ## completed ## ]]

In adhering to this structure, your objective is:
        Given the fields `question`, produce the fields `response`.


User message:

[[ ## question ## ]]
OpenAI is often associated with both the machine learning and artificial intelligence industries. Explain what OpenAI does, how its activities place it within these industries, and why it is recognized as a key player in this field.

[[ ## trajectory ## ]]
[[ ## thought_0 ## ]]
To provide a comprehensive explanation of what OpenAI does, how its activities place it within the machine learning and artificial intelligence industries, and why it is recognized as a key player, I need to gather detailed information about OpenAI's mission, projects, technologies, and industry impact.

[[ ## tool_name_0 ## ]]
search_documents

[[ ## tool_args_0 ## ]]
{"query": "OpenAI overview, mission, projects, technologies, impact in machine learning and artificial intelligence industries"}

[[ ## observation_0 ## ]]
[1] «««
    ...document content retrived from the vector store...

[[ ## thought_1 ## ]]
I have gathered detailed information about OpenAI's mission, history, key projects, technologies, and its role in the AI and machine learning industries. Now I will synthesize this information to explain what OpenAI does, how its activities place it within these industries, and why it is recognized as a key player.

[[ ## tool_name_1 ## ]]
finish

[[ ## tool_args_1 ## ]]
{"kwargs": {"response": "OpenAI is an American artificial intelligence research...

[[ ## observation_1 ## ]]
Completed.

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## response ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.

This output gives you a crystal-clear, step-by-step view of the agent's reasoning process, making it incredibly easy to see what it did and why. What's more, you can also see how DSPy orchestrates the Modules and Signatures into the underlying LLM prompts for you.

Best Practices & Caveats

  • Do start with Signatures: Define the inputs and outputs of your task clearly. This is the foundation of your DSPy program.
  • Do compose Modules: Build complex behaviors by combining simpler modules. This keeps your code clean and modular. For example, you can embed dspy.ReAct and dspy.ChainOfThought to build a more advanced Multi-Hop RAG system.
  • Do use Optimizers: Don't just run your DSPy programs. Compile them! Even with a small dataset of 5-10 examples, optimizers can lead to significant performance gains.
  • Don't over-specify in Signatures: Avoid putting complex instructions in the signature's docstring initially. Let the optimizer discover the best instructions for you.
  • Don't forget the cost: Optimization can be computationally intensive and may incur costs if you're using proprietary models. Start with small datasets and cheaper models to iterate quickly.

Key Takeaways

  • Shift from Imperative to Declarative: Stop telling the LLM how to do something with a specific prompt. Instead, declare what you need using a DSPy Signature and let the framework handle the how.
  • Modularity is Key: DSPy's components (Signatures, Modules, Optimizers) allow you to build complex systems from simple, reusable, and independently optimizable parts.
  • Optimization is a First-Class Citizen: DSPy makes performance optimization a core, automated part of the development workflow, moving beyond guesswork and manual tuning.

Wrapping Up

The era of manual prompt engineering is quickly fading. It's simply too slow, fragile, and unreliable for building production-grade AI systems, especially as applications grow in complexity. While our initial focus was on crafting the perfect prompt, the industry is increasingly realizing the paramount importance of context engineering—ensuring the LLM receives the precise, relevant information it needs to perform a task. DSPy is at the forefront of this shift, allowing you to program LLMs rather than just prompt them. It provides a robust, structured, and optimizable framework that inherently facilitates better context engineering. If you're tired of being a prompt whisperer and want to be a software engineer again, focusing on robust system design and effective context utilization, it's time to give DSPy a try.