Skip to content

How to Test Your LLM Prompts

With pytest, no API calls, no mocking.


The Problem

You have a prompt that tells an LLM to return JSON. Your code parses that JSON. Someone edits the prompt — renames a field, changes a value from string to int, removes a key. Your code breaks at runtime. In production.

This happens constantly. Prompts live in markdown files. Code lives in Python. Nothing connects them. The prompt says "score" but the code reads "rating". The prompt shows "status": "done" but the code checks for "ready". The prompt has 5 criteria fields but the code expects 6.

The gap between what your prompt tells the LLM to produce and what your code expects to receive is a contract. If nobody tests that contract, it will break.

The Fix: Prompt Contract Tests

Put JSON examples in your prompts. Extract them in tests. Validate the structure against what your code actually reads.

No API calls. No mocking. No LLM involved. Just regex + JSON parsing + assertions. Tests run in milliseconds.

Minimal Example

Say you have a prompt file that tells an LLM to analyze customer feedback:

<!-- prompts/analyze_feedback.md -->

# Analyze Customer Feedback

Review the customer feedback and classify it.

## Output Format

Return JSON only:

​```json
{
  "status": "ready",
  "sentiment": "positive",
  "topics": ["pricing", "support"],
  "urgency": 3,
  "suggestions": {
    "summary": "Customer is happy with support but concerned about pricing.",
    "follow_up": true
  }
}
​```

If you need more context:

​```json
{
  "status": "needs_info",
  "questions": ["What product is the customer using?", "When did the issue start?"],
  "suggestions": {}
}
​```

And your code does something like:

# app/feedback.py

def process_feedback(llm_response: dict):
    status = llm_response["status"]         # must exist

    if status == "ready":
        sentiment = llm_response["sentiment"]   # must be string
        topics = llm_response["topics"]         # must be list
        urgency = llm_response["urgency"]       # must be int 1-5
        summary = llm_response["suggestions"]["summary"]  # must exist

    elif status == "needs_info":
        questions = llm_response["questions"]   # must be non-empty list

Here is the test that connects them:

# tests/test_prompt_contracts.py

import json
import re
from pathlib import Path

PROMPTS_DIR = Path(__file__).resolve().parent.parent / "prompts"


def _read_prompt(filename: str) -> str:
    path = PROMPTS_DIR / filename
    assert path.exists(), f"Missing: {path}"
    return path.read_text(encoding="utf-8")


def _extract_json_blocks(text: str) -> list[dict]:
    """Pull every ```json ... ``` block from a markdown file."""
    pattern = r"```json\s*\n(.*?)```"
    matches = re.findall(pattern, text, re.DOTALL)
    results = []
    for match in matches:
        results.append(json.loads(match.strip()))
    return results


# --- Does the prompt have parseable JSON at all? ---

def test_prompt_has_output_format():
    text = _read_prompt("analyze_feedback.md")
    assert "## Output Format" in text

def test_json_blocks_are_valid():
    text = _read_prompt("analyze_feedback.md")
    examples = _extract_json_blocks(text)
    assert len(examples) >= 2, "Need at least a 'ready' and 'needs_info' example"

# --- Do the examples match what the code reads? ---

def test_status_values_are_valid():
    text = _read_prompt("analyze_feedback.md")
    for example in _extract_json_blocks(text):
        assert example["status"] in {"ready", "needs_info"}, (
            f"Code checks for 'ready' or 'needs_info', got '{example['status']}'"
        )

def test_ready_example_has_required_fields():
    text = _read_prompt("analyze_feedback.md")
    ready = [e for e in _extract_json_blocks(text) if e["status"] == "ready"]
    assert len(ready) > 0, "No 'ready' example in prompt"

    for example in ready:
        # These are the exact fields process_feedback() reads
        assert "sentiment" in example
        assert isinstance(example["sentiment"], str)
        assert "topics" in example
        assert isinstance(example["topics"], list)
        assert "urgency" in example
        assert isinstance(example["urgency"], int)
        assert 1 <= example["urgency"] <= 5
        assert "summary" in example["suggestions"]

def test_needs_info_has_questions():
    text = _read_prompt("analyze_feedback.md")
    needs_info = [e for e in _extract_json_blocks(text) if e["status"] == "needs_info"]
    assert len(needs_info) > 0, "No 'needs_info' example in prompt"

    for example in needs_info:
        assert "questions" in example
        assert isinstance(example["questions"], list)
        assert len(example["questions"]) > 0, "Empty questions list"

Run it:

pytest tests/test_prompt_contracts.py -v
test_prompt_has_output_format       PASSED
test_json_blocks_are_valid          PASSED
test_status_values_are_valid        PASSED
test_ready_example_has_required_fields  PASSED
test_needs_info_has_questions       PASSED

What This Catches

Now imagine someone edits the prompt and renames "urgency" to "priority". Or changes "status": "ready" to "status": "complete". Or removes the "summary" field from suggestions.

The tests fail immediately. Before anyone deploys. Before the LLM ever sees the new prompt. Before a customer hits the bug.

Real examples of prompt drift this pattern catches:

What changed in the prompt What breaks in code Test that catches it
Renamed urgencypriority KeyError: 'urgency' test_ready_example_has_required_fields
Changed "ready""complete" if status == "ready" never matches test_status_values_are_valid
Removed suggestions.summary KeyError: 'summary' test_ready_example_has_required_fields
Changed urgency from int to string Code does urgency > 3 on a string test_ready_example_has_required_fields
Deleted the JSON example entirely LLM returns unpredictable format test_json_blocks_are_valid

Scaling It

When you have multiple prompts, parametrize:

PROMPTS = {
    "analyze_feedback.md": "feedback",
    "classify_ticket.md": "ticket",
    "summarize_report.md": "report",
}

@pytest.mark.parametrize("filename", PROMPTS.keys())
def test_has_output_format(filename):
    text = _read_prompt(filename)
    assert "## Output Format" in text

@pytest.mark.parametrize("filename", PROMPTS.keys())
def test_json_is_valid(filename):
    text = _read_prompt(filename)
    for block in _extract_json_blocks(text):
        assert "status" in block  # every prompt must return status

Then add per-prompt test classes for the fields that are specific to each prompt:

class TestFeedbackContract:
    def test_has_sentiment(self):
        ...

class TestTicketContract:
    def test_has_priority_level(self):
        ...

The Principle

Your prompt's JSON examples are the specification. Your code's field reads are the implementation. The test connects them.

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│ prompt.md   │────▶│  test.py     │◀────│  app.py     │
│ (JSON spec) │     │ (contract)   │     │ (reads JSON)│
└─────────────┘     └──────────────┘     └─────────────┘

If the prompt changes, the test fails. If the code changes what it reads, you update the test, which forces you to check the prompt. The contract stays in sync.

What This Does NOT Test

This pattern validates structure, not quality. It answers:

  • Does the prompt show the LLM what fields to return? Yes/no.
  • Do those fields match what the code parses? Yes/no.
  • Are the types correct (string, int, list)? Yes/no.
  • Are enum values valid ("ready" not "complete")? Yes/no.

It does not answer:

  • Does the LLM actually follow the prompt well? (Need integration tests for that.)
  • Is the prompt well-written? (Need humans for that.)
  • Does the LLM hallucinate field values? (Need output validation in code for that.)

But structural drift is the #1 cause of prompt-related production bugs, and this catches it with zero cost and zero latency. Run it in CI, run it on every commit, run it before every deploy.

Try It

  1. Add ## Output Format with JSON examples to your prompt files
  2. Write _extract_json_blocks() (15 lines, copy it from above)
  3. Assert that the fields your code reads exist in the examples
  4. Run in CI

That is the whole pattern. No framework, no library, no dependency. Just regex, JSON, and pytest.


This pattern is used in production at LobOut across all refinement prompts — team verification, project briefs, pitch review, and scoring. 34 contract tests run on every commit, catching prompt drift before it reaches the evaluation engine.