How to Test Your LLM Prompts
With pytest, no API calls, no mocking.
The Problem
You have a prompt that tells an LLM to return JSON. Your code parses that JSON. Someone edits the prompt — renames a field, changes a value from string to int, removes a key. Your code breaks at runtime. In production.
This happens constantly. Prompts live in markdown files. Code lives in Python. Nothing connects them. The prompt says "score" but the code reads "rating". The prompt shows "status": "done" but the code checks for "ready". The prompt has 5 criteria fields but the code expects 6.
The gap between what your prompt tells the LLM to produce and what your code expects to receive is a contract. If nobody tests that contract, it will break.
The Fix: Prompt Contract Tests
Put JSON examples in your prompts. Extract them in tests. Validate the structure against what your code actually reads.
No API calls. No mocking. No LLM involved. Just regex + JSON parsing + assertions. Tests run in milliseconds.
Minimal Example
Say you have a prompt file that tells an LLM to analyze customer feedback:
<!-- prompts/analyze_feedback.md -->
# Analyze Customer Feedback
Review the customer feedback and classify it.
## Output Format
Return JSON only:
```json
{
"status": "ready",
"sentiment": "positive",
"topics": ["pricing", "support"],
"urgency": 3,
"suggestions": {
"summary": "Customer is happy with support but concerned about pricing.",
"follow_up": true
}
}
```
If you need more context:
```json
{
"status": "needs_info",
"questions": ["What product is the customer using?", "When did the issue start?"],
"suggestions": {}
}
```
And your code does something like:
# app/feedback.py
def process_feedback(llm_response: dict):
status = llm_response["status"] # must exist
if status == "ready":
sentiment = llm_response["sentiment"] # must be string
topics = llm_response["topics"] # must be list
urgency = llm_response["urgency"] # must be int 1-5
summary = llm_response["suggestions"]["summary"] # must exist
elif status == "needs_info":
questions = llm_response["questions"] # must be non-empty list
Here is the test that connects them:
# tests/test_prompt_contracts.py
import json
import re
from pathlib import Path
PROMPTS_DIR = Path(__file__).resolve().parent.parent / "prompts"
def _read_prompt(filename: str) -> str:
path = PROMPTS_DIR / filename
assert path.exists(), f"Missing: {path}"
return path.read_text(encoding="utf-8")
def _extract_json_blocks(text: str) -> list[dict]:
"""Pull every ```json ... ``` block from a markdown file."""
pattern = r"```json\s*\n(.*?)```"
matches = re.findall(pattern, text, re.DOTALL)
results = []
for match in matches:
results.append(json.loads(match.strip()))
return results
# --- Does the prompt have parseable JSON at all? ---
def test_prompt_has_output_format():
text = _read_prompt("analyze_feedback.md")
assert "## Output Format" in text
def test_json_blocks_are_valid():
text = _read_prompt("analyze_feedback.md")
examples = _extract_json_blocks(text)
assert len(examples) >= 2, "Need at least a 'ready' and 'needs_info' example"
# --- Do the examples match what the code reads? ---
def test_status_values_are_valid():
text = _read_prompt("analyze_feedback.md")
for example in _extract_json_blocks(text):
assert example["status"] in {"ready", "needs_info"}, (
f"Code checks for 'ready' or 'needs_info', got '{example['status']}'"
)
def test_ready_example_has_required_fields():
text = _read_prompt("analyze_feedback.md")
ready = [e for e in _extract_json_blocks(text) if e["status"] == "ready"]
assert len(ready) > 0, "No 'ready' example in prompt"
for example in ready:
# These are the exact fields process_feedback() reads
assert "sentiment" in example
assert isinstance(example["sentiment"], str)
assert "topics" in example
assert isinstance(example["topics"], list)
assert "urgency" in example
assert isinstance(example["urgency"], int)
assert 1 <= example["urgency"] <= 5
assert "summary" in example["suggestions"]
def test_needs_info_has_questions():
text = _read_prompt("analyze_feedback.md")
needs_info = [e for e in _extract_json_blocks(text) if e["status"] == "needs_info"]
assert len(needs_info) > 0, "No 'needs_info' example in prompt"
for example in needs_info:
assert "questions" in example
assert isinstance(example["questions"], list)
assert len(example["questions"]) > 0, "Empty questions list"
Run it:
test_prompt_has_output_format PASSED
test_json_blocks_are_valid PASSED
test_status_values_are_valid PASSED
test_ready_example_has_required_fields PASSED
test_needs_info_has_questions PASSED
What This Catches
Now imagine someone edits the prompt and renames "urgency" to "priority". Or changes "status": "ready" to "status": "complete". Or removes the "summary" field from suggestions.
The tests fail immediately. Before anyone deploys. Before the LLM ever sees the new prompt. Before a customer hits the bug.
Real examples of prompt drift this pattern catches:
| What changed in the prompt | What breaks in code | Test that catches it |
|---|---|---|
Renamed urgency → priority | KeyError: 'urgency' | test_ready_example_has_required_fields |
Changed "ready" → "complete" | if status == "ready" never matches | test_status_values_are_valid |
Removed suggestions.summary | KeyError: 'summary' | test_ready_example_has_required_fields |
Changed urgency from int to string | Code does urgency > 3 on a string | test_ready_example_has_required_fields |
| Deleted the JSON example entirely | LLM returns unpredictable format | test_json_blocks_are_valid |
Scaling It
When you have multiple prompts, parametrize:
PROMPTS = {
"analyze_feedback.md": "feedback",
"classify_ticket.md": "ticket",
"summarize_report.md": "report",
}
@pytest.mark.parametrize("filename", PROMPTS.keys())
def test_has_output_format(filename):
text = _read_prompt(filename)
assert "## Output Format" in text
@pytest.mark.parametrize("filename", PROMPTS.keys())
def test_json_is_valid(filename):
text = _read_prompt(filename)
for block in _extract_json_blocks(text):
assert "status" in block # every prompt must return status
Then add per-prompt test classes for the fields that are specific to each prompt:
class TestFeedbackContract:
def test_has_sentiment(self):
...
class TestTicketContract:
def test_has_priority_level(self):
...
The Principle
Your prompt's JSON examples are the specification. Your code's field reads are the implementation. The test connects them.
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ prompt.md │────▶│ test.py │◀────│ app.py │
│ (JSON spec) │ │ (contract) │ │ (reads JSON)│
└─────────────┘ └──────────────┘ └─────────────┘
If the prompt changes, the test fails. If the code changes what it reads, you update the test, which forces you to check the prompt. The contract stays in sync.
What This Does NOT Test
This pattern validates structure, not quality. It answers:
- Does the prompt show the LLM what fields to return? Yes/no.
- Do those fields match what the code parses? Yes/no.
- Are the types correct (string, int, list)? Yes/no.
- Are enum values valid ("ready" not "complete")? Yes/no.
It does not answer:
- Does the LLM actually follow the prompt well? (Need integration tests for that.)
- Is the prompt well-written? (Need humans for that.)
- Does the LLM hallucinate field values? (Need output validation in code for that.)
But structural drift is the #1 cause of prompt-related production bugs, and this catches it with zero cost and zero latency. Run it in CI, run it on every commit, run it before every deploy.
Try It
- Add
## Output Formatwith JSON examples to your prompt files - Write
_extract_json_blocks()(15 lines, copy it from above) - Assert that the fields your code reads exist in the examples
- Run in CI
That is the whole pattern. No framework, no library, no dependency. Just regex, JSON, and pytest.
This pattern is used in production at LobOut across all refinement prompts — team verification, project briefs, pitch review, and scoring. 34 contract tests run on every commit, catching prompt drift before it reaches the evaluation engine.