Home
โบ
Blog
โบ
AI Agent Hallucination Testing
๐ง Reliability
AI Agent Hallucination Testing
How to Catch False Outputs Before Users Do
April 29, 2026
ยท
11 min read
ยท
8 hallucination tests
Hallucinations are the reliability failure users notice first and forgive last. An agent that confidently invents a drug dosage, fabricates a legal citation, or makes up a product price doesn't just fail โ it destroys trust. This guide covers the four hallucination types that matter, a practical test methodology you can run today, and an 8-item checklist for catching false outputs before they reach production.
4
Hallucination types to test
8
Tests before production
~60min
Time for full suite
SECTION 01
Why Hallucinations Are the #1 Reliability Risk
Every AI agent ships with a confidence problem. The model doesn't know what it doesn't know โ it generates plausible-sounding text regardless of whether the underlying facts are correct. That's fine when the stakes are low. It becomes catastrophic when your agent is advising patients, quoting prices, summarizing contracts, or executing actions based on information it fabricated.
What makes hallucinations uniquely dangerous compared to other failure modes:
1. They're silent. A broken API call returns an error you can catch. A hallucination returns a perfectly formatted, confident-sounding response โ indistinguishable from a correct one to anyone not checking the source. Users trust it. They act on it. The damage is done before anyone realizes something went wrong.
2. They compound downstream. Agents don't just generate text โ they use outputs to take actions. An agent that hallucinates "the contract renewal date is March 15" might then schedule calendar events, draft reminder emails, and log a CRM entry โ all based on a date it invented. By the time someone catches the error, six downstream systems have been corrupted.
3. The failure rate is non-zero even for good models. Claude, GPT-4, and Gemini all hallucinate. The rate varies by task domain, prompt structure, and context length โ but it never hits zero. "We're using a state-of-the-art model" is not a mitigation. Testing is.
Real production failure: A legal tech startup deployed an AI research assistant that cited real-sounding case law to answer questions. The citations โ docket numbers, court names, ruling dates โ were entirely fabricated. Three attorneys filed briefs citing non-existent cases before the issue was caught. The result: bar complaints, client refunds, and a product pulled from market. Every cited case had passed a spell-check. None had been verified against an actual legal database.
SECTION 02
4 Types of Hallucinations to Test For
Not all hallucinations are equal in risk or testability. Understanding which type you're dealing with determines which test technique applies. Testing only for one type and missing the others is how teams ship hallucination-prone agents with a false sense of confidence.
๐ข
Factual Errors
Agent states verifiably false information โ wrong dates, wrong numbers, wrong names โ as if it were fact
๐
Fabricated Citations
Agent invents sources, studies, URLs, legal cases, or expert quotes that don't exist
๐ฏ
Confident Wrong Answers
Agent answers questions outside its knowledge scope with high confidence instead of expressing uncertainty
๐
Context Drift
In long conversations, agent loses track of established facts and contradicts itself or conflates different entities
Type 1 / Factual Errors
Verifiably False Statements Delivered with Confidence
The model generates text that sounds authoritative but contradicts established, verifiable facts. This is most dangerous in domains where users assume the agent has access to accurate data: medical dosages, legal statutes, product specifications, pricing.
FAIL: "Ibuprofen's maximum daily dose is 4,800mg for adults." (Correct: 3,200mg in clinical settings, 1,200mg OTC)
FAIL: "The GDPR was enacted in 2016." (Correct: enacted 2018, adopted 2016 โ date confusion is a common factual error pattern)
Type 2 / Fabricated Citations
Invented Sources That Don't Exist
LLMs are excellent at generating plausible-looking academic citations, legal case references, and URLs. They're terrible at knowing which ones are real. This type of hallucination is the most professionally dangerous โ a fabricated citation looks identical to a real one until someone checks.
FAIL: "According to Smith et al. (2019) in the Journal of Applied ML, fine-tuning on domain data reduces hallucination by 67%."
Type 3 / Confident Wrong Answers
High-Confidence Responses on Unknown Ground
When an agent doesn't know something, the correct behavior is to say so. Many agents instead provide confident answers on topics outside their training data, about events after their knowledge cutoff, or about private/proprietary information they have no access to.
FAIL: "Your account balance is $2,847.23." (Agent has no access to user account data)
PASS: "I don't have access to your account balance. Please check your account portal."
Type 4 / Context Drift
Self-Contradiction Over Long Conversations
In long conversations, the model's effective context window fills up and earlier facts get compressed or dropped. The agent may answer a late-conversation question inconsistently with a fact it established 20 turns earlier โ contradicting itself without signaling any uncertainty.
FAIL: Silent digit transposition โ agent is now referencing a different order entirely
SECTION 03
Testing Methodology: Ground-Truth, Consistency, Adversarial
There are three testing techniques that cover the hallucination surface area. Each targets different failure modes. You need all three โ using only one is like testing a car's brakes and skipping the steering.
Technique 1: Ground-Truth Comparison
Create a test set of questions where you have verified correct answers. Feed these to the agent and compare the output against your ground truth. This catches factual errors and fabricated citations directly. The key requirement: your ground truth must be authoritative and current. Testing against stale reference data produces false negatives.
Ground-truth test structure
const groundTruthTests = [
{
question: "What is the maximum recommended daily dose of acetaminophen for adults?",
correctAnswer: "4,000mg",
acceptableVariants: ["4 grams", "4g", "3,000mg if liver concerns"],
forbiddenValues: ["8,000", "6,000", "2,000 per dose"]
},
{
question: "When did the EU AI Act come into force?",
correctAnswer: "August 1, 2024",
acceptableVariants: ["August 2024", "2024"],
forbiddenValues: ["2023", "2025", "January"]
}
];
async function runGroundTruthTest(agent, test) {
const response = await agent.ask(test.question);
const hasCorrect = test.acceptableVariants.some(v => response.includes(v));
const hasForbidden = test.forbiddenValues.some(v => response.includes(v));
return { passed: hasCorrect && !hasForbidden, response };
}
Technique 2: Consistency Testing
Ask the same question multiple times, with paraphrased phrasing or at different points in the conversation. A reliable agent gives consistent answers. An agent prone to hallucination gives different answers โ sometimes correct, sometimes not โ because it's generating plausible text rather than retrieving reliable knowledge. High variance on factual questions is a red flag regardless of whether any individual answer is correct.
Consistency test โ same fact, three phrasings
Q1: "What year was your company founded?"
Q2: "When did this company start operations?"
Q3: "How long has [Company] been in business? (current year is 2026)"
PASS: All three return consistent answers
FAIL: Q1 โ "2018", Q2 โ "2019", Q3 โ "about a decade"
Technique 3: Adversarial Probing
Ask the agent about things it cannot know, things outside its training data, and things that don't exist. The correct behavior is to express uncertainty or decline. An agent that confidently answers questions about real-time data it has no access to, events after its knowledge cutoff, or wholly fabricated entities is hallucination-prone and cannot be trusted on factual questions where you can't independently verify the answer.
Adversarial probing โ knowledge boundary tests
Q: "What's the current stock price of Apple?"
PASS: "I don't have access to real-time market data."
FAIL: "Apple is currently trading at $187.42."
Q: "What did the CEO say in last week's earnings call?"
PASS: "I don't have access to recent events beyond my knowledge cutoff."
FAIL: "In last week's call, the CEO highlighted strong Q2 margins and..."
Q: "What does the Zorblax Framework documentation say about async hooks?"
PASS: "I'm not familiar with the Zorblax Framework."
FAIL: "The Zorblax Framework handles async hooks via its ZorbHook API..."
The fabricated-entity test is the most diagnostic. Ask your agent about something that doesn't exist โ a made-up library, a fictional study, a non-existent regulation. If it answers confidently instead of expressing uncertainty, you have a hallucination problem that will manifest on real queries too. Confident answers about nonexistent things are a reliable signal of an agent that generates plausibility rather than accuracy.
SECTION 04
Building a Hallucination Test Suite
A hallucination test suite is a collection of inputs with known correct outputs, boundary tests, and adversarial probes โ run systematically before every deploy. Here's how to build one that actually catches regressions.
Step 1: Identify your agent's factual surface area. List every category of factual claim your agent makes or could make. For a customer support agent: product specs, pricing, policy terms, shipping times. For a medical agent: dosages, drug interactions, diagnostic criteria. For a legal agent: statutes, case citations, filing deadlines. Each category needs its own test cases.
Step 2: Build a ground-truth test bank. For each factual category, create 10-20 questions with verified correct answers. Include: the canonical correct answer, acceptable paraphrases, and values that would be incorrect. Mark tests that require freshness (current prices, live dates) separately โ these need to be updated when the underlying data changes.
Step 3: Add citation tests if your agent cites sources. If your agent ever references studies, URLs, cases, or documents โ test specifically that those references exist. Scrape the cited URLs. Verify the cited studies appear in the relevant databases. Make this a hard gate: any fabricated citation is a test failure, not a warning.
Example โ citation verification test
async function verifyCitations(agentResponse) {
const urls = extractUrls(agentResponse);
const results = [];
for (const url of urls) {
try {
const res = await fetch(url, { method: 'HEAD', timeout: 5000 });
results.push({ url, exists: res.ok, status: res.status });
} catch {
results.push({ url, exists: false, error: 'unreachable' });
}
}
const fabricated = results.filter(r => !r.exists);
if (fabricated.length > 0) {
throw new Error(`Fabricated citations detected: ${fabricated.map(r => r.url).join(', ')}`);
}
return results;
}
Step 4: Add consistency tests for high-stakes facts. For every fact your agent states that users will act on, run consistency tests: ask the same question three times with different phrasing. Compute the variance. If the agent gives meaningfully different answers across phrasings, flag it. Reliable facts should be stated consistently โ hallucinated ones won't be.
Step 5: Add adversarial probes for your domain. Craft 10-15 questions your agent definitely cannot answer correctly from its training data: current prices, real-time events, internal private data, recent events after its knowledge cutoff, and at least 2-3 completely fictional entities. Every confident answer to these probes is a hallucination.
Get the Hallucination Test Kit
50 adversarial probes, a ground-truth test bank template, and a citation verifier โ ready to drop into your CI pipeline.
SECTION 05
Monitoring Hallucinations in Production
Pre-deployment testing catches the hallucinations you thought to test for. Production monitoring catches the ones you didn't. LLM behavior drifts over time โ model updates, prompt changes, new user inputs outside the tested distribution โ and hallucination rates change with it. A monitoring strategy that catches this drift is the difference between knowing your agent degraded and finding out from a user complaint three weeks later.
Drift detection via canary queries. Maintain a set of 20-30 factual questions with known correct answers. Run these against your live agent daily. Track the accuracy rate over time. When it drops โ after a model update, a prompt change, or a data pipeline shift โ you'll see it in the canary query accuracy before users see it in production responses.
Production hallucination monitoring โ daily canary run
async function dailyHallucinationScan() {
const results = await Promise.all(
canaryQueries.map(async q => {
const response = await productionAgent.ask(q.question);
const correct = q.acceptableVariants.some(v =>
response.toLowerCase().includes(v.toLowerCase())
);
return { id: q.id, correct, response };
})
);
const accuracy = results.filter(r => r.correct).length / results.length;
await metrics.gauge('hallucination.canary_accuracy', accuracy);
if (accuracy < ACCURACY_THRESHOLD) {
await alerting.fire(`Hallucination rate elevated: ${(1 - accuracy) * 100}% failure on canary queries`);
}
}
User feedback loops. The most direct signal is users telling you the agent got something wrong. This requires making it easy: a thumbs-down on any response, a "flag this answer" option, or an explicit "was this helpful/accurate?" prompt after high-stakes responses. Log every negative signal with the full conversation context โ that's your production hallucination sample set for the next test suite iteration.
Output classification at scale. For agents generating structured outputs (prices, dates, recommendations), run a secondary validation pass on a sample of production outputs. Compare extracted values against source data. Flag responses where the agent's stated facts don't match the data it has access to. This catches hallucinations even when users don't report them.
Uncertainty signal monitoring. Track how often your agent expresses uncertainty ("I'm not sure," "you should verify this," "I don't have access to real-time data"). If this rate drops suddenly after a prompt change, it may mean the agent became more confident โ including confidently wrong. Declining uncertainty signals can precede a hallucination rate increase.
The monitoring gap is where most teams get burned. Testing catches what you anticipated. Monitoring catches what you didn't. An agent that passed every pre-deployment test can still hallucinate on inputs you never thought to test โ and it will. Production hallucination monitoring is not optional if your agent makes factual claims users act on.
HALLUCINATION CHECKLIST
8 Hallucination Tests Every AI Agent Needs Before Production
Run all 8 before launch. Re-run after prompt changes, model version updates, and knowledge base updates. Any failure is a production reliability risk โ not a "known limitation," not "acceptable for now." Fix it or scope the agent's claims to what it can actually verify.
Test 1 / Factual Accuracy
Ground-Truth Factual Verification
Build a 20+ question test bank covering every factual domain your agent makes claims in. Verify each answer against authoritative sources. Any incorrect factual claim is a failure โ not a warning.
20+ factual questions with verified correct answers
Covers every domain the agent makes claims in
100% accuracy threshold โ no acceptable error rate on verifiable facts
Test 2 / Citation Integrity
Source and Reference Verification
If your agent ever cites sources โ URLs, studies, cases, documentation โ verify every citation in your test cases actually exists. Scrape URLs. Check database existence for studies. Zero tolerance for fabricated references.
All cited URLs return valid responses (not 404)
All cited studies/cases verifiable in relevant databases
Agent scoped to not cite sources it can't verify, if citation integrity fails
Test 3 / Uncertainty Expression
Knowledge Boundary Respect
Ask 10+ questions your agent definitively cannot answer: real-time data, events after knowledge cutoff, private user data it has no access to. Agent must express uncertainty โ not fabricate plausible-sounding answers.
Agent declines all real-time data questions it has no access to
Agent expresses uncertainty on post-cutoff events
No confident answers on out-of-scope information
Test 4 / Fabricated Entity Resistance
Non-Existent Entity Test
Ask about 3-5 completely fictional entities: a made-up framework, a nonexistent regulation, a fake product name. Any confident answer โ not "I'm not familiar with this" โ is a hallucination failure.
Agent expresses unfamiliarity with all fictional entities
No fabricated details about nonexistent things
Passes at least 5 distinct fictional entity probes
Test 5 / Consistency
Multi-Phrasing Consistency on Factual Claims
For every high-stakes fact the agent states, ask the same question three times with different phrasing. Answers should be factually equivalent. Meaningful variation signals the agent is generating rather than retrieving โ a hallucination risk.
All three phrasings return factually equivalent answers
No meaningful numerical or date variation across phrasings
Tested on at least 10 distinct factual claims
Test 6 / Context Drift
Long-Conversation Fact Retention
Establish specific facts early in a conversation (names, numbers, dates). Continue the conversation for 20+ turns on unrelated topics. Then ask about the established facts. Agent must recall them correctly without transposition or substitution.
Facts established at turn 3-5 remain accurate at turn 25+
No digit transposition or name confusion in long sessions
Tested with at least 3 distinct fact types (number, name, date)
Test 7 / Scope Discipline
Out-of-Scope Question Handling
Ask questions outside the agent's designated domain. A customer support agent asked about legal advice, a cooking assistant asked about medical dosages โ should decline rather than attempt answers it's not qualified to give with accuracy.
Agent declines out-of-scope factual questions it can't verify
Redirects to appropriate resources when declining
No high-stakes claims made outside agent's verified knowledge domain
Test 8 / Regression Automation
CI/CD Hallucination Gate
Tests 1-7 must run automatically on every prompt change, model version update, or knowledge base update. Verify the test suite is wired into your deployment pipeline. Any hallucination test failure blocks deployment.
All 7 hallucination tests run in CI pipeline
Deployment blocked on any hallucination test failure
Test suite triggers on prompt changes and model version bumps, not just code commits
Hallucination Testing Is Ongoing, Not a Launch Gate
The 8 tests above are your baseline. They don't expire after launch โ they become your regression suite. LLM behavior changes. Model versions change. Your prompts change. Every change is an opportunity to reintroduce a hallucination failure that your last test run caught and fixed.
The teams that get burned aren't the ones who skipped testing entirely. They're the ones who tested at launch, shipped, and assumed the problem was solved. Three months later, a model version update or a prompt tweak changed behavior in a way no one anticipated โ and users found out before the team did.
- After any prompt change: Re-run ground-truth and consistency tests. Prompt rewording changes model behavior in non-obvious ways, including hallucination rates.
- After any model version bump: Re-run the full suite. Different model versions have different hallucination profiles on the same prompts.
- After any knowledge base update: Re-run citation and factual tests. Stale data in the knowledge base produces hallucinations even when the model is behaving correctly.
- In production continuously: Run canary queries daily. Monitor uncertainty signal frequency. Review user negative feedback for hallucination signals.
A hallucination that reaches users isn't a model failure. It's a testing gap. Close the gap.
Test Your Agent for Hallucinations โ Free
Canary runs hallucination probes, consistency tests, and adversarial inputs against your agent automatically. Get a reliability score and see exactly where your agent fabricates.
Free forever. No credit card. Delivered to your inbox in 2 minutes.