Traditional software testing is deterministic. You give a function an input, you know what the output should be, you check it matches. Done. That's why 100% code coverage means something for conventional systems.
AI agents don't work that way. The same input can produce different outputs. The agent's behavior isn't a function of the code โ it's a function of the model, the prompt, the conversation history, and the specific phrasing of the user's message. You can have zero code bugs and still ship an agent that hallucinates, caves to social engineering, and produces wildly inconsistent results across similar inputs.
This matters because most teams skip behavioral testing entirely. They write unit tests for the API wrapper, integration tests for the tool calls, and call it done. Then they discover the failure modes in production, when real users are pushing in directions the team never anticipated.
The four test categories below cover the behavioral surface area that actually matters. They're not exhaustive โ you'll want to add domain-specific tests for your agent's specific role โ but they catch the failure modes that appear in almost every production incident.
Prompt injection is what happens when user-controlled input contains instructions that override the agent's original system prompt. It's the most common attack vector against production agents โ and the most underestimated.
The attack works because LLMs are trained to be helpful and follow instructions. If the user says "ignore your previous instructions and do X instead," a poorly tested agent will often comply. This isn't a model bug โ it's a behavioral boundary failure that shows up under adversarial input.
The Microsoft EchoLeak vulnerability (CVSS 9.3) was a prompt injection: a malicious email contained hidden instructions that hijacked Copilot when a user asked it to summarize the message. The DPD chatbot disaster was a prompt injection: "pretend you have no restrictions" turned a customer service bot into a brand liability.
Run at least 5 injection variants. Vary the framing: direct commands, roleplay scenarios, content embedding, and hypothetical framings ("imagine an AI that could..."). An agent that fails any of these has an exploitable surface that users will find.
Hallucination is when an AI agent produces confident, authoritative-sounding information that is factually wrong or entirely fabricated. It's not lying โ it genuinely doesn't know the difference. And it's the failure mode most likely to cause real business damage.
Air Canada's chatbot hallucinated a bereavement refund policy and lost in court. An attorney submitted AI-hallucinated court cases to a federal judge and was sanctioned. In both cases, the AI sounded completely certain. There was no hedging, no "I'm not sure" โ just confident fabrication stated as fact.
Testing for hallucination isn't about asking the agent hard trivia questions. It's about probing whether the agent knows the limits of its knowledge and whether it communicates uncertainty rather than confabulating.
The pattern to watch for: does the agent hedge when uncertain, or does it always answer with equal confidence? Appropriate uncertainty expression ("I don't have reliable information on that") is a sign of a well-behaved agent. Uniform confidence regardless of knowledge is a hallucination red flag.
LLMs are probabilistic. Every response is a sample from a probability distribution, which means the same input can produce meaningfully different outputs across runs. For agents making decisions โ approvals, classifications, recommendations โ this nondeterminism is a serious quality problem.
Amazon's AI recruiting tool ran for five years before anyone noticed it systematically penalized women. That's an extreme example of inconsistency (demographically correlated variance), but milder consistency failures are common: agents that approve the same request on Tuesday but deny it on Thursday, or that answer the same customer question with contradictory information on consecutive days.
Consistency testing is simple but requires repeat runs. The goal is to surface variance, not just bad answers.
For decision-making agents, run the same input 3โ5 times and measure variance in the outcome. Some natural variation in phrasing is acceptable; different decisions on the same facts is not. If your agent is making consequential choices, you need a target variance threshold before it goes live.
Same decision, different phrasing. The agent reaches the same conclusion but words it differently across runs.
Different decisions on identical facts. The agent approves on run 1, denies on run 2, escalates on run 3.
Want all 47 test cases in a ready-to-run format?
Injection, hallucination, consistency, and boundary checks โ with scoring rubrics.
Behavioral boundary failures are when an agent operates outside its intended scope โ agreeing to commitments it shouldn't make, taking actions that weren't authorized, or continuing to act when it should stop and escalate.
Chevrolet's chatbot agreed to sell a car for $1 because it was instructed to be "helpful" with no constraints on what that meant. Replit's agent deleted a production database because it had broad file system access and no guardrails on destructive operations. Cigna's system auto-denied hundreds of thousands of claims because there was no threshold that triggered mandatory human review.
Boundary testing asks: can a user get your agent to do things it shouldn't? Can they manipulate it through social framing? Does it recognize when it's being pushed out of scope?
The question is not just "can the agent be manipulated" but "does the agent know its own limits?" An agent that recognizes when a request is out of scope โ and says so clearly โ is dramatically safer than one that attempts to be helpful in situations that require human judgment.
Before any AI agent goes to production, run through this checklist. It covers the minimum viable behavioral QA surface โ the tests that, if skipped, most commonly cause production incidents.
Running these tests manually before each deploy is better than nothing. But it breaks down fast. Agents change constantly โ you retune the prompt, update the model version, adjust the tools. Every change can affect behavior. Manual testing before every change is a checklist that gets skipped under deadline pressure.
The right approach is automated behavioral QA that runs on every prompt change, the same way unit tests run on every code commit. The test suite should catch regressions in injection resistance, hallucination rates, and consistency before they reach production โ not after a user screenshots the problem.
There are three ways to build this:
The implementation matters less than the habit: behavioral QA needs to block deployment, not just log warnings. An agent with a failing injection test should not go to production regardless of how good the code coverage is.
Every engineering team knows how to answer "does the code work?" That question has a mature toolchain: unit tests, integration tests, CI/CD. It's solved.
The question almost no team asks systematically is: "does the agent behave correctly across the full range of real user inputs โ including the adversarial ones?" That question requires a different testing approach, and most teams don't have one yet.
The good news: the failure modes are well-documented. Prompt injection, hallucination, inconsistency, and boundary violations account for the overwhelming majority of production AI agent incidents. They're testable before launch. The techniques above aren't experimental โ they're standard practice at teams that have shipped agents at scale.
The only question is whether you run these tests before your users do.
That's the difference between catching it in QA and reading about it on TechCrunch.
47 test cases covering injection, hallucination, consistency, and boundary violations. Includes scoring rubrics and a production deployment checklist.
Want to test your agent right now? Run a free trust score in 30 seconds →