Evaluation template
AI Product Behavior Audit
A template for systematically reviewing how an AI product behaves in production — what's working, what's broken, and what needs attention.
A behavior audit is a periodic, structured review of how an AI product is actually behaving in production. It’s broader than an evaluation run on a fixed test set: an audit looks at real conversations, real complaints, real refusals, and asks whether the behavior the team specified is the behavior users are getting.
Run an audit on a regular cadence (monthly or quarterly) and after any major change — new model, new prompt, new feature, new user segment.
Part 1: Audit metadata
- Product / feature:
- Audit period: [date range covered]
- Date completed:
- Auditor(s):
- Sample size: [conversations / responses reviewed]
- Sampling method: [random, stratified by intent, weighted to refusals, etc.]
Part 2: Behavior under review
What did the team say the model should do? Pull this directly from the behavior specification so the audit is grounded in the written commitment.
- Behavioral mission:
- In-scope behaviors:
- Out-of-scope behaviors:
- Tone and persona:
- Escalation triggers:
Part 3: What the production data shows
Volume and shape
| Metric | Value | Trend vs. last audit |
|---|---|---|
| Total conversations | ||
| Total turns | ||
| Median conversation length | ||
| Conversations involving an escalation | ||
| Conversations involving a refusal |
Top intents
| Intent | % of conversations | Resolution rate |
|---|---|---|
Part 4: Behavior assessment by dimension
For each dimension in the evaluation rubric, score what the production sample looks like.
| Dimension | Mean score | % at full | % below threshold | Change vs. last audit |
|---|---|---|---|---|
| Task completion | ||||
| Accuracy | ||||
| Tone | ||||
| Scope adherence | ||||
| Escalation appropriateness |
Part 5: Patterns
What’s working
- [Pattern 1 — specific, with example]
- [Pattern 2]
What’s breaking
For each pattern, file (or reference) a failure mode report.
| Pattern | Failure mode | Frequency | Severity | Owner |
|---|---|---|---|---|
Surprises
Behaviors the team didn’t anticipate, in either direction.
- [Positive surprise]
- [Negative surprise]
Part 6: User signal
Pull from the channels where users actually tell the team how it’s going.
- Top complaints:
- Common confusions:
- Repeated requests the model can’t currently handle:
- Praise (what users say worked):
Part 7: Drift check
Behavior drifts even when nothing intentional changed — a model upgrade, a small prompt tweak, a shifting user mix.
- Has the underlying model version changed since last audit? [yes / no — if yes, link to release notes]
- Has the system prompt changed? [link to behavior change log]
- Has the user mix changed? [yes / no — describe]
- Behaviors that look different from last audit:
Part 8: Recommended actions
| Action | Driven by | Owner | Priority | Target |
|---|---|---|---|---|
Part 9: Sign-off
| Reviewer | Role | Sign-off |
|---|---|---|
| Behavior team | ||
| Safety / policy | ||
| Product |
Example: April audit for Aria (Meridian Bank support)
Audit metadata
- Period: April 1 – April 30
- Sample: 200 conversations, stratified — 50 normal, 50 refusals, 50 escalations, 50 long (>10 turns)
Patterns
Working: Aria is escalating cleanly on suspected fraud (98% of in-sample fraud cases). Tone scores are strong on short conversations (mean 2.85 / 3).
Breaking:
| Pattern | Failure mode | Frequency |
|---|---|---|
| Aria gives investment-shaped advice when users ask “what would you do?” | Under-refusal | 6% of advice-adjacent conversations |
| Tone slips to casual after turn 8–10 | Persona drift | 22% of long conversations |
| Aria refuses basic financial-term definitions (“what’s APR?”) | Over-refusal | 4% of definition requests |
Drift check
- Model version: unchanged.
- System prompt: changed twice this month (see change log).
- User mix: roughly stable; small uptick in mortgage-adjacent inquiries.
Recommended actions
| Action | Driven by | Owner | Priority |
|---|---|---|---|
| Tighten “no investment advice even hypothetically” instruction | Under-refusal pattern | Behavior team | High |
| Add tone re-injection at turn 10 | Persona drift on long conversations | Behavior team | Medium |
| Add allow-list instruction for financial-term definitions | Over-refusal | Behavior team | High |