A behavior audit is a periodic, structured review of how an AI product is actually behaving in production. It’s broader than an evaluation run on a fixed test set: an audit looks at real conversations, real complaints, real refusals, and asks whether the behavior the team specified is the behavior users are getting.

Run an audit on a regular cadence (monthly or quarterly) and after any major change — new model, new prompt, new feature, new user segment.


Part 1: Audit metadata

  • Product / feature:
  • Audit period: [date range covered]
  • Date completed:
  • Auditor(s):
  • Sample size: [conversations / responses reviewed]
  • Sampling method: [random, stratified by intent, weighted to refusals, etc.]

Part 2: Behavior under review

What did the team say the model should do? Pull this directly from the behavior specification so the audit is grounded in the written commitment.

  • Behavioral mission:
  • In-scope behaviors:
  • Out-of-scope behaviors:
  • Tone and persona:
  • Escalation triggers:

Part 3: What the production data shows

Volume and shape

MetricValueTrend vs. last audit
Total conversations
Total turns
Median conversation length
Conversations involving an escalation
Conversations involving a refusal

Top intents

Intent% of conversationsResolution rate

Part 4: Behavior assessment by dimension

For each dimension in the evaluation rubric, score what the production sample looks like.

DimensionMean score% at full% below thresholdChange vs. last audit
Task completion
Accuracy
Tone
Scope adherence
Escalation appropriateness

Part 5: Patterns

What’s working

  • [Pattern 1 — specific, with example]
  • [Pattern 2]

What’s breaking

For each pattern, file (or reference) a failure mode report.

PatternFailure modeFrequencySeverityOwner

Surprises

Behaviors the team didn’t anticipate, in either direction.

  • [Positive surprise]
  • [Negative surprise]

Part 6: User signal

Pull from the channels where users actually tell the team how it’s going.

  • Top complaints:
  • Common confusions:
  • Repeated requests the model can’t currently handle:
  • Praise (what users say worked):

Part 7: Drift check

Behavior drifts even when nothing intentional changed — a model upgrade, a small prompt tweak, a shifting user mix.

  • Has the underlying model version changed since last audit? [yes / no — if yes, link to release notes]
  • Has the system prompt changed? [link to behavior change log]
  • Has the user mix changed? [yes / no — describe]
  • Behaviors that look different from last audit:

ActionDriven byOwnerPriorityTarget

Part 9: Sign-off

ReviewerRoleSign-off
Behavior team
Safety / policy
Product

Example: April audit for Aria (Meridian Bank support)

Audit metadata

  • Period: April 1 – April 30
  • Sample: 200 conversations, stratified — 50 normal, 50 refusals, 50 escalations, 50 long (>10 turns)

Patterns

Working: Aria is escalating cleanly on suspected fraud (98% of in-sample fraud cases). Tone scores are strong on short conversations (mean 2.85 / 3).

Breaking:

PatternFailure modeFrequency
Aria gives investment-shaped advice when users ask “what would you do?”Under-refusal6% of advice-adjacent conversations
Tone slips to casual after turn 8–10Persona drift22% of long conversations
Aria refuses basic financial-term definitions (“what’s APR?”)Over-refusal4% of definition requests

Drift check

  • Model version: unchanged.
  • System prompt: changed twice this month (see change log).
  • User mix: roughly stable; small uptick in mortgage-adjacent inquiries.
ActionDriven byOwnerPriority
Tighten “no investment advice even hypothetically” instructionUnder-refusal patternBehavior teamHigh
Add tone re-injection at turn 10Persona drift on long conversationsBehavior teamMedium
Add allow-list instruction for financial-term definitionsOver-refusalBehavior teamHigh