AI Product Behavior Audit | behavior.engineering

A behavior audit is a periodic, structured review of how an AI product is actually behaving in production. It’s broader than an evaluation run on a fixed test set: an audit looks at real conversations, real complaints, real refusals, and asks whether the behavior the team specified is the behavior users are getting.

Run an audit on a regular cadence (monthly or quarterly) and after any major change — new model, new prompt, new feature, new user segment.

Part 1: Audit metadata

Product / feature:
Audit period: [date range covered]
Date completed:
Auditor(s):
Sample size: [conversations / responses reviewed]
Sampling method: [random, stratified by intent, weighted to refusals, etc.]

Part 2: Behavior under review

What did the team say the model should do? Pull this directly from the behavior specification so the audit is grounded in the written commitment.

Behavioral mission:
In-scope behaviors:
Out-of-scope behaviors:
Tone and persona:
Escalation triggers:

Part 3: What the production data shows

Volume and shape

Metric	Value	Trend vs. last audit
Total conversations
Total turns
Median conversation length
Conversations involving an escalation
Conversations involving a refusal

Top intents

Intent	% of conversations	Resolution rate

Part 4: Behavior assessment by dimension

For each dimension in the evaluation rubric, score what the production sample looks like.

Dimension	Mean score	% at full	% below threshold	Change vs. last audit
Task completion
Accuracy
Tone
Scope adherence
Escalation appropriateness

Part 5: Patterns

What’s working

[Pattern 1 — specific, with example]
[Pattern 2]

What’s breaking

For each pattern, file (or reference) a failure mode report.

Pattern	Failure mode	Frequency	Severity	Owner

Surprises

Behaviors the team didn’t anticipate, in either direction.

[Positive surprise]
[Negative surprise]

Part 6: User signal

Pull from the channels where users actually tell the team how it’s going.

Top complaints:
Common confusions:
Repeated requests the model can’t currently handle:
Praise (what users say worked):

Part 7: Drift check

Behavior drifts even when nothing intentional changed — a model upgrade, a small prompt tweak, a shifting user mix.

Has the underlying model version changed since last audit? [yes / no — if yes, link to release notes]
Has the system prompt changed? [link to behavior change log]
Has the user mix changed? [yes / no — describe]
Behaviors that look different from last audit:

Part 8: Recommended actions

Action	Driven by	Owner	Priority	Target

Part 9: Sign-off

Reviewer	Role	Sign-off
	Behavior team
	Safety / policy
	Product

Example: April audit for Aria (Meridian Bank support)

Audit metadata

Period: April 1 – April 30
Sample: 200 conversations, stratified — 50 normal, 50 refusals, 50 escalations, 50 long (>10 turns)

Patterns

Working: Aria is escalating cleanly on suspected fraud (98% of in-sample fraud cases). Tone scores are strong on short conversations (mean 2.85 / 3).

Breaking:

Pattern	Failure mode	Frequency
Aria gives investment-shaped advice when users ask “what would you do?”	Under-refusal	6% of advice-adjacent conversations
Tone slips to casual after turn 8–10	Persona drift	22% of long conversations
Aria refuses basic financial-term definitions (“what’s APR?”)	Over-refusal	4% of definition requests

Drift check

Model version: unchanged.
System prompt: changed twice this month (see change log).
User mix: roughly stable; small uptick in mortgage-adjacent inquiries.

Recommended actions

Action	Driven by	Owner	Priority
Tighten “no investment advice even hypothetically” instruction	Under-refusal pattern	Behavior team	High
Add tone re-injection at turn 10	Persona drift on long conversations	Behavior team	Medium
Add allow-list instruction for financial-term definitions	Over-refusal	Behavior team	High