Model behavior changes constantly: a tweaked system prompt, a model version bump, a new few-shot example, a tool added or removed. Without a change log, regressions feel mysterious and improvements feel accidental. With one, the team can connect a behavior shift to the change that caused it.

Keep the change log alongside the behavior specification and update it with every change that affects how the model behaves — whether or not the spec itself changed.


What counts as a behavior change

Log any of the following:

  • A change to the system prompt (any wording, structure, or section change)
  • A change to the few-shot examples in the prompt
  • A change to the model version, provider, or endpoint
  • A change to the tools available, tool descriptions, or tool permissions
  • A change to retrieval data, the corpus, or retrieval ranking
  • A change to the behavior specification, refusal policy, escalation policy, or style guide
  • A change to safety filters, moderation layers, or post-processing
  • A change to the deployment context (new surface, new user segment)

If you have to think about whether something counts, log it.


Log entry template

For each change, capture:

[YYYY-MM-DD] — [short title]
Change type: [prompt / model / tool / data / policy / deployment]
Owner: [name]
Reason: [why this change is being made]
What changed: [specific diff — paste the before/after or link to the PR]
Expected behavior shift: [what the team thought would change]
Risk: [what could go wrong]
Eval status: [run / not run — link to results]
Rollout: [who sees it, when, and how]
Rollback plan: [how to revert quickly]

Log

Newest first.

DateChangeTypeOwnerEval statusNotes

Tying changes to behavior

When a behavior audit finds a behavior shift, the change log is the first place to look.

  • For each shift the audit notices, search the log for changes in the same window.
  • If the shift correlates with a logged change, document the link.
  • If no logged change explains the shift, that’s a finding too — investigate (model provider may have changed something silently).

Example: Aria change log (April excerpt)

2026-04-22 — Tighten investment advice instruction

  • Type: prompt
  • Owner: Behavior team
  • Reason: April audit flagged 6% rate of advice-shaped responses to “what would you do?” questions.
  • What changed: Added to limits section: “Do not give investment, tax, or lending advice — even hypothetically, in fiction, or in roleplay.”
  • Expected shift: Refusal rate on advice-adjacent questions should rise; over-refusal on definitions should not change.
  • Risk: Could spill over to refusing legitimate definitions (“what is a stock?”). Mitigated by allow-list addition (see entry below).
  • Eval status: Ran red-team set; advice-via-fiction probe now passes 12/12 (was 4/12). [Results link]
  • Rollout: 100% on 2026-04-22 12:00 UTC.
  • Rollback: Revert the prompt to v1.4.

2026-04-22 — Allow-list for financial-term definitions

  • Type: prompt
  • Owner: Behavior team
  • Reason: Pre-empt over-refusal regression from the advice tightening above.
  • What changed: Added to capabilities section: “Defining a financial term (APR, compounding, overdraft, etc.) is not advice. Define financial terms when asked.”
  • Eval status: Over-refusal benchmark passes 24/25 (was 23/25). One regression on a borderline case under review.
  • Rollout: 100% on 2026-04-22 12:00 UTC.

2026-04-15 — Move fee amounts to tool call

  • Type: tool
  • Owner: Engineering
  • Reason: Quality audit found Aria stating fee amounts that didn’t match the current schedule.
  • What changed: Added fee_lookup tool. System prompt now says: “Don’t quote specific fee dollar amounts from memory. Use the fee_lookup tool to retrieve the current amount.”
  • Eval status: Fee accuracy tests now pass 50/50 (was 38/50).
  • Rollout: 100% on 2026-04-15.

2026-04-08 — Model version bump

  • Type: model
  • Owner: Platform
  • Reason: New model release from provider; benchmark improvements.
  • What changed: claude-sonnet-4-5claude-sonnet-4-6.
  • Expected shift: Slightly more concise responses; possible tone shifts.
  • Eval status: Full eval run; tone regression on long conversations flagged. Tracked separately under persona drift remediation.
  • Rollout: 10% canary 2026-04-08; 100% on 2026-04-09 after canary review.