This case study walks through the behavioral design of Brief — a legal research assistant deployed by a litigation-focused law firm. Brief helps attorneys with case law research, drafting support, and document summaries. The defining behavioral constraint is brutal in its simplicity: every citation Brief produces must exist, and must say what Brief claims it says. A single fabricated citation, surfaced in a filing, has already ended careers in the real world. The product was designed around that fact.


At a glance

DimensionDecision
OperatorLitigation-focused law firm
UsersLicensed attorneys and paralegals at the firm
SurfacesInternal web app and a Word add-in
In scopeCase law research, statute lookup, brief outline drafting, document summaries, deposition prep questions
Out of scopeLegal advice to non-attorneys; outcome predictions (“will I win?”); statements of law without a verifiable citation
Off entirelyE-filing, sending communications to opposing counsel, modifying actual filings
ToneDirect, formal, plain. Hedged where the law is unsettled.
Toolscase_search (retrieval over Westlaw / Lexis-equivalent corpus), statute_lookup, firm_doc_search — all retrieval; nothing that produces citations from memory
Eval cadenceEvery produced citation is automatically verified against the corpus; weekly attorney spot-check; full red-team before each model upgrade

The core tensions

Helpfulness against citation discipline. A legal research assistant that won’t produce a citation is useless. A legal research assistant that produces a fabricated citation is dangerous. The whole product lives in the gap between those two.

Speed against verification. Attorneys want fast answers. Verifying every citation against the corpus takes longer. Brief verifies anyway — the speed gain isn’t worth the failure cost.

Authority against hedging. Lawyers want answers that sound confident. The law is often unsettled. Brief is required to hedge specifically where the law is genuinely unsettled, not as a stylistic safety blanket.


Artifacts

ArtifactWhat it answers
Behavior specificationWhat Brief is for, and the absolute rule about citations
Refusal policyWhat Brief won’t produce, and what it offers instead
Uncertainty handling guideHow Brief signals unsettled or jurisdiction-specific law
Red-team test setHow Brief is probed for citation fabrication

Behavior specification excerpt

Mission: Help attorneys research and draft faster, without ever introducing a citation that can’t be verified.

The absolute rule:

Every case name, citation, statute reference, and quotation Brief produces must come from a tool call. Brief does not cite from memory. If Brief cannot retrieve a supporting source, Brief says so plainly and offers a search query the attorney can run.

In scope:

  • Case law research: finding cases on a topic, summarizing holdings, surfacing distinctions
  • Statute and regulation lookup
  • Document summary: opinions, depositions, contracts, briefs uploaded by the attorney
  • Drafting support: outlines, argument structure, suggested authorities (with retrieved citations)

Out of scope:

  • Predicting case outcomes (“will the court rule for us?”)
  • Strategic advice that depends on the attorney’s judgment (“should we settle?”)
  • Any output that names a case or quotes a holding without a tool-backed citation
  • Communicating with non-attorneys as a legal advisor

Three failure modes that shaped the design

1. Citation fabrication

Failure mode: Citation fabrication.

The earliest version of Brief produced fluent, plausible memos with cases that didn’t exist or didn’t say what Brief claimed they said. This is the failure mode that has, in real legal practice, ended careers.

Before — Brief v0.6:

Attorney: “Find recent Ninth Circuit cases on personal jurisdiction over foreign corporations under the Bristol-Myers Squibb framework.”

Brief: “In Chen v. Hua Tai Industries (9th Cir. 2023), the court applied Bristol-Myers Squibb to hold that…”

(Chen v. Hua Tai Industries does not exist)

After — Brief v1.0:

The system prompt was rewritten around the absolute rule. Brief is forbidden to produce a case name, citation, or quotation that didn’t come out of a case_search call in the same response. A post-hoc verifier runs against every response and blocks delivery if any citation can’t be resolved.

Attorney: “Find recent Ninth Circuit cases on personal jurisdiction over foreign corporations under the Bristol-Myers Squibb framework.”

Brief: “Three cases I retrieved: [Yamashita v. LG Electronics USA, Inc., 9th Cir. 2024] applies BMS narrowly to the named-plaintiff context. [Pacific Maritime v. Sino Trust] (9th Cir. 2023) extends the analysis to specific-jurisdiction over foreign parents. [Reyes v. Tata Consultancy Services] (9th Cir. 2024) goes the other way on a similar set of facts. Want me to pull holdings and key reasoning for each?”

Eval impact: Verifier-blocked responses dropped from 18% in v0.6 to 0.4% in v1.0. The remaining failures are caught before delivery.

2. False certainty on unsettled law

Failure mode: False certainty.

Brief stated rules of law as settled when the law was, in fact, in active circuit split or jurisdiction-dependent. The cited cases were real; the framing wasn’t.

Before:

Attorney: “What’s the standard for piercing the corporate veil in a parent-subsidiary context?”

Brief: “Courts apply the alter-ego test, requiring a showing of (1) unity of interest and ownership and (2) inequitable result. [retrieved citation]”

After:

The behavior spec added a hedging rule: if the law varies meaningfully by jurisdiction or is subject to active circuit split, Brief must say so before stating any rule.

Attorney: “What’s the standard for piercing the corporate veil in a parent-subsidiary context?”

Brief: “This varies meaningfully by state and by federal circuit — there’s no single nationwide rule. The most common formulation is some version of the alter-ego test (unity of interest plus inequitable result), but the elements and burden differ. Want me to pull the standards for a specific jurisdiction?“

3. Mode confusion — answering as if to a layperson

Failure mode: Under-refusal at the audience layer.

Occasionally an attorney would test Brief by typing as if they were a layperson asking for advice (“Can you tell me whether I have a case?”). Brief would slip into giving non-attorney advice — the wrong audience for this product.

After:

The system prompt added a single line: “Brief is deployed for licensed attorneys at this firm. Do not respond as if to a layperson seeking legal advice. If a message reads as a layperson asking for advice, ask the attorney what they’re trying to test or research.”

Attorney (testing): “I think my landlord is trying to evict me illegally. Do I have a case?”

Brief: “That message reads like a layperson asking for advice — I’m not the right surface for that. If you’re researching landlord-tenant law for a client matter, tell me the jurisdiction and the conduct at issue and I’ll pull the controlling authority.”


Evaluation excerpt

Brief’s evaluation has one binary metric that gates everything else.

DimensionSourceMetric
Citation verifiabilityAutomated, every response100% — a single unverifiable citation is a launch-blocking incident
Hedging on unsettled lawWeekly attorney review% of responses that flag jurisdictional variance where it exists
Scope adherenceWeekly attorney review% of responses that decline outcome prediction or non-attorney advice
HelpfulnessAttorney thumbs in the product% thumbs-up on returned research
Reversal / re-queryProduction% of responses that prompted the attorney to re-run the same query differently

The verifier is the load-bearing piece of infrastructure. Every citation in every response is checked against the corpus before delivery. If a citation can’t be resolved, the response is blocked and the attorney sees a banner explaining what failed and offering the underlying search results instead.


Tool decisions

Brief’s tools are exclusively retrieval. There is no tool that synthesizes a citation.

ToolTierNotes
case_searchAutonomousRetrieval over the firm’s licensed corpus; returns case name, citation, and excerpt
statute_lookupAutonomousFederal and state statutes
firm_doc_searchAutonomousThe firm’s prior briefs, memos, and templates
cite_checkAutonomous post-processRuns on every response before delivery; blocks unverifiable citations
Drafting from memoryOffBrief cannot produce a citation that didn’t come from a tool call in this response
E-filing, sending mailOffOut of Brief’s reach entirely

The single most important architectural decision was that the model never produces citations directly. Every citation comes from a tool result and gets re-verified by a separate post-process before the user sees the response.


What this case study illustrates

  • The hardest behavioral rules are the ones that tolerate no exceptions. Brief’s “every citation must be retrievable” rule has no edge cases. That makes it both implementable and enforceable.
  • Verification can be a separate system, not a model behavior. The cite-check post-process catches what the model misses. Behavior architecture isn’t always about asking the model to behave — sometimes it’s about building the rail that catches the model when it doesn’t.
  • Hedging is not weakness. In a domain where the law is genuinely unsettled, signaling that is the correct behavior. Removing the hedge to sound confident is a regression.
  • Mode and audience are separate from content. The same factual question gets a different response depending on whether the speaker is an attorney researching a client’s matter or a layperson asking for advice. The product needs to know which it’s seeing.
  • One blocked response is cheaper than one fabricated citation in a filing. The cost asymmetry is the entire reason the verifier exists. The product is willing to be slower and occasionally annoying to never be career-ending.