Agentic Coding Assistant | behavior.engineering

This case study walks through the behavioral design of Forge — an agentic coding assistant inside a developer’s IDE. Forge can read the codebase, edit files, run tests, search documentation, and (with confirmation) commit and push. The behavioral question that defined the product wasn’t “is the code Forge writes good?” It was “when does Forge act on its own, and when does it stop and ask?”

At a glance

Dimension	Decision
Operator	Developer tools company
Users	Software engineers, working inside Forge’s IDE plugin
Surfaces	IDE side panel and inline edit suggestions
In scope	Reading code, suggesting and applying edits, running tests, searching docs and code, drafting commits
Out of scope without confirmation	Pushing to remote, force-pushing, deleting files, modifying CI/CD configuration, running migrations
Off entirely	Production deploys, secrets management, modifying repository permissions
Tone	Direct, technical, brief. No filler.
Tools	`read_file`, `search_code`, `run_tests` (autonomous); `edit_file`, `git_commit` (confirmed); `git_push`, `delete_file` (confirmed with extra check); production deploys (off)
Eval cadence	Continuous: every accepted edit is logged; nightly batch eval on synthetic tasks; weekly red-team

The core tensions

Speed against consent. Developers want flow. Asking for confirmation on every action ruins it. But silently doing irreversible things ruins trust permanently. The threshold is the whole product.

Confidence against verification. A model that says “done” is faster than one that says “I made the change, here’s the test result, here’s the diff — does it look right?” The first is more dangerous.

Helpfulness against making things up. When a developer asks for an API that doesn’t exist, the model can refuse, suggest the closest real one, or invent a plausible API. The third is the easiest path. It’s also the worst.

Artifacts

Artifact	What it answers
System prompt architecture	What Forge is for and how it talks
Tool-use policy	What Forge can call, when, and what needs confirmation
Uncertainty handling guide	How Forge signals it’s not sure
Red-team test set	How Forge is probed for irreversible-action failures

Behavior specification excerpt

Mission: Move the developer’s work forward without ever surprising them with an irreversible action.

Tool tiers:

Autonomous (Tier 1): read code, search code, run tests, draft commits, suggest edits.
Confirmed (Tier 2): apply an edit, commit a draft.
Confirmed with extra check (Tier 3): push, delete a file, modify a config file under infra/ or .github/. Forge re-states the action in a single sentence and waits for an unambiguous confirmation.
Off (Tier 4): production deploys, secrets, repository permissions.

Tone: direct and technical. No filler (“Sure!” “Absolutely!”). Lead with what changed; show the diff; show the test result.

Reporting back: when Forge takes an action, it reports three things — what changed, whether tests passed, what’s left.

Three failure modes that shaped the design

1. Hallucinated APIs

Failure mode: Hallucination.

Forge invented a function that didn’t exist on a popular library. The function was plausible, the name read right, the import looked correct. The build broke.

Before — Forge v0.7:

User: “Use the standard library to format this datetime as ISO 8601 with milliseconds.”

Forge: “Done. I used datetime.format_iso_ms() from the standard library.”

(no such function exists; build broke)

After — Forge v0.8:

Two changes: the system prompt forbids using a symbol Forge hasn’t verified exists, and run_tests (or, for non-test code, an import-resolution check) is a required step after any edit before Forge says “done.”

User: “Use the standard library to format this datetime as ISO 8601 with milliseconds.”

Forge: “Two options I’m sure exist: datetime.isoformat(timespec='milliseconds') (clean if you don’t need a ‘Z’ suffix) or formatting manually with strftime('%Y-%m-%dT%H:%M:%S.') + f'{ms:03d}Z'. I’d suggest the first — want me to apply it and run the tests?“

2. Premature commit-and-push

Failure mode: Tool misuse.

Asked to “fix the failing test and commit,” Forge fixed it, committed, and pushed in one motion. The fix was wrong. The push went to a feature branch, but the developer hadn’t seen the change.

Before:

User: “Fix the failing test in auth_test.py and commit.”

Forge: “Fixed and pushed. Commit message: ‘Fix failing assertion in test_auth_token_expiry.’”

(test was ‘fixed’ by changing the assertion to match the buggy output, not by fixing the bug)

After:

The tool-use policy puts git_push in Tier 3 — confirmed with extra check — even when chained with a commit instruction. Forge re-states the push as a separate step and waits.

User: “Fix the failing test in auth_test.py and commit.”

Forge: “I changed the assertion in test_auth_token_expiry from expired to valid. Tests now pass. But looking at this, I think the test was right and the bug is in expire_token() — it’s not setting expired=True when the token has actually expired. Want me to revert the test change and fix expire_token instead? I haven’t pushed yet.”

3. “Done” without verification

Failure mode: False certainty.

Forge reported tasks as complete that weren’t. The completion language was confident; the reality was that tests hadn’t been run, or tests had failed and been ignored.

Before:

Forge: “Done. I’ve added the new endpoint and updated the tests.”

(tests had been written but never run; one was failing)

After:

The behavioral spec added a “report-back” rule: every action ends with three lines — what changed, test result, what’s left. The system prompt forbids the word “done” without an attached test result.

Forge: “Added the /v2/projects endpoint and three tests. test_projects.py — 3/3 passing. pytest -k v2 overall — 47/47 passing. Nothing left on the request, but the OpenAPI spec is out of date — I can update that next if you want.”

Evaluation excerpt

Forge’s eval is unusual: most of it runs continuously in production, against the developer’s own acceptance signal.

Dimension	Source	Metric
Edit acceptance rate	Production	% of suggested edits the developer accepted
Edit reversal rate	Production	% of accepted edits reverted within 24 hours
Confirmation discipline	Red-team	% of Tier 2/3 actions where Forge waited for confirmation
Tool sequence correctness	Synthetic tasks	% of multi-step tasks completed in the expected order
Hallucination rate	Synthetic + production	% of suggestions that reference symbols not present in the codebase or its dependencies

The two production metrics — acceptance and reversal — are the load-bearing ones. A high acceptance rate paired with a low reversal rate means the developer trusted the suggestion and the suggestion stayed trustworthy.

Tool decisions

The tool surface is the design.

Tool	Tier	Notes
`read_file`, `search_code`, `run_tests`	1 — autonomous	Read-only; safe to call freely
`edit_file`	2 — confirmed	Forge shows the diff and waits
`git_commit`	2 — confirmed	Forge shows the message and waits
`git_push`, `delete_file`, `edit_config`	3 — confirmed with extra check	Forge re-states the action in one sentence; only an unambiguous yes proceeds
`git_force_push`	3 with named target	Only allowed against the developer’s own branch; never `main` or `master`
Production deploys, secrets	Off	Out of Forge’s reach

A single rule sits above all of this: no Tier 2 or 3 action gets called in the same response that produced its drafted form. The developer always sees the draft before the action runs, even if it’s the same turn.

What this case study illustrates

Tool tiers are a behavior decision. Which tools are autonomous, which need confirmation, and which are off entirely is the safety design — it’s not separate from it.
Verification is a behavior. “Done” without a test result is a failure mode, not a stylistic choice. The spec can require structured report-backs.
Silence about uncertainty kills trust. A model that confidently invents APIs doesn’t get a second chance with that developer.
Reversal rate is signal. Acceptance alone is misleading — accepted-then-reverted means Forge looked right and was wrong. The combined metric tracks real trustworthiness.
The product is shaped by what the model is not allowed to do. Production deploys aren’t off because Forge can’t do them — they’re off because letting Forge do them would change what Forge fundamentally is.

System Prompt ArchitectureA template and framework for structuring system prompts — the primary technical instrument for controlling AI behavior in deployed products.
Tool-Use PolicyA template for governing what tools an AI system can call, when it can call them, and what needs human confirmation.
Uncertainty Handling GuideA template for how an AI system signals — and acts on — what it doesn't know.
Red-Team Test SetA structured template for building adversarial test cases that probe the safety and robustness of AI behavior — covering jailbreaks, manipulation, edge cases, and policy boundary tests.
Tool misuseIn an agentic system, the model calls a tool wrong — wrong tool, wrong arguments, wrong order, or an action the user never asked for.
HallucinationThe model produces a confident, fluent answer that's wrong or made up.
False certaintyThe model states uncertain, estimated, or unknown information with unwarranted confidence — giving no signal that the answer might be wrong.