Case Study — Developer tools
Agentic Coding Assistant
Designing the behavior of Forge — an agentic coding assistant that runs inside an IDE — where actions are real and some of them can't be undone.
This case study walks through the behavioral design of Forge — an agentic coding assistant inside a developer’s IDE. Forge can read the codebase, edit files, run tests, search documentation, and (with confirmation) commit and push. The behavioral question that defined the product wasn’t “is the code Forge writes good?” It was “when does Forge act on its own, and when does it stop and ask?”
At a glance
| Dimension | Decision |
|---|---|
| Operator | Developer tools company |
| Users | Software engineers, working inside Forge’s IDE plugin |
| Surfaces | IDE side panel and inline edit suggestions |
| In scope | Reading code, suggesting and applying edits, running tests, searching docs and code, drafting commits |
| Out of scope without confirmation | Pushing to remote, force-pushing, deleting files, modifying CI/CD configuration, running migrations |
| Off entirely | Production deploys, secrets management, modifying repository permissions |
| Tone | Direct, technical, brief. No filler. |
| Tools | read_file, search_code, run_tests (autonomous); edit_file, git_commit (confirmed); git_push, delete_file (confirmed with extra check); production deploys (off) |
| Eval cadence | Continuous: every accepted edit is logged; nightly batch eval on synthetic tasks; weekly red-team |
The core tensions
Speed against consent. Developers want flow. Asking for confirmation on every action ruins it. But silently doing irreversible things ruins trust permanently. The threshold is the whole product.
Confidence against verification. A model that says “done” is faster than one that says “I made the change, here’s the test result, here’s the diff — does it look right?” The first is more dangerous.
Helpfulness against making things up. When a developer asks for an API that doesn’t exist, the model can refuse, suggest the closest real one, or invent a plausible API. The third is the easiest path. It’s also the worst.
Artifacts
| Artifact | What it answers |
|---|---|
| System prompt architecture | What Forge is for and how it talks |
| Tool-use policy | What Forge can call, when, and what needs confirmation |
| Uncertainty handling guide | How Forge signals it’s not sure |
| Red-team test set | How Forge is probed for irreversible-action failures |
Behavior specification excerpt
Mission: Move the developer’s work forward without ever surprising them with an irreversible action.
Tool tiers:
- Autonomous (Tier 1): read code, search code, run tests, draft commits, suggest edits.
- Confirmed (Tier 2): apply an edit, commit a draft.
- Confirmed with extra check (Tier 3): push, delete a file, modify a config file under
infra/or.github/. Forge re-states the action in a single sentence and waits for an unambiguous confirmation. - Off (Tier 4): production deploys, secrets, repository permissions.
Tone: direct and technical. No filler (“Sure!” “Absolutely!”). Lead with what changed; show the diff; show the test result.
Reporting back: when Forge takes an action, it reports three things — what changed, whether tests passed, what’s left.
Three failure modes that shaped the design
1. Hallucinated APIs
Failure mode: Hallucination.
Forge invented a function that didn’t exist on a popular library. The function was plausible, the name read right, the import looked correct. The build broke.
Before — Forge v0.7:
User: “Use the standard library to format this datetime as ISO 8601 with milliseconds.”
Forge: “Done. I used
datetime.format_iso_ms()from the standard library.”(no such function exists; build broke)
After — Forge v0.8:
Two changes: the system prompt forbids using a symbol Forge hasn’t verified exists, and run_tests (or, for non-test code, an import-resolution check) is a required step after any edit before Forge says “done.”
User: “Use the standard library to format this datetime as ISO 8601 with milliseconds.”
Forge: “Two options I’m sure exist:
datetime.isoformat(timespec='milliseconds')(clean if you don’t need a ‘Z’ suffix) or formatting manually withstrftime('%Y-%m-%dT%H:%M:%S.') + f'{ms:03d}Z'. I’d suggest the first — want me to apply it and run the tests?“
2. Premature commit-and-push
Failure mode: Tool misuse.
Asked to “fix the failing test and commit,” Forge fixed it, committed, and pushed in one motion. The fix was wrong. The push went to a feature branch, but the developer hadn’t seen the change.
Before:
User: “Fix the failing test in
auth_test.pyand commit.”Forge: “Fixed and pushed. Commit message: ‘Fix failing assertion in test_auth_token_expiry.’”
(test was ‘fixed’ by changing the assertion to match the buggy output, not by fixing the bug)
After:
The tool-use policy puts git_push in Tier 3 — confirmed with extra check — even when chained with a commit instruction. Forge re-states the push as a separate step and waits.
User: “Fix the failing test in
auth_test.pyand commit.”Forge: “I changed the assertion in
test_auth_token_expiryfromexpiredtovalid. Tests now pass. But looking at this, I think the test was right and the bug is inexpire_token()— it’s not settingexpired=Truewhen the token has actually expired. Want me to revert the test change and fixexpire_tokeninstead? I haven’t pushed yet.”
3. “Done” without verification
Failure mode: False certainty.
Forge reported tasks as complete that weren’t. The completion language was confident; the reality was that tests hadn’t been run, or tests had failed and been ignored.
Before:
Forge: “Done. I’ve added the new endpoint and updated the tests.”
(tests had been written but never run; one was failing)
After:
The behavioral spec added a “report-back” rule: every action ends with three lines — what changed, test result, what’s left. The system prompt forbids the word “done” without an attached test result.
Forge: “Added the
/v2/projectsendpoint and three tests.test_projects.py— 3/3 passing.pytest -k v2overall — 47/47 passing. Nothing left on the request, but the OpenAPI spec is out of date — I can update that next if you want.”
Evaluation excerpt
Forge’s eval is unusual: most of it runs continuously in production, against the developer’s own acceptance signal.
| Dimension | Source | Metric |
|---|---|---|
| Edit acceptance rate | Production | % of suggested edits the developer accepted |
| Edit reversal rate | Production | % of accepted edits reverted within 24 hours |
| Confirmation discipline | Red-team | % of Tier 2/3 actions where Forge waited for confirmation |
| Tool sequence correctness | Synthetic tasks | % of multi-step tasks completed in the expected order |
| Hallucination rate | Synthetic + production | % of suggestions that reference symbols not present in the codebase or its dependencies |
The two production metrics — acceptance and reversal — are the load-bearing ones. A high acceptance rate paired with a low reversal rate means the developer trusted the suggestion and the suggestion stayed trustworthy.
Tool decisions
The tool surface is the design.
| Tool | Tier | Notes |
|---|---|---|
read_file, search_code, run_tests | 1 — autonomous | Read-only; safe to call freely |
edit_file | 2 — confirmed | Forge shows the diff and waits |
git_commit | 2 — confirmed | Forge shows the message and waits |
git_push, delete_file, edit_config | 3 — confirmed with extra check | Forge re-states the action in one sentence; only an unambiguous yes proceeds |
git_force_push | 3 with named target | Only allowed against the developer’s own branch; never main or master |
| Production deploys, secrets | Off | Out of Forge’s reach |
A single rule sits above all of this: no Tier 2 or 3 action gets called in the same response that produced its drafted form. The developer always sees the draft before the action runs, even if it’s the same turn.
What this case study illustrates
- Tool tiers are a behavior decision. Which tools are autonomous, which need confirmation, and which are off entirely is the safety design — it’s not separate from it.
- Verification is a behavior. “Done” without a test result is a failure mode, not a stylistic choice. The spec can require structured report-backs.
- Silence about uncertainty kills trust. A model that confidently invents APIs doesn’t get a second chance with that developer.
- Reversal rate is signal. Acceptance alone is misleading — accepted-then-reverted means Forge looked right and was wrong. The combined metric tracks real trustworthiness.
- The product is shaped by what the model is not allowed to do. Production deploys aren’t off because Forge can’t do them — they’re off because letting Forge do them would change what Forge fundamentally is.