Capability Evaluation | behavior.engineering

Capability evaluation asks: “What can this model actually do?” rather than “How well does it behave?” — though the two are related. Capability evaluations probe a model’s performance across different tasks and domains: can it write code? Translate between languages? Reason about math problems? Understand complex instructions? The results define the ceiling of what’s possible for a product built on the model. For behavior architects, capability evaluations provide essential context for behavioral design: a behavioral strategy that depends on a capability the model doesn’t reliably have will fail in ways that look like behavioral problems but are actually capability limitations. Distinguishing between the two helps teams invest in the right solutions.