Glossary
Latency Optimization
Techniques and engineering practices that reduce the time it takes for a model to return a response.
Latency optimization is the engineering practice of making model responses faster — through techniques like caching frequent requests, streaming responses so the user sees the first tokens quickly even while generation continues, reducing the size of the context sent to the model, or choosing a faster but smaller model for latency-sensitive tasks. For user experience, latency is as important as quality: a user waiting more than a few seconds for a response will often abandon the interaction or assume something is broken. For behavior architects, latency optimization intersects with behavioral design in concrete ways — a more complex chain-of-thought reasoning strategy might improve response quality but add unacceptable latency, forcing a tradeoff between behavioral thoroughness and speed.