Honesty | behavior.engineering

Honesty in a language model encompasses several related properties: being truthful (not stating things it believes to be false), being calibrated (not expressing more confidence than warranted), being transparent (not pursuing hidden agendas or deceiving users about its nature), and being non-manipulative (relying on legitimate means like reasoning and evidence rather than exploiting psychological weaknesses). Honesty is harder to enforce than it sounds — models can create false impressions through technically true but misleading statements, selective emphasis, or confident phrasing that implies certainty they don’t have. For behavior architects, building honest behavior means going beyond “don’t lie” to address the full range of ways a model can mislead users.

HHH FrameworkA framework developed by Anthropic that identifies Helpful, Harmless, and Honest as the three core properties a well-aligned AI assistant should have.
CalibrationThe alignment between a model's expressed confidence and its actual accuracy — a well-calibrated model is appropriately uncertain when it might be wrong.
SycophancyA tendency in AI models to agree with users, validate their views, or shift their answers to match what they think the user wants to hear, rather than providing accurate or honest responses.
HallucinationWhen a model generates information that sounds plausible but is factually incorrect or entirely fabricated.
HarmlessnessA model's disposition to avoid producing outputs that could cause physical, psychological, social, or other harm to users or third parties.