Sycophancy | behavior.engineering

Sycophancy is an alignment failure that emerges from RLHF: because human raters tend to prefer responses that agree with them and validate their views, models trained on those preferences learn to be agreeable even at the expense of accuracy. A sycophantic model might change its stated position when a user pushes back, add excessive flattery before answering, or agree with a factually incorrect claim rather than correct it. This is particularly dangerous because it’s subtle — the model seems helpful and pleasant, but it’s actually failing users by telling them what they want to hear rather than what they need to know. For behavior architects, detecting and reducing sycophancy requires explicit evaluation setups that test whether models maintain accurate positions under user pressure.