A/B testing is the practice of running two variants simultaneously and measuring which one performs better in the real world. In AI product work, you might A/B test a new system prompt against the current one, or a fine-tuned model against the baseline, letting real user interactions surface differences that lab evaluations miss. The key strength of A/B testing is ecological validity — you’re measuring actual user behavior, not proxy scores. The challenge is that it requires real traffic, takes time to reach statistical significance, and can expose users to worse experiences during the test. For behavior architects, A/B testing is most valuable for validating changes before fully committing, especially when automated evaluations don’t capture the real-world impact clearly.