Experiment tracking is the discipline of logging everything relevant about a model run or behavioral test — which model version was used, what prompt configuration was in place, what dataset was tested, what parameters were set, and what results were produced. Without experiment tracking, it’s easy to run a promising test, get good results, and then be unable to reproduce the exact conditions that produced them. Tools like MLflow, Weights & Biases, and many eval platforms include experiment tracking functionality. For behavior architects, experiment tracking is what allows behavioral iteration to be cumulative rather than circular — you can compare this week’s test against last month’s, understand what changed, and build on results rather than constantly re-exploring the same ground.