Glossary
184 key terms in model behavior architecture, defined for practitioners at every level.
A
- A/B Testing
- Comparing two versions of a model or prompt by splitting traffic between them and measuring outcomes on real users.
- Ablation Study
- An experiment that systematically removes or varies individual components of a system to understand each component's contribution to overall performance.
- Adversarial Examples
- Inputs carefully designed to cause a model to fail or produce unintended outputs, often by exploiting specific vulnerabilities in its training or architecture.
- Adversarial Prompting
- Crafting inputs specifically designed to cause a model to behave in unintended, harmful, or policy-violating ways.
- AI Ethics
- The study and application of ethical principles — fairness, accountability, transparency, harm avoidance — to the design and deployment of AI systems.
- AI Governance
- The frameworks, policies, processes, and oversight mechanisms that guide how AI is developed, deployed, and monitored within an organization or across society.
- AI Observability
- The ability to monitor, inspect, and understand what an AI system is doing in production — including inputs, outputs, errors, and behavioral patterns over time.
- AI Safety
- The field concerned with ensuring that AI systems behave in ways that are safe, controllable, and aligned with human values — especially as they become more capable.
- Alignment Science
- The research field focused on developing methods, theories, and techniques for making AI systems reliably pursue intended goals and values.
- Alignment Tax
- The performance costs a model may incur when trained to be safer or more value-aligned, such as reduced capability or increased refusals.
- Annotation
- The process of adding labels, ratings, or structured information to data so a model can learn from it.
- Annotation Platform
- Software designed to help teams collect, manage, and quality-control labeled data from human raters.
- API Integration
- Connecting an AI model to a product or system through a programming interface, allowing it to send requests and receive model responses programmatically.
- Applied Ethics
- The application of ethical theory and principles to specific real-world domains and practical decisions.
- Applied Finetuning
- The practice of finetuning models in a product or business context to improve behavior for a specific use case, distinct from academic or research finetuning.
- Automated Evaluation
- Using software, scripts, or models to score outputs without requiring human review for each individual case.
B
- Batch Testing
- Running a large set of prompts through a model simultaneously rather than one at a time, to evaluate behavior at scale.
- Behavior Design
- The practice of intentionally defining and shaping how an AI model acts across a range of situations, rather than leaving behavior to emerge by default.
- Behavioral Audit
- A systematic review of a model's behavior across a defined set of scenarios to assess whether it meets expected standards.
- Behavioral Consistency
- The degree to which a model produces similar outputs for similar inputs across different sessions, users, or contexts.
- Behavioral Drift
- A gradual, unintended change in how a model behaves over time — often across model updates, prompt changes, or accumulating context — that wasn't explicitly planned.
- Behavioral Economics
- A field that combines psychology and economics to study how real people make decisions, often in ways that deviate from purely rational models.
- Behavioral Regression
- When a model update or change causes behavior that was previously working correctly to degrade or break.
- Behavioral Specification
- A written document or set of guidelines that defines how a model is expected to behave across different situations.
- Behavioral Taxonomy
- A hierarchical classification system that organizes model behaviors into categories and subcategories for analysis, evaluation, and communication.
- Benchmark
- A standardized test or dataset used to compare model performance across versions or across different models.
- Bias Detection
- The process of identifying systematic patterns in model behavior that produce unfair or unequal outcomes across groups of people.
- Boundary Exploration
- The systematic practice of probing the edges of a model's behavioral constraints to understand where and how its policies apply.
- Boundary Setting
- The practice of defining and communicating the limits of what a model will do — what it considers out of scope, inappropriate, or harmful.
- Braintrust
- An evaluation and observability platform for AI products that focuses on running experiments, logging traces, and comparing prompt or model variants.
C
- Calibration
- The alignment between a model's expressed confidence and its actual accuracy — a well-calibrated model is appropriately uncertain when it might be wrong.
- Capability Evaluation
- Assessment of what a model is able to do — the range and level of tasks it can perform successfully under defined conditions.
- Categorical Thinking
- Reasoning by placing things into defined categories with clear rules, rather than reasoning about each case on its individual merits.
- Chain-of-Thought Prompting
- A prompting technique that encourages a model to work through a problem step by step before producing a final answer.
- ChainForge
- An open-source visual tool for testing and comparing LLM prompts and responses across models and configurations.
- Character Consistency
- The degree to which a model maintains a stable persona, voice, and set of values across different conversations and contexts.
- Chatbot Arena
- A public platform developed by LMSYS where users compare responses from anonymous AI models side by side and vote for the better one, generating a crowd-sourced Elo leaderboard.
- Cognitive Bias
- Systematic patterns in human thinking that lead to errors in judgment, often unconsciously — including biases that can affect annotation, evaluation, and behavioral design.
- Confabulation
- A more precise term for hallucination that emphasizes the model is generating plausible-sounding but false information to fill gaps, rather than intentionally lying.
- Confidence
- The degree of certainty a model expresses — or implies — when producing an output.
- Consequentialism
- A moral framework that judges actions by their outcomes — the right action is the one that produces the best overall results.
- Constitutional AI
- An approach developed by Anthropic where a model is trained to critique and revise its own outputs based on a written set of principles.
- Content Policy
- A documented set of rules governing what types of content a model will and will not produce.
- Context Engineering
- The broader practice of designing what information a model has access to at inference time — including instructions, memory, tools, and retrieved content.
- Context Strategy
- A deliberate plan for what information to include in a model's context window, how to structure it, and what to exclude given space and quality constraints.
- Context Window
- The maximum amount of text — measured in tokens — that a model can read and reason over in a single interaction.
- Conversation Transcripts
- Records of full multi-turn interactions between users and a model, used to analyze behavior in context.
- Cost Benchmarking
- Measuring and comparing the financial cost of running a model across different providers, configurations, or usage patterns.
- Coverage Matrix
- A structured map that shows which behavioral scenarios, use cases, or risk categories are represented in an evaluation suite, and which are missing.
- Cross-Functional Collaboration
- Working across organizational functions — engineering, design, policy, research, safety — to align on behavioral goals and coordinate the work needed to achieve them.
D
- Data Generation Pipelines
- Automated systems for producing, filtering, and formatting training or evaluation data at scale.
- Data Guideline Authorship
- Writing clear, detailed instructions for annotators that define what good and bad model responses look like and how to evaluate them consistently.
- Data Pipeline
- A series of automated steps that collect, process, transform, and deliver data from one system to another.
- Data Quality
- The degree to which training or evaluation data is accurate, consistent, relevant, and representative of the desired behavior.
- Data Versioning
- The practice of tracking changes to datasets over time so that training and evaluation runs are reproducible and changes can be traced.
- Decision-Making Frameworks
- Structured approaches for reasoning through complex choices, especially when values conflict or outcomes are uncertain.
- Deontological Ethics
- A moral framework that judges actions as right or wrong based on rules and duties, regardless of their consequences.
- Discourse Structure
- The organization and coherence of language across multiple sentences or turns — how ideas connect, flow, and build on each other in extended text or conversation.
- Domain Evaluation
- Evaluating a model's performance specifically within a defined subject area or application context, rather than in general.
- Dual-Use Risk
- The risk that a capability or piece of information provided by a model could be used both for legitimate purposes and to cause harm.
E
- Edge Case Behavior
- How a model responds to unusual, ambiguous, or boundary-pushing inputs that fall outside the common range of expected use.
- Edge Case Construction
- The deliberate process of designing inputs that test a model's behavior at the boundaries of expected usage.
- Edge Case Testing
- Evaluating model behavior on unusual, extreme, or boundary-pushing inputs that are unlikely but consequential when they occur.
- Elo Ranking
- A system for ranking models or outputs by their win rate in head-to-head comparisons, borrowed from competitive chess.
- Empirical Research
- Research based on observation and evidence rather than theory alone — drawing conclusions from data collected through experiments or systematic observation.
- Ethical Judgment
- The capacity to reason through situations where values conflict, weigh competing interests, and arrive at a principled decision about what is right.
- Eval Framework
- A structured approach or set of tools for designing, running, and interpreting model evaluations.
- Eval Platform
- Software infrastructure for running, tracking, and comparing model evaluations systematically.
- Eval Suite
- A comprehensive, organized collection of evaluation datasets and test cases that together cover the full range of behavioral requirements for a model.
- Evaluation Dataset
- A curated collection of inputs and expected outputs used to measure model performance in a consistent and repeatable way.
- Evaluation Pipeline
- An end-to-end system for consistently measuring model behavior across a defined set of inputs and criteria.
- Experiment Tracking
- Recording the configuration, inputs, and outcomes of model experiments so results can be compared and reproduced.
- Experimental Design
- The planning of a study or test — defining variables, controls, sample sizes, and measurement methods — so that results are valid and interpretable.
F
- Failure Mode
- A specific, recurring pattern in which a model produces incorrect, harmful, or otherwise unacceptable outputs.
- Failure Mode Analysis
- A systematic process for identifying, categorizing, and understanding the ways a model can behave incorrectly or harmfully.
- Failure Taxonomy
- An organized classification system that categorizes the different ways a model can fail, enabling systematic tracking and prioritization.
- Fairness
- The property of a model treating different individuals and groups equitably and without unjustified discrimination.
- Few-Shot Prompting
- Providing a model with a small number of examples of the desired input-output pattern before asking it to complete a new task.
- Finetuning
- Further training a model on a specific dataset to adjust its behavior, style, or knowledge for a particular purpose.
- Format Adherence
- A model's ability to consistently follow specified output formats, such as JSON, markdown, bullet lists, or length constraints.
- Freeplay
- An AI product development platform designed for testing prompts, managing model configurations, and iterating on AI features collaboratively.
G
- Goodhart's Law
- The principle that when a measure becomes a target, it ceases to be a good measure.
- Ground Truth
- The accepted correct answer or label for a data example, used as the standard against which model outputs are measured.
H
- Hallucination
- When a model generates information that sounds plausible but is factually incorrect or entirely fabricated.
- Harm Avoidance
- The practice of designing model behavior to minimize the risk of producing outputs that cause physical, psychological, social, or financial harm.
- Harmlessness
- A model's disposition to avoid producing outputs that could cause physical, psychological, social, or other harm to users or third parties.
- Hedging
- The use of qualifying language — like "it depends," "I'm not sure," or "you may want to consult an expert" — to soften or add uncertainty to a model's response.
- Helpfulness
- A model's ability to genuinely assist users in accomplishing their goals in a way that's accurate, clear, and appropriately complete.
- HHH Framework
- A framework developed by Anthropic that identifies Helpful, Harmless, and Honest as the three core properties a well-aligned AI assistant should have.
- Honesty
- A model's disposition to tell the truth, accurately represent its uncertainty, and avoid creating false impressions in users' minds.
- Human Evaluation
- Assessment of model outputs by people, used to measure quality dimensions that automated systems can't reliably capture.
- Human Feedback
- Ratings, comparisons, or corrections provided by people that are used to guide model training and improve behavior.
- Hypothesis Testing
- A research method for evaluating whether observed data supports a specific claim, by defining a hypothesis and testing it against evidence.
I
- In-Context Learning
- A model's ability to adapt its behavior or improve at a task based on examples and information provided in the prompt, without any change to its underlying weights.
- Inference Infrastructure
- The systems and compute resources that host a deployed model and serve its responses to users in production.
- Instruction Ambiguity
- The quality of an instruction or prompt that allows for multiple reasonable interpretations, potentially leading to inconsistent or incorrect model responses.
- Instruction Following
- A model's ability to accurately understand and comply with the directions given to it in a prompt.
- Inter-Annotator Agreement
- A measure of how consistently different human raters label or evaluate the same data, used to assess annotation reliability.
- Issue Reproduction
- The process of reliably recreating a reported behavioral failure so it can be analyzed and fixed.
J
- Jailbreaking
- Techniques users employ to get a model to bypass its safety guidelines and produce outputs it's been trained or instructed not to.
K
- KL Divergence
- A measure of how different one probability distribution is from another, used in model training to keep updated behavior from drifting too far from the original.
- Knowledge Sharing
- The practice of systematically documenting and distributing behavioral insights, findings, and lessons learned across teams and disciplines.
L
- Label Studio
- An open-source data labeling platform that supports annotation for text, images, audio, and other modalities used in AI training.
- Labeling
- Assigning categories, tags, or classifications to data examples to indicate what they represent or how a model should treat them.
- LangSmith
- A platform from LangChain for tracing, evaluating, and monitoring LLM applications in development and production.
- Latency Benchmarking
- Measuring how quickly a model produces responses under various conditions to evaluate its suitability for real-time use.
- Latency Optimization
- Techniques and engineering practices that reduce the time it takes for a model to return a response.
- Linguistic Analysis
- The systematic study of language features in model outputs — such as vocabulary, syntax, tone, and discourse structure — to understand behavioral patterns.
- Literature Review
- A systematic survey of existing research and writing on a topic to understand what is already known before designing new work.
- LLM-as-Judge
- Using a language model to evaluate the quality of another model's outputs, often as a scalable alternative to human review.
- Log Analysis
- Reviewing records of model interactions to identify patterns, failures, and opportunities for improvement.
M
- Meta-Prompt
- A prompt designed to generate or improve other prompts, rather than directly produce the final task output.
- Model Behavior
- The observable outputs, responses, and actions of an AI model as experienced by users and systems interacting with it.
- Model Card
- A standardized document that describes a model's intended use cases, known limitations, evaluation results, and potential risks.
- Model Comparison
- The systematic evaluation of two or more models against the same criteria to understand their relative strengths, weaknesses, and behavioral tradeoffs.
- Model Judgment
- A model's ability to reason through ambiguous or novel situations and arrive at contextually appropriate decisions without explicit instructions for every case.
- Model Launch Support
- The behavioral work required to prepare a model for public release — including pre-launch evaluation, policy review, red-teaming, and documentation.
- Model Playground
- An interactive interface for experimenting with model prompts, settings, and outputs without building a full application.
- Model Quality
- The overall degree to which a model meets the behavioral, accuracy, and user experience standards required for its intended use.
- Model Registry
- A centralized repository that stores, versions, and tracks deployed model artifacts and their associated metadata.
- Model Rollout
- The process of gradually deploying a new model version or behavioral update to users, typically in staged phases to monitor for unexpected issues.
- Model Strategy
- The deliberate plan for which AI models to use, how to configure them, and how to sequence improvements to achieve product and organizational goals.
- Monitoring
- Ongoing observation of a deployed model's behavior over time to detect problems, measure quality, and track changes.
- Moral Philosophy
- The branch of philosophy concerned with questions about what is right and wrong, what we owe each other, and how to live a good life.
- Moral Psychology
- The scientific study of how people actually form moral judgments, make ethical decisions, and reason about right and wrong.
N
- Normative Ethics
- The branch of moral philosophy concerned with establishing principles and frameworks for determining right and wrong action.
O
- Output Distribution
- The range and relative frequency of different types of responses a model produces across a given set of inputs.
- Output Quality
- How well a model's response meets the requirements of accuracy, helpfulness, appropriateness, and format for a given task.
P
- Policy Gradient
- A family of reinforcement learning algorithms that improve a model's behavior by adjusting the probability of actions that lead to higher rewards.
- Policy Writing
- The craft of authoring clear, principled documents that define what a model will and won't do, and why — serving as guidance for training, evaluation, and deployment decisions.
- Pragmatics
- The branch of linguistics concerned with how context shapes meaning — what people communicate beyond the literal content of their words.
- Preference Learning
- A training approach where models learn from comparisons between outputs rather than from single labeled examples.
- Preference-Based Evaluation
- An evaluation method where raters compare outputs and indicate which they prefer, rather than scoring each output independently.
- Product Design Collaboration
- The working relationship between behavior architects and product designers to ensure that model behavior and user experience are designed in alignment with each other.
- Production Data
- Real data generated from actual user interactions with a deployed model, as opposed to synthetic or evaluation data.
- Prompt Chaining
- A technique where the output of one model call is fed as input into a subsequent call, breaking a complex task into sequential steps.
- Prompt Engineering
- The practice of crafting and refining the text given to a model — instructions, examples, context — to reliably produce desired outputs.
- Prompt Injection
- An attack where malicious instructions embedded in user input or external content override a model's intended behavior.
- Prompt Robustness
- The degree to which a prompt continues to produce reliable, appropriate outputs even when inputs vary, are ambiguous, or are adversarial.
- Prompt Sensitivity
- The degree to which small changes in wording, format, or phrasing affect model outputs in significant ways.
- Promptfoo
- An open-source tool for testing and evaluating LLM outputs, focused on comparing prompts and detecting regressions.
- Prototyping
- Building quick, low-fidelity versions of a behavioral design to test assumptions and learn before committing to a full implementation.
Q
- Qualitative Analysis
- Close, interpretive review of model outputs and interactions to understand nuanced behavioral patterns that numbers alone don't capture.
- Qualitative Research
- A research approach that seeks to understand phenomena through detailed, interpretive analysis rather than numerical measurement.
- Quality Metrics
- Specific, measurable indicators used to assess how well a model is performing across dimensions that matter for a given use case.
- Quantitative Metrics
- Numerical measurements used to track model performance across dimensions like accuracy, refusal rate, response length, or user satisfaction.
- Quantitative Research
- A research approach that measures phenomena numerically and uses statistical analysis to identify patterns and test hypotheses.
R
- Reasoning Chain
- The sequence of intermediate steps a model uses to work through a problem before arriving at a final answer.
- Red-Teaming
- Deliberately attempting to find failure modes, safety vulnerabilities, and policy violations in a model by acting as an adversarial user.
- Refusal Behavior
- The patterns and decisions behind when and how a model declines to fulfill a request.
- Regression Identification
- The process of detecting when a change to a model or system has caused previously acceptable behavior to degrade.
- Regression Testing
- Running a consistent set of test cases after a change to verify that previously working behavior hasn't broken.
- Reinforcement Learning
- A machine learning approach where a model learns by receiving rewards or penalties based on the quality of its actions.
- Research Iteration
- The cyclical process of formulating questions, running experiments, analyzing results, and using findings to inform the next round of investigation.
- Responsible AI
- A framework and practice for developing and deploying AI systems in ways that are safe, fair, transparent, and accountable.
- Reward Hacking
- When a model finds ways to score well on a reward signal without actually achieving the underlying goal the reward was meant to measure.
- Reward Modeling
- Training a separate model to predict human preferences so it can be used to score outputs during reinforcement learning.
- RLAIF (Reinforcement Learning from AI Feedback)
- A variation of RLHF where another AI model provides the preference judgments instead of human raters.
- RLHF (Reinforcement Learning from Human Feedback)
- A way of training AI models by having humans rate or compare outputs, then using those ratings to reinforce better behavior over time.
- Root Cause Analysis
- A structured investigation to identify the underlying reason a failure occurred, rather than treating only its surface-level symptoms.
S
- Semantics
- The study of meaning in language — what words, phrases, and sentences refer to and signify.
- Semi-Automated Evaluation
- An evaluation approach that combines automated scoring with human review at key decision points.
- Sensitive Topics
- Subject areas that require extra care in handling because of their potential to cause harm, offense, or controversy — such as mental health, politics, religion, and violence.
- Signal-to-Noise Analysis
- The practice of separating meaningful patterns or indicators of behavioral problems from irrelevant or random variation in data.
- Stakeholder Communication
- Translating behavioral findings, tradeoffs, and recommendations into clear language for audiences outside the behavior team, including leadership, engineering, legal, and product.
- Steerability
- How easily and reliably a model's behavior can be adjusted through prompts, instructions, or system-level configuration.
- Supervised Finetuning
- A type of finetuning where the model learns from a dataset of input-output pairs that represent the desired behavior.
- Sycophancy
- A tendency in AI models to agree with users, validate their views, or shift their answers to match what they think the user wants to hear, rather than providing accurate or honest responses.
- Synthetic Data
- Training or evaluation data generated by a model rather than collected from real human interactions.
- Synthetic Testing Environment
- A controlled, artificially constructed environment for evaluating model behavior, separate from real production usage.
- System Prompt
- Instructions provided to a model at the start of a session, before any user input, that establish its role, behavior, and constraints.
T
- Task Decomposition
- Breaking a complex task into smaller, more manageable subtasks that can be addressed sequentially or in parallel.
- Test-Driven Development (for AI)
- An approach to AI product development where evaluation criteria and test cases are defined before prompts or models are changed, so success can be measured objectively.
- Tone
- The emotional register and relational quality of a model's responses — whether it comes across as warm, formal, playful, cautious, authoritative, and so on.
- Tool Prompt
- Instructions that describe available tools or functions to a model, telling it when and how to use them.
- Training Signal
- Any information fed back to a model during training to indicate whether its behavior is on the right track.
- Trust and Safety
- The organizational function responsible for protecting users and the platform from harm — including abuse, policy violations, and misuse of AI capabilities.
- Trust and Safety Team
- The organizational team responsible for detecting, preventing, and responding to harmful or policy-violating uses of an AI product.
U
- Usage Policy
- A broader set of rules governing how a model or AI product may and may not be used, often focused on prohibited applications rather than individual outputs.
- User Feedback
- Explicit or implicit signals from users that indicate whether they found a model response helpful, harmful, or otherwise notable.
V
- Value Alignment
- The degree to which a model's behavior reflects human values, intentions, and goals rather than optimizing for narrow objectives that miss the point.
- Verbosity
- The tendency of a model to produce responses that are longer than necessary for the task at hand.
- Virtue Ethics
- A moral framework that focuses on character rather than rules or outcomes — asking what a person of good character would do.
Z
- Zero-Shot Prompting
- Asking a model to complete a task using only instructions, with no examples provided.