AI Tutor | behavior.engineering

Product context

An education technology company wants to add an AI tutoring assistant to its K–12 math platform. The assistant should help students work through problems, build understanding, and develop problem-solving skills — but it should not simply provide answers, which would undermine learning objectives.

Operator: Edtech company
Users: K–12 students (primarily middle and high school), some teachers
Channel: In-product chat within the math practice interface
Scope: Math tutoring support within the problems the student is currently working on

Behavioral requirements

This product involves a fundamental tension that defines everything else:

The core tension: Students will ask the AI to give them the answer. The product’s educational mission requires not doing that. But refusing completely is unhelpful and frustrating — it pushes students away from the tool. The right behavior is to help students reach the answer themselves.

Additional tensions:

Scaffolding vs. over-scaffolding. A student who struggles deserves hints and guidance. A student who is given the full solution path before trying learns nothing. The model needs to calibrate its help to the student’s current state.

Student wellbeing vs. academic integrity. Some students are stressed, anxious, or stuck. The response to “I give up” should be different from the response to “What’s the answer?” Recognizing emotional state and responding appropriately is part of the product quality.

Teacher mode vs. student mode. Teachers using the platform need different interactions — they may legitimately ask to see solutions, check what guidance the AI gives, or test edge cases. The system prompt conditions behavior on user role.

Behavior specification excerpt

Mission: Help students understand math concepts and develop their own problem-solving skills. You succeed when students figure things out — not when you figure things out for them.

In scope:

Explaining concepts related to the current problem
Asking guiding questions that help the student think through steps
Identifying where the student’s approach went wrong (without correcting it directly)
Celebrating progress and effort
Explaining worked examples when a student requests one explicitly, after they have attempted the problem

Out of scope:

Providing direct answers to problems the student hasn’t attempted
Completing homework, tests, or assignments
Topics outside the current course material

The core tutoring principle:

When a student is stuck, ask before telling. Prefer a guiding question to a given step. Only provide a direct explanation when you have evidence the student cannot progress with a hint.

System prompt architecture

This prompt was more complex than a typical support bot because it needed to handle:

Role-conditioned behavior (student vs. teacher)
Problem context injection (the current problem state)
Scaffolding logic (hint level based on struggle state)

Key prompt sections:

Student mode (default):

You are a patient, encouraging math tutor. Your job is not to solve problems 
for students — it's to help them solve problems themselves. When a student 
is stuck:

1. First, ask a question that points them toward the next step.
2. If they are still stuck after your question, give a small hint.
3. If they are still stuck after the hint, explain the concept behind the step 
   (not the answer itself).
4. Only provide a worked example if the student explicitly asks after making 
   multiple genuine attempts.

Never provide the final numerical answer to an unsolved problem.

Detecting “give up” vs. “give me the answer”:

If a student says they give up, are frustrated, or are anxious about a problem, 
respond with acknowledgment first: "It's okay to feel stuck — this is a tricky one."
Then offer to break it into a smaller step, not to solve it for them.

If a student directly asks for the answer without apparent genuine struggle, 
respond: "Let's get you there together. What have you tried so far?"

Failure modes encountered

1. The “just tell me the answer” escalation. Students learned to try multiple framings to get answers: “I just need to check my work,” “Can you show me the first step just this once,” “My teacher said it’s okay.” The model was not robust to these framings initially. Resolved with specific handling instructions for each pattern.

2. Over-encouragement. The initial model was too lavish with praise, which students found patronizing and which obscured genuine progress signals. Revised to reserve encouragement for genuine effort and achievement.

3. Context blindness across turns. In long sessions, the model forgot which problem the student was working on and gave generic guidance. Resolved by injecting the current problem context at the top of each message via the product’s context management.

4. Anxiety escalation failures. A student who wrote “I’m so stressed I want to cry” received a math hint, not an acknowledgment. This failure drove explicit emotional state detection instructions and a dedicated response path.

Evaluation approach

Evaluation for this product required test sets that were different from typical AI products because correctness alone was insufficient — the pedagogical quality of the response mattered as much as accuracy.

Category	Description	Key metric
Scaffolding quality	Does the hint point toward the right step without giving it away?	Rubric-scored by experienced math teachers
Answer refusal	Does the model decline to give direct answers appropriately?	Binary pass/fail across test cases
Struggle recognition	Does the model recognize when a student is genuinely stuck vs. being lazy?	Rubric-scored
Emotional response	Does the model respond appropriately to distress signals?	Binary pass/fail on a distress test set
Teacher mode	Does teacher-mode allow solution viewing that student-mode does not?	Role-conditioned behavior tests

The hardest category to evaluate was scaffolding quality — it required teacher review rather than automated scoring.

Key decisions and rationale

No direct answers as a hard constraint. This was a product-level commitment, not a prompt suggestion. It was specified as a non-negotiable in the behavior spec, tested explicitly, and monitored in production via a “direct answer rate” metric.

Emotional response path. The team initially excluded emotional state handling as out of scope. An incident during beta testing — a student writing that they were “too stupid” to understand — drove the addition of an emotional response path. It became one of the most important behavioral investments.

Worked example as an earned unlock. Rather than refusing worked examples entirely, the team decided students could unlock them by demonstrating genuine attempt. This preserved educational integrity while not leaving students completely stuck.