All articles AI tutoring Stall detection

How AI Detects Where Students Stall

By Selena Ortiz March 11, 2025 6 min read

Abstract visualization of a learning path with a highlighted pause point

When a student stalls on a problem, the behavior looks roughly the same from the outside: they stop. But what "stopped" means in terms of their understanding can be quite different. They might have the right approach and made an arithmetic slip. They might have no idea where to begin. They might have a nearly correct mental model with one missing piece. They might have understood the previous ten problems and hit something categorically different. They might be out of time, distracted, or fatigued.

A human tutor who has watched the same student for months has a lot of context to read these signals. They know which topics this student has historically found harder, they watched how the student approached the last five problems, and they have a conversation they can use to probe understanding directly. An adaptive practice system has to infer the same things — not through conversation, but through the behavioral and performance data that the practice session itself generates.

Here's what that actually looks like.

Time Patterns: Not Just How Long, But When

Response time is one of the most information-dense signals in a practice session, but raw time per problem tells only part of the story. What matters more is the pattern: where in the problem does time accumulate?

A student who submits quickly and gets a problem wrong is displaying a different kind of error than a student who spends three minutes on a problem and gets it wrong. The quick incorrect submission often indicates a misclassification — the student recognized the problem surface as one type and applied the wrong approach confidently. The prolonged attempt with an incorrect result often indicates partial understanding: the student knew enough to try, got somewhere into the procedure, and couldn't complete it.

More granular still is what happens in the middle of a problem. If a system tracks a student's intermediate work or can observe when they re-read a problem (in a digital interface, this might manifest as scroll-back or time spent on specific parts of the problem), the pause structure reveals where in the procedure the breakdown occurs — not just that it occurred.

A student working through a multi-step equation problem who takes 20 seconds to read, then works quickly through the algebraic manipulation, then pauses for 90 seconds before submitting an incorrect answer, has likely stalled at interpretation of the result — not at the algebraic steps. That's a fundamentally different gap from a student who pauses at the start of the manipulation and makes an error in the first move.

Error Taxonomy: What Kind of Wrong

Not all errors are equally informative. A well-designed adaptive system distinguishes between error types because they correspond to different gaps and require different follow-up practice.

Procedural errors — wrong sign, arithmetic mistake, a step performed out of order — are often identifiable by the fact that the student's answer is systematically offset from the correct one. If a student consistently gets the magnitude right but the sign wrong in a class of problems, that's a procedural pattern. It suggests they understand the concept but have an error in one specific operation they're applying to it.

Conceptual errors look different. They often produce answers that are plausible but reflect a wrong model: confusing the perimeter formula with the area formula, applying a rule for triangles to a different polygon, treating a proportional relationship as additive. These errors tend to cluster by problem type, not by where in the procedure the mistake occurs. The student isn't slipping up on the execution — they're applying a mismatched mental model.

Setup errors — where the student identifies the wrong approach from the start — are often the hardest to detect from the answer alone. The answer may be computed correctly from a wrong premise. In a multiple-choice format, setup errors sometimes produce one of the distractor options (tests like the SAT are deliberately constructed so that common setup errors lead to specific wrong answers, not arbitrary ones). A system that tracks not just the final answer but which answer was selected can use this to infer which misunderstanding is operating.

Skip Behavior as Signal

When a student skips a problem — especially in a practice context where they could have taken a guess — the skip itself carries information. In a timed context, skipping is rational strategy: move past what's difficult and come back. But in untimed practice, the decision to skip often indicates something more specific: the student looked at the problem, didn't know how to approach it, and chose not to attempt it rather than try and fail.

The nature of the skip matters too. A student who skips after a 30-second look is making a recognition decision: "I don't know how to do this." A student who spends two minutes partially working a problem before abandoning it has at least identified enough structure to try. These are different states of understanding, and they call for different levels of support in whatever comes next.

Skip clustering — when a student skips multiple problems in a row — is a stronger signal still. It indicates a sustained breakdown rather than an isolated difficult problem. A cluster of skips around problems involving rational expressions, for example, suggests a systematic gap in that area rather than a single bad question.

Session-Level Patterns: Fatigue vs. Structural Gap

One of the harder detection challenges is distinguishing a structural knowledge gap from session-level degradation. Error rates increase and response times lengthen toward the end of a long practice session — that's fatigue, not a newly discovered gap. A system that doesn't account for session position will misclassify end-of-session errors as more significant than they are.

This is where cross-session comparison becomes important. A gap that appears at the end of one session and then appears at the beginning of the next — when the student is fresh — is a genuine structural gap. A degraded performance that appears only in the last few problems of a long session, and doesn't replicate when those problem types appear early in a subsequent session, is more likely to be a fatigue effect. Acting on it with a heavy barrage of targeted follow-up would be over-correction.

We're not saying session-level fatigue signals should be ignored — they carry useful information about pacing and session length calibration. But they need to be weighted differently from persistent cross-session error patterns when deciding what a student's next targeted practice should focus on.

From Signal to Question Generation

Detection is only half of the loop. The other half is: given this stall signal, what should the next practice question be?

The answer isn't simply "more problems of the same type." That's the volume approach. The answer is: problems that target the specific skill node where the stall occurred, at a difficulty level calibrated to the student's current error rate on that node — slightly harder than what they're currently getting right, but not so hard that they can't make any progress at all.

In practice this looks like a branching structure. If a student stalls on a problem involving the law of cosines, the system needs to determine: did they stall because they applied the wrong law (law of sines vs. cosines — a classification problem), or because they correctly set up the law of cosines but made an algebraic error solving for the unknown side (an execution problem)? The first calls for classification-practice problems. The second calls for multi-step algebraic manipulation problems within a trigonometric context.

Getting this right requires that the stall detection be specific enough to drive the branching decision. A system that only knows "the student missed a trig problem" can't make that distinction. A system that knows they selected the wrong formula in the setup step, quickly, before working any algebra, can.

That specificity — detecting not just that a stall occurred, but where within the problem it occurred and what kind of reasoning failure it reflects — is what makes it possible to generate the next question at the right level, on the right sub-skill, so that practice is actually working on the gap rather than working around it.

Ready to practice at your exact edge?

Tutorwren finds exactly where you stall and generates three targeted questions there.

Start free