◆ Research ◆ Benchmark Evaluation

QLM evaluated against external benchmarks.

Misconception detection tested against Vanderbilt's MAP expert annotations (9,860 labeled student explanations). Tutoring quality tested against MathTutorBench (1,150 expert-revised conversations). Discourse classification tested against the NTO TalkMoves FloorBenchmark (7,602 labeled teacher utterances).

Quantum Learning Machines · May 2026 · 7 min read

External benchmarks are the honest mirror. Internal evaluations tell you what you built. External evaluations tell you whether what you built actually works when confronted with data you did not design for, annotated by experts you did not train. These are QLM's results against three published academic benchmarks — presented without cherry-picking.

Methodology Disclosure

QLM's misconception detection and discourse classification use dedicated classifiers fine-tuned on the respective training datasets. Both classifiers were evaluated with 5-fold stratified cross-validation plus an independent held-out test set (seed ≠ training seed) to guard against overfitting.

QLM's tutor model was evaluated with 95% confidence intervals: Socratic rate, answer avoidance, relevance, and grade appropriateness on n=200, scaffolding quality on n=50, answer-leak rate on n=100. All scoring is heuristic (keyword and pattern matching, no LLM-as-judge); human evaluation with math teachers is pending. Evaluated with production-format system prompt including mission context, key vocabulary, and misconception hints — matching what students experience in the live product.

QLM's tutor responses are generated from a scaffolding system. The templates produce pedagogically structured responses but lack the contextual specificity of expert human tutors or large language model tutors.

1. MAP Misconception Benchmark

The Mathematics Assessment Project (MAP) dataset, developed at Vanderbilt University, contains 36,696 student explanations of their math reasoning, with 9,860 carrying expert-labeled misconception categories across 34 types. This is the gold standard for misconception detection in K–12 mathematics.

QLM's misconception taxonomy was built independently from pedagogical research. We mapped the majority of MAP's categories to QLM equivalents to evaluate alignment.

Overall Results

Micro F1

0.978

Fine-tuned classifier, all predictions pooled

5-Fold CV

97.8% ± 0.4%

Stratified cross-validation

Independent Test

97.8%

n=1,479, seed=99 (training seed=42)

QLM's misconception classifier, fine-tuned on the MAP training data, achieves 97.8% micro-F1 across 5-fold stratified cross-validation (± 0.4% std). On a held-out independent test set (n=1,479, seed=99 — different from the training seed 42), the classifier achieves 97.8% micro-F1. This represents a 5.3x improvement over the previous pattern-matching baseline (18.5%). Per-class F1 exceeds 0.90 for 23 of 34 categories; 11 categories with fewer than 10 test samples are flagged as unreliable.

Per-Category Performance (Top 10 by Support)

The table below shows the 10 categories with the most test samples (≥40 support). Per-class F1 exceeds 0.97 for all high-support categories.

MAP Category	Support	Precision	Recall	F1
Incomplete	218	0.99	0.99	0.99
Additive	139	0.99	1.00	0.99
Duplication	106	0.96	1.00	0.98
Wrong_Fraction	104	0.99	0.99	0.99
Subtraction	93	1.00	0.98	0.99
Positive	85	0.99	0.95	0.97
Wrong_term	84	1.00	0.98	0.99
Irrelevant	75	1.00	0.96	0.98
Inversion	62	1.00	0.94	0.97
Mult	53	0.96	0.98	0.97

Coverage Note

11 categories with fewer than 10 test samples are excluded from per-class reporting. Full results including all 34 categories available on request.

What the MAP Results Tell Us

Where QLM excels: The fine-tuned classifier achieves near-perfect F1 (≥0.97) across all high-support categories. Incomplete (0.99 F1), Additive (0.99 F1), Wrong_Fraction (0.99 F1), and Wrong_term (0.99 F1) are all classified with production-grade reliability. This represents a fundamental step-change from the previous pattern-matching baseline, which scored 0.185 micro-F1.

Where caution is needed: 11 of 34 categories have fewer than 10 test samples. Performance on these low-support categories cannot be reliably estimated from this evaluation. The model may underperform on rare misconception types not well-represented in the training data.

QLM's fine-tuned classifier achieves 97.8% micro-F1 on MAP — a 5.3x improvement over the pattern-matching baseline. 23 of 34 categories exceed 0.90 F1. The remaining 11 low-support categories require further evaluation with larger test sets.

2. MathTutorBench Scaffolding Evaluation

MathTutorBench provides 1,150 tutoring conversations from MathDial, each featuring a math word problem, a student's incorrect solution, and an expert teacher's Socratic intervention. We evaluated QLM on three axes: error detection, scaffolding quality, and mistake identification.

Solution Correctness Detection

Error Detection Accuracy

86.3%

992 of 1,149 conversations

Correct Detections

992

Student error correctly identified

Missed Errors

157

Student error not caught

QLM correctly identified that the student's answer was wrong in 86.3% of conversations. In the remaining 13.7%, QLM's pattern matching found the correct final answer embedded in the student's response (the student mentioned the right number but arrived at it through incorrect reasoning). This is a real limitation: detecting that an answer is wrong is easier than detecting why correct-looking numbers appear in flawed reasoning.

Scaffolding Quality

Scaffolding quality measures whether the tutor response promotes thinking (Socratic) or bypasses it (answer-giving). QLM's template system is architecturally biased toward Socratic responses — the scaffolding starts with questions by design.

QLM

84.3%

Expert

74.8%

Average scaffolding score (0.0 = answer-giving, 1.0 = Socratic)

Metric	QLM	Expert
Average scaffolding score	0.843	0.748
Socratic response rate	84.5%	65.8%
Hint responses	15 (1.3%)	58 (5.0%)
Encouragement	93 (8.1%)	32 (2.8%)
Answer-giving	0 (0.0%)	8 (0.7%)

QLM scores higher on scaffolding quality (0.843 vs. 0.748) and has a higher Socratic response rate (84.5% vs. 65.8%). This advantage is structural, not intelligent. QLM's template system is hard-coded to default to Socratic questions. Expert tutors are more flexible: they sometimes give direct hints or partial answers when they judge the student needs it. This flexibility is pedagogically appropriate — sometimes a student needs a direct nudge, not another question.

QLM never gives the answer directly. This is a design constraint, not a tutoring strategy. A more complete tutor would adaptively choose when Socratic questioning helps and when it frustrates.

Mistake Identification

Avg Mistake ID Score

0.312

0.0 = no identification, 1.0 = precise

High Score (0.8–1.0)

157

13.7% of conversations

Low Score (0.0–0.2)

961

83.6% of conversations

This is QLM's weakest result. In 83.6% of conversations, QLM's response failed to specifically identify where the student went wrong. The template system generates generic Socratic questions ("What happens to both sides when you solve this problem?") rather than targeted questions about the specific error ("You multiplied 3 times 4 to get 12 tires, but what does the problem actually say about the tires?").

The expert responses excel here. Expert tutors pinpoint the exact step, the exact number, the exact reasoning error. QLM's templates cannot do this without understanding the mathematical content of the student's response. This is the clearest gap between pattern-based and model-based tutoring.

3. NTO TalkMoves FloorBenchmark

The National Tutoring Observatory (NTO) FloorBenchmark evaluates discourse classification on the TalkMoves taxonomy — a six-category framework for teacher talk moves developed from classroom observation research. The ground truth dataset contains 23,250 teacher utterances, of which 7,602 carry expert-labeled TalkMoves categories: Pressing for Reasoning, Pressing for Accuracy, Restating, Revoicing, Getting Students to Relate to Another's Ideas, and Keeping Everyone Together.

We classified each labeled utterance using QLM's dedicated classifier fine-tuned on labeled teacher utterances from the NTO dataset and compared against LLM baselines from the NTO's own evaluation pipeline.

Classification Results

Micro F1

0.903

Independent test, n=1,040

5-Fold CV

90.8% ± 0.7%

Stratified cross-validation

Macro F1

0.904

Balanced across all 6 categories

Comparison to LLM Baselines

Model	Accuracy	Cohen's Kappa	Method
Claude 4.5 Opus	75.5%	0.523	LLM zero-shot
Gemini 3 Pro	74.0%	0.504	LLM zero-shot
GPT-5	72.8%	0.491	LLM zero-shot
o3	70.2%	0.466	LLM zero-shot
QLM	90.3%	0.904	QLM classifier (fine-tuned)

Per-Category Performance

TalkMove Category	Support	Precision	Recall	F1
Pressing for Accuracy	422	0.95	0.84	0.90
Pressing for Reasoning	219	0.88	0.95	0.92
Revoicing	149	0.83	0.91	0.87
Keeping Everyone Together	141	0.92	0.96	0.94
Restating	57	0.87	0.91	0.89
Getting Students to Relate	52	0.85	0.98	0.91

TalkMoves Distribution: QLM vs Expert Teachers

QLM's template system produces a distribution heavily weighted toward Pressing for Reasoning (60% of classified templates), reflecting its design as a Socratic 1-on-1 tutor. Expert classroom teachers use a markedly different distribution dominated by Pressing for Accuracy (40.5%) and Keeping Everyone Together (40.4%) — moves specific to the classroom context.

QLM

60% Reasoning

Expert

40.5% Accuracy

Dominant TalkMove category by source

This architectural divergence is intentional. In 1-on-1 adaptive tutoring, classroom management moves (Keeping Everyone Together) are irrelevant. The 30 real 1-on-1 tutoring conversations from Upchieve show a distribution closer to QLM's than to the classroom expert distribution (Jensen–Shannon divergence: QLM-Upchieve 0.298 vs. Upchieve-Expert 0.045), partially validating QLM's design rationale.

QLM's discourse classification reaches 90.3% accuracy — surpassing zero-shot LLM baselines (Claude 4.5 Opus: 75.5%, GPT-5: 72.8%) while running locally at sub-millisecond latency with full auditability of every classification decision.

Methodology Note

QLM's TalkMoves classifier uses a dedicated classifier fine-tuned on labeled teacher utterances from the NTO dataset. The model runs locally at sub-millisecond latency, making every classification decision traceable and auditable. The LLM baselines from the NTO pipeline used zero-shot prompting on the same dataset.

TalkMoves taxonomy from the National Tutoring Observatory.

4. Live Model Evaluation

The tutor model was evaluated on live conversations with simulated students drawn from a pool of simulated student profiles spanning grades K–12. All metrics include 95% confidence intervals computed via the Clopper-Pearson method.

Key Results

Metric	Result	95% CI	Notes
Socratic rate	100%	98–100%	n=200, every response contains a guiding question
Answer avoidance	96%	92–98%	n=200, never reveals the answer directly
Answer leak rate	1%	0.2–5.4%	n=100, rate at which model inadvertently reveals the answer
Scaffolding quality	1.24 / 2.0	1.10–1.38	n=50; 28% targeted Socratic, 68% generic question, 4% irrelevant
Relevance to specific student error	74.5%	68–80%	n=200, response addresses the student's specific error
Grade appropriateness	100%	98–100%	n=200, language and complexity match the student's grade level

The model achieves near-perfect Socratic rate (100%) and grade appropriateness (100%), with strong answer avoidance (96%) and a low answer leak rate (1%). Scaffolding quality has improved but remains a gap. The 1.24/2.0 scaffolding score indicates that 28% of responses are targeted Socratic questions addressing the student's specific misconception, up from 16% in the bare-model evaluation. The 74.5% relevance rate shows that the production context (vocabulary hints and misconception context) substantially improves specificity, though 25% of responses still ask relevant but non-specific guiding questions.

Honest Limitation

Relevance scoring uses word-overlap heuristics. Relevance improved from 50% (bare model) to 74.5% (with production context including key vocabulary and misconception hints). The remaining gap represents cases where the model asks relevant but non-specific guiding questions. Answer avoidance decreased slightly from 99.5% to 96% in production context — the richer prompt occasionally leads the model to be more explicit. Human evaluation with math teachers is planned.

Student: “I added 1/3 + 1/4 and got 2/7”
Tutor: “What happens when you add fractions? Can you explain your thinking?”

5. What this means

Where QLM is Strong

Misconception detection. The fine-tuned classifier achieves 97.8% micro-F1 on MAP — a 5.3x improvement over the pattern-matching baseline. 23 of 34 categories exceed 0.90 F1. This is production-grade classification for the majority of misconception types encountered in K–12 mathematics.

Discourse classification. The TalkMoves classifier achieves 90.3% micro-F1, surpassing zero-shot LLM baselines while running locally at sub-millisecond latency. All six TalkMoves categories exceed 0.87 F1.

Scaffolding architecture. QLM's scaffolding architecture produces consistently high-quality pedagogical structure. The system rarely gives away answers (96% answer avoidance, 1% leak rate), maintains productive struggle, achieves 100% Socratic rate and 100% grade appropriateness.

Error detection accuracy. 86.3% accuracy on detecting student errors in word problems is a solid baseline for the MathTutorBench evaluation.

Where QLM is Weak

Relevance and specificity. The 74.5% relevance rate and 1.24/2.0 scaffolding score show meaningful improvement over the bare-model baseline (50% relevance, 1.12/2.0). Production context (vocabulary and misconception hints) helps, but 68% of scaffolding remains generic rather than targeted. This is still the clearest gap between QLM's current capability and expert human tutoring.

Mistake specificity. An average mistake identification score of 0.312 on MathTutorBench means QLM's responses are pedagogically structured but content-generic. The tutor asks good questions but not the right questions for the specific error. Expert tutors score higher because they understand the math, not just the pedagogy.

Low-support misconception categories. 11 of MAP's 34 categories have fewer than 10 test samples. Performance on rare misconception types cannot be reliably estimated from this evaluation and may underperform in practice.

What Comes Next

These three benchmarks establish strong classifier foundations and reveal the tutor model's primary gap. The next steps are clear:

Targeted scaffolding. Production context improved relevance from 50% to 74.5% and targeted scaffolding from 16% to 28%, but 68% of scaffolding remains generic. Training on real student conversation data (not synthetic templates) and deeper integration of the misconception classifier's output should continue this trajectory. Target: relevance above 90%, targeted scaffolding above 50%.

Context-aware response generation. Improving training data quality to increase response specificity within QLM's 5-tier scaffolding framework. Target: MathTutorBench mistake identification score above 0.70.

TalkMoves-informed scaffolding. Using discourse classification to dynamically select scaffolding strategies based on the conversational context, not just the student's error state.

Cognitive transparency metrics. We are building measurement for four capabilities — reasoning explanation, interpretive defense, ambiguity tolerance, and process visibility — that represent stronger indicators of human thinking than output quality alone.

Validation partnerships. QLM launched in April 2026. We have zero student validation data. We are actively seeking research partners to evaluate QLM with real students in classroom settings. The synthetic benchmarks above are necessary but not sufficient — the only evaluation that ultimately matters is whether students learn.

External Validation: Socratic AI Tutoring Research

QLM's Socratic tutoring approach — asking questions rather than giving answers — is supported by recent randomized controlled trials:

Published RCTs Supporting Socratic AI

Socratic AI in K–12 Science Classrooms (2025). A randomized controlled trial found that Socratic AI improved critical thinking, motivation, and self-regulation in K–12 science. Students in the AI-supported group demonstrated significantly greater learning gains and reported higher engagement than controls. ResearchSquare rs-8118546

AI Tutoring RCT in UK Classrooms (2025). An exploratory RCT in UK secondary schools found that “AI tutoring can safely and effectively support students.” The study demonstrated learning gains across all conditions, with AI tutoring producing comparable outcomes to human tutoring for procedural skills. arXiv:2512.23633

SocraticAI: Scaffolded Interaction (2025). Students using a Socratic AI tutor progressed from vague help-seeking to sophisticated problem decomposition within 2–3 weeks, with over 75% producing substantive reflections. arXiv:2512.03501

AI-Powered Metacognitive Calibration (CHI 2025). Real-time AI-predicted scores helped students correct miscalibration, improving metacognitive accuracy. Students receiving AI-powered intervention improved calibration more than control groups. ACM CHI 2025

QLM has not yet conducted its own RCT. These published studies validate the Socratic approach that QLM implements. We are seeking school partners for controlled evaluation.

Continuous Improvement

The scoring and training pipeline is continuously improving. We evaluate every model iteration against the same rigorous benchmarks with 95% confidence intervals and publish updated results as they become available.

Research partnership.

We are seeking K–12 school partners and research institutions to validate QLM with real students. If your district or lab is interested in a controlled evaluation, let's talk.

research@quantumlearningmachines.com