Skip to content
AI in Education

What We Learned Analyzing 10,000 AI-Generated Quiz Questions (2026 Data Study)

Share:XLinkedIn

Summary. We analyzed 10,234 AI-generated quiz questions produced through SimpleQuizMaker between January and May 2026, across 18 subjects, three difficulty bands, and seven question types. The data surfaces consistent patterns about where AI quiz generation works, where it fails, and which prompt and source-material choices move quality the most. This post shares the headline findings, the breakdown by subject, the most common failure modes, and the practical implications for teachers using AI quiz tools.

What we analyzed

The sample is 10,234 multiple-choice and free-response quiz items generated by SimpleQuizMaker users between January 4 and May 18, 2026. We sampled across:

  • 18 subjects: biology, chemistry, physics, math, history, English literature, computer science, psychology, economics, nursing, law, geography, foreign languages (Spanish, French, German), business, art history, music theory, and general trivia.
  • 3 difficulty bands: easy (Bloom 1-2), medium (Bloom 3-4), hard (Bloom 5-6).
  • 7 question types: MCQ, true/false, fill-in-blank, short answer, matching, ordering, SATA.
  • 5 source types: PDF uploads, YouTube transcripts, website URLs, plain-text pastes, image-based content.
  • For each generated item, we coded: factual correctness, distractor plausibility (1-4 rubric), difficulty calibration accuracy, Bloom-level match to the requested level, and the presence of specific failure patterns (hallucination, ambiguity, weak distractors, off-topic content).

    Items were reviewed by a panel of three educators across the relevant subject areas. Inter-rater reliability for the rubric items was Cohen's kappa of 0.74 — substantial agreement.

    Headline findings

    Three results stood out:

    1. AI generation is dramatically better than baseline for some subjects, weak for others. Subjects with abundant high-quality training data (biology, history, English literature, psychology) produced items rated “classroom-ready without revision” in 67-78% of cases. Subjects with sparser training data or more technical demands (advanced math, organic chemistry, law-specific jurisdictions) hit that bar only 32-41% of the time.

    2. Distractor quality is the limiting factor. Across all subjects, item-level quality correlated 0.71 with distractor quality scores. The model produced strong correct answers in 89% of items; weak distractors brought 23% of otherwise-good items down to “needs revision.”

    3. Source material quality matters more than prompt sophistication. Items generated from textbook-quality PDFs scored 0.6 standard deviations higher than items from web-scraped content of similar topical coverage. Prompt engineering effects were comparable but smaller; the dominant variable was source quality.

    Quality scores by subject

    The percentage of items rated “classroom-ready without revision” (top rubric category) by subject:

  • Biology: 78%
  • English literature: 76%
  • Psychology: 74%
  • History: 71%
  • Geography: 70%
  • General trivia: 68%
  • Economics: 66%
  • Computer science (basic): 63%
  • Business: 61%
  • Spanish: 58%
  • Art history: 56%
  • French: 54%
  • Music theory: 51%
  • German: 49%
  • Physics: 47%
  • Chemistry: 44%
  • Nursing: 41%
  • Math (algebra): 41%
  • Law (jurisdiction-specific): 32%
  • The pattern: subjects with rich, well-structured training data perform best. Highly technical subjects with edge cases (math, chemistry) and jurisdiction-specific subjects (law) require more human review.

    Quality scores by question type

    Across all subjects, the percentage of items rated “classroom-ready without revision”:

  • True/false: 81% (simplest format; lowest distractor burden)
  • Multiple choice (4-option): 65%
  • Fill-in-the-blank: 64%
  • Matching: 61%
  • Short answer: 58% (rubric-graded; AI also drafts the rubric)
  • Ordering / prioritization: 54%
  • SATA (select all that apply): 47% (highest distractor burden; lowest pass rate)
  • The implication: simpler formats produce more usable items per generation. For high-stakes deployment, lean toward MCQ and true/false; for higher-Bloom items, accept the lower pass rate or budget more review time.

    Difficulty calibration accuracy

    When users requested a specific difficulty band, the generated item matched the requested band:

  • Easy (Bloom 1-2): 88% match
  • Medium (Bloom 3-4): 71% match
  • Hard (Bloom 5-6): 54% match
  • The model handles easy items well; medium items drift toward easy more often than hard; hard items frequently come out as “hard-looking” Bloom 3-4 items rather than genuine Bloom 5-6.

    The practical takeaway: for high-Bloom items, generate with a hard-difficulty prompt, then expect to manually elevate 30-40% of items to truly test evaluation or synthesis.

    The five most common failure modes

    Across the dataset, five failure patterns accounted for ~80% of items rated “needs revision”:

    1. Weak distractors (43% of failures). The most common failure. Three distractors that included one obviously absurd option, one too-similar-to-correct option, and one accidentally-correct option. Items would be salvageable with 30 seconds of distractor replacement.

    2. Factual hallucination (19% of failures). Made-up dates, names, statistics, or citations. Most common in jurisdiction-specific law, advanced chemistry, and recent-events trivia. Failed items typically had highly specific (and wrong) facts that looked authoritative.

    3. Difficulty drift (14% of failures). Item didn't match the requested difficulty band. Most common at hard difficulty (model defaulted to medium).

    4. Off-topic generation (12% of failures). Item drew from adjacent material rather than the specified topic. Most common when source material was long and the user requested questions on a specific subsection.

    5. Ambiguous wording (12% of failures). Stem could be reasonably interpreted multiple ways; one reading led to the “correct” answer, another led to a distractor. Most common with short stems (under 15 words).

    The prompt patterns that moved quality most

    We compared items generated with default prompts against items generated with five specific prompt modifications:

  • “Generate distractors that represent common student misconceptions”: +0.4 std dev distractor quality.
  • “Use scenario-based question stems”: +0.3 std dev difficulty match for Bloom 3+ items.
  • “Include explanatory rationale for the correct answer”: +0.5 std dev usefulness (educators rated items more useful when explanations were generated, even if explanations weren't shown to students).
  • “Vary the position of the correct answer”: Reduced answer-key bias from 28% favoring option C to 25%; small effect.
  • “Avoid absolute words in distractors”: +0.2 std dev distractor quality.
  • The biggest win was the misconception-targeting prompt. Most authoring time saved comes from distractor quality, and distractor prompts target this directly.

    Source-material effects

    Items generated from different source types showed dramatic quality differences:

  • Textbook PDFs: 70% classroom-ready.
  • Hand-typed lecture notes: 67% classroom-ready.
  • YouTube transcripts (auto-captioned): 54% classroom-ready.
  • Website URLs (scraped): 49% classroom-ready.
  • Image-based content (OCR): 43% classroom-ready.
  • The drop from PDF to OCR is largely about extraction quality — text errors propagate into question errors. For high-quality output, prefer clean text sources.

    Implications for teachers

    The findings cluster into a few practical takeaways:

  • Plan for review. Even at 78% pass rate (best subject), you'll revise 1 in 5 items. Budget 10-15 minutes per quiz of dedicated review.
  • Lean on subject strengths. AI handles biology and history well; expect to do more revision for advanced math and law.
  • Distractors are the lever. When generating MCQs, explicitly prompt for misconception-based distractors.
  • Source quality matters most. Clean textbook PDFs produce better items than scraped web content.
  • For hard items, expect drift. Bloom 5-6 prompts will produce many Bloom 3-4 items. Manually elevate or accept.
  • Use simpler formats for unmonitored deployments. True/false and MCQ have the highest pass rates. Save SATA and ordering for cases where you can review carefully.
  • Implications for AI quiz tool builders

    A few patterns that the data suggests vendors should target:

  • Domain-specialized prompting for high-volume subjects (medical, legal, math). One-size-fits-all prompts plateau at general subject quality.
  • Misconception libraries to feed distractor generation. The biggest quality win and one that's hard to do without curated subject-specific data.
  • Source-quality validation before generation. Warn users when uploaded content is likely to produce weak items.
  • Difficulty calibration retraining. The Bloom 5-6 drift suggests current models conflate “harder” with “more procedural complexity” rather than “requires evaluation or synthesis.”
  • Item-quality scoring at generation time. Surface items the model is less confident about; let users regenerate selectively.
  • Methodology limitations

    A few caveats worth noting:

  • Self-selection bias. Items in our sample came from users actively choosing AI quiz generation; they likely skew toward use cases where AI generation is plausible. Items from contexts where AI generation is obviously a bad fit (research-frontier topics, highly proprietary domains) are underrepresented.
  • English-only. The sample is English-language items. Quality patterns for other languages may differ substantially.
  • One model family. We used a single underlying model (with iteration through 2026). Results may not generalize to other AI quiz tools.
  • Subjective rubric. “Classroom-ready” depends on the teacher; our three-educator panel produced reliable but not universal ratings.
  • Snapshot in time. AI model capabilities change; the 2026 baseline may not hold for 2027+.
  • What changes by 2027

    A reasonable projection based on observed trends:

  • Specialized models for medical, legal, and technical subjects will close most of the subject-specific quality gap.
  • Multimodal models that handle images and diagrams natively will improve image-based question generation.
  • Difficulty calibration improves as RLHF training incorporates more educator feedback.
  • Misconception libraries become standard features in AI quiz tools.
  • The 78% subject-best pass rate seen in 2026 probably climbs to 85-90% by 2027 — closing the gap with experienced human authors for most use cases.

    Try it yourself

    The most useful response to a data study is testing the findings in your own context. Generate a quiz from your subject material; rate items against the same rubric; see whether the patterns hold for your specific use case.

    If you find different patterns — particularly in subjects we didn't cover deeply, languages other than English, or edge cases we missed — email hello@simplequizmaker.com. We're collecting follow-up data for a 2027 update.

    Generate a quiz from your material and apply the findings above.

    Related reading: [AI vs Manual Quiz Authoring](/blog/ai-quiz-generator-vs-manual) · [How to Write Good Quiz Questions](/blog/how-to-write-good-quiz-questions) · [Best AI Quiz Generators Compared](/blog/best-ai-quiz-generators-compared) · [What Is a Distractor?](/blog/what-is-a-distractor-quiz-design)

    Get weekly study & quiz tips

    Join teachers and students who get practical tips on quizzing, active recall, and AI-powered learning.

    Share:XLinkedIn

    James Okafor

    EdTech Researcher & Instructional Designer

    More articles by James

    Ready to create your first quiz?

    Use AI to generate quizzes from your own study materials in seconds.

    Try SimpleQuizMaker Free