Short answer. Item difficulty (denoted *p*) is the proportion of students who answered a given quiz question correctly. A question that 80% of students got right has *p* = 0.80. The metric is poorly named — *higher* difficulty value = *easier* question — but it's a foundational stat for evaluating quiz quality.
What the values mean
p = 1.0: Every student got it right. Either trivially easy, or a freebie. Not useful for distinguishing students.p = 0.80-0.90: Easy. Good for confidence-building, foundational checks.p = 0.50-0.70: Medium. Most useful for assessment — discriminates between students who know and don't know.p = 0.20-0.40: Hard. Good for stretch questions; use sparingly.p < 0.20: Either too hard for the cohort, or a broken question.p = 0: No one got it right. Almost certainly a broken question.How to use it
For a typical quiz:
Aim for an average difficulty of 0.5-0.7 across all itemsDon't have all easy items (p > 0.85) — the quiz doesn't differentiateDon't have all hard items (p < 0.4) — the quiz frustrates without measuringMix difficulty levels so the quiz has a discrimination curveItem difficulty + item discrimination
Item difficulty alone is incomplete. A question with p = 0.50 might be:
A great question (top half of class gets it right, bottom half doesn't) — high [item discrimination](/blog/what-is-item-discrimination)A coin-flip question (every student has 50/50 odds) — low item discriminationA *reverse-discrimination* broken question (bottom half gets it right, top half misses) — negative discriminationUse both metrics together. A good quiz has medium-difficulty items with positive discrimination.
Most LMSs and quiz platforms show item difficulty after each quiz administration:
Canvas, Blackboard, Moodle: quiz statistics pageSimpleQuizMaker: per-question analytics across all submissionsAfter every quiz, scan the difficulty distribution. The items with p > 0.95 (everyone right) and p < 0.15 (almost no one right) are usually the ones to review or rewrite.
Common mistakes
Confusing the name. Higher value = easier. Not intuitive. Many teachers reverse this in their heads.Ignoring difficulty data when grading. The "everyone missed Q5" pattern usually means Q5 is broken, not that everyone failed to learn.Treating low p as a teaching problem when it's a question problem. Verify the question is well-written before concluding students don't know the material.Item difficulty across high-stakes exams
For comparison, the average item difficulty on professional exams:
AP exams: target around p = 0.60 — calibrated to differentiate scores 1-5.SAT: average p around 0.65; harder items in the back of each section.MCAT, GRE: adaptive, so item difficulty is selected per student to maintain ~50% expected correct rate.Bar exam, USMLE: average p around 0.55 — high-stakes filtering exams.Classroom unit tests typically run easier (p around 0.75) because the goal is documenting that most students learned the material, not differentiating among them.
A diagnostic protocol after a quiz
After the first 10-20 students take a new quiz:
**Sort items by p.** Anything with p > 0.95 or p < 0.20 gets reviewed.**For high-p items**: was this a freebie? If yes, fine. If you intended it to be hard, the wording probably gave it away.**For low-p items**: read the question aloud, work it yourself. If you struggle or notice ambiguity, the item is broken. If it's truly hard and well-written, keep it but note the difficulty.**Check the discrimination index** on borderline-difficulty items (p around 0.40-0.60). High discrimination = useful item; low or negative = revise.After 5-10 administrations, you have a refined item bank with known difficulty. AP/MCAT/NCLEX item banks went through this process for years before reaching production.
Why item difficulty matters less than discrimination
Two questions with the same difficulty (p = 0.60) can have very different value:
One is a "good 60%" — top students mostly right, bottom students mostly wrong (high discrimination)One is a "noise 60%" — random students right (low or zero discrimination)The first is a productive item; the second is a coin flip dressed up as a quiz question. Always check both metrics; never use difficulty alone.
[What Is Item Discrimination?](/blog/what-is-item-discrimination)[How to Write Good Quiz Questions](/blog/how-to-write-good-quiz-questions)[How to Write Hard Quiz Questions](/blog/how-to-write-hard-quiz-questions)[Quiz Analytics — Teacher Guide](/blog/quiz-analytics-teacher-guide)[What Is a Distractor (in Quiz Design)?](/blog/what-is-a-distractor-quiz-design)Generate a quiz and see per-question difficulty after the first 5 submissions.
Ready to create your first quiz?
Use AI to generate quizzes from your own study materials in seconds.
Try SimpleQuizMaker Free