Interpreting Situational Judgment Test Results: Scores, Patterns, and What to Do With Them
The score without context is noise
An SJT score without an anchor is meaningless. A candidate scores 72%. Is that strong? Average? Weak? You do not know.
Context depends on:
- Your scoring methodology (most-effective vs. distance-based)
- Your comparison group (internal benchmarks vs. external norms)
- The role specificity of the assessment (generic vs. customized)
- The master ranking (how you defined "right")
A candidate who ranks "investigate alone first" on an incident response SJT might score 100% at a company that values autonomy and 40% at a company that values escalation discipline. Neither score is wrong. Both are measuring what the company values.
This guide walks through interpreting SJT results so you can make use of them correctly.
Scoring methodologies: most-effective vs. distance-based
Most-effective (MD) scoring
The candidate scores a point only if their top-ranked option matches the expert master ranking first choice.
Example:
- Expert ranking: E > D > A > C > B
- Candidate ranking: E > D > C > A > B
- Score: 1 point (they matched on option E)
- Result: 1/5 on this question = 20%
Advantages:
- Binary, defensible. Either they chose the most effective option or they did not.
- Matches your hiring standard: "Do they make the choice we would make?"
- No subjectivity in scoring.
- Easy to explain to candidates and stakeholders.
Disadvantages:
- Punishes partial credit. A candidate who ranks E first but D second (you want E > D > ...) gets zero credit.
- All-or-nothing can feel harsh on edge cases.
Use MD scoring when: You want to hire managers or leaders who consistently align with your judgment standards. You have low tolerance for deviance. You want the assessment to differentiate clearly.
Distance-based scoring
The candidate's full ranking is compared to the expert ranking using a distance metric (e.g., sum of absolute differences between positions).
Example:
- Expert ranking: E(1) > D(2) > A(3) > C(4) > B(5)
- Candidate ranking: E(1) > D(2) > C(3) > A(4) > B(5)
- Distance: |1-1| + |2-2| + |4-3| + |3-4| + |5-5| = 0 + 0 + 1 + 1 + 0 = 2
- Normalized score (lower distance = higher score): 10/10 or 5/5 depending on max possible distance
Advantages:
- Rewards partial alignment. A candidate who is "mostly right but with one option flipped" gets credit.
- More granular. Captures nuance in reasoning.
- Forgiving of edge cases where two options are very close in quality.
Disadvantages:
- More complex to calculate and explain.
- Requires clear definition of "distance" (Kendall tau, Spearman correlation, other metrics).
- A small difference in top choice can have large scoring impact depending on how you weight it.
Use distance-based scoring when: You want to hire individual contributors where reasonable disagreement is valuable. You want to see the shape of their judgment, not just the top choice. You have high tolerance for diversity of approach.
Comparing candidates: internal benchmarks vs. external norms
Internal benchmarks (recommended)
Give the SJT to your current top performers in the role. Document their average score. Use that as your comparison point for candidates.
Example:
- Your five best engineers average 78% on your custom engineering incident-response SJT.
- Candidate A scores 82%.
- Candidate B scores 71%.
Interpretation: Candidate A aligns well with your top performers. Candidate B deviates—either they have different judgment patterns (which could be good or bad) or they do not understand your context yet.
Why internal benchmarks work:
- They measure alignment with your definition of good judgment, not generic definitions.
- They let you say "we are hiring for people who think like our best performers on these dimensions."
- They surface subculture (if your top performers disagree with each other, that is interesting data too).
How to create internal benchmarks:
- Pick 5–10 high performers who have been with you 2+ years (enough to prove themselves).
- Give them the SJT (if your assessment is new, they can do it retrospectively: "How would you rank these?").
- Calculate their average score.
- Calculate individual variability (do they agree or is there debate?).
High internal variability is useful data: "Our top performers think differently about this." This might mean:
- The scenario is genuinely ambiguous (good—it should be)
- You have different subcultures within high performers (not necessarily bad, but interesting)
- Your master ranking is not representative (revisit it)
External norms (use cautiously)
Some commercial SJT vendors (SHL, CEB Talent, others) have published norms: "For a software engineer role, the 50th percentile score is 64%." You can compare your candidate to that distribution.
Why this is tricky:
- External norms assume the assessment is generic or industry-standard.
- Your custom-designed SJT will not have published norms.
- A candidate scoring at the 80th percentile on an external SJT might score at the 40th percentile on your internal benchmark if your definition of "good judgment" is different.
Use external norms for:
- Sanity-checking your assessments (if everyone scores above the 90th percentile, your assessment is probably too easy)
- Red-flag detection (if a candidate is below the 20th percentile, something is off)
- Transparency (you can tell candidates "for this role, the average score is...")
Do not use external norms alone. Always pair with internal benchmarks if possible.
Interpreting patterns, not just scores
Two candidates both score 76%. But the pattern of their choices matters.
Candidate A's rankings by scenario:
- Incident response: E first (matches expert)
- Customer conflict: D first (matches expert)
- Team friction: A first (expert ranked B first)
- Delegation: B first (expert ranked B first)
- Prioritization: C first (expert ranked D first)
Pattern: Mostly matches your top performers. Deviates on people-focused scenarios (team friction, prioritization). Hypothesis: strong technical judgment, weaker on people judgment.
Candidate B's rankings:
- Incident response: B first (expert E)
- Customer conflict: E first (expert D)
- Team friction: D first (expert B)
- Delegation: A first (expert B)
- Prioritization: D first (expert D)
Pattern: Less consistent across the board. No clear pattern. Hypothesis: either does not understand your context, or has fundamentally different judgment philosophy.
Both score 76%. But Candidate A reveals a weakness you can coach (people judgment). Candidate B reveals either lack of understanding or misalignment that is harder to fix.
Track patterns by domain:
- Technical judgment (incident response, debugging, architecture)
- People judgment (conflict, delegation, feedback)
- Execution judgment (prioritization, resource allocation, trade-offs)
- Risk management (escalation, when to slow down)
This granularity lets you say: "We would hire them for role X but not role Y" based on their pattern.
SJT score + interview coherence
A strong SJT score means the candidate theoretically aligns with your judgment standards. An interview validates that they can execute on that judgment. Use your hiring rubric to ensure consistency across all interviewers.
Strong SJT + strong interview: Aligned on judgment and can articulate examples. High confidence hire.
Strong SJT + weak interview: They "know" the right judgment in the abstract but cannot back it with examples or their examples feel rehearsed. Red flag. Probe: "Tell me about a time you chose to escalate early instead of investigating alone. What was the situation?"
Weak SJT + strong interview: They do not score well on your test but their past decisions align with your judgment standards. This often means: they did not understand your context in the SJT (they are new to the industry) or your assessment is not measuring what you think. Do not filter them out automatically. Understand why the mismatch exists.
Weak SJT + weak interview: Consistent signal. Judgment does not align or is not strong. Less likely to be a fit.
When SJT scores do not predict performance
SJTs are good for judgment measurement, but they do not predict everything. They predict:
- Decision quality under ambiguity
- Problem-solving approach
- Escalation discipline
- People judgment (for management roles)
They do not predict:
- Execution speed (a candidate might make great decisions but slow to act)
- Persistence through setback (they might know the right call but give up when it is hard)
- Learning velocity (they might understand your judgment standards but need time to internalize them)
- Communication ability (they might think well but struggle to explain)
- Technical skill (for roles where technical depth matters alongside judgment)
If you only use an SJT, you are missing these dimensions. Pair it with:
- Coding or work sample assessments for technical skill
- Behavioral interviews for past execution and resilience
- Structured interviews with rubrics for communication and depth
Red flags in SJT interpretation
Red flag 1: Everyone scores the same.
If all candidates score 82%, or all score 45%, your assessment is not differentiating. Likely causes:
- Assessment is too easy or too hard
- Master ranking is not representative
- Candidates are not understanding the scenarios
Revise the assessment. Pilot with 3–5 people and iterate.
Red flag 2: Score variance does not correlate with seniority.
If a junior candidate scores higher than your senior hires, something is wrong. Either:
- The assessment is measuring something other than what you think
- Your scoring is inconsistent
- You are comparing against the wrong benchmarks
Investigate by asking high and low scorers: "Tell me why you ranked that option first." Do their explanations match your expectations?
Red flag 3: Demographic groups score significantly differently.
If women consistently score 10+ points lower than men, or one ethnic group scores systematically lower, your assessment may have bias. Causes:
- Scenarios reflect cultural assumptions
- Language is not equally accessible
- Scenarios privilege certain types of experience
Review for bias (fairness in assessment design) and test with diverse groups.
Communicating scores to candidates
Be transparent about what the score means. Do not say "you scored 72%." Say:
"On our situational judgment assessment, you ranked the top choice consistently with our top performers on 3 of 5 scenarios. Your judgment on [domain] aligned well with our standards. Your approach to [domain] differs from our norm—this could be a strength (fresh perspective) or could require adaptation to our culture."
This reframes the score as pattern of judgment rather than a pass/fail grade. It signals that:
- You are measuring something specific
- You understand context
- You are open to learning their reasoning
Using SJT scores in the hiring decision
SJT scores are one signal among many. Use them as:
- Screening filter: Strong SJT + resume fit move forward. Weak SJT but interesting background warrants investigation.
- Interview probe: Use the assessment as a springboard for behavioral questions. "I noticed on the escalation scenario you ranked X first. Tell me about a time you escalated."
- Tiebreaker: Two candidates with similar interviews? The one with stronger SJT alignment is likely to adapt better to your culture.
- Onboarding data: For hired candidates, track their SJT patterns in onboarding to identify mentorship focus areas.
Do not use SJT as a knockout filter for borderline candidates. Use it as context.
For comprehensive assessment strategy, layer SJTs with coding assessments, structured interviews, and reference checks. Each measures different dimensions of fit.
ClarityHire's assessment platform includes automated scoring, benchmarking against your internal top performers, and pattern analysis to simplify interpretation.