2026 BOY Numeracy Scoring Decisions

Internal release-candidate discussion memo. Updated 2026-06-19 06:57 UTC. Aggregate evidence only.

Purpose. This page structures the scoring discussion, walks through the current model results, shows score-agreement plots, and ends with the choices that need a decision.

Working recommendation. If we need a near-term live score, use the unidim + testlet screener-index result as the 2026 BOY Numeracy Screener Index, keep D/trailing-zero only for timed-form performance, exclude STPM from math achievement, and use hierarchical modelling for modelled subtest profiles where reportable.

Questions to discuss

question why it matters
What should the single global score mean? Determines whether we use a Trusted-subtest composite, unidim + testlet screener index, balanced score, or future fluency model.
Should timed unreached trailing items count as zero or missing? Zero supports timed-form performance. Missing supports pure reached-item accuracy.
Should response time or pace be part of the 2026 live score? Current recommendation is shadow/developmental only. Fluency means accurate and efficient, not fast alone.
Which subtests can be reported as standalone or profile evidence? Strong subtests may support profiles. Weak/moderate subtests need hierarchical shrinkage, caveats, or internal-only status.
Should weak but construct-relevant probes stay in the global score? Exclude them for a Trusted-subtest composite; retain or downweight them for screener-index/balanced-score claims.
What score movement is acceptable when switching models? Use correlations, percentile shifts, risk-band movement, and high-cut movement to decide release readiness.
What remains unresolved before release-candidate lock? Year 1 BNL cleanup, weighted-vs-unweighted adjudication, hierarchical reporting status, and later outcome/fairness checks.

Review model results

Agreement and shift columns compare each candidate with the unidim + testlet screener-index candidate on a 0–100 percentile scale.

year model n agreement vs screener median abs shift pp p95 abs shift pp release read
Foundation Trusted-subtest composite 993.0 0.635 15.4 49.4 Clean trusted-marker option; narrower construct
Foundation Unidim + testlet screener index 997.0 reference 0 0 Primary candidate if claim is 2026 BOY Numeracy Screener Index
Foundation Hierarchical global 997.0 0.904 6.9 26.2 Shadow/internal for global; useful for subscore pooling
Foundation Equal-subtest composite 974.0 0.956 4.7 18.6 Balanced-score challenger; not same claim as screener index
Foundation Reached-only composite 974.0 0.897 7.2 27.2 Policy comparator for pure reached accuracy, not timed-form performance
Year 1 Trusted-subtest composite 1,198 0.706 13.7 44.8 Clean trusted-marker option; narrower construct
Year 1 Unidim + testlet screener index 1,221 reference 0 0 Primary candidate if claim is 2026 BOY Numeracy Screener Index
Year 1 Hierarchical global 1,221 0.836 9.4 34.3 Shadow/internal for global; useful for subscore pooling
Year 1 Equal-subtest composite 1,178 0.885 8.3 28.1 Balanced-score challenger; not same claim as screener index
Year 1 Reached-only composite 1,178 0.861 9 30.9 Policy comparator for pure reached accuracy, not timed-form performance

Subtest evidence at a glance

This is the short version needed for the decision discussion. The collapsed appendix has the full standalone-modelling table.

year subtest items standalone precision coherence floor ceiling trailing unreached risk flags release candidate role
Foundation MQ1-20 19 weak (0.6) ρ 0.43 0.04 / 0 0.83 weak_standalone_reliability; sparse_nonconstant_items_retained Hierarchical/descriptive only; avoid standalone high-stakes subscore
Foundation MC0-20 50 strong (0.93) ρ 0.53 0.01 / 0 0.74 sparse_nonconstant_items_retained Include if construct claim includes this probe; profile candidate
Foundation MNC0-20 24 strong (0.88) ρ 0.6 0.06 / 0 0.75 sparse_nonconstant_items_retained Include if construct claim includes this probe; profile candidate
Foundation DMT10_2026 6 weak (0.61) ρ 0.43 0.01 / 0.09 n/a weak_standalone_reliability; few_calibration_items Hierarchical/descriptive only; not a standalone global driver
Foundation BNL0-20 10 weak (0.67) ρ 0.35 0.01 / 0 n/a weak_standalone_reliability; number_line_policy_sensitive Hierarchical/descriptive only; avoid standalone high-stakes subscore
Year 1 MC0-100 34 strong (0.94) ρ 0.69 0.02 / 0 0.76 sparse_nonconstant_items_retained Include if construct claim includes this probe; profile candidate
Year 1 MNC0-100 22 strong (0.89) ρ 0.76 0.03 / 0 0.71 sparse_nonconstant_items_retained Include if construct claim includes this probe; profile candidate
Year 1 AAMC 38 strong (0.9) ρ 0.73 0.04 / 0 0.78 sparse_nonconstant_items_retained Include if construct claim includes this probe; profile candidate
Year 1 ASMC 25 moderate (0.84) ρ 0.62 0.13 / 0 0.76 moderate_reliability; floor_rate_ge_10pct; sparse_nonconstant_items_retained Profile with hierarchical shrinkage; caveat standalone interpretation
Year 1 BNL0-100 13 moderate (0.73) ρ 0.55 0 / 0 n/a moderate_reliability; number_line_policy_sensitive Profile with hierarchical shrinkage; caveat standalone interpretation

Corr/scatterplots: how much do choices change scores?

Plots are aggregate SVG summaries generated at build time. No student-level data are published. “Hierarchical trailing-zero” means the hierarchical subtest score fit from the policy-locked frame where timed non-NL probes use D/trailing-zero scoring.

Global score agreement

Trusted-subtest composite vs screener indexρ=0.64 · n=993 · aggregate jittered scatter005050100100Unidim + testlet screener-index percentileTrusted-subtest composite percentileHierarchical global vs screener indexρ=0.90 · n=997 · aggregate jittered scatter005050100100Unidim + testlet screener-index percentileHierarchical global percentileEqual-subtest composite vs screener indexρ=0.96 · n=974 · aggregate jittered scatter005050100100Unidim + testlet screener-index percentileEqual-subtest composite percentileReached-only composite vs screener indexρ=0.90 · n=974 · aggregate jittered scatter005050100100Unidim + testlet screener-index percentileReached-only composite percentile
Trusted-subtest composite vs screener indexρ=0.71 · n=1198 · aggregate jittered scatter005050100100Unidim + testlet screener-index percentileTrusted-subtest composite percentileHierarchical global vs screener indexρ=0.84 · n=1221 · aggregate jittered scatter005050100100Unidim + testlet screener-index percentileHierarchical global percentileEqual-subtest composite vs screener indexρ=0.89 · n=1178 · aggregate jittered scatter005050100100Unidim + testlet screener-index percentileEqual-subtest composite percentileReached-only composite vs screener indexρ=0.86 · n=1178 · aggregate jittered scatter005050100100Unidim + testlet screener-index percentileReached-only composite percentile

Reached/valid-only accuracy vs D/trailing-zero

Foundation

MQ1-20ρ=0.86 · n=997 · aggregate jittered scatter005050100100Reached/valid-only percentileD/trailing-zero percentileMC0-20ρ=0.72 · n=995 · aggregate jittered scatter005050100100Reached/valid-only percentileD/trailing-zero percentileMNC0-20ρ=0.90 · n=993 · aggregate jittered scatter005050100100Reached/valid-only percentileD/trailing-zero percentile

Year 1

MC0-100ρ=0.78 · n=1221 · aggregate jittered scatter005050100100Reached/valid-only percentileD/trailing-zero percentileMNC0-100ρ=0.90 · n=1211 · aggregate jittered scatter005050100100Reached/valid-only percentileD/trailing-zero percentileAAMCρ=0.88 · n=1205 · aggregate jittered scatter005050100100Reached/valid-only percentileD/trailing-zero percentileASMCρ=0.92 · n=1198 · aggregate jittered scatter005050100100Reached/valid-only percentileD/trailing-zero percentile

Standalone trailing-zero score vs hierarchical trailing-zero subtest score

Foundation

MQ1-20ρ=0.93 · n=997 · aggregate jittered scatter005050100100Standalone trailing-zero/operational percentileHierarchical trailing-zero subtest percentileMC0-20ρ=1.00 · n=995 · aggregate jittered scatter005050100100Standalone trailing-zero/operational percentileHierarchical trailing-zero subtest percentileMNC0-20ρ=1.00 · n=993 · aggregate jittered scatter005050100100Standalone trailing-zero/operational percentileHierarchical trailing-zero subtest percentileDMT10_2026ρ=0.96 · n=988 · aggregate jittered scatter005050100100Standalone trailing-zero/operational percentileHierarchical trailing-zero subtest percentileBNL0-20ρ=0.94 · n=974 · aggregate jittered scatter005050100100Standalone trailing-zero/operational percentileHierarchical trailing-zero subtest percentile

Year 1

MC0-100ρ=1.00 · n=1221 · aggregate jittered scatter005050100100Standalone trailing-zero/operational percentileHierarchical trailing-zero subtest percentileMNC0-100ρ=0.99 · n=1211 · aggregate jittered scatter005050100100Standalone trailing-zero/operational percentileHierarchical trailing-zero subtest percentileAAMCρ=0.99 · n=1205 · aggregate jittered scatter005050100100Standalone trailing-zero/operational percentileHierarchical trailing-zero subtest percentileASMCρ=0.99 · n=1198 · aggregate jittered scatter005050100100Standalone trailing-zero/operational percentileHierarchical trailing-zero subtest percentileBNL0-100ρ=0.96 · n=1178 · aggregate jittered scatter005050100100Standalone trailing-zero/operational percentileHierarchical trailing-zero subtest percentile

Discussions

Use this block to choose the construct claim first. The model and response-process policy follow from that choice.

if the team chooses then use response policy tradeoff
Trusted-subtest composite Independent trusted subtest scores only Timed trailing unreached treated as missing for pure accuracy Cleanest psychometric marker, but narrower construct coverage
2026 BOY Numeracy Screener Index Unidim + testlet item-level model D/trailing-zero for timed non-NL, locked NL ordinal policy, STPM excluded Best near-term operational path, but do not claim balanced broad numeracy
Balanced broad numeracy Equal-subtest weighted IRT or equal-subtest composite Same scoring policies, but subtest influence balanced by design Construct claim is cleaner, but requires weighted-vs-unweighted adjudication
Fluency Future accuracy × pace model Model correctness and pace jointly; do not use speed alone Not the current live score; keep as development/shadow evidence

Recommended wording if the screener-index path is selected

Technical appendices

Collapsed by default. These are support materials, not the main read.

Model run reviewGlobal Stan diagnostics and current caveats
year model n agreement vs screener median abs shift pp p95 abs shift pp release read
Foundation Trusted-subtest composite 993.0 0.635 15.4 49.4 Clean trusted-marker option; narrower construct
Foundation Unidim + testlet screener index 997.0 reference 0 0 Primary candidate if claim is 2026 BOY Numeracy Screener Index
Foundation Hierarchical global 997.0 0.904 6.9 26.2 Shadow/internal for global; useful for subscore pooling
Foundation Equal-subtest composite 974.0 0.956 4.7 18.6 Balanced-score challenger; not same claim as screener index
Foundation Reached-only composite 974.0 0.897 7.2 27.2 Policy comparator for pure reached accuracy, not timed-form performance
Year 1 Trusted-subtest composite 1,198 0.706 13.7 44.8 Clean trusted-marker option; narrower construct
Year 1 Unidim + testlet screener index 1,221 reference 0 0 Primary candidate if claim is 2026 BOY Numeracy Screener Index
Year 1 Hierarchical global 1,221 0.836 9.4 34.3 Shadow/internal for global; useful for subscore pooling
Year 1 Equal-subtest composite 1,178 0.885 8.3 28.1 Balanced-score challenger; not same claim as screener index
Year 1 Reached-only composite 1,178 0.861 9 30.9 Policy comparator for pure reached accuracy, not timed-form performance

Open full model run review

Detailed subtest evidence tableStandalone modelling and hierarchical readiness by subtest
year subtest policy model students release items standalone fit items excluded items standalone reliability evidence band coherence with other subtests floor ceiling trailing unreached itemfit flag rate profile modelling posture release candidate role risk flags
Foundation MQ1-20 D/trailing-zero Rasch 1PL 1,006 19 19 11 0.6 weak 0.43 0.04 / 0 0.83 0.68 hierarchical_shrinkage_required; avoid standalone high-stakes subscore Hierarchical/descriptive only; avoid standalone high-stakes subscore weak_standalone_reliability; sparse_nonconstant_items_retained
Foundation MC0-20 D/trailing-zero Rasch 1PL 1,005 50 50 10 0.93 strong 0.53 0.01 / 0 0.74 0.96 strong standalone signal; still prefer hierarchical coherence with global score Include if construct claim includes this probe; profile candidate sparse_nonconstant_items_retained
Foundation MNC0-20 D/trailing-zero Rasch 1PL 1,003 24 24 6 0.88 strong 0.6 0.06 / 0 0.75 0.88 strong standalone signal; still prefer hierarchical coherence with global score Include if construct claim includes this probe; profile candidate sparse_nonconstant_items_retained
Foundation DMT10_2026 Valid-only Rasch 1PL 1,002 6 target items 8 3 0.61 weak 0.43 0.01 / 0.09 n/a 0.12 hierarchical_shrinkage_required; avoid standalone high-stakes subscore Hierarchical/descriptive only; not a standalone global driver weak_standalone_reliability; few_calibration_items
Foundation BNL0-20 NL2 ordinal PCM 974.0 10 10 0 0.67 weak 0.35 0.01 / 0 n/a 0 hierarchical_shrinkage_required; avoid standalone high-stakes subscore Hierarchical/descriptive only; avoid standalone high-stakes subscore weak_standalone_reliability; number_line_policy_sensitive
Year 1 MC0-100 D/trailing-zero Rasch 1PL 1,235 34 34 26 0.94 strong 0.69 0.02 / 0 0.76 0.91 strong standalone signal; still prefer hierarchical coherence with global score Include if construct claim includes this probe; profile candidate sparse_nonconstant_items_retained
Year 1 MNC0-100 D/trailing-zero Rasch 1PL 1,229 22 22 7 0.89 strong 0.76 0.03 / 0 0.71 0.91 strong standalone signal; still prefer hierarchical coherence with global score Include if construct claim includes this probe; profile candidate sparse_nonconstant_items_retained
Year 1 AAMC D/trailing-zero Rasch 1PL 1,227 38 38 2 0.9 strong 0.73 0.04 / 0 0.78 0.97 strong standalone signal; still prefer hierarchical coherence with global score Include if construct claim includes this probe; profile candidate sparse_nonconstant_items_retained
Year 1 ASMC D/trailing-zero Rasch 1PL 1,223 25 25 5 0.84 moderate 0.62 0.13 / 0 0.76 0.72 hierarchical_shrinkage_recommended; standalone subscore only as caveated diagnostic Profile with hierarchical shrinkage; caveat standalone interpretation moderate_reliability; floor_rate_ge_10pct; sparse_nonconstant_items_retained
Year 1 BNL0-100 NL2 ordinal PCM 1,178 13 13 0 0.73 moderate 0.55 0 / 0 n/a 0.08 hierarchical_shrinkage_recommended; standalone subscore only as caveated diagnostic Profile with hierarchical shrinkage; caveat standalone interpretation moderate_reliability; number_line_policy_sensitive
Item auditExpanded BOY subtest/item diagnostics

The full item audit remains separate because it is long and figure-heavy.

Open item audit appendix

Subscore readiness CSVSource aggregate table for hierarchical/reportability decisions

Download subscore readiness CSV