2026 BOY Numeracy Scoring Decisions
Purpose. This page structures the scoring discussion, walks through the current model results, shows score-agreement plots, and ends with the choices that need a decision.
Working recommendation. If we need a near-term live score, use the unidim + testlet screener-index result as the 2026 BOY Numeracy Screener Index, keep D/trailing-zero only for timed-form performance, exclude STPM from math achievement, and use hierarchical modelling for modelled subtest profiles where reportable.
Questions to discuss
| question | why it matters |
|---|---|
| What should the single global score mean? | Determines whether we use a Trusted-subtest composite, unidim + testlet screener index, balanced score, or future fluency model. |
| Should timed unreached trailing items count as zero or missing? | Zero supports timed-form performance. Missing supports pure reached-item accuracy. |
| Should response time or pace be part of the 2026 live score? | Current recommendation is shadow/developmental only. Fluency means accurate and efficient, not fast alone. |
| Which subtests can be reported as standalone or profile evidence? | Strong subtests may support profiles. Weak/moderate subtests need hierarchical shrinkage, caveats, or internal-only status. |
| Should weak but construct-relevant probes stay in the global score? | Exclude them for a Trusted-subtest composite; retain or downweight them for screener-index/balanced-score claims. |
| What score movement is acceptable when switching models? | Use correlations, percentile shifts, risk-band movement, and high-cut movement to decide release readiness. |
| What remains unresolved before release-candidate lock? | Year 1 BNL cleanup, weighted-vs-unweighted adjudication, hierarchical reporting status, and later outcome/fairness checks. |
Review model results
Agreement and shift columns compare each candidate with the unidim + testlet screener-index candidate on a 0–100 percentile scale.
| year | model | n | agreement vs screener | median abs shift pp | p95 abs shift pp | release read |
|---|---|---|---|---|---|---|
| Foundation | Trusted-subtest composite | 993.0 | 0.635 | 15.4 | 49.4 | Clean trusted-marker option; narrower construct |
| Foundation | Unidim + testlet screener index | 997.0 | reference | 0 | 0 | Primary candidate if claim is 2026 BOY Numeracy Screener Index |
| Foundation | Hierarchical global | 997.0 | 0.904 | 6.9 | 26.2 | Shadow/internal for global; useful for subscore pooling |
| Foundation | Equal-subtest composite | 974.0 | 0.956 | 4.7 | 18.6 | Balanced-score challenger; not same claim as screener index |
| Foundation | Reached-only composite | 974.0 | 0.897 | 7.2 | 27.2 | Policy comparator for pure reached accuracy, not timed-form performance |
| Year 1 | Trusted-subtest composite | 1,198 | 0.706 | 13.7 | 44.8 | Clean trusted-marker option; narrower construct |
| Year 1 | Unidim + testlet screener index | 1,221 | reference | 0 | 0 | Primary candidate if claim is 2026 BOY Numeracy Screener Index |
| Year 1 | Hierarchical global | 1,221 | 0.836 | 9.4 | 34.3 | Shadow/internal for global; useful for subscore pooling |
| Year 1 | Equal-subtest composite | 1,178 | 0.885 | 8.3 | 28.1 | Balanced-score challenger; not same claim as screener index |
| Year 1 | Reached-only composite | 1,178 | 0.861 | 9 | 30.9 | Policy comparator for pure reached accuracy, not timed-form performance |
Subtest evidence at a glance
This is the short version needed for the decision discussion. The collapsed appendix has the full standalone-modelling table.
| year | subtest | items | standalone precision | coherence | floor ceiling | trailing unreached | risk flags | release candidate role |
|---|---|---|---|---|---|---|---|---|
| Foundation | MQ1-20 | 19 | weak (0.6) | ρ 0.43 | 0.04 / 0 | 0.83 | weak_standalone_reliability; sparse_nonconstant_items_retained | Hierarchical/descriptive only; avoid standalone high-stakes subscore |
| Foundation | MC0-20 | 50 | strong (0.93) | ρ 0.53 | 0.01 / 0 | 0.74 | sparse_nonconstant_items_retained | Include if construct claim includes this probe; profile candidate |
| Foundation | MNC0-20 | 24 | strong (0.88) | ρ 0.6 | 0.06 / 0 | 0.75 | sparse_nonconstant_items_retained | Include if construct claim includes this probe; profile candidate |
| Foundation | DMT10_2026 | 6 | weak (0.61) | ρ 0.43 | 0.01 / 0.09 | n/a | weak_standalone_reliability; few_calibration_items | Hierarchical/descriptive only; not a standalone global driver |
| Foundation | BNL0-20 | 10 | weak (0.67) | ρ 0.35 | 0.01 / 0 | n/a | weak_standalone_reliability; number_line_policy_sensitive | Hierarchical/descriptive only; avoid standalone high-stakes subscore |
| Year 1 | MC0-100 | 34 | strong (0.94) | ρ 0.69 | 0.02 / 0 | 0.76 | sparse_nonconstant_items_retained | Include if construct claim includes this probe; profile candidate |
| Year 1 | MNC0-100 | 22 | strong (0.89) | ρ 0.76 | 0.03 / 0 | 0.71 | sparse_nonconstant_items_retained | Include if construct claim includes this probe; profile candidate |
| Year 1 | AAMC | 38 | strong (0.9) | ρ 0.73 | 0.04 / 0 | 0.78 | sparse_nonconstant_items_retained | Include if construct claim includes this probe; profile candidate |
| Year 1 | ASMC | 25 | moderate (0.84) | ρ 0.62 | 0.13 / 0 | 0.76 | moderate_reliability; floor_rate_ge_10pct; sparse_nonconstant_items_retained | Profile with hierarchical shrinkage; caveat standalone interpretation |
| Year 1 | BNL0-100 | 13 | moderate (0.73) | ρ 0.55 | 0 / 0 | n/a | moderate_reliability; number_line_policy_sensitive | Profile with hierarchical shrinkage; caveat standalone interpretation |
Corr/scatterplots: how much do choices change scores?
Plots are aggregate SVG summaries generated at build time. No student-level data are published. “Hierarchical trailing-zero” means the hierarchical subtest score fit from the policy-locked frame where timed non-NL probes use D/trailing-zero scoring.
Global score agreement
Reached/valid-only accuracy vs D/trailing-zero
Foundation
Year 1
Standalone trailing-zero score vs hierarchical trailing-zero subtest score
Foundation
Year 1
Discussions
Use this block to choose the construct claim first. The model and response-process policy follow from that choice.
| if the team chooses | then use | response policy | tradeoff |
|---|---|---|---|
| Trusted-subtest composite | Independent trusted subtest scores only | Timed trailing unreached treated as missing for pure accuracy | Cleanest psychometric marker, but narrower construct coverage |
| 2026 BOY Numeracy Screener Index | Unidim + testlet item-level model | D/trailing-zero for timed non-NL, locked NL ordinal policy, STPM excluded | Best near-term operational path, but do not claim balanced broad numeracy |
| Balanced broad numeracy | Equal-subtest weighted IRT or equal-subtest composite | Same scoring policies, but subtest influence balanced by design | Construct claim is cleaner, but requires weighted-vs-unweighted adjudication |
| Fluency | Future accuracy × pace model | Model correctness and pace jointly; do not use speed alone | Not the current live score; keep as development/shadow evidence |
Recommended wording if the screener-index path is selected
- Use 2026 BOY Numeracy Screener Index, not “final broad numeracy score”.
- Explain timed non-NL scoring as timed-form performance, not pure reached-item accuracy.
- Explain hierarchical subtest scores as shrunken estimates of performance on that skill, borrowing strength from the global score.
- Keep the Year 1 Number Line caveat visible until residual-zero / active-mask cleanup is adjudicated.
Technical appendices
Collapsed by default. These are support materials, not the main read.
Model run reviewGlobal Stan diagnostics and current caveats
| year | model | n | agreement vs screener | median abs shift pp | p95 abs shift pp | release read |
|---|---|---|---|---|---|---|
| Foundation | Trusted-subtest composite | 993.0 | 0.635 | 15.4 | 49.4 | Clean trusted-marker option; narrower construct |
| Foundation | Unidim + testlet screener index | 997.0 | reference | 0 | 0 | Primary candidate if claim is 2026 BOY Numeracy Screener Index |
| Foundation | Hierarchical global | 997.0 | 0.904 | 6.9 | 26.2 | Shadow/internal for global; useful for subscore pooling |
| Foundation | Equal-subtest composite | 974.0 | 0.956 | 4.7 | 18.6 | Balanced-score challenger; not same claim as screener index |
| Foundation | Reached-only composite | 974.0 | 0.897 | 7.2 | 27.2 | Policy comparator for pure reached accuracy, not timed-form performance |
| Year 1 | Trusted-subtest composite | 1,198 | 0.706 | 13.7 | 44.8 | Clean trusted-marker option; narrower construct |
| Year 1 | Unidim + testlet screener index | 1,221 | reference | 0 | 0 | Primary candidate if claim is 2026 BOY Numeracy Screener Index |
| Year 1 | Hierarchical global | 1,221 | 0.836 | 9.4 | 34.3 | Shadow/internal for global; useful for subscore pooling |
| Year 1 | Equal-subtest composite | 1,178 | 0.885 | 8.3 | 28.1 | Balanced-score challenger; not same claim as screener index |
| Year 1 | Reached-only composite | 1,178 | 0.861 | 9 | 30.9 | Policy comparator for pure reached accuracy, not timed-form performance |
Detailed subtest evidence tableStandalone modelling and hierarchical readiness by subtest
| year | subtest | policy model | students | release items | standalone fit items | excluded items | standalone reliability | evidence band | coherence with other subtests | floor ceiling | trailing unreached | itemfit flag rate | profile modelling posture | release candidate role | risk flags |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Foundation | MQ1-20 | D/trailing-zero Rasch 1PL | 1,006 | 19 | 19 | 11 | 0.6 | weak | 0.43 | 0.04 / 0 | 0.83 | 0.68 | hierarchical_shrinkage_required; avoid standalone high-stakes subscore | Hierarchical/descriptive only; avoid standalone high-stakes subscore | weak_standalone_reliability; sparse_nonconstant_items_retained |
| Foundation | MC0-20 | D/trailing-zero Rasch 1PL | 1,005 | 50 | 50 | 10 | 0.93 | strong | 0.53 | 0.01 / 0 | 0.74 | 0.96 | strong standalone signal; still prefer hierarchical coherence with global score | Include if construct claim includes this probe; profile candidate | sparse_nonconstant_items_retained |
| Foundation | MNC0-20 | D/trailing-zero Rasch 1PL | 1,003 | 24 | 24 | 6 | 0.88 | strong | 0.6 | 0.06 / 0 | 0.75 | 0.88 | strong standalone signal; still prefer hierarchical coherence with global score | Include if construct claim includes this probe; profile candidate | sparse_nonconstant_items_retained |
| Foundation | DMT10_2026 | Valid-only Rasch 1PL | 1,002 | 6 target items | 8 | 3 | 0.61 | weak | 0.43 | 0.01 / 0.09 | n/a | 0.12 | hierarchical_shrinkage_required; avoid standalone high-stakes subscore | Hierarchical/descriptive only; not a standalone global driver | weak_standalone_reliability; few_calibration_items |
| Foundation | BNL0-20 | NL2 ordinal PCM | 974.0 | 10 | 10 | 0 | 0.67 | weak | 0.35 | 0.01 / 0 | n/a | 0 | hierarchical_shrinkage_required; avoid standalone high-stakes subscore | Hierarchical/descriptive only; avoid standalone high-stakes subscore | weak_standalone_reliability; number_line_policy_sensitive |
| Year 1 | MC0-100 | D/trailing-zero Rasch 1PL | 1,235 | 34 | 34 | 26 | 0.94 | strong | 0.69 | 0.02 / 0 | 0.76 | 0.91 | strong standalone signal; still prefer hierarchical coherence with global score | Include if construct claim includes this probe; profile candidate | sparse_nonconstant_items_retained |
| Year 1 | MNC0-100 | D/trailing-zero Rasch 1PL | 1,229 | 22 | 22 | 7 | 0.89 | strong | 0.76 | 0.03 / 0 | 0.71 | 0.91 | strong standalone signal; still prefer hierarchical coherence with global score | Include if construct claim includes this probe; profile candidate | sparse_nonconstant_items_retained |
| Year 1 | AAMC | D/trailing-zero Rasch 1PL | 1,227 | 38 | 38 | 2 | 0.9 | strong | 0.73 | 0.04 / 0 | 0.78 | 0.97 | strong standalone signal; still prefer hierarchical coherence with global score | Include if construct claim includes this probe; profile candidate | sparse_nonconstant_items_retained |
| Year 1 | ASMC | D/trailing-zero Rasch 1PL | 1,223 | 25 | 25 | 5 | 0.84 | moderate | 0.62 | 0.13 / 0 | 0.76 | 0.72 | hierarchical_shrinkage_recommended; standalone subscore only as caveated diagnostic | Profile with hierarchical shrinkage; caveat standalone interpretation | moderate_reliability; floor_rate_ge_10pct; sparse_nonconstant_items_retained |
| Year 1 | BNL0-100 | NL2 ordinal PCM | 1,178 | 13 | 13 | 0 | 0.73 | moderate | 0.55 | 0 / 0 | n/a | 0.08 | hierarchical_shrinkage_recommended; standalone subscore only as caveated diagnostic | Profile with hierarchical shrinkage; caveat standalone interpretation | moderate_reliability; number_line_policy_sensitive |
Item auditExpanded BOY subtest/item diagnostics
The full item audit remains separate because it is long and figure-heavy.