2026 BOY Numeracy Scoring Decisions

Internal release-candidate discussion memo. Updated 2026-06-19 06:57 UTC. Aggregate evidence only.

Purpose. This page structures the scoring discussion, walks through the current model results, shows score-agreement plots, and ends with the choices that need a decision.

Working recommendation. If we need a near-term live score, use the unidim + testlet screener-index result as the 2026 BOY Numeracy Screener Index, keep D/trailing-zero only for timed-form performance, exclude STPM from math achievement, and use hierarchical modelling for modelled subtest profiles where reportable.

Questions to discuss

question	why it matters
What should the single global score mean?	Determines whether we use a Trusted-subtest composite, unidim + testlet screener index, balanced score, or future fluency model.
Should timed unreached trailing items count as zero or missing?	Zero supports timed-form performance. Missing supports pure reached-item accuracy.
Should response time or pace be part of the 2026 live score?	Current recommendation is shadow/developmental only. Fluency means accurate and efficient, not fast alone.
Which subtests can be reported as standalone or profile evidence?	Strong subtests may support profiles. Weak/moderate subtests need hierarchical shrinkage, caveats, or internal-only status.
Should weak but construct-relevant probes stay in the global score?	Exclude them for a Trusted-subtest composite; retain or downweight them for screener-index/balanced-score claims.
What score movement is acceptable when switching models?	Use correlations, percentile shifts, risk-band movement, and high-cut movement to decide release readiness.
What remains unresolved before release-candidate lock?	Year 1 BNL cleanup, weighted-vs-unweighted adjudication, hierarchical reporting status, and later outcome/fairness checks.

Review model results

Agreement and shift columns compare each candidate with the unidim + testlet screener-index candidate on a 0–100 percentile scale.

year	model	n	agreement vs screener	median abs shift pp	p95 abs shift pp	release read
Foundation	Trusted-subtest composite	993.0	0.635	15.4	49.4	Clean trusted-marker option; narrower construct
Foundation	Unidim + testlet screener index	997.0	reference	0	0	Primary candidate if claim is 2026 BOY Numeracy Screener Index
Foundation	Hierarchical global	997.0	0.904	6.9	26.2	Shadow/internal for global; useful for subscore pooling
Foundation	Equal-subtest composite	974.0	0.956	4.7	18.6	Balanced-score challenger; not same claim as screener index
Foundation	Reached-only composite	974.0	0.897	7.2	27.2	Policy comparator for pure reached accuracy, not timed-form performance
Year 1	Trusted-subtest composite	1,198	0.706	13.7	44.8	Clean trusted-marker option; narrower construct
Year 1	Unidim + testlet screener index	1,221	reference	0	0	Primary candidate if claim is 2026 BOY Numeracy Screener Index
Year 1	Hierarchical global	1,221	0.836	9.4	34.3	Shadow/internal for global; useful for subscore pooling
Year 1	Equal-subtest composite	1,178	0.885	8.3	28.1	Balanced-score challenger; not same claim as screener index
Year 1	Reached-only composite	1,178	0.861	9	30.9	Policy comparator for pure reached accuracy, not timed-form performance

Subtest evidence at a glance

This is the short version needed for the decision discussion. The collapsed appendix has the full standalone-modelling table.

year	subtest	items	standalone precision	coherence	floor ceiling	trailing unreached	risk flags	release candidate role
Foundation	MQ1-20	19	weak (0.6)	ρ 0.43	0.04 / 0	0.83	weak_standalone_reliability; sparse_nonconstant_items_retained	Hierarchical/descriptive only; avoid standalone high-stakes subscore
Foundation	MC0-20	50	strong (0.93)	ρ 0.53	0.01 / 0	0.74	sparse_nonconstant_items_retained	Include if construct claim includes this probe; profile candidate
Foundation	MNC0-20	24	strong (0.88)	ρ 0.6	0.06 / 0	0.75	sparse_nonconstant_items_retained	Include if construct claim includes this probe; profile candidate
Foundation	DMT10_2026	6	weak (0.61)	ρ 0.43	0.01 / 0.09	n/a	weak_standalone_reliability; few_calibration_items	Hierarchical/descriptive only; not a standalone global driver
Foundation	BNL0-20	10	weak (0.67)	ρ 0.35	0.01 / 0	n/a	weak_standalone_reliability; number_line_policy_sensitive	Hierarchical/descriptive only; avoid standalone high-stakes subscore
Year 1	MC0-100	34	strong (0.94)	ρ 0.69	0.02 / 0	0.76	sparse_nonconstant_items_retained	Include if construct claim includes this probe; profile candidate
Year 1	MNC0-100	22	strong (0.89)	ρ 0.76	0.03 / 0	0.71	sparse_nonconstant_items_retained	Include if construct claim includes this probe; profile candidate
Year 1	AAMC	38	strong (0.9)	ρ 0.73	0.04 / 0	0.78	sparse_nonconstant_items_retained	Include if construct claim includes this probe; profile candidate
Year 1	ASMC	25	moderate (0.84)	ρ 0.62	0.13 / 0	0.76	moderate_reliability; floor_rate_ge_10pct; sparse_nonconstant_items_retained	Profile with hierarchical shrinkage; caveat standalone interpretation
Year 1	BNL0-100	13	moderate (0.73)	ρ 0.55	0 / 0	n/a	moderate_reliability; number_line_policy_sensitive	Profile with hierarchical shrinkage; caveat standalone interpretation

Corr/scatterplots: how much do choices change scores?

Plots are aggregate SVG summaries generated at build time. No student-level data are published. “Hierarchical trailing-zero” means the hierarchical subtest score fit from the policy-locked frame where timed non-NL probes use D/trailing-zero scoring.

Global score agreement

Foundation

Year 1

Reached/valid-only accuracy vs D/trailing-zero

Foundation

Year 1

Standalone trailing-zero score vs hierarchical trailing-zero subtest score

Foundation

Year 1

Discussions

Use this block to choose the construct claim first. The model and response-process policy follow from that choice.

if the team chooses	then use	response policy	tradeoff
Trusted-subtest composite	Independent trusted subtest scores only	Timed trailing unreached treated as missing for pure accuracy	Cleanest psychometric marker, but narrower construct coverage
2026 BOY Numeracy Screener Index	Unidim + testlet item-level model	D/trailing-zero for timed non-NL, locked NL ordinal policy, STPM excluded	Best near-term operational path, but do not claim balanced broad numeracy
Balanced broad numeracy	Equal-subtest weighted IRT or equal-subtest composite	Same scoring policies, but subtest influence balanced by design	Construct claim is cleaner, but requires weighted-vs-unweighted adjudication
Fluency	Future accuracy × pace model	Model correctness and pace jointly; do not use speed alone	Not the current live score; keep as development/shadow evidence

Recommended wording if the screener-index path is selected

Use 2026 BOY Numeracy Screener Index, not “final broad numeracy score”.
Explain timed non-NL scoring as timed-form performance, not pure reached-item accuracy.
Explain hierarchical subtest scores as shrunken estimates of performance on that skill, borrowing strength from the global score.
Keep the Year 1 Number Line caveat visible until residual-zero / active-mask cleanup is adjudicated.

Technical appendices

Collapsed by default. These are support materials, not the main read.

Model run reviewGlobal Stan diagnostics and current caveats

year	model	n	agreement vs screener	median abs shift pp	p95 abs shift pp	release read
Foundation	Trusted-subtest composite	993.0	0.635	15.4	49.4	Clean trusted-marker option; narrower construct
Foundation	Unidim + testlet screener index	997.0	reference	0	0	Primary candidate if claim is 2026 BOY Numeracy Screener Index
Foundation	Hierarchical global	997.0	0.904	6.9	26.2	Shadow/internal for global; useful for subscore pooling
Foundation	Equal-subtest composite	974.0	0.956	4.7	18.6	Balanced-score challenger; not same claim as screener index
Foundation	Reached-only composite	974.0	0.897	7.2	27.2	Policy comparator for pure reached accuracy, not timed-form performance
Year 1	Trusted-subtest composite	1,198	0.706	13.7	44.8	Clean trusted-marker option; narrower construct
Year 1	Unidim + testlet screener index	1,221	reference	0	0	Primary candidate if claim is 2026 BOY Numeracy Screener Index
Year 1	Hierarchical global	1,221	0.836	9.4	34.3	Shadow/internal for global; useful for subscore pooling
Year 1	Equal-subtest composite	1,178	0.885	8.3	28.1	Balanced-score challenger; not same claim as screener index
Year 1	Reached-only composite	1,178	0.861	9	30.9	Policy comparator for pure reached accuracy, not timed-form performance

Open full model run review

Detailed subtest evidence tableStandalone modelling and hierarchical readiness by subtest

year	subtest	policy model	students	release items	standalone fit items	excluded items	standalone reliability	evidence band	coherence with other subtests	floor ceiling	trailing unreached	itemfit flag rate	profile modelling posture	release candidate role	risk flags
Foundation	MQ1-20	D/trailing-zero Rasch 1PL	1,006	19	19	11	0.6	weak	0.43	0.04 / 0	0.83	0.68	hierarchical_shrinkage_required; avoid standalone high-stakes subscore	Hierarchical/descriptive only; avoid standalone high-stakes subscore	weak_standalone_reliability; sparse_nonconstant_items_retained
Foundation	MC0-20	D/trailing-zero Rasch 1PL	1,005	50	50	10	0.93	strong	0.53	0.01 / 0	0.74	0.96	strong standalone signal; still prefer hierarchical coherence with global score	Include if construct claim includes this probe; profile candidate	sparse_nonconstant_items_retained
Foundation	MNC0-20	D/trailing-zero Rasch 1PL	1,003	24	24	6	0.88	strong	0.6	0.06 / 0	0.75	0.88	strong standalone signal; still prefer hierarchical coherence with global score	Include if construct claim includes this probe; profile candidate	sparse_nonconstant_items_retained
Foundation	DMT10_2026	Valid-only Rasch 1PL	1,002	6 target items	8	3	0.61	weak	0.43	0.01 / 0.09	n/a	0.12	hierarchical_shrinkage_required; avoid standalone high-stakes subscore	Hierarchical/descriptive only; not a standalone global driver	weak_standalone_reliability; few_calibration_items
Foundation	BNL0-20	NL2 ordinal PCM	974.0	10	10	0	0.67	weak	0.35	0.01 / 0	n/a	0	hierarchical_shrinkage_required; avoid standalone high-stakes subscore	Hierarchical/descriptive only; avoid standalone high-stakes subscore	weak_standalone_reliability; number_line_policy_sensitive
Year 1	MC0-100	D/trailing-zero Rasch 1PL	1,235	34	34	26	0.94	strong	0.69	0.02 / 0	0.76	0.91	strong standalone signal; still prefer hierarchical coherence with global score	Include if construct claim includes this probe; profile candidate	sparse_nonconstant_items_retained
Year 1	MNC0-100	D/trailing-zero Rasch 1PL	1,229	22	22	7	0.89	strong	0.76	0.03 / 0	0.71	0.91	strong standalone signal; still prefer hierarchical coherence with global score	Include if construct claim includes this probe; profile candidate	sparse_nonconstant_items_retained
Year 1	AAMC	D/trailing-zero Rasch 1PL	1,227	38	38	2	0.9	strong	0.73	0.04 / 0	0.78	0.97	strong standalone signal; still prefer hierarchical coherence with global score	Include if construct claim includes this probe; profile candidate	sparse_nonconstant_items_retained
Year 1	ASMC	D/trailing-zero Rasch 1PL	1,223	25	25	5	0.84	moderate	0.62	0.13 / 0	0.76	0.72	hierarchical_shrinkage_recommended; standalone subscore only as caveated diagnostic	Profile with hierarchical shrinkage; caveat standalone interpretation	moderate_reliability; floor_rate_ge_10pct; sparse_nonconstant_items_retained
Year 1	BNL0-100	NL2 ordinal PCM	1,178	13	13	0	0.73	moderate	0.55	0 / 0	n/a	0.08	hierarchical_shrinkage_recommended; standalone subscore only as caveated diagnostic	Profile with hierarchical shrinkage; caveat standalone interpretation	moderate_reliability; number_line_policy_sensitive

Item auditExpanded BOY subtest/item diagnostics

The full item audit remains separate because it is long and figure-heavy.

Open item audit appendix

Subscore readiness CSVSource aggregate table for hierarchical/reportability decisions

Download subscore readiness CSV