Global modelling

Builds, diagnostics, sensitivity tests

Review of inclusive, hard-item-filtered, and targeted sensitivity Stan jobs for the 2026 BOY operational accuracy + Number Line candidate.

Diagnostics Download CSVs

13Stan jobs reviewed incl. Batch A

0divergences / treedepth hits

Y1 BNLresidual-zero check complete

Operational status Hard-item-filtered is the working 2026 candidate baseline. Year 1 BNL0-100 residual-zero is the cleaner candidate after Batch A. M0/current remains live until explicit promotion and outcome validation.

Batch A action-plan results

Targeted post-review checks for Year 1 BNL residual-zero, Hierarchical global/subscores, and outcome-validation readiness.

Y1 BNL residual-zero

stable

Score movement vs hard-filtered baseline is tiny; BNL items are retained while the weak local residual component is removed.

Hierarchical global

not yet

Hierarchical global changes risk classification materially enough that it should not replace the operational global score without outcome validation.

Subscores

candidate

Hierarchical shrunken subscores are preferred for teacher-facing development; standalone subtest IRT remains diagnostic/audit evidence.

comparison	year level	n matched	theta spearman	median abs percentile shift	p95 abs percentile shift	risk 3band exact agreement	very low overlap	very low n base	low or very low overlap	low or very low n base
H1_global_vs_hard_filtered	foundation	997.0	0.9043	6.921	26.18	0.8024	116.0	150.0	282.0	349.0
H1_global_vs_hard_filtered	year1	1,221	0.8362	9.419	34.32	0.7617	124.0	183.0	333.0	427.0
BNL_residual_zero_vs_hard_filtered	year1	1,221	0.9985	0.9009	3.276	0.9836	180.0	183.0	420.0	427.0
H1_global_vs_BNL_residual_zero	year1	1,221	0.8072	10.48	37.35	0.7494	121.0	183.0	326.0	427.0

BNL empirical local-dependence screen

year level	test subgroup	n persons with any bnl	n items	n item pairs	p95 abs residual corr	max abs residual corr	n pairs abs corr gt 0 20	n pairs abs corr gt 0 30
year1	BNL0-100	1,178	13	78	0.2533	0.2914	22	0

hierarchical subscore candidate decision table

year level	test subgroup	standalone tam eap reliability	h1 sigma delta mean	h1 subscore global spearman	h1 median subscore posterior sd	recommendation
foundation	BNL0-20	0.6743	0.8737	0.6252	0.4507	h1_shrunken_subscore_candidate_with_uncertainty_pending_validation
foundation	DMT10_2026	0.6093	0.8297	0.7465	0.6626	diagnostic_only_or_strong_caveat_pending_validation
foundation	MC0-20	0.9269	1.932	0.8217	0.6462	h1_shrunken_subscore_candidate_high_confidence_pending_validation
foundation	MNC0-20	0.8813	1.546	0.849	0.7292	h1_shrunken_subscore_candidate_high_confidence_pending_validation
foundation	MQ1-20	0.6021	0.8226	0.7961	0.7241	diagnostic_only_or_strong_caveat_pending_validation
year1	AAMC	0.9004	1.123	0.8889	0.6585	h1_shrunken_subscore_candidate_high_confidence_pending_validation
year1	ASMC	0.8415	1.213	0.8243	0.6981	h1_shrunken_subscore_candidate_with_uncertainty_pending_validation
year1	BNL0-100	0.7272	1.327	0.709	0.4076	h1_shrunken_subscore_candidate_with_uncertainty_pending_validation
year1	MC0-100	0.9404	1.709	0.873	0.6396	h1_shrunken_subscore_candidate_high_confidence_pending_validation
year1	MNC0-100	0.8912	1.014	0.9167	0.6973	h1_shrunken_subscore_candidate_high_confidence_pending_validation

Number Line policy adjudication

NL-only policy evidence now separates measurement-policy adjudication from full-battery operational promotion.

Current readout Current .85/.95 remains the reference. Year 1 .80/.90/.95 is the strongest ordinal challenger; .80/.90 remains a secondary challenger; strict/binary policies are not prioritised.

year level	policy id	tam nl only reliability	delta reliability vs current	tam nl only spearman with non nl composite	observed alpha numeric categories	recommendation
foundation	nl_80_90_relaxed_3cat	0.6839	0.0084	0.3351	0.7264	secondary_challenger_monitor
foundation	nl_85_95_current_3cat	0.6755	0	0.3388	0.7058	retain_as_current_reference
foundation	nl_90_97_strict_3cat	0.6014	-0.0741	0.2957	0.6297	not_prioritised_for_stan
foundation	nl_binary_95	0.5153	-0.1602	0.2801	0.5532	not_prioritised_for_stan
foundation	nl_80_90_95_4cat	0.6893	0.0138	0.3309	0.7223	secondary_challenger_monitor
year1	nl_80_90_relaxed_3cat	0.7578	0.0299	0.5877	0.7816	secondary_challenger_monitor
year1	nl_85_95_current_3cat	0.7279	0	0.5498	0.7442	retain_as_current_reference
year1	nl_90_97_strict_3cat	0.6712	-0.0567	0.5105	0.6884	not_prioritised_for_stan
year1	nl_binary_95	0.54	-0.1879	0.4152	0.5655	not_prioritised_for_stan
year1	nl_80_90_95_4cat	0.7579	0.03	0.5656	0.7697	serious_challenger_consider_full_battery_stan_if_outcome_validity_supports

Observed coordinate-derived NL policy metrics

year level	policy id	n persons	n items	complete case alpha numeric categories	spearman with non nl composite	person mean score floor rate	person mean score ceiling rate
foundation	nl_80_90_95_4cat	974.0	10	0.7223	0.3288	0.0021	0.0021
foundation	nl_80_90_relaxed_3cat	974.0	10	0.7264	0.3285	0.0021	0.0216
foundation	nl_85_95_current_3cat	974.0	10	0.7058	0.3333	0.0062	0.0021
foundation	nl_90_97_strict_3cat	974.0	10	0.6297	0.2924	0.0175	0.001
foundation	nl_binary_95	974.0	10	0.5532	0.2853	0.0883	0.0021
year1	nl_80_90_95_4cat	1,178	13	0.7697	0.5588	0.0017	0.0008
year1	nl_80_90_relaxed_3cat	1,178	13	0.7816	0.5805	0.0017	0.0076
year1	nl_85_95_current_3cat	1,178	13	0.7442	0.5438	0.0042	0.0025
year1	nl_90_97_strict_3cat	1,178	13	0.6884	0.5026	0.0136	0.0008
year1	nl_binary_95	1,178	13	0.5655	0.4111	0.0942	0.0008

Continuous Number Line prototype screen

Continuous coordinate-derived accuracy contains additional signal, but remains a model-development challenger rather than an operational replacement.

year level	continuous metric	alpha delta vs current ordinal	non nl spearman delta vs current ordinal	spearman vs current ordinal	p95 percentile shift vs current ordinal	recommendation
foundation	continuous_accuracy	0.1009	0.0353	0.9311	21.82	promising_continuous_challenger_consider_model_based_fit
foundation	negative_absolute_scaled_error	0.1009	0.0353	0.9311	21.84	promising_continuous_challenger_consider_model_based_fit
foundation	absolute_signed_error_scaled_negative	0.1009	0.0353	0.9311	21.84	promising_continuous_challenger_consider_model_based_fit
foundation	signed_error_scaled	-0.0029	-0.5315			diagnostic_bias_only_not_primary_score
year1	continuous_accuracy	0.0929	0.0608	0.9306	21.82	promising_continuous_challenger_consider_model_based_fit
year1	negative_absolute_scaled_error	0.0929	0.0609	0.9305	21.78	promising_continuous_challenger_consider_model_based_fit
year1	absolute_signed_error_scaled_negative	0.0929	0.0609	0.9305	21.78	promising_continuous_challenger_consider_model_based_fit
year1	signed_error_scaled	0.0438	-0.8672			diagnostic_bias_only_not_primary_score

Continuous vs ordinal metric summary

year level	nl metric	metric family	complete case alpha	spearman with non nl composite	person score floor rate	person score ceiling rate
foundation	absolute_signed_error_scaled_negative	continuous	0.8067	0.3685	0.001	0.001
foundation	continuous_accuracy	continuous	0.8067	0.3686	0.001	0.001
foundation	negative_absolute_scaled_error	continuous	0.8067	0.3685	0.001	0.001
foundation	nl_80_90_95_4cat	ordinal_policy	0.7223	0.3288	0.0021	0.0021
foundation	nl_80_90_relaxed_3cat	ordinal_policy	0.7264	0.3285	0.0021	0.0216
foundation	nl_85_95_current_3cat	ordinal_policy	0.7058	0.3333	0.0062	0.0021
foundation	signed_error_scaled	continuous	0.7029	-0.1982	0.001	0.001
year1	absolute_signed_error_scaled_negative	continuous	0.8371	0.6047	0.0008	0.0008
year1	continuous_accuracy	continuous	0.8371	0.6047	0.0008	0.0008
year1	negative_absolute_scaled_error	continuous	0.8371	0.6047	0.0008	0.0008
year1	nl_80_90_95_4cat	ordinal_policy	0.7697	0.5588	0.0017	0.0008
year1	nl_80_90_relaxed_3cat	ordinal_policy	0.7816	0.5805	0.0017	0.0076
year1	nl_85_95_current_3cat	ordinal_policy	0.7442	0.5438	0.0042	0.0025
year1	signed_error_scaled	continuous	0.7881	-0.3234	0.0008	0.0008

2025→2026 crosswalk, drift, and prior eligibility

Historical evidence supports weak/drift-inflated priors for strong common-item cases, especially BNL. DMT remains the main no-hard-anchor caveat.

year level	subtest 2026	n crosswalk items	n high confidence matches	n prior eligible items	median abs logit drift	p90 abs logit drift	recommendation
foundation	BNL0-20	10	10	10	0.1688	0.4296	historical_item_priors_supported_with_drift_inflation
foundation	DMT10_2026	11	7	0	0.269	3.335	use_2025_as_subtest_context_not_hard_item_anchor
foundation	MC0-20	60	0	0	6.902	9.5	no_item_level_historical_prior
foundation	MNC0-20	30	30	5	4.845	8.828	selective_item_priors_after_manual_review
foundation	MQ1-20	30	30	3	6.462	9.234	selective_item_priors_after_manual_review
year1	AAMC	40	40	1	4.995	9.864	selective_item_priors_after_manual_review
year1	ASMC	30	30	3	4.797	8.616	selective_item_priors_after_manual_review
year1	BNL0-100	13	13	13	0.1409	0.2853	historical_item_priors_supported_with_drift_inflation
year1	MC0-100	60	1	1	8.868	10.31	no_item_level_historical_prior
year1	MNC0-100	29	29	4	4.775	8.225	selective_item_priors_after_manual_review

2025 restricted T3/T4 reliability context

year level	term scope	item scope	engine	fit status	n person	n items	reliability	converged
foundation	term3	kept_crosswalk	TAM_PCM_1D	ok	1,440	93	0.9034	TRUE
foundation	term3	all_crosswalk	TAM_PCM_1D	ok	1,440	101.0	0.904	TRUE
foundation	term4	kept_crosswalk	TAM_PCM_1D	ok	1,096	84	0.8891	TRUE
foundation	term4	all_crosswalk	TAM_PCM_1D	ok	1,096	90	0.8905	TRUE
foundation	term3_4_pooled	kept_crosswalk	TAM_PCM_1D	ok	2,536	100.0	0.9048	TRUE
foundation	term3_4_pooled	all_crosswalk	TAM_PCM_1D	ok	2,536	118.0	0.9057	TRUE
year1	term3	kept_crosswalk	TAM_PCM_1D	ok	1,501	121.0	0.9448	TRUE
year1	term3	all_crosswalk	TAM_PCM_1D	ok	1,501	134.0	0.9454	TRUE
year1	term4	kept_crosswalk	TAM_PCM_1D	ok	1,067	105.0	0.9159	TRUE
year1	term4	all_crosswalk	TAM_PCM_1D	ok	1,067	109.0	0.9166	TRUE
year1	term3_4_pooled	kept_crosswalk	TAM_PCM_1D	ok	2,568	132.0	0.949	TRUE
year1	term3_4_pooled	all_crosswalk	TAM_PCM_1D	ok	2,568	148.0	0.9496	TRUE
foundation	term3	all_crosswalk	item_count_context	context		0
foundation	term3	kept_crosswalk	item_count_context	context		0
foundation	term3_4_pooled	all_crosswalk	item_count_context	context		0
foundation	term3_4_pooled	kept_crosswalk	item_count_context	context		0
foundation	term4	all_crosswalk	item_count_context	context		0
foundation	term4	kept_crosswalk	item_count_context	context		0
year1	term3	all_crosswalk	item_count_context	context		0
year1	term3	kept_crosswalk	item_count_context	context		0
year1	term3_4_pooled	all_crosswalk	item_count_context	context		0
year1	term3_4_pooled	kept_crosswalk	item_count_context	context		0
year1	term4	all_crosswalk	item_count_context	context		0
year1	term4	kept_crosswalk	item_count_context	context		0

Premodelling audit: subscores, Number Line policy, speed

New aggregate audit work documents the evidence base needed before fitting hierarchical global+subscore models, Number Line cutoff challengers, or accuracy-response-time models.

Hierarchical subscores

Hierarchical

Recommended next Stan challenger: global numeracy plus shrunken subtest deviations. This is preferred over unrelated standalone subscores.

Number Line policy

Cutoff policies are now cell-count audited: relaxed, current, strict, binary, and 4-category ordinal options.

Speed / RT

Shadow

RT remains QC and response-process context. Timed D already encodes reach/time pressure, so speed should not alter live bands yet.

Subscore readiness

Standalone subtest evidence is uneven; weaker/moderate subscores are the main reason to use hierarchical shrinkage.

year level	test subgroup	n items keep hard filter	standalone eap reliability or alpha proxy	reliability band	spearman with other subtest composite	hierarchical subscore posture	premodel risk flags
foundation	MQ1-20	19	0.6021	weak	0.4256	hierarchical_shrinkage_required; avoid standalone high-stakes subscore	weak_standalone_reliability;sparse_nonconstant_items_retained
foundation	MC0-20	50	0.9269	strong	0.5315	strong standalone signal; still prefer hierarchical coherence with global score	sparse_nonconstant_items_retained
foundation	MNC0-20	24	0.8813	strong	0.6039	strong standalone signal; still prefer hierarchical coherence with global score	sparse_nonconstant_items_retained
foundation	DMT10_2026	8	0.6093	weak	0.4302	hierarchical_shrinkage_required; avoid standalone high-stakes subscore	weak_standalone_reliability;few_calibration_items
foundation	BNL0-20	10	0.6743	weak	0.3535	hierarchical_shrinkage_required; avoid standalone high-stakes subscore	weak_standalone_reliability;number_line_policy_sensitive
year1	MC0-100	34	0.9404	strong	0.6945	strong standalone signal; still prefer hierarchical coherence with global score	sparse_nonconstant_items_retained
year1	MNC0-100	22	0.8912	strong	0.7621	strong standalone signal; still prefer hierarchical coherence with global score	sparse_nonconstant_items_retained
year1	AAMC	38	0.9004	strong	0.7288	strong standalone signal; still prefer hierarchical coherence with global score	sparse_nonconstant_items_retained
year1	ASMC	25	0.8415	moderate	0.6173	hierarchical_shrinkage_recommended; standalone subscore only as caveated diagnostic	moderate_reliability;floor_rate_ge_10pct;sparse_nonconstant_items_retained
year1	BNL0-100	13	0.7272	moderate	0.5481	hierarchical_shrinkage_recommended; standalone subscore only as caveated diagnostic	moderate_reliability;number_line_policy_sensitive

Number Line cutoff policy audit

Raw coordinate-derived category counts by candidate policy. Current .85/.95 remains the benchmark.

year level	test subgroup	policy id	cutoffs	n items	items all categories cell ok	share items cell ok	median min category pct	median top category pct	median entropy normalized	premodel policy posture
foundation	BNL0-20	nl_80_90_95_4cat	0.8;0.9;0.95	10	9	0.9	0.1565	0.2526	0.9547	higher-resolution_challenger; use_only_if_item_category_cells_are_stable
foundation	BNL0-20	nl_80_90_relaxed_3cat	0.8;0.9	10	9	0.9	0.206	0.4803	0.9384	relaxed_challenger; useful_if_current_policy_over-penalises_hard_targets
foundation	BNL0-20	nl_85_95_current_3cat	0.85;0.95	10	10	1	0.1822	0.2526	0.9296	benchmark_current_policy; keep as reference in all modelling
foundation	BNL0-20	nl_90_97_strict_3cat	0.9;0.97	10	10	1	0.136	0.1464	0.8726	strict_challenger; reject_if_top_category_sparse_or_validation_not_better
foundation	BNL0-20	nl_binary_95	0.95	10	10	1	0.2526	0.2526	0.8154	modelable_if_cells_ok_but_loses_partial-credit_information
year1	BNL0-100	nl_80_90_95_4cat	0.8;0.9;0.95	13	13	1	0.1944	0.2643	0.9894	higher-resolution_challenger; use_only_if_item_category_cells_are_stable
year1	BNL0-100	nl_80_90_relaxed_3cat	0.8;0.9	13	13	1	0.2441	0.4873	0.9536	relaxed_challenger; useful_if_current_policy_over-penalises_hard_targets
year1	BNL0-100	nl_85_95_current_3cat	0.85;0.95	13	13	1	0.2643	0.2643	0.9799	benchmark_current_policy; keep as reference in all modelling
year1	BNL0-100	nl_90_97_strict_3cat	0.9;0.97	13	13	1	0.1624	0.1624	0.9129	strict_challenger; reject_if_top_category_sparse_or_validation_not_better
year1	BNL0-100	nl_binary_95	0.95	13	13	1	0.2643	0.2643	0.8331	modelable_if_cells_ok_but_loses_partial-credit_information

Number Line PCM cutoff sensitivity findings

TAM screens were run on cisbox for all ordinal/binary cutoff policies. Bounded mirt 1D screens were run for the current, relaxed, and 4-category policies as a secondary check.

Current readout The full-battery global score is robust to reasonable PCM cutoff changes. Cutoffs matter more for Number-Line-only subscores: relaxed .80/.90 and 4-category .80/.90/.95 look like the best ordinal challengers; binary >=.95 loses partial-credit information; continuous Number Line remains a formal Stan challenger, not yet a replacement.

TAM full-battery

stable

Current, relaxed, strict, binary, and 4-category screens all fit; global movement vs current is modest.

NL-only reliability

↑

Relaxed and 4-category policies improve Number-Line-only reliability relative to current in both years; strict/binary weaken it.

mirt 1D

caution

Bounded mirt was mostly non-converged within 300 EM cycles, so it is a sensitivity check only; extracted movement was tiny.

TAM cutoff fit summary

year level	scope	policy id	status	n persons	n items	eap reliability	AIC	BIC
foundation	full_battery	nl_80_90_relaxed_3cat	fit_ok	997.0	111.0	0.9157	62,638	63,236
foundation	number_line_only	nl_80_90_relaxed_3cat	fit_ok	974.0	10	0.6839	16,351	16,453
foundation	full_battery	nl_85_95_current_3cat	fit_ok	997.0	111.0	0.9139	63,905	64,503
foundation	number_line_only	nl_85_95_current_3cat	fit_ok	974.0	10	0.6755	17,664	17,766
foundation	full_battery	nl_90_97_strict_3cat	fit_ok	997.0	111.0	0.9077	63,363	63,962
foundation	number_line_only	nl_90_97_strict_3cat	fit_ok	974.0	10	0.6014	16,979	17,081
foundation	full_battery	nl_binary_95	fit_ok	997.0	111.0	0.9171	55,120	55,670
foundation	number_line_only	nl_binary_95	fit_ok	974.0	10	0.5153	10,024	10,078
foundation	full_battery	nl_80_90_95_4cat	fit_ok	997.0	111.0	0.9061	69,656	70,304
foundation	number_line_only	nl_80_90_95_4cat	fit_ok	974.0	10	0.6893	22,246	22,397
year1	full_battery	nl_80_90_relaxed_3cat	fit_ok	1,221	132.0	0.9563	90,385	91,130
year1	number_line_only	nl_80_90_relaxed_3cat	fit_ok	1,178	13	0.7578	29,088	29,225
year1	full_battery	nl_85_95_current_3cat	fit_ok	1,221	132.0	0.9518	92,022	92,768
year1	number_line_only	nl_85_95_current_3cat	fit_ok	1,178	13	0.7279	29,992	30,129
year1	full_battery	nl_90_97_strict_3cat	fit_ok	1,221	132.0	0.949	89,902	90,647
year1	number_line_only	nl_90_97_strict_3cat	fit_ok	1,178	13	0.6712	27,566	27,703
year1	full_battery	nl_binary_95	fit_ok	1,221	132.0	0.9529	75,826	76,505
year1	number_line_only	nl_binary_95	fit_ok	1,178	13	0.54	15,845	15,916
year1	full_battery	nl_80_90_95_4cat	fit_ok	1,221	132.0	0.9488	102,866	103,678
year1	number_line_only	nl_80_90_95_4cat	fit_ok	1,178	13	0.7579	38,492	38,695

TAM cutoff score movement vs current .85/.95

year level	scope	comparison	n	spearman theta	median abs pctile shift	p95 abs pctile shift	band exact agreement	very low jaccard
foundation	full_battery	nl_80_90_relaxed_3cat vs nl_85_95_current_3cat	997.0	0.9909	0.0226	0.081	0.9398	0.8742
foundation	full_battery	nl_90_97_strict_3cat vs nl_85_95_current_3cat	997.0	0.991	0.0211	0.0813	0.9438	0.8861
foundation	full_battery	nl_binary_95 vs nl_85_95_current_3cat	997.0	0.9813	0.0326	0.1129	0.9178	0.8395
foundation	full_battery	nl_80_90_95_4cat vs nl_85_95_current_3cat	997.0	0.9882	0.0241	0.0928	0.9238	0.8395
foundation	number_line_only	nl_80_90_relaxed_3cat vs nl_85_95_current_3cat	974.0	0.9288	0.0675	0.2232	0.8542	0.6627
foundation	number_line_only	nl_90_97_strict_3cat vs nl_85_95_current_3cat	974.0	0.9262	0.0647	0.2266	0.8501	0.6686
foundation	number_line_only	nl_binary_95 vs nl_85_95_current_3cat	974.0	0.9026	0.0688	0.2599	0.7793	0.4703
foundation	number_line_only	nl_80_90_95_4cat vs nl_85_95_current_3cat	974.0	0.9718	0.0416	0.1439	0.8973	0.7711
year1	full_battery	nl_80_90_relaxed_3cat vs nl_85_95_current_3cat	1,221	0.9948	0.0172	0.0622	0.9419	0.8579
year1	full_battery	nl_90_97_strict_3cat vs nl_85_95_current_3cat	1,221	0.995	0.016	0.0581	0.9484	0.8769
year1	full_battery	nl_binary_95 vs nl_85_95_current_3cat	1,221	0.9898	0.0242	0.0852	0.9263	0.7902
year1	full_battery	nl_80_90_95_4cat vs nl_85_95_current_3cat	1,221	0.9936	0.0192	0.0672	0.9345	0.8535
year1	number_line_only	nl_80_90_relaxed_3cat vs nl_85_95_current_3cat	1,178	0.9436	0.0577	0.2055	0.8659	0.6635
year1	number_line_only	nl_90_97_strict_3cat vs nl_85_95_current_3cat	1,178	0.9384	0.0641	0.2017	0.8489	0.6479
year1	number_line_only	nl_binary_95 vs nl_85_95_current_3cat	1,178	0.8975	0.0781	0.2681	0.7674	0.542
year1	number_line_only	nl_80_90_95_4cat vs nl_85_95_current_3cat	1,178	0.9751	0.0352	0.1317	0.8973	0.7241

Bounded mirt 1D cutoff fit summary

year level	policy id	scope	status	converged	n persons	n items	notes
foundation	nl_80_90_relaxed_3cat	full_battery	fit_ok	FALSE	997.0	111.0	1D flexible 2PL/GPCM bounded cutoff screen; not a multidimensional promotion model
foundation	nl_85_95_current_3cat	full_battery	fit_ok	FALSE	997.0	111.0	1D flexible 2PL/GPCM bounded cutoff screen; not a multidimensional promotion model
foundation	nl_80_90_95_4cat	full_battery	fit_ok	FALSE	997.0	111.0	1D flexible 2PL/GPCM bounded cutoff screen; not a multidimensional promotion model
year1	nl_80_90_relaxed_3cat	full_battery	fit_ok	FALSE	1,221	132.0	1D flexible 2PL/GPCM bounded cutoff screen; not a multidimensional promotion model
year1	nl_85_95_current_3cat	full_battery	fit_ok	FALSE	1,221	132.0	1D flexible 2PL/GPCM bounded cutoff screen; not a multidimensional promotion model
year1	nl_80_90_95_4cat	full_battery	fit_ok	TRUE	1,221	132.0	1D flexible 2PL/GPCM bounded cutoff screen; not a multidimensional promotion model

Bounded mirt 1D cutoff score movement vs current .85/.95

year level	scope	comparison	n	spearman theta	median abs pctile shift	p95 abs pctile shift	band exact agreement	very low jaccard
foundation	full_battery	nl_80_90_relaxed_3cat vs nl_85_95_current_3cat	997.0	0.9999	0.002	0.009	0.992	0.9735
foundation	full_battery	nl_80_90_95_4cat vs nl_85_95_current_3cat	997.0	1	0.001	0.006	0.998	1
year1	full_battery	nl_80_90_relaxed_3cat vs nl_85_95_current_3cat	1,221	0.9994	0.0049	0.0221	0.9771	0.9365
year1	full_battery	nl_80_90_95_4cat vs nl_85_95_current_3cat	1,221	0.9998	0.0033	0.0131	0.9836	0.9572

Accuracy-speed readiness

Observed/reached timed rows have RT available; high presented-row missingness is largely trailing unreached D-zero rows.

year level	test subgroup	role	is timed	observed or coordinate rt missing rate	presented row rt missing or negative rate	trailing nonresponse rate	row rt p50	pct rt lt 1	initial joint model role	rt readiness flags
foundation	BNL0-20	achievement_primary	False	0	0.0199	0	7	0.0077	nl_rt_context_only_initially_not_accuracy_speed_scoring	none_obvious_from_row_rt_audit
foundation	DMT10_2026	achievement_primary	False	0	0.0146	0	16	0.0002	untimed_or_other_context_only_initially	none_obvious_from_row_rt_audit
foundation	MC0-20	achievement_primary	True	0	0.7501	0.7402	6	0.0042	initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context	presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
foundation	MNC0-20	achievement_primary	True	0	0.7611	0.7511	12	0.0045	initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context	presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
foundation	MQ1-20	achievement_primary	True	0	0.8417	0.8328	20	0.0082	initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context	presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
foundation	STPM	shadow_speed_only	True		0.0625	0.0521	8	0.0013	shadow_speed_only_exclude_from_math_achievement	presented_row_rt_missing_or_negative_gt_5pct
year1	AAMC	achievement_primary	True	0	0.8024	0.7845	9	0.0053	initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context	presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
year1	ASMC	achievement_primary	True	0	0.7762	0.7559	12	0.005	initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context	presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
year1	BNL0-100	achievement_primary	False	0	0.0377	0	5	0.0065	nl_rt_context_only_initially_not_accuracy_speed_scoring	none_obvious_from_row_rt_audit
year1	MC0-100	achievement_primary	True	0	0.7711	0.7599	6	0.0044	initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context	presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
year1	MNC0-100	achievement_primary	True	0	0.729	0.7145	11	0.0047	initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context	presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
year1	STPM	shadow_speed_only	True		0.0641	0.0443	6	0.001	shadow_speed_only_exclude_from_math_achievement	presented_row_rt_missing_or_negative_gt_5pct

Show rapid-row descriptive audit

year level	test subgroup	j2b style rapid rate	mean accuracy rapid rows	mean accuracy nonrapid rows	rapid minus nonrapid accuracy	interpretation
foundation	MC0-20	0.0597	0.6363	0.9188	-0.2826	rapid rows should remain diagnostic/shadow unless validated; not a motivation label
foundation	MNC0-20	0.0396	0.0877	0.7537	-0.666	rapid rows should remain diagnostic/shadow unless validated; not a motivation label
foundation	MQ1-20	0.0308	0.0544	0.6558	-0.6014	rapid rows should remain diagnostic/shadow unless validated; not a motivation label
year1	AAMC	0.0436	0.1844	0.8026	-0.6182	rapid rows should remain diagnostic/shadow unless validated; not a motivation label
year1	ASMC	0.0424	0.1322	0.6265	-0.4943	rapid rows should remain diagnostic/shadow unless validated; not a motivation label
year1	MC0-100	0.0527	0.5666	0.8974	-0.3308	rapid rows should remain diagnostic/shadow unless validated; not a motivation label
year1	MNC0-100	0.0403	0.0874	0.834	-0.7466	rapid rows should remain diagnostic/shadow unless validated; not a motivation label

Show profile-deviation spread by subtest

year level	test subgroup	n profile deviation	profile deviation sd z	profile deviation p10 z	profile deviation p90 z	pct abs profile deviation gt 1z
foundation	MQ1-20	1,005	0.9337	-1.043	1.115	0.2239
foundation	MC0-20	1,005	0.8424	-0.9794	1.032	0.202
foundation	MNC0-20	1,003	0.7871	-0.9912	1.018	0.2044
foundation	DMT10_2026	1,002	0.9381	-1.21	1.155	0.2735
foundation	BNL0-20	974.0	0.9978	-1.275	1.231	0.3142
year1	MC0-100	1,229	0.7369	-0.8171	0.9086	0.1448
year1	MNC0-100	1,229	0.6395	-0.7395	0.797	0.105
year1	AAMC	1,227	0.6877	-0.8228	0.8138	0.1084
year1	ASMC	1,223	0.7863	-0.9902	0.9733	0.1905
year1	BNL0-100	1,178	0.9017	-1.064	1.119	0.2326

Next-step gates

Decision gates for the next round of modelling.

stream	next action	must check before fit	must check after fit
hierarchical_subscores	fit H1 Stan global+subtest-deviation model on hard-filtered operational frame	subtest score reliability/correlation/readiness table; avoid writing nuisance residuals as teacher subscores	HMC diagnostics, posterior SD by subscore, shrinkage size, global score movement, risk-band movement, subgroup movement, profile interpretability
number_line_policy	run frequentist ordinal cutoff screens using audited candidate policies; keep current .85/.95 as reference	item-by-target category counts and ECDF; reject policies with sparse/empty categories before Stan	threshold behaviour, reliability, score/risk movement, validation/fairness; continuous challenger only after proxy screen
accuracy_speed_joint	treat RT as QC/shadow; candidate families are timed non-NL achievement subtests only at first	RT missingness, row RT quantiles, rapid-row accuracy, STPM exclusion, D/trailing-zero double-count risk	tau construct validity, rapid effect direction, theta robustness, admin/subgroup artefacts, no operational risk-band changes

Interactive diagnostics

Charts are rendered in-browser from aggregate JSON. No student-level scores are published in the chart data.

Downloads

Aggregate CSV/Markdown artifacts used to build this page.

⬇ stan_job_diagnostic_summary.csv ⬇ stan_score_movement_comparisons.csv ⬇ stan_testlet_sigma_summary_long.csv ⬇ stan_u_residual_diagnostic_summary.csv ⬇ stan_item_difficulty_extreme_or_diagnostic_flags.csv ⬇ 2026_boy_model_review_findings.md ⬇ stan_review_summary.md ⬇ 2026_boy_next_model_action_plan_readout.md ⬇ 2026_boy_numberline_policy_adjudication.md ⬇ 2026_boy_numberline_continuous_adjudication.md ⬇ 2025_2026_common_item_drift_prior_audit.md ⬇ 2026_boy_premodeling_audit_hierarchical_nl_speed.md ⬇ 2026_boy_next_model_spec_hierarchical_nl_speed.md ⬇ 2026_boy_hierarchical_subscore_readiness.csv ⬇ 2026_boy_subtest_score_correlations.csv ⬇ 2026_boy_subtest_composite_correlations.csv ⬇ 2026_boy_subtest_profile_deviation_summary.csv ⬇ 2026_boy_nl_accuracy_distribution_by_item.csv ⬇ 2026_boy_nl_policy_item_cell_counts.csv ⬇ 2026_boy_nl_policy_overall_summary.csv ⬇ 2026_boy_rt_readiness_by_subtest.csv ⬇ 2026_boy_j2b_style_rapid_row_audit.csv ⬇ 2026_boy_speed_accuracy_correlations.csv ⬇ 2026_boy_hierarchical_model_ladder.csv ⬇ 2026_boy_number_line_policy_ladder.csv ⬇ 2026_boy_accuracy_speed_model_ladder.csv ⬇ 2026_boy_premodeling_decision_gates.csv ⬇ 2026_boy_tam77_nl_cutoff_fit_summary.csv ⬇ 2026_boy_tam77_nl_cutoff_item_counts.csv ⬇ 2026_boy_tam77_nl_cutoff_score_movement.csv ⬇ 2026_boy_mirt78_nl_cutoff_fit_summary.csv ⬇ 2026_boy_mirt78_nl_cutoff_score_movement.csv ⬇ next_model_score_movement_summary.csv ⬇ next_model_risk_band_movement.csv ⬇ bnl_item_difficulty_stability_summary.csv ⬇ bnl_numberline_step_stability.csv ⬇ bnl_empirical_category_functioning.csv ⬇ bnl_empirical_local_dependence_summary.csv ⬇ bnl_empirical_local_dependence_top_pairs.csv ⬇ h1_movement_subtest_drivers.csv ⬇ subscore_reporting_decision_table.csv ⬇ nl_observed_policy_metrics.csv ⬇ nl_tam_mirt_policy_summary.csv ⬇ nl_full_battery_screen_movement.csv ⬇ nl_policy_decision_grid.csv ⬇ nl_continuous_metric_summary.csv ⬇ nl_continuous_vs_ordinal_comparison.csv ⬇ nl_continuous_target_residuals.csv ⬇ nl_continuous_decision_grid.csv ⬇ common_item_drift_prior_eligibility.csv ⬇ common_item_drift_prior_summary.csv ⬇ restricted_2025_t3t4_reliability_context.csv

Stan job diagnostic summary

Completion, verification, sampler, theta, item, and testlet-level summary.

key	family	variant	stan exitcode	postprocess status	verify ok	verify total	min ebfmi	theta max rhat	testlet max rhat	testlet flags
foundation_inclusive	inclusive	inclusive	0	completed	155.0	155.0	0.7052	1.004	1.006
year1_inclusive	inclusive	inclusive	0	completed	155.0	155.0	0.6144	1.004	1.023	BNL0-100:rhat=1.023,ess=109.1,mean=0.258
foundation_hard	hard_filtered	hard_item_filtered	0	completed	1,955	1,955	0.6765	1.003	1.003
year1_hard	hard_filtered	hard_item_filtered	0	completed	1,955	1,955	0.6462	1.006	1.068	BNL0-100:rhat=1.068,ess=78.4,mean=0.195
foundation_no_DMT10_2026	sensitivity	foundation_no_DMT10_2026	0	completed	2,104	2,104	0.6938	1.003	1.009
foundation_no_MQ1_20_no_DMT10_2026	sensitivity	foundation_no_MQ1_20_no_DMT10_2026	0	completed	2,104	2,104	0.7273	1.002	1.007
foundation_no_BNL0_20	sensitivity	foundation_no_BNL0_20	1	recovered_after_postprocess_failure	2,098	2,098	0.6673	1.002	1.004
year1_no_MC0_100	sensitivity	year1_no_MC0_100	0	completed	2,104	2,104	0.6466	1.002	1.004
year1_no_BNL0_100	sensitivity	year1_no_BNL0_100	1	recovered_after_postprocess_failure	2,098	2,098	0.5985	1.003	1.003
year1_core_no_MC_no_NL	sensitivity	year1_core_no_MC_no_NL	1	recovered_after_postprocess_failure	2,098	2,098	0.5676	1.002	1.003

Score movement and risk-band stability

All comparisons are against the hard-item-filtered baseline for the matching year.

comparison id	year	n common	spearman theta	median abs pctile shift	p95 abs pctile shift	exact 3band agreement	very low jaccard	low or very low jaccard	very low moved out	very low moved in	low or very low moved out	low or very low moved in
inclusive_vs_hard_foundation	foundation	997.0	0.9997	0.003	0.014	0.99	0.9735	0.9829	2	2	3	3
inclusive_vs_hard_year1	year1	1,221	0.999	0.0074	0.0262	0.9853	0.9677	0.9723	3	3	6	6
foundation_no_DMT10_2026	foundation	997.0	0.9346	0.0572	0.21	0.8516	0.7029	0.7576	26	26	48	48
foundation_no_MQ1_20_no_DMT10_2026	foundation	995.0	0.8254	0.0995	0.3568	0.7608	0.5204	0.6415	47	47	76	76
foundation_no_BNL0_20	foundation	997.0	0.8648	0.0802	0.3232	0.7753	0.5051	0.6611	49	49	71	71
year1_no_MC0_100	year1	1,211	0.9926	0.0182	0.0677	0.962	0.9053	0.9359	9	9	14	14
year1_no_BNL0_100	year1	1,221	0.7679	0.1188	0.3931	0.7027	0.4023	0.5471	78	78	125.0	125.0
year1_core_no_MC_no_NL	year1	1,211	0.739	0.1346	0.4042	0.6846	0.3817	0.5189	81	81	134.0	134.0

Year 1 residual/testlet caveat

Auxiliary latent residual diagnostic showing the localized BNL0-100 issue.

job	testlet index	test subgroup	n	rhat gt 1 01	ess lt 400	either	max rhat	min ess
foundation_inclusive	1	MQ1-20	997.0	0	0	0	1.003	5,816
foundation_inclusive	2	MC0-20	997.0	0	0	0	1.002	6,015
foundation_inclusive	3	MNC0-20	997.0	0	0	0	1.003	6,773
foundation_inclusive	4	DMT10_2026	997.0	0	0	0	1.003	7,920
foundation_inclusive	5	BNL0-20	997.0	0	0	0	1.002	5,589
year1_inclusive	1	MC0-100	1,221	0	0	0	1.003	2,759
year1_inclusive	2	MNC0-100	1,221	0	0	0	1.003	2,981
year1_inclusive	3	AAMC	1,221	0	0	0	1.003	2,173
year1_inclusive	4	ASMC	1,221	0	0	0	1.003	2,453
year1_inclusive	5	BNL0-100	1,221	0	3	3	1.009	320.1
foundation_hard	1	MQ1-20	997.0	0	0	0	1.003	5,781
foundation_hard	2	MC0-20	997.0	0	0	0	1.002	5,019
foundation_hard	3	MNC0-20	997.0	0	0	0	1.003	5,633
foundation_hard	4	DMT10_2026	997.0	0	0	0	1.003	7,825
foundation_hard	5	BNL0-20	997.0	0	0	0	1.002	5,096
year1_hard	1	MC0-100	1,221	0	0	0	1.003	4,635
year1_hard	2	MNC0-100	1,221	0	0	0	1.003	6,077
year1_hard	3	AAMC	1,221	0	0	0	1.003	4,001
year1_hard	4	ASMC	1,221	0	0	0	1.003	4,934
year1_hard	5	BNL0-100	1,221	1,193	2	1,193	1.026	279.0
foundation_no_DMT10_2026	1	MQ1-20	997.0	0	0	0	1.003	3,839
foundation_no_DMT10_2026	2	MC0-20	997.0	0	0	0	1.002	4,958
foundation_no_DMT10_2026	3	MNC0-20	997.0	0	0	0	1.002	5,529
foundation_no_DMT10_2026	4	BNL0-20	997.0	0	0	0	1.003	2,890
foundation_no_MQ1_20_no_DMT10_2026	1	MC0-20	995.0	0	0	0	1.003	6,494
foundation_no_MQ1_20_no_DMT10_2026	2	MNC0-20	995.0	0	0	0	1.003	7,203
foundation_no_MQ1_20_no_DMT10_2026	3	BNL0-20	995.0	0	0	0	1.007	6,786
foundation_no_BNL0_20	1	MQ1-20	997.0	0	0	0	1.004	3,875
foundation_no_BNL0_20	2	MC0-20	997.0	0	0	0	1.003	5,855
foundation_no_BNL0_20	3	MNC0-20	997.0	0	0	0	1.003	7,600
foundation_no_BNL0_20	4	DMT10_2026	997.0	0	0	0	1.003	7,424
year1_no_MC0_100	1	MNC0-100	1,211	0	0	0	1.003	6,104
year1_no_MC0_100	2	AAMC	1,211	0	0	0	1.003	5,836
year1_no_MC0_100	3	ASMC	1,211	0	0	0	1.003	5,684
year1_no_MC0_100	4	BNL0-100	1,211	0	0	0	1.005	6,171
year1_no_BNL0_100	1	MC0-100	1,221	0	0	0	1.002	4,308
year1_no_BNL0_100	2	MNC0-100	1,221	0	0	0	1.003	4,397
year1_no_BNL0_100	3	AAMC	1,221	0	0	0	1.002	3,863
year1_no_BNL0_100	4	ASMC	1,221	0	0	0	1.002	4,374
year1_core_no_MC_no_NL	1	MNC0-100	1,211	0	0	0	1.002	5,522
year1_core_no_MC_no_NL	2	AAMC	1,211	0	0	0	1.002	4,994
year1_core_no_MC_no_NL	3	ASMC	1,211	0	0	0	1.003	5,165

Show full testlet sigma table

key	test subgroup	variable	mean	sd	q5	q95	rhat	ess bulk	ess tail
foundation_inclusive	MQ1-20	sigma_u[1]	0.8397	0.0613	0.7372	0.9408	1.006	1,465	2,999
foundation_inclusive	MC0-20	sigma_u[2]	2.189	0.0714	2.075	2.31	1.002	1,844	3,930
foundation_inclusive	MNC0-20	sigma_u[3]	1.926	0.0711	1.812	2.046	1.003	2,447	4,306
foundation_inclusive	DMT10_2026	sigma_u[4]	0.7777	0.0561	0.6856	0.8697	1.001	1,884	3,879
foundation_inclusive	BNL0-20	sigma_u[5]	0.5988	0.0423	0.5285	0.6682	1.005	1,173	2,206
year1_inclusive	MC0-100	sigma_u[1]	2.322	0.0722	2.207	2.443	1.001	598.9	2,367
year1_inclusive	MNC0-100	sigma_u[2]	2.055	0.0745	1.935	2.179	1.005	541.0	2,220
year1_inclusive	AAMC	sigma_u[3]	2.05	0.0712	1.937	2.171	1.003	453.1	2,161
year1_inclusive	ASMC	sigma_u[4]	1.636	0.0629	1.533	1.74	1.005	519.2	2,078
year1_inclusive	BNL0-100	sigma_u[5]	0.2584	0.0978	0.0637	0.3936	1.023	109.1	237.0
foundation_hard	MQ1-20	sigma_u[1]	0.922	0.0597	0.8248	1.021	1.002	2,015	4,281
foundation_hard	MC0-20	sigma_u[2]	2.239	0.0696	2.125	2.355	1.001	1,602	3,564
foundation_hard	MNC0-20	sigma_u[3]	1.995	0.0725	1.877	2.116	1	2,183	3,927
foundation_hard	DMT10_2026	sigma_u[4]	0.7915	0.0564	0.6992	0.8843	1.003	1,658	4,175
foundation_hard	BNL0-20	sigma_u[5]	0.5939	0.0417	0.5256	0.6609	1.002	1,081	2,044
year1_hard	MC0-100	sigma_u[1]	2.576	0.0731	2.458	2.699	1.004	1,030	2,777
year1_hard	MNC0-100	sigma_u[2]	2.196	0.0739	2.077	2.318	1.005	1,090	2,789
year1_hard	AAMC	sigma_u[3]	2.091	0.0681	1.979	2.205	1.009	826.3	2,277
year1_hard	ASMC	sigma_u[4]	1.696	0.0623	1.594	1.799	1.006	900.3	2,518
year1_hard	BNL0-100	sigma_u[5]	0.1952	0.0927	0.0296	0.3417	1.068	78.37	319.6
foundation_no_DMT10_2026	MQ1-20	sigma_u[1]	0.9541	0.0677	0.8417	1.065	1.003	1,066	2,554
foundation_no_DMT10_2026	MC0-20	sigma_u[2]	2.263	0.071	2.148	2.383	1.001	1,903	3,560
foundation_no_DMT10_2026	MNC0-20	sigma_u[3]	2.07	0.076	1.945	2.197	1.003	1,781	3,846
foundation_no_DMT10_2026	BNL0-20	sigma_u[4]	0.4973	0.0596	0.3965	0.5908	1.009	481.8	814.6
foundation_no_MQ1_20_no_DMT10_2026	MC0-20	sigma_u[1]	2.408	0.0729	2.289	2.529	1.003	1,659	3,447
foundation_no_MQ1_20_no_DMT10_2026	MNC0-20	sigma_u[2]	2.219	0.0762	2.096	2.346	1.001	2,540	4,204
foundation_no_MQ1_20_no_DMT10_2026	BNL0-20	sigma_u[3]	0.0681	0.0513	0.005	0.1653	1.007	585.5	844.9
foundation_no_BNL0_20	MQ1-20	sigma_u[1]	0.7303	0.0742	0.6062	0.8476	1.004	1,221	2,465
foundation_no_BNL0_20	MC0-20	sigma_u[2]	2.127	0.0697	2.014	2.248	1.002	2,272	4,443
foundation_no_BNL0_20	MNC0-20	sigma_u[3]	1.866	0.0714	1.75	1.984	1	2,937	4,933
foundation_no_BNL0_20	DMT10_2026	sigma_u[4]	0.8036	0.0648	0.6963	0.9101	1.001	1,450	2,960
year1_no_MC0_100	MNC0-100	sigma_u[1]	2.283	0.0725	2.166	2.404	1.002	2,235	4,019
year1_no_MC0_100	AAMC	sigma_u[2]	2.164	0.0653	2.06	2.273	1.002	1,867	3,469
year1_no_MC0_100	ASMC	sigma_u[3]	1.76	0.0587	1.666	1.859	1.002	2,551	4,791
year1_no_MC0_100	BNL0-100	sigma_u[4]	0.0533	0.04	0.0044	0.1296	1.004	670.5	897.4
year1_no_BNL0_100	MC0-100	sigma_u[1]	2.26	0.0679	2.15	2.373	1.003	2,048	3,661
year1_no_BNL0_100	MNC0-100	sigma_u[2]	1.756	0.0707	1.641	1.873	1.002	1,715	3,122
year1_no_BNL0_100	AAMC	sigma_u[3]	1.608	0.0638	1.505	1.714	1.002	1,892	2,740
year1_no_BNL0_100	ASMC	sigma_u[4]	1.245	0.0588	1.149	1.344	1	1,668	3,369
year1_core_no_MC_no_NL	MNC0-100	sigma_u[1]	1.969	0.0751	1.848	2.095	1.002	2,157	3,843
year1_core_no_MC_no_NL	AAMC	sigma_u[2]	1.753	0.0678	1.644	1.866	1.003	1,925	4,099
year1_core_no_MC_no_NL	ASMC	sigma_u[3]	1.283	0.0635	1.179	1.388	1.001	1,991	4,106

Show item difficulty extreme/diagnostic flags table

key	variable	mean	sd	q5	q95	rhat	ess bulk	ess tail
foundation_inclusive	b[81]	7.829	0.5637	6.978	8.828	1	10,514	4,900
foundation_inclusive	b[61]	7.828	0.5537	6.986	8.796	1.002	12,140	5,255
foundation_inclusive	b[64]	7.826	0.5673	6.959	8.805	1.001	11,474	4,865
foundation_inclusive	b[63]	7.826	0.5509	6.973	8.777	1.001	11,666	5,063
foundation_inclusive	b[80]	7.825	0.5688	6.966	8.81	1.001	11,620	5,078
foundation_inclusive	b[65]	7.824	0.5555	6.965	8.786	1	10,609	5,412
foundation_inclusive	b[62]	7.823	0.5623	6.948	8.798	1	12,400	5,798
foundation_inclusive	b[68]	7.821	0.56	6.964	8.803	1	11,187	4,942
foundation_inclusive	b[69]	7.821	0.5668	6.948	8.806	0.9998	10,863	5,251
foundation_inclusive	b[79]	7.817	0.5511	6.974	8.78	1	10,663	5,172
foundation_inclusive	b[102]	7.7	0.5616	6.851	8.667	1	11,007	5,262
foundation_inclusive	b[110]	7.699	0.556	6.855	8.675	1	10,556	5,517
foundation_inclusive	b[109]	7.691	0.5449	6.869	8.635	1	10,619	5,544
foundation_inclusive	b[111]	7.689	0.553	6.846	8.656	1	11,379	4,307
foundation_inclusive	b[106]	7.687	0.5456	6.852	8.654	1	11,355	5,515
foundation_inclusive	b[108]	7.685	0.5583	6.829	8.656	1	11,662	5,305
foundation_inclusive	b[58]	7.565	0.5183	6.763	8.463	1.002	10,619	5,099
foundation_inclusive	b[75]	7.556	0.5175	6.763	8.446	1.001	12,080	4,752
foundation_inclusive	b[73]	7.555	0.5255	6.735	8.466	1.003	10,646	4,896
foundation_inclusive	b[72]	7.555	0.5083	6.767	8.418	1.001	11,396	5,494
year1_inclusive	b[164]	8.307	0.5422	7.484	9.259	1.001	10,601	5,368
year1_inclusive	b[40]	8.302	0.5642	7.431	9.289	1.002	13,359	5,382
year1_inclusive	b[170]	8.302	0.5416	7.463	9.234	1.002	12,387	5,139
year1_inclusive	b[168]	8.301	0.5358	7.477	9.24	1.001	10,278	5,653
year1_inclusive	b[34]	8.301	0.5614	7.447	9.288	1.001	13,806	4,939
year1_inclusive	b[171]	8.299	0.5325	7.478	9.221	1.002	11,647	5,517
year1_inclusive	b[166]	8.298	0.5464	7.466	9.256	1	12,309	5,572
year1_inclusive	b[169]	8.296	0.5284	7.491	9.223	1	10,930	5,443
year1_inclusive	b[172]	8.292	0.5326	7.48	9.213	1.001	13,538	5,169
year1_inclusive	b[120]	8.254	0.5375	7.431	9.171	1.001	12,951	5,834
year1_inclusive	b[136]	8.253	0.539	7.445	9.223	1.001	11,104	4,681
year1_inclusive	b[133]	8.251	0.5415	7.428	9.169	1	10,203	4,544
year1_inclusive	b[129]	8.251	0.5442	7.423	9.203	1	12,194	5,691
year1_inclusive	b[141]	8.251	0.5426	7.42	9.2	1	13,052	5,230
year1_inclusive	b[134]	8.251	0.5345	7.434	9.19	1.001	10,492	4,616
year1_inclusive	b[140]	8.25	0.5361	7.426	9.185	0.9999	11,556	4,875
year1_inclusive	b[125]	8.25	0.5321	7.435	9.178	1	12,733	5,602
year1_inclusive	b[123]	8.25	0.5398	7.416	9.191	1.001	13,588	5,578
year1_inclusive	b[142]	8.249	0.5375	7.421	9.193	1.001	13,551	5,404
year1_inclusive	b[128]	8.248	0.538	7.424	9.196	1.002	12,262	5,445
foundation_hard	b[67]	8.006	0.5185	7.206	8.902	1.001	10,703	5,245
foundation_hard	b[61]	8.005	0.5234	7.192	8.93	1	10,271	4,354
foundation_hard	b[59]	8.003	0.5155	7.209	8.901	1.001	11,504	4,907
foundation_hard	b[57]	8.001	0.5108	7.207	8.876	1	11,286	5,709
foundation_hard	b[66]	8.001	0.5202	7.206	8.9	1	11,874	4,432
foundation_hard	b[68]	8.001	0.5146	7.201	8.883	1.001	11,490	5,480
foundation_hard	b[64]	8	0.5054	7.213	8.872	1.001	10,755	5,631
foundation_hard	b[65]	7.999	0.5171	7.192	8.887	1	11,646	4,453
foundation_hard	b[58]	7.999	0.4969	7.221	8.85	1.001	10,610	5,527
foundation_hard	b[55]	7.998	0.5119	7.189	8.883	1	9,513	5,119
foundation_hard	b[62]	7.996	0.5156	7.194	8.896	1.001	10,896	5,401
foundation_hard	b[52]	7.995	0.5068	7.206	8.863	1	10,763	5,164
foundation_hard	b[56]	7.994	0.5129	7.207	8.897	1.001	10,798	5,588
foundation_hard	b[60]	7.994	0.5065	7.215	8.875	1.001	12,241	4,915
foundation_hard	b[54]	7.993	0.5138	7.199	8.888	1	11,508	5,796
foundation_hard	b[63]	7.991	0.5112	7.184	8.878	1.002	11,372	5,727
foundation_hard	b[53]	7.774	0.4765	7.021	8.583	1.001	10,998	5,387
foundation_hard	b[51]	7.766	0.4816	7.028	8.605	1	11,810	5,249
foundation_hard	b[92]	7.741	0.5019	6.961	8.61	1	11,551	5,223
foundation_hard	b[90]	7.741	0.4976	6.965	8.603	1	10,416	5,919
year1_hard	b[107]	9.497	0.5015	8.705	10.36	1.001	11,483	5,821
year1_hard	b[110]	9.493	0.5009	8.711	10.35	1	8,816	5,481
year1_hard	b[108]	9.49	0.502	8.713	10.35	1.001	9,981	4,805
year1_hard	b[109]	9.488	0.5143	8.682	10.37	1.001	11,363	6,024
year1_hard	b[106]	9.278	0.4784	8.536	10.11	1	10,215	6,011
year1_hard	b[104]	9.275	0.4658	8.551	10.06	1.001	9,470	5,165
year1_hard	b[105]	9.274	0.4675	8.539	10.09	1.001	10,277	5,493
year1_hard	b[131]	8.564	0.4876	7.814	9.401	1	10,202	4,515
year1_hard	b[132]	8.355	0.4509	7.639	9.122	1	12,065	5,390
year1_hard	b[130]	8.352	0.458	7.636	9.137	1	9,747	5,523
year1_hard	b[103]	8.263	0.3503	7.704	8.857	1	8,026	6,194
year1_hard	b[38]	8.205	0.529	7.39	9.121	1.001	10,887	5,070
year1_hard	b[27]	8.205	0.5226	7.389	9.123	1.001	10,901	5,201
year1_hard	b[37]	8.203	0.5094	7.421	9.066	1.001	11,727	5,798
year1_hard	b[30]	8.202	0.5279	7.399	9.118	1.001	12,492	4,902
year1_hard	b[22]	8.201	0.5256	7.389	9.115	1	12,550	5,470
year1_hard	b[35]	8.201	0.5215	7.397	9.098	1	11,661	4,768
year1_hard	b[25]	8.198	0.518	7.408	9.096	1.001	11,833	5,432
year1_hard	b[29]	8.197	0.523	7.399	9.101	1.002	11,422	5,258
year1_hard	b[32]	8.197	0.5085	7.422	9.082	1.001	11,232	5,871
foundation_no_DMT10_2026	b[56]	8.003	0.5149	7.2	8.903	1.001	11,245	5,544
foundation_no_DMT10_2026	b[50]	8.001	0.5014	7.23	8.884	1	9,058	4,869
foundation_no_DMT10_2026	b[53]	8	0.4953	7.231	8.864	1	9,980	5,239
foundation_no_DMT10_2026	b[51]	7.999	0.5205	7.203	8.897	1.001	10,170	5,438
foundation_no_DMT10_2026	b[44]	7.999	0.5222	7.199	8.898	1.001	10,974	5,023
foundation_no_DMT10_2026	b[48]	7.999	0.5089	7.201	8.885	1.001	11,335	5,408
foundation_no_DMT10_2026	b[58]	7.998	0.5133	7.21	8.885	1	10,047	5,687
foundation_no_DMT10_2026	b[54]	7.998	0.5168	7.206	8.897	1	10,213	5,567
foundation_no_DMT10_2026	b[46]	7.998	0.5092	7.211	8.864	1.001	10,190	5,756
foundation_no_DMT10_2026	b[52]	7.997	0.5123	7.207	8.894	1.001	10,015	5,462
foundation_no_DMT10_2026	b[57]	7.996	0.512	7.203	8.865	1.001	9,261	5,040
foundation_no_DMT10_2026	b[49]	7.996	0.5146	7.177	8.877	1	10,706	5,121
foundation_no_DMT10_2026	b[59]	7.994	0.5154	7.184	8.883	1.001	10,213	5,237
foundation_no_DMT10_2026	b[55]	7.991	0.49	7.221	8.838	1	10,062	5,853
foundation_no_DMT10_2026	b[60]	7.99	0.5223	7.175	8.895	1	11,202	5,325
foundation_no_DMT10_2026	b[47]	7.99	0.5131	7.198	8.867	1.001	10,746	4,765
foundation_no_DMT10_2026	b[43]	7.766	0.4771	7.011	8.595	1	10,443	5,506
foundation_no_DMT10_2026	b[45]	7.763	0.4732	7.033	8.58	1.002	8,990	4,577
foundation_no_DMT10_2026	b[81]	7.763	0.5075	6.995	8.648	1.001	9,655	5,295
foundation_no_DMT10_2026	b[80]	7.757	0.5023	6.994	8.623	1.002	9,873	5,455
foundation_no_MQ1_20_no_DMT10_2026	b[46]	8.038	0.516	7.234	8.926	1.001	12,235	4,708
foundation_no_MQ1_20_no_DMT10_2026	b[48]	8.035	0.5203	7.226	8.914	0.9999	10,550	5,122
foundation_no_MQ1_20_no_DMT10_2026	b[58]	8.034	0.52	7.217	8.946	1.002	12,559	4,526
foundation_no_MQ1_20_no_DMT10_2026	b[53]	8.034	0.5098	7.226	8.919	1.001	11,948	5,302
foundation_no_MQ1_20_no_DMT10_2026	b[49]	8.033	0.5137	7.238	8.925	1.001	11,860	5,433
foundation_no_MQ1_20_no_DMT10_2026	b[51]	8.033	0.51	7.233	8.893	1	12,133	5,543
foundation_no_MQ1_20_no_DMT10_2026	b[44]	8.031	0.5105	7.241	8.925	1.001	11,852	5,579
foundation_no_MQ1_20_no_DMT10_2026	b[56]	8.031	0.512	7.243	8.924	1.001	11,160	5,375
foundation_no_MQ1_20_no_DMT10_2026	b[52]	8.031	0.5062	7.249	8.916	1.001	12,815	6,038
foundation_no_MQ1_20_no_DMT10_2026	b[55]	8.03	0.5088	7.247	8.904	1.001	12,143	5,536
foundation_no_MQ1_20_no_DMT10_2026	b[50]	8.028	0.5175	7.227	8.941	1.001	12,795	4,985
foundation_no_MQ1_20_no_DMT10_2026	b[57]	8.027	0.5102	7.245	8.908	1.001	12,697	5,756
foundation_no_MQ1_20_no_DMT10_2026	b[60]	8.027	0.5199	7.22	8.927	1	12,049	5,128
foundation_no_MQ1_20_no_DMT10_2026	b[59]	8.026	0.5175	7.226	8.899	1	13,260	5,505
foundation_no_MQ1_20_no_DMT10_2026	b[54]	8.026	0.5159	7.228	8.908	1	11,114	5,088
foundation_no_MQ1_20_no_DMT10_2026	b[47]	8.024	0.4992	7.241	8.878	1.001	12,673	5,252
foundation_no_MQ1_20_no_DMT10_2026	b[81]	7.837	0.5043	7.062	8.705	1.002	13,099	5,769
foundation_no_MQ1_20_no_DMT10_2026	b[83]	7.834	0.5063	7.056	8.707	1	11,811	5,498
foundation_no_MQ1_20_no_DMT10_2026	b[80]	7.833	0.4987	7.074	8.708	1	11,850	5,260
foundation_no_MQ1_20_no_DMT10_2026	b[82]	7.832	0.5045	7.059	8.704	1	11,609	5,007
foundation_no_BNL0_20	b[48]	8.014	0.5103	7.229	8.895	1.002	14,235	6,076
foundation_no_BNL0_20	b[51]	8.014	0.5109	7.221	8.906	1.001	15,427	5,038
foundation_no_BNL0_20	b[52]	8.014	0.5101	7.228	8.899	1	14,431	5,512
foundation_no_BNL0_20	b[49]	8.013	0.524	7.202	8.931	1	14,002	5,477
foundation_no_BNL0_20	b[57]	8.013	0.5218	7.191	8.903	1.001	14,127	4,955
foundation_no_BNL0_20	b[56]	8.012	0.5246	7.192	8.923	1	14,244	4,541
foundation_no_BNL0_20	b[44]	8.011	0.4994	7.236	8.883	1.001	14,004	5,163
foundation_no_BNL0_20	b[50]	8.011	0.5116	7.224	8.881	1.001	13,465	5,542
foundation_no_BNL0_20	b[42]	8.011	0.5205	7.205	8.914	1	15,335	5,563
foundation_no_BNL0_20	b[47]	8.01	0.5221	7.202	8.908	1	13,570	5,075
foundation_no_BNL0_20	b[55]	8.009	0.5153	7.211	8.87	1	15,560	5,714
foundation_no_BNL0_20	b[58]	8.009	0.5	7.227	8.871	1	13,975	5,664
foundation_no_BNL0_20	b[45]	8.009	0.5089	7.212	8.888	1	15,748	5,994
foundation_no_BNL0_20	b[46]	8.008	0.5252	7.192	8.918	1	14,678	5,234
foundation_no_BNL0_20	b[53]	8.008	0.4992	7.235	8.856	1.001	14,424	5,539
foundation_no_BNL0_20	b[54]	8.006	0.508	7.218	8.893	1	13,131	5,241
foundation_no_BNL0_20	b[43]	7.783	0.4763	7.031	8.597	1.001	14,751	4,931
foundation_no_BNL0_20	b[41]	7.771	0.4693	7.034	8.579	1.001	15,584	5,620
foundation_no_BNL0_20	b[82]	7.734	0.4993	6.966	8.601	1.001	13,522	5,042
foundation_no_BNL0_20	b[81]	7.734	0.5011	6.967	8.61	1.001	14,061	5,130
year1_no_MC0_100	b[97]	8.641	0.497	7.877	9.482	1	11,447	5,361
year1_no_MC0_100	b[96]	8.434	0.4632	7.715	9.24	1.001	9,911	5,566
year1_no_MC0_100	b[98]	8.425	0.4599	7.703	9.209	1	10,199	5,925
year1_no_MC0_100	b[95]	8.251	0.4338	7.571	8.984	1	10,062	5,038
year1_no_MC0_100	b[29]	8.245	0.5165	7.433	9.15	0.9998	9,955	4,727
year1_no_MC0_100	b[31]	8.241	0.522	7.433	9.155	1.002	13,982	4,732
year1_no_MC0_100	b[35]	8.239	0.5331	7.427	9.168	1	11,445	4,974
year1_no_MC0_100	b[27]	8.238	0.5253	7.427	9.136	1.001	11,992	5,104
year1_no_MC0_100	b[28]	8.238	0.531	7.406	9.155	1.001	11,655	4,696
year1_no_MC0_100	b[34]	8.238	0.5258	7.427	9.143	1	12,380	5,543
year1_no_MC0_100	b[33]	8.237	0.5083	7.448	9.118	1	10,989	5,127
year1_no_MC0_100	b[22]	8.236	0.5217	7.439	9.127	1.001	12,780	5,294
year1_no_MC0_100	b[37]	8.235	0.515	7.44	9.127	1.002	11,521	5,609
year1_no_MC0_100	b[30]	8.231	0.5191	7.428	9.134	1	12,678	5,976
year1_no_MC0_100	b[36]	8.231	0.5155	7.426	9.115	1	11,650	5,620
year1_no_MC0_100	b[38]	8.229	0.5135	7.442	9.119	1.001	12,111	5,432
year1_no_MC0_100	b[32]	8.226	0.5062	7.454	9.102	1	11,546	5,832
year1_no_MC0_100	b[25]	8.225	0.5081	7.443	9.105	1	12,807	6,109
year1_no_MC0_100	b[26]	8.004	0.4745	7.274	8.815	1.001	9,905	4,832
year1_no_MC0_100	b[24]	8	0.4777	7.261	8.831	1	12,510	5,384
year1_no_BNL0_100	b[94]	9.423	0.5009	8.651	10.3	1	9,369	5,296
year1_no_BNL0_100	b[95]	9.422	0.5108	8.619	10.31	1.001	9,269	5,245
year1_no_BNL0_100	b[97]	9.418	0.5061	8.627	10.29	1	8,492	4,980
year1_no_BNL0_100	b[96]	9.416	0.5095	8.623	10.29	1	10,151	5,918
year1_no_BNL0_100	b[93]	9.201	0.4816	8.453	10.03	1	9,815	5,523
year1_no_BNL0_100	b[92]	9.197	0.4731	8.458	10.02	1	9,506	5,795
year1_no_BNL0_100	b[91]	9.196	0.4633	8.472	9.994	1	8,598	5,210
year1_no_BNL0_100	b[118]	8.432	0.4934	7.658	9.291	1.001	10,748	5,381
year1_no_BNL0_100	b[119]	8.233	0.4579	7.517	9.012	1.001	10,359	5,764
year1_no_BNL0_100	b[117]	8.228	0.4523	7.517	8.997	1.001	10,423	4,968
year1_no_BNL0_100	b[90]	8.194	0.3467	7.635	8.771	1	8,119	6,229
year1_no_BNL0_100	b[35]	8.163	0.5124	7.376	9.058	1.001	10,185	5,326
year1_no_BNL0_100	b[25]	8.162	0.5209	7.357	9.077	1	11,495	5,309
year1_no_BNL0_100	b[33]	8.161	0.5213	7.361	9.067	1	11,323	5,937
year1_no_BNL0_100	b[37]	8.16	0.5077	7.373	9.055	1.001	9,620	5,385
year1_no_BNL0_100	b[34]	8.16	0.5078	7.382	9.052	1	12,275	6,146
year1_no_BNL0_100	b[38]	8.158	0.5211	7.362	9.072	1	10,685	5,194
year1_no_BNL0_100	b[29]	8.158	0.5113	7.37	9.047	1	11,494	5,758
year1_no_BNL0_100	b[30]	8.158	0.5073	7.376	9.021	1	10,514	4,896
year1_no_BNL0_100	b[32]	8.156	0.5073	7.36	9.037	1	10,190	5,583
year1_core_no_MC_no_NL	b[84]	8.57	0.4904	7.811	9.415	1	12,285	5,696
year1_core_no_MC_no_NL	b[83]	8.367	0.4658	7.646	9.167	1	11,942	5,113
year1_core_no_MC_no_NL	b[85]	8.364	0.4601	7.644	9.15	1.001	12,817	5,931
year1_core_no_MC_no_NL	b[33]	8.21	0.5118	7.417	9.091	1	13,919	4,888
year1_core_no_MC_no_NL	b[31]	8.207	0.53	7.395	9.146	1.001	13,711	4,756
year1_core_no_MC_no_NL	b[29]	8.205	0.5111	7.419	9.082	1	16,253	5,420
year1_core_no_MC_no_NL	b[25]	8.203	0.5118	7.416	9.096	1	13,262	5,692
year1_core_no_MC_no_NL	b[34]	8.202	0.5202	7.391	9.104	1	16,277	5,645
year1_core_no_MC_no_NL	b[28]	8.2	0.519	7.403	9.09	1.001	15,230	5,049
year1_core_no_MC_no_NL	b[22]	8.2	0.5184	7.396	9.09	0.9998	14,382	5,210
year1_core_no_MC_no_NL	b[37]	8.2	0.5089	7.414	9.087	1.001	14,801	5,424
year1_core_no_MC_no_NL	b[38]	8.2	0.5151	7.401	9.099	1.002	12,909	4,308
year1_core_no_MC_no_NL	b[30]	8.199	0.4956	7.421	9.051	1	13,262	5,654
year1_core_no_MC_no_NL	b[27]	8.199	0.5191	7.392	9.115	1.001	14,735	4,717
year1_core_no_MC_no_NL	b[35]	8.199	0.5183	7.396	9.092	1	15,503	4,965
year1_core_no_MC_no_NL	b[36]	8.197	0.5026	7.425	9.065	1.001	15,112	5,976
year1_core_no_MC_no_NL	b[32]	8.194	0.4991	7.426	9.043	1.001	14,008	5,710
year1_core_no_MC_no_NL	b[82]	8.175	0.4304	7.505	8.919	1	12,572	5,793
year1_core_no_MC_no_NL	b[26]	7.967	0.4787	7.222	8.789	1	14,407	5,566
year1_core_no_MC_no_NL	b[24]	7.966	0.4803	7.226	8.793	1	14,103	5,606

Batch A action-plan memo

Rendered from the public-safe aggregate Markdown action-plan artifact.

2026 BOY next-model action-plan readout

Created: 2026-06-15T00:54:00Z

Provenance / compute lock

Batch A run: 2026-boy-next-stan-full-20260614T114524Z.
Current compute state was checked before this package: no pending/running/stopping EC2 instances were found; the Batch A instance is stopped and recorded in the internal provenance JSON.
Essential Batch A artifacts are aggregate/postprocessed outputs; full raw CmdStan draws were not needed for this aggregate review and remain an internal recovery-only caveat.
No further AWS should be launched without explicit approval.

Score movement / operational global

comparison	year_level	n_matched	theta_spearman	median_abs_percentile_shift	p95_abs_percentile_shift	risk_3band_exact_agreement	very_low_overlap	very_low_n_base	low_or_very_low_overlap	low_or_very_low_n_base
H1_global_vs_hard_filtered	foundation	997	0.904	6.921	26.179	0.802	116	150	282	349
H1_global_vs_hard_filtered	year1	1221	0.836	9.419	34.316	0.762	124	183	333	427
BNL_residual_zero_vs_hard_filtered	year1	1221	0.998	0.901	3.276	0.984	180	183	420	427
H1_global_vs_BNL_residual_zero	year1	1221	0.807	10.483	37.346	0.749	121	183	326	427

BNL residual-zero decision

Year 1 BNL residual-zero is score-stable vs hard-filtered baseline: Spearman 0.998, median absolute percentile movement 0.901 pp, p95 movement 3.276 pp, and 3-band agreement 98.4%. The basis for preferring the zero residual/testlet component is operational parsimony: the estimated BNL residual sigma in the hard baseline was weakly identified, while fixing that nuisance component leaves global ranking/risk almost unchanged and keeps the BNL items. The empirical BNL double-centred residual screen found max |residual correlation| 0.291 across 78 item pairs; 22 pairs exceeded .20 and 0 exceeded .30. This is a screen, not a posterior-predictive proof.

H1 global decision

H1 materially changes global ranking: Foundation Spearman 0.904 / 3-band agreement 80.2%; Year 1 Spearman 0.836 / 3-band agreement 76.2%. Therefore H1 global should not replace the operational global score unless outcome/risk validation clearly offsets this reclassification. Use H1 primarily for subscore development at this stage.

Teacher-facing subscores

year_level	test_subgroup	standalone_tam_eap_reliability	h1_sigma_delta_mean	h1_subscore_global_spearman	h1_median_subscore_posterior_sd	recommendation
foundation	BNL0-20	0.674	0.874	0.625	0.451	h1_shrunken_subscore_candidate_with_uncertainty_pending_validation
foundation	DMT10_2026	0.609	0.83	0.746	0.663	diagnostic_only_or_strong_caveat_pending_validation
foundation	MC0-20	0.927	1.932	0.822	0.646	h1_shrunken_subscore_candidate_high_confidence_pending_validation
foundation	MNC0-20	0.881	1.546	0.849	0.729	h1_shrunken_subscore_candidate_high_confidence_pending_validation
foundation	MQ1-20	0.602	0.823	0.796	0.724	diagnostic_only_or_strong_caveat_pending_validation
year1	AAMC	0.9	1.123	0.889	0.659	h1_shrunken_subscore_candidate_high_confidence_pending_validation
year1	ASMC	0.841	1.213	0.824	0.698	h1_shrunken_subscore_candidate_with_uncertainty_pending_validation
year1	BNL0-100	0.727	1.327	0.709	0.408	h1_shrunken_subscore_candidate_with_uncertainty_pending_validation
year1	MC0-100	0.94	1.709	0.873	0.64	h1_shrunken_subscore_candidate_high_confidence_pending_validation
year1	MNC0-100	0.891	1.014	0.917	0.697	h1_shrunken_subscore_candidate_high_confidence_pending_validation

Recommendation: teacher-facing subscore development should prioritise hierarchical/shrunken H1 candidate subscores with uncertainty labels, not standalone subtest IRT scores as primary reporting quantities. Final teacher-facing promotion still requires outcome/use-case validation.

Outcome validation status

status	message
not_run_no_outcome_csvs_supplied	Set OUTCOME_CSVS to comma-separated internal person-level outcome CSV paths when 2026 outcome data are available.

Outcome validation remains the promotion gate. If 2026 PAT/teacher/later screener CSVs are supplied internally via OUTCOME_CSVS, this same script will emit matched aggregate correlations.

Output tables

tables/model_review/action_plan/provenance_lock.json
tables/model_review/action_plan/next_model_score_movement_summary.csv
tables/model_review/action_plan/next_model_risk_band_movement.csv
tables/model_review/action_plan/bnl_item_difficulty_stability.csv
tables/model_review/action_plan/bnl_item_difficulty_stability_summary.csv
tables/model_review/action_plan/bnl_numberline_step_stability.csv
tables/model_review/action_plan/bnl_empirical_category_functioning.csv
tables/model_review/action_plan/bnl_empirical_local_dependence_summary.csv
tables/model_review/action_plan/bnl_empirical_local_dependence_top_pairs.csv
tables/model_review/action_plan/h1_movement_subtest_drivers.csv
tables/model_review/action_plan/subscore_reporting_decision_table.csv
tables/model_review/action_plan/optional_outcome_validation_summary.csv

Number Line policy memo

Rendered from the public-safe aggregate NL policy artifact.

2026 BOY Number Line policy adjudication

Created: 2026-06-15T00:46:49Z

Interpretation

This is an NL-only adjudication layer. It can nominate scoring policies, but it does not by itself promote a full operational global model. Full-battery Stan and outcome validation remain the promotion gates.

Decision grid

year_level	policy_id	tam_nl_only_reliability	delta_reliability_vs_current	tam_nl_only_spearman_with_non_nl_composite	observed_alpha_numeric_categories	tam_nl_only_vs_current_p95_pctile_shift	recommendation
foundation	nl_80_90_relaxed_3cat	0.684	0.008	0.335	0.726	0.223	secondary_challenger_monitor
foundation	nl_85_95_current_3cat	0.675	0	0.339	0.706		retain_as_current_reference
foundation	nl_90_97_strict_3cat	0.601	-0.074	0.296	0.63	0.227	not_prioritised_for_stan
foundation	nl_binary_95	0.515	-0.16	0.28	0.553	0.26	not_prioritised_for_stan
foundation	nl_80_90_95_4cat	0.689	0.014	0.331	0.722	0.144	secondary_challenger_monitor
year1	nl_80_90_relaxed_3cat	0.758	0.03	0.588	0.782	0.205	secondary_challenger_monitor
year1	nl_85_95_current_3cat	0.728	0	0.55	0.744		retain_as_current_reference
year1	nl_90_97_strict_3cat	0.671	-0.057	0.51	0.688	0.202	not_prioritised_for_stan
year1	nl_binary_95	0.54	-0.188	0.415	0.565	0.268	not_prioritised_for_stan
year1	nl_80_90_95_4cat	0.758	0.03	0.566	0.77	0.132	serious_challenger_consider_full_battery_stan_if_outcome_validity_supports

Observed coordinate-derived metrics

year_level	policy_id	n_persons	n_items	complete_case_alpha_numeric_categories	spearman_with_non_nl_composite	person_mean_score_floor_rate	person_mean_score_ceiling_rate
foundation	nl_80_90_95_4cat	974	10	0.722	0.329	0.002	0.002
foundation	nl_80_90_relaxed_3cat	974	10	0.726	0.328	0.002	0.022
foundation	nl_85_95_current_3cat	974	10	0.706	0.333	0.006	0.002
foundation	nl_90_97_strict_3cat	974	10	0.63	0.292	0.017	0.001
foundation	nl_binary_95	974	10	0.553	0.285	0.088	0.002
year1	nl_80_90_95_4cat	1178	13	0.77	0.559	0.002	0.001
year1	nl_80_90_relaxed_3cat	1178	13	0.782	0.581	0.002	0.008
year1	nl_85_95_current_3cat	1178	13	0.744	0.544	0.004	0.003
year1	nl_90_97_strict_3cat	1178	13	0.688	0.503	0.014	0.001
year1	nl_binary_95	1178	13	0.565	0.411	0.094	0.001

Current recommendation

.85/.95 remains the current reference and is defensible as an operational-compatible benchmark.
.80/.90 and .80/.90/.95 should be treated as serious challengers where they show higher NL-only reliability without destabilising full-battery movement.
Do not launch all full-battery Stan challengers reflexively. Use this grid plus outcome validation to nominate at most one or two.
Continuous Number Line is handled separately in script 87; it should be judged first as an NL-only measurement model before a full-battery Stan launch.

Output tables

outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/numberline_policy_adjudication/nl_observed_policy_metrics.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/numberline_policy_adjudication/nl_tam_mirt_policy_summary.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/numberline_policy_adjudication/nl_full_battery_screen_movement.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/numberline_policy_adjudication/nl_policy_decision_grid.csv

Continuous Number Line memo

Rendered from the public-safe aggregate continuous-NL screen.

2026 BOY continuous Number Line adjudication

Created: 2026-06-15T00:46:55Z

What this does and does not decide

This is a non-IRT continuous-coordinate screen. It asks whether continuous accuracy/error contains enough extra stable signal to justify a formal model-based NL-only or full-battery Stan challenger. It does not by itself replace ordinal NL scoring.

Continuous decision grid

year_level	continuous_metric	alpha_delta_vs_current_ordinal	non_nl_spearman_delta_vs_current_ordinal	spearman_vs_current_ordinal	p95_percentile_shift_vs_current_ordinal	recommendation
foundation	continuous_accuracy	0.101	0.035	0.931	21.82	promising_continuous_challenger_consider_model_based_fit
foundation	negative_absolute_scaled_error	0.101	0.035	0.931	21.838	promising_continuous_challenger_consider_model_based_fit
foundation	absolute_signed_error_scaled_negative	0.101	0.035	0.931	21.838	promising_continuous_challenger_consider_model_based_fit
foundation	signed_error_scaled	-0.003	-0.531			diagnostic_bias_only_not_primary_score
year1	continuous_accuracy	0.093	0.061	0.931	21.817	promising_continuous_challenger_consider_model_based_fit
year1	negative_absolute_scaled_error	0.093	0.061	0.931	21.781	promising_continuous_challenger_consider_model_based_fit
year1	absolute_signed_error_scaled_negative	0.093	0.061	0.931	21.781	promising_continuous_challenger_consider_model_based_fit
year1	signed_error_scaled	0.044	-0.867			diagnostic_bias_only_not_primary_score

Metric summary

year_level	nl_metric	metric_family	complete_case_alpha	spearman_with_non_nl_composite	person_score_floor_rate	person_score_ceiling_rate
foundation	absolute_signed_error_scaled_negative	continuous	0.807	0.369	0.001	0.001
foundation	continuous_accuracy	continuous	0.807	0.369	0.001	0.001
foundation	negative_absolute_scaled_error	continuous	0.807	0.369	0.001	0.001
foundation	nl_80_90_95_4cat	ordinal_policy	0.722	0.329	0.002	0.002
foundation	nl_80_90_relaxed_3cat	ordinal_policy	0.726	0.328	0.002	0.022
foundation	nl_85_95_current_3cat	ordinal_policy	0.706	0.333	0.006	0.002
foundation	signed_error_scaled	continuous	0.703	-0.198	0.001	0.001
year1	absolute_signed_error_scaled_negative	continuous	0.837	0.605	0.001	0.001
year1	continuous_accuracy	continuous	0.837	0.605	0.001	0.001
year1	negative_absolute_scaled_error	continuous	0.837	0.605	0.001	0.001
year1	nl_80_90_95_4cat	ordinal_policy	0.77	0.559	0.002	0.001
year1	nl_80_90_relaxed_3cat	ordinal_policy	0.782	0.581	0.002	0.008
year1	nl_85_95_current_3cat	ordinal_policy	0.744	0.544	0.004	0.003
year1	signed_error_scaled	continuous	0.788	-0.323	0.001	0.001

Recommendation

Continuous NL should be retained as a serious research/diagnostic contender if it improves reliability or non-NL composite alignment materially. If the gains are small and the continuous scores remain highly rank-correlated with the current ordinal score, prioritise ordinal .80/.90 or 4-category challengers before an expensive continuous full-battery Stan run.

Output tables

outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/numberline_continuous_adjudication/nl_continuous_metric_summary.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/numberline_continuous_adjudication/nl_continuous_vs_ordinal_comparison.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/numberline_continuous_adjudication/nl_continuous_target_residuals.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/numberline_continuous_adjudication/nl_continuous_decision_grid.csv

Historical prior/drift memo

Rendered from the public-safe aggregate crosswalk/prior artifact.

2025→2026 common-item drift and historical-prior audit

Created: 2026-06-15T00:47:05Z

Subtest summary

year_level	subtest_2026	n_crosswalk_items	n_high_confidence_matches	n_prior_eligible_items	median_abs_logit_drift	p90_abs_logit_drift	recommendation
foundation	BNL0-20	10	10	10	0.169	0.43	historical_item_priors_supported_with_drift_inflation
foundation	DMT10_2026	11	7	0	0.269	3.335	use_2025_as_subtest_context_not_hard_item_anchor
foundation	MC0-20	60	0	0	6.902	9.5	no_item_level_historical_prior
foundation	MNC0-20	30	30	5	4.845	8.828	selective_item_priors_after_manual_review
foundation	MQ1-20	30	30	3	6.462	9.234	selective_item_priors_after_manual_review
year1	AAMC	40	40	1	4.995	9.864	selective_item_priors_after_manual_review
year1	ASMC	30	30	3	4.797	8.616	selective_item_priors_after_manual_review
year1	BNL0-100	13	13	13	0.141	0.285	historical_item_priors_supported_with_drift_inflation
year1	MC0-100	60	1	1	8.868	10.309	no_item_level_historical_prior
year1	MNC0-100	29	29	4	4.775	8.225	selective_item_priors_after_manual_review

2025 restricted reliability context

year_level	term_scope	item_scope	fit_status	n_person	n_items	reliability	converged
foundation	term3	kept_crosswalk	ok	1440	93	0.903	TRUE
foundation	term3	all_crosswalk	ok	1440	101	0.904	TRUE
foundation	term4	kept_crosswalk	ok	1096	84	0.889	TRUE
foundation	term4	all_crosswalk	ok	1096	90	0.89	TRUE
foundation	term3_4_pooled	kept_crosswalk	ok	2536	100	0.905	TRUE
foundation	term3_4_pooled	all_crosswalk	ok	2536	118	0.906	TRUE
year1	term3	kept_crosswalk	ok	1501	121	0.945	TRUE
year1	term3	all_crosswalk	ok	1501	134	0.945	TRUE
year1	term4	kept_crosswalk	ok	1067	105	0.916	TRUE
year1	term4	all_crosswalk	ok	1067	109	0.917	TRUE
year1	term3_4_pooled	kept_crosswalk	ok	2568	132	0.949	TRUE
year1	term3_4_pooled	all_crosswalk	ok	2568	148	0.95	TRUE

Recommendation

Use 2025 as prior evidence only where item identity is strong/moderate and proxy drift is acceptable. Treat DMT10_2026 as the main caveat: it should receive at most weak subtest-level context, not hard item anchoring. For strong BNL and other exact-ID subtests, weak/commensurate priors with drift inflation are defensible if final validation supports Bayesian updating.

Output tables

outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/historical_prior_drift/common_item_drift_prior_eligibility.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/historical_prior_drift/common_item_drift_prior_summary.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/historical_prior_drift/restricted_2025_t3t4_reliability_context.csv

Premodelling audit memo

Rendered from the aggregate-only Markdown premodelling artifact.

2026 BOY premodelling audit: hierarchical subscores, Number Line policy, and accuracy-speed

Generated: 2026-06-14 08:57:14Z

This is a dependency-light, aggregate-only audit. It does not publish raw student identifiers or person-level score files.

Executive readout

Hierarchical subscores are justified as a modelling direction: standalone subtest evidence is uneven, so teacher-facing profiles should be shrunken/coherent with the global score rather than independent standalone IRT scores.
The first hierarchical Stan challenger should be H1_global_plus_subtest_deviations: global numeracy plus reportable subtest deviations, not the current nuisance testlet u residuals as subscores.
Number Line cutoff changes should be screened from raw coordinate-derived accuracy distributions first. This audit writes item-by-target ECDF/category-count tables for .80/.90, .85/.95, .90/.97, binary >=.95, and a 4-category .80/.90/.95 option.
Accuracy-speed remains shadow/QC-first. Timed D/trailing-zero already encodes reach/speed pressure, so RT must not be allowed to double-count speed in achievement bands without validation.

Hierarchical subscore readiness

year	subtest	keep_items	rel	band	global_r	posture
foundation	MQ1-20	19	0.602	weak	0.426	hierarchical_shrinkage_required; avoid standalone high-stakes subscore
foundation	MC0-20	50	0.927	strong	0.532	strong standalone signal; still prefer hierarchical coherence with global score
foundation	MNC0-20	24	0.881	strong	0.604	strong standalone signal; still prefer hierarchical coherence with global score
foundation	DMT10_2026	8	0.609	weak	0.43	hierarchical_shrinkage_required; avoid standalone high-stakes subscore
foundation	BNL0-20	10	0.674	weak	0.354	hierarchical_shrinkage_required; avoid standalone high-stakes subscore
year1	MC0-100	34	0.94	strong	0.694	strong standalone signal; still prefer hierarchical coherence with global score
year1	MNC0-100	22	0.891	strong	0.762	strong standalone signal; still prefer hierarchical coherence with global score
year1	AAMC	38	0.9	strong	0.729	strong standalone signal; still prefer hierarchical coherence with global score
year1	ASMC	25	0.841	moderate	0.617	hierarchical_shrinkage_recommended; standalone subscore only as caveated diagnostic
year1	BNL0-100	13	0.727	moderate	0.548	hierarchical_shrinkage_recommended; standalone subscore only as caveated diagnostic

Key implication: several subtests are not ideal standalone reporting scores, especially where reliability is weak/moderate or item counts are small. That is an argument *for* hierarchical shrinkage, not against subscores.

Current-policy subtest relationships

year	subtest_1	subtest_2	n	rho	band
foundation	MQ1-20	MC0-20	1005	0.405	moderate
foundation	MQ1-20	MNC0-20	1003	0.405	moderate
foundation	MQ1-20	DMT10_2026	1002	0.275	low
foundation	MQ1-20	BNL0-20	974	0.171	low
foundation	MC0-20	MNC0-20	1003	0.561	moderate
foundation	MC0-20	DMT10_2026	1002	0.283	low
foundation	MC0-20	BNL0-20	974	0.245	low
foundation	MNC0-20	DMT10_2026	1002	0.393	low
foundation	MNC0-20	BNL0-20	974	0.304	low
foundation	DMT10_2026	BNL0-20	974	0.302	low
year1	MC0-100	MNC0-100	1229	0.683	high
year1	MC0-100	AAMC	1227	0.595	moderate
year1	MC0-100	ASMC	1223	0.485	moderate
year1	MC0-100	BNL0-100	1178	0.468	moderate
year1	MNC0-100	AAMC	1227	0.671	high
year1	MNC0-100	ASMC	1223	0.561	moderate
year1	MNC0-100	BNL0-100	1178	0.504	moderate
year1	AAMC	ASMC	1223	0.595	moderate
year1	AAMC	BNL0-100	1178	0.469	moderate
year1	ASMC	BNL0-100	1178	0.379	low

Profile-deviation spread

year	subtest	n	sd_dev_z	p10	p90	%>
foundation	MQ1-20	1005	0.934	-1.04	1.11	22.4%
foundation	MC0-20	1005	0.842	-0.98	1.03	20.2%
foundation	MNC0-20	1003	0.787	-0.99	1.02	20.4%
foundation	DMT10_2026	1002	0.938	-1.21	1.15	27.3%
foundation	BNL0-20	974	0.998	-1.27	1.23	31.4%
year1	MC0-100	1229	0.737	-0.82	0.91	14.5%
year1	MNC0-100	1229	0.64	-0.74	0.8	10.5%
year1	AAMC	1227	0.688	-0.82	0.81	10.8%
year1	ASMC	1223	0.786	-0.99	0.97	19.1%
year1	BNL0-100	1178	0.902	-1.06	1.12	23.3%

Number Line cutoff policy premodelling

year	subtest	policy	items_ok	median_min_pct	median_top_pct	entropy	posture
foundation	BNL0-20	nl_80_90_95_4cat	9/10	15.6%	25.3%	0.955	higher-resolution_challenger; use_only_if_item_category_cells_are_stable
foundation	BNL0-20	nl_80_90_relaxed_3cat	9/10	20.6%	48.0%	0.938	relaxed_challenger; useful_if_current_policy_over-penalises_hard_targets
foundation	BNL0-20	nl_85_95_current_3cat	10/10	18.2%	25.3%	0.93	benchmark_current_policy; keep as reference in all modelling
foundation	BNL0-20	nl_90_97_strict_3cat	10/10	13.6%	14.6%	0.873	strict_challenger; reject_if_top_category_sparse_or_validation_not_better
foundation	BNL0-20	nl_binary_95	10/10	25.3%	25.3%	0.815	modelable_if_cells_ok_but_loses_partial-credit_information
year1	BNL0-100	nl_80_90_95_4cat	13/13	19.4%	26.4%	0.989	higher-resolution_challenger; use_only_if_item_category_cells_are_stable
year1	BNL0-100	nl_80_90_relaxed_3cat	13/13	24.4%	48.7%	0.954	relaxed_challenger; useful_if_current_policy_over-penalises_hard_targets
year1	BNL0-100	nl_85_95_current_3cat	13/13	26.4%	26.4%	0.98	benchmark_current_policy; keep as reference in all modelling
year1	BNL0-100	nl_90_97_strict_3cat	13/13	16.2%	16.2%	0.913	strict_challenger; reject_if_top_category_sparse_or_validation_not_better
year1	BNL0-100	nl_binary_95	13/13	26.4%	26.4%	0.833	modelable_if_cells_ok_but_loses_partial-credit_information

Interpretation rule: a policy can be *modelable* from cell counts but still not promotable. Promotion requires validation, risk-band movement, fairness/subgroup checks, and interpretability. Current .85/.95 remains the benchmark.

Accuracy-speed / RT readiness

year	subtest	role	timed	obs_rt_miss	presented_miss	trailing	rt_p50	<1s	model_role	flags
foundation	BNL0-20	achievement_primary	False	0.00%	2.0%	0.0%	7	0.8%	nl_rt_context_only_initially_not_accuracy_speed_scoring	none_obvious_from_row_rt_audit
foundation	DMT10_2026	achievement_primary	False	0.00%	1.5%	0.0%	16	0.0%	untimed_or_other_context_only_initially	none_obvious_from_row_rt_audit
foundation	MC0-20	achievement_primary	True	0.00%	75.0%	74.0%	6	0.4%	initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context	presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
foundation	MNC0-20	achievement_primary	True	0.00%	76.1%	75.1%	12	0.4%	initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context	presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
foundation	MQ1-20	achievement_primary	True	0.00%	84.2%	83.3%	20	0.8%	initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context	presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
foundation	STPM	shadow_speed_only	True		6.2%	5.2%	8	0.1%	shadow_speed_only_exclude_from_math_achievement	presented_row_rt_missing_or_negative_gt_5pct
year1	AAMC	achievement_primary	True	0.00%	80.2%	78.5%	9	0.5%	initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context	presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
year1	ASMC	achievement_primary	True	0.00%	77.6%	75.6%	12	0.5%	initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context	presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
year1	BNL0-100	achievement_primary	False	0.00%	3.8%	0.0%	5	0.7%	nl_rt_context_only_initially_not_accuracy_speed_scoring	none_obvious_from_row_rt_audit
year1	MC0-100	achievement_primary	True	0.00%	77.1%	76.0%	6	0.4%	initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context	presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
year1	MNC0-100	achievement_primary	True	0.00%	72.9%	71.4%	11	0.5%	initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context	presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
year1	STPM	shadow_speed_only	True		6.4%	4.4%	6	0.1%	shadow_speed_only_exclude_from_math_achievement	presented_row_rt_missing_or_negative_gt_5pct

J2b-style rapid-row descriptive check

year	subtest	rapid_rate	rapid_acc	nonrapid_acc	delta
foundation	MC0-20	5.97%	0.636	0.919	-0.283
foundation	MNC0-20	3.96%	0.088	0.754	-0.666
foundation	MQ1-20	3.08%	0.054	0.656	-0.601
year1	AAMC	4.36%	0.184	0.803	-0.618
year1	ASMC	4.24%	0.132	0.627	-0.494
year1	MC0-100	5.27%	0.567	0.897	-0.331
year1	MNC0-100	4.03%	0.087	0.834	-0.747

Person-level speed/reach correlations with current-policy scores

year	subtest	metric	n	rho	note
foundation	STPM	median_item_rt_sec	1016	-0.437	rt_context_not_achievement_adjustment
foundation	STPM	n_reached_or_valid_count	1024	0.819	reach_count_is_partly_scoring_policy_for_timed_D
foundation	STPM	n_trailing_nonresponse_rows	1024	-0.771	reach_count_is_partly_scoring_policy_for_timed_D
foundation	MQ1-20	median_item_rt_sec	998	-0.462	rt_context_not_achievement_adjustment
foundation	MQ1-20	n_reached_or_valid_count	1006	0.711	reach_count_is_partly_scoring_policy_for_timed_D
foundation	MQ1-20	n_trailing_nonresponse_rows	1006	-0.656	reach_count_is_partly_scoring_policy_for_timed_D
foundation	MC0-20	median_item_rt_sec	995	-0.83	rt_context_not_achievement_adjustment
foundation	MC0-20	n_reached_or_valid_count	1005	0.932	reach_count_is_partly_scoring_policy_for_timed_D
foundation	MC0-20	n_trailing_nonresponse_rows	1005	-0.873	reach_count_is_partly_scoring_policy_for_timed_D
foundation	MNC0-20	median_item_rt_sec	993	-0.673	rt_context_not_achievement_adjustment
foundation	MNC0-20	n_reached_or_valid_count	1003	0.778	reach_count_is_partly_scoring_policy_for_timed_D
foundation	MNC0-20	n_trailing_nonresponse_rows	1003	-0.721	reach_count_is_partly_scoring_policy_for_timed_D
foundation	DMT10_2026	median_item_rt_sec	988	0.061	rt_context_not_achievement_adjustment
foundation	DMT10_2026	n_reached_or_valid_count	1002	0.206	coverage_or_valid_count_context_not_timed_D_speed
foundation	DMT10_2026	n_trailing_nonresponse_rows	1002		coverage_or_valid_count_context_not_timed_D_speed
foundation	BNL0-20	median_item_rt_sec	974	-0.009	rt_context_not_achievement_adjustment
foundation	BNL0-20	n_reached_or_valid_count	974	0.345	coverage_or_valid_count_context_not_timed_D_speed
foundation	BNL0-20	n_trailing_nonresponse_rows	974		coverage_or_valid_count_context_not_timed_D_speed
year1	STPM	median_item_rt_sec	1235	-0.432	rt_context_not_achievement_adjustment
year1	STPM	n_reached_or_valid_count	1256	0.821	reach_count_is_partly_scoring_policy_for_timed_D
year1	STPM	n_trailing_nonresponse_rows	1256	-0.719	reach_count_is_partly_scoring_policy_for_timed_D
year1	MC0-100	median_item_rt_sec	1221	-0.84	rt_context_not_achievement_adjustment
year1	MC0-100	n_reached_or_valid_count	1235	0.932	reach_count_is_partly_scoring_policy_for_timed_D
year1	MC0-100	n_trailing_nonresponse_rows	1235	-0.865	reach_count_is_partly_scoring_policy_for_timed_D
year1	MNC0-100	median_item_rt_sec	1212	-0.704	rt_context_not_achievement_adjustment
year1	MNC0-100	n_reached_or_valid_count	1229	0.816	reach_count_is_partly_scoring_policy_for_timed_D
year1	MNC0-100	n_trailing_nonresponse_rows	1229	-0.729	reach_count_is_partly_scoring_policy_for_timed_D
year1	AAMC	median_item_rt_sec	1205	-0.79	rt_context_not_achievement_adjustment
year1	AAMC	n_reached_or_valid_count	1227	0.872	reach_count_is_partly_scoring_policy_for_timed_D
year1	AAMC	n_trailing_nonresponse_rows	1227	-0.768	reach_count_is_partly_scoring_policy_for_timed_D
year1	ASMC	median_item_rt_sec	1199	-0.584	rt_context_not_achievement_adjustment
year1	ASMC	n_reached_or_valid_count	1223	0.708	reach_count_is_partly_scoring_policy_for_timed_D
year1	ASMC	n_trailing_nonresponse_rows	1223	-0.599	reach_count_is_partly_scoring_policy_for_timed_D
year1	BNL0-100	median_item_rt_sec	1178	-0.046	rt_context_not_achievement_adjustment
year1	BNL0-100	n_reached_or_valid_count	1178	0.225	coverage_or_valid_count_context_not_timed_D_speed
year1	BNL0-100	n_trailing_nonresponse_rows	1178		coverage_or_valid_count_context_not_timed_D_speed
foundation	STPM_vs_composite	score	1006	0.232	STPM_is_shadow_non_math_exclude_from_math_score
foundation	STPM_vs_composite	median_item_rt_sec	1004	-0.389	STPM_is_shadow_non_math_exclude_from_math_score
foundation	STPM_vs_composite	total_rt_sec	1004	-0.34	STPM_is_shadow_non_math_exclude_from_math_score
year1	STPM_vs_composite	score	1235	0.234	STPM_is_shadow_non_math_exclude_from_math_score
year1	STPM_vs_composite	median_item_rt_sec	1228	-0.387	STPM_is_shadow_non_math_exclude_from_math_score
year1	STPM_vs_composite	total_rt_sec	1228	-0.284	STPM_is_shadow_non_math_exclude_from_math_score

Reach/trailing correlations are partly mechanical under timed D/trailing-zero scoring. This is exactly why RT/tau should initially remain a shadow response-process layer rather than a direct achievement-band adjustment.

Recommended model ladders

Hierarchical global/subscore ladder

model_id	purpose	latent_structure	subscores	premodel_status	promotion_gate
H0_current_operational_candidate	existing global score anchor	one global theta + subtest/testlet residuals u	not teacher-facing; u is nuisance/local-dependence residual	already fitted for inclusive/hard-filtered/sensitivities	retain as anchor while subscore challengers are tested
H1_global_plus_subtest_deviations	coherent teacher-facing global score + subscores	global theta; subtest score = global theta + shrunken subtest deviation; no separate nuisance residual for every same subtest initially	yes: report global, subtest posterior means/intervals, and relative deviation labels	recommended first Stan hierarchical subscore challenger	clean HMC, stable subscore posterior SDs, sensible shrinkage, better coherence than standalone subtest IRT, no harmful risk-band movement
H2_global_plus_NL_specific_deviation	target Year 1 BNL influence before full subtest expansion	global theta + Number Line-specific deviation/factor; optionally BNL residual fixed/omitted	global + NL profile only	recommended focused challenger if H1 is too broad or BNL remains unstable	keeps BNL contribution without weak BNL residual pathology; validates at least as well as H0
H3_correlated_subtest_thetas	diagnostic upper-bound profile model	one correlated theta per subtest; global score is derived composite	yes but global must be defined after fitting	diagnostic only until feasibility improves; mirt/TAM high-dimensional screens were resource-burdened	only proceed if H1/H2 insufficient and dimensions are stable/interpretable

Number Line policy ladder

policy_id	role	model_family	premodel_gate	promotion_gate
nl_85_95_current_3cat	benchmark/operational-compatible current policy	ordinal PCM/GPCM categories 0=<.85, 1=.85-.95, 2=>=.95	must be included as reference in all screens	already lockable as NL2 unless challenger clearly improves validation/fairness/classification
nl_80_90_relaxed_3cat	cutoff sensitivity challenger	ordinal 3-category PCM/GPCM	cell counts and target distributions acceptable	less harmful hard-target penalisation plus equal/better validation and risk classification
nl_90_97_strict_3cat	strict challenger	ordinal 3-category PCM/GPCM	top category not too sparse item-by-item	only if validation gain offsets expected sparsity/precision loss
nl_binary_95	simple mastery-like sensitivity	binary Rasch/2PL screen	both classes present by item	unlikely to promote unless it improves decision validity despite information loss
nl_80_90_95_4cat	higher-resolution ordinal sensitivity	4-category PCM/GPCM	all item categories have stable counts; thresholds ordered/usable	improved validation/precision without sparse-category pathology
continuous_abs_error_logitnormal_or_beta	formal continuous challenger, not TAM/mirt-faithful	mixed response Stan: binary/non-NL accuracy + continuous bounded NL accuracy/error	raw distributions and coordinate calibration pass; proxy validation competitive	material validation/classification/fairness gain over NL2 and clean HMC/PPC

Accuracy-speed ladder

model_id	purpose	status	uses_for_score	gate
RT0_QC_manifest_speed_descriptives	data-quality, rapid-response, timing-unit, and admin/device checks	recommended before any scoring use	none	no severe RT missingness/unit anomalies in candidate families
RT1_selected_family_speed_shadow	selected timed-family tau/pace research with accuracy anchor protected	supported by prior J2b work; rerun on 2026 BOY candidate families if needed	shadow only	tau aligns with RT/rapid behaviour; theta/risk bands not changed operationally
RT2_hierarchical_tau_shadow	overall response pace + family residual pace, coherent with teacher profile idea	Stan skeleton exists (J3b hierarchical tau)	shadow only	clean HMC; no subgroup/admin artefact; no achievement-band changes
RT3_joint_global_subscore_accuracy_speed	future integrated model after H1 subscore and RT2 pace models are separately stable	not first next fit	research only until validation burden is met	must add information beyond D/trailing-zero and not double-count speed/reach

Decision gates / next actions

stream	next_action	must_check_before_fit	must_check_after_fit
hierarchical_subscores	fit H1 Stan global+subtest-deviation model on hard-filtered operational frame	subtest score reliability/correlation/readiness table; avoid writing nuisance residuals as teacher subscores	HMC diagnostics, posterior SD by subscore, shrinkage size, global score movement, risk-band movement, subgroup movement, profile interpretability
number_line_policy	run frequentist ordinal cutoff screens using audited candidate policies; keep current .85/.95 as reference	item-by-target category counts and ECDF; reject policies with sparse/empty categories before Stan	threshold behaviour, reliability, score/risk movement, validation/fairness; continuous challenger only after proxy screen
accuracy_speed_joint	treat RT as QC/shadow; candidate families are timed non-NL achievement subtests only at first	RT missingness, row RT quantiles, rapid-row accuracy, STPM exclusion, D/trailing-zero double-count risk	tau construct validity, rapid effect direction, theta robustness, admin/subgroup artefacts, no operational risk-band changes

Written aggregate artifacts

tables/premodeling/2026_boy_hierarchical_subscore_readiness.csv
tables/premodeling/2026_boy_subtest_score_correlations.csv
tables/premodeling/2026_boy_subtest_composite_correlations.csv
tables/premodeling/2026_boy_subtest_profile_deviation_summary.csv
tables/premodeling/2026_boy_nl_accuracy_distribution_by_item.csv
tables/premodeling/2026_boy_nl_policy_item_cell_counts.csv
tables/premodeling/2026_boy_nl_policy_overall_summary.csv
tables/premodeling/2026_boy_rt_readiness_by_subtest.csv
tables/premodeling/2026_boy_j2b_style_rapid_row_audit.csv
tables/premodeling/2026_boy_speed_accuracy_correlations.csv
model-ladder and decision-gate CSVs in the same folder
TAM cutoff screen runner: analysis/modeling/v2_response_process_program/77_2026_boy_premodel_tam_cutoff_screens.R (requires TAM; intended for cisbox/AWS)

Next-model specification note

Concrete model ladders and gate checks for the next round.

2026 BOY next-model specification notes

Status: pre-fit design note generated after aggregate premodelling audit. Do not treat as an operational scoring decision.

1. Hierarchical global + subscore model

Goal

Produce a coherent global score and teacher-facing subtest subscores, avoiding unrelated standalone subtest IRT scales.

First challenger: `H1_global_plus_subtest_deviations`

For student p and subtest/domain s:

g_p ~ broad numeracy level
z_ps ~ standard normal residual profile component
delta_ps = sigma_delta_s * z_ps, centered across subtests within student
theta_ps = g_p + delta_ps

Binary/timed or untimed non-NL item j in subtest s[j]:

y_pj ~ Bernoulli_logit(theta_p,s[j] - b_j)

Ordinal Number Line item j under a PCM-style policy:

eta_1 = 0
eta_k = eta_{k-1} + theta_p,s[j] - (b_j + step_j,k-1)
y_pj ~ categorical_logit(eta)

Identification/regularisation:

Center each student's delta_ps across subtests so g_p remains the broad level.
Center item difficulties within year/model as in current Stan practice.
Start without a separate nuisance testlet u for every subtest; otherwise the reportable subtest deviation and nuisance residual compete for the same signal.
Keep posterior intervals for subtest deviations; do not report profile differences smaller than measurement uncertainty.

Primary post-fit checks:

1. HMC: 0 divergences, no max-treedepth hits, Rhat/ESS acceptable for g, theta_ps, sigma_delta_s, item parameters. 2. Global movement vs hard-filtered H0: Spearman, median/p95 percentile shift, <15 and 15-35 risk-band movement. 3. Subscore quality: posterior SD by subtest, shrinkage size, profile-deviation stability. 4. Teacher-facing coherence: subscore intervals and relative-strength labels agree with observed subtest evidence without overclaiming. 5. Subgroup/admin movement: no adverse subgroup artefacts.

2. Year 1 BNL residual surgical sensitivity

Keep BNL0-100 items in the global/hierarchical score but do not give BNL an extra nuisance residual variance if the current sigma_u[BNL0-100] remains weak.

Data-side option:

active_testlet_idx[BNL0-100] = 0
active_testlet_idx[other_subtests] = 1..K_active

Likelihood option:

resid = 0 if active_testlet_idx == 0
resid = sigma_u[k] * u_z[p,k] otherwise
theta_eff = theta + resid

This tests whether the issue is the BNL residual component, not the BNL items themselves.

3. Number Line policy ladder

Premodelling audit outputs:

tables/premodeling/2026_boy_nl_accuracy_distribution_by_item.csv
tables/premodeling/2026_boy_nl_policy_item_cell_counts.csv
tables/premodeling/2026_boy_nl_policy_overall_summary.csv

Frequentist screens before Stan:

nl_80_90_relaxed_3cat
nl_85_95_current_3cat
nl_90_97_strict_3cat
nl_binary_95
nl_80_90_95_4cat

Promotion burden:

current .85/.95 remains the reference;
challenger must have stable cells/thresholds;
challenger must improve or match validation/risk classification;
challenger must not cause unacceptable subgroup or risk-band movement;
continuous NL requires Stan or another mixed continuous-response framework, not TAM/mirt alone.

Continuous challenger sketch:

accuracy = 1 - absolute_error / scale_range
accuracy_squeezed = clamp/Smithson-Verkuilen transform into (0,1)
logit(mu_pj) = alpha_j + theta_p,s[j]
accuracy_pj ~ Beta(mu_pj * phi_j, (1 - mu_pj) * phi_j)

Optional signed-error diagnostic, not first scoring model:

signed_error_scaled_pj ~ Normal(target_bias_j + method_bias_family + ability_slope_j * theta, sigma_j)

4. Accuracy-speed joint modelling ladder

Operational posture: RT is shadow/QC first. Timed D/trailing-zero already encodes reach/time-pressure, so response time can double-count speed if added naively.

Initial 2026 BOY data rule:

Achievement accuracy model may continue to use D/trailing-zero for timed non-NL.
RT likelihood should use observed/reached item rows only.
Trailing unreached rows contribute to D accuracy/reach context, not item-level logRT.
STPM remains shadow/non-math and is excluded from math achievement.
Number Line RT is context-only initially.

Candidate shadow model:

y_pj ~ Bernoulli_logit(theta_p - b_j + gamma_family * rapid_pj)
logRT_pj ~ LogNormal(beta0 + beta_j - tau_p,family[j], sigma_rt_family)

Hierarchical pace extension:

tau_p,f = tau_overall_p + tau_residual_p,f

Pre-fit checks already written:

tables/premodeling/2026_boy_rt_readiness_by_subtest.csv
tables/premodeling/2026_boy_j2b_style_rapid_row_audit.csv
tables/premodeling/2026_boy_speed_accuracy_correlations.csv

Do not use RT/tau to alter risk bands unless later evidence shows robust validation gain, no subgroup/admin artefact, and added information beyond D/reach.

Full modelling review memo

Rendered from the saved Markdown decision artifact.

2026 BOY operational accuracy + Number Line candidate — modelled job review

Review timestamp: 2026-06-14 UTC

Compute / sync status

All AWS model jobs are complete. There are no active EC2 instances matching the 2026 BOY operational Number Line model tags, no active cisbox rsync sessions, and the local sensitivity monitor was stopped after all six sensitivity .done markers were present.

Final outstanding run (year1_no_BNL0_100) is synced, checksum-verified, recovered from the known no-NL post-processing failure, and its EC2 instance was terminated.

Reviewed jobs

The review covers 10 Stan jobs:

1. Foundation inclusive baseline. 2. Year 1 inclusive baseline. 3. Foundation hard-item-filtered baseline. 4. Year 1 hard-item-filtered baseline. 5. Foundation sensitivity: no DMT10_2026. 6. Foundation sensitivity: no MQ1-20 and no DMT10_2026. 7. Foundation sensitivity: no BNL0-20. 8. Year 1 sensitivity: no MC0-100. 9. Year 1 sensitivity: no BNL0-100. 10. Year 1 sensitivity: core model with no MC and no NL.

Source output base:

[internal artifact path redacted]

Local review artifacts:

outputs/runs/irt-2026-boy-subtest-audit/latest/reports/model_review/stan_review_summary.md
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/stan_job_diagnostic_summary.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/stan_score_movement_comparisons.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/stan_testlet_sigma_summary_long.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/stan_item_difficulty_extreme_or_diagnostic_flags.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/stan_u_residual_diagnostic_summary.csv

Completion and sampler diagnostics

All 10 jobs have successful MCMC sampling evidence:

0 divergences in every job.
0 max-treedepth hits in every job.
Minimum EBFMI across jobs: 0.568 (year1_core_no_MC_no_NL), acceptable.
Student theta summaries are clean: max theta Rhat <= 1.006 and theta ESS_bulk is comfortably high in all jobs.
Item difficulty summaries are clean by diagnostics: no item difficulty has Rhat > 1.01 or ESS_bulk < 400.

Three no-NL-style jobs exited with Stan runner exitcode 1 because of the known post-processing bug for empty/missing NL lookup files, not because of sampler failure:

foundation_no_BNL0_20
year1_no_BNL0_100
year1_core_no_MC_no_NL

All three were recovered from QC summaries and now have final score, item, testlet, and fit-readout files.

Job-level diagnostic table

job	exit	postprocess	verify	min EBFMI	theta max Rhat / min ESS	testlet max Rhat / min ESS	note
Foundation inclusive	0	completed	155/155	0.705	1.004 / 5066	1.006 / 1173	clean
Year 1 inclusive	0	completed	155/155	0.614	1.004 / 941	1.023 / 109	weak `BNL0-100` testlet sigma
Foundation hard-filtered	0	completed	1955/1955	0.677	1.003 / 4051	1.003 / 1081	clean
Year 1 hard-filtered	0	completed	1955/1955	0.646	1.006 / 1504	1.068 / 78	weak `BNL0-100` testlet sigma
Foundation no `DMT10_2026`	0	completed	2104/2104	0.694	1.003 / 3337	1.009 / 482	clean
Foundation no `MQ1-20`/no `DMT10_2026`	0	completed	2104/2104	0.727	1.002 / 6115	1.007 / 585	clean
Foundation no `BNL0-20`	1	recovered	2098/2098	0.667	1.002 / 4894	1.004 / 1221	sampling clean; postprocess recovered
Year 1 no `MC0-100`	0	completed	2104/2104	0.647	1.002 / 4875	1.004 / 670	clean
Year 1 no `BNL0-100`	1	recovered	2098/2098	0.598	1.003 / 3313	1.003 / 1668	sampling clean; postprocess recovered
Year 1 no MC/no NL	1	recovered	2098/2098	0.568	1.002 / 4612	1.003 / 1925	sampling clean; postprocess recovered

Main diagnostic finding

The global Year 1 baseline is usable from a sampler perspective, but the BNL0-100 testlet residual scale is weakly identified:

Inclusive Year 1 BNL0-100 sigma: Rhat ~1.023, ESS_bulk ~109.
Hard-filtered Year 1 BNL0-100 sigma: Rhat ~1.068, ESS_bulk ~78.

This issue is local to the BNL0-100 residual/testlet component. It does not show up as divergent transitions, treedepth failures, poor theta mixing, or item-difficulty non-convergence. It does show up in the latent residuals for the same component: in the hard-filtered Year 1 run, u[,5] corresponds to BNL0-100, and 1193/1221 residual terms had Rhat > 1.01, with max Rhat ~1.026. The likely interpretation is that the residual BNL0-100 testlet variance is near a boundary/small value and is hard for the sampler to estimate, while the BNL0-100 items themselves carry substantial global-theta information.

Auxiliary `u` residual diagnostic

job	testlet	residual terms	Rhat > 1.01	ESS < 400	max Rhat	min ESS	interpretation
Year 1 inclusive	`BNL0-100`	1221	0	3	1.009	320	minor low-ESS nuisance terms
Year 1 hard-filtered	`BNL0-100`	1221	1193	2	1.026	279	broad residual-component mixing issue tied to BNL testlet

No other job/testlet had u residual terms with Rhat > 1.01 or ESS_bulk < 400. This reinforces that the caveat is localized to Year 1 BNL0-100 dependence modelling, not to the global theta score or item difficulty estimates.

Hard-filtered vs inclusive baseline

The hard-item filter removes the 70 predeclared no-information items and has negligible impact on student ranking/risk classification.

comparison	n	Spearman	median abs percentile shift	p95 shift	exact 3-band agreement	very-low Jaccard	low+very-low Jaccard	moved out/in, very-low	moved out/in, low+very-low
Foundation inclusive vs hard-filtered	997	1.000	0.30 pp	1.40 pp	99.0%	0.974	0.983	2 / 2	3 / 3
Year 1 inclusive vs hard-filtered	1221	0.999	0.74 pp	2.62 pp	98.5%	0.968	0.972	3 / 3	6 / 6

Conclusion: hard-item-filtered should be the working operational baseline. The inclusive runs are useful historical evidence but should not be promoted over the filtered version.

Sensitivity findings vs hard-filtered baseline

Foundation

sensitivity	n	Spearman	median shift	p95 shift	3-band agreement	very-low Jaccard	low+very-low Jaccard	interpretation
no `DMT10_2026`	997	0.935	5.72 pp	21.00 pp	85.2%	0.703	0.758	DMT contributes materially; removal is not classification-stable.
no `MQ1-20` and no `DMT10_2026`	995	0.825	9.95 pp	35.68 pp	76.1%	0.520	0.642	Removing both early quantity/decomposition content substantially changes the score.
no `BNL0-20`	997	0.865	8.02 pp	32.32 pp	77.5%	0.505	0.661	Foundation Number Line is highly influential and improves precision.

Foundation interpretation:

The hard-filtered Foundation baseline is sampler-clean.
BNL0-20 is important to the global score; dropping it causes large risk-band movement.
DMT10_2026 also matters; despite being untimed, it contributes meaningfully to the Foundation global trait.
Foundation supports retaining the full hard-filtered operational accuracy + NL candidate, subject to external validation and reporting review.

Year 1

sensitivity	n	Spearman	median shift	p95 shift	3-band agreement	very-low Jaccard	low+very-low Jaccard	interpretation
no `MC0-100`	1211	0.993	1.82 pp	6.77 pp	96.2%	0.905	0.936	Removing MC has modest impact; MC is not the main source of instability.
no `BNL0-100`	1221	0.768	11.88 pp	39.31 pp	70.3%	0.402	0.547	Removing BNL radically changes rankings/risk bands and greatly increases uncertainty.
no MC/no NL	1211	0.739	13.46 pp	40.42 pp	68.5%	0.382	0.519	Core-only score differs substantially from the full hard-filtered candidate.

Year 1 interpretation:

MC0-100 is not a major concern; the no-MC sensitivity remains close to the hard-filtered baseline.
BNL0-100 is the key decision point. It is highly influential for Year 1 risk classification and precision.
The weak BNL0-100 sigma diagnostic should not be read as evidence to drop BNL. The no-BNL sensitivity shows the opposite: dropping it materially changes the construct coverage and low-achievement identification.
The most defensible reading is: retain BNL0-100 as a strong candidate, but resolve/report the localized testlet-sigma issue before final operational promotion.

Frequentist model-rung context

Frequentist pre-screening remains consistent with the Stan review:

TAM 1D PCM was the only clean frequentist baseline across both years.
Foundation reliability ~0.914.
Year 1 reliability ~0.952.
mirt correlated subtest factors failed due quadrature burden (Greater than 20000 quadrature points).
mirt flexible/bifactor screens did not provide stable enough evidence to justify multidimensional Stan challengers.
Refined frequentist sensitivities also flagged Foundation no-BNL and Year 1 no-BNL/no-MC variants as the main score-movement cases.

Therefore, the current Stan evidence should be interpreted within a 1D+testlet operational-candidate frame, not as support for immediate multidimensional/bifactor escalation.

Recommendations

1. Promote the hard-item-filtered model frame as the working baseline for final reporting comparisons. The hard filter removes no-information items with near-zero impact on student scores/risk bands.

2. Foundation: keep BNL0-20 and DMT10_2026 in the operational candidate. Both materially affect risk identification; the Foundation hard-filtered Stan run is diagnostically clean.

3. Year 1: do not drop BNL0-100 based on the sigma diagnostic alone. Removing it causes major movement and loss of precision. Treat the issue as a localized residual-scale estimation problem, not a failed global score.

4. Run or design one surgical Year 1 sensitivity if final promotion requires clearing the sigma caveat: keep BNL0-100 items in the global score but omit/fix the BNL0-100 testlet residual scale. This directly tests whether the weak sigma parameter is harmless. This is more informative than a no-BNL model, which changes both construct coverage and precision.

5. Complete external validation and subgroup movement checks before final operational lock-in. Compare hard-filtered baseline and key sensitivities against PAT/teacher outcomes and demographic/school subgroup stability, with priority on the <15th and 15th–35th percentile bands.

6. Update the audit/report package. Add sections for item eligibility, hard-filtered vs inclusive comparison, frequentist model rungs, Stan sensitivity results, and the Year 1 BNL0-100 decision caveat.

Proposed immediate next steps

1. Add the generated model-review tables to the unified audit HTML/report. 2. Build a final score-movement table with student-level risk-band transitions for the hard baseline vs the three most important sensitivity contrasts:

Foundation no BNL0-20.
Year 1 no BNL0-100.
Year 1 no MC0-100.

3. Run outcome validation comparisons for the hard baseline and sensitivity variants. 4. Review Year 1 BNL0-100 item-level diagnostics:

coordinate audit;
empirical item curves by theta bin;
target-value distribution;
ordinal threshold behavior;
response category sparsity.

5. Decide whether to run the surgical Year 1 BNL-included/no-BNL-testlet-residual Stan sensitivity. 6. Draft the operational recommendation:

hard-filtered baseline as default candidate;
Foundation BNL retained;
Year 1 BNL retained as candidate pending sigma caveat resolution and validation;
MC0-100 not a major exclusion pressure.

Builds, diagnostics, sensitivity tests

Batch A action-plan results

Number Line policy adjudication

Continuous Number Line prototype screen

2025→2026 crosswalk, drift, and prior eligibility

Premodelling audit: subscores, Number Line policy, speed

Subscore readiness

Number Line cutoff policy audit

Number Line PCM cutoff sensitivity findings

Accuracy-speed readiness

Next-step gates

Interactive diagnostics

Downloads

Stan job diagnostic summary

Score movement and risk-band stability

Year 1 residual/testlet caveat

Batch A action-plan memo

2026 BOY next-model action-plan readout

Provenance / compute lock

Score movement / operational global

BNL residual-zero decision

H1 global decision

Teacher-facing subscores

Outcome validation status

Output tables

Number Line policy memo

2026 BOY Number Line policy adjudication

Interpretation

Decision grid

Observed coordinate-derived metrics

Current recommendation

Output tables

Continuous Number Line memo

2026 BOY continuous Number Line adjudication

What this does and does not decide

Continuous decision grid

Metric summary

Recommendation

Output tables

Historical prior/drift memo

2025→2026 common-item drift and historical-prior audit

Subtest summary

2025 restricted reliability context

Recommendation

Output tables

Premodelling audit memo

2026 BOY premodelling audit: hierarchical subscores, Number Line policy, and accuracy-speed

Executive readout

Hierarchical subscore readiness

Current-policy subtest relationships

Profile-deviation spread

Number Line cutoff policy premodelling

Accuracy-speed / RT readiness

J2b-style rapid-row descriptive check

Person-level speed/reach correlations with current-policy scores

Recommended model ladders

Hierarchical global/subscore ladder

Number Line policy ladder

Accuracy-speed ladder

Decision gates / next actions

Written aggregate artifacts

Next-model specification note

2026 BOY next-model specification notes

1. Hierarchical global + subscore model

Goal

First challenger: H1_global_plus_subtest_deviations

2. Year 1 BNL residual surgical sensitivity

3. Number Line policy ladder

4. Accuracy-speed joint modelling ladder

Full modelling review memo

2026 BOY operational accuracy + Number Line candidate — modelled job review

Compute / sync status

Reviewed jobs

Completion and sampler diagnostics

Job-level diagnostic table

Main diagnostic finding

Auxiliary u residual diagnostic

Hard-filtered vs inclusive baseline

Sensitivity findings vs hard-filtered baseline

Foundation

First challenger: `H1_global_plus_subtest_deviations`

Auxiliary `u` residual diagnostic