Global modelling

Builds, diagnostics, sensitivity tests

Review of inclusive, hard-item-filtered, and targeted sensitivity Stan jobs for the 2026 BOY operational accuracy + Number Line candidate.

10modelled jobs reviewed
0divergences / treedepth hits
Y1 NLBNL0-100 residual caveat
Operational status Hard-item-filtered is the working 2026 candidate baseline. M0/current remains live until explicit promotion. Year 1 BNL0-100 needs a targeted residual/testlet sensitivity before final lock-in.

Premodelling audit: subscores, Number Line policy, speed

New aggregate audit work documents the evidence base needed before fitting hierarchical global+subscore models, Number Line cutoff challengers, or accuracy-response-time models.

Hierarchical subscores

H1

Recommended next Stan challenger: global numeracy plus shrunken subtest deviations. This is preferred over unrelated standalone subscores.

Number Line policy

5

Cutoff policies are now cell-count audited: relaxed, current, strict, binary, and 4-category ordinal options.

Speed / RT

Shadow

RT remains QC and response-process context. Timed D already encodes reach/time pressure, so speed should not alter live bands yet.

Subscore readiness

Standalone subtest evidence is uneven; weaker/moderate subscores are the main reason to use hierarchical shrinkage.

year level test subgroup n items keep hard filter standalone eap reliability or alpha proxy reliability band spearman with other subtest composite hierarchical subscore posture premodel risk flags
foundation MQ1-20 19 0.6021 weak 0.4256 hierarchical_shrinkage_required; avoid standalone high-stakes subscore weak_standalone_reliability;sparse_nonconstant_items_retained
foundation MC0-20 50 0.9269 strong 0.5315 strong standalone signal; still prefer hierarchical coherence with global score sparse_nonconstant_items_retained
foundation MNC0-20 24 0.8813 strong 0.6039 strong standalone signal; still prefer hierarchical coherence with global score sparse_nonconstant_items_retained
foundation DMT10_2026 8 0.6093 weak 0.4302 hierarchical_shrinkage_required; avoid standalone high-stakes subscore weak_standalone_reliability;few_calibration_items
foundation BNL0-20 10 0.6743 weak 0.3535 hierarchical_shrinkage_required; avoid standalone high-stakes subscore weak_standalone_reliability;number_line_policy_sensitive
year1 MC0-100 34 0.9404 strong 0.6945 strong standalone signal; still prefer hierarchical coherence with global score sparse_nonconstant_items_retained
year1 MNC0-100 22 0.8912 strong 0.7621 strong standalone signal; still prefer hierarchical coherence with global score sparse_nonconstant_items_retained
year1 AAMC 38 0.9004 strong 0.7288 strong standalone signal; still prefer hierarchical coherence with global score sparse_nonconstant_items_retained
year1 ASMC 25 0.8415 moderate 0.6173 hierarchical_shrinkage_recommended; standalone subscore only as caveated diagnostic moderate_reliability;floor_rate_ge_10pct;sparse_nonconstant_items_retained
year1 BNL0-100 13 0.7272 moderate 0.5481 hierarchical_shrinkage_recommended; standalone subscore only as caveated diagnostic moderate_reliability;number_line_policy_sensitive

Number Line cutoff policy audit

Raw coordinate-derived category counts by candidate policy. Current .85/.95 remains the benchmark.

year level test subgroup policy id cutoffs n items items all categories cell ok share items cell ok median min category pct median top category pct median entropy normalized premodel policy posture
foundation BNL0-20 nl_80_90_95_4cat 0.8;0.9;0.95 10 9 0.9 0.1565 0.2526 0.9547 higher-resolution_challenger; use_only_if_item_category_cells_are_stable
foundation BNL0-20 nl_80_90_relaxed_3cat 0.8;0.9 10 9 0.9 0.206 0.4803 0.9384 relaxed_challenger; useful_if_current_policy_over-penalises_hard_targets
foundation BNL0-20 nl_85_95_current_3cat 0.85;0.95 10 10 1 0.1822 0.2526 0.9296 benchmark_current_policy; keep as reference in all modelling
foundation BNL0-20 nl_90_97_strict_3cat 0.9;0.97 10 10 1 0.136 0.1464 0.8726 strict_challenger; reject_if_top_category_sparse_or_validation_not_better
foundation BNL0-20 nl_binary_95 0.95 10 10 1 0.2526 0.2526 0.8154 modelable_if_cells_ok_but_loses_partial-credit_information
year1 BNL0-100 nl_80_90_95_4cat 0.8;0.9;0.95 13 13 1 0.1944 0.2643 0.9894 higher-resolution_challenger; use_only_if_item_category_cells_are_stable
year1 BNL0-100 nl_80_90_relaxed_3cat 0.8;0.9 13 13 1 0.2441 0.4873 0.9536 relaxed_challenger; useful_if_current_policy_over-penalises_hard_targets
year1 BNL0-100 nl_85_95_current_3cat 0.85;0.95 13 13 1 0.2643 0.2643 0.9799 benchmark_current_policy; keep as reference in all modelling
year1 BNL0-100 nl_90_97_strict_3cat 0.9;0.97 13 13 1 0.1624 0.1624 0.9129 strict_challenger; reject_if_top_category_sparse_or_validation_not_better
year1 BNL0-100 nl_binary_95 0.95 13 13 1 0.2643 0.2643 0.8331 modelable_if_cells_ok_but_loses_partial-credit_information

Number Line PCM cutoff sensitivity findings

TAM screens were run on cisbox for all ordinal/binary cutoff policies. Bounded mirt 1D screens were run for the current, relaxed, and 4-category policies as a secondary check.

Current readout The full-battery global score is robust to reasonable PCM cutoff changes. Cutoffs matter more for Number-Line-only subscores: relaxed .80/.90 and 4-category .80/.90/.95 look like the best ordinal challengers; binary >=.95 loses partial-credit information; continuous Number Line remains a formal Stan challenger, not yet a replacement.

TAM full-battery

stable

Current, relaxed, strict, binary, and 4-category screens all fit; global movement vs current is modest.

NL-only reliability

Relaxed and 4-category policies improve Number-Line-only reliability relative to current in both years; strict/binary weaken it.

mirt 1D

caution

Bounded mirt was mostly non-converged within 300 EM cycles, so it is a sensitivity check only; extracted movement was tiny.

TAM cutoff fit summary
year level scope policy id status n persons n items eap reliability AIC BIC notes
foundation full_battery nl_80_90_relaxed_3cat fit_ok 997.0 111.0 0.9157 62,638 63,236
foundation number_line_only nl_80_90_relaxed_3cat fit_ok 974.0 10 0.6839 16,351 16,453
foundation full_battery nl_85_95_current_3cat fit_ok 997.0 111.0 0.9139 63,905 64,503
foundation number_line_only nl_85_95_current_3cat fit_ok 974.0 10 0.6755 17,664 17,766
foundation full_battery nl_90_97_strict_3cat fit_ok 997.0 111.0 0.9077 63,363 63,962
foundation number_line_only nl_90_97_strict_3cat fit_ok 974.0 10 0.6014 16,979 17,081
foundation full_battery nl_binary_95 fit_ok 997.0 111.0 0.9171 55,120 55,670
foundation number_line_only nl_binary_95 fit_ok 974.0 10 0.5153 10,024 10,078
foundation full_battery nl_80_90_95_4cat fit_ok 997.0 111.0 0.9061 69,656 70,304
foundation number_line_only nl_80_90_95_4cat fit_ok 974.0 10 0.6893 22,246 22,397
year1 full_battery nl_80_90_relaxed_3cat fit_ok 1,221 132.0 0.9563 90,385 91,130
year1 number_line_only nl_80_90_relaxed_3cat fit_ok 1,178 13 0.7578 29,088 29,225
year1 full_battery nl_85_95_current_3cat fit_ok 1,221 132.0 0.9518 92,022 92,768
year1 number_line_only nl_85_95_current_3cat fit_ok 1,178 13 0.7279 29,992 30,129
year1 full_battery nl_90_97_strict_3cat fit_ok 1,221 132.0 0.949 89,902 90,647
year1 number_line_only nl_90_97_strict_3cat fit_ok 1,178 13 0.6712 27,566 27,703
year1 full_battery nl_binary_95 fit_ok 1,221 132.0 0.9529 75,826 76,505
year1 number_line_only nl_binary_95 fit_ok 1,178 13 0.54 15,845 15,916
year1 full_battery nl_80_90_95_4cat fit_ok 1,221 132.0 0.9488 102,866 103,678
year1 number_line_only nl_80_90_95_4cat fit_ok 1,178 13 0.7579 38,492 38,695
TAM cutoff score movement vs current .85/.95
year level scope comparison n spearman theta median abs pctile shift p95 abs pctile shift band exact agreement very low jaccard
foundation full_battery nl_80_90_relaxed_3cat vs nl_85_95_current_3cat 997.0 0.9909 0.0226 0.081 0.9398 0.8742
foundation full_battery nl_90_97_strict_3cat vs nl_85_95_current_3cat 997.0 0.991 0.0211 0.0813 0.9438 0.8861
foundation full_battery nl_binary_95 vs nl_85_95_current_3cat 997.0 0.9813 0.0326 0.1129 0.9178 0.8395
foundation full_battery nl_80_90_95_4cat vs nl_85_95_current_3cat 997.0 0.9882 0.0241 0.0928 0.9238 0.8395
foundation number_line_only nl_80_90_relaxed_3cat vs nl_85_95_current_3cat 974.0 0.9288 0.0675 0.2232 0.8542 0.6627
foundation number_line_only nl_90_97_strict_3cat vs nl_85_95_current_3cat 974.0 0.9262 0.0647 0.2266 0.8501 0.6686
foundation number_line_only nl_binary_95 vs nl_85_95_current_3cat 974.0 0.9026 0.0688 0.2599 0.7793 0.4703
foundation number_line_only nl_80_90_95_4cat vs nl_85_95_current_3cat 974.0 0.9718 0.0416 0.1439 0.8973 0.7711
year1 full_battery nl_80_90_relaxed_3cat vs nl_85_95_current_3cat 1,221 0.9948 0.0172 0.0622 0.9419 0.8579
year1 full_battery nl_90_97_strict_3cat vs nl_85_95_current_3cat 1,221 0.995 0.016 0.0581 0.9484 0.8769
year1 full_battery nl_binary_95 vs nl_85_95_current_3cat 1,221 0.9898 0.0242 0.0852 0.9263 0.7902
year1 full_battery nl_80_90_95_4cat vs nl_85_95_current_3cat 1,221 0.9936 0.0192 0.0672 0.9345 0.8535
year1 number_line_only nl_80_90_relaxed_3cat vs nl_85_95_current_3cat 1,178 0.9436 0.0577 0.2055 0.8659 0.6635
year1 number_line_only nl_90_97_strict_3cat vs nl_85_95_current_3cat 1,178 0.9384 0.0641 0.2017 0.8489 0.6479
year1 number_line_only nl_binary_95 vs nl_85_95_current_3cat 1,178 0.8975 0.0781 0.2681 0.7674 0.542
year1 number_line_only nl_80_90_95_4cat vs nl_85_95_current_3cat 1,178 0.9751 0.0352 0.1317 0.8973 0.7241
Bounded mirt 1D cutoff fit summary
year level policy id scope status converged n persons n items notes
foundation nl_80_90_relaxed_3cat full_battery fit_ok FALSE 997.0 111.0 1D flexible 2PL/GPCM bounded cutoff screen; not a multidimensional promotion model
foundation nl_85_95_current_3cat full_battery fit_ok FALSE 997.0 111.0 1D flexible 2PL/GPCM bounded cutoff screen; not a multidimensional promotion model
foundation nl_80_90_95_4cat full_battery fit_ok FALSE 997.0 111.0 1D flexible 2PL/GPCM bounded cutoff screen; not a multidimensional promotion model
year1 nl_80_90_relaxed_3cat full_battery fit_ok FALSE 1,221 132.0 1D flexible 2PL/GPCM bounded cutoff screen; not a multidimensional promotion model
year1 nl_85_95_current_3cat full_battery fit_ok FALSE 1,221 132.0 1D flexible 2PL/GPCM bounded cutoff screen; not a multidimensional promotion model
year1 nl_80_90_95_4cat full_battery fit_ok TRUE 1,221 132.0 1D flexible 2PL/GPCM bounded cutoff screen; not a multidimensional promotion model
Bounded mirt 1D cutoff score movement vs current .85/.95
year level scope comparison n spearman theta median abs pctile shift p95 abs pctile shift band exact agreement very low jaccard
foundation full_battery nl_80_90_relaxed_3cat vs nl_85_95_current_3cat 997.0 0.9999 0.002 0.009 0.992 0.9735
foundation full_battery nl_80_90_95_4cat vs nl_85_95_current_3cat 997.0 1 0.001 0.006 0.998 1
year1 full_battery nl_80_90_relaxed_3cat vs nl_85_95_current_3cat 1,221 0.9994 0.0049 0.0221 0.9771 0.9365
year1 full_battery nl_80_90_95_4cat vs nl_85_95_current_3cat 1,221 0.9998 0.0033 0.0131 0.9836 0.9572

Accuracy-speed readiness

Observed/reached timed rows have RT available; high presented-row missingness is largely trailing unreached D-zero rows.

year level test subgroup role is timed observed or coordinate rt missing rate presented row rt missing or negative rate trailing nonresponse rate row rt p50 pct rt lt 1 initial joint model role rt readiness flags
foundation BNL0-20 achievement_primary False 0 0.0199 0 7 0.0077 nl_rt_context_only_initially_not_accuracy_speed_scoring none_obvious_from_row_rt_audit
foundation DMT10_2026 achievement_primary False 0 0.0146 0 16 0.0002 untimed_or_other_context_only_initially none_obvious_from_row_rt_audit
foundation MC0-20 achievement_primary True 0 0.7501 0.7402 6 0.0042 initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
foundation MNC0-20 achievement_primary True 0 0.7611 0.7511 12 0.0045 initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
foundation MQ1-20 achievement_primary True 0 0.8417 0.8328 20 0.0082 initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
foundation STPM shadow_speed_only True 0.0625 0.0521 8 0.0013 shadow_speed_only_exclude_from_math_achievement presented_row_rt_missing_or_negative_gt_5pct
year1 AAMC achievement_primary True 0 0.8024 0.7845 9 0.0053 initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
year1 ASMC achievement_primary True 0 0.7762 0.7559 12 0.005 initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
year1 BNL0-100 achievement_primary False 0 0.0377 0 5 0.0065 nl_rt_context_only_initially_not_accuracy_speed_scoring none_obvious_from_row_rt_audit
year1 MC0-100 achievement_primary True 0 0.7711 0.7599 6 0.0044 initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
year1 MNC0-100 achievement_primary True 0 0.729 0.7145 11 0.0047 initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
year1 STPM shadow_speed_only True 0.0641 0.0443 6 0.001 shadow_speed_only_exclude_from_math_achievement presented_row_rt_missing_or_negative_gt_5pct
Show rapid-row descriptive audit
year level test subgroup j2b style rapid rate mean accuracy rapid rows mean accuracy nonrapid rows rapid minus nonrapid accuracy interpretation
foundation MC0-20 0.0597 0.6363 0.9188 -0.2826 rapid rows should remain diagnostic/shadow unless validated; not a motivation label
foundation MNC0-20 0.0396 0.0877 0.7537 -0.666 rapid rows should remain diagnostic/shadow unless validated; not a motivation label
foundation MQ1-20 0.0308 0.0544 0.6558 -0.6014 rapid rows should remain diagnostic/shadow unless validated; not a motivation label
year1 AAMC 0.0436 0.1844 0.8026 -0.6182 rapid rows should remain diagnostic/shadow unless validated; not a motivation label
year1 ASMC 0.0424 0.1322 0.6265 -0.4943 rapid rows should remain diagnostic/shadow unless validated; not a motivation label
year1 MC0-100 0.0527 0.5666 0.8974 -0.3308 rapid rows should remain diagnostic/shadow unless validated; not a motivation label
year1 MNC0-100 0.0403 0.0874 0.834 -0.7466 rapid rows should remain diagnostic/shadow unless validated; not a motivation label
Show profile-deviation spread by subtest
year level test subgroup n profile deviation profile deviation sd z profile deviation p10 z profile deviation p90 z pct abs profile deviation gt 1z
foundation MQ1-20 1,005 0.9337 -1.043 1.115 0.2239
foundation MC0-20 1,005 0.8424 -0.9794 1.032 0.202
foundation MNC0-20 1,003 0.7871 -0.9912 1.018 0.2044
foundation DMT10_2026 1,002 0.9381 -1.21 1.155 0.2735
foundation BNL0-20 974.0 0.9978 -1.275 1.231 0.3142
year1 MC0-100 1,229 0.7369 -0.8171 0.9086 0.1448
year1 MNC0-100 1,229 0.6395 -0.7395 0.797 0.105
year1 AAMC 1,227 0.6877 -0.8228 0.8138 0.1084
year1 ASMC 1,223 0.7863 -0.9902 0.9733 0.1905
year1 BNL0-100 1,178 0.9017 -1.064 1.119 0.2326

Next-step gates

Decision gates for the next round of modelling.

stream next action must check before fit must check after fit
hierarchical_subscores fit H1 Stan global+subtest-deviation model on hard-filtered operational frame subtest score reliability/correlation/readiness table; avoid writing nuisance residuals as teacher subscores HMC diagnostics, posterior SD by subscore, shrinkage size, global score movement, risk-band movement, subgroup movement, profile interpretability
number_line_policy run frequentist ordinal cutoff screens using audited candidate policies; keep current .85/.95 as reference item-by-target category counts and ECDF; reject policies with sparse/empty categories before Stan threshold behaviour, reliability, score/risk movement, validation/fairness; continuous challenger only after proxy screen
accuracy_speed_joint treat RT as QC/shadow; candidate families are timed non-NL achievement subtests only at first RT missingness, row RT quantiles, rapid-row accuracy, STPM exclusion, D/trailing-zero double-count risk tau construct validity, rapid effect direction, theta robustness, admin/subgroup artefacts, no operational risk-band changes

Interactive diagnostics

Charts are rendered in-browser from aggregate JSON. No student-level scores are published in the chart data.

Downloads

Aggregate CSV/Markdown artifacts used to build this page.

⬇ stan_job_diagnostic_summary.csv ⬇ stan_score_movement_comparisons.csv ⬇ stan_testlet_sigma_summary_long.csv ⬇ stan_u_residual_diagnostic_summary.csv ⬇ stan_item_difficulty_extreme_or_diagnostic_flags.csv ⬇ 2026_boy_model_review_findings.md ⬇ stan_review_summary.md ⬇ 2026_boy_premodeling_audit_hierarchical_nl_speed.md ⬇ 2026_boy_next_model_spec_hierarchical_nl_speed.md ⬇ 2026_boy_hierarchical_subscore_readiness.csv ⬇ 2026_boy_subtest_score_correlations.csv ⬇ 2026_boy_subtest_composite_correlations.csv ⬇ 2026_boy_subtest_profile_deviation_summary.csv ⬇ 2026_boy_nl_accuracy_distribution_by_item.csv ⬇ 2026_boy_nl_policy_item_cell_counts.csv ⬇ 2026_boy_nl_policy_overall_summary.csv ⬇ 2026_boy_rt_readiness_by_subtest.csv ⬇ 2026_boy_j2b_style_rapid_row_audit.csv ⬇ 2026_boy_speed_accuracy_correlations.csv ⬇ 2026_boy_hierarchical_model_ladder.csv ⬇ 2026_boy_number_line_policy_ladder.csv ⬇ 2026_boy_accuracy_speed_model_ladder.csv ⬇ 2026_boy_premodeling_decision_gates.csv ⬇ 2026_boy_tam77_nl_cutoff_fit_summary.csv ⬇ 2026_boy_tam77_nl_cutoff_item_counts.csv ⬇ 2026_boy_tam77_nl_cutoff_score_movement.csv ⬇ 2026_boy_mirt78_nl_cutoff_fit_summary.csv ⬇ 2026_boy_mirt78_nl_cutoff_score_movement.csv

Stan job diagnostic summary

Completion, verification, sampler, theta, item, and testlet-level summary.

key family variant stan exitcode postprocess status verify ok verify total divergences max treedepth hits min ebfmi theta max rhat testlet max rhat testlet flags
foundation_inclusive inclusive inclusive 0 completed 155.0 155.0 0 0 0.7052 1.004 1.006
year1_inclusive inclusive inclusive 0 completed 155.0 155.0 0 0 0.6144 1.004 1.023 BNL0-100:rhat=1.023,ess=109.1,mean=0.258
foundation_hard hard_filtered hard_item_filtered 0 completed 1,955 1,955 0 0 0.6765 1.003 1.003
year1_hard hard_filtered hard_item_filtered 0 completed 1,955 1,955 0 0 0.6462 1.006 1.068 BNL0-100:rhat=1.068,ess=78.4,mean=0.195
foundation_no_DMT10_2026 sensitivity foundation_no_DMT10_2026 0 completed 2,104 2,104 0 0 0.6938 1.003 1.009
foundation_no_MQ1_20_no_DMT10_2026 sensitivity foundation_no_MQ1_20_no_DMT10_2026 0 completed 2,104 2,104 0 0 0.7273 1.002 1.007
foundation_no_BNL0_20 sensitivity foundation_no_BNL0_20 1 recovered_after_postprocess_failure 2,098 2,098 0 0 0.6673 1.002 1.004
year1_no_MC0_100 sensitivity year1_no_MC0_100 0 completed 2,104 2,104 0 0 0.6466 1.002 1.004
year1_no_BNL0_100 sensitivity year1_no_BNL0_100 1 recovered_after_postprocess_failure 2,098 2,098 0 0 0.5985 1.003 1.003
year1_core_no_MC_no_NL sensitivity year1_core_no_MC_no_NL 1 recovered_after_postprocess_failure 2,098 2,098 0 0 0.5676 1.002 1.003

Score movement and risk-band stability

All comparisons are against the hard-item-filtered baseline for the matching year.

comparison id year n common spearman theta median abs pctile shift p95 abs pctile shift exact 3band agreement very low jaccard low or very low jaccard very low moved out very low moved in low or very low moved out low or very low moved in
inclusive_vs_hard_foundation foundation 997.0 0.9997 0.003 0.014 0.99 0.9735 0.9829 2 2 3 3
inclusive_vs_hard_year1 year1 1,221 0.999 0.0074 0.0262 0.9853 0.9677 0.9723 3 3 6 6
foundation_no_DMT10_2026 foundation 997.0 0.9346 0.0572 0.21 0.8516 0.7029 0.7576 26 26 48 48
foundation_no_MQ1_20_no_DMT10_2026 foundation 995.0 0.8254 0.0995 0.3568 0.7608 0.5204 0.6415 47 47 76 76
foundation_no_BNL0_20 foundation 997.0 0.8648 0.0802 0.3232 0.7753 0.5051 0.6611 49 49 71 71
year1_no_MC0_100 year1 1,211 0.9926 0.0182 0.0677 0.962 0.9053 0.9359 9 9 14 14
year1_no_BNL0_100 year1 1,221 0.7679 0.1188 0.3931 0.7027 0.4023 0.5471 78 78 125.0 125.0
year1_core_no_MC_no_NL year1 1,211 0.739 0.1346 0.4042 0.6846 0.3817 0.5189 81 81 134.0 134.0

Year 1 residual/testlet caveat

Auxiliary latent residual diagnostic showing the localized BNL0-100 issue.

job testlet index test subgroup n rhat gt 1 01 ess lt 400 either max rhat min ess
foundation_inclusive 1 MQ1-20 997.0 0 0 0 1.003 5,816
foundation_inclusive 2 MC0-20 997.0 0 0 0 1.002 6,015
foundation_inclusive 3 MNC0-20 997.0 0 0 0 1.003 6,773
foundation_inclusive 4 DMT10_2026 997.0 0 0 0 1.003 7,920
foundation_inclusive 5 BNL0-20 997.0 0 0 0 1.002 5,589
year1_inclusive 1 MC0-100 1,221 0 0 0 1.003 2,759
year1_inclusive 2 MNC0-100 1,221 0 0 0 1.003 2,981
year1_inclusive 3 AAMC 1,221 0 0 0 1.003 2,173
year1_inclusive 4 ASMC 1,221 0 0 0 1.003 2,453
year1_inclusive 5 BNL0-100 1,221 0 3 3 1.009 320.1
foundation_hard 1 MQ1-20 997.0 0 0 0 1.003 5,781
foundation_hard 2 MC0-20 997.0 0 0 0 1.002 5,019
foundation_hard 3 MNC0-20 997.0 0 0 0 1.003 5,633
foundation_hard 4 DMT10_2026 997.0 0 0 0 1.003 7,825
foundation_hard 5 BNL0-20 997.0 0 0 0 1.002 5,096
year1_hard 1 MC0-100 1,221 0 0 0 1.003 4,635
year1_hard 2 MNC0-100 1,221 0 0 0 1.003 6,077
year1_hard 3 AAMC 1,221 0 0 0 1.003 4,001
year1_hard 4 ASMC 1,221 0 0 0 1.003 4,934
year1_hard 5 BNL0-100 1,221 1,193 2 1,193 1.026 279.0
foundation_no_DMT10_2026 1 MQ1-20 997.0 0 0 0 1.003 3,839
foundation_no_DMT10_2026 2 MC0-20 997.0 0 0 0 1.002 4,958
foundation_no_DMT10_2026 3 MNC0-20 997.0 0 0 0 1.002 5,529
foundation_no_DMT10_2026 4 BNL0-20 997.0 0 0 0 1.003 2,890
foundation_no_MQ1_20_no_DMT10_2026 1 MC0-20 995.0 0 0 0 1.003 6,494
foundation_no_MQ1_20_no_DMT10_2026 2 MNC0-20 995.0 0 0 0 1.003 7,203
foundation_no_MQ1_20_no_DMT10_2026 3 BNL0-20 995.0 0 0 0 1.007 6,786
foundation_no_BNL0_20 1 MQ1-20 997.0 0 0 0 1.004 3,875
foundation_no_BNL0_20 2 MC0-20 997.0 0 0 0 1.003 5,855
foundation_no_BNL0_20 3 MNC0-20 997.0 0 0 0 1.003 7,600
foundation_no_BNL0_20 4 DMT10_2026 997.0 0 0 0 1.003 7,424
year1_no_MC0_100 1 MNC0-100 1,211 0 0 0 1.003 6,104
year1_no_MC0_100 2 AAMC 1,211 0 0 0 1.003 5,836
year1_no_MC0_100 3 ASMC 1,211 0 0 0 1.003 5,684
year1_no_MC0_100 4 BNL0-100 1,211 0 0 0 1.005 6,171
year1_no_BNL0_100 1 MC0-100 1,221 0 0 0 1.002 4,308
year1_no_BNL0_100 2 MNC0-100 1,221 0 0 0 1.003 4,397
year1_no_BNL0_100 3 AAMC 1,221 0 0 0 1.002 3,863
year1_no_BNL0_100 4 ASMC 1,221 0 0 0 1.002 4,374
year1_core_no_MC_no_NL 1 MNC0-100 1,211 0 0 0 1.002 5,522
year1_core_no_MC_no_NL 2 AAMC 1,211 0 0 0 1.002 4,994
year1_core_no_MC_no_NL 3 ASMC 1,211 0 0 0 1.003 5,165
Show full testlet sigma table
key test subgroup variable mean sd q5 q95 rhat ess bulk ess tail
foundation_inclusive MQ1-20 sigma_u[1] 0.8397 0.0613 0.7372 0.9408 1.006 1,465 2,999
foundation_inclusive MC0-20 sigma_u[2] 2.189 0.0714 2.075 2.31 1.002 1,844 3,930
foundation_inclusive MNC0-20 sigma_u[3] 1.926 0.0711 1.812 2.046 1.003 2,447 4,306
foundation_inclusive DMT10_2026 sigma_u[4] 0.7777 0.0561 0.6856 0.8697 1.001 1,884 3,879
foundation_inclusive BNL0-20 sigma_u[5] 0.5988 0.0423 0.5285 0.6682 1.005 1,173 2,206
year1_inclusive MC0-100 sigma_u[1] 2.322 0.0722 2.207 2.443 1.001 598.9 2,367
year1_inclusive MNC0-100 sigma_u[2] 2.055 0.0745 1.935 2.179 1.005 541.0 2,220
year1_inclusive AAMC sigma_u[3] 2.05 0.0712 1.937 2.171 1.003 453.1 2,161
year1_inclusive ASMC sigma_u[4] 1.636 0.0629 1.533 1.74 1.005 519.2 2,078
year1_inclusive BNL0-100 sigma_u[5] 0.2584 0.0978 0.0637 0.3936 1.023 109.1 237.0
foundation_hard MQ1-20 sigma_u[1] 0.922 0.0597 0.8248 1.021 1.002 2,015 4,281
foundation_hard MC0-20 sigma_u[2] 2.239 0.0696 2.125 2.355 1.001 1,602 3,564
foundation_hard MNC0-20 sigma_u[3] 1.995 0.0725 1.877 2.116 1 2,183 3,927
foundation_hard DMT10_2026 sigma_u[4] 0.7915 0.0564 0.6992 0.8843 1.003 1,658 4,175
foundation_hard BNL0-20 sigma_u[5] 0.5939 0.0417 0.5256 0.6609 1.002 1,081 2,044
year1_hard MC0-100 sigma_u[1] 2.576 0.0731 2.458 2.699 1.004 1,030 2,777
year1_hard MNC0-100 sigma_u[2] 2.196 0.0739 2.077 2.318 1.005 1,090 2,789
year1_hard AAMC sigma_u[3] 2.091 0.0681 1.979 2.205 1.009 826.3 2,277
year1_hard ASMC sigma_u[4] 1.696 0.0623 1.594 1.799 1.006 900.3 2,518
year1_hard BNL0-100 sigma_u[5] 0.1952 0.0927 0.0296 0.3417 1.068 78.37 319.6
foundation_no_DMT10_2026 MQ1-20 sigma_u[1] 0.9541 0.0677 0.8417 1.065 1.003 1,066 2,554
foundation_no_DMT10_2026 MC0-20 sigma_u[2] 2.263 0.071 2.148 2.383 1.001 1,903 3,560
foundation_no_DMT10_2026 MNC0-20 sigma_u[3] 2.07 0.076 1.945 2.197 1.003 1,781 3,846
foundation_no_DMT10_2026 BNL0-20 sigma_u[4] 0.4973 0.0596 0.3965 0.5908 1.009 481.8 814.6
foundation_no_MQ1_20_no_DMT10_2026 MC0-20 sigma_u[1] 2.408 0.0729 2.289 2.529 1.003 1,659 3,447
foundation_no_MQ1_20_no_DMT10_2026 MNC0-20 sigma_u[2] 2.219 0.0762 2.096 2.346 1.001 2,540 4,204
foundation_no_MQ1_20_no_DMT10_2026 BNL0-20 sigma_u[3] 0.0681 0.0513 0.005 0.1653 1.007 585.5 844.9
foundation_no_BNL0_20 MQ1-20 sigma_u[1] 0.7303 0.0742 0.6062 0.8476 1.004 1,221 2,465
foundation_no_BNL0_20 MC0-20 sigma_u[2] 2.127 0.0697 2.014 2.248 1.002 2,272 4,443
foundation_no_BNL0_20 MNC0-20 sigma_u[3] 1.866 0.0714 1.75 1.984 1 2,937 4,933
foundation_no_BNL0_20 DMT10_2026 sigma_u[4] 0.8036 0.0648 0.6963 0.9101 1.001 1,450 2,960
year1_no_MC0_100 MNC0-100 sigma_u[1] 2.283 0.0725 2.166 2.404 1.002 2,235 4,019
year1_no_MC0_100 AAMC sigma_u[2] 2.164 0.0653 2.06 2.273 1.002 1,867 3,469
year1_no_MC0_100 ASMC sigma_u[3] 1.76 0.0587 1.666 1.859 1.002 2,551 4,791
year1_no_MC0_100 BNL0-100 sigma_u[4] 0.0533 0.04 0.0044 0.1296 1.004 670.5 897.4
year1_no_BNL0_100 MC0-100 sigma_u[1] 2.26 0.0679 2.15 2.373 1.003 2,048 3,661
year1_no_BNL0_100 MNC0-100 sigma_u[2] 1.756 0.0707 1.641 1.873 1.002 1,715 3,122
year1_no_BNL0_100 AAMC sigma_u[3] 1.608 0.0638 1.505 1.714 1.002 1,892 2,740
year1_no_BNL0_100 ASMC sigma_u[4] 1.245 0.0588 1.149 1.344 1 1,668 3,369
year1_core_no_MC_no_NL MNC0-100 sigma_u[1] 1.969 0.0751 1.848 2.095 1.002 2,157 3,843
year1_core_no_MC_no_NL AAMC sigma_u[2] 1.753 0.0678 1.644 1.866 1.003 1,925 4,099
year1_core_no_MC_no_NL ASMC sigma_u[3] 1.283 0.0635 1.179 1.388 1.001 1,991 4,106
Show item difficulty extreme/diagnostic flags table
key variable mean sd q5 q95 rhat ess bulk ess tail
foundation_inclusive b[81] 7.829 0.5637 6.978 8.828 1 10,514 4,900
foundation_inclusive b[61] 7.828 0.5537 6.986 8.796 1.002 12,140 5,255
foundation_inclusive b[64] 7.826 0.5673 6.959 8.805 1.001 11,474 4,865
foundation_inclusive b[63] 7.826 0.5509 6.973 8.777 1.001 11,666 5,063
foundation_inclusive b[80] 7.825 0.5688 6.966 8.81 1.001 11,620 5,078
foundation_inclusive b[65] 7.824 0.5555 6.965 8.786 1 10,609 5,412
foundation_inclusive b[62] 7.823 0.5623 6.948 8.798 1 12,400 5,798
foundation_inclusive b[68] 7.821 0.56 6.964 8.803 1 11,187 4,942
foundation_inclusive b[69] 7.821 0.5668 6.948 8.806 0.9998 10,863 5,251
foundation_inclusive b[79] 7.817 0.5511 6.974 8.78 1 10,663 5,172
foundation_inclusive b[102] 7.7 0.5616 6.851 8.667 1 11,007 5,262
foundation_inclusive b[110] 7.699 0.556 6.855 8.675 1 10,556 5,517
foundation_inclusive b[109] 7.691 0.5449 6.869 8.635 1 10,619 5,544
foundation_inclusive b[111] 7.689 0.553 6.846 8.656 1 11,379 4,307
foundation_inclusive b[106] 7.687 0.5456 6.852 8.654 1 11,355 5,515
foundation_inclusive b[108] 7.685 0.5583 6.829 8.656 1 11,662 5,305
foundation_inclusive b[58] 7.565 0.5183 6.763 8.463 1.002 10,619 5,099
foundation_inclusive b[75] 7.556 0.5175 6.763 8.446 1.001 12,080 4,752
foundation_inclusive b[73] 7.555 0.5255 6.735 8.466 1.003 10,646 4,896
foundation_inclusive b[72] 7.555 0.5083 6.767 8.418 1.001 11,396 5,494
year1_inclusive b[164] 8.307 0.5422 7.484 9.259 1.001 10,601 5,368
year1_inclusive b[40] 8.302 0.5642 7.431 9.289 1.002 13,359 5,382
year1_inclusive b[170] 8.302 0.5416 7.463 9.234 1.002 12,387 5,139
year1_inclusive b[168] 8.301 0.5358 7.477 9.24 1.001 10,278 5,653
year1_inclusive b[34] 8.301 0.5614 7.447 9.288 1.001 13,806 4,939
year1_inclusive b[171] 8.299 0.5325 7.478 9.221 1.002 11,647 5,517
year1_inclusive b[166] 8.298 0.5464 7.466 9.256 1 12,309 5,572
year1_inclusive b[169] 8.296 0.5284 7.491 9.223 1 10,930 5,443
year1_inclusive b[172] 8.292 0.5326 7.48 9.213 1.001 13,538 5,169
year1_inclusive b[120] 8.254 0.5375 7.431 9.171 1.001 12,951 5,834
year1_inclusive b[136] 8.253 0.539 7.445 9.223 1.001 11,104 4,681
year1_inclusive b[133] 8.251 0.5415 7.428 9.169 1 10,203 4,544
year1_inclusive b[129] 8.251 0.5442 7.423 9.203 1 12,194 5,691
year1_inclusive b[141] 8.251 0.5426 7.42 9.2 1 13,052 5,230
year1_inclusive b[134] 8.251 0.5345 7.434 9.19 1.001 10,492 4,616
year1_inclusive b[140] 8.25 0.5361 7.426 9.185 0.9999 11,556 4,875
year1_inclusive b[125] 8.25 0.5321 7.435 9.178 1 12,733 5,602
year1_inclusive b[123] 8.25 0.5398 7.416 9.191 1.001 13,588 5,578
year1_inclusive b[142] 8.249 0.5375 7.421 9.193 1.001 13,551 5,404
year1_inclusive b[128] 8.248 0.538 7.424 9.196 1.002 12,262 5,445
foundation_hard b[67] 8.006 0.5185 7.206 8.902 1.001 10,703 5,245
foundation_hard b[61] 8.005 0.5234 7.192 8.93 1 10,271 4,354
foundation_hard b[59] 8.003 0.5155 7.209 8.901 1.001 11,504 4,907
foundation_hard b[57] 8.001 0.5108 7.207 8.876 1 11,286 5,709
foundation_hard b[66] 8.001 0.5202 7.206 8.9 1 11,874 4,432
foundation_hard b[68] 8.001 0.5146 7.201 8.883 1.001 11,490 5,480
foundation_hard b[64] 8 0.5054 7.213 8.872 1.001 10,755 5,631
foundation_hard b[65] 7.999 0.5171 7.192 8.887 1 11,646 4,453
foundation_hard b[58] 7.999 0.4969 7.221 8.85 1.001 10,610 5,527
foundation_hard b[55] 7.998 0.5119 7.189 8.883 1 9,513 5,119
foundation_hard b[62] 7.996 0.5156 7.194 8.896 1.001 10,896 5,401
foundation_hard b[52] 7.995 0.5068 7.206 8.863 1 10,763 5,164
foundation_hard b[56] 7.994 0.5129 7.207 8.897 1.001 10,798 5,588
foundation_hard b[60] 7.994 0.5065 7.215 8.875 1.001 12,241 4,915
foundation_hard b[54] 7.993 0.5138 7.199 8.888 1 11,508 5,796
foundation_hard b[63] 7.991 0.5112 7.184 8.878 1.002 11,372 5,727
foundation_hard b[53] 7.774 0.4765 7.021 8.583 1.001 10,998 5,387
foundation_hard b[51] 7.766 0.4816 7.028 8.605 1 11,810 5,249
foundation_hard b[92] 7.741 0.5019 6.961 8.61 1 11,551 5,223
foundation_hard b[90] 7.741 0.4976 6.965 8.603 1 10,416 5,919
year1_hard b[107] 9.497 0.5015 8.705 10.36 1.001 11,483 5,821
year1_hard b[110] 9.493 0.5009 8.711 10.35 1 8,816 5,481
year1_hard b[108] 9.49 0.502 8.713 10.35 1.001 9,981 4,805
year1_hard b[109] 9.488 0.5143 8.682 10.37 1.001 11,363 6,024
year1_hard b[106] 9.278 0.4784 8.536 10.11 1 10,215 6,011
year1_hard b[104] 9.275 0.4658 8.551 10.06 1.001 9,470 5,165
year1_hard b[105] 9.274 0.4675 8.539 10.09 1.001 10,277 5,493
year1_hard b[131] 8.564 0.4876 7.814 9.401 1 10,202 4,515
year1_hard b[132] 8.355 0.4509 7.639 9.122 1 12,065 5,390
year1_hard b[130] 8.352 0.458 7.636 9.137 1 9,747 5,523
year1_hard b[103] 8.263 0.3503 7.704 8.857 1 8,026 6,194
year1_hard b[38] 8.205 0.529 7.39 9.121 1.001 10,887 5,070
year1_hard b[27] 8.205 0.5226 7.389 9.123 1.001 10,901 5,201
year1_hard b[37] 8.203 0.5094 7.421 9.066 1.001 11,727 5,798
year1_hard b[30] 8.202 0.5279 7.399 9.118 1.001 12,492 4,902
year1_hard b[22] 8.201 0.5256 7.389 9.115 1 12,550 5,470
year1_hard b[35] 8.201 0.5215 7.397 9.098 1 11,661 4,768
year1_hard b[25] 8.198 0.518 7.408 9.096 1.001 11,833 5,432
year1_hard b[29] 8.197 0.523 7.399 9.101 1.002 11,422 5,258
year1_hard b[32] 8.197 0.5085 7.422 9.082 1.001 11,232 5,871
foundation_no_DMT10_2026 b[56] 8.003 0.5149 7.2 8.903 1.001 11,245 5,544
foundation_no_DMT10_2026 b[50] 8.001 0.5014 7.23 8.884 1 9,058 4,869
foundation_no_DMT10_2026 b[53] 8 0.4953 7.231 8.864 1 9,980 5,239
foundation_no_DMT10_2026 b[51] 7.999 0.5205 7.203 8.897 1.001 10,170 5,438
foundation_no_DMT10_2026 b[44] 7.999 0.5222 7.199 8.898 1.001 10,974 5,023
foundation_no_DMT10_2026 b[48] 7.999 0.5089 7.201 8.885 1.001 11,335 5,408
foundation_no_DMT10_2026 b[58] 7.998 0.5133 7.21 8.885 1 10,047 5,687
foundation_no_DMT10_2026 b[54] 7.998 0.5168 7.206 8.897 1 10,213 5,567
foundation_no_DMT10_2026 b[46] 7.998 0.5092 7.211 8.864 1.001 10,190 5,756
foundation_no_DMT10_2026 b[52] 7.997 0.5123 7.207 8.894 1.001 10,015 5,462
foundation_no_DMT10_2026 b[57] 7.996 0.512 7.203 8.865 1.001 9,261 5,040
foundation_no_DMT10_2026 b[49] 7.996 0.5146 7.177 8.877 1 10,706 5,121
foundation_no_DMT10_2026 b[59] 7.994 0.5154 7.184 8.883 1.001 10,213 5,237
foundation_no_DMT10_2026 b[55] 7.991 0.49 7.221 8.838 1 10,062 5,853
foundation_no_DMT10_2026 b[60] 7.99 0.5223 7.175 8.895 1 11,202 5,325
foundation_no_DMT10_2026 b[47] 7.99 0.5131 7.198 8.867 1.001 10,746 4,765
foundation_no_DMT10_2026 b[43] 7.766 0.4771 7.011 8.595 1 10,443 5,506
foundation_no_DMT10_2026 b[45] 7.763 0.4732 7.033 8.58 1.002 8,990 4,577
foundation_no_DMT10_2026 b[81] 7.763 0.5075 6.995 8.648 1.001 9,655 5,295
foundation_no_DMT10_2026 b[80] 7.757 0.5023 6.994 8.623 1.002 9,873 5,455
foundation_no_MQ1_20_no_DMT10_2026 b[46] 8.038 0.516 7.234 8.926 1.001 12,235 4,708
foundation_no_MQ1_20_no_DMT10_2026 b[48] 8.035 0.5203 7.226 8.914 0.9999 10,550 5,122
foundation_no_MQ1_20_no_DMT10_2026 b[58] 8.034 0.52 7.217 8.946 1.002 12,559 4,526
foundation_no_MQ1_20_no_DMT10_2026 b[53] 8.034 0.5098 7.226 8.919 1.001 11,948 5,302
foundation_no_MQ1_20_no_DMT10_2026 b[49] 8.033 0.5137 7.238 8.925 1.001 11,860 5,433
foundation_no_MQ1_20_no_DMT10_2026 b[51] 8.033 0.51 7.233 8.893 1 12,133 5,543
foundation_no_MQ1_20_no_DMT10_2026 b[44] 8.031 0.5105 7.241 8.925 1.001 11,852 5,579
foundation_no_MQ1_20_no_DMT10_2026 b[56] 8.031 0.512 7.243 8.924 1.001 11,160 5,375
foundation_no_MQ1_20_no_DMT10_2026 b[52] 8.031 0.5062 7.249 8.916 1.001 12,815 6,038
foundation_no_MQ1_20_no_DMT10_2026 b[55] 8.03 0.5088 7.247 8.904 1.001 12,143 5,536
foundation_no_MQ1_20_no_DMT10_2026 b[50] 8.028 0.5175 7.227 8.941 1.001 12,795 4,985
foundation_no_MQ1_20_no_DMT10_2026 b[57] 8.027 0.5102 7.245 8.908 1.001 12,697 5,756
foundation_no_MQ1_20_no_DMT10_2026 b[60] 8.027 0.5199 7.22 8.927 1 12,049 5,128
foundation_no_MQ1_20_no_DMT10_2026 b[59] 8.026 0.5175 7.226 8.899 1 13,260 5,505
foundation_no_MQ1_20_no_DMT10_2026 b[54] 8.026 0.5159 7.228 8.908 1 11,114 5,088
foundation_no_MQ1_20_no_DMT10_2026 b[47] 8.024 0.4992 7.241 8.878 1.001 12,673 5,252
foundation_no_MQ1_20_no_DMT10_2026 b[81] 7.837 0.5043 7.062 8.705 1.002 13,099 5,769
foundation_no_MQ1_20_no_DMT10_2026 b[83] 7.834 0.5063 7.056 8.707 1 11,811 5,498
foundation_no_MQ1_20_no_DMT10_2026 b[80] 7.833 0.4987 7.074 8.708 1 11,850 5,260
foundation_no_MQ1_20_no_DMT10_2026 b[82] 7.832 0.5045 7.059 8.704 1 11,609 5,007
foundation_no_BNL0_20 b[48] 8.014 0.5103 7.229 8.895 1.002 14,235 6,076
foundation_no_BNL0_20 b[51] 8.014 0.5109 7.221 8.906 1.001 15,427 5,038
foundation_no_BNL0_20 b[52] 8.014 0.5101 7.228 8.899 1 14,431 5,512
foundation_no_BNL0_20 b[49] 8.013 0.524 7.202 8.931 1 14,002 5,477
foundation_no_BNL0_20 b[57] 8.013 0.5218 7.191 8.903 1.001 14,127 4,955
foundation_no_BNL0_20 b[56] 8.012 0.5246 7.192 8.923 1 14,244 4,541
foundation_no_BNL0_20 b[44] 8.011 0.4994 7.236 8.883 1.001 14,004 5,163
foundation_no_BNL0_20 b[50] 8.011 0.5116 7.224 8.881 1.001 13,465 5,542
foundation_no_BNL0_20 b[42] 8.011 0.5205 7.205 8.914 1 15,335 5,563
foundation_no_BNL0_20 b[47] 8.01 0.5221 7.202 8.908 1 13,570 5,075
foundation_no_BNL0_20 b[55] 8.009 0.5153 7.211 8.87 1 15,560 5,714
foundation_no_BNL0_20 b[58] 8.009 0.5 7.227 8.871 1 13,975 5,664
foundation_no_BNL0_20 b[45] 8.009 0.5089 7.212 8.888 1 15,748 5,994
foundation_no_BNL0_20 b[46] 8.008 0.5252 7.192 8.918 1 14,678 5,234
foundation_no_BNL0_20 b[53] 8.008 0.4992 7.235 8.856 1.001 14,424 5,539
foundation_no_BNL0_20 b[54] 8.006 0.508 7.218 8.893 1 13,131 5,241
foundation_no_BNL0_20 b[43] 7.783 0.4763 7.031 8.597 1.001 14,751 4,931
foundation_no_BNL0_20 b[41] 7.771 0.4693 7.034 8.579 1.001 15,584 5,620
foundation_no_BNL0_20 b[82] 7.734 0.4993 6.966 8.601 1.001 13,522 5,042
foundation_no_BNL0_20 b[81] 7.734 0.5011 6.967 8.61 1.001 14,061 5,130
year1_no_MC0_100 b[97] 8.641 0.497 7.877 9.482 1 11,447 5,361
year1_no_MC0_100 b[96] 8.434 0.4632 7.715 9.24 1.001 9,911 5,566
year1_no_MC0_100 b[98] 8.425 0.4599 7.703 9.209 1 10,199 5,925
year1_no_MC0_100 b[95] 8.251 0.4338 7.571 8.984 1 10,062 5,038
year1_no_MC0_100 b[29] 8.245 0.5165 7.433 9.15 0.9998 9,955 4,727
year1_no_MC0_100 b[31] 8.241 0.522 7.433 9.155 1.002 13,982 4,732
year1_no_MC0_100 b[35] 8.239 0.5331 7.427 9.168 1 11,445 4,974
year1_no_MC0_100 b[27] 8.238 0.5253 7.427 9.136 1.001 11,992 5,104
year1_no_MC0_100 b[28] 8.238 0.531 7.406 9.155 1.001 11,655 4,696
year1_no_MC0_100 b[34] 8.238 0.5258 7.427 9.143 1 12,380 5,543
year1_no_MC0_100 b[33] 8.237 0.5083 7.448 9.118 1 10,989 5,127
year1_no_MC0_100 b[22] 8.236 0.5217 7.439 9.127 1.001 12,780 5,294
year1_no_MC0_100 b[37] 8.235 0.515 7.44 9.127 1.002 11,521 5,609
year1_no_MC0_100 b[30] 8.231 0.5191 7.428 9.134 1 12,678 5,976
year1_no_MC0_100 b[36] 8.231 0.5155 7.426 9.115 1 11,650 5,620
year1_no_MC0_100 b[38] 8.229 0.5135 7.442 9.119 1.001 12,111 5,432
year1_no_MC0_100 b[32] 8.226 0.5062 7.454 9.102 1 11,546 5,832
year1_no_MC0_100 b[25] 8.225 0.5081 7.443 9.105 1 12,807 6,109
year1_no_MC0_100 b[26] 8.004 0.4745 7.274 8.815 1.001 9,905 4,832
year1_no_MC0_100 b[24] 8 0.4777 7.261 8.831 1 12,510 5,384
year1_no_BNL0_100 b[94] 9.423 0.5009 8.651 10.3 1 9,369 5,296
year1_no_BNL0_100 b[95] 9.422 0.5108 8.619 10.31 1.001 9,269 5,245
year1_no_BNL0_100 b[97] 9.418 0.5061 8.627 10.29 1 8,492 4,980
year1_no_BNL0_100 b[96] 9.416 0.5095 8.623 10.29 1 10,151 5,918
year1_no_BNL0_100 b[93] 9.201 0.4816 8.453 10.03 1 9,815 5,523
year1_no_BNL0_100 b[92] 9.197 0.4731 8.458 10.02 1 9,506 5,795
year1_no_BNL0_100 b[91] 9.196 0.4633 8.472 9.994 1 8,598 5,210
year1_no_BNL0_100 b[118] 8.432 0.4934 7.658 9.291 1.001 10,748 5,381
year1_no_BNL0_100 b[119] 8.233 0.4579 7.517 9.012 1.001 10,359 5,764
year1_no_BNL0_100 b[117] 8.228 0.4523 7.517 8.997 1.001 10,423 4,968
year1_no_BNL0_100 b[90] 8.194 0.3467 7.635 8.771 1 8,119 6,229
year1_no_BNL0_100 b[35] 8.163 0.5124 7.376 9.058 1.001 10,185 5,326
year1_no_BNL0_100 b[25] 8.162 0.5209 7.357 9.077 1 11,495 5,309
year1_no_BNL0_100 b[33] 8.161 0.5213 7.361 9.067 1 11,323 5,937
year1_no_BNL0_100 b[37] 8.16 0.5077 7.373 9.055 1.001 9,620 5,385
year1_no_BNL0_100 b[34] 8.16 0.5078 7.382 9.052 1 12,275 6,146
year1_no_BNL0_100 b[38] 8.158 0.5211 7.362 9.072 1 10,685 5,194
year1_no_BNL0_100 b[29] 8.158 0.5113 7.37 9.047 1 11,494 5,758
year1_no_BNL0_100 b[30] 8.158 0.5073 7.376 9.021 1 10,514 4,896
year1_no_BNL0_100 b[32] 8.156 0.5073 7.36 9.037 1 10,190 5,583
year1_core_no_MC_no_NL b[84] 8.57 0.4904 7.811 9.415 1 12,285 5,696
year1_core_no_MC_no_NL b[83] 8.367 0.4658 7.646 9.167 1 11,942 5,113
year1_core_no_MC_no_NL b[85] 8.364 0.4601 7.644 9.15 1.001 12,817 5,931
year1_core_no_MC_no_NL b[33] 8.21 0.5118 7.417 9.091 1 13,919 4,888
year1_core_no_MC_no_NL b[31] 8.207 0.53 7.395 9.146 1.001 13,711 4,756
year1_core_no_MC_no_NL b[29] 8.205 0.5111 7.419 9.082 1 16,253 5,420
year1_core_no_MC_no_NL b[25] 8.203 0.5118 7.416 9.096 1 13,262 5,692
year1_core_no_MC_no_NL b[34] 8.202 0.5202 7.391 9.104 1 16,277 5,645
year1_core_no_MC_no_NL b[28] 8.2 0.519 7.403 9.09 1.001 15,230 5,049
year1_core_no_MC_no_NL b[22] 8.2 0.5184 7.396 9.09 0.9998 14,382 5,210
year1_core_no_MC_no_NL b[37] 8.2 0.5089 7.414 9.087 1.001 14,801 5,424
year1_core_no_MC_no_NL b[38] 8.2 0.5151 7.401 9.099 1.002 12,909 4,308
year1_core_no_MC_no_NL b[30] 8.199 0.4956 7.421 9.051 1 13,262 5,654
year1_core_no_MC_no_NL b[27] 8.199 0.5191 7.392 9.115 1.001 14,735 4,717
year1_core_no_MC_no_NL b[35] 8.199 0.5183 7.396 9.092 1 15,503 4,965
year1_core_no_MC_no_NL b[36] 8.197 0.5026 7.425 9.065 1.001 15,112 5,976
year1_core_no_MC_no_NL b[32] 8.194 0.4991 7.426 9.043 1.001 14,008 5,710
year1_core_no_MC_no_NL b[82] 8.175 0.4304 7.505 8.919 1 12,572 5,793
year1_core_no_MC_no_NL b[26] 7.967 0.4787 7.222 8.789 1 14,407 5,566
year1_core_no_MC_no_NL b[24] 7.966 0.4803 7.226 8.793 1 14,103 5,606

Premodelling audit memo

Rendered from the aggregate-only Markdown premodelling artifact.

2026 BOY premodelling audit: hierarchical subscores, Number Line policy, and accuracy-speed

Generated: 2026-06-14 08:57:14Z

This is a dependency-light, aggregate-only audit. It does not publish raw student identifiers or person-level score files.

Executive readout

Hierarchical subscore readiness

yearsubtestkeep_itemsrelbandglobal_rposture
foundationMQ1-20190.602weak0.426hierarchical_shrinkage_required; avoid standalone high-stakes subscore
foundationMC0-20500.927strong0.532strong standalone signal; still prefer hierarchical coherence with global score
foundationMNC0-20240.881strong0.604strong standalone signal; still prefer hierarchical coherence with global score
foundationDMT10_202680.609weak0.43hierarchical_shrinkage_required; avoid standalone high-stakes subscore
foundationBNL0-20100.674weak0.354hierarchical_shrinkage_required; avoid standalone high-stakes subscore
year1MC0-100340.94strong0.694strong standalone signal; still prefer hierarchical coherence with global score
year1MNC0-100220.891strong0.762strong standalone signal; still prefer hierarchical coherence with global score
year1AAMC380.9strong0.729strong standalone signal; still prefer hierarchical coherence with global score
year1ASMC250.841moderate0.617hierarchical_shrinkage_recommended; standalone subscore only as caveated diagnostic
year1BNL0-100130.727moderate0.548hierarchical_shrinkage_recommended; standalone subscore only as caveated diagnostic

Key implication: several subtests are not ideal standalone reporting scores, especially where reliability is weak/moderate or item counts are small. That is an argument *for* hierarchical shrinkage, not against subscores.

Current-policy subtest relationships

yearsubtest_1subtest_2nrhoband
foundationMQ1-20MC0-2010050.405moderate
foundationMQ1-20MNC0-2010030.405moderate
foundationMQ1-20DMT10_202610020.275low
foundationMQ1-20BNL0-209740.171low
foundationMC0-20MNC0-2010030.561moderate
foundationMC0-20DMT10_202610020.283low
foundationMC0-20BNL0-209740.245low
foundationMNC0-20DMT10_202610020.393low
foundationMNC0-20BNL0-209740.304low
foundationDMT10_2026BNL0-209740.302low
year1MC0-100MNC0-10012290.683high
year1MC0-100AAMC12270.595moderate
year1MC0-100ASMC12230.485moderate
year1MC0-100BNL0-10011780.468moderate
year1MNC0-100AAMC12270.671high
year1MNC0-100ASMC12230.561moderate
year1MNC0-100BNL0-10011780.504moderate
year1AAMCASMC12230.595moderate
year1AAMCBNL0-10011780.469moderate
year1ASMCBNL0-10011780.379low

Profile-deviation spread

yearsubtestnsd_dev_zp10p90%>1z
foundationMQ1-2010050.934-1.041.1122.4%
foundationMC0-2010050.842-0.981.0320.2%
foundationMNC0-2010030.787-0.991.0220.4%
foundationDMT10_202610020.938-1.211.1527.3%
foundationBNL0-209740.998-1.271.2331.4%
year1MC0-10012290.737-0.820.9114.5%
year1MNC0-10012290.64-0.740.810.5%
year1AAMC12270.688-0.820.8110.8%
year1ASMC12230.786-0.990.9719.1%
year1BNL0-10011780.902-1.061.1223.3%

Number Line cutoff policy premodelling

yearsubtestpolicyitems_okmedian_min_pctmedian_top_pctentropyposture
foundationBNL0-20nl_80_90_95_4cat9/1015.6%25.3%0.955higher-resolution_challenger; use_only_if_item_category_cells_are_stable
foundationBNL0-20nl_80_90_relaxed_3cat9/1020.6%48.0%0.938relaxed_challenger; useful_if_current_policy_over-penalises_hard_targets
foundationBNL0-20nl_85_95_current_3cat10/1018.2%25.3%0.93benchmark_current_policy; keep as reference in all modelling
foundationBNL0-20nl_90_97_strict_3cat10/1013.6%14.6%0.873strict_challenger; reject_if_top_category_sparse_or_validation_not_better
foundationBNL0-20nl_binary_9510/1025.3%25.3%0.815modelable_if_cells_ok_but_loses_partial-credit_information
year1BNL0-100nl_80_90_95_4cat13/1319.4%26.4%0.989higher-resolution_challenger; use_only_if_item_category_cells_are_stable
year1BNL0-100nl_80_90_relaxed_3cat13/1324.4%48.7%0.954relaxed_challenger; useful_if_current_policy_over-penalises_hard_targets
year1BNL0-100nl_85_95_current_3cat13/1326.4%26.4%0.98benchmark_current_policy; keep as reference in all modelling
year1BNL0-100nl_90_97_strict_3cat13/1316.2%16.2%0.913strict_challenger; reject_if_top_category_sparse_or_validation_not_better
year1BNL0-100nl_binary_9513/1326.4%26.4%0.833modelable_if_cells_ok_but_loses_partial-credit_information

Interpretation rule: a policy can be *modelable* from cell counts but still not promotable. Promotion requires validation, risk-band movement, fairness/subgroup checks, and interpretability. Current .85/.95 remains the benchmark.

Accuracy-speed / RT readiness

yearsubtestroletimedobs_rt_misspresented_misstrailingrt_p50<1smodel_roleflags
foundationBNL0-20achievement_primaryFalse0.00%2.0%0.0%70.8%nl_rt_context_only_initially_not_accuracy_speed_scoringnone_obvious_from_row_rt_audit
foundationDMT10_2026achievement_primaryFalse0.00%1.5%0.0%160.0%untimed_or_other_context_only_initiallynone_obvious_from_row_rt_audit
foundationMC0-20achievement_primaryTrue0.00%75.0%74.0%60.4%initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_contextpresented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
foundationMNC0-20achievement_primaryTrue0.00%76.1%75.1%120.4%initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_contextpresented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
foundationMQ1-20achievement_primaryTrue0.00%84.2%83.3%200.8%initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_contextpresented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
foundationSTPMshadow_speed_onlyTrue6.2%5.2%80.1%shadow_speed_only_exclude_from_math_achievementpresented_row_rt_missing_or_negative_gt_5pct
year1AAMCachievement_primaryTrue0.00%80.2%78.5%90.5%initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_contextpresented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
year1ASMCachievement_primaryTrue0.00%77.6%75.6%120.5%initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_contextpresented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
year1BNL0-100achievement_primaryFalse0.00%3.8%0.0%50.7%nl_rt_context_only_initially_not_accuracy_speed_scoringnone_obvious_from_row_rt_audit
year1MC0-100achievement_primaryTrue0.00%77.1%76.0%60.4%initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_contextpresented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
year1MNC0-100achievement_primaryTrue0.00%72.9%71.4%110.5%initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_contextpresented_row_rt_missing_high_expected_from_trailing_unreached_D_rows
year1STPMshadow_speed_onlyTrue6.4%4.4%60.1%shadow_speed_only_exclude_from_math_achievementpresented_row_rt_missing_or_negative_gt_5pct

J2b-style rapid-row descriptive check

yearsubtestrapid_raterapid_accnonrapid_accdelta
foundationMC0-205.97%0.6360.919-0.283
foundationMNC0-203.96%0.0880.754-0.666
foundationMQ1-203.08%0.0540.656-0.601
year1AAMC4.36%0.1840.803-0.618
year1ASMC4.24%0.1320.627-0.494
year1MC0-1005.27%0.5670.897-0.331
year1MNC0-1004.03%0.0870.834-0.747

Person-level speed/reach correlations with current-policy scores

yearsubtestmetricnrhonote
foundationSTPMmedian_item_rt_sec1016-0.437rt_context_not_achievement_adjustment
foundationSTPMn_reached_or_valid_count10240.819reach_count_is_partly_scoring_policy_for_timed_D
foundationSTPMn_trailing_nonresponse_rows1024-0.771reach_count_is_partly_scoring_policy_for_timed_D
foundationMQ1-20median_item_rt_sec998-0.462rt_context_not_achievement_adjustment
foundationMQ1-20n_reached_or_valid_count10060.711reach_count_is_partly_scoring_policy_for_timed_D
foundationMQ1-20n_trailing_nonresponse_rows1006-0.656reach_count_is_partly_scoring_policy_for_timed_D
foundationMC0-20median_item_rt_sec995-0.83rt_context_not_achievement_adjustment
foundationMC0-20n_reached_or_valid_count10050.932reach_count_is_partly_scoring_policy_for_timed_D
foundationMC0-20n_trailing_nonresponse_rows1005-0.873reach_count_is_partly_scoring_policy_for_timed_D
foundationMNC0-20median_item_rt_sec993-0.673rt_context_not_achievement_adjustment
foundationMNC0-20n_reached_or_valid_count10030.778reach_count_is_partly_scoring_policy_for_timed_D
foundationMNC0-20n_trailing_nonresponse_rows1003-0.721reach_count_is_partly_scoring_policy_for_timed_D
foundationDMT10_2026median_item_rt_sec9880.061rt_context_not_achievement_adjustment
foundationDMT10_2026n_reached_or_valid_count10020.206coverage_or_valid_count_context_not_timed_D_speed
foundationDMT10_2026n_trailing_nonresponse_rows1002coverage_or_valid_count_context_not_timed_D_speed
foundationBNL0-20median_item_rt_sec974-0.009rt_context_not_achievement_adjustment
foundationBNL0-20n_reached_or_valid_count9740.345coverage_or_valid_count_context_not_timed_D_speed
foundationBNL0-20n_trailing_nonresponse_rows974coverage_or_valid_count_context_not_timed_D_speed
year1STPMmedian_item_rt_sec1235-0.432rt_context_not_achievement_adjustment
year1STPMn_reached_or_valid_count12560.821reach_count_is_partly_scoring_policy_for_timed_D
year1STPMn_trailing_nonresponse_rows1256-0.719reach_count_is_partly_scoring_policy_for_timed_D
year1MC0-100median_item_rt_sec1221-0.84rt_context_not_achievement_adjustment
year1MC0-100n_reached_or_valid_count12350.932reach_count_is_partly_scoring_policy_for_timed_D
year1MC0-100n_trailing_nonresponse_rows1235-0.865reach_count_is_partly_scoring_policy_for_timed_D
year1MNC0-100median_item_rt_sec1212-0.704rt_context_not_achievement_adjustment
year1MNC0-100n_reached_or_valid_count12290.816reach_count_is_partly_scoring_policy_for_timed_D
year1MNC0-100n_trailing_nonresponse_rows1229-0.729reach_count_is_partly_scoring_policy_for_timed_D
year1AAMCmedian_item_rt_sec1205-0.79rt_context_not_achievement_adjustment
year1AAMCn_reached_or_valid_count12270.872reach_count_is_partly_scoring_policy_for_timed_D
year1AAMCn_trailing_nonresponse_rows1227-0.768reach_count_is_partly_scoring_policy_for_timed_D
year1ASMCmedian_item_rt_sec1199-0.584rt_context_not_achievement_adjustment
year1ASMCn_reached_or_valid_count12230.708reach_count_is_partly_scoring_policy_for_timed_D
year1ASMCn_trailing_nonresponse_rows1223-0.599reach_count_is_partly_scoring_policy_for_timed_D
year1BNL0-100median_item_rt_sec1178-0.046rt_context_not_achievement_adjustment
year1BNL0-100n_reached_or_valid_count11780.225coverage_or_valid_count_context_not_timed_D_speed
year1BNL0-100n_trailing_nonresponse_rows1178coverage_or_valid_count_context_not_timed_D_speed
foundationSTPM_vs_compositescore10060.232STPM_is_shadow_non_math_exclude_from_math_score
foundationSTPM_vs_compositemedian_item_rt_sec1004-0.389STPM_is_shadow_non_math_exclude_from_math_score
foundationSTPM_vs_compositetotal_rt_sec1004-0.34STPM_is_shadow_non_math_exclude_from_math_score
year1STPM_vs_compositescore12350.234STPM_is_shadow_non_math_exclude_from_math_score
year1STPM_vs_compositemedian_item_rt_sec1228-0.387STPM_is_shadow_non_math_exclude_from_math_score
year1STPM_vs_compositetotal_rt_sec1228-0.284STPM_is_shadow_non_math_exclude_from_math_score

Reach/trailing correlations are partly mechanical under timed D/trailing-zero scoring. This is exactly why RT/tau should initially remain a shadow response-process layer rather than a direct achievement-band adjustment.

Recommended model ladders

Hierarchical global/subscore ladder

model_idpurposelatent_structuresubscorespremodel_statuspromotion_gate
H0_current_operational_candidateexisting global score anchorone global theta + subtest/testlet residuals unot teacher-facing; u is nuisance/local-dependence residualalready fitted for inclusive/hard-filtered/sensitivitiesretain as anchor while subscore challengers are tested
H1_global_plus_subtest_deviationscoherent teacher-facing global score + subscoresglobal theta; subtest score = global theta + shrunken subtest deviation; no separate nuisance residual for every same subtest initiallyyes: report global, subtest posterior means/intervals, and relative deviation labelsrecommended first Stan hierarchical subscore challengerclean HMC, stable subscore posterior SDs, sensible shrinkage, better coherence than standalone subtest IRT, no harmful risk-band movement
H2_global_plus_NL_specific_deviationtarget Year 1 BNL influence before full subtest expansionglobal theta + Number Line-specific deviation/factor; optionally BNL residual fixed/omittedglobal + NL profile onlyrecommended focused challenger if H1 is too broad or BNL remains unstablekeeps BNL contribution without weak BNL residual pathology; validates at least as well as H0
H3_correlated_subtest_thetasdiagnostic upper-bound profile modelone correlated theta per subtest; global score is derived compositeyes but global must be defined after fittingdiagnostic only until feasibility improves; mirt/TAM high-dimensional screens were resource-burdenedonly proceed if H1/H2 insufficient and dimensions are stable/interpretable

Number Line policy ladder

policy_idrolemodel_familypremodel_gatepromotion_gate
nl_85_95_current_3catbenchmark/operational-compatible current policyordinal PCM/GPCM categories 0=<.85, 1=.85-.95, 2=>=.95must be included as reference in all screensalready lockable as NL2 unless challenger clearly improves validation/fairness/classification
nl_80_90_relaxed_3catcutoff sensitivity challengerordinal 3-category PCM/GPCMcell counts and target distributions acceptableless harmful hard-target penalisation plus equal/better validation and risk classification
nl_90_97_strict_3catstrict challengerordinal 3-category PCM/GPCMtop category not too sparse item-by-itemonly if validation gain offsets expected sparsity/precision loss
nl_binary_95simple mastery-like sensitivitybinary Rasch/2PL screenboth classes present by itemunlikely to promote unless it improves decision validity despite information loss
nl_80_90_95_4cathigher-resolution ordinal sensitivity4-category PCM/GPCMall item categories have stable counts; thresholds ordered/usableimproved validation/precision without sparse-category pathology
continuous_abs_error_logitnormal_or_betaformal continuous challenger, not TAM/mirt-faithfulmixed response Stan: binary/non-NL accuracy + continuous bounded NL accuracy/errorraw distributions and coordinate calibration pass; proxy validation competitivematerial validation/classification/fairness gain over NL2 and clean HMC/PPC

Accuracy-speed ladder

model_idpurposestatususes_for_scoregate
RT0_QC_manifest_speed_descriptivesdata-quality, rapid-response, timing-unit, and admin/device checksrecommended before any scoring usenoneno severe RT missingness/unit anomalies in candidate families
RT1_selected_family_speed_shadowselected timed-family tau/pace research with accuracy anchor protectedsupported by prior J2b work; rerun on 2026 BOY candidate families if neededshadow onlytau aligns with RT/rapid behaviour; theta/risk bands not changed operationally
RT2_hierarchical_tau_shadowoverall response pace + family residual pace, coherent with teacher profile ideaStan skeleton exists (J3b hierarchical tau)shadow onlyclean HMC; no subgroup/admin artefact; no achievement-band changes
RT3_joint_global_subscore_accuracy_speedfuture integrated model after H1 subscore and RT2 pace models are separately stablenot first next fitresearch only until validation burden is metmust add information beyond D/trailing-zero and not double-count speed/reach

Decision gates / next actions

streamnext_actionmust_check_before_fitmust_check_after_fit
hierarchical_subscoresfit H1 Stan global+subtest-deviation model on hard-filtered operational framesubtest score reliability/correlation/readiness table; avoid writing nuisance residuals as teacher subscoresHMC diagnostics, posterior SD by subscore, shrinkage size, global score movement, risk-band movement, subgroup movement, profile interpretability
number_line_policyrun frequentist ordinal cutoff screens using audited candidate policies; keep current .85/.95 as referenceitem-by-target category counts and ECDF; reject policies with sparse/empty categories before Stanthreshold behaviour, reliability, score/risk movement, validation/fairness; continuous challenger only after proxy screen
accuracy_speed_jointtreat RT as QC/shadow; candidate families are timed non-NL achievement subtests only at firstRT missingness, row RT quantiles, rapid-row accuracy, STPM exclusion, D/trailing-zero double-count risktau construct validity, rapid effect direction, theta robustness, admin/subgroup artefacts, no operational risk-band changes

Written aggregate artifacts

Next-model specification note

Concrete model ladders and gate checks for the next round.

2026 BOY next-model specification notes

Status: pre-fit design note generated after aggregate premodelling audit. Do not treat as an operational scoring decision.

1. Hierarchical global + subscore model

Goal

Produce a coherent global score and teacher-facing subtest subscores, avoiding unrelated standalone subtest IRT scales.

First challenger: H1_global_plus_subtest_deviations

For student p and subtest/domain s:

g_p ~ broad numeracy level
z_ps ~ standard normal residual profile component
delta_ps = sigma_delta_s * z_ps, centered across subtests within student
theta_ps = g_p + delta_ps

Binary/timed or untimed non-NL item j in subtest s[j]:

y_pj ~ Bernoulli_logit(theta_p,s[j] - b_j)

Ordinal Number Line item j under a PCM-style policy:

eta_1 = 0
eta_k = eta_{k-1} + theta_p,s[j] - (b_j + step_j,k-1)
y_pj ~ categorical_logit(eta)

Identification/regularisation:

Primary post-fit checks:

1. HMC: 0 divergences, no max-treedepth hits, Rhat/ESS acceptable for g, theta_ps, sigma_delta_s, item parameters. 2. Global movement vs hard-filtered H0: Spearman, median/p95 percentile shift, <15 and 15-35 risk-band movement. 3. Subscore quality: posterior SD by subtest, shrinkage size, profile-deviation stability. 4. Teacher-facing coherence: subscore intervals and relative-strength labels agree with observed subtest evidence without overclaiming. 5. Subgroup/admin movement: no adverse subgroup artefacts.

2. Year 1 BNL residual surgical sensitivity

Keep BNL0-100 items in the global/hierarchical score but do not give BNL an extra nuisance residual variance if the current sigma_u[BNL0-100] remains weak.

Data-side option:

active_testlet_idx[BNL0-100] = 0
active_testlet_idx[other_subtests] = 1..K_active

Likelihood option:

resid = 0 if active_testlet_idx == 0
resid = sigma_u[k] * u_z[p,k] otherwise
theta_eff = theta + resid

This tests whether the issue is the BNL residual component, not the BNL items themselves.

3. Number Line policy ladder

Premodelling audit outputs:

Frequentist screens before Stan:

nl_80_90_relaxed_3cat
nl_85_95_current_3cat
nl_90_97_strict_3cat
nl_binary_95
nl_80_90_95_4cat

Promotion burden:

Continuous challenger sketch:

accuracy = 1 - absolute_error / scale_range
accuracy_squeezed = clamp/Smithson-Verkuilen transform into (0,1)
logit(mu_pj) = alpha_j + theta_p,s[j]
accuracy_pj ~ Beta(mu_pj * phi_j, (1 - mu_pj) * phi_j)

Optional signed-error diagnostic, not first scoring model:

signed_error_scaled_pj ~ Normal(target_bias_j + method_bias_family + ability_slope_j * theta, sigma_j)

4. Accuracy-speed joint modelling ladder

Operational posture: RT is shadow/QC first. Timed D/trailing-zero already encodes reach/time-pressure, so response time can double-count speed if added naively.

Initial 2026 BOY data rule:

Candidate shadow model:

y_pj ~ Bernoulli_logit(theta_p - b_j + gamma_family * rapid_pj)
logRT_pj ~ LogNormal(beta0 + beta_j - tau_p,family[j], sigma_rt_family)

Hierarchical pace extension:

tau_p,f = tau_overall_p + tau_residual_p,f

Pre-fit checks already written:

Do not use RT/tau to alter risk bands unless later evidence shows robust validation gain, no subgroup/admin artefact, and added information beyond D/reach.

Full modelling review memo

Rendered from the saved Markdown decision artifact.

2026 BOY operational accuracy + Number Line candidate — modelled job review

Review timestamp: 2026-06-14 UTC

Compute / sync status

All AWS model jobs are complete. There are no active EC2 instances matching the 2026 BOY operational Number Line model tags, no active cisbox rsync sessions, and the local sensitivity monitor was stopped after all six sensitivity .done markers were present.

Final outstanding run (year1_no_BNL0_100) is synced, checksum-verified, recovered from the known no-NL post-processing failure, and its EC2 instance was terminated.

Reviewed jobs

The review covers 10 Stan jobs:

1. Foundation inclusive baseline. 2. Year 1 inclusive baseline. 3. Foundation hard-item-filtered baseline. 4. Year 1 hard-item-filtered baseline. 5. Foundation sensitivity: no DMT10_2026. 6. Foundation sensitivity: no MQ1-20 and no DMT10_2026. 7. Foundation sensitivity: no BNL0-20. 8. Year 1 sensitivity: no MC0-100. 9. Year 1 sensitivity: no BNL0-100. 10. Year 1 sensitivity: core model with no MC and no NL.

Source output base:

/data/numeracy-screening-models/irt/2026_boy_operational_accuracy_nl_candidate

Local review artifacts:

outputs/runs/irt-2026-boy-subtest-audit/latest/reports/model_review/stan_review_summary.md
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/stan_job_diagnostic_summary.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/stan_score_movement_comparisons.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/stan_testlet_sigma_summary_long.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/stan_item_difficulty_extreme_or_diagnostic_flags.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/stan_u_residual_diagnostic_summary.csv

Completion and sampler diagnostics

All 10 jobs have successful MCMC sampling evidence:

Three no-NL-style jobs exited with Stan runner exitcode 1 because of the known post-processing bug for empty/missing NL lookup files, not because of sampler failure:

All three were recovered from QC summaries and now have final score, item, testlet, and fit-readout files.

Job-level diagnostic table

jobexitpostprocessverifydivtreedepth hitsmin EBFMItheta max Rhat / min ESStestlet max Rhat / min ESSnote
Foundation inclusive0completed155/155000.7051.004 / 50661.006 / 1173clean
Year 1 inclusive0completed155/155000.6141.004 / 9411.023 / 109weak BNL0-100 testlet sigma
Foundation hard-filtered0completed1955/1955000.6771.003 / 40511.003 / 1081clean
Year 1 hard-filtered0completed1955/1955000.6461.006 / 15041.068 / 78weak BNL0-100 testlet sigma
Foundation no DMT10_20260completed2104/2104000.6941.003 / 33371.009 / 482clean
Foundation no MQ1-20/no DMT10_20260completed2104/2104000.7271.002 / 61151.007 / 585clean
Foundation no BNL0-201recovered2098/2098000.6671.002 / 48941.004 / 1221sampling clean; postprocess recovered
Year 1 no MC0-1000completed2104/2104000.6471.002 / 48751.004 / 670clean
Year 1 no BNL0-1001recovered2098/2098000.5981.003 / 33131.003 / 1668sampling clean; postprocess recovered
Year 1 no MC/no NL1recovered2098/2098000.5681.002 / 46121.003 / 1925sampling clean; postprocess recovered

Main diagnostic finding

The global Year 1 baseline is usable from a sampler perspective, but the BNL0-100 testlet residual scale is weakly identified:

This issue is local to the BNL0-100 residual/testlet component. It does not show up as divergent transitions, treedepth failures, poor theta mixing, or item-difficulty non-convergence. It does show up in the latent residuals for the same component: in the hard-filtered Year 1 run, u[,5] corresponds to BNL0-100, and 1193/1221 residual terms had Rhat > 1.01, with max Rhat ~1.026. The likely interpretation is that the residual BNL0-100 testlet variance is near a boundary/small value and is hard for the sampler to estimate, while the BNL0-100 items themselves carry substantial global-theta information.

Auxiliary u residual diagnostic

jobtestletresidual termsRhat > 1.01ESS < 400max Rhatmin ESSinterpretation
Year 1 inclusiveBNL0-1001221031.009320minor low-ESS nuisance terms
Year 1 hard-filteredBNL0-1001221119321.026279broad residual-component mixing issue tied to BNL testlet

No other job/testlet had u residual terms with Rhat > 1.01 or ESS_bulk < 400. This reinforces that the caveat is localized to Year 1 BNL0-100 dependence modelling, not to the global theta score or item difficulty estimates.

Hard-filtered vs inclusive baseline

The hard-item filter removes the 70 predeclared no-information items and has negligible impact on student ranking/risk classification.

comparisonnSpearmanmedian abs percentile shiftp95 shiftexact 3-band agreementvery-low Jaccardlow+very-low Jaccardmoved out/in, very-lowmoved out/in, low+very-low
Foundation inclusive vs hard-filtered9971.0000.30 pp1.40 pp99.0%0.9740.9832 / 23 / 3
Year 1 inclusive vs hard-filtered12210.9990.74 pp2.62 pp98.5%0.9680.9723 / 36 / 6

Conclusion: hard-item-filtered should be the working operational baseline. The inclusive runs are useful historical evidence but should not be promoted over the filtered version.

Sensitivity findings vs hard-filtered baseline

Foundation

sensitivitynSpearmanmedian shiftp95 shift3-band agreementvery-low Jaccardlow+very-low Jaccardinterpretation
no DMT10_20269970.9355.72 pp21.00 pp85.2%0.7030.758DMT contributes materially; removal is not classification-stable.
no MQ1-20 and no DMT10_20269950.8259.95 pp35.68 pp76.1%0.5200.642Removing both early quantity/decomposition content substantially changes the score.
no BNL0-209970.8658.02 pp32.32 pp77.5%0.5050.661Foundation Number Line is highly influential and improves precision.

Foundation interpretation:

Year 1

sensitivitynSpearmanmedian shiftp95 shift3-band agreementvery-low Jaccardlow+very-low Jaccardinterpretation
no MC0-10012110.9931.82 pp6.77 pp96.2%0.9050.936Removing MC has modest impact; MC is not the main source of instability.
no BNL0-10012210.76811.88 pp39.31 pp70.3%0.4020.547Removing BNL radically changes rankings/risk bands and greatly increases uncertainty.
no MC/no NL12110.73913.46 pp40.42 pp68.5%0.3820.519Core-only score differs substantially from the full hard-filtered candidate.

Year 1 interpretation:

Frequentist model-rung context

Frequentist pre-screening remains consistent with the Stan review:

Therefore, the current Stan evidence should be interpreted within a 1D+testlet operational-candidate frame, not as support for immediate multidimensional/bifactor escalation.

Recommendations

1. Promote the hard-item-filtered model frame as the working baseline for final reporting comparisons. The hard filter removes no-information items with near-zero impact on student scores/risk bands.

2. Foundation: keep BNL0-20 and DMT10_2026 in the operational candidate. Both materially affect risk identification; the Foundation hard-filtered Stan run is diagnostically clean.

3. Year 1: do not drop BNL0-100 based on the sigma diagnostic alone. Removing it causes major movement and loss of precision. Treat the issue as a localized residual-scale estimation problem, not a failed global score.

4. Run or design one surgical Year 1 sensitivity if final promotion requires clearing the sigma caveat: keep BNL0-100 items in the global score but omit/fix the BNL0-100 testlet residual scale. This directly tests whether the weak sigma parameter is harmless. This is more informative than a no-BNL model, which changes both construct coverage and precision.

5. Complete external validation and subgroup movement checks before final operational lock-in. Compare hard-filtered baseline and key sensitivities against PAT/teacher outcomes and demographic/school subgroup stability, with priority on the <15th and 15th–35th percentile bands.

6. Update the audit/report package. Add sections for item eligibility, hard-filtered vs inclusive comparison, frequentist model rungs, Stan sensitivity results, and the Year 1 BNL0-100 decision caveat.

Proposed immediate next steps

1. Add the generated model-review tables to the unified audit HTML/report. 2. Build a final score-movement table with student-level risk-band transitions for the hard baseline vs the three most important sensitivity contrasts:

3. Run outcome validation comparisons for the hard baseline and sensitivity variants. 4. Review Year 1 BNL0-100 item-level diagnostics:

5. Decide whether to run the surgical Year 1 BNL-included/no-BNL-testlet-residual Stan sensitivity. 6. Draft the operational recommendation: