Hierarchical subscores
Recommended next Stan challenger: global numeracy plus shrunken subtest deviations. This is preferred over unrelated standalone subscores.
Review of inclusive, hard-item-filtered, and targeted sensitivity Stan jobs for the 2026 BOY operational accuracy + Number Line candidate.
BNL0-100 needs a targeted residual/testlet sensitivity before final lock-in.
New aggregate audit work documents the evidence base needed before fitting hierarchical global+subscore models, Number Line cutoff challengers, or accuracy-response-time models.
Hierarchical subscores
Recommended next Stan challenger: global numeracy plus shrunken subtest deviations. This is preferred over unrelated standalone subscores.
Number Line policy
Cutoff policies are now cell-count audited: relaxed, current, strict, binary, and 4-category ordinal options.
Speed / RT
RT remains QC and response-process context. Timed D already encodes reach/time pressure, so speed should not alter live bands yet.
Standalone subtest evidence is uneven; weaker/moderate subscores are the main reason to use hierarchical shrinkage.
| year level | test subgroup | n items keep hard filter | standalone eap reliability or alpha proxy | reliability band | spearman with other subtest composite | hierarchical subscore posture | premodel risk flags |
|---|---|---|---|---|---|---|---|
| foundation | MQ1-20 | 19 | 0.6021 | weak | 0.4256 | hierarchical_shrinkage_required; avoid standalone high-stakes subscore | weak_standalone_reliability;sparse_nonconstant_items_retained |
| foundation | MC0-20 | 50 | 0.9269 | strong | 0.5315 | strong standalone signal; still prefer hierarchical coherence with global score | sparse_nonconstant_items_retained |
| foundation | MNC0-20 | 24 | 0.8813 | strong | 0.6039 | strong standalone signal; still prefer hierarchical coherence with global score | sparse_nonconstant_items_retained |
| foundation | DMT10_2026 | 8 | 0.6093 | weak | 0.4302 | hierarchical_shrinkage_required; avoid standalone high-stakes subscore | weak_standalone_reliability;few_calibration_items |
| foundation | BNL0-20 | 10 | 0.6743 | weak | 0.3535 | hierarchical_shrinkage_required; avoid standalone high-stakes subscore | weak_standalone_reliability;number_line_policy_sensitive |
| year1 | MC0-100 | 34 | 0.9404 | strong | 0.6945 | strong standalone signal; still prefer hierarchical coherence with global score | sparse_nonconstant_items_retained |
| year1 | MNC0-100 | 22 | 0.8912 | strong | 0.7621 | strong standalone signal; still prefer hierarchical coherence with global score | sparse_nonconstant_items_retained |
| year1 | AAMC | 38 | 0.9004 | strong | 0.7288 | strong standalone signal; still prefer hierarchical coherence with global score | sparse_nonconstant_items_retained |
| year1 | ASMC | 25 | 0.8415 | moderate | 0.6173 | hierarchical_shrinkage_recommended; standalone subscore only as caveated diagnostic | moderate_reliability;floor_rate_ge_10pct;sparse_nonconstant_items_retained |
| year1 | BNL0-100 | 13 | 0.7272 | moderate | 0.5481 | hierarchical_shrinkage_recommended; standalone subscore only as caveated diagnostic | moderate_reliability;number_line_policy_sensitive |
Raw coordinate-derived category counts by candidate policy. Current .85/.95 remains the benchmark.
| year level | test subgroup | policy id | cutoffs | n items | items all categories cell ok | share items cell ok | median min category pct | median top category pct | median entropy normalized | premodel policy posture |
|---|---|---|---|---|---|---|---|---|---|---|
| foundation | BNL0-20 | nl_80_90_95_4cat | 0.8;0.9;0.95 | 10 | 9 | 0.9 | 0.1565 | 0.2526 | 0.9547 | higher-resolution_challenger; use_only_if_item_category_cells_are_stable |
| foundation | BNL0-20 | nl_80_90_relaxed_3cat | 0.8;0.9 | 10 | 9 | 0.9 | 0.206 | 0.4803 | 0.9384 | relaxed_challenger; useful_if_current_policy_over-penalises_hard_targets |
| foundation | BNL0-20 | nl_85_95_current_3cat | 0.85;0.95 | 10 | 10 | 1 | 0.1822 | 0.2526 | 0.9296 | benchmark_current_policy; keep as reference in all modelling |
| foundation | BNL0-20 | nl_90_97_strict_3cat | 0.9;0.97 | 10 | 10 | 1 | 0.136 | 0.1464 | 0.8726 | strict_challenger; reject_if_top_category_sparse_or_validation_not_better |
| foundation | BNL0-20 | nl_binary_95 | 0.95 | 10 | 10 | 1 | 0.2526 | 0.2526 | 0.8154 | modelable_if_cells_ok_but_loses_partial-credit_information |
| year1 | BNL0-100 | nl_80_90_95_4cat | 0.8;0.9;0.95 | 13 | 13 | 1 | 0.1944 | 0.2643 | 0.9894 | higher-resolution_challenger; use_only_if_item_category_cells_are_stable |
| year1 | BNL0-100 | nl_80_90_relaxed_3cat | 0.8;0.9 | 13 | 13 | 1 | 0.2441 | 0.4873 | 0.9536 | relaxed_challenger; useful_if_current_policy_over-penalises_hard_targets |
| year1 | BNL0-100 | nl_85_95_current_3cat | 0.85;0.95 | 13 | 13 | 1 | 0.2643 | 0.2643 | 0.9799 | benchmark_current_policy; keep as reference in all modelling |
| year1 | BNL0-100 | nl_90_97_strict_3cat | 0.9;0.97 | 13 | 13 | 1 | 0.1624 | 0.1624 | 0.9129 | strict_challenger; reject_if_top_category_sparse_or_validation_not_better |
| year1 | BNL0-100 | nl_binary_95 | 0.95 | 13 | 13 | 1 | 0.2643 | 0.2643 | 0.8331 | modelable_if_cells_ok_but_loses_partial-credit_information |
TAM screens were run on cisbox for all ordinal/binary cutoff policies. Bounded mirt 1D screens were run for the current, relaxed, and 4-category policies as a secondary check.
.80/.90 and 4-category .80/.90/.95 look like the best ordinal challengers; binary >=.95 loses partial-credit information; continuous Number Line remains a formal Stan challenger, not yet a replacement.
TAM full-battery
Current, relaxed, strict, binary, and 4-category screens all fit; global movement vs current is modest.
NL-only reliability
Relaxed and 4-category policies improve Number-Line-only reliability relative to current in both years; strict/binary weaken it.
mirt 1D
Bounded mirt was mostly non-converged within 300 EM cycles, so it is a sensitivity check only; extracted movement was tiny.
| year level | scope | policy id | status | n persons | n items | eap reliability | AIC | BIC | notes |
|---|---|---|---|---|---|---|---|---|---|
| foundation | full_battery | nl_80_90_relaxed_3cat | fit_ok | 997.0 | 111.0 | 0.9157 | 62,638 | 63,236 | |
| foundation | number_line_only | nl_80_90_relaxed_3cat | fit_ok | 974.0 | 10 | 0.6839 | 16,351 | 16,453 | |
| foundation | full_battery | nl_85_95_current_3cat | fit_ok | 997.0 | 111.0 | 0.9139 | 63,905 | 64,503 | |
| foundation | number_line_only | nl_85_95_current_3cat | fit_ok | 974.0 | 10 | 0.6755 | 17,664 | 17,766 | |
| foundation | full_battery | nl_90_97_strict_3cat | fit_ok | 997.0 | 111.0 | 0.9077 | 63,363 | 63,962 | |
| foundation | number_line_only | nl_90_97_strict_3cat | fit_ok | 974.0 | 10 | 0.6014 | 16,979 | 17,081 | |
| foundation | full_battery | nl_binary_95 | fit_ok | 997.0 | 111.0 | 0.9171 | 55,120 | 55,670 | |
| foundation | number_line_only | nl_binary_95 | fit_ok | 974.0 | 10 | 0.5153 | 10,024 | 10,078 | |
| foundation | full_battery | nl_80_90_95_4cat | fit_ok | 997.0 | 111.0 | 0.9061 | 69,656 | 70,304 | |
| foundation | number_line_only | nl_80_90_95_4cat | fit_ok | 974.0 | 10 | 0.6893 | 22,246 | 22,397 | |
| year1 | full_battery | nl_80_90_relaxed_3cat | fit_ok | 1,221 | 132.0 | 0.9563 | 90,385 | 91,130 | |
| year1 | number_line_only | nl_80_90_relaxed_3cat | fit_ok | 1,178 | 13 | 0.7578 | 29,088 | 29,225 | |
| year1 | full_battery | nl_85_95_current_3cat | fit_ok | 1,221 | 132.0 | 0.9518 | 92,022 | 92,768 | |
| year1 | number_line_only | nl_85_95_current_3cat | fit_ok | 1,178 | 13 | 0.7279 | 29,992 | 30,129 | |
| year1 | full_battery | nl_90_97_strict_3cat | fit_ok | 1,221 | 132.0 | 0.949 | 89,902 | 90,647 | |
| year1 | number_line_only | nl_90_97_strict_3cat | fit_ok | 1,178 | 13 | 0.6712 | 27,566 | 27,703 | |
| year1 | full_battery | nl_binary_95 | fit_ok | 1,221 | 132.0 | 0.9529 | 75,826 | 76,505 | |
| year1 | number_line_only | nl_binary_95 | fit_ok | 1,178 | 13 | 0.54 | 15,845 | 15,916 | |
| year1 | full_battery | nl_80_90_95_4cat | fit_ok | 1,221 | 132.0 | 0.9488 | 102,866 | 103,678 | |
| year1 | number_line_only | nl_80_90_95_4cat | fit_ok | 1,178 | 13 | 0.7579 | 38,492 | 38,695 |
| year level | scope | comparison | n | spearman theta | median abs pctile shift | p95 abs pctile shift | band exact agreement | very low jaccard |
|---|---|---|---|---|---|---|---|---|
| foundation | full_battery | nl_80_90_relaxed_3cat vs nl_85_95_current_3cat | 997.0 | 0.9909 | 0.0226 | 0.081 | 0.9398 | 0.8742 |
| foundation | full_battery | nl_90_97_strict_3cat vs nl_85_95_current_3cat | 997.0 | 0.991 | 0.0211 | 0.0813 | 0.9438 | 0.8861 |
| foundation | full_battery | nl_binary_95 vs nl_85_95_current_3cat | 997.0 | 0.9813 | 0.0326 | 0.1129 | 0.9178 | 0.8395 |
| foundation | full_battery | nl_80_90_95_4cat vs nl_85_95_current_3cat | 997.0 | 0.9882 | 0.0241 | 0.0928 | 0.9238 | 0.8395 |
| foundation | number_line_only | nl_80_90_relaxed_3cat vs nl_85_95_current_3cat | 974.0 | 0.9288 | 0.0675 | 0.2232 | 0.8542 | 0.6627 |
| foundation | number_line_only | nl_90_97_strict_3cat vs nl_85_95_current_3cat | 974.0 | 0.9262 | 0.0647 | 0.2266 | 0.8501 | 0.6686 |
| foundation | number_line_only | nl_binary_95 vs nl_85_95_current_3cat | 974.0 | 0.9026 | 0.0688 | 0.2599 | 0.7793 | 0.4703 |
| foundation | number_line_only | nl_80_90_95_4cat vs nl_85_95_current_3cat | 974.0 | 0.9718 | 0.0416 | 0.1439 | 0.8973 | 0.7711 |
| year1 | full_battery | nl_80_90_relaxed_3cat vs nl_85_95_current_3cat | 1,221 | 0.9948 | 0.0172 | 0.0622 | 0.9419 | 0.8579 |
| year1 | full_battery | nl_90_97_strict_3cat vs nl_85_95_current_3cat | 1,221 | 0.995 | 0.016 | 0.0581 | 0.9484 | 0.8769 |
| year1 | full_battery | nl_binary_95 vs nl_85_95_current_3cat | 1,221 | 0.9898 | 0.0242 | 0.0852 | 0.9263 | 0.7902 |
| year1 | full_battery | nl_80_90_95_4cat vs nl_85_95_current_3cat | 1,221 | 0.9936 | 0.0192 | 0.0672 | 0.9345 | 0.8535 |
| year1 | number_line_only | nl_80_90_relaxed_3cat vs nl_85_95_current_3cat | 1,178 | 0.9436 | 0.0577 | 0.2055 | 0.8659 | 0.6635 |
| year1 | number_line_only | nl_90_97_strict_3cat vs nl_85_95_current_3cat | 1,178 | 0.9384 | 0.0641 | 0.2017 | 0.8489 | 0.6479 |
| year1 | number_line_only | nl_binary_95 vs nl_85_95_current_3cat | 1,178 | 0.8975 | 0.0781 | 0.2681 | 0.7674 | 0.542 |
| year1 | number_line_only | nl_80_90_95_4cat vs nl_85_95_current_3cat | 1,178 | 0.9751 | 0.0352 | 0.1317 | 0.8973 | 0.7241 |
| year level | policy id | scope | status | converged | n persons | n items | notes |
|---|---|---|---|---|---|---|---|
| foundation | nl_80_90_relaxed_3cat | full_battery | fit_ok | FALSE | 997.0 | 111.0 | 1D flexible 2PL/GPCM bounded cutoff screen; not a multidimensional promotion model |
| foundation | nl_85_95_current_3cat | full_battery | fit_ok | FALSE | 997.0 | 111.0 | 1D flexible 2PL/GPCM bounded cutoff screen; not a multidimensional promotion model |
| foundation | nl_80_90_95_4cat | full_battery | fit_ok | FALSE | 997.0 | 111.0 | 1D flexible 2PL/GPCM bounded cutoff screen; not a multidimensional promotion model |
| year1 | nl_80_90_relaxed_3cat | full_battery | fit_ok | FALSE | 1,221 | 132.0 | 1D flexible 2PL/GPCM bounded cutoff screen; not a multidimensional promotion model |
| year1 | nl_85_95_current_3cat | full_battery | fit_ok | FALSE | 1,221 | 132.0 | 1D flexible 2PL/GPCM bounded cutoff screen; not a multidimensional promotion model |
| year1 | nl_80_90_95_4cat | full_battery | fit_ok | TRUE | 1,221 | 132.0 | 1D flexible 2PL/GPCM bounded cutoff screen; not a multidimensional promotion model |
| year level | scope | comparison | n | spearman theta | median abs pctile shift | p95 abs pctile shift | band exact agreement | very low jaccard |
|---|---|---|---|---|---|---|---|---|
| foundation | full_battery | nl_80_90_relaxed_3cat vs nl_85_95_current_3cat | 997.0 | 0.9999 | 0.002 | 0.009 | 0.992 | 0.9735 |
| foundation | full_battery | nl_80_90_95_4cat vs nl_85_95_current_3cat | 997.0 | 1 | 0.001 | 0.006 | 0.998 | 1 |
| year1 | full_battery | nl_80_90_relaxed_3cat vs nl_85_95_current_3cat | 1,221 | 0.9994 | 0.0049 | 0.0221 | 0.9771 | 0.9365 |
| year1 | full_battery | nl_80_90_95_4cat vs nl_85_95_current_3cat | 1,221 | 0.9998 | 0.0033 | 0.0131 | 0.9836 | 0.9572 |
Observed/reached timed rows have RT available; high presented-row missingness is largely trailing unreached D-zero rows.
| year level | test subgroup | role | is timed | observed or coordinate rt missing rate | presented row rt missing or negative rate | trailing nonresponse rate | row rt p50 | pct rt lt 1 | initial joint model role | rt readiness flags |
|---|---|---|---|---|---|---|---|---|---|---|
| foundation | BNL0-20 | achievement_primary | False | 0 | 0.0199 | 0 | 7 | 0.0077 | nl_rt_context_only_initially_not_accuracy_speed_scoring | none_obvious_from_row_rt_audit |
| foundation | DMT10_2026 | achievement_primary | False | 0 | 0.0146 | 0 | 16 | 0.0002 | untimed_or_other_context_only_initially | none_obvious_from_row_rt_audit |
| foundation | MC0-20 | achievement_primary | True | 0 | 0.7501 | 0.7402 | 6 | 0.0042 | initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context | presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows |
| foundation | MNC0-20 | achievement_primary | True | 0 | 0.7611 | 0.7511 | 12 | 0.0045 | initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context | presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows |
| foundation | MQ1-20 | achievement_primary | True | 0 | 0.8417 | 0.8328 | 20 | 0.0082 | initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context | presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows |
| foundation | STPM | shadow_speed_only | True | 0.0625 | 0.0521 | 8 | 0.0013 | shadow_speed_only_exclude_from_math_achievement | presented_row_rt_missing_or_negative_gt_5pct | |
| year1 | AAMC | achievement_primary | True | 0 | 0.8024 | 0.7845 | 9 | 0.0053 | initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context | presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows |
| year1 | ASMC | achievement_primary | True | 0 | 0.7762 | 0.7559 | 12 | 0.005 | initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context | presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows |
| year1 | BNL0-100 | achievement_primary | False | 0 | 0.0377 | 0 | 5 | 0.0065 | nl_rt_context_only_initially_not_accuracy_speed_scoring | none_obvious_from_row_rt_audit |
| year1 | MC0-100 | achievement_primary | True | 0 | 0.7711 | 0.7599 | 6 | 0.0044 | initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context | presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows |
| year1 | MNC0-100 | achievement_primary | True | 0 | 0.729 | 0.7145 | 11 | 0.0047 | initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context | presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows |
| year1 | STPM | shadow_speed_only | True | 0.0641 | 0.0443 | 6 | 0.001 | shadow_speed_only_exclude_from_math_achievement | presented_row_rt_missing_or_negative_gt_5pct |
| year level | test subgroup | j2b style rapid rate | mean accuracy rapid rows | mean accuracy nonrapid rows | rapid minus nonrapid accuracy | interpretation |
|---|---|---|---|---|---|---|
| foundation | MC0-20 | 0.0597 | 0.6363 | 0.9188 | -0.2826 | rapid rows should remain diagnostic/shadow unless validated; not a motivation label |
| foundation | MNC0-20 | 0.0396 | 0.0877 | 0.7537 | -0.666 | rapid rows should remain diagnostic/shadow unless validated; not a motivation label |
| foundation | MQ1-20 | 0.0308 | 0.0544 | 0.6558 | -0.6014 | rapid rows should remain diagnostic/shadow unless validated; not a motivation label |
| year1 | AAMC | 0.0436 | 0.1844 | 0.8026 | -0.6182 | rapid rows should remain diagnostic/shadow unless validated; not a motivation label |
| year1 | ASMC | 0.0424 | 0.1322 | 0.6265 | -0.4943 | rapid rows should remain diagnostic/shadow unless validated; not a motivation label |
| year1 | MC0-100 | 0.0527 | 0.5666 | 0.8974 | -0.3308 | rapid rows should remain diagnostic/shadow unless validated; not a motivation label |
| year1 | MNC0-100 | 0.0403 | 0.0874 | 0.834 | -0.7466 | rapid rows should remain diagnostic/shadow unless validated; not a motivation label |
| year level | test subgroup | n profile deviation | profile deviation sd z | profile deviation p10 z | profile deviation p90 z | pct abs profile deviation gt 1z |
|---|---|---|---|---|---|---|
| foundation | MQ1-20 | 1,005 | 0.9337 | -1.043 | 1.115 | 0.2239 |
| foundation | MC0-20 | 1,005 | 0.8424 | -0.9794 | 1.032 | 0.202 |
| foundation | MNC0-20 | 1,003 | 0.7871 | -0.9912 | 1.018 | 0.2044 |
| foundation | DMT10_2026 | 1,002 | 0.9381 | -1.21 | 1.155 | 0.2735 |
| foundation | BNL0-20 | 974.0 | 0.9978 | -1.275 | 1.231 | 0.3142 |
| year1 | MC0-100 | 1,229 | 0.7369 | -0.8171 | 0.9086 | 0.1448 |
| year1 | MNC0-100 | 1,229 | 0.6395 | -0.7395 | 0.797 | 0.105 |
| year1 | AAMC | 1,227 | 0.6877 | -0.8228 | 0.8138 | 0.1084 |
| year1 | ASMC | 1,223 | 0.7863 | -0.9902 | 0.9733 | 0.1905 |
| year1 | BNL0-100 | 1,178 | 0.9017 | -1.064 | 1.119 | 0.2326 |
Decision gates for the next round of modelling.
| stream | next action | must check before fit | must check after fit |
|---|---|---|---|
| hierarchical_subscores | fit H1 Stan global+subtest-deviation model on hard-filtered operational frame | subtest score reliability/correlation/readiness table; avoid writing nuisance residuals as teacher subscores | HMC diagnostics, posterior SD by subscore, shrinkage size, global score movement, risk-band movement, subgroup movement, profile interpretability |
| number_line_policy | run frequentist ordinal cutoff screens using audited candidate policies; keep current .85/.95 as reference | item-by-target category counts and ECDF; reject policies with sparse/empty categories before Stan | threshold behaviour, reliability, score/risk movement, validation/fairness; continuous challenger only after proxy screen |
| accuracy_speed_joint | treat RT as QC/shadow; candidate families are timed non-NL achievement subtests only at first | RT missingness, row RT quantiles, rapid-row accuracy, STPM exclusion, D/trailing-zero double-count risk | tau construct validity, rapid effect direction, theta robustness, admin/subgroup artefacts, no operational risk-band changes |
Charts are rendered in-browser from aggregate JSON. No student-level scores are published in the chart data.
Aggregate CSV/Markdown artifacts used to build this page.
Completion, verification, sampler, theta, item, and testlet-level summary.
| key | family | variant | stan exitcode | postprocess status | verify ok | verify total | divergences | max treedepth hits | min ebfmi | theta max rhat | testlet max rhat | testlet flags |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| foundation_inclusive | inclusive | inclusive | 0 | completed | 155.0 | 155.0 | 0 | 0 | 0.7052 | 1.004 | 1.006 | |
| year1_inclusive | inclusive | inclusive | 0 | completed | 155.0 | 155.0 | 0 | 0 | 0.6144 | 1.004 | 1.023 | BNL0-100:rhat=1.023,ess=109.1,mean=0.258 |
| foundation_hard | hard_filtered | hard_item_filtered | 0 | completed | 1,955 | 1,955 | 0 | 0 | 0.6765 | 1.003 | 1.003 | |
| year1_hard | hard_filtered | hard_item_filtered | 0 | completed | 1,955 | 1,955 | 0 | 0 | 0.6462 | 1.006 | 1.068 | BNL0-100:rhat=1.068,ess=78.4,mean=0.195 |
| foundation_no_DMT10_2026 | sensitivity | foundation_no_DMT10_2026 | 0 | completed | 2,104 | 2,104 | 0 | 0 | 0.6938 | 1.003 | 1.009 | |
| foundation_no_MQ1_20_no_DMT10_2026 | sensitivity | foundation_no_MQ1_20_no_DMT10_2026 | 0 | completed | 2,104 | 2,104 | 0 | 0 | 0.7273 | 1.002 | 1.007 | |
| foundation_no_BNL0_20 | sensitivity | foundation_no_BNL0_20 | 1 | recovered_after_postprocess_failure | 2,098 | 2,098 | 0 | 0 | 0.6673 | 1.002 | 1.004 | |
| year1_no_MC0_100 | sensitivity | year1_no_MC0_100 | 0 | completed | 2,104 | 2,104 | 0 | 0 | 0.6466 | 1.002 | 1.004 | |
| year1_no_BNL0_100 | sensitivity | year1_no_BNL0_100 | 1 | recovered_after_postprocess_failure | 2,098 | 2,098 | 0 | 0 | 0.5985 | 1.003 | 1.003 | |
| year1_core_no_MC_no_NL | sensitivity | year1_core_no_MC_no_NL | 1 | recovered_after_postprocess_failure | 2,098 | 2,098 | 0 | 0 | 0.5676 | 1.002 | 1.003 |
All comparisons are against the hard-item-filtered baseline for the matching year.
| comparison id | year | n common | spearman theta | median abs pctile shift | p95 abs pctile shift | exact 3band agreement | very low jaccard | low or very low jaccard | very low moved out | very low moved in | low or very low moved out | low or very low moved in |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| inclusive_vs_hard_foundation | foundation | 997.0 | 0.9997 | 0.003 | 0.014 | 0.99 | 0.9735 | 0.9829 | 2 | 2 | 3 | 3 |
| inclusive_vs_hard_year1 | year1 | 1,221 | 0.999 | 0.0074 | 0.0262 | 0.9853 | 0.9677 | 0.9723 | 3 | 3 | 6 | 6 |
| foundation_no_DMT10_2026 | foundation | 997.0 | 0.9346 | 0.0572 | 0.21 | 0.8516 | 0.7029 | 0.7576 | 26 | 26 | 48 | 48 |
| foundation_no_MQ1_20_no_DMT10_2026 | foundation | 995.0 | 0.8254 | 0.0995 | 0.3568 | 0.7608 | 0.5204 | 0.6415 | 47 | 47 | 76 | 76 |
| foundation_no_BNL0_20 | foundation | 997.0 | 0.8648 | 0.0802 | 0.3232 | 0.7753 | 0.5051 | 0.6611 | 49 | 49 | 71 | 71 |
| year1_no_MC0_100 | year1 | 1,211 | 0.9926 | 0.0182 | 0.0677 | 0.962 | 0.9053 | 0.9359 | 9 | 9 | 14 | 14 |
| year1_no_BNL0_100 | year1 | 1,221 | 0.7679 | 0.1188 | 0.3931 | 0.7027 | 0.4023 | 0.5471 | 78 | 78 | 125.0 | 125.0 |
| year1_core_no_MC_no_NL | year1 | 1,211 | 0.739 | 0.1346 | 0.4042 | 0.6846 | 0.3817 | 0.5189 | 81 | 81 | 134.0 | 134.0 |
Auxiliary latent residual diagnostic showing the localized BNL0-100 issue.
| job | testlet index | test subgroup | n | rhat gt 1 01 | ess lt 400 | either | max rhat | min ess |
|---|---|---|---|---|---|---|---|---|
| foundation_inclusive | 1 | MQ1-20 | 997.0 | 0 | 0 | 0 | 1.003 | 5,816 |
| foundation_inclusive | 2 | MC0-20 | 997.0 | 0 | 0 | 0 | 1.002 | 6,015 |
| foundation_inclusive | 3 | MNC0-20 | 997.0 | 0 | 0 | 0 | 1.003 | 6,773 |
| foundation_inclusive | 4 | DMT10_2026 | 997.0 | 0 | 0 | 0 | 1.003 | 7,920 |
| foundation_inclusive | 5 | BNL0-20 | 997.0 | 0 | 0 | 0 | 1.002 | 5,589 |
| year1_inclusive | 1 | MC0-100 | 1,221 | 0 | 0 | 0 | 1.003 | 2,759 |
| year1_inclusive | 2 | MNC0-100 | 1,221 | 0 | 0 | 0 | 1.003 | 2,981 |
| year1_inclusive | 3 | AAMC | 1,221 | 0 | 0 | 0 | 1.003 | 2,173 |
| year1_inclusive | 4 | ASMC | 1,221 | 0 | 0 | 0 | 1.003 | 2,453 |
| year1_inclusive | 5 | BNL0-100 | 1,221 | 0 | 3 | 3 | 1.009 | 320.1 |
| foundation_hard | 1 | MQ1-20 | 997.0 | 0 | 0 | 0 | 1.003 | 5,781 |
| foundation_hard | 2 | MC0-20 | 997.0 | 0 | 0 | 0 | 1.002 | 5,019 |
| foundation_hard | 3 | MNC0-20 | 997.0 | 0 | 0 | 0 | 1.003 | 5,633 |
| foundation_hard | 4 | DMT10_2026 | 997.0 | 0 | 0 | 0 | 1.003 | 7,825 |
| foundation_hard | 5 | BNL0-20 | 997.0 | 0 | 0 | 0 | 1.002 | 5,096 |
| year1_hard | 1 | MC0-100 | 1,221 | 0 | 0 | 0 | 1.003 | 4,635 |
| year1_hard | 2 | MNC0-100 | 1,221 | 0 | 0 | 0 | 1.003 | 6,077 |
| year1_hard | 3 | AAMC | 1,221 | 0 | 0 | 0 | 1.003 | 4,001 |
| year1_hard | 4 | ASMC | 1,221 | 0 | 0 | 0 | 1.003 | 4,934 |
| year1_hard | 5 | BNL0-100 | 1,221 | 1,193 | 2 | 1,193 | 1.026 | 279.0 |
| foundation_no_DMT10_2026 | 1 | MQ1-20 | 997.0 | 0 | 0 | 0 | 1.003 | 3,839 |
| foundation_no_DMT10_2026 | 2 | MC0-20 | 997.0 | 0 | 0 | 0 | 1.002 | 4,958 |
| foundation_no_DMT10_2026 | 3 | MNC0-20 | 997.0 | 0 | 0 | 0 | 1.002 | 5,529 |
| foundation_no_DMT10_2026 | 4 | BNL0-20 | 997.0 | 0 | 0 | 0 | 1.003 | 2,890 |
| foundation_no_MQ1_20_no_DMT10_2026 | 1 | MC0-20 | 995.0 | 0 | 0 | 0 | 1.003 | 6,494 |
| foundation_no_MQ1_20_no_DMT10_2026 | 2 | MNC0-20 | 995.0 | 0 | 0 | 0 | 1.003 | 7,203 |
| foundation_no_MQ1_20_no_DMT10_2026 | 3 | BNL0-20 | 995.0 | 0 | 0 | 0 | 1.007 | 6,786 |
| foundation_no_BNL0_20 | 1 | MQ1-20 | 997.0 | 0 | 0 | 0 | 1.004 | 3,875 |
| foundation_no_BNL0_20 | 2 | MC0-20 | 997.0 | 0 | 0 | 0 | 1.003 | 5,855 |
| foundation_no_BNL0_20 | 3 | MNC0-20 | 997.0 | 0 | 0 | 0 | 1.003 | 7,600 |
| foundation_no_BNL0_20 | 4 | DMT10_2026 | 997.0 | 0 | 0 | 0 | 1.003 | 7,424 |
| year1_no_MC0_100 | 1 | MNC0-100 | 1,211 | 0 | 0 | 0 | 1.003 | 6,104 |
| year1_no_MC0_100 | 2 | AAMC | 1,211 | 0 | 0 | 0 | 1.003 | 5,836 |
| year1_no_MC0_100 | 3 | ASMC | 1,211 | 0 | 0 | 0 | 1.003 | 5,684 |
| year1_no_MC0_100 | 4 | BNL0-100 | 1,211 | 0 | 0 | 0 | 1.005 | 6,171 |
| year1_no_BNL0_100 | 1 | MC0-100 | 1,221 | 0 | 0 | 0 | 1.002 | 4,308 |
| year1_no_BNL0_100 | 2 | MNC0-100 | 1,221 | 0 | 0 | 0 | 1.003 | 4,397 |
| year1_no_BNL0_100 | 3 | AAMC | 1,221 | 0 | 0 | 0 | 1.002 | 3,863 |
| year1_no_BNL0_100 | 4 | ASMC | 1,221 | 0 | 0 | 0 | 1.002 | 4,374 |
| year1_core_no_MC_no_NL | 1 | MNC0-100 | 1,211 | 0 | 0 | 0 | 1.002 | 5,522 |
| year1_core_no_MC_no_NL | 2 | AAMC | 1,211 | 0 | 0 | 0 | 1.002 | 4,994 |
| year1_core_no_MC_no_NL | 3 | ASMC | 1,211 | 0 | 0 | 0 | 1.003 | 5,165 |
| key | test subgroup | variable | mean | sd | q5 | q95 | rhat | ess bulk | ess tail |
|---|---|---|---|---|---|---|---|---|---|
| foundation_inclusive | MQ1-20 | sigma_u[1] | 0.8397 | 0.0613 | 0.7372 | 0.9408 | 1.006 | 1,465 | 2,999 |
| foundation_inclusive | MC0-20 | sigma_u[2] | 2.189 | 0.0714 | 2.075 | 2.31 | 1.002 | 1,844 | 3,930 |
| foundation_inclusive | MNC0-20 | sigma_u[3] | 1.926 | 0.0711 | 1.812 | 2.046 | 1.003 | 2,447 | 4,306 |
| foundation_inclusive | DMT10_2026 | sigma_u[4] | 0.7777 | 0.0561 | 0.6856 | 0.8697 | 1.001 | 1,884 | 3,879 |
| foundation_inclusive | BNL0-20 | sigma_u[5] | 0.5988 | 0.0423 | 0.5285 | 0.6682 | 1.005 | 1,173 | 2,206 |
| year1_inclusive | MC0-100 | sigma_u[1] | 2.322 | 0.0722 | 2.207 | 2.443 | 1.001 | 598.9 | 2,367 |
| year1_inclusive | MNC0-100 | sigma_u[2] | 2.055 | 0.0745 | 1.935 | 2.179 | 1.005 | 541.0 | 2,220 |
| year1_inclusive | AAMC | sigma_u[3] | 2.05 | 0.0712 | 1.937 | 2.171 | 1.003 | 453.1 | 2,161 |
| year1_inclusive | ASMC | sigma_u[4] | 1.636 | 0.0629 | 1.533 | 1.74 | 1.005 | 519.2 | 2,078 |
| year1_inclusive | BNL0-100 | sigma_u[5] | 0.2584 | 0.0978 | 0.0637 | 0.3936 | 1.023 | 109.1 | 237.0 |
| foundation_hard | MQ1-20 | sigma_u[1] | 0.922 | 0.0597 | 0.8248 | 1.021 | 1.002 | 2,015 | 4,281 |
| foundation_hard | MC0-20 | sigma_u[2] | 2.239 | 0.0696 | 2.125 | 2.355 | 1.001 | 1,602 | 3,564 |
| foundation_hard | MNC0-20 | sigma_u[3] | 1.995 | 0.0725 | 1.877 | 2.116 | 1 | 2,183 | 3,927 |
| foundation_hard | DMT10_2026 | sigma_u[4] | 0.7915 | 0.0564 | 0.6992 | 0.8843 | 1.003 | 1,658 | 4,175 |
| foundation_hard | BNL0-20 | sigma_u[5] | 0.5939 | 0.0417 | 0.5256 | 0.6609 | 1.002 | 1,081 | 2,044 |
| year1_hard | MC0-100 | sigma_u[1] | 2.576 | 0.0731 | 2.458 | 2.699 | 1.004 | 1,030 | 2,777 |
| year1_hard | MNC0-100 | sigma_u[2] | 2.196 | 0.0739 | 2.077 | 2.318 | 1.005 | 1,090 | 2,789 |
| year1_hard | AAMC | sigma_u[3] | 2.091 | 0.0681 | 1.979 | 2.205 | 1.009 | 826.3 | 2,277 |
| year1_hard | ASMC | sigma_u[4] | 1.696 | 0.0623 | 1.594 | 1.799 | 1.006 | 900.3 | 2,518 |
| year1_hard | BNL0-100 | sigma_u[5] | 0.1952 | 0.0927 | 0.0296 | 0.3417 | 1.068 | 78.37 | 319.6 |
| foundation_no_DMT10_2026 | MQ1-20 | sigma_u[1] | 0.9541 | 0.0677 | 0.8417 | 1.065 | 1.003 | 1,066 | 2,554 |
| foundation_no_DMT10_2026 | MC0-20 | sigma_u[2] | 2.263 | 0.071 | 2.148 | 2.383 | 1.001 | 1,903 | 3,560 |
| foundation_no_DMT10_2026 | MNC0-20 | sigma_u[3] | 2.07 | 0.076 | 1.945 | 2.197 | 1.003 | 1,781 | 3,846 |
| foundation_no_DMT10_2026 | BNL0-20 | sigma_u[4] | 0.4973 | 0.0596 | 0.3965 | 0.5908 | 1.009 | 481.8 | 814.6 |
| foundation_no_MQ1_20_no_DMT10_2026 | MC0-20 | sigma_u[1] | 2.408 | 0.0729 | 2.289 | 2.529 | 1.003 | 1,659 | 3,447 |
| foundation_no_MQ1_20_no_DMT10_2026 | MNC0-20 | sigma_u[2] | 2.219 | 0.0762 | 2.096 | 2.346 | 1.001 | 2,540 | 4,204 |
| foundation_no_MQ1_20_no_DMT10_2026 | BNL0-20 | sigma_u[3] | 0.0681 | 0.0513 | 0.005 | 0.1653 | 1.007 | 585.5 | 844.9 |
| foundation_no_BNL0_20 | MQ1-20 | sigma_u[1] | 0.7303 | 0.0742 | 0.6062 | 0.8476 | 1.004 | 1,221 | 2,465 |
| foundation_no_BNL0_20 | MC0-20 | sigma_u[2] | 2.127 | 0.0697 | 2.014 | 2.248 | 1.002 | 2,272 | 4,443 |
| foundation_no_BNL0_20 | MNC0-20 | sigma_u[3] | 1.866 | 0.0714 | 1.75 | 1.984 | 1 | 2,937 | 4,933 |
| foundation_no_BNL0_20 | DMT10_2026 | sigma_u[4] | 0.8036 | 0.0648 | 0.6963 | 0.9101 | 1.001 | 1,450 | 2,960 |
| year1_no_MC0_100 | MNC0-100 | sigma_u[1] | 2.283 | 0.0725 | 2.166 | 2.404 | 1.002 | 2,235 | 4,019 |
| year1_no_MC0_100 | AAMC | sigma_u[2] | 2.164 | 0.0653 | 2.06 | 2.273 | 1.002 | 1,867 | 3,469 |
| year1_no_MC0_100 | ASMC | sigma_u[3] | 1.76 | 0.0587 | 1.666 | 1.859 | 1.002 | 2,551 | 4,791 |
| year1_no_MC0_100 | BNL0-100 | sigma_u[4] | 0.0533 | 0.04 | 0.0044 | 0.1296 | 1.004 | 670.5 | 897.4 |
| year1_no_BNL0_100 | MC0-100 | sigma_u[1] | 2.26 | 0.0679 | 2.15 | 2.373 | 1.003 | 2,048 | 3,661 |
| year1_no_BNL0_100 | MNC0-100 | sigma_u[2] | 1.756 | 0.0707 | 1.641 | 1.873 | 1.002 | 1,715 | 3,122 |
| year1_no_BNL0_100 | AAMC | sigma_u[3] | 1.608 | 0.0638 | 1.505 | 1.714 | 1.002 | 1,892 | 2,740 |
| year1_no_BNL0_100 | ASMC | sigma_u[4] | 1.245 | 0.0588 | 1.149 | 1.344 | 1 | 1,668 | 3,369 |
| year1_core_no_MC_no_NL | MNC0-100 | sigma_u[1] | 1.969 | 0.0751 | 1.848 | 2.095 | 1.002 | 2,157 | 3,843 |
| year1_core_no_MC_no_NL | AAMC | sigma_u[2] | 1.753 | 0.0678 | 1.644 | 1.866 | 1.003 | 1,925 | 4,099 |
| year1_core_no_MC_no_NL | ASMC | sigma_u[3] | 1.283 | 0.0635 | 1.179 | 1.388 | 1.001 | 1,991 | 4,106 |
| key | variable | mean | sd | q5 | q95 | rhat | ess bulk | ess tail |
|---|---|---|---|---|---|---|---|---|
| foundation_inclusive | b[81] | 7.829 | 0.5637 | 6.978 | 8.828 | 1 | 10,514 | 4,900 |
| foundation_inclusive | b[61] | 7.828 | 0.5537 | 6.986 | 8.796 | 1.002 | 12,140 | 5,255 |
| foundation_inclusive | b[64] | 7.826 | 0.5673 | 6.959 | 8.805 | 1.001 | 11,474 | 4,865 |
| foundation_inclusive | b[63] | 7.826 | 0.5509 | 6.973 | 8.777 | 1.001 | 11,666 | 5,063 |
| foundation_inclusive | b[80] | 7.825 | 0.5688 | 6.966 | 8.81 | 1.001 | 11,620 | 5,078 |
| foundation_inclusive | b[65] | 7.824 | 0.5555 | 6.965 | 8.786 | 1 | 10,609 | 5,412 |
| foundation_inclusive | b[62] | 7.823 | 0.5623 | 6.948 | 8.798 | 1 | 12,400 | 5,798 |
| foundation_inclusive | b[68] | 7.821 | 0.56 | 6.964 | 8.803 | 1 | 11,187 | 4,942 |
| foundation_inclusive | b[69] | 7.821 | 0.5668 | 6.948 | 8.806 | 0.9998 | 10,863 | 5,251 |
| foundation_inclusive | b[79] | 7.817 | 0.5511 | 6.974 | 8.78 | 1 | 10,663 | 5,172 |
| foundation_inclusive | b[102] | 7.7 | 0.5616 | 6.851 | 8.667 | 1 | 11,007 | 5,262 |
| foundation_inclusive | b[110] | 7.699 | 0.556 | 6.855 | 8.675 | 1 | 10,556 | 5,517 |
| foundation_inclusive | b[109] | 7.691 | 0.5449 | 6.869 | 8.635 | 1 | 10,619 | 5,544 |
| foundation_inclusive | b[111] | 7.689 | 0.553 | 6.846 | 8.656 | 1 | 11,379 | 4,307 |
| foundation_inclusive | b[106] | 7.687 | 0.5456 | 6.852 | 8.654 | 1 | 11,355 | 5,515 |
| foundation_inclusive | b[108] | 7.685 | 0.5583 | 6.829 | 8.656 | 1 | 11,662 | 5,305 |
| foundation_inclusive | b[58] | 7.565 | 0.5183 | 6.763 | 8.463 | 1.002 | 10,619 | 5,099 |
| foundation_inclusive | b[75] | 7.556 | 0.5175 | 6.763 | 8.446 | 1.001 | 12,080 | 4,752 |
| foundation_inclusive | b[73] | 7.555 | 0.5255 | 6.735 | 8.466 | 1.003 | 10,646 | 4,896 |
| foundation_inclusive | b[72] | 7.555 | 0.5083 | 6.767 | 8.418 | 1.001 | 11,396 | 5,494 |
| year1_inclusive | b[164] | 8.307 | 0.5422 | 7.484 | 9.259 | 1.001 | 10,601 | 5,368 |
| year1_inclusive | b[40] | 8.302 | 0.5642 | 7.431 | 9.289 | 1.002 | 13,359 | 5,382 |
| year1_inclusive | b[170] | 8.302 | 0.5416 | 7.463 | 9.234 | 1.002 | 12,387 | 5,139 |
| year1_inclusive | b[168] | 8.301 | 0.5358 | 7.477 | 9.24 | 1.001 | 10,278 | 5,653 |
| year1_inclusive | b[34] | 8.301 | 0.5614 | 7.447 | 9.288 | 1.001 | 13,806 | 4,939 |
| year1_inclusive | b[171] | 8.299 | 0.5325 | 7.478 | 9.221 | 1.002 | 11,647 | 5,517 |
| year1_inclusive | b[166] | 8.298 | 0.5464 | 7.466 | 9.256 | 1 | 12,309 | 5,572 |
| year1_inclusive | b[169] | 8.296 | 0.5284 | 7.491 | 9.223 | 1 | 10,930 | 5,443 |
| year1_inclusive | b[172] | 8.292 | 0.5326 | 7.48 | 9.213 | 1.001 | 13,538 | 5,169 |
| year1_inclusive | b[120] | 8.254 | 0.5375 | 7.431 | 9.171 | 1.001 | 12,951 | 5,834 |
| year1_inclusive | b[136] | 8.253 | 0.539 | 7.445 | 9.223 | 1.001 | 11,104 | 4,681 |
| year1_inclusive | b[133] | 8.251 | 0.5415 | 7.428 | 9.169 | 1 | 10,203 | 4,544 |
| year1_inclusive | b[129] | 8.251 | 0.5442 | 7.423 | 9.203 | 1 | 12,194 | 5,691 |
| year1_inclusive | b[141] | 8.251 | 0.5426 | 7.42 | 9.2 | 1 | 13,052 | 5,230 |
| year1_inclusive | b[134] | 8.251 | 0.5345 | 7.434 | 9.19 | 1.001 | 10,492 | 4,616 |
| year1_inclusive | b[140] | 8.25 | 0.5361 | 7.426 | 9.185 | 0.9999 | 11,556 | 4,875 |
| year1_inclusive | b[125] | 8.25 | 0.5321 | 7.435 | 9.178 | 1 | 12,733 | 5,602 |
| year1_inclusive | b[123] | 8.25 | 0.5398 | 7.416 | 9.191 | 1.001 | 13,588 | 5,578 |
| year1_inclusive | b[142] | 8.249 | 0.5375 | 7.421 | 9.193 | 1.001 | 13,551 | 5,404 |
| year1_inclusive | b[128] | 8.248 | 0.538 | 7.424 | 9.196 | 1.002 | 12,262 | 5,445 |
| foundation_hard | b[67] | 8.006 | 0.5185 | 7.206 | 8.902 | 1.001 | 10,703 | 5,245 |
| foundation_hard | b[61] | 8.005 | 0.5234 | 7.192 | 8.93 | 1 | 10,271 | 4,354 |
| foundation_hard | b[59] | 8.003 | 0.5155 | 7.209 | 8.901 | 1.001 | 11,504 | 4,907 |
| foundation_hard | b[57] | 8.001 | 0.5108 | 7.207 | 8.876 | 1 | 11,286 | 5,709 |
| foundation_hard | b[66] | 8.001 | 0.5202 | 7.206 | 8.9 | 1 | 11,874 | 4,432 |
| foundation_hard | b[68] | 8.001 | 0.5146 | 7.201 | 8.883 | 1.001 | 11,490 | 5,480 |
| foundation_hard | b[64] | 8 | 0.5054 | 7.213 | 8.872 | 1.001 | 10,755 | 5,631 |
| foundation_hard | b[65] | 7.999 | 0.5171 | 7.192 | 8.887 | 1 | 11,646 | 4,453 |
| foundation_hard | b[58] | 7.999 | 0.4969 | 7.221 | 8.85 | 1.001 | 10,610 | 5,527 |
| foundation_hard | b[55] | 7.998 | 0.5119 | 7.189 | 8.883 | 1 | 9,513 | 5,119 |
| foundation_hard | b[62] | 7.996 | 0.5156 | 7.194 | 8.896 | 1.001 | 10,896 | 5,401 |
| foundation_hard | b[52] | 7.995 | 0.5068 | 7.206 | 8.863 | 1 | 10,763 | 5,164 |
| foundation_hard | b[56] | 7.994 | 0.5129 | 7.207 | 8.897 | 1.001 | 10,798 | 5,588 |
| foundation_hard | b[60] | 7.994 | 0.5065 | 7.215 | 8.875 | 1.001 | 12,241 | 4,915 |
| foundation_hard | b[54] | 7.993 | 0.5138 | 7.199 | 8.888 | 1 | 11,508 | 5,796 |
| foundation_hard | b[63] | 7.991 | 0.5112 | 7.184 | 8.878 | 1.002 | 11,372 | 5,727 |
| foundation_hard | b[53] | 7.774 | 0.4765 | 7.021 | 8.583 | 1.001 | 10,998 | 5,387 |
| foundation_hard | b[51] | 7.766 | 0.4816 | 7.028 | 8.605 | 1 | 11,810 | 5,249 |
| foundation_hard | b[92] | 7.741 | 0.5019 | 6.961 | 8.61 | 1 | 11,551 | 5,223 |
| foundation_hard | b[90] | 7.741 | 0.4976 | 6.965 | 8.603 | 1 | 10,416 | 5,919 |
| year1_hard | b[107] | 9.497 | 0.5015 | 8.705 | 10.36 | 1.001 | 11,483 | 5,821 |
| year1_hard | b[110] | 9.493 | 0.5009 | 8.711 | 10.35 | 1 | 8,816 | 5,481 |
| year1_hard | b[108] | 9.49 | 0.502 | 8.713 | 10.35 | 1.001 | 9,981 | 4,805 |
| year1_hard | b[109] | 9.488 | 0.5143 | 8.682 | 10.37 | 1.001 | 11,363 | 6,024 |
| year1_hard | b[106] | 9.278 | 0.4784 | 8.536 | 10.11 | 1 | 10,215 | 6,011 |
| year1_hard | b[104] | 9.275 | 0.4658 | 8.551 | 10.06 | 1.001 | 9,470 | 5,165 |
| year1_hard | b[105] | 9.274 | 0.4675 | 8.539 | 10.09 | 1.001 | 10,277 | 5,493 |
| year1_hard | b[131] | 8.564 | 0.4876 | 7.814 | 9.401 | 1 | 10,202 | 4,515 |
| year1_hard | b[132] | 8.355 | 0.4509 | 7.639 | 9.122 | 1 | 12,065 | 5,390 |
| year1_hard | b[130] | 8.352 | 0.458 | 7.636 | 9.137 | 1 | 9,747 | 5,523 |
| year1_hard | b[103] | 8.263 | 0.3503 | 7.704 | 8.857 | 1 | 8,026 | 6,194 |
| year1_hard | b[38] | 8.205 | 0.529 | 7.39 | 9.121 | 1.001 | 10,887 | 5,070 |
| year1_hard | b[27] | 8.205 | 0.5226 | 7.389 | 9.123 | 1.001 | 10,901 | 5,201 |
| year1_hard | b[37] | 8.203 | 0.5094 | 7.421 | 9.066 | 1.001 | 11,727 | 5,798 |
| year1_hard | b[30] | 8.202 | 0.5279 | 7.399 | 9.118 | 1.001 | 12,492 | 4,902 |
| year1_hard | b[22] | 8.201 | 0.5256 | 7.389 | 9.115 | 1 | 12,550 | 5,470 |
| year1_hard | b[35] | 8.201 | 0.5215 | 7.397 | 9.098 | 1 | 11,661 | 4,768 |
| year1_hard | b[25] | 8.198 | 0.518 | 7.408 | 9.096 | 1.001 | 11,833 | 5,432 |
| year1_hard | b[29] | 8.197 | 0.523 | 7.399 | 9.101 | 1.002 | 11,422 | 5,258 |
| year1_hard | b[32] | 8.197 | 0.5085 | 7.422 | 9.082 | 1.001 | 11,232 | 5,871 |
| foundation_no_DMT10_2026 | b[56] | 8.003 | 0.5149 | 7.2 | 8.903 | 1.001 | 11,245 | 5,544 |
| foundation_no_DMT10_2026 | b[50] | 8.001 | 0.5014 | 7.23 | 8.884 | 1 | 9,058 | 4,869 |
| foundation_no_DMT10_2026 | b[53] | 8 | 0.4953 | 7.231 | 8.864 | 1 | 9,980 | 5,239 |
| foundation_no_DMT10_2026 | b[51] | 7.999 | 0.5205 | 7.203 | 8.897 | 1.001 | 10,170 | 5,438 |
| foundation_no_DMT10_2026 | b[44] | 7.999 | 0.5222 | 7.199 | 8.898 | 1.001 | 10,974 | 5,023 |
| foundation_no_DMT10_2026 | b[48] | 7.999 | 0.5089 | 7.201 | 8.885 | 1.001 | 11,335 | 5,408 |
| foundation_no_DMT10_2026 | b[58] | 7.998 | 0.5133 | 7.21 | 8.885 | 1 | 10,047 | 5,687 |
| foundation_no_DMT10_2026 | b[54] | 7.998 | 0.5168 | 7.206 | 8.897 | 1 | 10,213 | 5,567 |
| foundation_no_DMT10_2026 | b[46] | 7.998 | 0.5092 | 7.211 | 8.864 | 1.001 | 10,190 | 5,756 |
| foundation_no_DMT10_2026 | b[52] | 7.997 | 0.5123 | 7.207 | 8.894 | 1.001 | 10,015 | 5,462 |
| foundation_no_DMT10_2026 | b[57] | 7.996 | 0.512 | 7.203 | 8.865 | 1.001 | 9,261 | 5,040 |
| foundation_no_DMT10_2026 | b[49] | 7.996 | 0.5146 | 7.177 | 8.877 | 1 | 10,706 | 5,121 |
| foundation_no_DMT10_2026 | b[59] | 7.994 | 0.5154 | 7.184 | 8.883 | 1.001 | 10,213 | 5,237 |
| foundation_no_DMT10_2026 | b[55] | 7.991 | 0.49 | 7.221 | 8.838 | 1 | 10,062 | 5,853 |
| foundation_no_DMT10_2026 | b[60] | 7.99 | 0.5223 | 7.175 | 8.895 | 1 | 11,202 | 5,325 |
| foundation_no_DMT10_2026 | b[47] | 7.99 | 0.5131 | 7.198 | 8.867 | 1.001 | 10,746 | 4,765 |
| foundation_no_DMT10_2026 | b[43] | 7.766 | 0.4771 | 7.011 | 8.595 | 1 | 10,443 | 5,506 |
| foundation_no_DMT10_2026 | b[45] | 7.763 | 0.4732 | 7.033 | 8.58 | 1.002 | 8,990 | 4,577 |
| foundation_no_DMT10_2026 | b[81] | 7.763 | 0.5075 | 6.995 | 8.648 | 1.001 | 9,655 | 5,295 |
| foundation_no_DMT10_2026 | b[80] | 7.757 | 0.5023 | 6.994 | 8.623 | 1.002 | 9,873 | 5,455 |
| foundation_no_MQ1_20_no_DMT10_2026 | b[46] | 8.038 | 0.516 | 7.234 | 8.926 | 1.001 | 12,235 | 4,708 |
| foundation_no_MQ1_20_no_DMT10_2026 | b[48] | 8.035 | 0.5203 | 7.226 | 8.914 | 0.9999 | 10,550 | 5,122 |
| foundation_no_MQ1_20_no_DMT10_2026 | b[58] | 8.034 | 0.52 | 7.217 | 8.946 | 1.002 | 12,559 | 4,526 |
| foundation_no_MQ1_20_no_DMT10_2026 | b[53] | 8.034 | 0.5098 | 7.226 | 8.919 | 1.001 | 11,948 | 5,302 |
| foundation_no_MQ1_20_no_DMT10_2026 | b[49] | 8.033 | 0.5137 | 7.238 | 8.925 | 1.001 | 11,860 | 5,433 |
| foundation_no_MQ1_20_no_DMT10_2026 | b[51] | 8.033 | 0.51 | 7.233 | 8.893 | 1 | 12,133 | 5,543 |
| foundation_no_MQ1_20_no_DMT10_2026 | b[44] | 8.031 | 0.5105 | 7.241 | 8.925 | 1.001 | 11,852 | 5,579 |
| foundation_no_MQ1_20_no_DMT10_2026 | b[56] | 8.031 | 0.512 | 7.243 | 8.924 | 1.001 | 11,160 | 5,375 |
| foundation_no_MQ1_20_no_DMT10_2026 | b[52] | 8.031 | 0.5062 | 7.249 | 8.916 | 1.001 | 12,815 | 6,038 |
| foundation_no_MQ1_20_no_DMT10_2026 | b[55] | 8.03 | 0.5088 | 7.247 | 8.904 | 1.001 | 12,143 | 5,536 |
| foundation_no_MQ1_20_no_DMT10_2026 | b[50] | 8.028 | 0.5175 | 7.227 | 8.941 | 1.001 | 12,795 | 4,985 |
| foundation_no_MQ1_20_no_DMT10_2026 | b[57] | 8.027 | 0.5102 | 7.245 | 8.908 | 1.001 | 12,697 | 5,756 |
| foundation_no_MQ1_20_no_DMT10_2026 | b[60] | 8.027 | 0.5199 | 7.22 | 8.927 | 1 | 12,049 | 5,128 |
| foundation_no_MQ1_20_no_DMT10_2026 | b[59] | 8.026 | 0.5175 | 7.226 | 8.899 | 1 | 13,260 | 5,505 |
| foundation_no_MQ1_20_no_DMT10_2026 | b[54] | 8.026 | 0.5159 | 7.228 | 8.908 | 1 | 11,114 | 5,088 |
| foundation_no_MQ1_20_no_DMT10_2026 | b[47] | 8.024 | 0.4992 | 7.241 | 8.878 | 1.001 | 12,673 | 5,252 |
| foundation_no_MQ1_20_no_DMT10_2026 | b[81] | 7.837 | 0.5043 | 7.062 | 8.705 | 1.002 | 13,099 | 5,769 |
| foundation_no_MQ1_20_no_DMT10_2026 | b[83] | 7.834 | 0.5063 | 7.056 | 8.707 | 1 | 11,811 | 5,498 |
| foundation_no_MQ1_20_no_DMT10_2026 | b[80] | 7.833 | 0.4987 | 7.074 | 8.708 | 1 | 11,850 | 5,260 |
| foundation_no_MQ1_20_no_DMT10_2026 | b[82] | 7.832 | 0.5045 | 7.059 | 8.704 | 1 | 11,609 | 5,007 |
| foundation_no_BNL0_20 | b[48] | 8.014 | 0.5103 | 7.229 | 8.895 | 1.002 | 14,235 | 6,076 |
| foundation_no_BNL0_20 | b[51] | 8.014 | 0.5109 | 7.221 | 8.906 | 1.001 | 15,427 | 5,038 |
| foundation_no_BNL0_20 | b[52] | 8.014 | 0.5101 | 7.228 | 8.899 | 1 | 14,431 | 5,512 |
| foundation_no_BNL0_20 | b[49] | 8.013 | 0.524 | 7.202 | 8.931 | 1 | 14,002 | 5,477 |
| foundation_no_BNL0_20 | b[57] | 8.013 | 0.5218 | 7.191 | 8.903 | 1.001 | 14,127 | 4,955 |
| foundation_no_BNL0_20 | b[56] | 8.012 | 0.5246 | 7.192 | 8.923 | 1 | 14,244 | 4,541 |
| foundation_no_BNL0_20 | b[44] | 8.011 | 0.4994 | 7.236 | 8.883 | 1.001 | 14,004 | 5,163 |
| foundation_no_BNL0_20 | b[50] | 8.011 | 0.5116 | 7.224 | 8.881 | 1.001 | 13,465 | 5,542 |
| foundation_no_BNL0_20 | b[42] | 8.011 | 0.5205 | 7.205 | 8.914 | 1 | 15,335 | 5,563 |
| foundation_no_BNL0_20 | b[47] | 8.01 | 0.5221 | 7.202 | 8.908 | 1 | 13,570 | 5,075 |
| foundation_no_BNL0_20 | b[55] | 8.009 | 0.5153 | 7.211 | 8.87 | 1 | 15,560 | 5,714 |
| foundation_no_BNL0_20 | b[58] | 8.009 | 0.5 | 7.227 | 8.871 | 1 | 13,975 | 5,664 |
| foundation_no_BNL0_20 | b[45] | 8.009 | 0.5089 | 7.212 | 8.888 | 1 | 15,748 | 5,994 |
| foundation_no_BNL0_20 | b[46] | 8.008 | 0.5252 | 7.192 | 8.918 | 1 | 14,678 | 5,234 |
| foundation_no_BNL0_20 | b[53] | 8.008 | 0.4992 | 7.235 | 8.856 | 1.001 | 14,424 | 5,539 |
| foundation_no_BNL0_20 | b[54] | 8.006 | 0.508 | 7.218 | 8.893 | 1 | 13,131 | 5,241 |
| foundation_no_BNL0_20 | b[43] | 7.783 | 0.4763 | 7.031 | 8.597 | 1.001 | 14,751 | 4,931 |
| foundation_no_BNL0_20 | b[41] | 7.771 | 0.4693 | 7.034 | 8.579 | 1.001 | 15,584 | 5,620 |
| foundation_no_BNL0_20 | b[82] | 7.734 | 0.4993 | 6.966 | 8.601 | 1.001 | 13,522 | 5,042 |
| foundation_no_BNL0_20 | b[81] | 7.734 | 0.5011 | 6.967 | 8.61 | 1.001 | 14,061 | 5,130 |
| year1_no_MC0_100 | b[97] | 8.641 | 0.497 | 7.877 | 9.482 | 1 | 11,447 | 5,361 |
| year1_no_MC0_100 | b[96] | 8.434 | 0.4632 | 7.715 | 9.24 | 1.001 | 9,911 | 5,566 |
| year1_no_MC0_100 | b[98] | 8.425 | 0.4599 | 7.703 | 9.209 | 1 | 10,199 | 5,925 |
| year1_no_MC0_100 | b[95] | 8.251 | 0.4338 | 7.571 | 8.984 | 1 | 10,062 | 5,038 |
| year1_no_MC0_100 | b[29] | 8.245 | 0.5165 | 7.433 | 9.15 | 0.9998 | 9,955 | 4,727 |
| year1_no_MC0_100 | b[31] | 8.241 | 0.522 | 7.433 | 9.155 | 1.002 | 13,982 | 4,732 |
| year1_no_MC0_100 | b[35] | 8.239 | 0.5331 | 7.427 | 9.168 | 1 | 11,445 | 4,974 |
| year1_no_MC0_100 | b[27] | 8.238 | 0.5253 | 7.427 | 9.136 | 1.001 | 11,992 | 5,104 |
| year1_no_MC0_100 | b[28] | 8.238 | 0.531 | 7.406 | 9.155 | 1.001 | 11,655 | 4,696 |
| year1_no_MC0_100 | b[34] | 8.238 | 0.5258 | 7.427 | 9.143 | 1 | 12,380 | 5,543 |
| year1_no_MC0_100 | b[33] | 8.237 | 0.5083 | 7.448 | 9.118 | 1 | 10,989 | 5,127 |
| year1_no_MC0_100 | b[22] | 8.236 | 0.5217 | 7.439 | 9.127 | 1.001 | 12,780 | 5,294 |
| year1_no_MC0_100 | b[37] | 8.235 | 0.515 | 7.44 | 9.127 | 1.002 | 11,521 | 5,609 |
| year1_no_MC0_100 | b[30] | 8.231 | 0.5191 | 7.428 | 9.134 | 1 | 12,678 | 5,976 |
| year1_no_MC0_100 | b[36] | 8.231 | 0.5155 | 7.426 | 9.115 | 1 | 11,650 | 5,620 |
| year1_no_MC0_100 | b[38] | 8.229 | 0.5135 | 7.442 | 9.119 | 1.001 | 12,111 | 5,432 |
| year1_no_MC0_100 | b[32] | 8.226 | 0.5062 | 7.454 | 9.102 | 1 | 11,546 | 5,832 |
| year1_no_MC0_100 | b[25] | 8.225 | 0.5081 | 7.443 | 9.105 | 1 | 12,807 | 6,109 |
| year1_no_MC0_100 | b[26] | 8.004 | 0.4745 | 7.274 | 8.815 | 1.001 | 9,905 | 4,832 |
| year1_no_MC0_100 | b[24] | 8 | 0.4777 | 7.261 | 8.831 | 1 | 12,510 | 5,384 |
| year1_no_BNL0_100 | b[94] | 9.423 | 0.5009 | 8.651 | 10.3 | 1 | 9,369 | 5,296 |
| year1_no_BNL0_100 | b[95] | 9.422 | 0.5108 | 8.619 | 10.31 | 1.001 | 9,269 | 5,245 |
| year1_no_BNL0_100 | b[97] | 9.418 | 0.5061 | 8.627 | 10.29 | 1 | 8,492 | 4,980 |
| year1_no_BNL0_100 | b[96] | 9.416 | 0.5095 | 8.623 | 10.29 | 1 | 10,151 | 5,918 |
| year1_no_BNL0_100 | b[93] | 9.201 | 0.4816 | 8.453 | 10.03 | 1 | 9,815 | 5,523 |
| year1_no_BNL0_100 | b[92] | 9.197 | 0.4731 | 8.458 | 10.02 | 1 | 9,506 | 5,795 |
| year1_no_BNL0_100 | b[91] | 9.196 | 0.4633 | 8.472 | 9.994 | 1 | 8,598 | 5,210 |
| year1_no_BNL0_100 | b[118] | 8.432 | 0.4934 | 7.658 | 9.291 | 1.001 | 10,748 | 5,381 |
| year1_no_BNL0_100 | b[119] | 8.233 | 0.4579 | 7.517 | 9.012 | 1.001 | 10,359 | 5,764 |
| year1_no_BNL0_100 | b[117] | 8.228 | 0.4523 | 7.517 | 8.997 | 1.001 | 10,423 | 4,968 |
| year1_no_BNL0_100 | b[90] | 8.194 | 0.3467 | 7.635 | 8.771 | 1 | 8,119 | 6,229 |
| year1_no_BNL0_100 | b[35] | 8.163 | 0.5124 | 7.376 | 9.058 | 1.001 | 10,185 | 5,326 |
| year1_no_BNL0_100 | b[25] | 8.162 | 0.5209 | 7.357 | 9.077 | 1 | 11,495 | 5,309 |
| year1_no_BNL0_100 | b[33] | 8.161 | 0.5213 | 7.361 | 9.067 | 1 | 11,323 | 5,937 |
| year1_no_BNL0_100 | b[37] | 8.16 | 0.5077 | 7.373 | 9.055 | 1.001 | 9,620 | 5,385 |
| year1_no_BNL0_100 | b[34] | 8.16 | 0.5078 | 7.382 | 9.052 | 1 | 12,275 | 6,146 |
| year1_no_BNL0_100 | b[38] | 8.158 | 0.5211 | 7.362 | 9.072 | 1 | 10,685 | 5,194 |
| year1_no_BNL0_100 | b[29] | 8.158 | 0.5113 | 7.37 | 9.047 | 1 | 11,494 | 5,758 |
| year1_no_BNL0_100 | b[30] | 8.158 | 0.5073 | 7.376 | 9.021 | 1 | 10,514 | 4,896 |
| year1_no_BNL0_100 | b[32] | 8.156 | 0.5073 | 7.36 | 9.037 | 1 | 10,190 | 5,583 |
| year1_core_no_MC_no_NL | b[84] | 8.57 | 0.4904 | 7.811 | 9.415 | 1 | 12,285 | 5,696 |
| year1_core_no_MC_no_NL | b[83] | 8.367 | 0.4658 | 7.646 | 9.167 | 1 | 11,942 | 5,113 |
| year1_core_no_MC_no_NL | b[85] | 8.364 | 0.4601 | 7.644 | 9.15 | 1.001 | 12,817 | 5,931 |
| year1_core_no_MC_no_NL | b[33] | 8.21 | 0.5118 | 7.417 | 9.091 | 1 | 13,919 | 4,888 |
| year1_core_no_MC_no_NL | b[31] | 8.207 | 0.53 | 7.395 | 9.146 | 1.001 | 13,711 | 4,756 |
| year1_core_no_MC_no_NL | b[29] | 8.205 | 0.5111 | 7.419 | 9.082 | 1 | 16,253 | 5,420 |
| year1_core_no_MC_no_NL | b[25] | 8.203 | 0.5118 | 7.416 | 9.096 | 1 | 13,262 | 5,692 |
| year1_core_no_MC_no_NL | b[34] | 8.202 | 0.5202 | 7.391 | 9.104 | 1 | 16,277 | 5,645 |
| year1_core_no_MC_no_NL | b[28] | 8.2 | 0.519 | 7.403 | 9.09 | 1.001 | 15,230 | 5,049 |
| year1_core_no_MC_no_NL | b[22] | 8.2 | 0.5184 | 7.396 | 9.09 | 0.9998 | 14,382 | 5,210 |
| year1_core_no_MC_no_NL | b[37] | 8.2 | 0.5089 | 7.414 | 9.087 | 1.001 | 14,801 | 5,424 |
| year1_core_no_MC_no_NL | b[38] | 8.2 | 0.5151 | 7.401 | 9.099 | 1.002 | 12,909 | 4,308 |
| year1_core_no_MC_no_NL | b[30] | 8.199 | 0.4956 | 7.421 | 9.051 | 1 | 13,262 | 5,654 |
| year1_core_no_MC_no_NL | b[27] | 8.199 | 0.5191 | 7.392 | 9.115 | 1.001 | 14,735 | 4,717 |
| year1_core_no_MC_no_NL | b[35] | 8.199 | 0.5183 | 7.396 | 9.092 | 1 | 15,503 | 4,965 |
| year1_core_no_MC_no_NL | b[36] | 8.197 | 0.5026 | 7.425 | 9.065 | 1.001 | 15,112 | 5,976 |
| year1_core_no_MC_no_NL | b[32] | 8.194 | 0.4991 | 7.426 | 9.043 | 1.001 | 14,008 | 5,710 |
| year1_core_no_MC_no_NL | b[82] | 8.175 | 0.4304 | 7.505 | 8.919 | 1 | 12,572 | 5,793 |
| year1_core_no_MC_no_NL | b[26] | 7.967 | 0.4787 | 7.222 | 8.789 | 1 | 14,407 | 5,566 |
| year1_core_no_MC_no_NL | b[24] | 7.966 | 0.4803 | 7.226 | 8.793 | 1 | 14,103 | 5,606 |
Rendered from the aggregate-only Markdown premodelling artifact.
Generated: 2026-06-14 08:57:14Z
This is a dependency-light, aggregate-only audit. It does not publish raw student identifiers or person-level score files.
H1_global_plus_subtest_deviations: global numeracy plus reportable subtest deviations, not the current nuisance testlet u residuals as subscores..80/.90, .85/.95, .90/.97, binary >=.95, and a 4-category .80/.90/.95 option.| year | subtest | keep_items | rel | band | global_r | posture |
|---|---|---|---|---|---|---|
| foundation | MQ1-20 | 19 | 0.602 | weak | 0.426 | hierarchical_shrinkage_required; avoid standalone high-stakes subscore |
| foundation | MC0-20 | 50 | 0.927 | strong | 0.532 | strong standalone signal; still prefer hierarchical coherence with global score |
| foundation | MNC0-20 | 24 | 0.881 | strong | 0.604 | strong standalone signal; still prefer hierarchical coherence with global score |
| foundation | DMT10_2026 | 8 | 0.609 | weak | 0.43 | hierarchical_shrinkage_required; avoid standalone high-stakes subscore |
| foundation | BNL0-20 | 10 | 0.674 | weak | 0.354 | hierarchical_shrinkage_required; avoid standalone high-stakes subscore |
| year1 | MC0-100 | 34 | 0.94 | strong | 0.694 | strong standalone signal; still prefer hierarchical coherence with global score |
| year1 | MNC0-100 | 22 | 0.891 | strong | 0.762 | strong standalone signal; still prefer hierarchical coherence with global score |
| year1 | AAMC | 38 | 0.9 | strong | 0.729 | strong standalone signal; still prefer hierarchical coherence with global score |
| year1 | ASMC | 25 | 0.841 | moderate | 0.617 | hierarchical_shrinkage_recommended; standalone subscore only as caveated diagnostic |
| year1 | BNL0-100 | 13 | 0.727 | moderate | 0.548 | hierarchical_shrinkage_recommended; standalone subscore only as caveated diagnostic |
Key implication: several subtests are not ideal standalone reporting scores, especially where reliability is weak/moderate or item counts are small. That is an argument *for* hierarchical shrinkage, not against subscores.
| year | subtest_1 | subtest_2 | n | rho | band |
|---|---|---|---|---|---|
| foundation | MQ1-20 | MC0-20 | 1005 | 0.405 | moderate |
| foundation | MQ1-20 | MNC0-20 | 1003 | 0.405 | moderate |
| foundation | MQ1-20 | DMT10_2026 | 1002 | 0.275 | low |
| foundation | MQ1-20 | BNL0-20 | 974 | 0.171 | low |
| foundation | MC0-20 | MNC0-20 | 1003 | 0.561 | moderate |
| foundation | MC0-20 | DMT10_2026 | 1002 | 0.283 | low |
| foundation | MC0-20 | BNL0-20 | 974 | 0.245 | low |
| foundation | MNC0-20 | DMT10_2026 | 1002 | 0.393 | low |
| foundation | MNC0-20 | BNL0-20 | 974 | 0.304 | low |
| foundation | DMT10_2026 | BNL0-20 | 974 | 0.302 | low |
| year1 | MC0-100 | MNC0-100 | 1229 | 0.683 | high |
| year1 | MC0-100 | AAMC | 1227 | 0.595 | moderate |
| year1 | MC0-100 | ASMC | 1223 | 0.485 | moderate |
| year1 | MC0-100 | BNL0-100 | 1178 | 0.468 | moderate |
| year1 | MNC0-100 | AAMC | 1227 | 0.671 | high |
| year1 | MNC0-100 | ASMC | 1223 | 0.561 | moderate |
| year1 | MNC0-100 | BNL0-100 | 1178 | 0.504 | moderate |
| year1 | AAMC | ASMC | 1223 | 0.595 | moderate |
| year1 | AAMC | BNL0-100 | 1178 | 0.469 | moderate |
| year1 | ASMC | BNL0-100 | 1178 | 0.379 | low |
| year | subtest | n | sd_dev_z | p10 | p90 | %> | 1z | |
|---|---|---|---|---|---|---|---|---|
| foundation | MQ1-20 | 1005 | 0.934 | -1.04 | 1.11 | 22.4% | ||
| foundation | MC0-20 | 1005 | 0.842 | -0.98 | 1.03 | 20.2% | ||
| foundation | MNC0-20 | 1003 | 0.787 | -0.99 | 1.02 | 20.4% | ||
| foundation | DMT10_2026 | 1002 | 0.938 | -1.21 | 1.15 | 27.3% | ||
| foundation | BNL0-20 | 974 | 0.998 | -1.27 | 1.23 | 31.4% | ||
| year1 | MC0-100 | 1229 | 0.737 | -0.82 | 0.91 | 14.5% | ||
| year1 | MNC0-100 | 1229 | 0.64 | -0.74 | 0.8 | 10.5% | ||
| year1 | AAMC | 1227 | 0.688 | -0.82 | 0.81 | 10.8% | ||
| year1 | ASMC | 1223 | 0.786 | -0.99 | 0.97 | 19.1% | ||
| year1 | BNL0-100 | 1178 | 0.902 | -1.06 | 1.12 | 23.3% |
| year | subtest | policy | items_ok | median_min_pct | median_top_pct | entropy | posture |
|---|---|---|---|---|---|---|---|
| foundation | BNL0-20 | nl_80_90_95_4cat | 9/10 | 15.6% | 25.3% | 0.955 | higher-resolution_challenger; use_only_if_item_category_cells_are_stable |
| foundation | BNL0-20 | nl_80_90_relaxed_3cat | 9/10 | 20.6% | 48.0% | 0.938 | relaxed_challenger; useful_if_current_policy_over-penalises_hard_targets |
| foundation | BNL0-20 | nl_85_95_current_3cat | 10/10 | 18.2% | 25.3% | 0.93 | benchmark_current_policy; keep as reference in all modelling |
| foundation | BNL0-20 | nl_90_97_strict_3cat | 10/10 | 13.6% | 14.6% | 0.873 | strict_challenger; reject_if_top_category_sparse_or_validation_not_better |
| foundation | BNL0-20 | nl_binary_95 | 10/10 | 25.3% | 25.3% | 0.815 | modelable_if_cells_ok_but_loses_partial-credit_information |
| year1 | BNL0-100 | nl_80_90_95_4cat | 13/13 | 19.4% | 26.4% | 0.989 | higher-resolution_challenger; use_only_if_item_category_cells_are_stable |
| year1 | BNL0-100 | nl_80_90_relaxed_3cat | 13/13 | 24.4% | 48.7% | 0.954 | relaxed_challenger; useful_if_current_policy_over-penalises_hard_targets |
| year1 | BNL0-100 | nl_85_95_current_3cat | 13/13 | 26.4% | 26.4% | 0.98 | benchmark_current_policy; keep as reference in all modelling |
| year1 | BNL0-100 | nl_90_97_strict_3cat | 13/13 | 16.2% | 16.2% | 0.913 | strict_challenger; reject_if_top_category_sparse_or_validation_not_better |
| year1 | BNL0-100 | nl_binary_95 | 13/13 | 26.4% | 26.4% | 0.833 | modelable_if_cells_ok_but_loses_partial-credit_information |
Interpretation rule: a policy can be *modelable* from cell counts but still not promotable. Promotion requires validation, risk-band movement, fairness/subgroup checks, and interpretability. Current .85/.95 remains the benchmark.
| year | subtest | role | timed | obs_rt_miss | presented_miss | trailing | rt_p50 | <1s | model_role | flags |
|---|---|---|---|---|---|---|---|---|---|---|
| foundation | BNL0-20 | achievement_primary | False | 0.00% | 2.0% | 0.0% | 7 | 0.8% | nl_rt_context_only_initially_not_accuracy_speed_scoring | none_obvious_from_row_rt_audit |
| foundation | DMT10_2026 | achievement_primary | False | 0.00% | 1.5% | 0.0% | 16 | 0.0% | untimed_or_other_context_only_initially | none_obvious_from_row_rt_audit |
| foundation | MC0-20 | achievement_primary | True | 0.00% | 75.0% | 74.0% | 6 | 0.4% | initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context | presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows |
| foundation | MNC0-20 | achievement_primary | True | 0.00% | 76.1% | 75.1% | 12 | 0.4% | initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context | presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows |
| foundation | MQ1-20 | achievement_primary | True | 0.00% | 84.2% | 83.3% | 20 | 0.8% | initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context | presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows |
| foundation | STPM | shadow_speed_only | True | 6.2% | 5.2% | 8 | 0.1% | shadow_speed_only_exclude_from_math_achievement | presented_row_rt_missing_or_negative_gt_5pct | |
| year1 | AAMC | achievement_primary | True | 0.00% | 80.2% | 78.5% | 9 | 0.5% | initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context | presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows |
| year1 | ASMC | achievement_primary | True | 0.00% | 77.6% | 75.6% | 12 | 0.5% | initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context | presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows |
| year1 | BNL0-100 | achievement_primary | False | 0.00% | 3.8% | 0.0% | 5 | 0.7% | nl_rt_context_only_initially_not_accuracy_speed_scoring | none_obvious_from_row_rt_audit |
| year1 | MC0-100 | achievement_primary | True | 0.00% | 77.1% | 76.0% | 6 | 0.4% | initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context | presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows |
| year1 | MNC0-100 | achievement_primary | True | 0.00% | 72.9% | 71.4% | 11 | 0.5% | initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context | presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows |
| year1 | STPM | shadow_speed_only | True | 6.4% | 4.4% | 6 | 0.1% | shadow_speed_only_exclude_from_math_achievement | presented_row_rt_missing_or_negative_gt_5pct |
| year | subtest | rapid_rate | rapid_acc | nonrapid_acc | delta |
|---|---|---|---|---|---|
| foundation | MC0-20 | 5.97% | 0.636 | 0.919 | -0.283 |
| foundation | MNC0-20 | 3.96% | 0.088 | 0.754 | -0.666 |
| foundation | MQ1-20 | 3.08% | 0.054 | 0.656 | -0.601 |
| year1 | AAMC | 4.36% | 0.184 | 0.803 | -0.618 |
| year1 | ASMC | 4.24% | 0.132 | 0.627 | -0.494 |
| year1 | MC0-100 | 5.27% | 0.567 | 0.897 | -0.331 |
| year1 | MNC0-100 | 4.03% | 0.087 | 0.834 | -0.747 |
| year | subtest | metric | n | rho | note |
|---|---|---|---|---|---|
| foundation | STPM | median_item_rt_sec | 1016 | -0.437 | rt_context_not_achievement_adjustment |
| foundation | STPM | n_reached_or_valid_count | 1024 | 0.819 | reach_count_is_partly_scoring_policy_for_timed_D |
| foundation | STPM | n_trailing_nonresponse_rows | 1024 | -0.771 | reach_count_is_partly_scoring_policy_for_timed_D |
| foundation | MQ1-20 | median_item_rt_sec | 998 | -0.462 | rt_context_not_achievement_adjustment |
| foundation | MQ1-20 | n_reached_or_valid_count | 1006 | 0.711 | reach_count_is_partly_scoring_policy_for_timed_D |
| foundation | MQ1-20 | n_trailing_nonresponse_rows | 1006 | -0.656 | reach_count_is_partly_scoring_policy_for_timed_D |
| foundation | MC0-20 | median_item_rt_sec | 995 | -0.83 | rt_context_not_achievement_adjustment |
| foundation | MC0-20 | n_reached_or_valid_count | 1005 | 0.932 | reach_count_is_partly_scoring_policy_for_timed_D |
| foundation | MC0-20 | n_trailing_nonresponse_rows | 1005 | -0.873 | reach_count_is_partly_scoring_policy_for_timed_D |
| foundation | MNC0-20 | median_item_rt_sec | 993 | -0.673 | rt_context_not_achievement_adjustment |
| foundation | MNC0-20 | n_reached_or_valid_count | 1003 | 0.778 | reach_count_is_partly_scoring_policy_for_timed_D |
| foundation | MNC0-20 | n_trailing_nonresponse_rows | 1003 | -0.721 | reach_count_is_partly_scoring_policy_for_timed_D |
| foundation | DMT10_2026 | median_item_rt_sec | 988 | 0.061 | rt_context_not_achievement_adjustment |
| foundation | DMT10_2026 | n_reached_or_valid_count | 1002 | 0.206 | coverage_or_valid_count_context_not_timed_D_speed |
| foundation | DMT10_2026 | n_trailing_nonresponse_rows | 1002 | coverage_or_valid_count_context_not_timed_D_speed | |
| foundation | BNL0-20 | median_item_rt_sec | 974 | -0.009 | rt_context_not_achievement_adjustment |
| foundation | BNL0-20 | n_reached_or_valid_count | 974 | 0.345 | coverage_or_valid_count_context_not_timed_D_speed |
| foundation | BNL0-20 | n_trailing_nonresponse_rows | 974 | coverage_or_valid_count_context_not_timed_D_speed | |
| year1 | STPM | median_item_rt_sec | 1235 | -0.432 | rt_context_not_achievement_adjustment |
| year1 | STPM | n_reached_or_valid_count | 1256 | 0.821 | reach_count_is_partly_scoring_policy_for_timed_D |
| year1 | STPM | n_trailing_nonresponse_rows | 1256 | -0.719 | reach_count_is_partly_scoring_policy_for_timed_D |
| year1 | MC0-100 | median_item_rt_sec | 1221 | -0.84 | rt_context_not_achievement_adjustment |
| year1 | MC0-100 | n_reached_or_valid_count | 1235 | 0.932 | reach_count_is_partly_scoring_policy_for_timed_D |
| year1 | MC0-100 | n_trailing_nonresponse_rows | 1235 | -0.865 | reach_count_is_partly_scoring_policy_for_timed_D |
| year1 | MNC0-100 | median_item_rt_sec | 1212 | -0.704 | rt_context_not_achievement_adjustment |
| year1 | MNC0-100 | n_reached_or_valid_count | 1229 | 0.816 | reach_count_is_partly_scoring_policy_for_timed_D |
| year1 | MNC0-100 | n_trailing_nonresponse_rows | 1229 | -0.729 | reach_count_is_partly_scoring_policy_for_timed_D |
| year1 | AAMC | median_item_rt_sec | 1205 | -0.79 | rt_context_not_achievement_adjustment |
| year1 | AAMC | n_reached_or_valid_count | 1227 | 0.872 | reach_count_is_partly_scoring_policy_for_timed_D |
| year1 | AAMC | n_trailing_nonresponse_rows | 1227 | -0.768 | reach_count_is_partly_scoring_policy_for_timed_D |
| year1 | ASMC | median_item_rt_sec | 1199 | -0.584 | rt_context_not_achievement_adjustment |
| year1 | ASMC | n_reached_or_valid_count | 1223 | 0.708 | reach_count_is_partly_scoring_policy_for_timed_D |
| year1 | ASMC | n_trailing_nonresponse_rows | 1223 | -0.599 | reach_count_is_partly_scoring_policy_for_timed_D |
| year1 | BNL0-100 | median_item_rt_sec | 1178 | -0.046 | rt_context_not_achievement_adjustment |
| year1 | BNL0-100 | n_reached_or_valid_count | 1178 | 0.225 | coverage_or_valid_count_context_not_timed_D_speed |
| year1 | BNL0-100 | n_trailing_nonresponse_rows | 1178 | coverage_or_valid_count_context_not_timed_D_speed | |
| foundation | STPM_vs_composite | score | 1006 | 0.232 | STPM_is_shadow_non_math_exclude_from_math_score |
| foundation | STPM_vs_composite | median_item_rt_sec | 1004 | -0.389 | STPM_is_shadow_non_math_exclude_from_math_score |
| foundation | STPM_vs_composite | total_rt_sec | 1004 | -0.34 | STPM_is_shadow_non_math_exclude_from_math_score |
| year1 | STPM_vs_composite | score | 1235 | 0.234 | STPM_is_shadow_non_math_exclude_from_math_score |
| year1 | STPM_vs_composite | median_item_rt_sec | 1228 | -0.387 | STPM_is_shadow_non_math_exclude_from_math_score |
| year1 | STPM_vs_composite | total_rt_sec | 1228 | -0.284 | STPM_is_shadow_non_math_exclude_from_math_score |
Reach/trailing correlations are partly mechanical under timed D/trailing-zero scoring. This is exactly why RT/tau should initially remain a shadow response-process layer rather than a direct achievement-band adjustment.
| model_id | purpose | latent_structure | subscores | premodel_status | promotion_gate |
|---|---|---|---|---|---|
| H0_current_operational_candidate | existing global score anchor | one global theta + subtest/testlet residuals u | not teacher-facing; u is nuisance/local-dependence residual | already fitted for inclusive/hard-filtered/sensitivities | retain as anchor while subscore challengers are tested |
| H1_global_plus_subtest_deviations | coherent teacher-facing global score + subscores | global theta; subtest score = global theta + shrunken subtest deviation; no separate nuisance residual for every same subtest initially | yes: report global, subtest posterior means/intervals, and relative deviation labels | recommended first Stan hierarchical subscore challenger | clean HMC, stable subscore posterior SDs, sensible shrinkage, better coherence than standalone subtest IRT, no harmful risk-band movement |
| H2_global_plus_NL_specific_deviation | target Year 1 BNL influence before full subtest expansion | global theta + Number Line-specific deviation/factor; optionally BNL residual fixed/omitted | global + NL profile only | recommended focused challenger if H1 is too broad or BNL remains unstable | keeps BNL contribution without weak BNL residual pathology; validates at least as well as H0 |
| H3_correlated_subtest_thetas | diagnostic upper-bound profile model | one correlated theta per subtest; global score is derived composite | yes but global must be defined after fitting | diagnostic only until feasibility improves; mirt/TAM high-dimensional screens were resource-burdened | only proceed if H1/H2 insufficient and dimensions are stable/interpretable |
| policy_id | role | model_family | premodel_gate | promotion_gate |
|---|---|---|---|---|
| nl_85_95_current_3cat | benchmark/operational-compatible current policy | ordinal PCM/GPCM categories 0=<.85, 1=.85-.95, 2=>=.95 | must be included as reference in all screens | already lockable as NL2 unless challenger clearly improves validation/fairness/classification |
| nl_80_90_relaxed_3cat | cutoff sensitivity challenger | ordinal 3-category PCM/GPCM | cell counts and target distributions acceptable | less harmful hard-target penalisation plus equal/better validation and risk classification |
| nl_90_97_strict_3cat | strict challenger | ordinal 3-category PCM/GPCM | top category not too sparse item-by-item | only if validation gain offsets expected sparsity/precision loss |
| nl_binary_95 | simple mastery-like sensitivity | binary Rasch/2PL screen | both classes present by item | unlikely to promote unless it improves decision validity despite information loss |
| nl_80_90_95_4cat | higher-resolution ordinal sensitivity | 4-category PCM/GPCM | all item categories have stable counts; thresholds ordered/usable | improved validation/precision without sparse-category pathology |
| continuous_abs_error_logitnormal_or_beta | formal continuous challenger, not TAM/mirt-faithful | mixed response Stan: binary/non-NL accuracy + continuous bounded NL accuracy/error | raw distributions and coordinate calibration pass; proxy validation competitive | material validation/classification/fairness gain over NL2 and clean HMC/PPC |
| model_id | purpose | status | uses_for_score | gate |
|---|---|---|---|---|
| RT0_QC_manifest_speed_descriptives | data-quality, rapid-response, timing-unit, and admin/device checks | recommended before any scoring use | none | no severe RT missingness/unit anomalies in candidate families |
| RT1_selected_family_speed_shadow | selected timed-family tau/pace research with accuracy anchor protected | supported by prior J2b work; rerun on 2026 BOY candidate families if needed | shadow only | tau aligns with RT/rapid behaviour; theta/risk bands not changed operationally |
| RT2_hierarchical_tau_shadow | overall response pace + family residual pace, coherent with teacher profile idea | Stan skeleton exists (J3b hierarchical tau) | shadow only | clean HMC; no subgroup/admin artefact; no achievement-band changes |
| RT3_joint_global_subscore_accuracy_speed | future integrated model after H1 subscore and RT2 pace models are separately stable | not first next fit | research only until validation burden is met | must add information beyond D/trailing-zero and not double-count speed/reach |
| stream | next_action | must_check_before_fit | must_check_after_fit |
|---|---|---|---|
| hierarchical_subscores | fit H1 Stan global+subtest-deviation model on hard-filtered operational frame | subtest score reliability/correlation/readiness table; avoid writing nuisance residuals as teacher subscores | HMC diagnostics, posterior SD by subscore, shrinkage size, global score movement, risk-band movement, subgroup movement, profile interpretability |
| number_line_policy | run frequentist ordinal cutoff screens using audited candidate policies; keep current .85/.95 as reference | item-by-target category counts and ECDF; reject policies with sparse/empty categories before Stan | threshold behaviour, reliability, score/risk movement, validation/fairness; continuous challenger only after proxy screen |
| accuracy_speed_joint | treat RT as QC/shadow; candidate families are timed non-NL achievement subtests only at first | RT missingness, row RT quantiles, rapid-row accuracy, STPM exclusion, D/trailing-zero double-count risk | tau construct validity, rapid effect direction, theta robustness, admin/subgroup artefacts, no operational risk-band changes |
tables/premodeling/2026_boy_hierarchical_subscore_readiness.csvtables/premodeling/2026_boy_subtest_score_correlations.csvtables/premodeling/2026_boy_subtest_composite_correlations.csvtables/premodeling/2026_boy_subtest_profile_deviation_summary.csvtables/premodeling/2026_boy_nl_accuracy_distribution_by_item.csvtables/premodeling/2026_boy_nl_policy_item_cell_counts.csvtables/premodeling/2026_boy_nl_policy_overall_summary.csvtables/premodeling/2026_boy_rt_readiness_by_subtest.csvtables/premodeling/2026_boy_j2b_style_rapid_row_audit.csvtables/premodeling/2026_boy_speed_accuracy_correlations.csvanalysis/modeling/v2_response_process_program/77_2026_boy_premodel_tam_cutoff_screens.R (requires TAM; intended for cisbox/AWS)Concrete model ladders and gate checks for the next round.
Status: pre-fit design note generated after aggregate premodelling audit. Do not treat as an operational scoring decision.
Produce a coherent global score and teacher-facing subtest subscores, avoiding unrelated standalone subtest IRT scales.
H1_global_plus_subtest_deviationsFor student p and subtest/domain s:
g_p ~ broad numeracy level
z_ps ~ standard normal residual profile component
delta_ps = sigma_delta_s * z_ps, centered across subtests within student
theta_ps = g_p + delta_ps
Binary/timed or untimed non-NL item j in subtest s[j]:
y_pj ~ Bernoulli_logit(theta_p,s[j] - b_j)
Ordinal Number Line item j under a PCM-style policy:
eta_1 = 0
eta_k = eta_{k-1} + theta_p,s[j] - (b_j + step_j,k-1)
y_pj ~ categorical_logit(eta)
Identification/regularisation:
delta_ps across subtests so g_p remains the broad level.u for every subtest; otherwise the reportable subtest deviation and nuisance residual compete for the same signal.Primary post-fit checks:
1. HMC: 0 divergences, no max-treedepth hits, Rhat/ESS acceptable for g, theta_ps, sigma_delta_s, item parameters. 2. Global movement vs hard-filtered H0: Spearman, median/p95 percentile shift, <15 and 15-35 risk-band movement. 3. Subscore quality: posterior SD by subtest, shrinkage size, profile-deviation stability. 4. Teacher-facing coherence: subscore intervals and relative-strength labels agree with observed subtest evidence without overclaiming. 5. Subgroup/admin movement: no adverse subgroup artefacts.
Keep BNL0-100 items in the global/hierarchical score but do not give BNL an extra nuisance residual variance if the current sigma_u[BNL0-100] remains weak.
Data-side option:
active_testlet_idx[BNL0-100] = 0
active_testlet_idx[other_subtests] = 1..K_active
Likelihood option:
resid = 0 if active_testlet_idx == 0
resid = sigma_u[k] * u_z[p,k] otherwise
theta_eff = theta + resid
This tests whether the issue is the BNL residual component, not the BNL items themselves.
Premodelling audit outputs:
tables/premodeling/2026_boy_nl_accuracy_distribution_by_item.csvtables/premodeling/2026_boy_nl_policy_item_cell_counts.csvtables/premodeling/2026_boy_nl_policy_overall_summary.csvFrequentist screens before Stan:
nl_80_90_relaxed_3cat
nl_85_95_current_3cat
nl_90_97_strict_3cat
nl_binary_95
nl_80_90_95_4cat
Promotion burden:
.85/.95 remains the reference;Continuous challenger sketch:
accuracy = 1 - absolute_error / scale_range
accuracy_squeezed = clamp/Smithson-Verkuilen transform into (0,1)
logit(mu_pj) = alpha_j + theta_p,s[j]
accuracy_pj ~ Beta(mu_pj * phi_j, (1 - mu_pj) * phi_j)
Optional signed-error diagnostic, not first scoring model:
signed_error_scaled_pj ~ Normal(target_bias_j + method_bias_family + ability_slope_j * theta, sigma_j)
Operational posture: RT is shadow/QC first. Timed D/trailing-zero already encodes reach/time-pressure, so response time can double-count speed if added naively.
Initial 2026 BOY data rule:
Candidate shadow model:
y_pj ~ Bernoulli_logit(theta_p - b_j + gamma_family * rapid_pj)
logRT_pj ~ LogNormal(beta0 + beta_j - tau_p,family[j], sigma_rt_family)
Hierarchical pace extension:
tau_p,f = tau_overall_p + tau_residual_p,f
Pre-fit checks already written:
tables/premodeling/2026_boy_rt_readiness_by_subtest.csvtables/premodeling/2026_boy_j2b_style_rapid_row_audit.csvtables/premodeling/2026_boy_speed_accuracy_correlations.csvDo not use RT/tau to alter risk bands unless later evidence shows robust validation gain, no subgroup/admin artefact, and added information beyond D/reach.
Rendered from the saved Markdown decision artifact.
Review timestamp: 2026-06-14 UTC
All AWS model jobs are complete. There are no active EC2 instances matching the 2026 BOY operational Number Line model tags, no active cisbox rsync sessions, and the local sensitivity monitor was stopped after all six sensitivity .done markers were present.
Final outstanding run (year1_no_BNL0_100) is synced, checksum-verified, recovered from the known no-NL post-processing failure, and its EC2 instance was terminated.
The review covers 10 Stan jobs:
1. Foundation inclusive baseline. 2. Year 1 inclusive baseline. 3. Foundation hard-item-filtered baseline. 4. Year 1 hard-item-filtered baseline. 5. Foundation sensitivity: no DMT10_2026. 6. Foundation sensitivity: no MQ1-20 and no DMT10_2026. 7. Foundation sensitivity: no BNL0-20. 8. Year 1 sensitivity: no MC0-100. 9. Year 1 sensitivity: no BNL0-100. 10. Year 1 sensitivity: core model with no MC and no NL.
Source output base:
/data/numeracy-screening-models/irt/2026_boy_operational_accuracy_nl_candidate
Local review artifacts:
outputs/runs/irt-2026-boy-subtest-audit/latest/reports/model_review/stan_review_summary.md
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/stan_job_diagnostic_summary.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/stan_score_movement_comparisons.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/stan_testlet_sigma_summary_long.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/stan_item_difficulty_extreme_or_diagnostic_flags.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/stan_u_residual_diagnostic_summary.csv
All 10 jobs have successful MCMC sampling evidence:
year1_core_no_MC_no_NL), acceptable.Three no-NL-style jobs exited with Stan runner exitcode 1 because of the known post-processing bug for empty/missing NL lookup files, not because of sampler failure:
foundation_no_BNL0_20year1_no_BNL0_100year1_core_no_MC_no_NLAll three were recovered from QC summaries and now have final score, item, testlet, and fit-readout files.
| job | exit | postprocess | verify | div | treedepth hits | min EBFMI | theta max Rhat / min ESS | testlet max Rhat / min ESS | note |
|---|---|---|---|---|---|---|---|---|---|
| Foundation inclusive | 0 | completed | 155/155 | 0 | 0 | 0.705 | 1.004 / 5066 | 1.006 / 1173 | clean |
| Year 1 inclusive | 0 | completed | 155/155 | 0 | 0 | 0.614 | 1.004 / 941 | 1.023 / 109 | weak BNL0-100 testlet sigma |
| Foundation hard-filtered | 0 | completed | 1955/1955 | 0 | 0 | 0.677 | 1.003 / 4051 | 1.003 / 1081 | clean |
| Year 1 hard-filtered | 0 | completed | 1955/1955 | 0 | 0 | 0.646 | 1.006 / 1504 | 1.068 / 78 | weak BNL0-100 testlet sigma |
Foundation no DMT10_2026 | 0 | completed | 2104/2104 | 0 | 0 | 0.694 | 1.003 / 3337 | 1.009 / 482 | clean |
Foundation no MQ1-20/no DMT10_2026 | 0 | completed | 2104/2104 | 0 | 0 | 0.727 | 1.002 / 6115 | 1.007 / 585 | clean |
Foundation no BNL0-20 | 1 | recovered | 2098/2098 | 0 | 0 | 0.667 | 1.002 / 4894 | 1.004 / 1221 | sampling clean; postprocess recovered |
Year 1 no MC0-100 | 0 | completed | 2104/2104 | 0 | 0 | 0.647 | 1.002 / 4875 | 1.004 / 670 | clean |
Year 1 no BNL0-100 | 1 | recovered | 2098/2098 | 0 | 0 | 0.598 | 1.003 / 3313 | 1.003 / 1668 | sampling clean; postprocess recovered |
| Year 1 no MC/no NL | 1 | recovered | 2098/2098 | 0 | 0 | 0.568 | 1.002 / 4612 | 1.003 / 1925 | sampling clean; postprocess recovered |
The global Year 1 baseline is usable from a sampler perspective, but the BNL0-100 testlet residual scale is weakly identified:
BNL0-100 sigma: Rhat ~1.023, ESS_bulk ~109.BNL0-100 sigma: Rhat ~1.068, ESS_bulk ~78.This issue is local to the BNL0-100 residual/testlet component. It does not show up as divergent transitions, treedepth failures, poor theta mixing, or item-difficulty non-convergence. It does show up in the latent residuals for the same component: in the hard-filtered Year 1 run, u[,5] corresponds to BNL0-100, and 1193/1221 residual terms had Rhat > 1.01, with max Rhat ~1.026. The likely interpretation is that the residual BNL0-100 testlet variance is near a boundary/small value and is hard for the sampler to estimate, while the BNL0-100 items themselves carry substantial global-theta information.
u residual diagnostic| job | testlet | residual terms | Rhat > 1.01 | ESS < 400 | max Rhat | min ESS | interpretation |
|---|---|---|---|---|---|---|---|
| Year 1 inclusive | BNL0-100 | 1221 | 0 | 3 | 1.009 | 320 | minor low-ESS nuisance terms |
| Year 1 hard-filtered | BNL0-100 | 1221 | 1193 | 2 | 1.026 | 279 | broad residual-component mixing issue tied to BNL testlet |
No other job/testlet had u residual terms with Rhat > 1.01 or ESS_bulk < 400. This reinforces that the caveat is localized to Year 1 BNL0-100 dependence modelling, not to the global theta score or item difficulty estimates.
The hard-item filter removes the 70 predeclared no-information items and has negligible impact on student ranking/risk classification.
| comparison | n | Spearman | median abs percentile shift | p95 shift | exact 3-band agreement | very-low Jaccard | low+very-low Jaccard | moved out/in, very-low | moved out/in, low+very-low |
|---|---|---|---|---|---|---|---|---|---|
| Foundation inclusive vs hard-filtered | 997 | 1.000 | 0.30 pp | 1.40 pp | 99.0% | 0.974 | 0.983 | 2 / 2 | 3 / 3 |
| Year 1 inclusive vs hard-filtered | 1221 | 0.999 | 0.74 pp | 2.62 pp | 98.5% | 0.968 | 0.972 | 3 / 3 | 6 / 6 |
Conclusion: hard-item-filtered should be the working operational baseline. The inclusive runs are useful historical evidence but should not be promoted over the filtered version.
| sensitivity | n | Spearman | median shift | p95 shift | 3-band agreement | very-low Jaccard | low+very-low Jaccard | interpretation |
|---|---|---|---|---|---|---|---|---|
no DMT10_2026 | 997 | 0.935 | 5.72 pp | 21.00 pp | 85.2% | 0.703 | 0.758 | DMT contributes materially; removal is not classification-stable. |
no MQ1-20 and no DMT10_2026 | 995 | 0.825 | 9.95 pp | 35.68 pp | 76.1% | 0.520 | 0.642 | Removing both early quantity/decomposition content substantially changes the score. |
no BNL0-20 | 997 | 0.865 | 8.02 pp | 32.32 pp | 77.5% | 0.505 | 0.661 | Foundation Number Line is highly influential and improves precision. |
Foundation interpretation:
BNL0-20 is important to the global score; dropping it causes large risk-band movement.DMT10_2026 also matters; despite being untimed, it contributes meaningfully to the Foundation global trait.| sensitivity | n | Spearman | median shift | p95 shift | 3-band agreement | very-low Jaccard | low+very-low Jaccard | interpretation |
|---|---|---|---|---|---|---|---|---|
no MC0-100 | 1211 | 0.993 | 1.82 pp | 6.77 pp | 96.2% | 0.905 | 0.936 | Removing MC has modest impact; MC is not the main source of instability. |
no BNL0-100 | 1221 | 0.768 | 11.88 pp | 39.31 pp | 70.3% | 0.402 | 0.547 | Removing BNL radically changes rankings/risk bands and greatly increases uncertainty. |
| no MC/no NL | 1211 | 0.739 | 13.46 pp | 40.42 pp | 68.5% | 0.382 | 0.519 | Core-only score differs substantially from the full hard-filtered candidate. |
Year 1 interpretation:
MC0-100 is not a major concern; the no-MC sensitivity remains close to the hard-filtered baseline.BNL0-100 is the key decision point. It is highly influential for Year 1 risk classification and precision.BNL0-100 sigma diagnostic should not be read as evidence to drop BNL. The no-BNL sensitivity shows the opposite: dropping it materially changes the construct coverage and low-achievement identification.BNL0-100 as a strong candidate, but resolve/report the localized testlet-sigma issue before final operational promotion.Frequentist pre-screening remains consistent with the Stan review:
Greater than 20000 quadrature points).Therefore, the current Stan evidence should be interpreted within a 1D+testlet operational-candidate frame, not as support for immediate multidimensional/bifactor escalation.
1. Promote the hard-item-filtered model frame as the working baseline for final reporting comparisons. The hard filter removes no-information items with near-zero impact on student scores/risk bands.
2. Foundation: keep BNL0-20 and DMT10_2026 in the operational candidate. Both materially affect risk identification; the Foundation hard-filtered Stan run is diagnostically clean.
3. Year 1: do not drop BNL0-100 based on the sigma diagnostic alone. Removing it causes major movement and loss of precision. Treat the issue as a localized residual-scale estimation problem, not a failed global score.
4. Run or design one surgical Year 1 sensitivity if final promotion requires clearing the sigma caveat: keep BNL0-100 items in the global score but omit/fix the BNL0-100 testlet residual scale. This directly tests whether the weak sigma parameter is harmless. This is more informative than a no-BNL model, which changes both construct coverage and precision.
5. Complete external validation and subgroup movement checks before final operational lock-in. Compare hard-filtered baseline and key sensitivities against PAT/teacher outcomes and demographic/school subgroup stability, with priority on the <15th and 15th–35th percentile bands.
6. Update the audit/report package. Add sections for item eligibility, hard-filtered vs inclusive comparison, frequentist model rungs, Stan sensitivity results, and the Year 1 BNL0-100 decision caveat.
1. Add the generated model-review tables to the unified audit HTML/report. 2. Build a final score-movement table with student-level risk-band transitions for the hard baseline vs the three most important sensitivity contrasts:
BNL0-20.BNL0-100.MC0-100.3. Run outcome validation comparisons for the hard baseline and sensitivity variants. 4. Review Year 1 BNL0-100 item-level diagnostics:
5. Decide whether to run the surgical Year 1 BNL-included/no-BNL-testlet-residual Stan sensitivity. 6. Draft the operational recommendation: