# 2026 BOY premodelling audit: hierarchical subscores, Number Line policy, and accuracy-speed

Generated: 2026-06-14 08:57:14Z

This is a dependency-light, aggregate-only audit. It does not publish raw student identifiers or person-level score files.

## Executive readout

- Hierarchical subscores are justified as a modelling direction: standalone subtest evidence is uneven, so teacher-facing profiles should be shrunken/coherent with the global score rather than independent standalone IRT scores.
- The first hierarchical Stan challenger should be `H1_global_plus_subtest_deviations`: global numeracy plus reportable subtest deviations, not the current nuisance testlet `u` residuals as subscores.
- Number Line cutoff changes should be screened from raw coordinate-derived accuracy distributions first. This audit writes item-by-target ECDF/category-count tables for `.80/.90`, `.85/.95`, `.90/.97`, binary `>=.95`, and a 4-category `.80/.90/.95` option.
- Accuracy-speed remains shadow/QC-first. Timed D/trailing-zero already encodes reach/speed pressure, so RT must not be allowed to double-count speed in achievement bands without validation.

## Hierarchical subscore readiness

| year | subtest | keep_items | rel | band | global_r | posture |
| --- | --- | --- | --- | --- | --- | --- |
| foundation | MQ1-20 | 19 | 0.602 | weak | 0.426 | hierarchical_shrinkage_required; avoid standalone high-stakes subscore |
| foundation | MC0-20 | 50 | 0.927 | strong | 0.532 | strong standalone signal; still prefer hierarchical coherence with global score |
| foundation | MNC0-20 | 24 | 0.881 | strong | 0.604 | strong standalone signal; still prefer hierarchical coherence with global score |
| foundation | DMT10_2026 | 8 | 0.609 | weak | 0.43 | hierarchical_shrinkage_required; avoid standalone high-stakes subscore |
| foundation | BNL0-20 | 10 | 0.674 | weak | 0.354 | hierarchical_shrinkage_required; avoid standalone high-stakes subscore |
| year1 | MC0-100 | 34 | 0.94 | strong | 0.694 | strong standalone signal; still prefer hierarchical coherence with global score |
| year1 | MNC0-100 | 22 | 0.891 | strong | 0.762 | strong standalone signal; still prefer hierarchical coherence with global score |
| year1 | AAMC | 38 | 0.9 | strong | 0.729 | strong standalone signal; still prefer hierarchical coherence with global score |
| year1 | ASMC | 25 | 0.841 | moderate | 0.617 | hierarchical_shrinkage_recommended; standalone subscore only as caveated diagnostic |
| year1 | BNL0-100 | 13 | 0.727 | moderate | 0.548 | hierarchical_shrinkage_recommended; standalone subscore only as caveated diagnostic |

Key implication: several subtests are not ideal standalone reporting scores, especially where reliability is weak/moderate or item counts are small. That is an argument *for* hierarchical shrinkage, not against subscores.

### Current-policy subtest relationships

| year | subtest_1 | subtest_2 | n | rho | band |
| --- | --- | --- | --- | --- | --- |
| foundation | MQ1-20 | MC0-20 | 1005 | 0.405 | moderate |
| foundation | MQ1-20 | MNC0-20 | 1003 | 0.405 | moderate |
| foundation | MQ1-20 | DMT10_2026 | 1002 | 0.275 | low |
| foundation | MQ1-20 | BNL0-20 | 974 | 0.171 | low |
| foundation | MC0-20 | MNC0-20 | 1003 | 0.561 | moderate |
| foundation | MC0-20 | DMT10_2026 | 1002 | 0.283 | low |
| foundation | MC0-20 | BNL0-20 | 974 | 0.245 | low |
| foundation | MNC0-20 | DMT10_2026 | 1002 | 0.393 | low |
| foundation | MNC0-20 | BNL0-20 | 974 | 0.304 | low |
| foundation | DMT10_2026 | BNL0-20 | 974 | 0.302 | low |
| year1 | MC0-100 | MNC0-100 | 1229 | 0.683 | high |
| year1 | MC0-100 | AAMC | 1227 | 0.595 | moderate |
| year1 | MC0-100 | ASMC | 1223 | 0.485 | moderate |
| year1 | MC0-100 | BNL0-100 | 1178 | 0.468 | moderate |
| year1 | MNC0-100 | AAMC | 1227 | 0.671 | high |
| year1 | MNC0-100 | ASMC | 1223 | 0.561 | moderate |
| year1 | MNC0-100 | BNL0-100 | 1178 | 0.504 | moderate |
| year1 | AAMC | ASMC | 1223 | 0.595 | moderate |
| year1 | AAMC | BNL0-100 | 1178 | 0.469 | moderate |
| year1 | ASMC | BNL0-100 | 1178 | 0.379 | low |


### Profile-deviation spread

| year | subtest | n | sd_dev_z | p10 | p90 | %>|1z| |
| --- | --- | --- | --- | --- | --- | --- |
| foundation | MQ1-20 | 1005 | 0.934 | -1.04 | 1.11 | 22.4% |
| foundation | MC0-20 | 1005 | 0.842 | -0.98 | 1.03 | 20.2% |
| foundation | MNC0-20 | 1003 | 0.787 | -0.99 | 1.02 | 20.4% |
| foundation | DMT10_2026 | 1002 | 0.938 | -1.21 | 1.15 | 27.3% |
| foundation | BNL0-20 | 974 | 0.998 | -1.27 | 1.23 | 31.4% |
| year1 | MC0-100 | 1229 | 0.737 | -0.82 | 0.91 | 14.5% |
| year1 | MNC0-100 | 1229 | 0.64 | -0.74 | 0.8 | 10.5% |
| year1 | AAMC | 1227 | 0.688 | -0.82 | 0.81 | 10.8% |
| year1 | ASMC | 1223 | 0.786 | -0.99 | 0.97 | 19.1% |
| year1 | BNL0-100 | 1178 | 0.902 | -1.06 | 1.12 | 23.3% |


## Number Line cutoff policy premodelling

| year | subtest | policy | items_ok | median_min_pct | median_top_pct | entropy | posture |
| --- | --- | --- | --- | --- | --- | --- | --- |
| foundation | BNL0-20 | nl_80_90_95_4cat | 9/10 | 15.6% | 25.3% | 0.955 | higher-resolution_challenger; use_only_if_item_category_cells_are_stable |
| foundation | BNL0-20 | nl_80_90_relaxed_3cat | 9/10 | 20.6% | 48.0% | 0.938 | relaxed_challenger; useful_if_current_policy_over-penalises_hard_targets |
| foundation | BNL0-20 | nl_85_95_current_3cat | 10/10 | 18.2% | 25.3% | 0.93 | benchmark_current_policy; keep as reference in all modelling |
| foundation | BNL0-20 | nl_90_97_strict_3cat | 10/10 | 13.6% | 14.6% | 0.873 | strict_challenger; reject_if_top_category_sparse_or_validation_not_better |
| foundation | BNL0-20 | nl_binary_95 | 10/10 | 25.3% | 25.3% | 0.815 | modelable_if_cells_ok_but_loses_partial-credit_information |
| year1 | BNL0-100 | nl_80_90_95_4cat | 13/13 | 19.4% | 26.4% | 0.989 | higher-resolution_challenger; use_only_if_item_category_cells_are_stable |
| year1 | BNL0-100 | nl_80_90_relaxed_3cat | 13/13 | 24.4% | 48.7% | 0.954 | relaxed_challenger; useful_if_current_policy_over-penalises_hard_targets |
| year1 | BNL0-100 | nl_85_95_current_3cat | 13/13 | 26.4% | 26.4% | 0.98 | benchmark_current_policy; keep as reference in all modelling |
| year1 | BNL0-100 | nl_90_97_strict_3cat | 13/13 | 16.2% | 16.2% | 0.913 | strict_challenger; reject_if_top_category_sparse_or_validation_not_better |
| year1 | BNL0-100 | nl_binary_95 | 13/13 | 26.4% | 26.4% | 0.833 | modelable_if_cells_ok_but_loses_partial-credit_information |


Interpretation rule: a policy can be *modelable* from cell counts but still not promotable. Promotion requires validation, risk-band movement, fairness/subgroup checks, and interpretability. Current `.85/.95` remains the benchmark.

## Accuracy-speed / RT readiness

| year | subtest | role | timed | obs_rt_miss | presented_miss | trailing | rt_p50 | <1s | model_role | flags |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| foundation | BNL0-20 | achievement_primary | False | 0.00% | 2.0% | 0.0% | 7 | 0.8% | nl_rt_context_only_initially_not_accuracy_speed_scoring | none_obvious_from_row_rt_audit |
| foundation | DMT10_2026 | achievement_primary | False | 0.00% | 1.5% | 0.0% | 16 | 0.0% | untimed_or_other_context_only_initially | none_obvious_from_row_rt_audit |
| foundation | MC0-20 | achievement_primary | True | 0.00% | 75.0% | 74.0% | 6 | 0.4% | initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context | presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows |
| foundation | MNC0-20 | achievement_primary | True | 0.00% | 76.1% | 75.1% | 12 | 0.4% | initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context | presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows |
| foundation | MQ1-20 | achievement_primary | True | 0.00% | 84.2% | 83.3% | 20 | 0.8% | initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context | presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows |
| foundation | STPM | shadow_speed_only | True |  | 6.2% | 5.2% | 8 | 0.1% | shadow_speed_only_exclude_from_math_achievement | presented_row_rt_missing_or_negative_gt_5pct |
| year1 | AAMC | achievement_primary | True | 0.00% | 80.2% | 78.5% | 9 | 0.5% | initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context | presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows |
| year1 | ASMC | achievement_primary | True | 0.00% | 77.6% | 75.6% | 12 | 0.5% | initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context | presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows |
| year1 | BNL0-100 | achievement_primary | False | 0.00% | 3.8% | 0.0% | 5 | 0.7% | nl_rt_context_only_initially_not_accuracy_speed_scoring | none_obvious_from_row_rt_audit |
| year1 | MC0-100 | achievement_primary | True | 0.00% | 77.1% | 76.0% | 6 | 0.4% | initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context | presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows |
| year1 | MNC0-100 | achievement_primary | True | 0.00% | 72.9% | 71.4% | 11 | 0.5% | initial_joint_accuracy_rt_candidate_timed_non_nl_observed_rows_plus_D_reach_context | presented_row_rt_missing_high_expected_from_trailing_unreached_D_rows |
| year1 | STPM | shadow_speed_only | True |  | 6.4% | 4.4% | 6 | 0.1% | shadow_speed_only_exclude_from_math_achievement | presented_row_rt_missing_or_negative_gt_5pct |


### J2b-style rapid-row descriptive check

| year | subtest | rapid_rate | rapid_acc | nonrapid_acc | delta |
| --- | --- | --- | --- | --- | --- |
| foundation | MC0-20 | 5.97% | 0.636 | 0.919 | -0.283 |
| foundation | MNC0-20 | 3.96% | 0.088 | 0.754 | -0.666 |
| foundation | MQ1-20 | 3.08% | 0.054 | 0.656 | -0.601 |
| year1 | AAMC | 4.36% | 0.184 | 0.803 | -0.618 |
| year1 | ASMC | 4.24% | 0.132 | 0.627 | -0.494 |
| year1 | MC0-100 | 5.27% | 0.567 | 0.897 | -0.331 |
| year1 | MNC0-100 | 4.03% | 0.087 | 0.834 | -0.747 |


### Person-level speed/reach correlations with current-policy scores

| year | subtest | metric | n | rho | note |
| --- | --- | --- | --- | --- | --- |
| foundation | STPM | median_item_rt_sec | 1016 | -0.437 | rt_context_not_achievement_adjustment |
| foundation | STPM | n_reached_or_valid_count | 1024 | 0.819 | reach_count_is_partly_scoring_policy_for_timed_D |
| foundation | STPM | n_trailing_nonresponse_rows | 1024 | -0.771 | reach_count_is_partly_scoring_policy_for_timed_D |
| foundation | MQ1-20 | median_item_rt_sec | 998 | -0.462 | rt_context_not_achievement_adjustment |
| foundation | MQ1-20 | n_reached_or_valid_count | 1006 | 0.711 | reach_count_is_partly_scoring_policy_for_timed_D |
| foundation | MQ1-20 | n_trailing_nonresponse_rows | 1006 | -0.656 | reach_count_is_partly_scoring_policy_for_timed_D |
| foundation | MC0-20 | median_item_rt_sec | 995 | -0.83 | rt_context_not_achievement_adjustment |
| foundation | MC0-20 | n_reached_or_valid_count | 1005 | 0.932 | reach_count_is_partly_scoring_policy_for_timed_D |
| foundation | MC0-20 | n_trailing_nonresponse_rows | 1005 | -0.873 | reach_count_is_partly_scoring_policy_for_timed_D |
| foundation | MNC0-20 | median_item_rt_sec | 993 | -0.673 | rt_context_not_achievement_adjustment |
| foundation | MNC0-20 | n_reached_or_valid_count | 1003 | 0.778 | reach_count_is_partly_scoring_policy_for_timed_D |
| foundation | MNC0-20 | n_trailing_nonresponse_rows | 1003 | -0.721 | reach_count_is_partly_scoring_policy_for_timed_D |
| foundation | DMT10_2026 | median_item_rt_sec | 988 | 0.061 | rt_context_not_achievement_adjustment |
| foundation | DMT10_2026 | n_reached_or_valid_count | 1002 | 0.206 | coverage_or_valid_count_context_not_timed_D_speed |
| foundation | DMT10_2026 | n_trailing_nonresponse_rows | 1002 |  | coverage_or_valid_count_context_not_timed_D_speed |
| foundation | BNL0-20 | median_item_rt_sec | 974 | -0.009 | rt_context_not_achievement_adjustment |
| foundation | BNL0-20 | n_reached_or_valid_count | 974 | 0.345 | coverage_or_valid_count_context_not_timed_D_speed |
| foundation | BNL0-20 | n_trailing_nonresponse_rows | 974 |  | coverage_or_valid_count_context_not_timed_D_speed |
| year1 | STPM | median_item_rt_sec | 1235 | -0.432 | rt_context_not_achievement_adjustment |
| year1 | STPM | n_reached_or_valid_count | 1256 | 0.821 | reach_count_is_partly_scoring_policy_for_timed_D |
| year1 | STPM | n_trailing_nonresponse_rows | 1256 | -0.719 | reach_count_is_partly_scoring_policy_for_timed_D |
| year1 | MC0-100 | median_item_rt_sec | 1221 | -0.84 | rt_context_not_achievement_adjustment |
| year1 | MC0-100 | n_reached_or_valid_count | 1235 | 0.932 | reach_count_is_partly_scoring_policy_for_timed_D |
| year1 | MC0-100 | n_trailing_nonresponse_rows | 1235 | -0.865 | reach_count_is_partly_scoring_policy_for_timed_D |
| year1 | MNC0-100 | median_item_rt_sec | 1212 | -0.704 | rt_context_not_achievement_adjustment |
| year1 | MNC0-100 | n_reached_or_valid_count | 1229 | 0.816 | reach_count_is_partly_scoring_policy_for_timed_D |
| year1 | MNC0-100 | n_trailing_nonresponse_rows | 1229 | -0.729 | reach_count_is_partly_scoring_policy_for_timed_D |
| year1 | AAMC | median_item_rt_sec | 1205 | -0.79 | rt_context_not_achievement_adjustment |
| year1 | AAMC | n_reached_or_valid_count | 1227 | 0.872 | reach_count_is_partly_scoring_policy_for_timed_D |
| year1 | AAMC | n_trailing_nonresponse_rows | 1227 | -0.768 | reach_count_is_partly_scoring_policy_for_timed_D |
| year1 | ASMC | median_item_rt_sec | 1199 | -0.584 | rt_context_not_achievement_adjustment |
| year1 | ASMC | n_reached_or_valid_count | 1223 | 0.708 | reach_count_is_partly_scoring_policy_for_timed_D |
| year1 | ASMC | n_trailing_nonresponse_rows | 1223 | -0.599 | reach_count_is_partly_scoring_policy_for_timed_D |
| year1 | BNL0-100 | median_item_rt_sec | 1178 | -0.046 | rt_context_not_achievement_adjustment |
| year1 | BNL0-100 | n_reached_or_valid_count | 1178 | 0.225 | coverage_or_valid_count_context_not_timed_D_speed |
| year1 | BNL0-100 | n_trailing_nonresponse_rows | 1178 |  | coverage_or_valid_count_context_not_timed_D_speed |
| foundation | STPM_vs_composite | score | 1006 | 0.232 | STPM_is_shadow_non_math_exclude_from_math_score |
| foundation | STPM_vs_composite | median_item_rt_sec | 1004 | -0.389 | STPM_is_shadow_non_math_exclude_from_math_score |
| foundation | STPM_vs_composite | total_rt_sec | 1004 | -0.34 | STPM_is_shadow_non_math_exclude_from_math_score |
| year1 | STPM_vs_composite | score | 1235 | 0.234 | STPM_is_shadow_non_math_exclude_from_math_score |
| year1 | STPM_vs_composite | median_item_rt_sec | 1228 | -0.387 | STPM_is_shadow_non_math_exclude_from_math_score |
| year1 | STPM_vs_composite | total_rt_sec | 1228 | -0.284 | STPM_is_shadow_non_math_exclude_from_math_score |


Reach/trailing correlations are partly mechanical under timed D/trailing-zero scoring. This is exactly why RT/tau should initially remain a shadow response-process layer rather than a direct achievement-band adjustment.

## Recommended model ladders

### Hierarchical global/subscore ladder
| model_id | purpose | latent_structure | subscores | premodel_status | promotion_gate |
| --- | --- | --- | --- | --- | --- |
| H0_current_operational_candidate | existing global score anchor | one global theta + subtest/testlet residuals u | not teacher-facing; u is nuisance/local-dependence residual | already fitted for inclusive/hard-filtered/sensitivities | retain as anchor while subscore challengers are tested |
| H1_global_plus_subtest_deviations | coherent teacher-facing global score + subscores | global theta; subtest score = global theta + shrunken subtest deviation; no separate nuisance residual for every same subtest initially | yes: report global, subtest posterior means/intervals, and relative deviation labels | recommended first Stan hierarchical subscore challenger | clean HMC, stable subscore posterior SDs, sensible shrinkage, better coherence than standalone subtest IRT, no harmful risk-band movement |
| H2_global_plus_NL_specific_deviation | target Year 1 BNL influence before full subtest expansion | global theta + Number Line-specific deviation/factor; optionally BNL residual fixed/omitted | global + NL profile only | recommended focused challenger if H1 is too broad or BNL remains unstable | keeps BNL contribution without weak BNL residual pathology; validates at least as well as H0 |
| H3_correlated_subtest_thetas | diagnostic upper-bound profile model | one correlated theta per subtest; global score is derived composite | yes but global must be defined after fitting | diagnostic only until feasibility improves; mirt/TAM high-dimensional screens were resource-burdened | only proceed if H1/H2 insufficient and dimensions are stable/interpretable |


### Number Line policy ladder
| policy_id | role | model_family | premodel_gate | promotion_gate |
| --- | --- | --- | --- | --- |
| nl_85_95_current_3cat | benchmark/operational-compatible current policy | ordinal PCM/GPCM categories 0=<.85, 1=.85-.95, 2=>=.95 | must be included as reference in all screens | already lockable as NL2 unless challenger clearly improves validation/fairness/classification |
| nl_80_90_relaxed_3cat | cutoff sensitivity challenger | ordinal 3-category PCM/GPCM | cell counts and target distributions acceptable | less harmful hard-target penalisation plus equal/better validation and risk classification |
| nl_90_97_strict_3cat | strict challenger | ordinal 3-category PCM/GPCM | top category not too sparse item-by-item | only if validation gain offsets expected sparsity/precision loss |
| nl_binary_95 | simple mastery-like sensitivity | binary Rasch/2PL screen | both classes present by item | unlikely to promote unless it improves decision validity despite information loss |
| nl_80_90_95_4cat | higher-resolution ordinal sensitivity | 4-category PCM/GPCM | all item categories have stable counts; thresholds ordered/usable | improved validation/precision without sparse-category pathology |
| continuous_abs_error_logitnormal_or_beta | formal continuous challenger, not TAM/mirt-faithful | mixed response Stan: binary/non-NL accuracy + continuous bounded NL accuracy/error | raw distributions and coordinate calibration pass; proxy validation competitive | material validation/classification/fairness gain over NL2 and clean HMC/PPC |


### Accuracy-speed ladder
| model_id | purpose | status | uses_for_score | gate |
| --- | --- | --- | --- | --- |
| RT0_QC_manifest_speed_descriptives | data-quality, rapid-response, timing-unit, and admin/device checks | recommended before any scoring use | none | no severe RT missingness/unit anomalies in candidate families |
| RT1_selected_family_speed_shadow | selected timed-family tau/pace research with accuracy anchor protected | supported by prior J2b work; rerun on 2026 BOY candidate families if needed | shadow only | tau aligns with RT/rapid behaviour; theta/risk bands not changed operationally |
| RT2_hierarchical_tau_shadow | overall response pace + family residual pace, coherent with teacher profile idea | Stan skeleton exists (J3b hierarchical tau) | shadow only | clean HMC; no subgroup/admin artefact; no achievement-band changes |
| RT3_joint_global_subscore_accuracy_speed | future integrated model after H1 subscore and RT2 pace models are separately stable | not first next fit | research only until validation burden is met | must add information beyond D/trailing-zero and not double-count speed/reach |


## Decision gates / next actions

| stream | next_action | must_check_before_fit | must_check_after_fit |
| --- | --- | --- | --- |
| hierarchical_subscores | fit H1 Stan global+subtest-deviation model on hard-filtered operational frame | subtest score reliability/correlation/readiness table; avoid writing nuisance residuals as teacher subscores | HMC diagnostics, posterior SD by subscore, shrinkage size, global score movement, risk-band movement, subgroup movement, profile interpretability |
| number_line_policy | run frequentist ordinal cutoff screens using audited candidate policies; keep current .85/.95 as reference | item-by-target category counts and ECDF; reject policies with sparse/empty categories before Stan | threshold behaviour, reliability, score/risk movement, validation/fairness; continuous challenger only after proxy screen |
| accuracy_speed_joint | treat RT as QC/shadow; candidate families are timed non-NL achievement subtests only at first | RT missingness, row RT quantiles, rapid-row accuracy, STPM exclusion, D/trailing-zero double-count risk | tau construct validity, rapid effect direction, theta robustness, admin/subgroup artefacts, no operational risk-band changes |


## Written aggregate artifacts

- `tables/premodeling/2026_boy_hierarchical_subscore_readiness.csv`
- `tables/premodeling/2026_boy_subtest_score_correlations.csv`
- `tables/premodeling/2026_boy_subtest_composite_correlations.csv`
- `tables/premodeling/2026_boy_subtest_profile_deviation_summary.csv`
- `tables/premodeling/2026_boy_nl_accuracy_distribution_by_item.csv`
- `tables/premodeling/2026_boy_nl_policy_item_cell_counts.csv`
- `tables/premodeling/2026_boy_nl_policy_overall_summary.csv`
- `tables/premodeling/2026_boy_rt_readiness_by_subtest.csv`
- `tables/premodeling/2026_boy_j2b_style_rapid_row_audit.csv`
- `tables/premodeling/2026_boy_speed_accuracy_correlations.csv`
- model-ladder and decision-gate CSVs in the same folder
- TAM cutoff screen runner: `analysis/modeling/v2_response_process_program/77_2026_boy_premodel_tam_cutoff_screens.R` (requires TAM; intended for cisbox/AWS)