# 2026 BOY operational accuracy + Number Line candidate — modelled job review

Review timestamp: 2026-06-14 UTC

## Compute / sync status

All AWS model jobs are complete. There are no active EC2 instances matching the 2026 BOY operational Number Line model tags, no active cisbox rsync sessions, and the local sensitivity monitor was stopped after all six sensitivity `.done` markers were present.

Final outstanding run (`year1_no_BNL0_100`) is synced, checksum-verified, recovered from the known no-NL post-processing failure, and its EC2 instance was terminated.

## Reviewed jobs

The review covers 10 Stan jobs:

1. Foundation inclusive baseline.
2. Year 1 inclusive baseline.
3. Foundation hard-item-filtered baseline.
4. Year 1 hard-item-filtered baseline.
5. Foundation sensitivity: no `DMT10_2026`.
6. Foundation sensitivity: no `MQ1-20` and no `DMT10_2026`.
7. Foundation sensitivity: no `BNL0-20`.
8. Year 1 sensitivity: no `MC0-100`.
9. Year 1 sensitivity: no `BNL0-100`.
10. Year 1 sensitivity: core model with no MC and no NL.

Source output base:

```text
/data/numeracy-screening-models/irt/2026_boy_operational_accuracy_nl_candidate
```

Local review artifacts:

```text
outputs/runs/irt-2026-boy-subtest-audit/latest/reports/model_review/stan_review_summary.md
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/stan_job_diagnostic_summary.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/stan_score_movement_comparisons.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/stan_testlet_sigma_summary_long.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/stan_item_difficulty_extreme_or_diagnostic_flags.csv
outputs/runs/irt-2026-boy-subtest-audit/latest/tables/model_review/stan_u_residual_diagnostic_summary.csv
```

## Completion and sampler diagnostics

All 10 jobs have successful MCMC sampling evidence:

- 0 divergences in every job.
- 0 max-treedepth hits in every job.
- Minimum EBFMI across jobs: 0.568 (`year1_core_no_MC_no_NL`), acceptable.
- Student theta summaries are clean: max theta Rhat <= 1.006 and theta ESS_bulk is comfortably high in all jobs.
- Item difficulty summaries are clean by diagnostics: no item difficulty has Rhat > 1.01 or ESS_bulk < 400.

Three no-NL-style jobs exited with Stan runner exitcode 1 because of the known post-processing bug for empty/missing NL lookup files, not because of sampler failure:

- `foundation_no_BNL0_20`
- `year1_no_BNL0_100`
- `year1_core_no_MC_no_NL`

All three were recovered from QC summaries and now have final score, item, testlet, and fit-readout files.

## Job-level diagnostic table

| job | exit | postprocess | verify | div | treedepth hits | min EBFMI | theta max Rhat / min ESS | testlet max Rhat / min ESS | note |
|---|---:|---|---:|---:|---:|---:|---|---|---|
| Foundation inclusive | 0 | completed | 155/155 | 0 | 0 | 0.705 | 1.004 / 5066 | 1.006 / 1173 | clean |
| Year 1 inclusive | 0 | completed | 155/155 | 0 | 0 | 0.614 | 1.004 / 941 | 1.023 / 109 | weak `BNL0-100` testlet sigma |
| Foundation hard-filtered | 0 | completed | 1955/1955 | 0 | 0 | 0.677 | 1.003 / 4051 | 1.003 / 1081 | clean |
| Year 1 hard-filtered | 0 | completed | 1955/1955 | 0 | 0 | 0.646 | 1.006 / 1504 | 1.068 / 78 | weak `BNL0-100` testlet sigma |
| Foundation no `DMT10_2026` | 0 | completed | 2104/2104 | 0 | 0 | 0.694 | 1.003 / 3337 | 1.009 / 482 | clean |
| Foundation no `MQ1-20`/no `DMT10_2026` | 0 | completed | 2104/2104 | 0 | 0 | 0.727 | 1.002 / 6115 | 1.007 / 585 | clean |
| Foundation no `BNL0-20` | 1 | recovered | 2098/2098 | 0 | 0 | 0.667 | 1.002 / 4894 | 1.004 / 1221 | sampling clean; postprocess recovered |
| Year 1 no `MC0-100` | 0 | completed | 2104/2104 | 0 | 0 | 0.647 | 1.002 / 4875 | 1.004 / 670 | clean |
| Year 1 no `BNL0-100` | 1 | recovered | 2098/2098 | 0 | 0 | 0.598 | 1.003 / 3313 | 1.003 / 1668 | sampling clean; postprocess recovered |
| Year 1 no MC/no NL | 1 | recovered | 2098/2098 | 0 | 0 | 0.568 | 1.002 / 4612 | 1.003 / 1925 | sampling clean; postprocess recovered |

## Main diagnostic finding

The global Year 1 baseline is usable from a sampler perspective, but the `BNL0-100` testlet residual scale is weakly identified:

- Inclusive Year 1 `BNL0-100` sigma: Rhat ~1.023, ESS_bulk ~109.
- Hard-filtered Year 1 `BNL0-100` sigma: Rhat ~1.068, ESS_bulk ~78.

This issue is local to the `BNL0-100` residual/testlet component. It does not show up as divergent transitions, treedepth failures, poor theta mixing, or item-difficulty non-convergence. It does show up in the latent residuals for the same component: in the hard-filtered Year 1 run, `u[,5]` corresponds to `BNL0-100`, and 1193/1221 residual terms had Rhat > 1.01, with max Rhat ~1.026. The likely interpretation is that the residual `BNL0-100` testlet variance is near a boundary/small value and is hard for the sampler to estimate, while the `BNL0-100` items themselves carry substantial global-theta information.

### Auxiliary `u` residual diagnostic

| job | testlet | residual terms | Rhat > 1.01 | ESS < 400 | max Rhat | min ESS | interpretation |
|---|---|---:|---:|---:|---:|---:|---|
| Year 1 inclusive | `BNL0-100` | 1221 | 0 | 3 | 1.009 | 320 | minor low-ESS nuisance terms |
| Year 1 hard-filtered | `BNL0-100` | 1221 | 1193 | 2 | 1.026 | 279 | broad residual-component mixing issue tied to BNL testlet |

No other job/testlet had `u` residual terms with Rhat > 1.01 or ESS_bulk < 400. This reinforces that the caveat is localized to Year 1 `BNL0-100` dependence modelling, not to the global theta score or item difficulty estimates.

## Hard-filtered vs inclusive baseline

The hard-item filter removes the 70 predeclared no-information items and has negligible impact on student ranking/risk classification.

| comparison | n | Spearman | median abs percentile shift | p95 shift | exact 3-band agreement | very-low Jaccard | low+very-low Jaccard | moved out/in, very-low | moved out/in, low+very-low |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| Foundation inclusive vs hard-filtered | 997 | 1.000 | 0.30 pp | 1.40 pp | 99.0% | 0.974 | 0.983 | 2 / 2 | 3 / 3 |
| Year 1 inclusive vs hard-filtered | 1221 | 0.999 | 0.74 pp | 2.62 pp | 98.5% | 0.968 | 0.972 | 3 / 3 | 6 / 6 |

Conclusion: hard-item-filtered should be the working operational baseline. The inclusive runs are useful historical evidence but should not be promoted over the filtered version.

## Sensitivity findings vs hard-filtered baseline

### Foundation

| sensitivity | n | Spearman | median shift | p95 shift | 3-band agreement | very-low Jaccard | low+very-low Jaccard | interpretation |
|---|---:|---:|---:|---:|---:|---:|---:|---|
| no `DMT10_2026` | 997 | 0.935 | 5.72 pp | 21.00 pp | 85.2% | 0.703 | 0.758 | DMT contributes materially; removal is not classification-stable. |
| no `MQ1-20` and no `DMT10_2026` | 995 | 0.825 | 9.95 pp | 35.68 pp | 76.1% | 0.520 | 0.642 | Removing both early quantity/decomposition content substantially changes the score. |
| no `BNL0-20` | 997 | 0.865 | 8.02 pp | 32.32 pp | 77.5% | 0.505 | 0.661 | Foundation Number Line is highly influential and improves precision. |

Foundation interpretation:

- The hard-filtered Foundation baseline is sampler-clean.
- `BNL0-20` is important to the global score; dropping it causes large risk-band movement.
- `DMT10_2026` also matters; despite being untimed, it contributes meaningfully to the Foundation global trait.
- Foundation supports retaining the full hard-filtered operational accuracy + NL candidate, subject to external validation and reporting review.

### Year 1

| sensitivity | n | Spearman | median shift | p95 shift | 3-band agreement | very-low Jaccard | low+very-low Jaccard | interpretation |
|---|---:|---:|---:|---:|---:|---:|---:|---|
| no `MC0-100` | 1211 | 0.993 | 1.82 pp | 6.77 pp | 96.2% | 0.905 | 0.936 | Removing MC has modest impact; MC is not the main source of instability. |
| no `BNL0-100` | 1221 | 0.768 | 11.88 pp | 39.31 pp | 70.3% | 0.402 | 0.547 | Removing BNL radically changes rankings/risk bands and greatly increases uncertainty. |
| no MC/no NL | 1211 | 0.739 | 13.46 pp | 40.42 pp | 68.5% | 0.382 | 0.519 | Core-only score differs substantially from the full hard-filtered candidate. |

Year 1 interpretation:

- `MC0-100` is not a major concern; the no-MC sensitivity remains close to the hard-filtered baseline.
- `BNL0-100` is the key decision point. It is highly influential for Year 1 risk classification and precision.
- The weak `BNL0-100` sigma diagnostic should not be read as evidence to drop BNL. The no-BNL sensitivity shows the opposite: dropping it materially changes the construct coverage and low-achievement identification.
- The most defensible reading is: retain `BNL0-100` as a strong candidate, but resolve/report the localized testlet-sigma issue before final operational promotion.

## Frequentist model-rung context

Frequentist pre-screening remains consistent with the Stan review:

- TAM 1D PCM was the only clean frequentist baseline across both years.
  - Foundation reliability ~0.914.
  - Year 1 reliability ~0.952.
- mirt correlated subtest factors failed due quadrature burden (`Greater than 20000 quadrature points`).
- mirt flexible/bifactor screens did not provide stable enough evidence to justify multidimensional Stan challengers.
- Refined frequentist sensitivities also flagged Foundation no-BNL and Year 1 no-BNL/no-MC variants as the main score-movement cases.

Therefore, the current Stan evidence should be interpreted within a 1D+testlet operational-candidate frame, not as support for immediate multidimensional/bifactor escalation.

## Recommendations

1. **Promote the hard-item-filtered model frame as the working baseline for final reporting comparisons.**  
   The hard filter removes no-information items with near-zero impact on student scores/risk bands.

2. **Foundation: keep `BNL0-20` and `DMT10_2026` in the operational candidate.**  
   Both materially affect risk identification; the Foundation hard-filtered Stan run is diagnostically clean.

3. **Year 1: do not drop `BNL0-100` based on the sigma diagnostic alone.**  
   Removing it causes major movement and loss of precision. Treat the issue as a localized residual-scale estimation problem, not a failed global score.

4. **Run or design one surgical Year 1 sensitivity if final promotion requires clearing the sigma caveat:**  
   keep `BNL0-100` items in the global score but omit/fix the `BNL0-100` testlet residual scale. This directly tests whether the weak sigma parameter is harmless. This is more informative than a no-BNL model, which changes both construct coverage and precision.

5. **Complete external validation and subgroup movement checks before final operational lock-in.**  
   Compare hard-filtered baseline and key sensitivities against PAT/teacher outcomes and demographic/school subgroup stability, with priority on the <15th and 15th–35th percentile bands.

6. **Update the audit/report package.**  
   Add sections for item eligibility, hard-filtered vs inclusive comparison, frequentist model rungs, Stan sensitivity results, and the Year 1 `BNL0-100` decision caveat.

## Proposed immediate next steps

1. Add the generated model-review tables to the unified audit HTML/report.
2. Build a final score-movement table with student-level risk-band transitions for the hard baseline vs the three most important sensitivity contrasts:
   - Foundation no `BNL0-20`.
   - Year 1 no `BNL0-100`.
   - Year 1 no `MC0-100`.
3. Run outcome validation comparisons for the hard baseline and sensitivity variants.
4. Review Year 1 `BNL0-100` item-level diagnostics:
   - coordinate audit;
   - empirical item curves by theta bin;
   - target-value distribution;
   - ordinal threshold behavior;
   - response category sparsity.
5. Decide whether to run the surgical Year 1 BNL-included/no-BNL-testlet-residual Stan sensitivity.
6. Draft the operational recommendation:
   - hard-filtered baseline as default candidate;
   - Foundation BNL retained;
   - Year 1 BNL retained as candidate pending sigma caveat resolution and validation;
   - MC0-100 not a major exclusion pressure.