# 2026 BOY next-model specification notes

Status: pre-fit design note generated after aggregate premodelling audit. Do not treat as an operational scoring decision.

## 1. Hierarchical global + subscore model

### Goal
Produce a coherent global score and teacher-facing subtest subscores, avoiding unrelated standalone subtest IRT scales.

### First challenger: `H1_global_plus_subtest_deviations`

For student `p` and subtest/domain `s`:

```text
g_p ~ broad numeracy level
z_ps ~ standard normal residual profile component
delta_ps = sigma_delta_s * z_ps, centered across subtests within student
theta_ps = g_p + delta_ps
```

Binary/timed or untimed non-NL item `j` in subtest `s[j]`:

```text
y_pj ~ Bernoulli_logit(theta_p,s[j] - b_j)
```

Ordinal Number Line item `j` under a PCM-style policy:

```text
eta_1 = 0
eta_k = eta_{k-1} + theta_p,s[j] - (b_j + step_j,k-1)
y_pj ~ categorical_logit(eta)
```

Identification/regularisation:

- Center each student's `delta_ps` across subtests so `g_p` remains the broad level.
- Center item difficulties within year/model as in current Stan practice.
- Start without a separate nuisance testlet `u` for every subtest; otherwise the reportable subtest deviation and nuisance residual compete for the same signal.
- Keep posterior intervals for subtest deviations; do not report profile differences smaller than measurement uncertainty.

Primary post-fit checks:

1. HMC: 0 divergences, no max-treedepth hits, Rhat/ESS acceptable for `g`, `theta_ps`, `sigma_delta_s`, item parameters.
2. Global movement vs hard-filtered H0: Spearman, median/p95 percentile shift, <15 and 15-35 risk-band movement.
3. Subscore quality: posterior SD by subtest, shrinkage size, profile-deviation stability.
4. Teacher-facing coherence: subscore intervals and relative-strength labels agree with observed subtest evidence without overclaiming.
5. Subgroup/admin movement: no adverse subgroup artefacts.

## 2. Year 1 BNL residual surgical sensitivity

Keep `BNL0-100` items in the global/hierarchical score but do not give BNL an extra nuisance residual variance if the current `sigma_u[BNL0-100]` remains weak.

Data-side option:

```text
active_testlet_idx[BNL0-100] = 0
active_testlet_idx[other_subtests] = 1..K_active
```

Likelihood option:

```text
resid = 0 if active_testlet_idx == 0
resid = sigma_u[k] * u_z[p,k] otherwise
theta_eff = theta + resid
```

This tests whether the issue is the BNL residual component, not the BNL items themselves.

## 3. Number Line policy ladder

Premodelling audit outputs:

- `tables/premodeling/2026_boy_nl_accuracy_distribution_by_item.csv`
- `tables/premodeling/2026_boy_nl_policy_item_cell_counts.csv`
- `tables/premodeling/2026_boy_nl_policy_overall_summary.csv`

Frequentist screens before Stan:

```text
nl_80_90_relaxed_3cat
nl_85_95_current_3cat
nl_90_97_strict_3cat
nl_binary_95
nl_80_90_95_4cat
```

Promotion burden:

- current `.85/.95` remains the reference;
- challenger must have stable cells/thresholds;
- challenger must improve or match validation/risk classification;
- challenger must not cause unacceptable subgroup or risk-band movement;
- continuous NL requires Stan or another mixed continuous-response framework, not TAM/mirt alone.

Continuous challenger sketch:

```text
accuracy = 1 - absolute_error / scale_range
accuracy_squeezed = clamp/Smithson-Verkuilen transform into (0,1)
logit(mu_pj) = alpha_j + theta_p,s[j]
accuracy_pj ~ Beta(mu_pj * phi_j, (1 - mu_pj) * phi_j)
```

Optional signed-error diagnostic, not first scoring model:

```text
signed_error_scaled_pj ~ Normal(target_bias_j + method_bias_family + ability_slope_j * theta, sigma_j)
```

## 4. Accuracy-speed joint modelling ladder

Operational posture: RT is shadow/QC first. Timed D/trailing-zero already encodes reach/time-pressure, so response time can double-count speed if added naively.

Initial 2026 BOY data rule:

- Achievement accuracy model may continue to use D/trailing-zero for timed non-NL.
- RT likelihood should use observed/reached item rows only.
- Trailing unreached rows contribute to D accuracy/reach context, not item-level logRT.
- STPM remains shadow/non-math and is excluded from math achievement.
- Number Line RT is context-only initially.

Candidate shadow model:

```text
y_pj ~ Bernoulli_logit(theta_p - b_j + gamma_family * rapid_pj)
logRT_pj ~ LogNormal(beta0 + beta_j - tau_p,family[j], sigma_rt_family)
```

Hierarchical pace extension:

```text
tau_p,f = tau_overall_p + tau_residual_p,f
```

Pre-fit checks already written:

- `tables/premodeling/2026_boy_rt_readiness_by_subtest.csv`
- `tables/premodeling/2026_boy_j2b_style_rapid_row_audit.csv`
- `tables/premodeling/2026_boy_speed_accuracy_correlations.csv`

Do not use RT/tau to alter risk bands unless later evidence shows robust validation gain, no subgroup/admin artefact, and added information beyond D/reach.