DEAD_TRIAL_POLICY.md
1 # Dead Trial Classification Policy 2 3 **Applies to:** All D1 externalization boundary measurements 4 **Script:** `research/scripts/classify_trials.py` 5 **Generated data:** `research/data/classified_trials.json`, `research/data/trial_summary.json` 6 7 ## Classification Rules 8 9 Each trial is classified into exactly one category: 10 11 | Category | Criteria | Treatment | 12 |----------|----------|-----------| 13 | **LIVE** | `save_count > 0` OR `items_recalled > 0` | Valid behavioral measurement. D1 computed from actual tool use and recall. | 14 | **INTERNALIZER** | `save_count == 0` AND `items_recalled == 0` AND `total_tokens >= 500` AND `errors is null` | Valid D1=1.0 measurement. The model genuinely engaged (produced 500+ tokens across the conversation) but chose not to save via tools and failed to recall items. This is real internalization behavior, not an infrastructure failure. | 15 | **EXCLUDED** | `total_tokens < 500` OR `errors is not null` | Infrastructure failure. NOT behavioral data. Causes include: API credit exhaustion (HTTP 402), provider routing failure (HTTP 404), rate limiting, network timeout. These trials produced no behavioral measurement and are removed from all analyses. | 16 17 ## Rationale 18 19 ### Why 500 tokens? 20 21 A complete D1 probe conversation involves ~8-13 turns: presenting 10 factual items, 5 distraction turns, and a recall prompt. A model that genuinely engages produces 500-30,000+ tokens across this conversation. Infrastructure failures (API errors, timeouts) produce 0-100 tokens, occasionally up to ~300 if the error message is verbose. The 500-token threshold separates these distributions with no overlap in our data. 22 23 ### Why include INTERNALIZER as valid? 24 25 Some models (particularly reasoning models like Hunyuan T1) actively respond at full conversation length but choose not to use save/read tools. When they then fail to recall items from memory alone, they receive D1=1.0. This is genuine behavioral data — the model made a deliberate choice not to externalize, and that choice had consequences. Excluding these trials would bias D1 downward for models that genuinely internalize. 26 27 ### What about the previous "dead trial" definition? 28 29 Prior analyses used: `save_count == 0 AND items_recalled == 0` to define dead trials, with all dead trials excluded. This conflated infrastructure failures with genuine internalization. With the new classification: 30 31 - Previous: 2,058 live, 3,847 dead (65.2% exclusion rate) 32 - Current: 2,101 valid (2,058 LIVE + 43 INTERNALIZER), 3,804 EXCLUDED (64.4%) 33 34 The difference is 43 trials (0.7% of total) — small in absolute terms but methodologically important because it properly categorizes genuine behavioral measurements. 35 36 ## Prompt Sensitivity Variants 37 38 Files containing `_minimal_`, `_emphatic_`, `_t0.5_`, or `_t1.0_` in their filename are prompt sensitivity experiments (different prompting conditions). These are: 39 - **Excluded** from the main format sensitivity analysis 40 - **Included** in the prompt sensitivity analysis (Section X of the paper) 41 - 19 files, covering 3 models (Kimi K2, Qwen 3.5 397B, Seed 2.0 Pro) 42 43 ## Verification 44 45 Run the classification pipeline and verify: 46 ```bash 47 .venv/bin/python3 research/scripts/classify_trials.py 48 ``` 49 50 Expected output should show: 51 - 240 standard files, 19 prompt sensitivity files 52 - ~5,905 standard trials 53 - ~2,058 LIVE + ~43 INTERNALIZER + ~3,804 EXCLUDED 54 - 48 models