/ data / DEAD_TRIAL_POLICY.md
DEAD_TRIAL_POLICY.md
 1  # Dead Trial Classification Policy
 2  
 3  **Applies to:** All D1 externalization boundary measurements
 4  **Script:** `research/scripts/classify_trials.py`
 5  **Generated data:** `research/data/classified_trials.json`, `research/data/trial_summary.json`
 6  
 7  ## Classification Rules
 8  
 9  Each trial is classified into exactly one category:
10  
11  | Category | Criteria | Treatment |
12  |----------|----------|-----------|
13  | **LIVE** | `save_count > 0` OR `items_recalled > 0` | Valid behavioral measurement. D1 computed from actual tool use and recall. |
14  | **INTERNALIZER** | `save_count == 0` AND `items_recalled == 0` AND `total_tokens >= 500` AND `errors is null` | Valid D1=1.0 measurement. The model genuinely engaged (produced 500+ tokens across the conversation) but chose not to save via tools and failed to recall items. This is real internalization behavior, not an infrastructure failure. |
15  | **EXCLUDED** | `total_tokens < 500` OR `errors is not null` | Infrastructure failure. NOT behavioral data. Causes include: API credit exhaustion (HTTP 402), provider routing failure (HTTP 404), rate limiting, network timeout. These trials produced no behavioral measurement and are removed from all analyses. |
16  
17  ## Rationale
18  
19  ### Why 500 tokens?
20  
21  A complete D1 probe conversation involves ~8-13 turns: presenting 10 factual items, 5 distraction turns, and a recall prompt. A model that genuinely engages produces 500-30,000+ tokens across this conversation. Infrastructure failures (API errors, timeouts) produce 0-100 tokens, occasionally up to ~300 if the error message is verbose. The 500-token threshold separates these distributions with no overlap in our data.
22  
23  ### Why include INTERNALIZER as valid?
24  
25  Some models (particularly reasoning models like Hunyuan T1) actively respond at full conversation length but choose not to use save/read tools. When they then fail to recall items from memory alone, they receive D1=1.0. This is genuine behavioral data — the model made a deliberate choice not to externalize, and that choice had consequences. Excluding these trials would bias D1 downward for models that genuinely internalize.
26  
27  ### What about the previous "dead trial" definition?
28  
29  Prior analyses used: `save_count == 0 AND items_recalled == 0` to define dead trials, with all dead trials excluded. This conflated infrastructure failures with genuine internalization. With the new classification:
30  
31  - Previous: 2,058 live, 3,847 dead (65.2% exclusion rate)
32  - Current: 2,101 valid (2,058 LIVE + 43 INTERNALIZER), 3,804 EXCLUDED (64.4%)
33  
34  The difference is 43 trials (0.7% of total) — small in absolute terms but methodologically important because it properly categorizes genuine behavioral measurements.
35  
36  ## Prompt Sensitivity Variants
37  
38  Files containing `_minimal_`, `_emphatic_`, `_t0.5_`, or `_t1.0_` in their filename are prompt sensitivity experiments (different prompting conditions). These are:
39  - **Excluded** from the main format sensitivity analysis
40  - **Included** in the prompt sensitivity analysis (Section X of the paper)
41  - 19 files, covering 3 models (Kimi K2, Qwen 3.5 397B, Seed 2.0 Pro)
42  
43  ## Verification
44  
45  Run the classification pipeline and verify:
46  ```bash
47  .venv/bin/python3 research/scripts/classify_trials.py
48  ```
49  
50  Expected output should show:
51  - 240 standard files, 19 prompt sensitivity files
52  - ~5,905 standard trials
53  - ~2,058 LIVE + ~43 INTERNALIZER + ~3,804 EXCLUDED
54  - 48 models