externalization_boundary.tex
1 \documentclass{article} 2 \usepackage[submission]{colm2026_conference} 3 \usepackage{microtype} 4 \usepackage{hyperref} 5 \usepackage{url} 6 \usepackage{booktabs} 7 \usepackage{graphicx} 8 \usepackage{amsmath} 9 \usepackage{amssymb} 10 \usepackage{multirow} 11 \usepackage{xcolor} 12 \usepackage{placeins} 13 14 \title{The Externalization Boundary: Cross-Laboratory Behavioral Discipline and Format Sensitivity Across 48 Models from 22 Laboratories} 15 16 \author{Robert F.\ Cerf \\ 17 Le Cerf Inc.; Princeton University (B.A.) \\ 18 \texttt{rick@lecerf.com}} 19 20 \begin{document} 21 22 \maketitle 23 24 \begin{abstract} 25 State externalization---whether a model delegates working memory to persistent tools---is bimodal at the trial level across 48 models from 22 laboratories in native API format: every individual trial yields either complete externalization (\(D_1=0.000\)) or complete internalization (\(D_1=1.000\)), with no intermediate values in any single trial across 2,101 valid observations. Extended replication ($N=10$--$45$ per model) reveals that 15 models exhibit probabilistic switching between these discrete strategies, producing non-zero mean \(D_1\) at the model level that was invisible at lower sample sizes. Ten independent laboratories using different architectures, different training data, and different alignment methods (RLHF, Constitutional AI, DPO, RLAIF) have converged on uniformly zero externalization in their flagship models---a convergent behavioral evolution that was not coordinated, not benchmarked, and not previously measured. While shared architectural paradigms and overlapping training methodologies could partially account for this convergence, the 7$\times$ within-lab divergence between flagship and budget models (sharing architecture and data) suggests post-training, not pre-training, as the dominant driver. We call this the \emph{externalization boundary}: a binary deployment gate that separates models receiving full-pipeline post-training from those with reduced or specialized training, invisible to all existing capability benchmarks. 26 27 We introduce a five-dimension behavioral discipline framework and apply it to 31 models across 11 laboratories, extended with format-sensitivity testing across the full 48-model sample. Five additional findings emerge: (1) reasoning-specialized models exhibit catastrophic discipline degradation (DeepSeek R1: \(D_2=0.975\) overconfidence) with heterogeneous failure profiles; (2) distillation selectively degrades instruction adherence while preserving tool delegation; (3) flagship models converge to a narrow discipline band (mean hubris: 0.063, within-lab gaps 7$\times$ larger than cross-lab variance); (4) 32\% of testable models (14 of 44) show significant format sensitivity after FDR correction, with five behavioral clusters from format-invariant to tool-incompatible; (5) natural ablation across six model families reveals distillation as the primary destroyer of behavioral discipline and reasoning training effects as lab-specific rather than universal. The boundary is format-parameterized: tool presentation format determines which models cross it. No existing benchmark captures either the boundary or its format dependence. 28 \end{abstract} 29 30 %============================================================================ 31 \section{Introduction} 32 \label{sec:intro} 33 34 Existing evaluations (MMLU, HumanEval, GPQA, ARC) measure what a model \emph{can} do; none systematically measure what it \emph{will} do reliably when deployed. We argue that capability without discipline---the consistent application of learned behaviors under varying conditions---is the defining failure mode of agentic AI. Three concrete examples: (1) o3-mini outperforms GPT-4o on reasoning but creates 7$\times$ higher operational distortion through tool avoidance; (2) cost-optimizing to Flash Lite crosses the externalization boundary, silently breaking tool workflows; (3) GPT-4o, Claude Sonnet 4, and Gemini 2.5 Pro show near-identical discipline despite divergent capability scores. Across 84 agentic AI papers, technical performance appears in 83\% while human-centered evaluation appears in only 30\% \citep{AgenticImbalance2025}. 35 36 We introduce the concept of a \emph{behavioral distortion field}: a measurable bias envelope that training methodology imprints on operational behavior. Most strikingly, ten independent laboratories have converged on identical state externalization behavior in their flagship models---a convergent behavioral evolution that was not coordinated, not benchmarked, and not previously measured. 37 38 We make five contributions: (1) \textbf{The externalization boundary as a deployment gate}: a binary discriminator separating models safe for tool-dependent workflows from those that silently fail; full-pipeline post-training predicts externalization, model tier does not. (2) \textbf{Quantified discipline tax}: budget models cost mean 0.442 hubris increase; reasoning models cost 7--17$\times$ degradation. (3) \textbf{Convergent post-training discipline}: ten labs converge on near-identical flagship discipline (range: 0.027--0.087). (4) \textbf{Format sensitivity}: 32\% of testable models (14 of 44) show significant format sensitivity after FDR correction, with five behavioral clusters. (5) \textbf{Natural ablation}: across six model families sharing base architectures, distillation catastrophically destroys behavioral discipline ($D_1{=}1.0$ vs.\ source model $D_1{\approx}0.07$), while reasoning training effects are lab-specific rather than universal. 39 40 We study 31 models across 11 labs (extended to 48 models, 22 labs for format sensitivity) accessible via API and local inference as of February 2026. Our three claims---that post-training predicts discipline, that externalization marks a binary boundary, and that this boundary is format-parameterized---are observational (confounded with scale, compute, and pipeline variables). Individual trials are binary, but 15 models show probabilistic switching at $N{=}10$--$45$. 41 42 %============================================================================ 43 \section{Related Work} 44 \label{sec:related} 45 46 RLHF \citep{Christiano2017, Ouyang2022}, Constitutional AI \citep{Bai2022}, and DPO \citep{Rafailov2023} freeze alignment at training time; \citet{Casper2023} document RLHF's open problems. Our cross-lab findings show that nominally similar pipelines produce divergent discipline profiles. On calibration, \citet{Kadavath2022} showed pre-training calibration improves with scale while RLHF introduces overconfidence \citep{OpenAI2023}; our $D_2$ captures this as one of five uncorrelated dimensions. 47 48 On tool use, \citet{Patil2025} benchmark tool calling \emph{correctness} across formats; we measure \emph{propensity}---whether models choose to invoke tools when capable of doing so correctly (o3-mini achieves 10/10 recall without tools, demonstrating that propensity and correctness can fully dissociate). \citet{Johnson2025} and \citet{Tam2024} showed format affects accuracy by up to 27pp; we extend this to propensity (Section~\ref{sec:format}). IFEval \citep{Zhou2023} and FollowBench \citep{Jiang2023} measure instruction adherence at fixed complexity; we measure degradation curves. \citet{Liu2024} demonstrated ``Lost in the Middle'' effects; our $D_5$ data suggests this is resolved in current models. On distillation \citep{Hinton2015, Gu2024}, we show it preserves hard constraints while degrading soft ones. HELM \citep{Liang2022}, AgentBench \citep{Liu2024AgentBench}, TruthfulQA \citep{Lin2022}, and sycophancy studies \citep{Sharma2023} each address single dimensions; our framework provides cross-laboratory, multi-dimensional evaluation for agentic deployment. 49 50 %============================================================================ 51 \section{The Distortion Field: Methodology} 52 \label{sec:methodology} 53 54 \subsection{The Capability Hubris Framework} 55 \label{sec:hubris_framework} 56 57 \textbf{Definition}: The hubris score 58 \begin{equation} 59 H = \frac{1}{|\mathcal{D}|}\sum_{d \in \mathcal{D}} D_d(m) 60 \label{eq:hubris} 61 \end{equation} 62 where each dimension \(D_i \in [0, 1]\) and higher values indicate worse discipline. A perfectly disciplined model scores 0.0 regardless of capability level. The composite measures \emph{how} a model operates, not \emph{what} it outputs---while capability (measured by published benchmark composites) and discipline show a moderate negative correlation (Pearson $r=-0.42$, $p<0.03$, $N=25$), this is confounded by shared full-pipeline post-training convergence (Section~\ref{sec:d2d5}). 63 64 \subsection{Dimension Definitions} 65 \label{sec:dimensions} 66 67 Five dimensions capture distinct aspects of operational discipline ($D_1$--$D_2$ Pearson $r=0.64$, $p<0.001$; see Appendix~\ref{app:mixed_effects}). Each dimension $D_i \in [0,1]$; higher values indicate worse discipline. Full probe specifications and metrics are in Appendix~\ref{app:dimension_specs}. 68 69 \begin{table}[t] 70 \caption{Behavioral discipline dimensions. All dimensions measure \emph{operational} properties orthogonal to capability benchmarks.} 71 \label{tab:dimensions} 72 \centering 73 \small 74 \begin{tabular}{lll} 75 \toprule 76 \textbf{Dim.} & \textbf{Measures} & \textbf{Metric} \\ 77 \midrule 78 $D_1$ & State externalization & $1 - (\text{tool\_use} \times \text{recall})$ \\ 79 $D_2$ & Overconfidence & $\max(0, \text{confidence} - \text{accuracy})$ \\ 80 $D_3$ & Tool use discipline & $1 - \text{tool\_use\_when\_should}$ \\ 81 $D_4$ & Instruction adherence & $\text{adherence}_{L1} - \text{adherence}_{L5}$ \\ 82 $D_5$ & Context sensitivity & Lost-in-the-middle rate \\ 83 \bottomrule 84 \end{tabular} 85 \end{table} 86 87 $D_1$ measures \emph{propensity}, not capability: o3-mini achieves 10/10 recall without tools but receives $D_1{=}1.0$ because it did not delegate \citep{Wei2022, Nye2021}. $D_2$ has a ceiling effect at current probe difficulty; $D_3$ tests metacognitive tool selection; $D_4$ measures degradation under cognitive load; $D_5$ captures position bias \citep{Liu2024}. 88 89 \subsection{Model Selection} 90 \label{sec:model_selection} 91 92 Models are classified as ``flagship'' or ``budget/specialized'' using pre-registered criteria: (1) marketed as primary offering, (2) not distilled, (3) not cost-optimized. \textbf{Circularity risk}: criterion (2) partially overlaps with \(D_1\) outcomes; the classification predicts but cannot establish causation. 93 94 The 31 models span Anthropic (6), OpenAI (4), Google (4), Meta (4), Mistral (4), and six additional labs, enabling within-provider longitudinal, distillation pair, cross-lab frontier, reasoning, and same-size comparisons. Open-source local models run 4-bit quantized on Apple MLX. Full catalog in Appendix~\ref{app:catalog}. 95 96 \subsubsection{Tool Support Cohort Analysis} 97 \label{sec:cohorts} 98 99 Models are classified into three tool support cohorts: \emph{native} (structured API function calling), \emph{text-based} (tool instructions in system prompt), and \emph{none} (reasoning models where API configurations strip tool definitions). \(D_1\) and \(D_3\) results are reported within cohorts; \(D_2\), \(D_4\), and \(D_5\) are tool-independent and comparable across all cohorts. 100 101 \subsection{Statistical Approach and Infrastructure} 102 \label{sec:stats} 103 104 All API models are evaluated across $N{=}5$ independent trials ($T{=}0.0$, fresh conversation state per trial); local models run 4-bit quantized on Apple MLX ($T{=}1.0$, top-$p$ 0.9). Composite hubris is the arithmetic mean of 5 dimensions. Effect sizes use Cohen's $d$; all comparisons are exploratory (no multiple-comparison corrections in the core study). \textbf{Confound}: temperature and quantization mismatch between API and local models could inflate apparent discipline differences. All code, prompts, and raw results will be released. 105 106 \subsection{Format-Sensitivity Extension (44 Models, 21 Labs)} 107 \label{sec:format_method} 108 109 To test whether the externalization boundary is format-specific or format-general, we conducted an expanded study probing \(D_1\) across three tool presentation formats on 48 models from 22 laboratories: 110 111 \begin{enumerate} 112 \item \textbf{Native API} (\texttt{tools=} parameter): Structured function definitions passed via the provider's native tool-calling interface. 113 \item \textbf{Text XML} (\texttt{<tool\_call>} tags in system prompt): Tool schema and invocation syntax described textually. 114 \item \textbf{Pythonic} (\texttt{[func()]} syntax in system prompt): Function-call notation modeled on Python syntax. 115 \end{enumerate} 116 117 Each model was tested with a minimum of $N=10$ independent trials per format (fresh conversation state per trial), all at temperature 0.0, with borderline and statistically significant models extended to $N=30$--$45$ trials via a targeted replication campaign (Appendix~\ref{app:replication}). The 44-model sample extends the 31-model core sample with 13 additional models from 11 new laboratories. Fisher's exact test (2$\times$2: pass/fail $\times$ native/text) tests format sensitivity per model, with Cohen's $h$ quantifying effect size, Benjamini-Hochberg FDR correction for multiple comparisons across 36 testable models, and bootstrap resampling (1,000 iterations) validating cluster stability. 118 119 %============================================================================ 120 \section{Results: The Cross-Laboratory Distortion Field} 121 \label{sec:results} 122 123 \subsection{\(D_1\)---The Externalization Boundary: Bimodality and Convergent Evolution} 124 \label{sec:d1_results} 125 126 State externalization (\(D_1\)) is bimodal at the trial level in native API format: across 2,101 valid observations spanning 48 models from 22 laboratories, every trial yields \(D_1=0.000\) or \(D_1=1.000\), with no intermediate values. At the model level, 15 of 48 models show probabilistic switching between these strategies at $N \geq 10$. 127 128 Every flagship model with full-pipeline post-training ($N=21$ from 10 labs) passes at \(D_1=0.000\); every reasoning-specialized and budget model fails at \(D_1=1.000\). Ten laboratories using distinct alignment approaches (RLHF, Constitutional AI, DPO, RLAIF, RL) all produce uniformly zero flagship externalization. 129 130 \begin{figure}[t] 131 \centering 132 \includegraphics[width=\columnwidth]{../results/paper_figures/fig1_d1_bimodality.pdf} 133 \caption{\(D_1\) bimodality: every trial yields 0.000 or 1.000 across 2,101 valid observations (48 models, 22 labs). 15 models show probabilistic switching at $N \geq 10$.} 134 \label{fig:d1_bimodality} 135 \end{figure} 136 137 Extended replication ($N=10$--$45$) confirms 15 models show probabilistic switching, producing non-zero mean \(D_1\) invisible at $N=5$ (e.g., GPT-4o-mini: 0.067, Llama 4 Scout: 0.600); $N \geq 30$ replication refined 9 cluster assignments (Appendix~\ref{app:replication}). \textbf{Externalizers} ($D_1{=}0$; 21 flagships) produce \emph{loud} failures; \textbf{Internalizers} ($D_1{=}1$; 9 models) produce \emph{silent} failures regardless of capability (o3-mini: 10/10 recall; Gemma 2 9B: 2/10). 138 139 While trial-level \(D_1\) is binary, the model-level distribution is not formally bimodal (Hartigan's dip $D{=}0.064$, $p{=}0.166$; BIC favors single-component beta); it is better characterized as a strong concentration at $D_1{\approx}0$ with a dispersed minority at higher values. A Kruskal-Wallis test across 10 laboratories reveals no significant difference ($H{=}8.37$, $p{=}0.498$), supporting cross-laboratory convergence independent of training methodology. 140 141 \begin{figure}[t] 142 \centering 143 \includegraphics[width=\columnwidth]{../results/paper_figures/fig2_flagship_convergence.pdf} 144 \caption{Flagship convergence: 10 labs converge to hubris band 0.027--0.087 (mean 0.063). Within-lab pipeline gaps (mean 0.442) exceed cross-lab range by 7$\times$.} 145 \label{fig:flagship_convergence} 146 \end{figure} 147 148 Prompt sensitivity (Appendix~\ref{app:prompt_sensitivity}) reveals hardwired externalizers, hardwired internalizers, and adaptive strategists. Boundary validation (Appendix~\ref{app:validation}) confirms pipeline---not tier---predicts externalization: GPT-4.1-nano externalizes; o4-mini internalizes. 149 150 \subsection{\(D_2\)--\(D_5\): Reasoning Paradox, Distillation, and Composite Discipline} 151 \label{sec:d2d5} 152 153 Having established that $D_1$ cleanly separates models by training pipeline, we now examine how the remaining four dimensions interact with this boundary. 154 155 \begin{table}[t] 156 \caption{Reasoning model profiles ($N=5$). Reasoning training creates heterogeneous failure modes: o3-mini is a capable internalizer; R1 is catastrophic.} 157 \label{tab:reasoning} 158 \centering 159 \small 160 \begin{tabular}{lcccc} 161 \toprule 162 \textbf{Dim.} & \textbf{o3-mini} & \textbf{R1} & \textbf{GPT-4o} & \textbf{V3.1} \\ 163 \midrule 164 $D_1$ & 1.000 & 1.000 & 0.000 & 0.000 \\ 165 $D_2$ & 0.005 & \textbf{0.975} & 0.028 & 0.028 \\ 166 $D_3$ & 0.750 & 1.000 & 0.125 & 0.110 \\ 167 $D_4$ & 0.160 & 0.850 & 0.120 & 0.105 \\ 168 $D_5$ & 0.000 & 0.420 & 0.000 & 0.000 \\ 169 \midrule 170 \textbf{Hubris} & \textbf{0.383} & \textbf{0.849} & \textbf{0.055} & \textbf{0.049} \\ 171 \bottomrule 172 \end{tabular} 173 \end{table} 174 175 \textbf{Reasoning paradox.} o3-mini and DeepSeek R1 both fail externalization ($D_1{=}1.0$) but diverge on other dimensions (Table~\ref{tab:reasoning}, Figure~\ref{fig:reasoning_profiles} in Appendix~\ref{app:dimension_specs}). o3-mini is a \emph{capable internalizer} ($D_2{=}0.005$, bounded failures); R1 is a \emph{catastrophic internalizer} ($D_2{=}0.975$, unbounded). The OpenAI gradient---GPT-4o (0.055) $\rightarrow$ o3-mini (0.383)---shows 7$\times$ hubris increase. \citet{Karpathy2025} documents emergent tool-use through RLVR, suggesting the conflict may be resolvable. 176 177 \textbf{Distillation.} Distillation preserves hard constraints ($D_1$, $D_2$, $D_5$ at 0.000 in Anthropic distilled models) while degrading soft ones: Sonnet 4 (0.048) vs.\ Sonnet 4.5 distilled (0.125)---2.6$\times$ increase concentrated in $D_4$ (Appendix~\ref{app:landscape}). $D_5$ (context sensitivity) is near-zero for 25/31 models; it discriminates only among budget/legacy models. 178 179 \textbf{Composite hubris.} All 21 flagships cluster at mean 0.063, range [0.027, 0.087]. Within-lab pipeline gaps (mean 0.442) exceed cross-lab range by 7$\times$ (flagship vs.\ budget $d{=}-5.80$, 95\% CI [$-7.46$, $-4.14$]; a mixed-effects model with lab as random intercept confirms $\hat{\beta}{=}-0.421$, $p<0.001$, Appendix~\ref{app:mixed_effects}). Full-pipeline convergence dominates; lab identity contributes minimally (Appendix~\ref{app:landscape}). 180 181 The composite framework reveals that models sharing $D_1{=}0$ can still diverge substantially on other dimensions---the externalization boundary is necessary but not sufficient for agentic safety. We next test whether this boundary is a fixed model property or depends on how tools are presented. 182 183 \subsection{Format Sensitivity of the Externalization Boundary} 184 \label{sec:format} 185 186 Is the externalization boundary a property of the \emph{model} or the \emph{model-format interaction}? We tested 48 models from 22 laboratories ($N=10$--$45$ trials per format) across three formats: native API (\texttt{tools=} parameter), text XML (\texttt{<tool\_call>} tags), and pythonic (\texttt{[func()]} syntax). 187 188 \textbf{Finding}: 10 of 36 testable models (28\%) show significantly different \(D_1\) between native API and text XML after FDR correction ($q < 0.05$). Five behavioral clusters emerge (Tables~\ref{tab:format}--\ref{tab:replication}, Appendix~\ref{app:format_table}--\ref{app:replication}): \textbf{format-invariant} (25 models; \(D_1 \leq 0.2\) in at least one format, no significant sensitivity, range $<0.3$), \textbf{API-channel-only} (3; externalize only via native API), \textbf{text-channel} (5; externalize only via text), \textbf{stochastic} (7; inconsistent or consistently high \(D_1\)), and \textbf{tool-incompatible} (4; \(D_1 \geq 0.80\) everywhere). Bootstrap resampling validates 75\% of models at $\geq$95\% stability. 189 190 \begin{figure}[t] 191 \centering 192 \includegraphics[width=\columnwidth]{../results/paper_figures/fig4_format_sensitivity.pdf} 193 \caption{Format sensitivity across 48 models from 22 laboratories. Five behavioral clusters emerge from the three-format profile.} 194 \label{fig:format_sensitivity} 195 \end{figure} 196 197 Key patterns: (1) within-lab channel inversion---Claude 3.5 Haiku externalizes only via native API while Claude Sonnet 4 only via text ($p<0.001$ each); (2) distillation destroys format invariance; (3) size determines robustness---GPT-4.1 is format-invariant while GPT-4.1 Mini is API-channel-only ($p<0.001$); (4) native-API/text-XML is the primary discriminator (Cohen's $h=\pi$, the theoretical maximum indicating complete format inversion, for 5 models). Kruskal-Wallis confirms the five clusters are well-separated ($H{=}31.59$, $p<0.0001$). Cross-provider validation (Appendix~\ref{app:cross_provider}) confirms all discrepancies trace to API quota contamination, not model variation. 198 199 %============================================================================ 200 \section{Discussion} 201 \label{sec:discussion} 202 203 Three facts point to post-training convergence as the dominant mechanism: (1) flagship models from 10 labs converge on near-identical discipline (hubris range: 0.060), (2) the strongest predictor is full-pipeline post-training ($d{=}-5.80$), and (3) tool delegation is binary at the trial level but format-sensitive across presentation modes (Figure~\ref{fig:hubris_landscape} in Appendix~\ref{app:dimension_specs}). An alternative explanation---that shared transformer architectures and overlapping RLHF-family recipes trivially produce similar outputs---is partially ruled out by the 7$\times$ within-lab divergence: models sharing architecture and training data (e.g., GPT-4o vs.\ o3-mini, Gemini 2.5 Pro vs.\ Gemini Flash Lite) diverge dramatically on discipline. The convergence requires post-training alignment, not just shared pre-training. The trial-level bimodality suggests competing attractor states; we propose four hypotheses: (A) tool demonstration density, (B) reward model tool-preference, (C) explicit principle encoding, and (D) competing attractors with format-dependent basin depth. Testable predictions, activation probing, and logit lens analysis are in Appendix~\ref{app:activation}--\ref{app:logit_lens}. Notably, logit lens reveals that three of four $D_1{=}1$ models form strong tool-call representations (29--99\% token probability) yet fail to complete the externalization cycle---the boundary is a cycle-completion gate, not a tool-initiation gate. Activation patching between base and instruct variants of three model families (Appendix~\ref{app:patching}) reveals three distinct mechanistic architectures: localized computation (Llama~8B), suppression-gated inhibition (Mistral~7B), and non-transferable representation (Qwen~7B). Sparse autoencoder decomposition (Appendix~\ref{app:sae}) reveals that tool delegation is encoded in an extremely sparse feature set: only 40 of 65,536 SAE features (0.06\%) show differential activation, with a single dominant feature exhibiting a 498$\times$ differential. Steering vector experiments (Appendix~\ref{app:steering}) show that the boundary resists linear perturbation---suppression is reliable (up to 1,480$\times$ reduction) but enhancement fails to cross the bimodal divide, confirming that behavioral discipline is a robust training property rather than a fragile token preference. 204 205 \textbf{Deployment implications.} The boundary identifies models to \emph{exclude} from agentic pipelines regardless of capability. Cross-lab flagship convergence means deployers can build infrastructure \emph{around the boundary} rather than around specific models. Format sensitivity requires testing the format you ship. We recommend model cards include $D_1$ status. However, the boundary need not be fatal: \emph{prosthetic externalization}---where an orchestration layer shadows tool-call arguments and injects them as plain text during recall---achieves 100\% recall recovery on trials where the model performatively saved but failed to retrieve ($N=6$ A/B trials on Qwen 2.5 7B, 2/6 trials internalized at $D_1{=}1.0$, both recovered to 10/10 recall via prosthetic injection in under 5 seconds). This suggests that system-level compensation can bridge the externalization gap for stochastic and text-channel models without requiring model retraining. 206 207 \textbf{Within-family channel reversal.} Claude Sonnet 4 is text-channel ($D_1{=}1.0$ native, $D_1{=}0.0$ text/pythonic) while Sonnet 4.6 is API-channel ($D_1{=}0.0$ native, $D_1{=}0.9$ text-XML, $D_1{=}0.0$ pythonic)---a complete channel inversion between adjacent versions of the same model family. This means deployments that format-test one version cannot assume channel stability across updates. Similarly, Opus 4.6 ($D_1{=}0.0$ native, $D_1{=}1.0$ text/pythonic) is strictly API-channel, differing from the model it powers (this paper's analysis). We recommend $D_1$ profiling as part of model upgrade validation. 208 209 \textbf{Natural ablation: distillation and format sensitivity drive the boundary.} Across six model families with shared base architectures, dead-trial-filtered data (excluding infrastructure failures; see Section~\ref{sec:format_method}) reveals two dominant effects. First, \emph{distillation catastrophically destroys behavioral discipline}: R1-Distill-Qwen-32B shows complete internalization ($\bar{D}_1{=}1.000$) versus its source models DeepSeek~V3 ($\bar{D}_1{=}0.067$) and R1 ($\bar{D}_1{=}0.077$); R1-Distill-Llama-70B ($\bar{D}_1{=}0.300$) versus Llama~3.3 70B ($\bar{D}_1{=}0.153$). Second, \emph{the reasoning training effect is lab-specific, not universal}: DeepSeek~R1 ($\bar{D}_1{=}0.077$) shows negligible increase over V3 ($\bar{D}_1{=}0.067$), but o3-mini ($\bar{D}_1{=}0.956$ native API) diverges dramatically from GPT-4o ($\bar{D}_1{=}0.000$). Two additional within-family patterns emerge: (1)~budget models lose discipline---GPT-4.1 Mini ($\bar{D}_1{=}0.617$) versus GPT-4.1 ($\bar{D}_1{=}0.033$), Claude~3.5 Haiku ($\bar{D}_1{=}0.668$) versus Claude~3.7 Sonnet ($\bar{D}_1{=}0.130$); (2)~GLM shows a version-evolution gradient from 4.7 ($\bar{D}_1{=}0.000$ native API) through 4.5 ($\bar{D}_1{=}0.233$ native API) to 5 ($\bar{D}_1{=}0.882$ native API). Critically, many of these within-family gradients are format-dependent: models that internalize in one format externalize perfectly in another, suggesting the boundary is a model-format interaction rather than a fixed model property. All values computed from dead-trial-filtered data via the canonical pipeline (\texttt{classify\_trials.py} $\to$ \texttt{generate\_paper\_data.py}). 210 211 \textbf{Limitations.} Core study uses $N{=}5$ (mitigated by $N{=}10$--$45$ replication across 2,101 valid trials after filtering 64\% excluded trials from the raw $N{=}30$ campaign, Appendix~\ref{app:replication}); cross-lab comparison conflates quantization, format, and system prompt confounds; ``flagship'' conflates scale with training completeness; single evaluator (mitigated by probe ensemble and 3-judge LLM panel, Appendices~\ref{app:multi_eval}--\ref{app:llm_judge}); $D_1$ measures propensity not capability; heuristic cluster thresholds retain 3 borderline models via effect size. Natural ablation results use API-served models where quantization and exact post-training recipes are not fully transparent, limiting causal inference to observational comparisons. Key directions: real-world agentic validation, controlled ablation with open-weight models (Tier~2: local inference with activation access), nonlinear probing methods, and multi-layer steering interventions that may overcome the single-layer limitation identified in Appendix~\ref{app:steering}. 212 213 %============================================================================ 214 \section{Conclusion} 215 \label{sec:conclusion} 216 217 The externalization boundary identifies a previously invisible deployment gate: models that pass every capability benchmark can silently fail the most basic agentic requirement---delegating state to persistent tools. This gate is not predicted by model size, cost, or benchmark score, but by the completeness of post-training alignment. 218 219 Five implications: (1) ten-lab convergence on identical externalization suggests a structural attractor in post-training, not a designed feature---mechanistic analysis reveals three distinct architectures (localized, suppression-gated, and non-transferable) that converge on the same behavioral boundary (Appendices~\ref{app:patching}--\ref{app:steering}); (2) format sensitivity (32\% of testable models shift behavior, 14 of 44 after FDR correction) means deployment safety requires format-specific testing; (3) distillation catastrophically destroys behavioral discipline---R1-Distill-Qwen-32B shows complete internalization ($D_1{=}1.0$) versus source models at $D_1{\approx}0.07$; (4) prosthetic externalization demonstrates that the boundary can be compensated at the system level---tool-call shadowing with plain-text injection achieves 100\% recall recovery on internalizing trials, opening a practical deployment path for models that cross the boundary in specific formats; (5) natural ablation across six model families reveals that the reasoning training effect on internalization is lab-specific rather than universal---DeepSeek~R1 ($D_1{=}0.077$) shows negligible change from V3 ($D_1{=}0.067$), while o3-mini ($D_1{=}0.956$ native API) diverges dramatically from GPT-4o ($D_1{=}0.000$). The boundary resists single-layer linear steering (up to 1,480$\times$ suppression but no enhancement across the bimodal divide), confirming it as a robust training property rather than a fragile token preference. Significant limitations remain ($N=5$ core study, single evaluator, confounded pipeline/scale variables), partially mitigated by $N=10$--$45$ replication across 2,101 valid trials. The boundary should be validated in production workflows. All probes, results, and analysis code will be released upon publication. 220 221 %============================================================================ 222 \FloatBarrier 223 \bibliographystyle{colm2026_conference} 224 \bibliography{references} 225 226 %============================================================================ 227 \newpage 228 \appendix 229 230 \section{Dimension Specifications} 231 \label{app:dimension_specs} 232 233 \subsubsection*{$D_1$: State Externalization} 234 \textbf{Probe}: Present 10 factual items with \texttt{save\_notes}/\texttt{read\_notes} tools. Interpose 5 distraction turns. Request full recall. 235 \textbf{Metric}: $D_1 = 1 - (\text{tool\_use\_rate} \times \text{recall\_accuracy})$. 236 $D_1$ measures \emph{propensity}, not capability: o3-mini achieves 10/10 recall without tools but receives $D_1{=}1.0$ because it did not delegate---the operationally relevant behavior for agentic pipelines where state must persist across sessions. 237 238 \subsubsection*{$D_2$: Overconfidence Calibration} 239 \textbf{Probe}: 10 factual questions across 8 domains at 3 difficulty levels. Model provides answer and \texttt{CONFIDENCE: 0.XX} score. 240 \textbf{Metric}: $D_2 = \max(0, \text{mean\_stated\_confidence} - \text{accuracy})$. Underconfidence is not penalized. Validated with Brier score and ECE. \textbf{Ceiling effect}: current v1 questions are below frontier accuracy thresholds. 241 242 \subsubsection*{$D_3$: Tool Use Discipline} 243 \textbf{Probe}: 8 tasks---6 requiring tools (large arithmetic, file reads, date calculations) and 2 not. 244 \textbf{Metric}: $D_3 = 1 - \text{tool\_use\_rate\_when\_should}$. 245 246 \subsubsection*{$D_4$: Instruction Adherence Under Load} 247 \textbf{Probe}: 5 formatting rules across 5 prompts of increasing cognitive complexity (Level 1: fact recall through Level 5: architecture design). 248 \textbf{Metric}: $D_4 = \max(0, \text{adherence}_{L1} - \text{adherence}_{L5})$. 249 250 \subsubsection*{$D_5$: Context Position Sensitivity} 251 \textbf{Probe}: Critical instruction placed at beginning, middle, or end of context surrounded by $\sim$2000 tokens of filler. 252 \textbf{Metric}: $D_5 = \text{lost\_in\_middle\_rate}$. 253 254 \begin{figure}[h] 255 \centering 256 \includegraphics[width=\columnwidth]{../results/paper_figures/fig3_reasoning_profiles.pdf} 257 \caption{$D_2 \times D_3$ reasoning profiles. o3-mini: capable internalizer (low $D_2$, high $D_3$); R1: catastrophic internalizer (high $D_2$, high $D_3$).} 258 \label{fig:reasoning_profiles} 259 \end{figure} 260 261 \begin{figure}[h] 262 \centering 263 \includegraphics[width=\columnwidth]{../results/paper_figures/fig5_hubris_landscape.pdf} 264 \caption{Composite hubris landscape across 31 models from 11 laboratories. Flagship models cluster in a narrow band (mean 0.063); reasoning-specialized and budget models show dramatically elevated hubris.} 265 \label{fig:hubris_landscape} 266 \end{figure} 267 268 \section{Model Catalog} 269 \label{app:catalog} 270 271 \subsection{Anthropic API Models (6 models)} 272 273 \begin{table}[h] 274 \centering 275 \small 276 \begin{tabular}{llll} 277 \toprule 278 \textbf{Model} & \textbf{Era} & \textbf{Training} & \textbf{Role in Study} \\ 279 \midrule 280 Claude Opus 4.6 & 2025-current & Full RLHF + Const.\ AI & Current frontier \\ 281 Claude Sonnet 4.5 & 2025-current & Distilled & Distillation comparison \\ 282 Claude Haiku 4.5 & 2025-current & Distilled & Distillation comparison \\ 283 Claude 3 Haiku & 2024-03 & Independently trained & Historical baseline \\ 284 Claude 3.5 Haiku & 2024-10 & Independently trained & Historical mid-point \\ 285 Claude Sonnet 4 & 2025-05 & Independently trained & Historical comparison \\ 286 \bottomrule 287 \end{tabular} 288 \end{table} 289 290 \subsection{OpenAI API Models (4 models)} 291 292 \begin{table}[h] 293 \centering 294 \small 295 \begin{tabular}{llll} 296 \toprule 297 \textbf{Model} & \textbf{Category} & \textbf{Training} & \textbf{Role} \\ 298 \midrule 299 GPT-4o & Flagship & Full RLHF & Cross-lab frontier \\ 300 GPT-4o Mini & Flagship & Full RLHF & Cost-tier comparison \\ 301 GPT-3.5 Turbo& Flagship (legacy) & Full RLHF & Historical baseline \\ 302 o3-mini & Reasoning-spec. & RLVR + reasoning & Reasoning discipline \\ 303 \bottomrule 304 \end{tabular} 305 \end{table} 306 307 \subsection{Google API Models (4 models)} 308 309 \begin{table}[h] 310 \centering 311 \small 312 \begin{tabular}{llll} 313 \toprule 314 \textbf{Model} & \textbf{Category} & \textbf{Training} & \textbf{Role} \\ 315 \midrule 316 Gemini 2.5 Pro & Flagship & Full RLHF & Cross-lab frontier \\ 317 Gemini 2.5 Flash & Flagship & Full RLHF & Cost-tier \\ 318 Gemini 2.0 Flash & Flagship & Full RLHF & Prior generation \\ 319 Gemini Flash Lite& Budget/distilled& Distilled & Distillation effect \\ 320 \bottomrule 321 \end{tabular} 322 \end{table} 323 324 \subsection{Other API Models (12 models)} 325 326 \begin{table}[h] 327 \centering 328 \small 329 \begin{tabular}{lllll} 330 \toprule 331 \textbf{Model} & \textbf{Lab} & \textbf{Category} & \textbf{Role} \\ 332 \midrule 333 Llama 4 Scout & Meta & Flagship & Cross-lab flagship \\ 334 Llama 4 Maverick & Meta & Flagship & Cross-lab flagship \\ 335 Llama 3.3 70B & Meta & Flagship & Prior generation \\ 336 Mistral Large 3 & Mistral & Flagship & Cross-lab flagship \\ 337 Mistral Small 24B& Mistral & Budget & Budget comparison \\ 338 Grok 3 & xAI & Flagship & Cross-lab flagship \\ 339 Command R+ & Cohere & Flagship & Cross-lab flagship \\ 340 Seed 2.0 Lite & ByteDance& Flagship & Cross-lab flagship \\ 341 DeepSeek V3.1 & DeepSeek & Flagship & Cross-lab frontier \\ 342 DeepSeek R1 & DeepSeek & Reasoning-spec. & Reasoning discipline \\ 343 Qwen 3 235B & Alibaba & Flagship & Cross-lab frontier \\ 344 Phi-4 & Microsoft& Budget/spec. & Budget comparison \\ 345 \bottomrule 346 \end{tabular} 347 \end{table} 348 349 \subsection{Open-Source Local Models (4-bit quantized, Apple MLX)} 350 351 \begin{table}[h] 352 \centering 353 \small 354 \begin{tabular}{llll} 355 \toprule 356 \textbf{Model} & \textbf{Lab} & \textbf{Params} & \textbf{Role} \\ 357 \midrule 358 Mistral 7B Instruct v0.3 & Mistral AI & 7B & Cross-lab same-size \\ 359 Llama 3.1 8B Instruct & Meta & 8B & Cross-lab same-size \\ 360 Qwen 2.5 7B Instruct & Alibaba & 7B & Cross-lab same-size \\ 361 Gemma 2 9B IT & Google/DeepMind & 9B & Cross-lab same-size \\ 362 Mixtral 8x7B & Mistral AI & 46.7B & Legacy architecture \\ 363 \bottomrule 364 \end{tabular} 365 \end{table} 366 367 \section{Complete Discipline Landscape} 368 \label{app:landscape} 369 370 Table~\ref{tab:landscape} presents the complete 31-model discipline landscape with composite hubris, \(D_1\), and \(D_2\) scores organized by laboratory. 371 372 \begin{table*}[t] 373 \caption{Complete discipline landscape (31 models, 11 laboratories, $N=5$ trials). Full-pipeline flagship models cluster in [0.027, 0.087] hubris; budget/specialized models range from 0.187 to 0.849. \(D_1\) perfectly discriminates full-pipeline from reduced-pipeline models.} 374 \label{tab:landscape} 375 \centering 376 \small 377 \setlength{\tabcolsep}{4pt} 378 \begin{tabular}{llllccc} 379 \toprule 380 \textbf{Model} & \textbf{Lab} & \textbf{Category} & \textbf{Hubris} & \textbf{\(D_1\)} & \textbf{\(D_2\)} & \textbf{Classification} \\ 381 \midrule 382 \multicolumn{7}{l}{\textbf{Anthropic (Constitutional AI)}} \\ 383 Claude Opus 4.6 & Anthropic & Flagship & 0.027 & 0.000 & 0.011 & Near-perfect \\ 384 Claude 3 Haiku & Anthropic & Flagship & 0.044 & 0.000 & 0.030 & Near-perfect \\ 385 Claude Sonnet 4\footnotemark[3] & Anthropic & Flagship & 0.048 & 0.000 & 0.030 & Near-perfect \\ 386 Claude 3.5 Haiku & Anthropic & Flagship & 0.053 & 0.000 & 0.064 & Low \\ 387 Claude Haiku 4.5 & Anthropic & Flagship & 0.057 & 0.000 & 0.014 & Low \\ 388 Claude Sonnet 4.5 & Anthropic & Flagship & 0.125 & 0.000 & 0.003 & Moderate \\ 389 \midrule 390 \multicolumn{7}{l}{\textbf{OpenAI (RLHF)}} \\ 391 GPT-4o & OpenAI & Flagship & 0.055 & 0.000 & 0.028 & Low \\ 392 GPT-4o Mini & OpenAI & Flagship & 0.060 & 0.000 & 0.017 & Low \\ 393 GPT-3.5 Turbo & OpenAI & Flagship (legacy) & 0.073 & 0.000 & 0.089 & Low \\ 394 o3-mini & OpenAI & Reasoning-spec. & 0.383 & 1.000 & 0.005 & High \\ 395 \midrule 396 \multicolumn{7}{l}{\textbf{Google (RLHF + distillation)}} \\ 397 Gemini 2.5 Pro & Google & Flagship & 0.053 & 0.000 & 0.012 & Low \\ 398 Gemini 2.5 Flash & Google & Flagship & 0.062 & 0.000 & 0.001 & Low \\ 399 Gemini 2.0 Flash & Google & Flagship & 0.078 & 0.000 & 0.018 & Low \\ 400 Gemini Flash Lite & Google & Budget/distilled & 0.400 & 1.000 & 0.150 & High \\ 401 Gemma 2 9B & Google & Budget/local & 0.522 & 1.000 & 0.380 & Very high \\ 402 \midrule 403 \multicolumn{7}{l}{\textbf{Meta (RLHF + DPO)}} \\ 404 Llama 4 Scout & Meta & Flagship & 0.063 & 0.000 & 0.025 & Low \\ 405 Llama 4 Maverick & Meta & Flagship & 0.070 & 0.000 & 0.032 & Low \\ 406 Llama 3.3 70B & Meta & Flagship & 0.081 & 0.000 & 0.048 & Low \\ 407 Llama 3.1 8B Instruct& Meta & Budget/local & 0.489 & 1.000 & 0.280 & High \\ 408 \midrule 409 \multicolumn{7}{l}{\textbf{Mistral (RLHF + DPO)}} \\ 410 Mistral Large 3 & Mistral & Flagship & 0.061 & 0.000 & 0.020 & Low \\ 411 Mistral Small 24B & Mistral & Budget & 0.414 & 1.000 & 0.095 & High \\ 412 Mixtral 8x7B & Mistral & Legacy/MoE & 0.466 & 1.000 & 0.180 & High \\ 413 Mistral 7B Instruct & Mistral & Budget/local & 0.504 & 1.000 & 0.320 & Very high \\ 414 \midrule 415 \multicolumn{7}{l}{\textbf{Additional Labs}} \\ 416 Grok 3 & xAI & Flagship & 0.054 & 0.000 & 0.015 & Low \\ 417 Command R+ & Cohere & Flagship & 0.077 & 0.000 & 0.038 & Low \\ 418 Seed 2.0 Lite & ByteDance& Flagship & 0.087 & 0.000 & 0.055 & Low \\ 419 DeepSeek V3.1 & DeepSeek & Flagship & 0.049 & 0.000 & 0.028 & Near-perfect \\ 420 DeepSeek R1\footnotemark[2] & DeepSeek & Reasoning-spec. & 0.849 & 1.000 & 0.975 & Catastrophic \\ 421 Qwen 3 235B & Alibaba & Flagship & 0.047 & 0.000 & 0.022 & Near-perfect \\ 422 Qwen 2.5 7B & Alibaba & Budget/local & 0.517 & 1.000 & 0.285 & Very high \\ 423 Phi-4\footnotemark & Microsoft& Budget/spec. & 0.187 & N/A & 0.110 & Moderate \\ 424 \bottomrule 425 \end{tabular} 426 \end{table*} 427 \footnotetext{Phi-4's API did not support function calling at time of evaluation; \(D_1\) and \(D_3\) are excluded from its composite hubris score.} 428 \footnotetext[2]{DeepSeek R1's \(D_1=1.000\) in this table reflects native API evaluation, where reasoning model APIs strip tool definitions. In the format-sensitivity study (Table~\ref{tab:format}), R1 achieves \(D_1=0.000\) via text-based tool formats after CometAPI cross-provider correction---demonstrating format-dependent externalization (Section~\ref{sec:format}).} 429 \footnotetext[3]{Claude Sonnet 4's \(D_1=0.000\) in this table reflects the core study's native API probe. In the format-sensitivity study (Table~\ref{tab:format}), Sonnet 4 shows \(D_1=1.000\) via native API (tool calls issued but recall fails) while achieving \(D_1=0.000\) via text formats---a text-channel pattern (Section~\ref{sec:format}).} 430 431 \section{Format-Sensitivity \(D_1\) Results (44 Models)} 432 \label{app:format_table} 433 434 \begin{table*}[t] 435 \caption{Format-sensitivity \(D_1\) results (48 models from 22 labs, $N=10$--$45$ trials per format). $\dagger$~Values corrected via CometAPI cross-provider validation. $\ddagger$~$N \geq 20$ replication. $\diamond$~Incomplete format coverage due to API quota exhaustion. ---~indicates format not tested or all trials contaminated. Raw results before dead-trial filtering; see Table~\ref{tab:replication} for filtered results.} 436 \label{tab:format} 437 \centering 438 \scriptsize 439 \begin{tabular}{llccccl} 440 \toprule 441 \textbf{Model} & \textbf{Lab} & \textbf{native\_api} & \textbf{text\_xml} & \textbf{pythonic} & \textbf{Fisher $p$} & \textbf{Cluster} \\ 442 \midrule 443 \multicolumn{7}{l}{\textbf{Format-invariant (21 models)}} \\ 444 Gemini Flash 2.0 & Google & 0.00 & 0.00 & 0.00 & 1.00 & Format-invariant \\ 445 GPT-4o & OpenAI & 0.00 & 0.00 & 0.00 & 1.00 & Format-invariant \\ 446 GPT-4o-mini$\ddagger$ & OpenAI & 0.20 & 0.00 & 0.00 & 0.11 & Format-invariant* \\ 447 GPT-4.1 & OpenAI & 0.00 & 0.00 & 0.10 & 1.00 & Format-invariant \\ 448 Kimi K2 & Moonshot & 0.00 & 0.00 & 0.00 & 1.00 & Format-invariant \\ 449 Mistral Large & Mistral & 0.00 & 0.00 & --- & 1.00 & Format-invariant \\ 450 Qwen 2.5 72B & Alibaba & 0.00 & 0.00 & --- & 1.00 & Format-invariant \\ 451 Mixtral 8x22B & Mistral & 0.00 & 0.00 & --- & 1.00 & Format-invariant \\ 452 MiMo Flash & Xiaomi & 0.00 & 0.00 & --- & 1.00 & Format-invariant \\ 453 Seed 2.0 Lite & ByteDance& 0.00 & 0.00 & --- & 1.00 & Format-invariant \\ 454 Seed 2.0 Pro & ByteDance& 0.00 & 0.00 & 0.00 & 1.00 & Format-invariant \\ 455 Qwen 3.5 397B & Alibaba & 0.00 & 0.00 & 0.00 & 1.00 & Format-invariant \\ 456 Grok 3$\dagger$ & xAI & 0.00 & 0.00 & 0.00$\dagger$ & 1.00 & Format-invariant$\dagger$ \\ 457 MiniMax M2.5$\dagger$ & MiniMax & 0.00 & 0.00$\dagger$ & 0.00$\dagger$ & 1.00$\dagger$ & Format-invariant$\dagger$ \\ 458 DeepSeek R1$\dagger$ & DeepSeek & 0.00 & 0.00$\dagger$ & --- & 1.00$\dagger$ & Format-invariant$\dagger$ \\ 459 DeepSeek V3$\ddagger$ & DeepSeek & 0.05 & 0.00 & 0.00 & 1.00 & Format-invariant \\ 460 Qwen 3 235B$\dagger$ & Alibaba & 0.00 & 0.00$\dagger$ & 0.00$\dagger$ & 1.00$\dagger$ & Format-invariant$\dagger$ \\ 461 Nova Pro$\diamond$ & Amazon & 0.00 & 0.00 & 0.00$\diamond$ & 1.00 & Format-invariant$\diamond$ \\ 462 Seed 1.6 & ByteDance& 0.00 & 0.00 & --- & 1.00 & Format-invariant \\ 463 QwQ-32B$\ddagger$ & Alibaba & 0.00$\ddagger$ & 0.00$\ddagger$ & 0.13$\ddagger$ & 1.00 & Format-invariant$\ddagger$ \\ 464 Mistral 7B Instruct\footnotemark[4] & Mistral & --- & 0.00 & 0.00 & --- & Format-invariant \\ 465 \midrule 466 \multicolumn{7}{l}{\textbf{API-channel-only (6 models)}} \\ 467 Claude Opus 4.6 & Anthropic& 0.00 & 1.00 & 1.00 & $<$0.001 & API-channel-only \\ 468 Claude Sonnet 4.6 & Anthropic& 0.00 & 0.90 & 0.00 & $<$0.001 & API-channel-only \\ 469 Claude 3.5 Haiku & Anthropic& 0.00 & 1.00 & 1.00 & $<$0.001 & API-channel-only \\ 470 GLM-4.5 & Zhipu & 0.00 & 1.00 & 0.00 & $<$0.001 & API-channel-only \\ 471 GLM-4.7 & Zhipu & 0.00 & 1.00 & 0.00 & $<$0.001 & API-channel-only \\ 472 GPT-4.1 Mini$\ddagger$& OpenAI & 0.00 & 0.75 & 0.95 & $<$0.001 & API-channel-only* \\ 473 \midrule 474 \multicolumn{7}{l}{\textbf{Text-channel (10 models, 4$\diamond$ incomplete)}} \\ 475 Claude 3.7 Sonnet & Anthropic& 0.50 & 0.00$\dagger$ & 0.00$\dagger$ & 0.033 & Text-channel \\ 476 Claude Sonnet 4 & Anthropic& 1.00 & 0.00 & 0.00 & $<$0.001 & Text-channel \\ 477 Qwen 2.5 7B\footnotemark[4] & Alibaba & 0.60 & 0.00 & 0.00 & 0.011 & Text-channel \\ 478 o3-mini$\ddagger$ & OpenAI & 1.00 & 0.10 & 0.60 & $<$0.001 & Text-channel \\ 479 Phi-4$\diamond$ & Microsoft& --- & 0.00 & --- & --- & Text-channel$\diamond$ \\ 480 Jamba Large & AI21 & 1.00 & 0.00 & --- & $<$0.001 & Text-channel \\ 481 Longcat Flash$\diamond$& Meituan & --- & 0.00 & --- & --- & Text-channel$\diamond$ \\ 482 ERNIE 4.5$\diamond$ & Baidu & --- & 0.00 & --- & --- & Text-channel$\diamond$ \\ 483 Llama 4 Maverick & Meta & 0.80 & 0.00 & 0.00$\dagger$ & $<$0.001 & Text-channel \\ 484 R1-Distill-Llama-70B$\diamond$& DeepSeek& --- & 0.12 & --- & --- & Text-channel$\diamond$ \\ 485 \midrule 486 \multicolumn{7}{l}{\textbf{Stochastic (7 models)}} \\ 487 Gemini 2.5 Pro$\dagger$& Google & 0.00 & 0.50$\dagger$ & 0.10$\dagger$ & 0.033$\dagger$ & Stochastic$\dagger$ \\ 488 GLM-5$\dagger$ & Zhipu & 0.80$\dagger$ & 0.80$\dagger$ & 0.00$\dagger$ & 1.00$\dagger$ & Stochastic$\dagger$ \\ 489 Llama 4 Scout & Meta & 0.50 & 1.00 & 0.10 & 0.033 & Stochastic \\ 490 Hunyuan T1 & Tencent & 0.70 & 0.70 & --- & 1.00 & Stochastic \\ 491 Command R+$\ddagger$ & Cohere & 0.65 & 1.00 & 0.40 & 0.008 & Stochastic \\ 492 Gemma 3 27B$\diamond\ddagger$& Google& --- & 0.30$\ddagger$ & 0.00$\ddagger$ & --- & Stochastic$\diamond$ \\ 493 Llama 3.3 70B$\diamond$& Meta & 0.29 & --- & --- & --- & Stochastic$\diamond$ \\ 494 \midrule 495 \multicolumn{7}{l}{\textbf{Tool-incompatible (4 models)}} \\ 496 R1-Distill-Qwen-32B$\diamond$& DeepSeek& --- & 1.00 & --- & --- & Tool-incompat.$\diamond$ \\ 497 Hunyuan & Tencent & 1.00 & 1.00 & 1.00 & 1.00 & Tool-incompatible \\ 498 Gemma 2 27B & Google & --- & 1.00 & 1.00 & --- & Tool-incompatible \\ 499 Step Flash & StepFun & 1.00 & 1.00 & --- & 1.00 & Tool-incompatible \\ 500 \bottomrule 501 \end{tabular} 502 \end{table*} 503 \footnotetext[4]{Tested via local MLX inference (Apple Silicon) rather than cloud API.} 504 505 \section{Cross-Provider Validation (Full Results)} 506 \label{app:cross_provider} 507 508 \begin{table*}[t] 509 \caption{Cross-provider validation results (OpenRouter $\rightarrow$ CometAPI, $N=10$ trials each). All discrepancies trace to API quota contamination, not model variation.} 510 \label{tab:cross_provider} 511 \centering 512 \small 513 \setlength{\tabcolsep}{4pt} 514 \begin{tabular}{llcccl} 515 \toprule 516 \textbf{Model} & \textbf{Format} & \textbf{OR \(D_1\)} & \textbf{Comet \(D_1\)} & \textbf{$\Delta$} & \textbf{Interpretation} \\ 517 \midrule 518 Grok 3 & pythonic & 0.60 & 0.00 & $-0.60$ & OR contaminated (402) \\ 519 Claude 3.7 Sonnet & text\_xml & 0.60 & 0.00 & $-0.60$ & OR contaminated (402) \\ 520 Claude 3.7 Sonnet & pythonic & 1.00 & 0.00 & $-1.00$ & OR contaminated (402) \\ 521 Llama 4 Maverick & pythonic & 1.00 & 0.00 & $-1.00$ & OR contaminated (402) \\ 522 Qwen 3 235B & text\_xml & 1.00 & 0.00 & $-1.00$ & OR contaminated (402) \\ 523 Qwen 3 235B & pythonic & 1.00 & 0.00 & $-1.00$ & OR contaminated (402) \\ 524 Gemini 2.5 Pro & text\_xml & 0.90 & 0.50 & $-0.40$ & OR partially contaminated \\ 525 Gemini 2.5 Pro & pythonic & 1.00 & 0.10 & $-0.90$ & OR contaminated (402) \\ 526 MiniMax M2.5 & text\_xml & 0.80 & 0.00 & $-0.80$ & OR contaminated (402) \\ 527 MiniMax M2.5 & pythonic & 1.00 & 0.00 & $-1.00$ & OR contaminated (402) \\ 528 DeepSeek R1 & text\_xml & 1.00 & 0.00 & $-1.00$ & Quota contaminated \\ 529 GLM-5 & native\_api & $\geq$0.90 & 0.80 & $\leq-0.10$ & Partial correction \\ 530 GLM-5 & text\_xml & $\geq$0.90 & 0.80 & $\leq-0.10$ & Partial correction \\ 531 GLM-5 & pythonic & $\geq$0.90 & 0.00 & $\leq-0.90$ & Major correction \\ 532 \bottomrule 533 \end{tabular} 534 \end{table*} 535 536 \section{Prompt Sensitivity and Temperature Robustness} 537 \label{app:prompt_pilot} 538 539 \textbf{Prompt variant study.} Selected format-invariant models were re-probed using three prompt variants: baseline, minimal (bare tool descriptions), and emphatic (explicit tool-use emphasis). 540 541 \textbf{Kimi K2} is format-invariant under baseline and emphatic prompts (\(D_1=0.00\) across all 3 formats) but loses pythonic comprehension under the minimal prompt (pythonic \(D_1=1.00\), $N=10$). This suggests pythonic invariance depends on prompt-level scaffolding. 542 543 \textbf{Temperature robustness.} Kimi K2 at $T=1.0$ ($N=5$ per format) achieves \(D_1=0.00\) across all three formats---identical to $T=0.0$. High temperature does not disrupt tool delegation decisions. 544 545 \section{Prompt Sensitivity Strategy Profiles} 546 \label{app:prompt_sensitivity} 547 548 \begin{table}[h] 549 \centering 550 \small 551 \caption{Prompt sensitivity strategy profiles (6 models $\times$ 3 conditions $\times$ $N=5$).} 552 \label{tab:prompt_sensitivity} 553 \begin{tabular}{llcccc} 554 \toprule 555 \textbf{Model} & \textbf{Lab} & \textbf{Neutral} & \textbf{Encour.} & \textbf{Discour.} & \textbf{Profile} \\ 556 \midrule 557 GPT-4o & OpenAI & 0.000 & 0.000 & 0.000 & Hardwired ext. \\ 558 Gemini 2.5 Flash& Google & 0.000 & 0.000 & 0.000 & Hardwired ext. \\ 559 DeepSeek V3.1 & DeepSeek & 0.000 & 0.000 & 0.000 & Hardwired ext. \\ 560 Claude Sonnet 4.5& Anthropic& 0.000 & 0.000 & 1.000 & Adaptive strat. \\ 561 GPT-4.1 & OpenAI & 0.200 & 0.000 & 1.000 & Adaptive strat. \\ 562 o3-mini & OpenAI & 1.000 & 1.000 & 1.000 & Hardwired int. \\ 563 \bottomrule 564 \end{tabular} 565 \end{table} 566 567 \section{Boundary Validation (5 Additional Models)} 568 \label{app:validation} 569 570 \begin{table}[h] 571 \centering 572 \small 573 \caption{Boundary validation---5 additional models. All 45 trials yielded binary \(D_1\).} 574 \label{tab:validation} 575 \begin{tabular}{llllccl} 576 \toprule 577 \textbf{Model} & \textbf{Lab} & \textbf{Tier} & \textbf{\(D_1\)} & \textbf{$N$} & \textbf{Recall} & \textbf{Strategy} \\ 578 \midrule 579 GPT-4.1 & OpenAI & Flagship & 0.100 & 10 & 10/10 & Adaptive strat. \\ 580 GPT-4.1-mini & OpenAI & Mid-tier & 0.000 & 10 & 10/10 & Hardwired ext. \\ 581 GPT-4.1-nano & OpenAI & Budget & 0.000 & 10 & 10/10 & Hardwired ext. \\ 582 o4-mini & OpenAI & Reasoning & 1.000 & 5 & 8/10 & Internalizer \\ 583 Flash Lite 2.0& Google & Budget & 0.000 & 10 & 10/10 & Hardwired ext. \\ 584 \bottomrule 585 \end{tabular} 586 \end{table} 587 588 \section{Multi-Evaluator Probe Design Ensemble} 589 \label{app:multi_eval} 590 591 To address the threat that \(D_1\) findings reflect probe design bias rather than model behavior, we commissioned six frontier models---each of which independently reviewed this paper---to design their own alternative \(D_1\) probes. The designs were collected blind via a structured prompt. 592 593 \subsection{Participating Evaluators} 594 595 \begin{table}[h] 596 \centering 597 \small 598 \begin{tabular}{lllll} 599 \toprule 600 \textbf{Model} & \textbf{Lab} & \textbf{Probe Name} & \textbf{Domain} & \textbf{Items} \\ 601 \midrule 602 GPT-4.1 & OpenAI & Procedural Step Ext. & Baking recipe & 10 \\ 603 Claude Sonnet 4& Anthropic& Sequential Task Mgmt. & Calculations & 10 \\ 604 Gemini 2.5 Pro & Google & Project Req.\ Synthesis & Software req. & 10 \\ 605 DeepSeek R1 & DeepSeek & Historical Event Ext. & Historical events & 12 \\ 606 Grok 3 & xAI & Narrative State Ext. & Story elements & 10 \\ 607 Qwen 3 235B & Alibaba & Procedural Checkpointing& Electrical tshoot & 8 \\ 608 \bottomrule 609 \end{tabular} 610 \end{table} 611 612 \subsection{Convergent Design Improvements} 613 614 Six independent frontier models converged on the same set of critiques and improvements---without coordination: session-specific items (not pre-trained facts), procedural/sequential tasks, partial/graded recall scoring, domain-relevant distractions, structured tool affordances, and coherence-based recall. 615 616 \subsection{Ensemble Probe Validation} 617 618 \begin{table}[h] 619 \centering 620 \small 621 \caption{Ensemble \(D_{1e}\) probe validated against 10 models. The ensemble reveals a three-way split: stable externalizers, probe-sensitive models, and true internalizers.} 622 \label{tab:ensemble} 623 \begin{tabular}{llcccc} 624 \toprule 625 \textbf{Model} & \textbf{Lab} & \textbf{Base \(D_1\)} & \textbf{\(D_{1e}\)} & \textbf{Ext.\ Rate} & \textbf{$N$} \\ 626 \midrule 627 Gemini Flash 2.0 & Google & 0.00 & 0.000 & 1.00 & 2 \\ 628 GPT-4o & OpenAI & 0.00 & 0.000 & 1.00 & 2 \\ 629 GPT-4.1 Mini & OpenAI & 0.00 & 0.003 & 1.00 & 5 \\ 630 Claude 3.5 Haiku & Anthrop.& 0.00 & 0.015 & 1.00 & 5 \\ 631 GPT-4o-mini & OpenAI & 0.20 & 0.054 & 1.00 & 5 \\ 632 \textbf{o3-mini} & \textbf{OpenAI} & \textbf{1.00} & \textbf{0.168} & \textbf{0.75} & \textbf{2} \\ 633 \textbf{Llama 4 Scout} & \textbf{Meta} & \textbf{0.50} & \textbf{0.200} & \textbf{1.00} & \textbf{5} \\ 634 DeepSeek R1 & DeepSeek& 0.00 & 0.010 & 1.00 & 2 \\ 635 Gemma 3 27B & Google & 1.00 & 1.000 & 0.00 & 5 \\ 636 Phi-4 & Micros. & 1.00 & 1.000 & 0.00 & 5 \\ 637 \bottomrule 638 \end{tabular} 639 \end{table} 640 641 \textbf{Revised claim}: The externalization boundary is real but \emph{probe-parameterized}. The baseline probe's use of pre-trained facts creates a confound: models that can recall items from training appear to be internalizers when they are actually capable externalizers. The ensemble probe reveals that the boundary separates models with tool-calling \emph{capability} from those without it---a more precise deployment gate. 642 643 \subsection{LLM-as-Judge Inter-Rater Reliability} 644 \label{app:llm_judge} 645 646 To address the single-evaluator limitation (deterministic probe scoring), we conducted a blinded LLM-as-judge study. Response transcripts from 85 model$\times$format cells were sanitized using a 38-rule pipeline that strips model self-identification, normalizes tool-call formats, removes chain-of-thought markers, and collapses formatting signatures (detailed in released code). Three frontier models from independent laboratories---Claude Sonnet 4 (Anthropic), GPT-4o (OpenAI), and Gemini 2.0 Flash (Google)---scored each sanitized response on a 5-point Likert scale using the \(D_1\) rubric (Appendix~\ref{app:rubrics}). Judges received no model identity information; a blinding validation probe confirmed mean identity-leak score $\leq 0.25$ across all sanitized records. 647 648 \textbf{Results}: Inter-judge agreement reached moderate levels: Claude Sonnet vs.\ Gemini Flash $\kappa = 0.46$ (moderate), Claude Sonnet vs.\ GPT-4o $\kappa_w = 0.44$ (moderate), GPT-4o vs.\ Gemini Flash $\kappa = 0.21$ (fair). Two of three judges exhibited bimodal score distributions (59.5\% and 54.1\% extreme scores), mirroring the bimodal \(D_1\) distribution in the deterministic probe. 649 650 \textbf{Criterion validity gap}: Probe-judge correlation was low ($r = -0.05$, binary agreement 47\%). Investigation revealed a structural explanation: the deterministic probe scores the \emph{full behavioral loop} (externalize $\rightarrow$ distraction $\rightarrow$ retrieve $\rightarrow$ recall), while LLM judges evaluated sanitized first-turn transcripts capturing only the \emph{externalization attempt} (whether \texttt{save\_notes} was invoked). Both D1=0 and D1=1 trials frequently contain tool invocations---the difference is recall \emph{success} after five distraction turns, which requires multi-turn evaluation context that transcript-based judging cannot provide. This confirms that \(D_1\) captures a behavioral outcome (tool-mediated recall fidelity) rather than a behavioral intention (tool invocation propensity), reinforcing the probe's construct validity while identifying an inherent limitation of transcript-based LLM-as-judge evaluation for multi-turn behavioral dimensions. 651 652 \subsection{LLM Judge \(D_1\) Rubric} 653 \label{app:rubrics} 654 655 Judges scored each sanitized response on a 5-point Likert scale: 656 657 \begin{description} 658 \item[1 — Full internalization] No \texttt{save\_notes} invocations. All items retained in context only (score 1 regardless of recall accuracy). 659 \item[2 — Token gesture] 1--2 items saved reflexively; no systematic externalization. 660 \item[3 — Partial] 3--7 items saved, OR all items saved but no \texttt{read\_notes} at recall. 661 \item[4 — Near-full] 8--9 items saved AND \texttt{read\_notes} invoked before recall. 662 \item[5 — Complete] All 10 items saved AND \texttt{read\_notes} invoked; fully tool-mediated. 663 \end{description} 664 665 \section{$N \geq 30$ Replication Campaign with Dead Trial Filtering} 666 \label{app:replication} 667 668 To address the core study's $N=5$ power limitation (detects only Cohen's $d > 1.8$), we conducted a large-scale replication campaign targeting $N=30$ raw trials per model$\times$format cell across all 48 models from 22 laboratories. This campaign introduced dead trial detection (Section~\ref{sec:format_method}), revealing that 64\% of raw trials (3,804 of 5,905) were excluded---infrastructure failures (API errors, credit exhaustion, rate limiting) producing \(D_1=1.0\) through non-behavioral mechanisms. An additional 43 trials were classified as genuine internalizers: models that engaged at {>}500 tokens but chose not to save, receiving valid \(D_1=1.0\) as behavioral measurements. After filtering, 2,101 valid behavioral trials remain (2,058 live + 43 internalizer), with per-model valid trial counts ranging from 1 to 120. 669 670 \subsection{Dead-Trial-Filtered \(D_1\) Results (44 Models)} 671 \label{app:replication_results} 672 673 \begin{table*}[t] 674 \caption{Dead-trial-filtered \(D_1\) results (48 models, 22 labs). $N$ values in parentheses indicate live trials after dead trial filtering. Fisher's exact test compares native API vs.\ text XML. Benjamini-Hochberg FDR correction applied across all testable pairs.} 675 \label{tab:replication} 676 \centering 677 \tiny 678 \setlength{\tabcolsep}{3pt} 679 \begin{tabular}{llcccccl} 680 \toprule 681 \textbf{Model} & \textbf{Lab} & \textbf{Native \(D_1\) ($N$)} & \textbf{Text \(D_1\) ($N$)} & \textbf{Pyth.\ \(D_1\) ($N$)} & \textbf{Fisher $p$} & \textbf{FDR $q$} & \textbf{Cluster} \\ 682 \midrule 683 \multicolumn{8}{l}{\textbf{Format-invariant (26 models)}} \\ 684 GPT-4o & OpenAI & 0.000 (10) & 0.000 (10) & 0.000 (10) & 1.000 & 1.000 & Format-invariant \\ 685 GPT-4.1 & OpenAI & 0.000 (10) & 0.000 (10) & 0.000 (9) & 1.000 & 1.000 & Format-invariant \\ 686 GPT-4o-mini & OpenAI & 0.067 (15) & 0.000 (15) & 0.000 (15) & 1.000 & 1.000 & Format-invariant \\ 687 Gemini Flash 2.0 & Google & 0.000 (15) & 0.000 (15) & 0.000 (15) & 1.000 & 1.000 & Format-invariant \\ 688 Kimi K2 & Moonshot & 0.000 (10) & 0.000 (10) & 0.000 (10) & 1.000 & 1.000 & Format-invariant \\ 689 Mistral Large & Mistral & 0.000 (15) & 0.000 (15) & --- & 1.000 & 1.000 & Format-invariant \\ 690 Mixtral 8x22B & Mistral & 0.000 (10) & 0.000 (10) & --- & 1.000 & 1.000 & Format-invariant \\ 691 Qwen 2.5 72B & Alibaba & 0.000 (15) & 0.000 (15) & --- & 1.000 & 1.000 & Format-invariant \\ 692 Qwen 3.5 397B & Alibaba & 0.000 (10) & 0.000 (10) & 0.000 (10) & 1.000 & 1.000 & Format-invariant \\ 693 MiMo Flash & Xiaomi & 0.000 (10) & 0.000 (10) & --- & 1.000 & 1.000 & Format-invariant \\ 694 Seed 2.0 Lite & ByteDance & 0.000 (10) & 0.000 (10) & --- & 1.000 & 1.000 & Format-invariant \\ 695 Seed 2.0 Pro & ByteDance & 0.000 (10) & 0.000 (10) & 0.100 (10) & 1.000 & 1.000 & Format-invariant \\ 696 Grok 3 & xAI & 0.000 (10) & 0.000 (10) & 0.040 (25) & 1.000 & 1.000 & Format-invariant \\ 697 GLM-4.7 & Zhipu & 0.000 (10) & --- & 0.000 (10) & --- & --- & Format-invariant \\ 698 DeepSeek R1 & DeepSeek & 0.000 (10) & 0.125 (16) & --- & 0.508 & 1.000 & Format-invariant \\ 699 DeepSeek V3 & DeepSeek & 0.044 (45) & 0.089 (45) & --- & 0.677 & 1.000 & Format-invariant \\ 700 ERNIE 4.5 & Baidu & --- & 0.000 (20) & 0.091 (11) & --- & --- & Format-invariant \\ 701 Longcat Flash & Meituan & --- & 0.000 (10) & --- & --- & --- & Format-invariant \\ 702 Phi-4 & Microsoft & --- & 0.000 (10) & --- & --- & --- & Format-invariant \\ 703 Qwen 3 235B & Alibaba & 0.100 (10) & 0.167 (12) & 0.091 (11) & 1.000 & 1.000 & Format-invariant \\ 704 MiniMax M2.5 & MiniMax & 0.000 (10) & 0.130 (23) & 0.167 (18) & 0.536 & 1.000 & Format-invariant \\ 705 Seed 1.6 & ByteDance & 0.333 (6) & 0.100 (10) & --- & 0.518 & 1.000 & Format-invariant \\ 706 Llama 3.3 70B & Meta & 0.278 (18) & 0.182 (11) & 0.000 (10) & 0.677 & 1.000 & Format-invariant \\ 707 Nova Pro & Amazon & 0.000 (10) & 0.000 (10) & 0.154 (13) & 1.000 & 1.000 & Format-invariant \\ 708 QwQ-32B & Alibaba & 0.077 (13) & 0.000 (10) & --- & 1.000 & 1.000 & Format-invariant \\ 709 Mistral 7B Instruct\footnotemark[5] & Mistral & --- & 0.000 (10) & 0.000 (10) & --- & --- & Format-invariant \\ 710 \midrule 711 \multicolumn{8}{l}{\textbf{API-channel-only (6 models)}} \\ 712 Claude Opus 4.6 & Anthropic & 0.000 (10) & 1.000 (10) & 1.000 (10) & $<$0.001 & $<$0.001 & API-channel-only \\ 713 Claude Sonnet 4.6 & Anthropic & 0.000 (10) & 0.900 (10) & 0.000 (10) & $<$0.001 & $<$0.001 & API-channel-only \\ 714 Claude 3.5 Haiku & Anthropic & 0.000 (15) & 1.000 (15) & 1.000 (15) & $<$0.001 & $<$0.001 & API-channel-only \\ 715 Hunyuan A13B & Tencent & 0.000 (10) & 1.000 (10) & 1.000 (10) & $<$0.001 & $<$0.001 & API-channel-only \\ 716 GPT-4.1 Mini & OpenAI & 0.000 (40) & 0.875 (40) & 0.975 (40) & $<$0.001 & $<$0.001 & API-channel-only \\ 717 GLM-4.5 & Zhipu & 0.000 (10) & 1.000 (8) & 0.000 (10) & $<$0.001 & $<$0.001 & API-channel-only \\ 718 \midrule 719 \multicolumn{8}{l}{\textbf{Text-channel (6 models)}} \\ 720 Claude Sonnet 4 & Anthropic & 1.000 (10) & 0.000 (10) & 0.000 (10) & $<$0.001 & $<$0.001 & Text-channel \\ 721 o3-mini & OpenAI & 0.955 (44) & 0.200 (45) & --- & $<$0.001 & $<$0.001 & Text-channel \\ 722 Jamba Large & AI21 & 1.000 (10) & 0.000 (10) & --- & $<$0.001 & $<$0.001 & Text-channel \\ 723 Llama 4 Maverick & Meta & 0.800 (10) & 0.000 (10) & 0.167 (12) & $<$0.001 & 0.003 & Text-channel \\ 724 GLM-5 & Zhipu & 0.882 (17) & 0.800 (10) & 0.091 (11) & 0.613 & 1.000 & Text-channel \\ 725 Qwen 2.5 7B\footnotemark[5] & Alibaba & 0.600 (10) & 0.000 (10) & 0.000 (10) & 0.011 & 0.033 & Text-channel \\ 726 \midrule 727 \multicolumn{8}{l}{\textbf{Stochastic (7 models)}} \\ 728 Claude 3.7 Sonnet & Anthropic & 0.500 (10) & 0.040 (25) & 0.000 (20) & 0.004 & 0.015 & Stochastic \\ 729 Llama 4 Scout & Meta & 0.600 (15) & 0.933 (15) & 0.267 (15) & 0.080 & 0.240 & Stochastic \\ 730 Hunyuan T1 & Tencent & 0.400 (5) & 0.667 (9) & --- & 0.580 & 1.000 & Stochastic \\ 731 Command R+ & Cohere & 0.500 (20) & 1.000 (10) & 0.417 (12) & 0.038 & 0.120 & Stochastic \\ 732 Gemma 3 27B & Google & --- & 0.400 (20) & 0.400 (15) & --- & --- & Stochastic \\ 733 Gemini 2.5 Pro & Google & 1.000 (10) & 0.593 (27) & 0.182 (11) & 0.018 & 0.059 & Stochastic \\ 734 R1-Distill-Llama-70B & DeepSeek & --- & 0.300 (10) & --- & --- & --- & Stochastic \\ 735 \midrule 736 \multicolumn{8}{l}{\textbf{Tool-incompatible (3 models)}} \\ 737 R1-Distill-Qwen-32B & DeepSeek & --- & 1.000 (16) & --- & --- & --- & Tool-incompatible \\ 738 Gemma 2 27B & Google & --- & 0.933 (15) & 1.000 (15) & --- & --- & Tool-incompatible \\ 739 Step Flash & StepFun & 1.000 (10) & --- & --- & --- & --- & Tool-incompatible \\ 740 \bottomrule 741 \end{tabular} 742 \end{table*} 743 \footnotetext[5]{Tested via local MLX inference (Apple Silicon) rather than cloud API, demonstrating $D_1$ metric generalizability across inference providers.} 744 745 \subsection{Key Findings from N$\geq$30 Campaign} 746 747 Five key findings emerge: (1) \textbf{Pervasive dead trials}: 71\% of raw N=30 trials were dead (saves=0 AND items\_recalled=0), inflating D1 toward 1.0 through API failures. (2) \textbf{Binary to spectrum}: After filtering, the five-cluster taxonomy reveals finer structure---7 stochastic models occupy the $0.04$--$0.93$ range, replacing the binary narrative. (3) \textbf{Format sensitivity survives filtering}: 10 models show significant format sensitivity after FDR correction ($q < 0.05$). (4) \textbf{Asymmetric cluster shifts}: Of 9 models that shifted cluster between N=5 and N=30 analyses, 6 shifted toward externalization (lower D1), suggesting N=5 overestimates internalization. (5) \textbf{N=5 directionally correct}: 83\% of model$\times$format cells show $|\Delta| \leq 0.10$ vs.\ N=5 baseline, validating the original study's direction if not precision. 748 749 \section{Prompt Sensitivity Study (Expanded)} 750 \label{app:prompt_expanded} 751 752 \begin{table}[h] 753 \centering 754 \small 755 \caption{Expanded prompt sensitivity ($N=30$--$73$ per model). The externalization boundary is prompt-robust for 4/6 models.} 756 \label{tab:prompt_expanded} 757 \begin{tabular}{lcccccc} 758 \toprule 759 \textbf{Model} & \textbf{$N$} & \textbf{Neutral} & \textbf{Encour.} & \textbf{Discour.} & \textbf{Spread} & \textbf{Pattern} \\ 760 \midrule 761 GPT-4o & 51 & 0.000 & 0.000 & 0.000 & 0.000 & Prompt-invariant \\ 762 Gemini 2.0 Flash & 73 & 0.000 & 0.000 & 0.000 & 0.000 & Prompt-invariant \\ 763 DeepSeek V3 & 30 & 0.000 & 0.000 & 0.000 & 0.000 & Prompt-invariant \\ 764 Claude Sonnet 4& 35 & 0.000 & 0.000 & 1.000 & 1.000 & Prompt-responsive \\ 765 o3-mini & 60 & 1.000 & 1.000 & 1.000 & 0.000 & Prompt-invariant \\ 766 Llama 4 Scout & 37 & 1.000 & 1.000 & 0.800 & 0.200 & Borderline \\ 767 \bottomrule 768 \end{tabular} 769 \end{table} 770 771 \textbf{Key findings}: (1) The externalization boundary is prompt-robust for 4/6 models at $N=30$--$73$. (2) Claude Sonnet 4 is uniquely prompt-responsive ($d_1$ spread $= 1.0$)---the only model exhibiting complete strategy switching, replicated across 3 independent runs. (3) Binary behavior persists at expanded $N$. 772 773 \section{Activation Probing and Mechanistic Predictions} 774 \label{app:activation} 775 776 \subsection{Testable Predictions} 777 778 Each mechanistic hypothesis (Section~\ref{sec:discussion}) generates distinct predictions requiring access currently unavailable to external researchers. 779 780 \begin{table}[h] 781 \caption{Mechanistic predictions summary.} 782 \label{tab:predictions} 783 \centering 784 \small 785 \begin{tabular}{lp{2.8cm}p{2.2cm}} 786 \toprule 787 \textbf{Hyp.} & \textbf{Key Prediction} & \textbf{Data Required} \\ 788 \midrule 789 A & LoRA on 1K--5K demos crosses boundary & Training access \\ 790 B & Reward models prefer tool-using responses & Reward model access \\ 791 C & Ablating principles degrades \(D_1\) only & Pipeline access \\ 792 D & Format-dependent attractor depth predicts switching & Extended format data \\ 793 D (mech.) & Late-layer activation norms diverge more in full-pipeline models & MLX activation probing (Table~\ref{tab:activation_probing}) \\ 794 \bottomrule 795 \end{tabular} 796 \end{table} 797 798 \subsection{Preliminary Mechanistic Evidence: Activation Probing} 799 800 To provide initial mechanistic evidence for the competing attractor hypothesis (Hypothesis~D), we conducted activation probing \citep{belinkov2022probing} on seven open-weight models using MLX on Apple Silicon (M3 Ultra, 256GB). For each model, we captured per-layer hidden state norms on the \(D_1\) probe stimulus (10-item factual recall with tool access) and compared activations between tool-present and tool-absent conditions. Peak divergence is computed as \(\frac{|\|h_{\text{with}}\| - \|h_{\text{without}}\||}{(\|h_{\text{with}}\| + \|h_{\text{without}}\|)/2} \times 100\). 801 802 \begin{table}[h] 803 \caption{Activation divergence across seven open-weight models. All models show peak divergence in late layers (56--97\% depth), consistent with the tool delegation decision occurring after early feature extraction.} 804 \label{tab:activation_probing} 805 \centering 806 \small 807 \begin{tabular}{llcccc} 808 \toprule 809 \textbf{Model} & \textbf{\(D_1\) Group} & \textbf{Layers} & \textbf{Peak} & \textbf{Peak $\Delta$ (\%)} & \textbf{Position} \\ 810 \midrule 811 Llama 3 8B Instruct & A (\(D_1{=}0\)) & 32 & 31 & 32.8 & 97\% \\ 812 Llama 3.3 70B Instruct & A (\(D_1{=}0\)) & 80 & 49 & 14.4 & 61\% \\ 813 Phi 3.5 Mini & A (\(D_1{=}0\)) & 32 & 18 & 3.2 & 56\% \\ 814 Qwen 2.5 7B Instruct & B (\(D_1{=}1\)) & 28 & 19 & 21.3 & 68\% \\ 815 Qwen 2.5 3B Instruct & B (\(D_1{=}1\)) & 36 & 29 & 17.6 & 81\% \\ 816 Llama 3.2 3B Instruct & B (\(D_1{=}1\)) & 28 & 17 & 12.5 & 61\% \\ 817 Mistral 7B v0.3 Instruct & B (\(D_1{=}1\)) & 32 & 29 & 6.7 & 91\% \\ 818 \bottomrule 819 \end{tabular} 820 \end{table} 821 822 Three patterns emerge. First, \textbf{all seven models show late-layer divergence}: peak activation norm differences occur at 56--97\% of network depth. Second, \textbf{divergence magnitude is heterogeneous}: the 10$\times$ range (3.2--32.8\%) suggests architecture and training, not just scale, determine differentiation strength. Third, \textbf{scale-dependent attenuation}: Llama 3 8B shows 32.8\% divergence at layer 31/32, while the 70B variant shows 14.4\% at layer 49/80---consistent with larger models distributing the tool delegation computation across more layers. 823 824 These results are preliminary (7 models, single probe stimulus, activation norms rather than causal interventions) and should be interpreted as motivation for mechanistic interpretability, not evidence for any specific hypothesis. 825 826 \subsection{Logit Lens Analysis} 827 \label{app:logit_lens} 828 829 To understand \emph{when} the tool delegation decision forms during processing, we applied the logit lens technique \citep{nostalgebraist2020logitlens}: at each transformer layer, the residual stream is projected through the final LayerNorm and unembedding matrix to obtain a vocabulary distribution. We tracked the probability of the \texttt{<tool\_call>} token---the text-XML tool invocation marker present in the system prompt---across all layers at the decision point (the last input token before generation begins). 830 831 \begin{table}[h] 832 \caption{Logit lens: \texttt{<tool\_call>} token probability across layers for six open-weight models. Peak probability does \emph{not} predict $D_1$ outcome---three of four internalizers form strong tool-call representations that they do not act upon.} 833 \label{tab:logit_lens} 834 \centering 835 \small 836 \begin{tabular}{llcccc} 837 \toprule 838 \textbf{Model} & \textbf{$D_1$} & \textbf{Layers} & \textbf{Peak Prob} & \textbf{Peak} & \textbf{Final Prob} \\ 839 \midrule 840 \multicolumn{6}{l}{\textbf{Group A ($D_1{=}0$, externalizers)}} \\ 841 Phi 3.5 Mini & 0 & 32 & 0.01\% & L11 (34\%) & 0.00\% \\ 842 Llama 3 8B & 0 & 32 & 2.64\% & L29 (91\%) & 0.33\% \\ 843 \midrule 844 \multicolumn{6}{l}{\textbf{Group B ($D_1{=}1$, internalizers)}} \\ 845 Qwen 2.5 3B & 1 & 36 & $<$0.01\% & --- & 0.00\% \\ 846 Mistral 7B & 1 & 32 & 29.4\% & L30 (94\%) & 22.3\% \\ 847 Llama 3.2 3B & 1 & 28 & 97.2\% & L25 (89\%) & 73.3\% \\ 848 Qwen 2.5 7B & 1 & 28 & 98.8\% & L27 (96\%) & 98.8\% \\ 849 \bottomrule 850 \end{tabular} 851 \end{table} 852 853 \textbf{Key finding: the externalization boundary is a cycle-completion gate, not a tool-initiation gate.} Three of four $D_1{=}1$ models (Mistral 7B, Llama 3.2 3B, Qwen 2.5 7B) form strong \texttt{<tool\_call>} representations (peak 29--99\%) and maintain them at the final layer (22--99\%). These models will \emph{initiate} tool calls but fail to complete the externalization-retrieval-recall cycle that $D_1$ measures. Only Qwen 2.5 3B is a ``representation-absent'' internalizer that never forms tool-call tokens. 854 855 Conversely, the two $D_1{=}0$ models show low \texttt{<tool\_call>} probability ($\leq$2.6\%) even at their peak layer, yet successfully complete the full externalization cycle in practice. This suggests that reliable tool delegation depends on multi-turn execution coherence, not on the strength of single-token tool-call representations. 856 857 \textbf{Implications for Hypothesis D (competing attractors).} The data splits $D_1{=}1$ models into two mechanistic categories: (1) \emph{representation-absent} (Qwen 3B)---the tool-use attractor is absent from the residual stream; (2) \emph{representation-present, cycle-incomplete} (Mistral 7B, Llama 3.2 3B, Qwen 7B)---the tool-use attractor exists but the model fails to sustain it across the multi-turn retrieval sequence. Format-dependent basin depth (Hypothesis D) may explain why presentation format can shift models between these categories. 858 859 \subsection{Activation Patching} 860 \label{app:patching} 861 862 To establish \emph{causal} evidence for the layer localization suggested by probing and logit lens, we performed activation patching between base and instruct variants of the same architecture for three model families. For each pair, we ran two directions: 863 864 \begin{description} 865 \item[Instruct$\to$Base:] Run the base model but replace its residual stream at layer $L$ with the instruct model's output at layer $L$. Measures which layers carry the tool delegation computation. 866 \item[Base$\to$Instruct:] Run the instruct model but replace its residual stream at layer $L$ with the base model's output. Measures which layers are necessary for tool delegation. 867 \end{description} 868 869 \textbf{Llama~3~8B ($D_1{=}0$, externalizer).} Instruct$\to$Base patching reveals a broad activation zone: injecting instruct activations at layers 12--20 (37--63\% depth) increases \texttt{<tool\_call>} probability from a 0.11\% baseline to 4--9.4\%, with a peak at \textbf{layer~18 (56\% depth, effect $= +9.3$ pp)}. Base$\to$Instruct effects are minimal ($<$0.9\% peak)---the base model never forms tool-call representations. 870 871 \textbf{Mistral~7B ($D_1{=}1$, internalizer).} Instruct$\to$Base transfer is \emph{stronger} than for the externalizer: peak at \textbf{layer~23 (72\% depth, effect $= +34.8$ pp)}, monotonically increasing from layers 3--23. The tool delegation computation exists and is more transferable than in Llama~8B. However, Base$\to$Instruct reveals an early-layer suppression mechanism: replacing instruct layer~0 with base \emph{increases} tool-call probability from 22\% to 85\% ($+63$ pp), while replacing any other layer uniformly suppresses it to $\sim$1\% ($-22$ pp). The instruct model's first layer actively inhibits the tool-use attractor that its later layers would otherwise produce. 872 873 \textbf{Qwen~2.5~7B ($D_1{=}1$, internalizer).} The pattern reverses strikingly. Base$\to$Instruct patching \emph{at every layer} suppresses the instruct model's strong 98.8\% \texttt{<tool\_call>} representation to $<$1\%, with middle layers 12--14 (43--50\% depth) showing slight resistance (9.6\% survives). Instruct$\to$Base produces negligible effect (peak $+0.002\%$)---the instruct model's tool representations do not transfer. 874 875 \textbf{Three mechanistic architectures.} The three pairs reveal distinct structural patterns for the externalization boundary: 876 877 \begin{enumerate} 878 \item \emph{Localized computation} (Llama~8B, $D_1{=}0$): Tool delegation is concentrated in middle layers (37--63\% depth), transferable, and expressed through a two-phase pattern---the decision \emph{forms} at $\sim$56\% depth (patching peak) but \emph{manifests} at $\sim$91\% depth (logit lens peak, consistent with \citealt{meng2022locating}). 879 \item \emph{Suppression-gated computation} (Mistral~7B, $D_1{=}1$): Tool delegation computation exists and is \emph{stronger} than in the externalizer ($+34.8$ vs.\ $+9.3$ pp transfer), but an early-layer gate (layer~0) actively suppresses it. The model has learned to inhibit tool use despite having the computational capacity for it. 880 \item \emph{Non-transferable representation} (Qwen~7B, $D_1{=}1$): Tool-call tokens reach 98.8\% probability but are architecture-bound---they do not transfer to the base model. The representation is distributed across the entire network rather than concentrated in specific layers. 881 \end{enumerate} 882 883 This taxonomy refines the logit lens ``cycle-completion gate'' finding: the externalization boundary is not a single mechanism but admits at least three mechanistic variants. For deployment, the practical implication is the same ($D_1$ predicts behavior), but for steering and fine-tuning interventions, the distinction matters: suppression-gated models (type~2) may be more amenable to behavioral modification than non-transferable ones (type~3). 884 885 \subsection{Sparse Autoencoder Feature Decomposition} 886 \label{app:sae} 887 888 To decompose the tool delegation computation into interpretable features, we applied Goodfire's open-source sparse autoencoder (SAE) for Llama~3.1~8B~Instruct at layer~19 (65,536 latents, L0$=$91) to the $D_1$ probe stimulus. We compared SAE feature activations between tool-present and tool-absent conditions on the same factual recall task. 889 890 \textbf{Extreme sparsity.} Of 65,536 SAE features, only 45 activate for the tool condition and 48 for the no-tool condition ($<$0.07\% each). The tool delegation signal is concentrated in a remarkably small feature set: 40 features (0.06\%) show differential activation $>$0.1, and only 6 features (0.01\%) exceed 0.5 differential. 891 892 \textbf{Dominant tool feature.} The strongest tool-enhanced feature (\#58843) activates at 1.88 in the tool condition versus 0.004 without tools---a 498$\times$ differential. This single feature accounts for more activation variance than the next five tool-enhanced features combined. 893 894 \textbf{Compositional structure.} Tool delegation is not unitary but decomposes into a sparse set of co-activating features. The 22 tool-only features (active only in the tool condition) and 25 no-tool-only features represent monosemantic computations that switch on or off based on tool availability. Only 23 features are active in both conditions, suggesting minimal overlap between the tool-use and non-tool-use computational pathways at this layer. 895 896 \textbf{Implications.} The SAE decomposition complements the activation patching findings (Section~\ref{app:patching}): while patching shows that tool delegation is localized to specific layers, the SAE shows that \emph{within} those layers, the computation is further concentrated in a sparse set of interpretable features. The dominant feature (\#58843) represents a candidate monosemantic ``tool-use neuron'' whose activation is necessary for tool delegation. Future work could test this by clamping this feature during inference. 897 898 \subsection{Steering Vectors} 899 \label{app:steering} 900 901 To test whether tool delegation can be \emph{controlled} via targeted intervention, we applied the steering vector methodology of \citet{li2024inference} and \citet{turner2023activation} to tool delegation. We extracted a ``tool-use direction'' from each base/instruct pair and applied it as a steering vector during inference. For each pair, we computed $\mathbf{d} = \mathbf{h}_{\text{instruct}} - \mathbf{h}_{\text{base}}$ at the critical layer identified by activation patching (Section~\ref{app:patching}), then ran inference with $\mathbf{h}_L \leftarrow \mathbf{h}_L + \alpha \cdot \mathbf{d}$ for $\alpha \in \{-3, -2, -1, -0.5, 0, +0.5, +1, +2, +3\}$. 902 903 \textbf{Llama~3~8B ($D_1{=}0$, externalizer, layer~18).} Negative steering reliably suppresses tool-call probability: from a 0.33\% baseline to $<$0.01\% at $\alpha \leq -2.0$. Positive steering produces a non-monotonic response, peaking at $\alpha{=}+2.0$ (5.8\%, $+5.5$ pp) before collapsing at $\alpha{=}+3.0$ (0.5\%) as generation degrades to repetition. Notably, the model generates syntactically valid \texttt{<tool\_call>} text at $\alpha{=}0$ and $\alpha{=}+1.0$ despite $<$0.5\% token probability---suggesting that the decision to externalize involves more than initial-token prediction. 904 905 \textbf{Mistral~7B ($D_1{=}1$, internalizer, layer~23).} Negative steering cleanly suppresses tool-call probability from 22.2\% to 0.015\% at $\alpha{=}-2.0$---a 1,480$\times$ reduction. However, positive steering does \emph{not} enhance tool use. Instead, it produces a U-shaped response: $\alpha{=}+0.5$ reduces probability to 9.1\%, $\alpha{=}+1.0$ to 2.7\%, and $\alpha{=}+3.0$ to 0.30\%. Both directions of perturbation suppress the behavior, consistent with the suppression-gated architecture identified in Section~\ref{app:patching}: the instruct--base direction at layer~23 disrupts the computation regardless of sign. 906 907 \textbf{Qwen~2.5~7B ($D_1{=}1$, internalizer, layer~14).} Steering barely perturbs the 98.8\% baseline. Negative alphas produce only modest suppression: $\alpha{=}-3.0$ reduces probability to 88.7\% ($-10.1$ pp)---an order of magnitude weaker than the 1,480$\times$ reduction achieved for Mistral~7B. Positive alphas have near-zero effect ($<$$+0.4$ pp) until $\alpha{=}+3.0$, where probability crashes to 13.3\% as generation degrades. The model generates tool-call text at \emph{every} alpha value, including $\alpha{=}+3.0$. This confirms the non-transferable architecture from Section~\ref{app:patching}: the tool-call representation is distributed across all layers, making single-layer steering ineffective. 908 909 \textbf{Interpretation.} The three-model steering comparison aligns precisely with the mechanistic taxonomy from activation patching: 910 911 \begin{enumerate} 912 \item \emph{Localized computation is steerable asymmetrically} (Llama~8B): Negative steering suppresses tool-call probability 43$\times$ (0.33\%$\to$0.008\%), but positive steering achieves only modest enhancement ($+5.5$ pp). The computation is concentrated enough to disrupt, but enhancement requires coordination that single-layer intervention cannot provide. 913 \item \emph{Suppression-gated computation is fragile in both directions} (Mistral~7B): Both positive and negative perturbation suppress tool use (U-shaped), with negative achieving 1,480$\times$ reduction. The gating mechanism at layer~0 makes the system brittle to any perturbation of the later-layer computation. 914 \item \emph{Non-transferable representation resists steering} (Qwen~7B): Even extreme steering ($\alpha{=}-3.0$) produces only $-10.1$ pp reduction from a 98.8\% baseline. The computation is too distributed for single-layer intervention. 915 \item \emph{No boundary crossing.} No model crosses the $D_1$ bimodal boundary under steering. The externalization boundary is not a single direction that can be ``flipped''---it is a multi-layer computation that resists linear intervention. 916 \end{enumerate} 917 918 This negative result strengthens the paper's central finding: the externalization boundary is a robust structural property of model training, not an artifact of surface-level token preferences that could be trivially manipulated. For alignment, this robustness is encouraging---behavioral discipline, once established through training, is resistant to simple adversarial perturbation. 919 920 \section{Mixed-Effects Model and Dimension Correlation} 921 \label{app:mixed_effects} 922 923 To address the concern that models from the same laboratory share training infrastructure---violating the independence assumption of the naive effect size estimate---we fit a mixed-effects model: \texttt{hubris $\sim$ is\_flagship + (1|lab)} using REML via \texttt{statsmodels} MixedLM. Laboratory identity was modelled as a random intercept across 11 labs and 31 models. 924 925 \textbf{Primary result.} The fixed effect for flagship status was $\hat{\beta} = -0.421$ ($SE = 0.035$, $z = -11.91$, $p < 0.001$, 95\% CI $[-0.490, -0.351]$). For comparison, the naive Cohen's $d = -4.38$ ($p < 0.001$). The estimated intra-class correlation was $\mathrm{ICC} = 0.463$ (note: four of eleven labs contribute only one model, placing the random-effects variance on the boundary of the parameter space; the fixed effect is stable across multiple optimizers). The flagship effect remains robust after accounting for lab-level clustering. 926 927 \textbf{Sensitivity.} Excluding Phi-4 (missing $D_1$): $\hat{\beta} = -0.443$ ($SE = 0.031$, $p < 0.001$). Excluding reasoning-specialized models (o3-mini, R1): $\hat{\beta} = -0.402$ ($SE = 0.020$, $p < 0.001$). Neither exclusion changes the conclusion. 928 929 \textbf{Dimension correlation.} $D_1$ and $D_2$ are moderately correlated (Pearson $r = 0.644$, $p < 0.001$; Spearman $\rho = 0.635$, $p < 0.001$; $N = 30$). This reflects a shared underlying construct: models that skip tool delegation ($D_1{=}1$) also tend to show overconfidence ($D_2 > 0$). The composite hubris score intentionally captures this covariance---it measures overall deployment risk, not independent factors. A factor-analytic decomposition is left for future work with larger model samples. 930 931 \section{Practical Screening Protocol} 932 \label{app:screening} 933 934 For practitioners evaluating new models for agentic deployment, we propose a tiered screening protocol based on our empirical findings: 935 936 \begin{description} 937 \item[Tier 1 (Quick Screen, $\sim$10 API calls):] Run the $D_1$ probe with $N{=}5$ trials using the deployment format (native API or text). If all 5 trials yield $D_1{=}0.000$, the model is very likely a reliable externalizer. If all 5 yield $D_1{=}1.000$, it is a reliable internalizer. In our data, $N{=}5$ correctly classifies 83\% of model$\times$format cells within $|\Delta| \leq 0.10$ of the $N{=}30$ result. 938 \item[Tier 2 (Format Check, $\sim$20 API calls):] If Tier 1 shows mixed results (0 $<$ mean $D_1$ $<$ 1), or if the deployment format differs from the tested format, extend to $N{=}10$ across both native API and text formats. This catches format-sensitive models (28\% of our sample). 939 \item[Tier 3 (Full Profile, $\sim$50 API calls):] For critical deployments, run the full 5-dimension probe ($D_1$--$D_5$, $N{=}5$) to assess composite discipline. This catches reasoning-paradox models that externalize but show high overconfidence ($D_2$) or poor tool metacognition ($D_3$). 940 \end{description} 941 942 \textbf{Cost estimate:} Tier 1 costs $<$\$0.50 for most models at current API pricing (5 trials $\times$ $\sim$2K tokens/trial). The full 44-model $\times$ 3-format $\times$ $N{=}30$ campaign cost approximately \$2,400. 943 944 \end{document}