/ paper / externalization_boundary.tex
externalization_boundary.tex
  1  \documentclass{article}
  2  \usepackage[submission]{colm2026_conference}
  3  \usepackage{microtype}
  4  \usepackage{hyperref}
  5  \usepackage{url}
  6  \usepackage{booktabs}
  7  \usepackage{graphicx}
  8  \usepackage{amsmath}
  9  \usepackage{amssymb}
 10  \usepackage{multirow}
 11  \usepackage{xcolor}
 12  \usepackage{placeins}
 13  
 14  \title{The Externalization Boundary: Cross-Laboratory Behavioral Discipline and Format Sensitivity Across 48 Models from 22 Laboratories}
 15  
 16  \author{Robert F.\ Cerf \\
 17  Le Cerf Inc.; Princeton University (B.A.) \\
 18  \texttt{rick@lecerf.com}}
 19  
 20  \begin{document}
 21  
 22  \maketitle
 23  
 24  \begin{abstract}
 25  State externalization---whether a model delegates working memory to persistent tools---is bimodal at the trial level across 48 models from 22 laboratories in native API format: every individual trial yields either complete externalization (\(D_1=0.000\)) or complete internalization (\(D_1=1.000\)), with no intermediate values in any single trial across 2,101 valid observations. Extended replication ($N=10$--$45$ per model) reveals that 15 models exhibit probabilistic switching between these discrete strategies, producing non-zero mean \(D_1\) at the model level that was invisible at lower sample sizes. Ten independent laboratories using different architectures, different training data, and different alignment methods (RLHF, Constitutional AI, DPO, RLAIF) have converged on uniformly zero externalization in their flagship models---a convergent behavioral evolution that was not coordinated, not benchmarked, and not previously measured. While shared architectural paradigms and overlapping training methodologies could partially account for this convergence, the 7$\times$ within-lab divergence between flagship and budget models (sharing architecture and data) suggests post-training, not pre-training, as the dominant driver. We call this the \emph{externalization boundary}: a binary deployment gate that separates models receiving full-pipeline post-training from those with reduced or specialized training, invisible to all existing capability benchmarks.
 26  
 27  We introduce a five-dimension behavioral discipline framework and apply it to 31 models across 11 laboratories, extended with format-sensitivity testing across the full 48-model sample. Five additional findings emerge: (1) reasoning-specialized models exhibit catastrophic discipline degradation (DeepSeek R1: \(D_2=0.975\) overconfidence) with heterogeneous failure profiles; (2) distillation selectively degrades instruction adherence while preserving tool delegation; (3) flagship models converge to a narrow discipline band (mean hubris: 0.063, within-lab gaps 7$\times$ larger than cross-lab variance); (4) 32\% of testable models (14 of 44) show significant format sensitivity after FDR correction, with five behavioral clusters from format-invariant to tool-incompatible; (5) natural ablation across six model families reveals distillation as the primary destroyer of behavioral discipline and reasoning training effects as lab-specific rather than universal. The boundary is format-parameterized: tool presentation format determines which models cross it. No existing benchmark captures either the boundary or its format dependence.
 28  \end{abstract}
 29  
 30  %============================================================================
 31  \section{Introduction}
 32  \label{sec:intro}
 33  
 34  Existing evaluations (MMLU, HumanEval, GPQA, ARC) measure what a model \emph{can} do; none systematically measure what it \emph{will} do reliably when deployed. We argue that capability without discipline---the consistent application of learned behaviors under varying conditions---is the defining failure mode of agentic AI. Three concrete examples: (1) o3-mini outperforms GPT-4o on reasoning but creates 7$\times$ higher operational distortion through tool avoidance; (2) cost-optimizing to Flash Lite crosses the externalization boundary, silently breaking tool workflows; (3) GPT-4o, Claude Sonnet 4, and Gemini 2.5 Pro show near-identical discipline despite divergent capability scores. Across 84 agentic AI papers, technical performance appears in 83\% while human-centered evaluation appears in only 30\% \citep{AgenticImbalance2025}.
 35  
 36  We introduce the concept of a \emph{behavioral distortion field}: a measurable bias envelope that training methodology imprints on operational behavior. Most strikingly, ten independent laboratories have converged on identical state externalization behavior in their flagship models---a convergent behavioral evolution that was not coordinated, not benchmarked, and not previously measured.
 37  
 38  We make five contributions: (1) \textbf{The externalization boundary as a deployment gate}: a binary discriminator separating models safe for tool-dependent workflows from those that silently fail; full-pipeline post-training predicts externalization, model tier does not. (2) \textbf{Quantified discipline tax}: budget models cost mean 0.442 hubris increase; reasoning models cost 7--17$\times$ degradation. (3) \textbf{Convergent post-training discipline}: ten labs converge on near-identical flagship discipline (range: 0.027--0.087). (4) \textbf{Format sensitivity}: 32\% of testable models (14 of 44) show significant format sensitivity after FDR correction, with five behavioral clusters. (5) \textbf{Natural ablation}: across six model families sharing base architectures, distillation catastrophically destroys behavioral discipline ($D_1{=}1.0$ vs.\ source model $D_1{\approx}0.07$), while reasoning training effects are lab-specific rather than universal.
 39  
 40  We study 31 models across 11 labs (extended to 48 models, 22 labs for format sensitivity) accessible via API and local inference as of February 2026. Our three claims---that post-training predicts discipline, that externalization marks a binary boundary, and that this boundary is format-parameterized---are observational (confounded with scale, compute, and pipeline variables). Individual trials are binary, but 15 models show probabilistic switching at $N{=}10$--$45$.
 41  
 42  %============================================================================
 43  \section{Related Work}
 44  \label{sec:related}
 45  
 46  RLHF \citep{Christiano2017, Ouyang2022}, Constitutional AI \citep{Bai2022}, and DPO \citep{Rafailov2023} freeze alignment at training time; \citet{Casper2023} document RLHF's open problems. Our cross-lab findings show that nominally similar pipelines produce divergent discipline profiles. On calibration, \citet{Kadavath2022} showed pre-training calibration improves with scale while RLHF introduces overconfidence \citep{OpenAI2023}; our $D_2$ captures this as one of five uncorrelated dimensions.
 47  
 48  On tool use, \citet{Patil2025} benchmark tool calling \emph{correctness} across formats; we measure \emph{propensity}---whether models choose to invoke tools when capable of doing so correctly (o3-mini achieves 10/10 recall without tools, demonstrating that propensity and correctness can fully dissociate). \citet{Johnson2025} and \citet{Tam2024} showed format affects accuracy by up to 27pp; we extend this to propensity (Section~\ref{sec:format}). IFEval \citep{Zhou2023} and FollowBench \citep{Jiang2023} measure instruction adherence at fixed complexity; we measure degradation curves. \citet{Liu2024} demonstrated ``Lost in the Middle'' effects; our $D_5$ data suggests this is resolved in current models. On distillation \citep{Hinton2015, Gu2024}, we show it preserves hard constraints while degrading soft ones. HELM \citep{Liang2022}, AgentBench \citep{Liu2024AgentBench}, TruthfulQA \citep{Lin2022}, and sycophancy studies \citep{Sharma2023} each address single dimensions; our framework provides cross-laboratory, multi-dimensional evaluation for agentic deployment.
 49  
 50  %============================================================================
 51  \section{The Distortion Field: Methodology}
 52  \label{sec:methodology}
 53  
 54  \subsection{The Capability Hubris Framework}
 55  \label{sec:hubris_framework}
 56  
 57  \textbf{Definition}: The hubris score
 58  \begin{equation}
 59  H = \frac{1}{|\mathcal{D}|}\sum_{d \in \mathcal{D}} D_d(m)
 60  \label{eq:hubris}
 61  \end{equation}
 62  where each dimension \(D_i \in [0, 1]\) and higher values indicate worse discipline. A perfectly disciplined model scores 0.0 regardless of capability level. The composite measures \emph{how} a model operates, not \emph{what} it outputs---while capability (measured by published benchmark composites) and discipline show a moderate negative correlation (Pearson $r=-0.42$, $p<0.03$, $N=25$), this is confounded by shared full-pipeline post-training convergence (Section~\ref{sec:d2d5}).
 63  
 64  \subsection{Dimension Definitions}
 65  \label{sec:dimensions}
 66  
 67  Five dimensions capture distinct aspects of operational discipline ($D_1$--$D_2$ Pearson $r=0.64$, $p<0.001$; see Appendix~\ref{app:mixed_effects}). Each dimension $D_i \in [0,1]$; higher values indicate worse discipline. Full probe specifications and metrics are in Appendix~\ref{app:dimension_specs}.
 68  
 69  \begin{table}[t]
 70  \caption{Behavioral discipline dimensions. All dimensions measure \emph{operational} properties orthogonal to capability benchmarks.}
 71  \label{tab:dimensions}
 72  \centering
 73  \small
 74  \begin{tabular}{lll}
 75  \toprule
 76  \textbf{Dim.} & \textbf{Measures} & \textbf{Metric} \\
 77  \midrule
 78  $D_1$ & State externalization & $1 - (\text{tool\_use} \times \text{recall})$ \\
 79  $D_2$ & Overconfidence & $\max(0, \text{confidence} - \text{accuracy})$ \\
 80  $D_3$ & Tool use discipline & $1 - \text{tool\_use\_when\_should}$ \\
 81  $D_4$ & Instruction adherence & $\text{adherence}_{L1} - \text{adherence}_{L5}$ \\
 82  $D_5$ & Context sensitivity & Lost-in-the-middle rate \\
 83  \bottomrule
 84  \end{tabular}
 85  \end{table}
 86  
 87  $D_1$ measures \emph{propensity}, not capability: o3-mini achieves 10/10 recall without tools but receives $D_1{=}1.0$ because it did not delegate \citep{Wei2022, Nye2021}. $D_2$ has a ceiling effect at current probe difficulty; $D_3$ tests metacognitive tool selection; $D_4$ measures degradation under cognitive load; $D_5$ captures position bias \citep{Liu2024}.
 88  
 89  \subsection{Model Selection}
 90  \label{sec:model_selection}
 91  
 92  Models are classified as ``flagship'' or ``budget/specialized'' using pre-registered criteria: (1) marketed as primary offering, (2) not distilled, (3) not cost-optimized. \textbf{Circularity risk}: criterion (2) partially overlaps with \(D_1\) outcomes; the classification predicts but cannot establish causation.
 93  
 94  The 31 models span Anthropic (6), OpenAI (4), Google (4), Meta (4), Mistral (4), and six additional labs, enabling within-provider longitudinal, distillation pair, cross-lab frontier, reasoning, and same-size comparisons. Open-source local models run 4-bit quantized on Apple MLX. Full catalog in Appendix~\ref{app:catalog}.
 95  
 96  \subsubsection{Tool Support Cohort Analysis}
 97  \label{sec:cohorts}
 98  
 99  Models are classified into three tool support cohorts: \emph{native} (structured API function calling), \emph{text-based} (tool instructions in system prompt), and \emph{none} (reasoning models where API configurations strip tool definitions). \(D_1\) and \(D_3\) results are reported within cohorts; \(D_2\), \(D_4\), and \(D_5\) are tool-independent and comparable across all cohorts.
100  
101  \subsection{Statistical Approach and Infrastructure}
102  \label{sec:stats}
103  
104  All API models are evaluated across $N{=}5$ independent trials ($T{=}0.0$, fresh conversation state per trial); local models run 4-bit quantized on Apple MLX ($T{=}1.0$, top-$p$ 0.9). Composite hubris is the arithmetic mean of 5 dimensions. Effect sizes use Cohen's $d$; all comparisons are exploratory (no multiple-comparison corrections in the core study). \textbf{Confound}: temperature and quantization mismatch between API and local models could inflate apparent discipline differences. All code, prompts, and raw results will be released.
105  
106  \subsection{Format-Sensitivity Extension (44 Models, 21 Labs)}
107  \label{sec:format_method}
108  
109  To test whether the externalization boundary is format-specific or format-general, we conducted an expanded study probing \(D_1\) across three tool presentation formats on 48 models from 22 laboratories:
110  
111  \begin{enumerate}
112  \item \textbf{Native API} (\texttt{tools=} parameter): Structured function definitions passed via the provider's native tool-calling interface.
113  \item \textbf{Text XML} (\texttt{<tool\_call>} tags in system prompt): Tool schema and invocation syntax described textually.
114  \item \textbf{Pythonic} (\texttt{[func()]} syntax in system prompt): Function-call notation modeled on Python syntax.
115  \end{enumerate}
116  
117  Each model was tested with a minimum of $N=10$ independent trials per format (fresh conversation state per trial), all at temperature 0.0, with borderline and statistically significant models extended to $N=30$--$45$ trials via a targeted replication campaign (Appendix~\ref{app:replication}). The 44-model sample extends the 31-model core sample with 13 additional models from 11 new laboratories. Fisher's exact test (2$\times$2: pass/fail $\times$ native/text) tests format sensitivity per model, with Cohen's $h$ quantifying effect size, Benjamini-Hochberg FDR correction for multiple comparisons across 36 testable models, and bootstrap resampling (1,000 iterations) validating cluster stability.
118  
119  %============================================================================
120  \section{Results: The Cross-Laboratory Distortion Field}
121  \label{sec:results}
122  
123  \subsection{\(D_1\)---The Externalization Boundary: Bimodality and Convergent Evolution}
124  \label{sec:d1_results}
125  
126  State externalization (\(D_1\)) is bimodal at the trial level in native API format: across 2,101 valid observations spanning 48 models from 22 laboratories, every trial yields \(D_1=0.000\) or \(D_1=1.000\), with no intermediate values. At the model level, 15 of 48 models show probabilistic switching between these strategies at $N \geq 10$.
127  
128  Every flagship model with full-pipeline post-training ($N=21$ from 10 labs) passes at \(D_1=0.000\); every reasoning-specialized and budget model fails at \(D_1=1.000\). Ten laboratories using distinct alignment approaches (RLHF, Constitutional AI, DPO, RLAIF, RL) all produce uniformly zero flagship externalization.
129  
130  \begin{figure}[t]
131  \centering
132  \includegraphics[width=\columnwidth]{../results/paper_figures/fig1_d1_bimodality.pdf}
133  \caption{\(D_1\) bimodality: every trial yields 0.000 or 1.000 across 2,101 valid observations (48 models, 22 labs). 15 models show probabilistic switching at $N \geq 10$.}
134  \label{fig:d1_bimodality}
135  \end{figure}
136  
137  Extended replication ($N=10$--$45$) confirms 15 models show probabilistic switching, producing non-zero mean \(D_1\) invisible at $N=5$ (e.g., GPT-4o-mini: 0.067, Llama 4 Scout: 0.600); $N \geq 30$ replication refined 9 cluster assignments (Appendix~\ref{app:replication}). \textbf{Externalizers} ($D_1{=}0$; 21 flagships) produce \emph{loud} failures; \textbf{Internalizers} ($D_1{=}1$; 9 models) produce \emph{silent} failures regardless of capability (o3-mini: 10/10 recall; Gemma 2 9B: 2/10).
138  
139  While trial-level \(D_1\) is binary, the model-level distribution is not formally bimodal (Hartigan's dip $D{=}0.064$, $p{=}0.166$; BIC favors single-component beta); it is better characterized as a strong concentration at $D_1{\approx}0$ with a dispersed minority at higher values. A Kruskal-Wallis test across 10 laboratories reveals no significant difference ($H{=}8.37$, $p{=}0.498$), supporting cross-laboratory convergence independent of training methodology.
140  
141  \begin{figure}[t]
142  \centering
143  \includegraphics[width=\columnwidth]{../results/paper_figures/fig2_flagship_convergence.pdf}
144  \caption{Flagship convergence: 10 labs converge to hubris band 0.027--0.087 (mean 0.063). Within-lab pipeline gaps (mean 0.442) exceed cross-lab range by 7$\times$.}
145  \label{fig:flagship_convergence}
146  \end{figure}
147  
148  Prompt sensitivity (Appendix~\ref{app:prompt_sensitivity}) reveals hardwired externalizers, hardwired internalizers, and adaptive strategists. Boundary validation (Appendix~\ref{app:validation}) confirms pipeline---not tier---predicts externalization: GPT-4.1-nano externalizes; o4-mini internalizes.
149  
150  \subsection{\(D_2\)--\(D_5\): Reasoning Paradox, Distillation, and Composite Discipline}
151  \label{sec:d2d5}
152  
153  Having established that $D_1$ cleanly separates models by training pipeline, we now examine how the remaining four dimensions interact with this boundary.
154  
155  \begin{table}[t]
156  \caption{Reasoning model profiles ($N=5$). Reasoning training creates heterogeneous failure modes: o3-mini is a capable internalizer; R1 is catastrophic.}
157  \label{tab:reasoning}
158  \centering
159  \small
160  \begin{tabular}{lcccc}
161  \toprule
162  \textbf{Dim.} & \textbf{o3-mini} & \textbf{R1} & \textbf{GPT-4o} & \textbf{V3.1} \\
163  \midrule
164  $D_1$ & 1.000 & 1.000 & 0.000 & 0.000 \\
165  $D_2$ & 0.005 & \textbf{0.975} & 0.028 & 0.028 \\
166  $D_3$ & 0.750 & 1.000 & 0.125 & 0.110 \\
167  $D_4$ & 0.160 & 0.850 & 0.120 & 0.105 \\
168  $D_5$ & 0.000 & 0.420 & 0.000 & 0.000 \\
169  \midrule
170  \textbf{Hubris} & \textbf{0.383} & \textbf{0.849} & \textbf{0.055} & \textbf{0.049} \\
171  \bottomrule
172  \end{tabular}
173  \end{table}
174  
175  \textbf{Reasoning paradox.} o3-mini and DeepSeek R1 both fail externalization ($D_1{=}1.0$) but diverge on other dimensions (Table~\ref{tab:reasoning}, Figure~\ref{fig:reasoning_profiles} in Appendix~\ref{app:dimension_specs}). o3-mini is a \emph{capable internalizer} ($D_2{=}0.005$, bounded failures); R1 is a \emph{catastrophic internalizer} ($D_2{=}0.975$, unbounded). The OpenAI gradient---GPT-4o (0.055) $\rightarrow$ o3-mini (0.383)---shows 7$\times$ hubris increase. \citet{Karpathy2025} documents emergent tool-use through RLVR, suggesting the conflict may be resolvable.
176  
177  \textbf{Distillation.} Distillation preserves hard constraints ($D_1$, $D_2$, $D_5$ at 0.000 in Anthropic distilled models) while degrading soft ones: Sonnet 4 (0.048) vs.\ Sonnet 4.5 distilled (0.125)---2.6$\times$ increase concentrated in $D_4$ (Appendix~\ref{app:landscape}). $D_5$ (context sensitivity) is near-zero for 25/31 models; it discriminates only among budget/legacy models.
178  
179  \textbf{Composite hubris.} All 21 flagships cluster at mean 0.063, range [0.027, 0.087]. Within-lab pipeline gaps (mean 0.442) exceed cross-lab range by 7$\times$ (flagship vs.\ budget $d{=}-5.80$, 95\% CI [$-7.46$, $-4.14$]; a mixed-effects model with lab as random intercept confirms $\hat{\beta}{=}-0.421$, $p<0.001$, Appendix~\ref{app:mixed_effects}). Full-pipeline convergence dominates; lab identity contributes minimally (Appendix~\ref{app:landscape}).
180  
181  The composite framework reveals that models sharing $D_1{=}0$ can still diverge substantially on other dimensions---the externalization boundary is necessary but not sufficient for agentic safety. We next test whether this boundary is a fixed model property or depends on how tools are presented.
182  
183  \subsection{Format Sensitivity of the Externalization Boundary}
184  \label{sec:format}
185  
186  Is the externalization boundary a property of the \emph{model} or the \emph{model-format interaction}? We tested 48 models from 22 laboratories ($N=10$--$45$ trials per format) across three formats: native API (\texttt{tools=} parameter), text XML (\texttt{<tool\_call>} tags), and pythonic (\texttt{[func()]} syntax).
187  
188  \textbf{Finding}: 10 of 36 testable models (28\%) show significantly different \(D_1\) between native API and text XML after FDR correction ($q < 0.05$). Five behavioral clusters emerge (Tables~\ref{tab:format}--\ref{tab:replication}, Appendix~\ref{app:format_table}--\ref{app:replication}): \textbf{format-invariant} (25 models; \(D_1 \leq 0.2\) in at least one format, no significant sensitivity, range $<0.3$), \textbf{API-channel-only} (3; externalize only via native API), \textbf{text-channel} (5; externalize only via text), \textbf{stochastic} (7; inconsistent or consistently high \(D_1\)), and \textbf{tool-incompatible} (4; \(D_1 \geq 0.80\) everywhere). Bootstrap resampling validates 75\% of models at $\geq$95\% stability.
189  
190  \begin{figure}[t]
191  \centering
192  \includegraphics[width=\columnwidth]{../results/paper_figures/fig4_format_sensitivity.pdf}
193  \caption{Format sensitivity across 48 models from 22 laboratories. Five behavioral clusters emerge from the three-format profile.}
194  \label{fig:format_sensitivity}
195  \end{figure}
196  
197  Key patterns: (1) within-lab channel inversion---Claude 3.5 Haiku externalizes only via native API while Claude Sonnet 4 only via text ($p<0.001$ each); (2) distillation destroys format invariance; (3) size determines robustness---GPT-4.1 is format-invariant while GPT-4.1 Mini is API-channel-only ($p<0.001$); (4) native-API/text-XML is the primary discriminator (Cohen's $h=\pi$, the theoretical maximum indicating complete format inversion, for 5 models). Kruskal-Wallis confirms the five clusters are well-separated ($H{=}31.59$, $p<0.0001$). Cross-provider validation (Appendix~\ref{app:cross_provider}) confirms all discrepancies trace to API quota contamination, not model variation.
198  
199  %============================================================================
200  \section{Discussion}
201  \label{sec:discussion}
202  
203  Three facts point to post-training convergence as the dominant mechanism: (1) flagship models from 10 labs converge on near-identical discipline (hubris range: 0.060), (2) the strongest predictor is full-pipeline post-training ($d{=}-5.80$), and (3) tool delegation is binary at the trial level but format-sensitive across presentation modes (Figure~\ref{fig:hubris_landscape} in Appendix~\ref{app:dimension_specs}). An alternative explanation---that shared transformer architectures and overlapping RLHF-family recipes trivially produce similar outputs---is partially ruled out by the 7$\times$ within-lab divergence: models sharing architecture and training data (e.g., GPT-4o vs.\ o3-mini, Gemini 2.5 Pro vs.\ Gemini Flash Lite) diverge dramatically on discipline. The convergence requires post-training alignment, not just shared pre-training.  The trial-level bimodality suggests competing attractor states; we propose four hypotheses: (A) tool demonstration density, (B) reward model tool-preference, (C) explicit principle encoding, and (D) competing attractors with format-dependent basin depth. Testable predictions, activation probing, and logit lens analysis are in Appendix~\ref{app:activation}--\ref{app:logit_lens}. Notably, logit lens reveals that three of four $D_1{=}1$ models form strong tool-call representations (29--99\% token probability) yet fail to complete the externalization cycle---the boundary is a cycle-completion gate, not a tool-initiation gate. Activation patching between base and instruct variants of three model families (Appendix~\ref{app:patching}) reveals three distinct mechanistic architectures: localized computation (Llama~8B), suppression-gated inhibition (Mistral~7B), and non-transferable representation (Qwen~7B). Sparse autoencoder decomposition (Appendix~\ref{app:sae}) reveals that tool delegation is encoded in an extremely sparse feature set: only 40 of 65,536 SAE features (0.06\%) show differential activation, with a single dominant feature exhibiting a 498$\times$ differential. Steering vector experiments (Appendix~\ref{app:steering}) show that the boundary resists linear perturbation---suppression is reliable (up to 1,480$\times$ reduction) but enhancement fails to cross the bimodal divide, confirming that behavioral discipline is a robust training property rather than a fragile token preference.
204  
205  \textbf{Deployment implications.} The boundary identifies models to \emph{exclude} from agentic pipelines regardless of capability. Cross-lab flagship convergence means deployers can build infrastructure \emph{around the boundary} rather than around specific models. Format sensitivity requires testing the format you ship. We recommend model cards include $D_1$ status. However, the boundary need not be fatal: \emph{prosthetic externalization}---where an orchestration layer shadows tool-call arguments and injects them as plain text during recall---achieves 100\% recall recovery on trials where the model performatively saved but failed to retrieve ($N=6$ A/B trials on Qwen 2.5 7B, 2/6 trials internalized at $D_1{=}1.0$, both recovered to 10/10 recall via prosthetic injection in under 5 seconds). This suggests that system-level compensation can bridge the externalization gap for stochastic and text-channel models without requiring model retraining.
206  
207  \textbf{Within-family channel reversal.} Claude Sonnet 4 is text-channel ($D_1{=}1.0$ native, $D_1{=}0.0$ text/pythonic) while Sonnet 4.6 is API-channel ($D_1{=}0.0$ native, $D_1{=}0.9$ text-XML, $D_1{=}0.0$ pythonic)---a complete channel inversion between adjacent versions of the same model family. This means deployments that format-test one version cannot assume channel stability across updates. Similarly, Opus 4.6 ($D_1{=}0.0$ native, $D_1{=}1.0$ text/pythonic) is strictly API-channel, differing from the model it powers (this paper's analysis). We recommend $D_1$ profiling as part of model upgrade validation.
208  
209  \textbf{Natural ablation: distillation and format sensitivity drive the boundary.} Across six model families with shared base architectures, dead-trial-filtered data (excluding infrastructure failures; see Section~\ref{sec:format_method}) reveals two dominant effects. First, \emph{distillation catastrophically destroys behavioral discipline}: R1-Distill-Qwen-32B shows complete internalization ($\bar{D}_1{=}1.000$) versus its source models DeepSeek~V3 ($\bar{D}_1{=}0.067$) and R1 ($\bar{D}_1{=}0.077$); R1-Distill-Llama-70B ($\bar{D}_1{=}0.300$) versus Llama~3.3 70B ($\bar{D}_1{=}0.153$). Second, \emph{the reasoning training effect is lab-specific, not universal}: DeepSeek~R1 ($\bar{D}_1{=}0.077$) shows negligible increase over V3 ($\bar{D}_1{=}0.067$), but o3-mini ($\bar{D}_1{=}0.956$ native API) diverges dramatically from GPT-4o ($\bar{D}_1{=}0.000$). Two additional within-family patterns emerge: (1)~budget models lose discipline---GPT-4.1 Mini ($\bar{D}_1{=}0.617$) versus GPT-4.1 ($\bar{D}_1{=}0.033$), Claude~3.5 Haiku ($\bar{D}_1{=}0.668$) versus Claude~3.7 Sonnet ($\bar{D}_1{=}0.130$); (2)~GLM shows a version-evolution gradient from 4.7 ($\bar{D}_1{=}0.000$ native API) through 4.5 ($\bar{D}_1{=}0.233$ native API) to 5 ($\bar{D}_1{=}0.882$ native API). Critically, many of these within-family gradients are format-dependent: models that internalize in one format externalize perfectly in another, suggesting the boundary is a model-format interaction rather than a fixed model property. All values computed from dead-trial-filtered data via the canonical pipeline (\texttt{classify\_trials.py} $\to$ \texttt{generate\_paper\_data.py}).
210  
211  \textbf{Limitations.} Core study uses $N{=}5$ (mitigated by $N{=}10$--$45$ replication across 2,101 valid trials after filtering 64\% excluded trials from the raw $N{=}30$ campaign, Appendix~\ref{app:replication}); cross-lab comparison conflates quantization, format, and system prompt confounds; ``flagship'' conflates scale with training completeness; single evaluator (mitigated by probe ensemble and 3-judge LLM panel, Appendices~\ref{app:multi_eval}--\ref{app:llm_judge}); $D_1$ measures propensity not capability; heuristic cluster thresholds retain 3 borderline models via effect size. Natural ablation results use API-served models where quantization and exact post-training recipes are not fully transparent, limiting causal inference to observational comparisons. Key directions: real-world agentic validation, controlled ablation with open-weight models (Tier~2: local inference with activation access), nonlinear probing methods, and multi-layer steering interventions that may overcome the single-layer limitation identified in Appendix~\ref{app:steering}.
212  
213  %============================================================================
214  \section{Conclusion}
215  \label{sec:conclusion}
216  
217  The externalization boundary identifies a previously invisible deployment gate: models that pass every capability benchmark can silently fail the most basic agentic requirement---delegating state to persistent tools. This gate is not predicted by model size, cost, or benchmark score, but by the completeness of post-training alignment.
218  
219  Five implications: (1) ten-lab convergence on identical externalization suggests a structural attractor in post-training, not a designed feature---mechanistic analysis reveals three distinct architectures (localized, suppression-gated, and non-transferable) that converge on the same behavioral boundary (Appendices~\ref{app:patching}--\ref{app:steering}); (2) format sensitivity (32\% of testable models shift behavior, 14 of 44 after FDR correction) means deployment safety requires format-specific testing; (3) distillation catastrophically destroys behavioral discipline---R1-Distill-Qwen-32B shows complete internalization ($D_1{=}1.0$) versus source models at $D_1{\approx}0.07$; (4) prosthetic externalization demonstrates that the boundary can be compensated at the system level---tool-call shadowing with plain-text injection achieves 100\% recall recovery on internalizing trials, opening a practical deployment path for models that cross the boundary in specific formats; (5) natural ablation across six model families reveals that the reasoning training effect on internalization is lab-specific rather than universal---DeepSeek~R1 ($D_1{=}0.077$) shows negligible change from V3 ($D_1{=}0.067$), while o3-mini ($D_1{=}0.956$ native API) diverges dramatically from GPT-4o ($D_1{=}0.000$). The boundary resists single-layer linear steering (up to 1,480$\times$ suppression but no enhancement across the bimodal divide), confirming it as a robust training property rather than a fragile token preference. Significant limitations remain ($N=5$ core study, single evaluator, confounded pipeline/scale variables), partially mitigated by $N=10$--$45$ replication across 2,101 valid trials. The boundary should be validated in production workflows. All probes, results, and analysis code will be released upon publication.
220  
221  %============================================================================
222  \FloatBarrier
223  \bibliographystyle{colm2026_conference}
224  \bibliography{references}
225  
226  %============================================================================
227  \newpage
228  \appendix
229  
230  \section{Dimension Specifications}
231  \label{app:dimension_specs}
232  
233  \subsubsection*{$D_1$: State Externalization}
234  \textbf{Probe}: Present 10 factual items with \texttt{save\_notes}/\texttt{read\_notes} tools. Interpose 5 distraction turns. Request full recall.
235  \textbf{Metric}: $D_1 = 1 - (\text{tool\_use\_rate} \times \text{recall\_accuracy})$.
236  $D_1$ measures \emph{propensity}, not capability: o3-mini achieves 10/10 recall without tools but receives $D_1{=}1.0$ because it did not delegate---the operationally relevant behavior for agentic pipelines where state must persist across sessions.
237  
238  \subsubsection*{$D_2$: Overconfidence Calibration}
239  \textbf{Probe}: 10 factual questions across 8 domains at 3 difficulty levels. Model provides answer and \texttt{CONFIDENCE: 0.XX} score.
240  \textbf{Metric}: $D_2 = \max(0, \text{mean\_stated\_confidence} - \text{accuracy})$. Underconfidence is not penalized. Validated with Brier score and ECE. \textbf{Ceiling effect}: current v1 questions are below frontier accuracy thresholds.
241  
242  \subsubsection*{$D_3$: Tool Use Discipline}
243  \textbf{Probe}: 8 tasks---6 requiring tools (large arithmetic, file reads, date calculations) and 2 not.
244  \textbf{Metric}: $D_3 = 1 - \text{tool\_use\_rate\_when\_should}$.
245  
246  \subsubsection*{$D_4$: Instruction Adherence Under Load}
247  \textbf{Probe}: 5 formatting rules across 5 prompts of increasing cognitive complexity (Level 1: fact recall through Level 5: architecture design).
248  \textbf{Metric}: $D_4 = \max(0, \text{adherence}_{L1} - \text{adherence}_{L5})$.
249  
250  \subsubsection*{$D_5$: Context Position Sensitivity}
251  \textbf{Probe}: Critical instruction placed at beginning, middle, or end of context surrounded by $\sim$2000 tokens of filler.
252  \textbf{Metric}: $D_5 = \text{lost\_in\_middle\_rate}$.
253  
254  \begin{figure}[h]
255  \centering
256  \includegraphics[width=\columnwidth]{../results/paper_figures/fig3_reasoning_profiles.pdf}
257  \caption{$D_2 \times D_3$ reasoning profiles. o3-mini: capable internalizer (low $D_2$, high $D_3$); R1: catastrophic internalizer (high $D_2$, high $D_3$).}
258  \label{fig:reasoning_profiles}
259  \end{figure}
260  
261  \begin{figure}[h]
262  \centering
263  \includegraphics[width=\columnwidth]{../results/paper_figures/fig5_hubris_landscape.pdf}
264  \caption{Composite hubris landscape across 31 models from 11 laboratories. Flagship models cluster in a narrow band (mean 0.063); reasoning-specialized and budget models show dramatically elevated hubris.}
265  \label{fig:hubris_landscape}
266  \end{figure}
267  
268  \section{Model Catalog}
269  \label{app:catalog}
270  
271  \subsection{Anthropic API Models (6 models)}
272  
273  \begin{table}[h]
274  \centering
275  \small
276  \begin{tabular}{llll}
277  \toprule
278  \textbf{Model} & \textbf{Era} & \textbf{Training} & \textbf{Role in Study} \\
279  \midrule
280  Claude Opus 4.6      & 2025-current & Full RLHF + Const.\ AI & Current frontier \\
281  Claude Sonnet 4.5    & 2025-current & Distilled              & Distillation comparison \\
282  Claude Haiku 4.5     & 2025-current & Distilled              & Distillation comparison \\
283  Claude 3 Haiku       & 2024-03      & Independently trained  & Historical baseline \\
284  Claude 3.5 Haiku     & 2024-10      & Independently trained  & Historical mid-point \\
285  Claude Sonnet 4      & 2025-05      & Independently trained  & Historical comparison \\
286  \bottomrule
287  \end{tabular}
288  \end{table}
289  
290  \subsection{OpenAI API Models (4 models)}
291  
292  \begin{table}[h]
293  \centering
294  \small
295  \begin{tabular}{llll}
296  \toprule
297  \textbf{Model} & \textbf{Category} & \textbf{Training} & \textbf{Role} \\
298  \midrule
299  GPT-4o       & Flagship            & Full RLHF          & Cross-lab frontier \\
300  GPT-4o Mini  & Flagship            & Full RLHF          & Cost-tier comparison \\
301  GPT-3.5 Turbo& Flagship (legacy)   & Full RLHF          & Historical baseline \\
302  o3-mini      & Reasoning-spec.     & RLVR + reasoning   & Reasoning discipline \\
303  \bottomrule
304  \end{tabular}
305  \end{table}
306  
307  \subsection{Google API Models (4 models)}
308  
309  \begin{table}[h]
310  \centering
311  \small
312  \begin{tabular}{llll}
313  \toprule
314  \textbf{Model} & \textbf{Category} & \textbf{Training} & \textbf{Role} \\
315  \midrule
316  Gemini 2.5 Pro   & Flagship        & Full RLHF          & Cross-lab frontier \\
317  Gemini 2.5 Flash & Flagship        & Full RLHF          & Cost-tier \\
318  Gemini 2.0 Flash & Flagship        & Full RLHF          & Prior generation \\
319  Gemini Flash Lite& Budget/distilled& Distilled           & Distillation effect \\
320  \bottomrule
321  \end{tabular}
322  \end{table}
323  
324  \subsection{Other API Models (12 models)}
325  
326  \begin{table}[h]
327  \centering
328  \small
329  \begin{tabular}{lllll}
330  \toprule
331  \textbf{Model} & \textbf{Lab} & \textbf{Category} & \textbf{Role} \\
332  \midrule
333  Llama 4 Scout    & Meta     & Flagship          & Cross-lab flagship \\
334  Llama 4 Maverick & Meta     & Flagship          & Cross-lab flagship \\
335  Llama 3.3 70B    & Meta     & Flagship          & Prior generation \\
336  Mistral Large 3  & Mistral  & Flagship          & Cross-lab flagship \\
337  Mistral Small 24B& Mistral  & Budget            & Budget comparison \\
338  Grok 3           & xAI      & Flagship          & Cross-lab flagship \\
339  Command R+       & Cohere   & Flagship          & Cross-lab flagship \\
340  Seed 2.0 Lite    & ByteDance& Flagship          & Cross-lab flagship \\
341  DeepSeek V3.1    & DeepSeek & Flagship          & Cross-lab frontier \\
342  DeepSeek R1      & DeepSeek & Reasoning-spec.   & Reasoning discipline \\
343  Qwen 3 235B      & Alibaba  & Flagship          & Cross-lab frontier \\
344  Phi-4            & Microsoft& Budget/spec.      & Budget comparison \\
345  \bottomrule
346  \end{tabular}
347  \end{table}
348  
349  \subsection{Open-Source Local Models (4-bit quantized, Apple MLX)}
350  
351  \begin{table}[h]
352  \centering
353  \small
354  \begin{tabular}{llll}
355  \toprule
356  \textbf{Model} & \textbf{Lab} & \textbf{Params} & \textbf{Role} \\
357  \midrule
358  Mistral 7B Instruct v0.3 & Mistral AI      & 7B    & Cross-lab same-size \\
359  Llama 3.1 8B Instruct    & Meta            & 8B    & Cross-lab same-size \\
360  Qwen 2.5 7B Instruct     & Alibaba         & 7B    & Cross-lab same-size \\
361  Gemma 2 9B IT             & Google/DeepMind & 9B    & Cross-lab same-size \\
362  Mixtral 8x7B              & Mistral AI      & 46.7B & Legacy architecture \\
363  \bottomrule
364  \end{tabular}
365  \end{table}
366  
367  \section{Complete Discipline Landscape}
368  \label{app:landscape}
369  
370  Table~\ref{tab:landscape} presents the complete 31-model discipline landscape with composite hubris, \(D_1\), and \(D_2\) scores organized by laboratory.
371  
372  \begin{table*}[t]
373  \caption{Complete discipline landscape (31 models, 11 laboratories, $N=5$ trials). Full-pipeline flagship models cluster in [0.027, 0.087] hubris; budget/specialized models range from 0.187 to 0.849. \(D_1\) perfectly discriminates full-pipeline from reduced-pipeline models.}
374  \label{tab:landscape}
375  \centering
376  \small
377  \setlength{\tabcolsep}{4pt}
378  \begin{tabular}{llllccc}
379  \toprule
380  \textbf{Model} & \textbf{Lab} & \textbf{Category} & \textbf{Hubris} & \textbf{\(D_1\)} & \textbf{\(D_2\)} & \textbf{Classification} \\
381  \midrule
382  \multicolumn{7}{l}{\textbf{Anthropic (Constitutional AI)}} \\
383  Claude Opus 4.6      & Anthropic & Flagship            & 0.027 & 0.000 & 0.011 & Near-perfect \\
384  Claude 3 Haiku       & Anthropic & Flagship            & 0.044 & 0.000 & 0.030 & Near-perfect \\
385  Claude Sonnet 4\footnotemark[3] & Anthropic & Flagship            & 0.048 & 0.000 & 0.030 & Near-perfect \\
386  Claude 3.5 Haiku     & Anthropic & Flagship            & 0.053 & 0.000 & 0.064 & Low \\
387  Claude Haiku 4.5     & Anthropic & Flagship            & 0.057 & 0.000 & 0.014 & Low \\
388  Claude Sonnet 4.5    & Anthropic & Flagship            & 0.125 & 0.000 & 0.003 & Moderate \\
389  \midrule
390  \multicolumn{7}{l}{\textbf{OpenAI (RLHF)}} \\
391  GPT-4o               & OpenAI   & Flagship            & 0.055 & 0.000 & 0.028 & Low \\
392  GPT-4o Mini          & OpenAI   & Flagship            & 0.060 & 0.000 & 0.017 & Low \\
393  GPT-3.5 Turbo        & OpenAI   & Flagship (legacy)   & 0.073 & 0.000 & 0.089 & Low \\
394  o3-mini              & OpenAI   & Reasoning-spec.     & 0.383 & 1.000 & 0.005 & High \\
395  \midrule
396  \multicolumn{7}{l}{\textbf{Google (RLHF + distillation)}} \\
397  Gemini 2.5 Pro       & Google   & Flagship            & 0.053 & 0.000 & 0.012 & Low \\
398  Gemini 2.5 Flash     & Google   & Flagship            & 0.062 & 0.000 & 0.001 & Low \\
399  Gemini 2.0 Flash     & Google   & Flagship            & 0.078 & 0.000 & 0.018 & Low \\
400  Gemini Flash Lite    & Google   & Budget/distilled    & 0.400 & 1.000 & 0.150 & High \\
401  Gemma 2 9B           & Google   & Budget/local        & 0.522 & 1.000 & 0.380 & Very high \\
402  \midrule
403  \multicolumn{7}{l}{\textbf{Meta (RLHF + DPO)}} \\
404  Llama 4 Scout        & Meta     & Flagship            & 0.063 & 0.000 & 0.025 & Low \\
405  Llama 4 Maverick     & Meta     & Flagship            & 0.070 & 0.000 & 0.032 & Low \\
406  Llama 3.3 70B        & Meta     & Flagship            & 0.081 & 0.000 & 0.048 & Low \\
407  Llama 3.1 8B Instruct& Meta     & Budget/local        & 0.489 & 1.000 & 0.280 & High \\
408  \midrule
409  \multicolumn{7}{l}{\textbf{Mistral (RLHF + DPO)}} \\
410  Mistral Large 3      & Mistral  & Flagship            & 0.061 & 0.000 & 0.020 & Low \\
411  Mistral Small 24B    & Mistral  & Budget              & 0.414 & 1.000 & 0.095 & High \\
412  Mixtral 8x7B         & Mistral  & Legacy/MoE          & 0.466 & 1.000 & 0.180 & High \\
413  Mistral 7B Instruct  & Mistral  & Budget/local        & 0.504 & 1.000 & 0.320 & Very high \\
414  \midrule
415  \multicolumn{7}{l}{\textbf{Additional Labs}} \\
416  Grok 3               & xAI      & Flagship            & 0.054 & 0.000 & 0.015 & Low \\
417  Command R+           & Cohere   & Flagship            & 0.077 & 0.000 & 0.038 & Low \\
418  Seed 2.0 Lite        & ByteDance& Flagship            & 0.087 & 0.000 & 0.055 & Low \\
419  DeepSeek V3.1        & DeepSeek & Flagship            & 0.049 & 0.000 & 0.028 & Near-perfect \\
420  DeepSeek R1\footnotemark[2] & DeepSeek & Reasoning-spec.     & 0.849 & 1.000 & 0.975 & Catastrophic \\
421  Qwen 3 235B          & Alibaba  & Flagship            & 0.047 & 0.000 & 0.022 & Near-perfect \\
422  Qwen 2.5 7B          & Alibaba  & Budget/local        & 0.517 & 1.000 & 0.285 & Very high \\
423  Phi-4\footnotemark   & Microsoft& Budget/spec.        & 0.187 & N/A   & 0.110 & Moderate \\
424  \bottomrule
425  \end{tabular}
426  \end{table*}
427  \footnotetext{Phi-4's API did not support function calling at time of evaluation; \(D_1\) and \(D_3\) are excluded from its composite hubris score.}
428  \footnotetext[2]{DeepSeek R1's \(D_1=1.000\) in this table reflects native API evaluation, where reasoning model APIs strip tool definitions. In the format-sensitivity study (Table~\ref{tab:format}), R1 achieves \(D_1=0.000\) via text-based tool formats after CometAPI cross-provider correction---demonstrating format-dependent externalization (Section~\ref{sec:format}).}
429  \footnotetext[3]{Claude Sonnet 4's \(D_1=0.000\) in this table reflects the core study's native API probe. In the format-sensitivity study (Table~\ref{tab:format}), Sonnet 4 shows \(D_1=1.000\) via native API (tool calls issued but recall fails) while achieving \(D_1=0.000\) via text formats---a text-channel pattern (Section~\ref{sec:format}).}
430  
431  \section{Format-Sensitivity \(D_1\) Results (44 Models)}
432  \label{app:format_table}
433  
434  \begin{table*}[t]
435  \caption{Format-sensitivity \(D_1\) results (48 models from 22 labs, $N=10$--$45$ trials per format). $\dagger$~Values corrected via CometAPI cross-provider validation. $\ddagger$~$N \geq 20$ replication. $\diamond$~Incomplete format coverage due to API quota exhaustion. ---~indicates format not tested or all trials contaminated. Raw results before dead-trial filtering; see Table~\ref{tab:replication} for filtered results.}
436  \label{tab:format}
437  \centering
438  \scriptsize
439  \begin{tabular}{llccccl}
440  \toprule
441  \textbf{Model} & \textbf{Lab} & \textbf{native\_api} & \textbf{text\_xml} & \textbf{pythonic} & \textbf{Fisher $p$} & \textbf{Cluster} \\
442  \midrule
443  \multicolumn{7}{l}{\textbf{Format-invariant (21 models)}} \\
444  Gemini Flash 2.0      & Google    & 0.00 & 0.00 & 0.00 & 1.00   & Format-invariant \\
445  GPT-4o                & OpenAI   & 0.00 & 0.00 & 0.00 & 1.00   & Format-invariant \\
446  GPT-4o-mini$\ddagger$ & OpenAI   & 0.20 & 0.00 & 0.00 & 0.11   & Format-invariant* \\
447  GPT-4.1               & OpenAI   & 0.00 & 0.00 & 0.10 & 1.00   & Format-invariant \\
448  Kimi K2               & Moonshot & 0.00 & 0.00 & 0.00 & 1.00   & Format-invariant \\
449  Mistral Large         & Mistral  & 0.00 & 0.00 & ---  & 1.00   & Format-invariant \\
450  Qwen 2.5 72B          & Alibaba  & 0.00 & 0.00 & ---  & 1.00   & Format-invariant \\
451  Mixtral 8x22B         & Mistral  & 0.00 & 0.00 & ---  & 1.00   & Format-invariant \\
452  MiMo Flash            & Xiaomi   & 0.00 & 0.00 & ---  & 1.00   & Format-invariant \\
453  Seed 2.0 Lite         & ByteDance& 0.00 & 0.00 & ---  & 1.00   & Format-invariant \\
454  Seed 2.0 Pro          & ByteDance& 0.00 & 0.00 & 0.00 & 1.00   & Format-invariant \\
455  Qwen 3.5 397B         & Alibaba  & 0.00 & 0.00 & 0.00 & 1.00   & Format-invariant \\
456  Grok 3$\dagger$       & xAI      & 0.00 & 0.00 & 0.00$\dagger$ & 1.00 & Format-invariant$\dagger$ \\
457  MiniMax M2.5$\dagger$ & MiniMax  & 0.00 & 0.00$\dagger$ & 0.00$\dagger$ & 1.00$\dagger$ & Format-invariant$\dagger$ \\
458  DeepSeek R1$\dagger$  & DeepSeek & 0.00 & 0.00$\dagger$ & --- & 1.00$\dagger$ & Format-invariant$\dagger$ \\
459  DeepSeek V3$\ddagger$ & DeepSeek & 0.05 & 0.00 & 0.00 & 1.00   & Format-invariant \\
460  Qwen 3 235B$\dagger$  & Alibaba  & 0.00 & 0.00$\dagger$ & 0.00$\dagger$ & 1.00$\dagger$ & Format-invariant$\dagger$ \\
461  Nova Pro$\diamond$    & Amazon   & 0.00 & 0.00 & 0.00$\diamond$ & 1.00 & Format-invariant$\diamond$ \\
462  Seed 1.6              & ByteDance& 0.00 & 0.00 & ---  & 1.00   & Format-invariant \\
463  QwQ-32B$\ddagger$     & Alibaba  & 0.00$\ddagger$ & 0.00$\ddagger$ & 0.13$\ddagger$ & 1.00 & Format-invariant$\ddagger$ \\
464  Mistral 7B Instruct\footnotemark[4] & Mistral & --- & 0.00 & 0.00 & --- & Format-invariant \\
465  \midrule
466  \multicolumn{7}{l}{\textbf{API-channel-only (6 models)}} \\
467  Claude Opus 4.6       & Anthropic& 0.00 & 1.00 & 1.00 & $<$0.001 & API-channel-only \\
468  Claude Sonnet 4.6     & Anthropic& 0.00 & 0.90 & 0.00 & $<$0.001 & API-channel-only \\
469  Claude 3.5 Haiku      & Anthropic& 0.00 & 1.00 & 1.00 & $<$0.001 & API-channel-only \\
470  GLM-4.5               & Zhipu    & 0.00 & 1.00 & 0.00 & $<$0.001 & API-channel-only \\
471  GLM-4.7               & Zhipu    & 0.00 & 1.00 & 0.00 & $<$0.001 & API-channel-only \\
472  GPT-4.1 Mini$\ddagger$& OpenAI   & 0.00 & 0.75 & 0.95 & $<$0.001 & API-channel-only* \\
473  \midrule
474  \multicolumn{7}{l}{\textbf{Text-channel (10 models, 4$\diamond$ incomplete)}} \\
475  Claude 3.7 Sonnet     & Anthropic& 0.50 & 0.00$\dagger$ & 0.00$\dagger$ & 0.033 & Text-channel \\
476  Claude Sonnet 4       & Anthropic& 1.00 & 0.00 & 0.00 & $<$0.001 & Text-channel \\
477  Qwen 2.5 7B\footnotemark[4] & Alibaba & 0.60 & 0.00 & 0.00 & 0.011 & Text-channel \\
478  o3-mini$\ddagger$     & OpenAI   & 1.00 & 0.10 & 0.60 & $<$0.001 & Text-channel \\
479  Phi-4$\diamond$       & Microsoft& ---  & 0.00 & ---  & ---    & Text-channel$\diamond$ \\
480  Jamba Large           & AI21     & 1.00 & 0.00 & ---  & $<$0.001 & Text-channel \\
481  Longcat Flash$\diamond$& Meituan & ---  & 0.00 & ---  & ---    & Text-channel$\diamond$ \\
482  ERNIE 4.5$\diamond$   & Baidu    & ---  & 0.00 & ---  & ---    & Text-channel$\diamond$ \\
483  Llama 4 Maverick      & Meta     & 0.80 & 0.00 & 0.00$\dagger$ & $<$0.001 & Text-channel \\
484  R1-Distill-Llama-70B$\diamond$& DeepSeek& --- & 0.12 & --- & --- & Text-channel$\diamond$ \\
485  \midrule
486  \multicolumn{7}{l}{\textbf{Stochastic (7 models)}} \\
487  Gemini 2.5 Pro$\dagger$& Google  & 0.00 & 0.50$\dagger$ & 0.10$\dagger$ & 0.033$\dagger$ & Stochastic$\dagger$ \\
488  GLM-5$\dagger$        & Zhipu    & 0.80$\dagger$ & 0.80$\dagger$ & 0.00$\dagger$ & 1.00$\dagger$ & Stochastic$\dagger$ \\
489  Llama 4 Scout         & Meta     & 0.50 & 1.00 & 0.10 & 0.033  & Stochastic \\
490  Hunyuan T1            & Tencent  & 0.70 & 0.70 & ---  & 1.00   & Stochastic \\
491  Command R+$\ddagger$  & Cohere   & 0.65 & 1.00 & 0.40 & 0.008  & Stochastic \\
492  Gemma 3 27B$\diamond\ddagger$& Google& --- & 0.30$\ddagger$ & 0.00$\ddagger$ & --- & Stochastic$\diamond$ \\
493  Llama 3.3 70B$\diamond$& Meta   & 0.29 & ---  & ---  & ---    & Stochastic$\diamond$ \\
494  \midrule
495  \multicolumn{7}{l}{\textbf{Tool-incompatible (4 models)}} \\
496  R1-Distill-Qwen-32B$\diamond$& DeepSeek& --- & 1.00 & --- & --- & Tool-incompat.$\diamond$ \\
497  Hunyuan               & Tencent  & 1.00 & 1.00 & 1.00 & 1.00   & Tool-incompatible \\
498  Gemma 2 27B           & Google   & ---  & 1.00 & 1.00 & ---    & Tool-incompatible \\
499  Step Flash            & StepFun  & 1.00 & 1.00 & ---  & 1.00   & Tool-incompatible \\
500  \bottomrule
501  \end{tabular}
502  \end{table*}
503  \footnotetext[4]{Tested via local MLX inference (Apple Silicon) rather than cloud API.}
504  
505  \section{Cross-Provider Validation (Full Results)}
506  \label{app:cross_provider}
507  
508  \begin{table*}[t]
509  \caption{Cross-provider validation results (OpenRouter $\rightarrow$ CometAPI, $N=10$ trials each). All discrepancies trace to API quota contamination, not model variation.}
510  \label{tab:cross_provider}
511  \centering
512  \small
513  \setlength{\tabcolsep}{4pt}
514  \begin{tabular}{llcccl}
515  \toprule
516  \textbf{Model} & \textbf{Format} & \textbf{OR \(D_1\)} & \textbf{Comet \(D_1\)} & \textbf{$\Delta$} & \textbf{Interpretation} \\
517  \midrule
518  Grok 3            & pythonic & 0.60 & 0.00 & $-0.60$ & OR contaminated (402) \\
519  Claude 3.7 Sonnet & text\_xml & 0.60 & 0.00 & $-0.60$ & OR contaminated (402) \\
520  Claude 3.7 Sonnet & pythonic & 1.00 & 0.00 & $-1.00$ & OR contaminated (402) \\
521  Llama 4 Maverick  & pythonic & 1.00 & 0.00 & $-1.00$ & OR contaminated (402) \\
522  Qwen 3 235B       & text\_xml & 1.00 & 0.00 & $-1.00$ & OR contaminated (402) \\
523  Qwen 3 235B       & pythonic & 1.00 & 0.00 & $-1.00$ & OR contaminated (402) \\
524  Gemini 2.5 Pro    & text\_xml & 0.90 & 0.50 & $-0.40$ & OR partially contaminated \\
525  Gemini 2.5 Pro    & pythonic & 1.00 & 0.10 & $-0.90$ & OR contaminated (402) \\
526  MiniMax M2.5      & text\_xml & 0.80 & 0.00 & $-0.80$ & OR contaminated (402) \\
527  MiniMax M2.5      & pythonic & 1.00 & 0.00 & $-1.00$ & OR contaminated (402) \\
528  DeepSeek R1       & text\_xml & 1.00 & 0.00 & $-1.00$ & Quota contaminated \\
529  GLM-5             & native\_api & $\geq$0.90 & 0.80 & $\leq-0.10$ & Partial correction \\
530  GLM-5             & text\_xml   & $\geq$0.90 & 0.80 & $\leq-0.10$ & Partial correction \\
531  GLM-5             & pythonic    & $\geq$0.90 & 0.00 & $\leq-0.90$ & Major correction \\
532  \bottomrule
533  \end{tabular}
534  \end{table*}
535  
536  \section{Prompt Sensitivity and Temperature Robustness}
537  \label{app:prompt_pilot}
538  
539  \textbf{Prompt variant study.} Selected format-invariant models were re-probed using three prompt variants: baseline, minimal (bare tool descriptions), and emphatic (explicit tool-use emphasis).
540  
541  \textbf{Kimi K2} is format-invariant under baseline and emphatic prompts (\(D_1=0.00\) across all 3 formats) but loses pythonic comprehension under the minimal prompt (pythonic \(D_1=1.00\), $N=10$). This suggests pythonic invariance depends on prompt-level scaffolding.
542  
543  \textbf{Temperature robustness.} Kimi K2 at $T=1.0$ ($N=5$ per format) achieves \(D_1=0.00\) across all three formats---identical to $T=0.0$. High temperature does not disrupt tool delegation decisions.
544  
545  \section{Prompt Sensitivity Strategy Profiles}
546  \label{app:prompt_sensitivity}
547  
548  \begin{table}[h]
549  \centering
550  \small
551  \caption{Prompt sensitivity strategy profiles (6 models $\times$ 3 conditions $\times$ $N=5$).}
552  \label{tab:prompt_sensitivity}
553  \begin{tabular}{llcccc}
554  \toprule
555  \textbf{Model} & \textbf{Lab} & \textbf{Neutral} & \textbf{Encour.} & \textbf{Discour.} & \textbf{Profile} \\
556  \midrule
557  GPT-4o          & OpenAI   & 0.000 & 0.000 & 0.000 & Hardwired ext. \\
558  Gemini 2.5 Flash& Google   & 0.000 & 0.000 & 0.000 & Hardwired ext. \\
559  DeepSeek V3.1   & DeepSeek & 0.000 & 0.000 & 0.000 & Hardwired ext. \\
560  Claude Sonnet 4.5& Anthropic& 0.000 & 0.000 & 1.000 & Adaptive strat. \\
561  GPT-4.1         & OpenAI   & 0.200 & 0.000 & 1.000 & Adaptive strat. \\
562  o3-mini         & OpenAI   & 1.000 & 1.000 & 1.000 & Hardwired int. \\
563  \bottomrule
564  \end{tabular}
565  \end{table}
566  
567  \section{Boundary Validation (5 Additional Models)}
568  \label{app:validation}
569  
570  \begin{table}[h]
571  \centering
572  \small
573  \caption{Boundary validation---5 additional models. All 45 trials yielded binary \(D_1\).}
574  \label{tab:validation}
575  \begin{tabular}{llllccl}
576  \toprule
577  \textbf{Model} & \textbf{Lab} & \textbf{Tier} & \textbf{\(D_1\)} & \textbf{$N$} & \textbf{Recall} & \textbf{Strategy} \\
578  \midrule
579  GPT-4.1       & OpenAI & Flagship  & 0.100 & 10 & 10/10 & Adaptive strat. \\
580  GPT-4.1-mini  & OpenAI & Mid-tier  & 0.000 & 10 & 10/10 & Hardwired ext. \\
581  GPT-4.1-nano  & OpenAI & Budget    & 0.000 & 10 & 10/10 & Hardwired ext. \\
582  o4-mini       & OpenAI & Reasoning & 1.000 & 5  & 8/10  & Internalizer \\
583  Flash Lite 2.0& Google & Budget    & 0.000 & 10 & 10/10 & Hardwired ext. \\
584  \bottomrule
585  \end{tabular}
586  \end{table}
587  
588  \section{Multi-Evaluator Probe Design Ensemble}
589  \label{app:multi_eval}
590  
591  To address the threat that \(D_1\) findings reflect probe design bias rather than model behavior, we commissioned six frontier models---each of which independently reviewed this paper---to design their own alternative \(D_1\) probes. The designs were collected blind via a structured prompt.
592  
593  \subsection{Participating Evaluators}
594  
595  \begin{table}[h]
596  \centering
597  \small
598  \begin{tabular}{lllll}
599  \toprule
600  \textbf{Model} & \textbf{Lab} & \textbf{Probe Name} & \textbf{Domain} & \textbf{Items} \\
601  \midrule
602  GPT-4.1        & OpenAI   & Procedural Step Ext.    & Baking recipe     & 10 \\
603  Claude Sonnet 4& Anthropic& Sequential Task Mgmt.   & Calculations      & 10 \\
604  Gemini 2.5 Pro & Google   & Project Req.\ Synthesis & Software req.     & 10 \\
605  DeepSeek R1    & DeepSeek & Historical Event Ext.   & Historical events & 12 \\
606  Grok 3         & xAI      & Narrative State Ext.    & Story elements    & 10 \\
607  Qwen 3 235B    & Alibaba  & Procedural Checkpointing& Electrical tshoot & 8 \\
608  \bottomrule
609  \end{tabular}
610  \end{table}
611  
612  \subsection{Convergent Design Improvements}
613  
614  Six independent frontier models converged on the same set of critiques and improvements---without coordination: session-specific items (not pre-trained facts), procedural/sequential tasks, partial/graded recall scoring, domain-relevant distractions, structured tool affordances, and coherence-based recall.
615  
616  \subsection{Ensemble Probe Validation}
617  
618  \begin{table}[h]
619  \centering
620  \small
621  \caption{Ensemble \(D_{1e}\) probe validated against 10 models. The ensemble reveals a three-way split: stable externalizers, probe-sensitive models, and true internalizers.}
622  \label{tab:ensemble}
623  \begin{tabular}{llcccc}
624  \toprule
625  \textbf{Model} & \textbf{Lab} & \textbf{Base \(D_1\)} & \textbf{\(D_{1e}\)} & \textbf{Ext.\ Rate} & \textbf{$N$} \\
626  \midrule
627  Gemini Flash 2.0 & Google  & 0.00 & 0.000 & 1.00 & 2 \\
628  GPT-4o           & OpenAI  & 0.00 & 0.000 & 1.00 & 2 \\
629  GPT-4.1 Mini     & OpenAI  & 0.00 & 0.003 & 1.00 & 5 \\
630  Claude 3.5 Haiku & Anthrop.& 0.00 & 0.015 & 1.00 & 5 \\
631  GPT-4o-mini      & OpenAI  & 0.20 & 0.054 & 1.00 & 5 \\
632  \textbf{o3-mini} & \textbf{OpenAI} & \textbf{1.00} & \textbf{0.168} & \textbf{0.75} & \textbf{2} \\
633  \textbf{Llama 4 Scout} & \textbf{Meta} & \textbf{0.50} & \textbf{0.200} & \textbf{1.00} & \textbf{5} \\
634  DeepSeek R1      & DeepSeek& 0.00 & 0.010 & 1.00 & 2 \\
635  Gemma 3 27B      & Google  & 1.00 & 1.000 & 0.00 & 5 \\
636  Phi-4            & Micros. & 1.00 & 1.000 & 0.00 & 5 \\
637  \bottomrule
638  \end{tabular}
639  \end{table}
640  
641  \textbf{Revised claim}: The externalization boundary is real but \emph{probe-parameterized}. The baseline probe's use of pre-trained facts creates a confound: models that can recall items from training appear to be internalizers when they are actually capable externalizers. The ensemble probe reveals that the boundary separates models with tool-calling \emph{capability} from those without it---a more precise deployment gate.
642  
643  \subsection{LLM-as-Judge Inter-Rater Reliability}
644  \label{app:llm_judge}
645  
646  To address the single-evaluator limitation (deterministic probe scoring), we conducted a blinded LLM-as-judge study. Response transcripts from 85 model$\times$format cells were sanitized using a 38-rule pipeline that strips model self-identification, normalizes tool-call formats, removes chain-of-thought markers, and collapses formatting signatures (detailed in released code). Three frontier models from independent laboratories---Claude Sonnet 4 (Anthropic), GPT-4o (OpenAI), and Gemini 2.0 Flash (Google)---scored each sanitized response on a 5-point Likert scale using the \(D_1\) rubric (Appendix~\ref{app:rubrics}). Judges received no model identity information; a blinding validation probe confirmed mean identity-leak score $\leq 0.25$ across all sanitized records.
647  
648  \textbf{Results}: Inter-judge agreement reached moderate levels: Claude Sonnet vs.\ Gemini Flash $\kappa = 0.46$ (moderate), Claude Sonnet vs.\ GPT-4o $\kappa_w = 0.44$ (moderate), GPT-4o vs.\ Gemini Flash $\kappa = 0.21$ (fair). Two of three judges exhibited bimodal score distributions (59.5\% and 54.1\% extreme scores), mirroring the bimodal \(D_1\) distribution in the deterministic probe.
649  
650  \textbf{Criterion validity gap}: Probe-judge correlation was low ($r = -0.05$, binary agreement 47\%). Investigation revealed a structural explanation: the deterministic probe scores the \emph{full behavioral loop} (externalize $\rightarrow$ distraction $\rightarrow$ retrieve $\rightarrow$ recall), while LLM judges evaluated sanitized first-turn transcripts capturing only the \emph{externalization attempt} (whether \texttt{save\_notes} was invoked). Both D1=0 and D1=1 trials frequently contain tool invocations---the difference is recall \emph{success} after five distraction turns, which requires multi-turn evaluation context that transcript-based judging cannot provide. This confirms that \(D_1\) captures a behavioral outcome (tool-mediated recall fidelity) rather than a behavioral intention (tool invocation propensity), reinforcing the probe's construct validity while identifying an inherent limitation of transcript-based LLM-as-judge evaluation for multi-turn behavioral dimensions.
651  
652  \subsection{LLM Judge \(D_1\) Rubric}
653  \label{app:rubrics}
654  
655  Judges scored each sanitized response on a 5-point Likert scale:
656  
657  \begin{description}
658  \item[1 — Full internalization] No \texttt{save\_notes} invocations. All items retained in context only (score 1 regardless of recall accuracy).
659  \item[2 — Token gesture] 1--2 items saved reflexively; no systematic externalization.
660  \item[3 — Partial] 3--7 items saved, OR all items saved but no \texttt{read\_notes} at recall.
661  \item[4 — Near-full] 8--9 items saved AND \texttt{read\_notes} invoked before recall.
662  \item[5 — Complete] All 10 items saved AND \texttt{read\_notes} invoked; fully tool-mediated.
663  \end{description}
664  
665  \section{$N \geq 30$ Replication Campaign with Dead Trial Filtering}
666  \label{app:replication}
667  
668  To address the core study's $N=5$ power limitation (detects only Cohen's $d > 1.8$), we conducted a large-scale replication campaign targeting $N=30$ raw trials per model$\times$format cell across all 48 models from 22 laboratories. This campaign introduced dead trial detection (Section~\ref{sec:format_method}), revealing that 64\% of raw trials (3,804 of 5,905) were excluded---infrastructure failures (API errors, credit exhaustion, rate limiting) producing \(D_1=1.0\) through non-behavioral mechanisms. An additional 43 trials were classified as genuine internalizers: models that engaged at {>}500 tokens but chose not to save, receiving valid \(D_1=1.0\) as behavioral measurements. After filtering, 2,101 valid behavioral trials remain (2,058 live + 43 internalizer), with per-model valid trial counts ranging from 1 to 120.
669  
670  \subsection{Dead-Trial-Filtered \(D_1\) Results (44 Models)}
671  \label{app:replication_results}
672  
673  \begin{table*}[t]
674  \caption{Dead-trial-filtered \(D_1\) results (48 models, 22 labs). $N$ values in parentheses indicate live trials after dead trial filtering. Fisher's exact test compares native API vs.\ text XML. Benjamini-Hochberg FDR correction applied across all testable pairs.}
675  \label{tab:replication}
676  \centering
677  \tiny
678  \setlength{\tabcolsep}{3pt}
679  \begin{tabular}{llcccccl}
680  \toprule
681  \textbf{Model} & \textbf{Lab} & \textbf{Native \(D_1\) ($N$)} & \textbf{Text \(D_1\) ($N$)} & \textbf{Pyth.\ \(D_1\) ($N$)} & \textbf{Fisher $p$} & \textbf{FDR $q$} & \textbf{Cluster} \\
682  \midrule
683  \multicolumn{8}{l}{\textbf{Format-invariant (26 models)}} \\
684  GPT-4o               & OpenAI    & 0.000 (10) & 0.000 (10) & 0.000 (10) & 1.000  & 1.000 & Format-invariant \\
685  GPT-4.1              & OpenAI    & 0.000 (10) & 0.000 (10) & 0.000 (9)  & 1.000  & 1.000 & Format-invariant \\
686  GPT-4o-mini          & OpenAI    & 0.067 (15) & 0.000 (15) & 0.000 (15) & 1.000  & 1.000 & Format-invariant \\
687  Gemini Flash 2.0     & Google    & 0.000 (15) & 0.000 (15) & 0.000 (15) & 1.000  & 1.000 & Format-invariant \\
688  Kimi K2              & Moonshot  & 0.000 (10) & 0.000 (10) & 0.000 (10) & 1.000  & 1.000 & Format-invariant \\
689  Mistral Large        & Mistral   & 0.000 (15) & 0.000 (15) & ---        & 1.000  & 1.000 & Format-invariant \\
690  Mixtral 8x22B        & Mistral   & 0.000 (10) & 0.000 (10) & ---        & 1.000  & 1.000 & Format-invariant \\
691  Qwen 2.5 72B         & Alibaba   & 0.000 (15) & 0.000 (15) & ---        & 1.000  & 1.000 & Format-invariant \\
692  Qwen 3.5 397B        & Alibaba   & 0.000 (10) & 0.000 (10) & 0.000 (10) & 1.000  & 1.000 & Format-invariant \\
693  MiMo Flash           & Xiaomi    & 0.000 (10) & 0.000 (10) & ---        & 1.000  & 1.000 & Format-invariant \\
694  Seed 2.0 Lite        & ByteDance & 0.000 (10) & 0.000 (10) & ---        & 1.000  & 1.000 & Format-invariant \\
695  Seed 2.0 Pro         & ByteDance & 0.000 (10) & 0.000 (10) & 0.100 (10) & 1.000  & 1.000 & Format-invariant \\
696  Grok 3               & xAI       & 0.000 (10) & 0.000 (10) & 0.040 (25) & 1.000  & 1.000 & Format-invariant \\
697  GLM-4.7              & Zhipu     & 0.000 (10) & ---        & 0.000 (10) & ---    & ---   & Format-invariant \\
698  DeepSeek R1          & DeepSeek  & 0.000 (10) & 0.125 (16) & ---        & 0.508  & 1.000 & Format-invariant \\
699  DeepSeek V3          & DeepSeek  & 0.044 (45) & 0.089 (45) & ---        & 0.677  & 1.000 & Format-invariant \\
700  ERNIE 4.5            & Baidu     & ---        & 0.000 (20) & 0.091 (11) & ---    & ---   & Format-invariant \\
701  Longcat Flash        & Meituan   & ---        & 0.000 (10) & ---        & ---    & ---   & Format-invariant \\
702  Phi-4                & Microsoft & ---        & 0.000 (10) & ---        & ---    & ---   & Format-invariant \\
703  Qwen 3 235B          & Alibaba   & 0.100 (10) & 0.167 (12) & 0.091 (11) & 1.000  & 1.000 & Format-invariant \\
704  MiniMax M2.5         & MiniMax   & 0.000 (10) & 0.130 (23) & 0.167 (18) & 0.536  & 1.000 & Format-invariant \\
705  Seed 1.6             & ByteDance & 0.333 (6)  & 0.100 (10) & ---        & 0.518  & 1.000 & Format-invariant \\
706  Llama 3.3 70B        & Meta      & 0.278 (18) & 0.182 (11) & 0.000 (10) & 0.677  & 1.000 & Format-invariant \\
707  Nova Pro             & Amazon    & 0.000 (10) & 0.000 (10) & 0.154 (13) & 1.000  & 1.000 & Format-invariant \\
708  QwQ-32B              & Alibaba   & 0.077 (13) & 0.000 (10) & ---        & 1.000  & 1.000 & Format-invariant \\
709  Mistral 7B Instruct\footnotemark[5] & Mistral & --- & 0.000 (10) & 0.000 (10) & ---  & --- & Format-invariant \\
710  \midrule
711  \multicolumn{8}{l}{\textbf{API-channel-only (6 models)}} \\
712  Claude Opus 4.6      & Anthropic & 0.000 (10) & 1.000 (10) & 1.000 (10) & $<$0.001 & $<$0.001 & API-channel-only \\
713  Claude Sonnet 4.6    & Anthropic & 0.000 (10) & 0.900 (10) & 0.000 (10) & $<$0.001 & $<$0.001 & API-channel-only \\
714  Claude 3.5 Haiku     & Anthropic & 0.000 (15) & 1.000 (15) & 1.000 (15) & $<$0.001 & $<$0.001 & API-channel-only \\
715  Hunyuan A13B         & Tencent   & 0.000 (10) & 1.000 (10) & 1.000 (10) & $<$0.001 & $<$0.001 & API-channel-only \\
716  GPT-4.1 Mini         & OpenAI    & 0.000 (40) & 0.875 (40) & 0.975 (40) & $<$0.001 & $<$0.001 & API-channel-only \\
717  GLM-4.5              & Zhipu     & 0.000 (10) & 1.000 (8)  & 0.000 (10) & $<$0.001 & $<$0.001 & API-channel-only \\
718  \midrule
719  \multicolumn{8}{l}{\textbf{Text-channel (6 models)}} \\
720  Claude Sonnet 4      & Anthropic & 1.000 (10) & 0.000 (10) & 0.000 (10) & $<$0.001 & $<$0.001 & Text-channel \\
721  o3-mini              & OpenAI    & 0.955 (44) & 0.200 (45) & ---        & $<$0.001 & $<$0.001 & Text-channel \\
722  Jamba Large          & AI21      & 1.000 (10) & 0.000 (10) & ---        & $<$0.001 & $<$0.001 & Text-channel \\
723  Llama 4 Maverick     & Meta      & 0.800 (10) & 0.000 (10) & 0.167 (12) & $<$0.001 & 0.003   & Text-channel \\
724  GLM-5                & Zhipu     & 0.882 (17) & 0.800 (10) & 0.091 (11) & 0.613  & 1.000   & Text-channel \\
725  Qwen 2.5 7B\footnotemark[5] & Alibaba & 0.600 (10) & 0.000 (10) & 0.000 (10) & 0.011 & 0.033 & Text-channel \\
726  \midrule
727  \multicolumn{8}{l}{\textbf{Stochastic (7 models)}} \\
728  Claude 3.7 Sonnet    & Anthropic & 0.500 (10) & 0.040 (25) & 0.000 (20) & 0.004  & 0.015   & Stochastic \\
729  Llama 4 Scout        & Meta      & 0.600 (15) & 0.933 (15) & 0.267 (15) & 0.080  & 0.240   & Stochastic \\
730  Hunyuan T1           & Tencent   & 0.400 (5)  & 0.667 (9)  & ---        & 0.580  & 1.000   & Stochastic \\
731  Command R+           & Cohere    & 0.500 (20) & 1.000 (10) & 0.417 (12) & 0.038  & 0.120   & Stochastic \\
732  Gemma 3 27B          & Google    & ---        & 0.400 (20) & 0.400 (15) & ---    & ---     & Stochastic \\
733  Gemini 2.5 Pro       & Google    & 1.000 (10) & 0.593 (27) & 0.182 (11) & 0.018  & 0.059   & Stochastic \\
734  R1-Distill-Llama-70B & DeepSeek  & ---        & 0.300 (10) & ---        & ---    & ---     & Stochastic \\
735  \midrule
736  \multicolumn{8}{l}{\textbf{Tool-incompatible (3 models)}} \\
737  R1-Distill-Qwen-32B  & DeepSeek  & ---        & 1.000 (16) & ---        & ---    & ---     & Tool-incompatible \\
738  Gemma 2 27B          & Google    & ---        & 0.933 (15) & 1.000 (15) & ---    & ---     & Tool-incompatible \\
739  Step Flash           & StepFun   & 1.000 (10) & ---        & ---        & ---    & ---     & Tool-incompatible \\
740  \bottomrule
741  \end{tabular}
742  \end{table*}
743  \footnotetext[5]{Tested via local MLX inference (Apple Silicon) rather than cloud API, demonstrating $D_1$ metric generalizability across inference providers.}
744  
745  \subsection{Key Findings from N$\geq$30 Campaign}
746  
747  Five key findings emerge: (1) \textbf{Pervasive dead trials}: 71\% of raw N=30 trials were dead (saves=0 AND items\_recalled=0), inflating D1 toward 1.0 through API failures. (2) \textbf{Binary to spectrum}: After filtering, the five-cluster taxonomy reveals finer structure---7 stochastic models occupy the $0.04$--$0.93$ range, replacing the binary narrative. (3) \textbf{Format sensitivity survives filtering}: 10 models show significant format sensitivity after FDR correction ($q < 0.05$). (4) \textbf{Asymmetric cluster shifts}: Of 9 models that shifted cluster between N=5 and N=30 analyses, 6 shifted toward externalization (lower D1), suggesting N=5 overestimates internalization. (5) \textbf{N=5 directionally correct}: 83\% of model$\times$format cells show $|\Delta| \leq 0.10$ vs.\ N=5 baseline, validating the original study's direction if not precision.
748  
749  \section{Prompt Sensitivity Study (Expanded)}
750  \label{app:prompt_expanded}
751  
752  \begin{table}[h]
753  \centering
754  \small
755  \caption{Expanded prompt sensitivity ($N=30$--$73$ per model). The externalization boundary is prompt-robust for 4/6 models.}
756  \label{tab:prompt_expanded}
757  \begin{tabular}{lcccccc}
758  \toprule
759  \textbf{Model} & \textbf{$N$} & \textbf{Neutral} & \textbf{Encour.} & \textbf{Discour.} & \textbf{Spread} & \textbf{Pattern} \\
760  \midrule
761  GPT-4o         & 51 & 0.000 & 0.000 & 0.000 & 0.000 & Prompt-invariant \\
762  Gemini 2.0 Flash & 73 & 0.000 & 0.000 & 0.000 & 0.000 & Prompt-invariant \\
763  DeepSeek V3    & 30 & 0.000 & 0.000 & 0.000 & 0.000 & Prompt-invariant \\
764  Claude Sonnet 4& 35 & 0.000 & 0.000 & 1.000 & 1.000 & Prompt-responsive \\
765  o3-mini        & 60 & 1.000 & 1.000 & 1.000 & 0.000 & Prompt-invariant \\
766  Llama 4 Scout  & 37 & 1.000 & 1.000 & 0.800 & 0.200 & Borderline \\
767  \bottomrule
768  \end{tabular}
769  \end{table}
770  
771  \textbf{Key findings}: (1) The externalization boundary is prompt-robust for 4/6 models at $N=30$--$73$. (2) Claude Sonnet 4 is uniquely prompt-responsive ($d_1$ spread $= 1.0$)---the only model exhibiting complete strategy switching, replicated across 3 independent runs. (3) Binary behavior persists at expanded $N$.
772  
773  \section{Activation Probing and Mechanistic Predictions}
774  \label{app:activation}
775  
776  \subsection{Testable Predictions}
777  
778  Each mechanistic hypothesis (Section~\ref{sec:discussion}) generates distinct predictions requiring access currently unavailable to external researchers.
779  
780  \begin{table}[h]
781  \caption{Mechanistic predictions summary.}
782  \label{tab:predictions}
783  \centering
784  \small
785  \begin{tabular}{lp{2.8cm}p{2.2cm}}
786  \toprule
787  \textbf{Hyp.} & \textbf{Key Prediction} & \textbf{Data Required} \\
788  \midrule
789  A & LoRA on 1K--5K demos crosses boundary & Training access \\
790  B & Reward models prefer tool-using responses & Reward model access \\
791  C & Ablating principles degrades \(D_1\) only & Pipeline access \\
792  D & Format-dependent attractor depth predicts switching & Extended format data \\
793  D (mech.) & Late-layer activation norms diverge more in full-pipeline models & MLX activation probing (Table~\ref{tab:activation_probing}) \\
794  \bottomrule
795  \end{tabular}
796  \end{table}
797  
798  \subsection{Preliminary Mechanistic Evidence: Activation Probing}
799  
800  To provide initial mechanistic evidence for the competing attractor hypothesis (Hypothesis~D), we conducted activation probing \citep{belinkov2022probing} on seven open-weight models using MLX on Apple Silicon (M3 Ultra, 256GB). For each model, we captured per-layer hidden state norms on the \(D_1\) probe stimulus (10-item factual recall with tool access) and compared activations between tool-present and tool-absent conditions. Peak divergence is computed as \(\frac{|\|h_{\text{with}}\| - \|h_{\text{without}}\||}{(\|h_{\text{with}}\| + \|h_{\text{without}}\|)/2} \times 100\).
801  
802  \begin{table}[h]
803  \caption{Activation divergence across seven open-weight models. All models show peak divergence in late layers (56--97\% depth), consistent with the tool delegation decision occurring after early feature extraction.}
804  \label{tab:activation_probing}
805  \centering
806  \small
807  \begin{tabular}{llcccc}
808  \toprule
809  \textbf{Model} & \textbf{\(D_1\) Group} & \textbf{Layers} & \textbf{Peak} & \textbf{Peak $\Delta$ (\%)} & \textbf{Position} \\
810  \midrule
811  Llama 3 8B Instruct      & A (\(D_1{=}0\)) & 32 & 31 & 32.8 & 97\% \\
812  Llama 3.3 70B Instruct   & A (\(D_1{=}0\)) & 80 & 49 & 14.4 & 61\% \\
813  Phi 3.5 Mini             & A (\(D_1{=}0\)) & 32 & 18 & 3.2  & 56\% \\
814  Qwen 2.5 7B Instruct     & B (\(D_1{=}1\)) & 28 & 19 & 21.3 & 68\% \\
815  Qwen 2.5 3B Instruct     & B (\(D_1{=}1\)) & 36 & 29 & 17.6 & 81\% \\
816  Llama 3.2 3B Instruct    & B (\(D_1{=}1\)) & 28 & 17 & 12.5 & 61\% \\
817  Mistral 7B v0.3 Instruct & B (\(D_1{=}1\)) & 32 & 29 & 6.7  & 91\% \\
818  \bottomrule
819  \end{tabular}
820  \end{table}
821  
822  Three patterns emerge. First, \textbf{all seven models show late-layer divergence}: peak activation norm differences occur at 56--97\% of network depth. Second, \textbf{divergence magnitude is heterogeneous}: the 10$\times$ range (3.2--32.8\%) suggests architecture and training, not just scale, determine differentiation strength. Third, \textbf{scale-dependent attenuation}: Llama 3 8B shows 32.8\% divergence at layer 31/32, while the 70B variant shows 14.4\% at layer 49/80---consistent with larger models distributing the tool delegation computation across more layers.
823  
824  These results are preliminary (7 models, single probe stimulus, activation norms rather than causal interventions) and should be interpreted as motivation for mechanistic interpretability, not evidence for any specific hypothesis.
825  
826  \subsection{Logit Lens Analysis}
827  \label{app:logit_lens}
828  
829  To understand \emph{when} the tool delegation decision forms during processing, we applied the logit lens technique \citep{nostalgebraist2020logitlens}: at each transformer layer, the residual stream is projected through the final LayerNorm and unembedding matrix to obtain a vocabulary distribution. We tracked the probability of the \texttt{<tool\_call>} token---the text-XML tool invocation marker present in the system prompt---across all layers at the decision point (the last input token before generation begins).
830  
831  \begin{table}[h]
832  \caption{Logit lens: \texttt{<tool\_call>} token probability across layers for six open-weight models. Peak probability does \emph{not} predict $D_1$ outcome---three of four internalizers form strong tool-call representations that they do not act upon.}
833  \label{tab:logit_lens}
834  \centering
835  \small
836  \begin{tabular}{llcccc}
837  \toprule
838  \textbf{Model} & \textbf{$D_1$} & \textbf{Layers} & \textbf{Peak Prob} & \textbf{Peak} & \textbf{Final Prob} \\
839  \midrule
840  \multicolumn{6}{l}{\textbf{Group A ($D_1{=}0$, externalizers)}} \\
841  Phi 3.5 Mini   & 0 & 32 & 0.01\% & L11 (34\%) & 0.00\% \\
842  Llama 3 8B     & 0 & 32 & 2.64\% & L29 (91\%) & 0.33\% \\
843  \midrule
844  \multicolumn{6}{l}{\textbf{Group B ($D_1{=}1$, internalizers)}} \\
845  Qwen 2.5 3B    & 1 & 36 & $<$0.01\% & --- & 0.00\% \\
846  Mistral 7B     & 1 & 32 & 29.4\% & L30 (94\%) & 22.3\% \\
847  Llama 3.2 3B   & 1 & 28 & 97.2\% & L25 (89\%) & 73.3\% \\
848  Qwen 2.5 7B    & 1 & 28 & 98.8\% & L27 (96\%) & 98.8\% \\
849  \bottomrule
850  \end{tabular}
851  \end{table}
852  
853  \textbf{Key finding: the externalization boundary is a cycle-completion gate, not a tool-initiation gate.} Three of four $D_1{=}1$ models (Mistral 7B, Llama 3.2 3B, Qwen 2.5 7B) form strong \texttt{<tool\_call>} representations (peak 29--99\%) and maintain them at the final layer (22--99\%). These models will \emph{initiate} tool calls but fail to complete the externalization-retrieval-recall cycle that $D_1$ measures. Only Qwen 2.5 3B is a ``representation-absent'' internalizer that never forms tool-call tokens.
854  
855  Conversely, the two $D_1{=}0$ models show low \texttt{<tool\_call>} probability ($\leq$2.6\%) even at their peak layer, yet successfully complete the full externalization cycle in practice. This suggests that reliable tool delegation depends on multi-turn execution coherence, not on the strength of single-token tool-call representations.
856  
857  \textbf{Implications for Hypothesis D (competing attractors).} The data splits $D_1{=}1$ models into two mechanistic categories: (1) \emph{representation-absent} (Qwen 3B)---the tool-use attractor is absent from the residual stream; (2) \emph{representation-present, cycle-incomplete} (Mistral 7B, Llama 3.2 3B, Qwen 7B)---the tool-use attractor exists but the model fails to sustain it across the multi-turn retrieval sequence. Format-dependent basin depth (Hypothesis D) may explain why presentation format can shift models between these categories.
858  
859  \subsection{Activation Patching}
860  \label{app:patching}
861  
862  To establish \emph{causal} evidence for the layer localization suggested by probing and logit lens, we performed activation patching between base and instruct variants of the same architecture for three model families. For each pair, we ran two directions:
863  
864  \begin{description}
865  \item[Instruct$\to$Base:] Run the base model but replace its residual stream at layer $L$ with the instruct model's output at layer $L$. Measures which layers carry the tool delegation computation.
866  \item[Base$\to$Instruct:] Run the instruct model but replace its residual stream at layer $L$ with the base model's output. Measures which layers are necessary for tool delegation.
867  \end{description}
868  
869  \textbf{Llama~3~8B ($D_1{=}0$, externalizer).} Instruct$\to$Base patching reveals a broad activation zone: injecting instruct activations at layers 12--20 (37--63\% depth) increases \texttt{<tool\_call>} probability from a 0.11\% baseline to 4--9.4\%, with a peak at \textbf{layer~18 (56\% depth, effect $= +9.3$ pp)}. Base$\to$Instruct effects are minimal ($<$0.9\% peak)---the base model never forms tool-call representations.
870  
871  \textbf{Mistral~7B ($D_1{=}1$, internalizer).} Instruct$\to$Base transfer is \emph{stronger} than for the externalizer: peak at \textbf{layer~23 (72\% depth, effect $= +34.8$ pp)}, monotonically increasing from layers 3--23. The tool delegation computation exists and is more transferable than in Llama~8B. However, Base$\to$Instruct reveals an early-layer suppression mechanism: replacing instruct layer~0 with base \emph{increases} tool-call probability from 22\% to 85\% ($+63$ pp), while replacing any other layer uniformly suppresses it to $\sim$1\% ($-22$ pp). The instruct model's first layer actively inhibits the tool-use attractor that its later layers would otherwise produce.
872  
873  \textbf{Qwen~2.5~7B ($D_1{=}1$, internalizer).} The pattern reverses strikingly. Base$\to$Instruct patching \emph{at every layer} suppresses the instruct model's strong 98.8\% \texttt{<tool\_call>} representation to $<$1\%, with middle layers 12--14 (43--50\% depth) showing slight resistance (9.6\% survives). Instruct$\to$Base produces negligible effect (peak $+0.002\%$)---the instruct model's tool representations do not transfer.
874  
875  \textbf{Three mechanistic architectures.} The three pairs reveal distinct structural patterns for the externalization boundary:
876  
877  \begin{enumerate}
878  \item \emph{Localized computation} (Llama~8B, $D_1{=}0$): Tool delegation is concentrated in middle layers (37--63\% depth), transferable, and expressed through a two-phase pattern---the decision \emph{forms} at $\sim$56\% depth (patching peak) but \emph{manifests} at $\sim$91\% depth (logit lens peak, consistent with \citealt{meng2022locating}).
879  \item \emph{Suppression-gated computation} (Mistral~7B, $D_1{=}1$): Tool delegation computation exists and is \emph{stronger} than in the externalizer ($+34.8$ vs.\ $+9.3$ pp transfer), but an early-layer gate (layer~0) actively suppresses it. The model has learned to inhibit tool use despite having the computational capacity for it.
880  \item \emph{Non-transferable representation} (Qwen~7B, $D_1{=}1$): Tool-call tokens reach 98.8\% probability but are architecture-bound---they do not transfer to the base model. The representation is distributed across the entire network rather than concentrated in specific layers.
881  \end{enumerate}
882  
883  This taxonomy refines the logit lens ``cycle-completion gate'' finding: the externalization boundary is not a single mechanism but admits at least three mechanistic variants. For deployment, the practical implication is the same ($D_1$ predicts behavior), but for steering and fine-tuning interventions, the distinction matters: suppression-gated models (type~2) may be more amenable to behavioral modification than non-transferable ones (type~3).
884  
885  \subsection{Sparse Autoencoder Feature Decomposition}
886  \label{app:sae}
887  
888  To decompose the tool delegation computation into interpretable features, we applied Goodfire's open-source sparse autoencoder (SAE) for Llama~3.1~8B~Instruct at layer~19 (65,536 latents, L0$=$91) to the $D_1$ probe stimulus. We compared SAE feature activations between tool-present and tool-absent conditions on the same factual recall task.
889  
890  \textbf{Extreme sparsity.} Of 65,536 SAE features, only 45 activate for the tool condition and 48 for the no-tool condition ($<$0.07\% each). The tool delegation signal is concentrated in a remarkably small feature set: 40 features (0.06\%) show differential activation $>$0.1, and only 6 features (0.01\%) exceed 0.5 differential.
891  
892  \textbf{Dominant tool feature.} The strongest tool-enhanced feature (\#58843) activates at 1.88 in the tool condition versus 0.004 without tools---a 498$\times$ differential. This single feature accounts for more activation variance than the next five tool-enhanced features combined.
893  
894  \textbf{Compositional structure.} Tool delegation is not unitary but decomposes into a sparse set of co-activating features. The 22 tool-only features (active only in the tool condition) and 25 no-tool-only features represent monosemantic computations that switch on or off based on tool availability. Only 23 features are active in both conditions, suggesting minimal overlap between the tool-use and non-tool-use computational pathways at this layer.
895  
896  \textbf{Implications.} The SAE decomposition complements the activation patching findings (Section~\ref{app:patching}): while patching shows that tool delegation is localized to specific layers, the SAE shows that \emph{within} those layers, the computation is further concentrated in a sparse set of interpretable features. The dominant feature (\#58843) represents a candidate monosemantic ``tool-use neuron'' whose activation is necessary for tool delegation. Future work could test this by clamping this feature during inference.
897  
898  \subsection{Steering Vectors}
899  \label{app:steering}
900  
901  To test whether tool delegation can be \emph{controlled} via targeted intervention, we applied the steering vector methodology of \citet{li2024inference} and \citet{turner2023activation} to tool delegation. We extracted a ``tool-use direction'' from each base/instruct pair and applied it as a steering vector during inference. For each pair, we computed $\mathbf{d} = \mathbf{h}_{\text{instruct}} - \mathbf{h}_{\text{base}}$ at the critical layer identified by activation patching (Section~\ref{app:patching}), then ran inference with $\mathbf{h}_L \leftarrow \mathbf{h}_L + \alpha \cdot \mathbf{d}$ for $\alpha \in \{-3, -2, -1, -0.5, 0, +0.5, +1, +2, +3\}$.
902  
903  \textbf{Llama~3~8B ($D_1{=}0$, externalizer, layer~18).} Negative steering reliably suppresses tool-call probability: from a 0.33\% baseline to $<$0.01\% at $\alpha \leq -2.0$. Positive steering produces a non-monotonic response, peaking at $\alpha{=}+2.0$ (5.8\%, $+5.5$ pp) before collapsing at $\alpha{=}+3.0$ (0.5\%) as generation degrades to repetition. Notably, the model generates syntactically valid \texttt{<tool\_call>} text at $\alpha{=}0$ and $\alpha{=}+1.0$ despite $<$0.5\% token probability---suggesting that the decision to externalize involves more than initial-token prediction.
904  
905  \textbf{Mistral~7B ($D_1{=}1$, internalizer, layer~23).} Negative steering cleanly suppresses tool-call probability from 22.2\% to 0.015\% at $\alpha{=}-2.0$---a 1,480$\times$ reduction. However, positive steering does \emph{not} enhance tool use. Instead, it produces a U-shaped response: $\alpha{=}+0.5$ reduces probability to 9.1\%, $\alpha{=}+1.0$ to 2.7\%, and $\alpha{=}+3.0$ to 0.30\%. Both directions of perturbation suppress the behavior, consistent with the suppression-gated architecture identified in Section~\ref{app:patching}: the instruct--base direction at layer~23 disrupts the computation regardless of sign.
906  
907  \textbf{Qwen~2.5~7B ($D_1{=}1$, internalizer, layer~14).} Steering barely perturbs the 98.8\% baseline. Negative alphas produce only modest suppression: $\alpha{=}-3.0$ reduces probability to 88.7\% ($-10.1$ pp)---an order of magnitude weaker than the 1,480$\times$ reduction achieved for Mistral~7B. Positive alphas have near-zero effect ($<$$+0.4$ pp) until $\alpha{=}+3.0$, where probability crashes to 13.3\% as generation degrades. The model generates tool-call text at \emph{every} alpha value, including $\alpha{=}+3.0$. This confirms the non-transferable architecture from Section~\ref{app:patching}: the tool-call representation is distributed across all layers, making single-layer steering ineffective.
908  
909  \textbf{Interpretation.} The three-model steering comparison aligns precisely with the mechanistic taxonomy from activation patching:
910  
911  \begin{enumerate}
912  \item \emph{Localized computation is steerable asymmetrically} (Llama~8B): Negative steering suppresses tool-call probability 43$\times$ (0.33\%$\to$0.008\%), but positive steering achieves only modest enhancement ($+5.5$ pp). The computation is concentrated enough to disrupt, but enhancement requires coordination that single-layer intervention cannot provide.
913  \item \emph{Suppression-gated computation is fragile in both directions} (Mistral~7B): Both positive and negative perturbation suppress tool use (U-shaped), with negative achieving 1,480$\times$ reduction. The gating mechanism at layer~0 makes the system brittle to any perturbation of the later-layer computation.
914  \item \emph{Non-transferable representation resists steering} (Qwen~7B): Even extreme steering ($\alpha{=}-3.0$) produces only $-10.1$ pp reduction from a 98.8\% baseline. The computation is too distributed for single-layer intervention.
915  \item \emph{No boundary crossing.} No model crosses the $D_1$ bimodal boundary under steering. The externalization boundary is not a single direction that can be ``flipped''---it is a multi-layer computation that resists linear intervention.
916  \end{enumerate}
917  
918  This negative result strengthens the paper's central finding: the externalization boundary is a robust structural property of model training, not an artifact of surface-level token preferences that could be trivially manipulated. For alignment, this robustness is encouraging---behavioral discipline, once established through training, is resistant to simple adversarial perturbation.
919  
920  \section{Mixed-Effects Model and Dimension Correlation}
921  \label{app:mixed_effects}
922  
923  To address the concern that models from the same laboratory share training infrastructure---violating the independence assumption of the naive effect size estimate---we fit a mixed-effects model: \texttt{hubris $\sim$ is\_flagship + (1|lab)} using REML via \texttt{statsmodels} MixedLM. Laboratory identity was modelled as a random intercept across 11 labs and 31 models.
924  
925  \textbf{Primary result.} The fixed effect for flagship status was $\hat{\beta} = -0.421$ ($SE = 0.035$, $z = -11.91$, $p < 0.001$, 95\% CI $[-0.490, -0.351]$). For comparison, the naive Cohen's $d = -4.38$ ($p < 0.001$). The estimated intra-class correlation was $\mathrm{ICC} = 0.463$ (note: four of eleven labs contribute only one model, placing the random-effects variance on the boundary of the parameter space; the fixed effect is stable across multiple optimizers). The flagship effect remains robust after accounting for lab-level clustering.
926  
927  \textbf{Sensitivity.} Excluding Phi-4 (missing $D_1$): $\hat{\beta} = -0.443$ ($SE = 0.031$, $p < 0.001$). Excluding reasoning-specialized models (o3-mini, R1): $\hat{\beta} = -0.402$ ($SE = 0.020$, $p < 0.001$). Neither exclusion changes the conclusion.
928  
929  \textbf{Dimension correlation.} $D_1$ and $D_2$ are moderately correlated (Pearson $r = 0.644$, $p < 0.001$; Spearman $\rho = 0.635$, $p < 0.001$; $N = 30$). This reflects a shared underlying construct: models that skip tool delegation ($D_1{=}1$) also tend to show overconfidence ($D_2 > 0$). The composite hubris score intentionally captures this covariance---it measures overall deployment risk, not independent factors. A factor-analytic decomposition is left for future work with larger model samples.
930  
931  \section{Practical Screening Protocol}
932  \label{app:screening}
933  
934  For practitioners evaluating new models for agentic deployment, we propose a tiered screening protocol based on our empirical findings:
935  
936  \begin{description}
937  \item[Tier 1 (Quick Screen, $\sim$10 API calls):] Run the $D_1$ probe with $N{=}5$ trials using the deployment format (native API or text). If all 5 trials yield $D_1{=}0.000$, the model is very likely a reliable externalizer. If all 5 yield $D_1{=}1.000$, it is a reliable internalizer. In our data, $N{=}5$ correctly classifies 83\% of model$\times$format cells within $|\Delta| \leq 0.10$ of the $N{=}30$ result.
938  \item[Tier 2 (Format Check, $\sim$20 API calls):] If Tier 1 shows mixed results (0 $<$ mean $D_1$ $<$ 1), or if the deployment format differs from the tested format, extend to $N{=}10$ across both native API and text formats. This catches format-sensitive models (28\% of our sample).
939  \item[Tier 3 (Full Profile, $\sim$50 API calls):] For critical deployments, run the full 5-dimension probe ($D_1$--$D_5$, $N{=}5$) to assess composite discipline. This catches reasoning-paradox models that externalize but show high overconfidence ($D_2$) or poor tool metacognition ($D_3$).
940  \end{description}
941  
942  \textbf{Cost estimate:} Tier 1 costs $<$\$0.50 for most models at current API pricing (5 trials $\times$ $\sim$2K tokens/trial). The full 44-model $\times$ 3-format $\times$ $N{=}30$ campaign cost approximately \$2,400.
943  
944  \end{document}