Cradicle Explorer

/ research / 2026-02-19-privacy-preserving-local-agents.md
2026-02-19-privacy-preserving-local-agents.md
  1  # Privacy-Preserving Multi-Agent Architecture with Local Models
  2  
  3  **Author:** Roman "Romanov" Research-Rachmaninov  
  4  **Date:** 2026-02-19  
  5  **Bead:** beads-hub-pe1  
  6  **Status:** Final
  7  
  8  ## Abstract
  9  
 10  This paper investigates whether #B4mad can run its entire multi-agent system—Brenner Axiom, CodeMonkey, PltOps, Romanov—on local open-weight models with zero cloud dependency for sensitive workloads. We evaluate the current landscape of local inference (Qwen3-Coder-Next, Llama-based routers, Ollama), assess where local models can replace cloud APIs today, and propose a minimum viable architecture. Our finding: **local models can handle ~80% of agent tasks** (code generation, bead management, routine ops) with Qwen3-Coder-Next (80B/3B-active MoE) as the workhorse, but deep reasoning tasks (complex research, multi-step strategic analysis) still benefit from cloud-tier models. We recommend a tiered architecture: local-first with optional cloud escalation, governed by data sensitivity classification.
 11  
 12  ## 1. Context: Why This Matters for #B4mad
 13  
 14  #B4mad already stores all agent memory in markdown files backed by git. This is a strong privacy foundation—memory never leaves the machine unless explicitly pushed. But inference still flows through cloud APIs (Anthropic Claude, Google Gemini), meaning every agent prompt, every bead description, every piece of context is transmitted externally.
 15  
 16  This creates three risks:
 17  1. **Data exposure**: Sensitive work orders, personal context from `MEMORY.md`, infrastructure details from `TOOLS.md`—all sent to third-party inference providers.
 18  2. **Vendor lock-in**: If Anthropic or Google change pricing, rate-limit, or deprecate models, the entire agent fleet stops.
 19  3. **Availability dependency**: Cloud outages halt all agent work, even for tasks that don't require frontier reasoning.
 20  
 21  The Lex Fridman #490 podcast (AI State of the Art 2026, ~34:46) captured the sentiment well: users want separate work/personal AI contexts, local customization, and the ability to add data post-training without it leaving their machine. This aligns exactly with #B4mad's agent-first philosophy.
 22  
 23  Our recently published pull-based scheduling paper (beads-hub-30f) already describes agents polling a local bead board. The natural next step: those agents running on local models, with the bead board as the only coordination surface, and no data leaving the machine.
 24  
 25  ## 2. State of the Art: Local Inference for Agent Workloads
 26  
 27  ### 2.1 Qwen3-Coder Family
 28  
 29  The Qwen3-Coder family represents the current state-of-the-art for local agentic coding:
 30  
 31  - **Qwen3-Coder-480B-A35B-Instruct**: Flagship MoE model, 480B total / 35B active parameters. Performance comparable to Claude Sonnet 4 on SWE-Bench, agentic coding, and tool use. Requires ~70GB VRAM (quantized) — feasible on a dual-GPU workstation but not casual hardware.
 32  - **Qwen3-Coder-Next (80B-A3B)**: The local-first variant. 80B total / 3B active parameters with hybrid attention and MoE. Designed explicitly for coding agents and local development. Runs comfortably on a single consumer GPU (16GB+ VRAM at Q4 quantization). Trained with large-scale agentic RL including environment interaction.
 33  - **Qwen3-Coder-30B-A3B-Instruct**: Mid-tier option, 30B/3B-active. Good balance of capability and resource requirements.
 34  
 35  Key capabilities relevant to #B4mad:
 36  - 256K native context (1M with YaRN extrapolation) — sufficient for repo-scale understanding
 37  - Native function calling / tool use — critical for agent frameworks
 38  - 358 programming language support
 39  - Available via Ollama: `ollama run qwen3-coder`
 40  
 41  ### 2.2 Routing and Orchestration Models
 42  
 43  For the "small routing model" that dispatches tasks to specialists:
 44  
 45  - **Qwen3-0.6B / 1.7B**: Tiny models suitable for classification tasks (intent detection, bead routing, priority assessment). Can run on CPU.
 46  - **Llama-3.2-3B**: Strong general-purpose small model for routing decisions.
 47  - **Phi-4-mini (3.8B)**: Microsoft's compact model with strong reasoning for its size.
 48  - **RouteLLM** (open-source project): Framework for routing between strong/weak models based on query complexity. Directly applicable to our local/cloud tiering.
 49  
 50  ### 2.3 Inference Infrastructure
 51  
 52  - **Ollama**: De facto standard for local model serving. OpenAI-compatible API, easy model management, quantization support. Already in use at #B4mad (`custom-10-144-28-67-11434/qwen3-coder-next:latest`).
 53  - **llama.cpp / llama-server**: Lower-level but more configurable. Supports speculative decoding (small draft model + large verify model) for faster inference.
 54  - **vLLM**: High-throughput serving with PagedAttention. Better for concurrent agent requests but heavier setup.
 55  - **LocalAI**: OpenAI-compatible API server supporting multiple backends.
 56  
 57  ### 2.4 Privacy-Preserving Approaches in the Literature
 58  
 59  - **Federated learning** (McMahan et al., 2017): Training across distributed nodes without sharing data. Relevant for future multi-node #B4mad setups.
 60  - **Differential privacy in LLM inference** (various 2024-2025): Adding noise to prevent memorization. Less relevant for our use case since we control the entire pipeline.
 61  - **Confidential computing** (Intel SGX, AMD SEV): Hardware-level isolation for sensitive inference. Overkill for our threat model but worth noting.
 62  - **On-device AI** (Apple Intelligence, Google Gemini Nano): Industry trend toward local inference for privacy. Validates our approach.
 63  
 64  ## 3. Analysis: Can Local Models Replace Cloud APIs for 80% of Agent Tasks?
 65  
 66  ### 3.1 Task Taxonomy
 67  
 68  We categorize #B4mad agent tasks by complexity and map them to model requirements:
 69  
 70  | Task Category | Examples | Required Capability | Local Feasible? |
 71  |---|---|---|---|
 72  | **Bead management** | Create, update, close beads; parse status | Structured output, tool calling | ✅ Yes — any 3B+ model |
 73  | **Code generation** | Scripts, configs, Ansible playbooks | Coding, context understanding | ✅ Yes — Qwen3-Coder-Next excels |
 74  | **Code review / PR feedback** | Review diffs, suggest changes | Code understanding, reasoning | ✅ Yes — Qwen3-Coder-Next |
 75  | **Git operations** | Commit messages, branch management | Template following | ✅ Yes — trivial |
 76  | **Routing / dispatch** | Classify incoming requests, assign to agents | Intent classification | ✅ Yes — 1-3B router model |
 77  | **URL summarization** | Fetch and summarize web content | Reading comprehension | ✅ Yes — 7B+ model |
 78  | **Infrastructure ops** | kubectl, oc commands, monitoring checks | Tool use, structured output | ✅ Yes — Qwen3-Coder-Next |
 79  | **Conversational interaction** | Chat with goern, group discussions | Natural language, personality | ⚠️ Mostly — but nuance/humor degrades |
 80  | **Deep research** | Literature review, multi-source synthesis | Long-context reasoning, depth | ❌ Not yet — Opus-tier still needed |
 81  | **Complex strategic analysis** | Architecture decisions, trade-off papers | Deep reasoning, creativity | ❌ Not yet — frontier models preferred |
 82  
 83  **Estimate: 75-85% of daily agent tasks are locally feasible today.**
 84  
 85  ### 3.2 The Qwen3-Coder-Next Sweet Spot
 86  
 87  Qwen3-Coder-Next (80B/3B-active) is the ideal workhorse for #B4mad because:
 88  
 89  1. **MoE efficiency**: Only 3B parameters active per token despite 80B total knowledge. This means near-3B inference cost with much higher capability.
 90  2. **Agentic training**: Specifically trained with long-horizon RL on real-world agent tasks, environment interaction, and tool use. Not just a code completer—it's an agent model.
 91  3. **Ollama integration**: Already supported, already deployed at #B4mad's inference endpoint.
 92  4. **256K context**: Enough to hold an entire bead board + memory files + current task context.
 93  
 94  ### 3.3 Where Local Falls Short
 95  
 96  Two categories remain cloud-dependent:
 97  
 98  1. **Deep research (Romanov tasks)**: Synthesizing across multiple sources, producing nuanced analysis with original insights, evaluating trade-offs at a strategic level. Qwen3-Coder-Next can produce *adequate* research but not Opus-quality depth. This is the 15-20% that still needs cloud.
 99  
100  2. **Personality-rich interaction**: Brenner's main session conversations with goern require wit, cultural awareness, and emotional intelligence that smaller models handle less gracefully. Acceptable for task execution but not for the "personal assistant with personality" use case.
101  
102  ### 3.4 The Router Model Question
103  
104  Can a small model (0.6B-3B) effectively route tasks to the right agent? Yes, because:
105  
106  - Bead titles already contain routing hints ("Research:", code tasks, ops tasks)
107  - The routing decision is a classification task, not a generation task
108  - A fine-tuned Qwen3-0.6B on #B4mad's historical bead assignments would likely achieve >95% routing accuracy
109  - Even without fine-tuning, a prompted 1.7B model can classify intent reliably
110  
111  **Proposed router**: Qwen3-1.7B with a system prompt describing each agent's capabilities. Input: bead title + description. Output: agent assignment + priority. Runs on CPU, <2GB RAM.
112  
113  ## 4. Proposed Architecture: Local-First with Cloud Escalation
114  
115  ### 4.1 System Overview
116  
117  ```
118  ┌─────────────────────────────────────────────────────┐
119  │                   Local Machine                      │
120  │                                                      │
121  │  ┌──────────┐    ┌──────────────────────────────┐   │
122  │  │  Router   │    │         Ollama Server         │   │
123  │  │ (1.7B)   │───▶│  ┌────────────────────────┐  │   │
124  │  └──────────┘    │  │ Qwen3-Coder-Next (3B)  │  │   │
125  │       ▲          │  └────────────────────────┘  │   │
126  │       │          └──────────────────────────────┘   │
127  │       │                       │                      │
128  │  ┌────┴────┐          ┌──────┴──────┐               │
129  │  │  Bead   │          │   Agents    │               │
130  │  │  Board  │◀────────▶│ (OpenClaw)  │               │
131  │  │  (git)  │          │             │               │
132  │  └─────────┘          └──────┬──────┘               │
133  │                              │                       │
134  │                    ┌─────────┴──────────┐           │
135  │                    │ Sensitivity Gate   │           │
136  │                    │ (local policy)     │           │
137  │                    └─────────┬──────────┘           │
138  └──────────────────────────────┼───────────────────────┘
139                                 │ (only if needed AND allowed)
140                          ┌──────┴──────┐
141                          │  Cloud API  │
142                          │ (Opus/etc)  │
143                          └─────────────┘
144  ```
145  
146  ### 4.2 Components
147  
148  **1. Local Router (Qwen3-1.7B on CPU)**
149  - Classifies incoming beads/messages
150  - Routes to appropriate local agent
151  - Flags tasks that may need cloud escalation
152  
153  **2. Primary Inference (Qwen3-Coder-Next via Ollama)**
154  - Handles all code, ops, bead management, and routine conversation
155  - Serves CodeMonkey, PltOps, and routine Brenner tasks
156  - Single GPU (RTX 4090 / RTX 5090 or equivalent)
157  
158  **3. Bead Board (git-backed, local)**
159  - Already implemented — no changes needed
160  - Pull-based scheduling as described in our previous paper
161  - Agents poll, claim, execute, close
162  
163  **4. Memory Layer (markdown files, git-backed)**
164  - Already implemented — `MEMORY.md`, `memory/*.md`, `AGENTS.md`
165  - Zero cloud dependency, full local control
166  - Git provides versioning, sync is explicit
167  
168  **5. Sensitivity Gate (local policy engine)**
169  - Simple rule-based classifier:
170    - Contains personal data? → Local only
171    - Contains infrastructure secrets? → Local only
172    - Requires deep reasoning? → May escalate to cloud
173    - Research task? → May escalate to cloud
174  - User can override: `--local-only` flag forces all-local
175  
176  **6. Cloud Escalation (optional)**
177  - Only for tasks that pass the sensitivity gate AND require frontier capability
178  - User explicitly approves cloud usage per-task or per-category
179  - Could be eliminated entirely if accepting quality trade-off on research/deep reasoning
180  
181  ### 4.3 Minimum Viable Local Setup
182  
183  | Component | Hardware | Cost (approx.) |
184  |---|---|---|
185  | GPU | NVIDIA RTX 4090 (24GB VRAM) | ~$1,600 |
186  | CPU | Any modern 8-core (for router model) | (existing) |
187  | RAM | 32GB+ | (existing) |
188  | Storage | 500GB SSD (models + repos) | ~$50 |
189  | Software | Ollama + OpenClaw + git | Free |
190  
191  **Total incremental cost: ~$1,650** (assuming existing workstation; just add GPU)
192  
193  For the budget-conscious: an RTX 4070 Ti Super (16GB) can run Qwen3-Coder-Next at Q4 quantization with acceptable speed. Cost: ~$800.
194  
195  For maximum capability: dual RTX 4090 or single RTX 5090 (32GB) allows running the 30B-A3B variant at higher quantization or the full 480B-A35B with aggressive quantization.
196  
197  ### 4.4 Model Configuration
198  
199  ```yaml
200  # Proposed Ollama model configuration
201  models:
202    router:
203      name: qwen3:1.7b
204      purpose: Intent classification, bead routing
205      hardware: CPU only
206      memory: ~2GB RAM
207      
208    workhorse:
209      name: qwen3-coder-next:latest
210      purpose: Code, ops, bead management, conversation
211      hardware: GPU (RTX 4090)
212      memory: ~14GB VRAM (Q4_K_M)
213      context: 32768  # expandable to 256K if needed
214      
215    summarizer:
216      name: qwen3:7b
217      purpose: URL summarization (Brew agent)
218      hardware: CPU or shared GPU
219      memory: ~5GB
220  ```
221  
222  ## 5. Migration Path
223  
224  ### Phase 1: Shadow Mode (Weeks 1-2)
225  - Run local models alongside cloud APIs
226  - Compare outputs for quality regression
227  - Measure latency and throughput
228  - Identify tasks where local quality is unacceptable
229  
230  ### Phase 2: Local-Default (Weeks 3-4)
231  - Switch CodeMonkey and PltOps to local inference
232  - These are the most tool-use heavy, least personality-dependent agents
233  - Keep Brenner main session and Romanov on cloud
234  
235  ### Phase 3: Full Local with Cloud Escalation (Weeks 5-8)
236  - Move Brenner routine tasks to local
237  - Implement sensitivity gate
238  - Cloud only for: Romanov deep research, complex Brenner conversations
239  - Measure cloud API cost reduction (target: 80%+ reduction)
240  
241  ### Phase 4: Evaluate Full Local (Ongoing)
242  - As local models improve (Qwen4, Llama 4, etc.), reassess cloud necessity
243  - Fine-tune router on accumulated #B4mad data
244  - Consider fine-tuning workhorse model on #B4mad-specific patterns
245  
246  ## 6. Connection to Pull-Based Scheduling
247  
248  This architecture completes the vision outlined in our pull-based scheduling paper:
249  
250  1. **Bead board** serves as the shared work queue (already implemented)
251  2. **Agents poll** for tasks matching their capabilities (described in previous paper)
252  3. **All inference is local** (this paper's contribution)
253  4. **All memory is local markdown** (already implemented)
254  
255  The result: a fully self-contained multi-agent system where:
256  - No data leaves the machine unless explicitly pushed to git remotes
257  - No cloud dependency for routine operations
258  - Agents are autonomous, self-scheduling, and privacy-preserving
259  - The only external dependency is git hosting (which can also be self-hosted)
260  
261  ## 7. Risks and Mitigations
262  
263  | Risk | Likelihood | Impact | Mitigation |
264  |---|---|---|---|
265  | Local model quality regression on edge cases | High | Medium | Shadow mode testing; cloud escalation path |
266  | GPU failure = all agents down | Medium | High | CPU fallback (slower but functional); spare GPU |
267  | Model updates break agent prompts | Medium | Medium | Pin model versions; test before upgrading |
268  | Context window insufficient for complex tasks | Low | Medium | Qwen3-Coder-Next supports 256K natively |
269  | Ollama instability under concurrent load | Medium | Medium | Rate limiting; vLLM as alternative backend |
270  
271  ## 8. Recommendations
272  
273  1. **Adopt Qwen3-Coder-Next as the primary local model** for CodeMonkey, PltOps, and routine Brenner tasks. It is purpose-built for agentic workloads and runs efficiently on consumer hardware.
274  
275  2. **Deploy Qwen3-1.7B as the router** on CPU. It costs nothing in GPU resources and can classify/route with high accuracy.
276  
277  3. **Start with Phase 1 (shadow mode)** immediately. The infrastructure is already in place—Ollama is running, models are available, OpenClaw supports custom model endpoints.
278  
279  4. **Keep cloud escalation for Romanov and complex Brenner tasks** until local models close the reasoning gap. Budget for ~20% cloud usage.
280  
281  5. **Implement the sensitivity gate** as a simple rule-based policy before any cloud calls. This is the key privacy guarantee.
282  
283  6. **Self-host git** (Forgejo on Nostromo) to eliminate the last external dependency. This makes the system fully air-gappable for maximum-security deployments.
284  
285  7. **Track the Qwen3-Coder evolution**: The family is rapidly improving. The gap between Qwen3-Coder-Next and Claude Opus is narrowing. Re-evaluate quarterly.
286  
287  ## 9. Conclusion
288  
289  #B4mad is uniquely positioned to offer a privacy-preserving multi-agent system. The foundation is already laid: markdown-based memory, git-backed bead coordination, pull-based scheduling. The missing piece—local inference—is now viable thanks to Qwen3-Coder-Next and efficient MoE architectures.
290  
291  The answer to "Can Qwen3-Coder + a small routing model replace cloud APIs for 80% of agent tasks?" is **yes, today**. The minimum viable setup is a single RTX 4090, Ollama, and the models described in this paper. The 20% that still benefits from cloud (deep research, complex reasoning) can be handled via an explicit escalation path with sensitivity controls.
292  
293  The vision of agents polling a local bead board, running on local models, with no data leaving the machine is not aspirational—it is achievable with current technology and #B4mad's existing architecture.
294  
295  ## References
296  
297  1. Qwen Team, "Qwen3-Coder: Agentic Coding in the World," 2026. https://qwenlm.github.io/blog/qwen3-coder/
298  2. Qwen Team, "Qwen3-Coder-Next: Pushing Small Hybrid Models on Agentic Coding," 2026. https://github.com/QwenLM/Qwen3-Coder
299  3. Romanov, "Pull-Based Agent Scheduling Architecture for #B4mad," 2026. Internal paper, beads-hub-30f.
300  4. Lex Fridman Podcast #490, "AI State of the Art 2026," ~34:46. Discussion on local inference and data privacy.
301  5. McMahan et al., "Communication-Efficient Learning of Deep Networks from Decentralized Data," AISTATS 2017.
302  6. Ollama Project, https://ollama.com/
303  7. RouteLLM Project, "A framework for LLM routing," 2024. https://github.com/lm-sys/RouteLLM
304  8. OpenClaw Documentation, https://openclaw.com/