2026-02-19-privacy-preserving-local-agents.md
1 # Privacy-Preserving Multi-Agent Architecture with Local Models 2 3 **Author:** Roman "Romanov" Research-Rachmaninov 4 **Date:** 2026-02-19 5 **Bead:** beads-hub-pe1 6 **Status:** Final 7 8 ## Abstract 9 10 This paper investigates whether #B4mad can run its entire multi-agent system—Brenner Axiom, CodeMonkey, PltOps, Romanov—on local open-weight models with zero cloud dependency for sensitive workloads. We evaluate the current landscape of local inference (Qwen3-Coder-Next, Llama-based routers, Ollama), assess where local models can replace cloud APIs today, and propose a minimum viable architecture. Our finding: **local models can handle ~80% of agent tasks** (code generation, bead management, routine ops) with Qwen3-Coder-Next (80B/3B-active MoE) as the workhorse, but deep reasoning tasks (complex research, multi-step strategic analysis) still benefit from cloud-tier models. We recommend a tiered architecture: local-first with optional cloud escalation, governed by data sensitivity classification. 11 12 ## 1. Context: Why This Matters for #B4mad 13 14 #B4mad already stores all agent memory in markdown files backed by git. This is a strong privacy foundation—memory never leaves the machine unless explicitly pushed. But inference still flows through cloud APIs (Anthropic Claude, Google Gemini), meaning every agent prompt, every bead description, every piece of context is transmitted externally. 15 16 This creates three risks: 17 1. **Data exposure**: Sensitive work orders, personal context from `MEMORY.md`, infrastructure details from `TOOLS.md`—all sent to third-party inference providers. 18 2. **Vendor lock-in**: If Anthropic or Google change pricing, rate-limit, or deprecate models, the entire agent fleet stops. 19 3. **Availability dependency**: Cloud outages halt all agent work, even for tasks that don't require frontier reasoning. 20 21 The Lex Fridman #490 podcast (AI State of the Art 2026, ~34:46) captured the sentiment well: users want separate work/personal AI contexts, local customization, and the ability to add data post-training without it leaving their machine. This aligns exactly with #B4mad's agent-first philosophy. 22 23 Our recently published pull-based scheduling paper (beads-hub-30f) already describes agents polling a local bead board. The natural next step: those agents running on local models, with the bead board as the only coordination surface, and no data leaving the machine. 24 25 ## 2. State of the Art: Local Inference for Agent Workloads 26 27 ### 2.1 Qwen3-Coder Family 28 29 The Qwen3-Coder family represents the current state-of-the-art for local agentic coding: 30 31 - **Qwen3-Coder-480B-A35B-Instruct**: Flagship MoE model, 480B total / 35B active parameters. Performance comparable to Claude Sonnet 4 on SWE-Bench, agentic coding, and tool use. Requires ~70GB VRAM (quantized) — feasible on a dual-GPU workstation but not casual hardware. 32 - **Qwen3-Coder-Next (80B-A3B)**: The local-first variant. 80B total / 3B active parameters with hybrid attention and MoE. Designed explicitly for coding agents and local development. Runs comfortably on a single consumer GPU (16GB+ VRAM at Q4 quantization). Trained with large-scale agentic RL including environment interaction. 33 - **Qwen3-Coder-30B-A3B-Instruct**: Mid-tier option, 30B/3B-active. Good balance of capability and resource requirements. 34 35 Key capabilities relevant to #B4mad: 36 - 256K native context (1M with YaRN extrapolation) — sufficient for repo-scale understanding 37 - Native function calling / tool use — critical for agent frameworks 38 - 358 programming language support 39 - Available via Ollama: `ollama run qwen3-coder` 40 41 ### 2.2 Routing and Orchestration Models 42 43 For the "small routing model" that dispatches tasks to specialists: 44 45 - **Qwen3-0.6B / 1.7B**: Tiny models suitable for classification tasks (intent detection, bead routing, priority assessment). Can run on CPU. 46 - **Llama-3.2-3B**: Strong general-purpose small model for routing decisions. 47 - **Phi-4-mini (3.8B)**: Microsoft's compact model with strong reasoning for its size. 48 - **RouteLLM** (open-source project): Framework for routing between strong/weak models based on query complexity. Directly applicable to our local/cloud tiering. 49 50 ### 2.3 Inference Infrastructure 51 52 - **Ollama**: De facto standard for local model serving. OpenAI-compatible API, easy model management, quantization support. Already in use at #B4mad (`custom-10-144-28-67-11434/qwen3-coder-next:latest`). 53 - **llama.cpp / llama-server**: Lower-level but more configurable. Supports speculative decoding (small draft model + large verify model) for faster inference. 54 - **vLLM**: High-throughput serving with PagedAttention. Better for concurrent agent requests but heavier setup. 55 - **LocalAI**: OpenAI-compatible API server supporting multiple backends. 56 57 ### 2.4 Privacy-Preserving Approaches in the Literature 58 59 - **Federated learning** (McMahan et al., 2017): Training across distributed nodes without sharing data. Relevant for future multi-node #B4mad setups. 60 - **Differential privacy in LLM inference** (various 2024-2025): Adding noise to prevent memorization. Less relevant for our use case since we control the entire pipeline. 61 - **Confidential computing** (Intel SGX, AMD SEV): Hardware-level isolation for sensitive inference. Overkill for our threat model but worth noting. 62 - **On-device AI** (Apple Intelligence, Google Gemini Nano): Industry trend toward local inference for privacy. Validates our approach. 63 64 ## 3. Analysis: Can Local Models Replace Cloud APIs for 80% of Agent Tasks? 65 66 ### 3.1 Task Taxonomy 67 68 We categorize #B4mad agent tasks by complexity and map them to model requirements: 69 70 | Task Category | Examples | Required Capability | Local Feasible? | 71 |---|---|---|---| 72 | **Bead management** | Create, update, close beads; parse status | Structured output, tool calling | ✅ Yes — any 3B+ model | 73 | **Code generation** | Scripts, configs, Ansible playbooks | Coding, context understanding | ✅ Yes — Qwen3-Coder-Next excels | 74 | **Code review / PR feedback** | Review diffs, suggest changes | Code understanding, reasoning | ✅ Yes — Qwen3-Coder-Next | 75 | **Git operations** | Commit messages, branch management | Template following | ✅ Yes — trivial | 76 | **Routing / dispatch** | Classify incoming requests, assign to agents | Intent classification | ✅ Yes — 1-3B router model | 77 | **URL summarization** | Fetch and summarize web content | Reading comprehension | ✅ Yes — 7B+ model | 78 | **Infrastructure ops** | kubectl, oc commands, monitoring checks | Tool use, structured output | ✅ Yes — Qwen3-Coder-Next | 79 | **Conversational interaction** | Chat with goern, group discussions | Natural language, personality | ⚠️ Mostly — but nuance/humor degrades | 80 | **Deep research** | Literature review, multi-source synthesis | Long-context reasoning, depth | ❌ Not yet — Opus-tier still needed | 81 | **Complex strategic analysis** | Architecture decisions, trade-off papers | Deep reasoning, creativity | ❌ Not yet — frontier models preferred | 82 83 **Estimate: 75-85% of daily agent tasks are locally feasible today.** 84 85 ### 3.2 The Qwen3-Coder-Next Sweet Spot 86 87 Qwen3-Coder-Next (80B/3B-active) is the ideal workhorse for #B4mad because: 88 89 1. **MoE efficiency**: Only 3B parameters active per token despite 80B total knowledge. This means near-3B inference cost with much higher capability. 90 2. **Agentic training**: Specifically trained with long-horizon RL on real-world agent tasks, environment interaction, and tool use. Not just a code completer—it's an agent model. 91 3. **Ollama integration**: Already supported, already deployed at #B4mad's inference endpoint. 92 4. **256K context**: Enough to hold an entire bead board + memory files + current task context. 93 94 ### 3.3 Where Local Falls Short 95 96 Two categories remain cloud-dependent: 97 98 1. **Deep research (Romanov tasks)**: Synthesizing across multiple sources, producing nuanced analysis with original insights, evaluating trade-offs at a strategic level. Qwen3-Coder-Next can produce *adequate* research but not Opus-quality depth. This is the 15-20% that still needs cloud. 99 100 2. **Personality-rich interaction**: Brenner's main session conversations with goern require wit, cultural awareness, and emotional intelligence that smaller models handle less gracefully. Acceptable for task execution but not for the "personal assistant with personality" use case. 101 102 ### 3.4 The Router Model Question 103 104 Can a small model (0.6B-3B) effectively route tasks to the right agent? Yes, because: 105 106 - Bead titles already contain routing hints ("Research:", code tasks, ops tasks) 107 - The routing decision is a classification task, not a generation task 108 - A fine-tuned Qwen3-0.6B on #B4mad's historical bead assignments would likely achieve >95% routing accuracy 109 - Even without fine-tuning, a prompted 1.7B model can classify intent reliably 110 111 **Proposed router**: Qwen3-1.7B with a system prompt describing each agent's capabilities. Input: bead title + description. Output: agent assignment + priority. Runs on CPU, <2GB RAM. 112 113 ## 4. Proposed Architecture: Local-First with Cloud Escalation 114 115 ### 4.1 System Overview 116 117 ``` 118 ┌─────────────────────────────────────────────────────┐ 119 │ Local Machine │ 120 │ │ 121 │ ┌──────────┐ ┌──────────────────────────────┐ │ 122 │ │ Router │ │ Ollama Server │ │ 123 │ │ (1.7B) │───▶│ ┌────────────────────────┐ │ │ 124 │ └──────────┘ │ │ Qwen3-Coder-Next (3B) │ │ │ 125 │ ▲ │ └────────────────────────┘ │ │ 126 │ │ └──────────────────────────────┘ │ 127 │ │ │ │ 128 │ ┌────┴────┐ ┌──────┴──────┐ │ 129 │ │ Bead │ │ Agents │ │ 130 │ │ Board │◀────────▶│ (OpenClaw) │ │ 131 │ │ (git) │ │ │ │ 132 │ └─────────┘ └──────┬──────┘ │ 133 │ │ │ 134 │ ┌─────────┴──────────┐ │ 135 │ │ Sensitivity Gate │ │ 136 │ │ (local policy) │ │ 137 │ └─────────┬──────────┘ │ 138 └──────────────────────────────┼───────────────────────┘ 139 │ (only if needed AND allowed) 140 ┌──────┴──────┐ 141 │ Cloud API │ 142 │ (Opus/etc) │ 143 └─────────────┘ 144 ``` 145 146 ### 4.2 Components 147 148 **1. Local Router (Qwen3-1.7B on CPU)** 149 - Classifies incoming beads/messages 150 - Routes to appropriate local agent 151 - Flags tasks that may need cloud escalation 152 153 **2. Primary Inference (Qwen3-Coder-Next via Ollama)** 154 - Handles all code, ops, bead management, and routine conversation 155 - Serves CodeMonkey, PltOps, and routine Brenner tasks 156 - Single GPU (RTX 4090 / RTX 5090 or equivalent) 157 158 **3. Bead Board (git-backed, local)** 159 - Already implemented — no changes needed 160 - Pull-based scheduling as described in our previous paper 161 - Agents poll, claim, execute, close 162 163 **4. Memory Layer (markdown files, git-backed)** 164 - Already implemented — `MEMORY.md`, `memory/*.md`, `AGENTS.md` 165 - Zero cloud dependency, full local control 166 - Git provides versioning, sync is explicit 167 168 **5. Sensitivity Gate (local policy engine)** 169 - Simple rule-based classifier: 170 - Contains personal data? → Local only 171 - Contains infrastructure secrets? → Local only 172 - Requires deep reasoning? → May escalate to cloud 173 - Research task? → May escalate to cloud 174 - User can override: `--local-only` flag forces all-local 175 176 **6. Cloud Escalation (optional)** 177 - Only for tasks that pass the sensitivity gate AND require frontier capability 178 - User explicitly approves cloud usage per-task or per-category 179 - Could be eliminated entirely if accepting quality trade-off on research/deep reasoning 180 181 ### 4.3 Minimum Viable Local Setup 182 183 | Component | Hardware | Cost (approx.) | 184 |---|---|---| 185 | GPU | NVIDIA RTX 4090 (24GB VRAM) | ~$1,600 | 186 | CPU | Any modern 8-core (for router model) | (existing) | 187 | RAM | 32GB+ | (existing) | 188 | Storage | 500GB SSD (models + repos) | ~$50 | 189 | Software | Ollama + OpenClaw + git | Free | 190 191 **Total incremental cost: ~$1,650** (assuming existing workstation; just add GPU) 192 193 For the budget-conscious: an RTX 4070 Ti Super (16GB) can run Qwen3-Coder-Next at Q4 quantization with acceptable speed. Cost: ~$800. 194 195 For maximum capability: dual RTX 4090 or single RTX 5090 (32GB) allows running the 30B-A3B variant at higher quantization or the full 480B-A35B with aggressive quantization. 196 197 ### 4.4 Model Configuration 198 199 ```yaml 200 # Proposed Ollama model configuration 201 models: 202 router: 203 name: qwen3:1.7b 204 purpose: Intent classification, bead routing 205 hardware: CPU only 206 memory: ~2GB RAM 207 208 workhorse: 209 name: qwen3-coder-next:latest 210 purpose: Code, ops, bead management, conversation 211 hardware: GPU (RTX 4090) 212 memory: ~14GB VRAM (Q4_K_M) 213 context: 32768 # expandable to 256K if needed 214 215 summarizer: 216 name: qwen3:7b 217 purpose: URL summarization (Brew agent) 218 hardware: CPU or shared GPU 219 memory: ~5GB 220 ``` 221 222 ## 5. Migration Path 223 224 ### Phase 1: Shadow Mode (Weeks 1-2) 225 - Run local models alongside cloud APIs 226 - Compare outputs for quality regression 227 - Measure latency and throughput 228 - Identify tasks where local quality is unacceptable 229 230 ### Phase 2: Local-Default (Weeks 3-4) 231 - Switch CodeMonkey and PltOps to local inference 232 - These are the most tool-use heavy, least personality-dependent agents 233 - Keep Brenner main session and Romanov on cloud 234 235 ### Phase 3: Full Local with Cloud Escalation (Weeks 5-8) 236 - Move Brenner routine tasks to local 237 - Implement sensitivity gate 238 - Cloud only for: Romanov deep research, complex Brenner conversations 239 - Measure cloud API cost reduction (target: 80%+ reduction) 240 241 ### Phase 4: Evaluate Full Local (Ongoing) 242 - As local models improve (Qwen4, Llama 4, etc.), reassess cloud necessity 243 - Fine-tune router on accumulated #B4mad data 244 - Consider fine-tuning workhorse model on #B4mad-specific patterns 245 246 ## 6. Connection to Pull-Based Scheduling 247 248 This architecture completes the vision outlined in our pull-based scheduling paper: 249 250 1. **Bead board** serves as the shared work queue (already implemented) 251 2. **Agents poll** for tasks matching their capabilities (described in previous paper) 252 3. **All inference is local** (this paper's contribution) 253 4. **All memory is local markdown** (already implemented) 254 255 The result: a fully self-contained multi-agent system where: 256 - No data leaves the machine unless explicitly pushed to git remotes 257 - No cloud dependency for routine operations 258 - Agents are autonomous, self-scheduling, and privacy-preserving 259 - The only external dependency is git hosting (which can also be self-hosted) 260 261 ## 7. Risks and Mitigations 262 263 | Risk | Likelihood | Impact | Mitigation | 264 |---|---|---|---| 265 | Local model quality regression on edge cases | High | Medium | Shadow mode testing; cloud escalation path | 266 | GPU failure = all agents down | Medium | High | CPU fallback (slower but functional); spare GPU | 267 | Model updates break agent prompts | Medium | Medium | Pin model versions; test before upgrading | 268 | Context window insufficient for complex tasks | Low | Medium | Qwen3-Coder-Next supports 256K natively | 269 | Ollama instability under concurrent load | Medium | Medium | Rate limiting; vLLM as alternative backend | 270 271 ## 8. Recommendations 272 273 1. **Adopt Qwen3-Coder-Next as the primary local model** for CodeMonkey, PltOps, and routine Brenner tasks. It is purpose-built for agentic workloads and runs efficiently on consumer hardware. 274 275 2. **Deploy Qwen3-1.7B as the router** on CPU. It costs nothing in GPU resources and can classify/route with high accuracy. 276 277 3. **Start with Phase 1 (shadow mode)** immediately. The infrastructure is already in place—Ollama is running, models are available, OpenClaw supports custom model endpoints. 278 279 4. **Keep cloud escalation for Romanov and complex Brenner tasks** until local models close the reasoning gap. Budget for ~20% cloud usage. 280 281 5. **Implement the sensitivity gate** as a simple rule-based policy before any cloud calls. This is the key privacy guarantee. 282 283 6. **Self-host git** (Forgejo on Nostromo) to eliminate the last external dependency. This makes the system fully air-gappable for maximum-security deployments. 284 285 7. **Track the Qwen3-Coder evolution**: The family is rapidly improving. The gap between Qwen3-Coder-Next and Claude Opus is narrowing. Re-evaluate quarterly. 286 287 ## 9. Conclusion 288 289 #B4mad is uniquely positioned to offer a privacy-preserving multi-agent system. The foundation is already laid: markdown-based memory, git-backed bead coordination, pull-based scheduling. The missing piece—local inference—is now viable thanks to Qwen3-Coder-Next and efficient MoE architectures. 290 291 The answer to "Can Qwen3-Coder + a small routing model replace cloud APIs for 80% of agent tasks?" is **yes, today**. The minimum viable setup is a single RTX 4090, Ollama, and the models described in this paper. The 20% that still benefits from cloud (deep research, complex reasoning) can be handled via an explicit escalation path with sensitivity controls. 292 293 The vision of agents polling a local bead board, running on local models, with no data leaving the machine is not aspirational—it is achievable with current technology and #B4mad's existing architecture. 294 295 ## References 296 297 1. Qwen Team, "Qwen3-Coder: Agentic Coding in the World," 2026. https://qwenlm.github.io/blog/qwen3-coder/ 298 2. Qwen Team, "Qwen3-Coder-Next: Pushing Small Hybrid Models on Agentic Coding," 2026. https://github.com/QwenLM/Qwen3-Coder 299 3. Romanov, "Pull-Based Agent Scheduling Architecture for #B4mad," 2026. Internal paper, beads-hub-30f. 300 4. Lex Fridman Podcast #490, "AI State of the Art 2026," ~34:46. Discussion on local inference and data privacy. 301 5. McMahan et al., "Communication-Efficient Learning of Deep Networks from Decentralized Data," AISTATS 2017. 302 6. Ollama Project, https://ollama.com/ 303 7. RouteLLM Project, "A framework for LLM routing," 2024. https://github.com/lm-sys/RouteLLM 304 8. OpenClaw Documentation, https://openclaw.com/