2026-02-19-finetuning-open-models-agent-workflows.md
1 # Fine-Tuning Open Models for Agent Workflows: A #B4mad Feasibility Study 2 3 **Author:** Roman "Romanov" Research-Rachmaninov 4 **Date:** 2026-02-19 5 **Bead:** beads-hub-1pq 6 7 ## Abstract 8 9 This paper investigates the feasibility of fine-tuning open-weight language models — specifically Qwen3 and DeepSeek — for #B4mad's agent-specific workflows: MCP tool calling, beads task coordination, and multi-agent delegation. We evaluate LoRA and QLoRA as parameter-efficient fine-tuning (PEFT) methods suitable for our local RTX 4090 (24GB VRAM) infrastructure. Our conclusion: a #B4mad-tuned agent model is not only feasible but strategically valuable, though the primary challenge is dataset curation rather than compute. 10 11 ## 1. Context: Why This Matters for #B4mad 12 13 #B4mad Industries runs a multi-agent architecture where specialized agents (Brenner, Romanov, PLTops, Lotti, etc.) coordinate via the beads task system, call tools through MCP (Model Context Protocol), and delegate sub-tasks to each other. Today, this runs on commercial frontier models (Claude Opus, GPT-4). A fine-tuned open model would provide: 14 15 - **Technological sovereignty** — No dependency on API providers for core agent capabilities 16 - **Cost reduction** — Local inference at ~$0/token vs. $15-75/M tokens for frontier APIs 17 - **Latency improvement** — Local inference eliminates network round-trips 18 - **Customization depth** — Models that natively understand #B4mad's tool schemas, bead lifecycle, and delegation patterns 19 - **Privacy** — Sensitive workflows never leave our infrastructure 20 21 The Lex Fridman podcast (#490, ~32:33) discussion between Sebastian Raschka and Nathan Lambert reinforces that the differentiator in 2026 is no longer model architecture (ideas diffuse rapidly across labs) but rather the *application-specific tuning and deployment* that organizations build on top of open weights. 22 23 ## 2. State of the Art 24 25 ### 2.1 Open Model Landscape (February 2026) 26 27 The open-weight model ecosystem has matured dramatically: 28 29 | Model | Parameters | Architecture | License | Tool Calling | Context | 30 |-------|-----------|-------------|---------|-------------|---------| 31 | **Qwen3-30B-A3B** | 30B (3B active) | MoE, 128 experts | Apache 2.0 | Native | 128K | 32 | **Qwen3-8B** | 8B | Dense | Apache 2.0 | Native | 128K | 33 | **Qwen3-4B** | 4B | Dense | Apache 2.0 | Native | 32K | 34 | **DeepSeek-R1** | 671B (37B active) | MoE | MIT | Via fine-tune | 128K | 35 | **DeepSeek-V3** | 671B (37B active) | MoE | MIT | Native | 128K | 36 | **Llama 3.3** | 70B | Dense | Llama License | Community | 128K | 37 38 **Qwen3 is our recommended base model family.** The Qwen3-30B-A3B MoE model achieves performance rivaling QwQ-32B with only 3B activated parameters — meaning it runs efficiently on consumer hardware while maintaining strong reasoning. Qwen3-8B and Qwen3-4B are viable for development and testing. All are Apache 2.0 licensed, permitting commercial fine-tuning and deployment. 39 40 ### 2.2 Parameter-Efficient Fine-Tuning (PEFT) 41 42 Full fine-tuning of even an 8B model requires ~60GB+ VRAM (model + gradients + optimizer states in fp16). PEFT methods solve this: 43 44 **LoRA (Low-Rank Adaptation):** Decomposes weight update matrices into low-rank factors. For a weight matrix W ∈ ℝ^(d×k), LoRA learns A ∈ ℝ^(d×r) and B ∈ ℝ^(r×k) where r << min(d,k). Only A and B are trained. Typical rank r=16-64, yielding adapters of 10-100MB vs. multi-GB full models. 45 46 **QLoRA:** Combines 4-bit NormalFloat (NF4) quantization of the base model with LoRA adapters trained in 16-bit. Key innovations: 47 - 4-bit NF4 quantization (information-theoretically optimal for normal distributions) 48 - Double quantization (quantizing quantization constants) 49 - Paged optimizers for memory spike management 50 51 QLoRA enables fine-tuning a 65B parameter model on a single 48GB GPU with no performance loss vs. full 16-bit fine-tuning (Dettmers et al., 2023). 52 53 ### 2.3 Agent-Specific Fine-Tuning Approaches 54 55 Several projects have demonstrated fine-tuning for tool use and agent behavior: 56 57 - **Gorilla** (Berkeley): Fine-tuned LLaMA for API calling with retrieval-augmented generation 58 - **ToolLLM** (Tsinghua): Fine-tuned on 16K+ real-world APIs with tool-use trajectories 59 - **AgentTuning** (Tsinghua): General-purpose agent tuning using interaction trajectories from 6 agent tasks 60 - **FireAct** (Princeton): Fine-tuned agents using ReAct-style trajectories with tool use 61 62 The common pattern: **the training data is structured interaction traces** — sequences of (observation, thought, action, tool_call, tool_result) tuples. 63 64 ## 3. Analysis: A #B4mad-Tuned Agent Model 65 66 ### 3.1 Target Capabilities 67 68 A #B4mad-tuned model needs three core capabilities: 69 70 **1. MCP Tool Calling:** Structured JSON tool invocations following the Model Context Protocol schema. The model must generate valid tool call JSON, handle tool results, and chain multiple tool calls. 71 72 **2. Beads Task Coordination:** Understanding bead lifecycle (create → assign → progress → close), parsing bead IDs, updating status, and reasoning about task dependencies and priorities. 73 74 **3. Multi-Agent Delegation:** Knowing when to delegate vs. handle directly, formulating clear sub-agent task descriptions, and synthesizing results from delegated work. 75 76 ### 3.2 Dataset Strategy 77 78 This is the hard part. We need high-quality training data in three forms: 79 80 **A. Synthetic Trajectories from Existing Agents** 81 - Instrument our current Claude-powered agents to log full interaction traces 82 - Each trace: system prompt → user message → tool calls → results → response 83 - Estimated: 500-2000 high-quality traces needed for meaningful fine-tuning 84 - Timeline: 2-4 weeks of normal operation with logging enabled 85 86 **B. Curated Tool-Use Examples** 87 - Hand-craft 100-200 gold-standard examples of each pattern: 88 - MCP tool call generation and result parsing 89 - Bead creation, querying, updating, closing 90 - Sub-agent task formulation and result synthesis 91 - These serve as the quality anchor for the dataset 92 93 **C. Rejection Sampling / DPO Pairs** 94 - Run the base model on #B4mad tasks, collect both successful and failed completions 95 - Use these as preference pairs for Direct Preference Optimization (DPO) 96 - This teaches the model our specific quality bar 97 98 ### 3.3 Recommended Training Pipeline 99 100 ``` 101 Phase 1: SFT (Supervised Fine-Tuning) 102 Base: Qwen3-8B (or Qwen3-30B-A3B for production) 103 Method: QLoRA (4-bit base + LoRA rank 32) 104 Data: 1000-2000 curated interaction traces 105 Hardware: RTX 4090 (24GB) — sufficient for QLoRA on 8B 106 Framework: Unsloth or Axolotl + HuggingFace PEFT 107 Training time: ~4-8 hours for 8B, ~12-24 hours for 30B-A3B 108 109 Phase 2: DPO (Direct Preference Optimization) 110 Data: 500+ preference pairs from rejection sampling 111 Method: QLoRA DPO on Phase 1 checkpoint 112 Training time: ~2-4 hours 113 114 Phase 3: Evaluation & Iteration 115 Benchmarks: Custom #B4mad agent eval suite 116 - Tool call accuracy (valid JSON, correct tool selection) 117 - Bead lifecycle completion rate 118 - Delegation appropriateness scoring 119 - End-to-end task success on held-out beads 120 ``` 121 122 ### 3.4 Hardware Feasibility 123 124 Our RTX 4090 (24GB VRAM) is well-suited for QLoRA fine-tuning: 125 126 | Model | QLoRA VRAM | Feasible? | Inference VRAM (4-bit) | 127 |-------|-----------|-----------|----------------------| 128 | Qwen3-4B | ~8GB | ✅ Easy | ~3GB | 129 | Qwen3-8B | ~14GB | ✅ Comfortable | ~6GB | 130 | Qwen3-14B | ~20GB | ✅ Tight | ~9GB | 131 | Qwen3-30B-A3B | ~16GB* | ✅ Good (MoE) | ~10GB* | 132 | Qwen3-32B | ~28GB | ❌ Too large | ~18GB | 133 134 *MoE models only load active experts, making the 30B-A3B surprisingly efficient. 135 136 The sweet spot for #B4mad is **Qwen3-8B for development/testing** and **Qwen3-30B-A3B for production**, both trainable on our single RTX 4090. 137 138 ### 3.5 Risks and Limitations 139 140 1. **Catastrophic forgetting:** Fine-tuning on narrow agent tasks may degrade general capabilities. Mitigation: LoRA's parameter isolation naturally preserves base model knowledge; also mix in general instruction data during SFT. 141 142 2. **Dataset quality:** Garbage in, garbage out. Our biggest risk is insufficient or low-quality training data. Mitigation: Start with curated gold examples, expand gradually. 143 144 3. **Evaluation difficulty:** Agent task success is hard to measure automatically. Mitigation: Build a structured eval suite before training, not after. 145 146 4. **Maintenance burden:** Models need retraining as our tool schemas and agent patterns evolve. Mitigation: Keep training pipelines automated and modular. 147 148 5. **Capability ceiling:** A fine-tuned 8B model won't match Claude Opus on complex reasoning. Mitigation: Use the fine-tuned model for routine agent tasks; escalate to frontier models for complex reasoning. 149 150 ## 4. Recommendations 151 152 ### Immediate (Week 1-2) 153 1. **Instrument agent logging:** Add structured trace collection to all #B4mad agents (Brenner, PLTops, Lotti, Romanov). Every tool call, every bead operation, every delegation — logged as training data. 154 2. **Define eval suite:** Create 50+ test cases covering MCP tool calling, bead operations, and delegation scenarios. This is the yardstick before any training begins. 155 156 ### Short-term (Week 3-6) 157 3. **Curate gold dataset:** Hand-craft 200 gold-standard examples. Run Qwen3-8B base on these tasks to establish baseline performance. 158 4. **First QLoRA training run:** Fine-tune Qwen3-8B on the curated dataset using Unsloth + PEFT. Evaluate against the test suite. This is the proof-of-concept. 159 160 ### Medium-term (Month 2-3) 161 5. **Scale to Qwen3-30B-A3B:** Once the pipeline is validated on 8B, move to the MoE model for production-quality results. 162 6. **DPO pass:** Collect preference data from real agent runs, apply DPO for quality refinement. 163 7. **A/B test in production:** Run the fine-tuned model alongside Claude for a subset of routine tasks. Measure success rates, latency, and cost. 164 165 ### Strategic 166 8. **Hybrid architecture:** Use the #B4mad-tuned model for 80% of routine agent operations (tool calling, bead management, simple delegation) and frontier models for the remaining 20% (complex reasoning, novel tasks). This could cut API costs by 80%+ while maintaining quality. 167 168 ## 5. Conclusion 169 170 A #B4mad-tuned agent model is feasible, valuable, and achievable with our current hardware. The Qwen3 family — particularly the 8B dense and 30B-A3B MoE models — provides an excellent foundation. QLoRA makes training practical on a single RTX 4090. 171 172 The critical path is **not compute but data**: instrumenting our agents to collect high-quality interaction traces, curating gold-standard examples, and building a rigorous evaluation suite. With 4-6 weeks of focused effort, we could have a proof-of-concept model that handles routine agent tasks locally, reducing our dependence on frontier API providers and advancing #B4mad's mission of technological sovereignty. 173 174 The question isn't whether we *can* build a #B4mad-tuned model. It's whether we have the discipline to collect great training data first. 175 176 ## References 177 178 1. Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." arXiv:2305.14314. 179 2. Hu, E.J., et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685. 180 3. Qwen Team (2025). "Qwen3: Think Deeper, Act Faster." https://qwenlm.github.io/blog/qwen3/ 181 4. Patil, S., et al. (2023). "Gorilla: Large Language Model Connected with Massive APIs." arXiv:2305.15334. 182 5. Qin, Y., et al. (2023). "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs." arXiv:2307.16789. 183 6. Zeng, A., et al. (2023). "AgentTuning: Enabling Generalized Agent Abilities for LLMs." arXiv:2310.12823. 184 7. Chen, B., et al. (2023). "FireAct: Toward Language Agent Fine-tuning." arXiv:2310.05915. 185 8. HuggingFace PEFT Library. https://github.com/huggingface/peft 186 9. Fridman, L. (2026). "State of AI in 2026" Podcast #490, with Sebastian Raschka & Nathan Lambert. https://lexfridman.com/ai-sota-2026-transcript 187 10. Raschka, S. (2025). "Build a Large Language Model from Scratch." Manning Publications.