Cradicle Explorer

/ research / 2026-02-19-finetuning-open-models-agent-workflows.md
2026-02-19-finetuning-open-models-agent-workflows.md
  1  # Fine-Tuning Open Models for Agent Workflows: A #B4mad Feasibility Study
  2  
  3  **Author:** Roman "Romanov" Research-Rachmaninov  
  4  **Date:** 2026-02-19  
  5  **Bead:** beads-hub-1pq  
  6  
  7  ## Abstract
  8  
  9  This paper investigates the feasibility of fine-tuning open-weight language models — specifically Qwen3 and DeepSeek — for #B4mad's agent-specific workflows: MCP tool calling, beads task coordination, and multi-agent delegation. We evaluate LoRA and QLoRA as parameter-efficient fine-tuning (PEFT) methods suitable for our local RTX 4090 (24GB VRAM) infrastructure. Our conclusion: a #B4mad-tuned agent model is not only feasible but strategically valuable, though the primary challenge is dataset curation rather than compute.
 10  
 11  ## 1. Context: Why This Matters for #B4mad
 12  
 13  #B4mad Industries runs a multi-agent architecture where specialized agents (Brenner, Romanov, PLTops, Lotti, etc.) coordinate via the beads task system, call tools through MCP (Model Context Protocol), and delegate sub-tasks to each other. Today, this runs on commercial frontier models (Claude Opus, GPT-4). A fine-tuned open model would provide:
 14  
 15  - **Technological sovereignty** — No dependency on API providers for core agent capabilities
 16  - **Cost reduction** — Local inference at ~$0/token vs. $15-75/M tokens for frontier APIs
 17  - **Latency improvement** — Local inference eliminates network round-trips
 18  - **Customization depth** — Models that natively understand #B4mad's tool schemas, bead lifecycle, and delegation patterns
 19  - **Privacy** — Sensitive workflows never leave our infrastructure
 20  
 21  The Lex Fridman podcast (#490, ~32:33) discussion between Sebastian Raschka and Nathan Lambert reinforces that the differentiator in 2026 is no longer model architecture (ideas diffuse rapidly across labs) but rather the *application-specific tuning and deployment* that organizations build on top of open weights.
 22  
 23  ## 2. State of the Art
 24  
 25  ### 2.1 Open Model Landscape (February 2026)
 26  
 27  The open-weight model ecosystem has matured dramatically:
 28  
 29  | Model | Parameters | Architecture | License | Tool Calling | Context |
 30  |-------|-----------|-------------|---------|-------------|---------|
 31  | **Qwen3-30B-A3B** | 30B (3B active) | MoE, 128 experts | Apache 2.0 | Native | 128K |
 32  | **Qwen3-8B** | 8B | Dense | Apache 2.0 | Native | 128K |
 33  | **Qwen3-4B** | 4B | Dense | Apache 2.0 | Native | 32K |
 34  | **DeepSeek-R1** | 671B (37B active) | MoE | MIT | Via fine-tune | 128K |
 35  | **DeepSeek-V3** | 671B (37B active) | MoE | MIT | Native | 128K |
 36  | **Llama 3.3** | 70B | Dense | Llama License | Community | 128K |
 37  
 38  **Qwen3 is our recommended base model family.** The Qwen3-30B-A3B MoE model achieves performance rivaling QwQ-32B with only 3B activated parameters — meaning it runs efficiently on consumer hardware while maintaining strong reasoning. Qwen3-8B and Qwen3-4B are viable for development and testing. All are Apache 2.0 licensed, permitting commercial fine-tuning and deployment.
 39  
 40  ### 2.2 Parameter-Efficient Fine-Tuning (PEFT)
 41  
 42  Full fine-tuning of even an 8B model requires ~60GB+ VRAM (model + gradients + optimizer states in fp16). PEFT methods solve this:
 43  
 44  **LoRA (Low-Rank Adaptation):** Decomposes weight update matrices into low-rank factors. For a weight matrix W ∈ ℝ^(d×k), LoRA learns A ∈ ℝ^(d×r) and B ∈ ℝ^(r×k) where r << min(d,k). Only A and B are trained. Typical rank r=16-64, yielding adapters of 10-100MB vs. multi-GB full models.
 45  
 46  **QLoRA:** Combines 4-bit NormalFloat (NF4) quantization of the base model with LoRA adapters trained in 16-bit. Key innovations:
 47  - 4-bit NF4 quantization (information-theoretically optimal for normal distributions)
 48  - Double quantization (quantizing quantization constants)
 49  - Paged optimizers for memory spike management
 50  
 51  QLoRA enables fine-tuning a 65B parameter model on a single 48GB GPU with no performance loss vs. full 16-bit fine-tuning (Dettmers et al., 2023).
 52  
 53  ### 2.3 Agent-Specific Fine-Tuning Approaches
 54  
 55  Several projects have demonstrated fine-tuning for tool use and agent behavior:
 56  
 57  - **Gorilla** (Berkeley): Fine-tuned LLaMA for API calling with retrieval-augmented generation
 58  - **ToolLLM** (Tsinghua): Fine-tuned on 16K+ real-world APIs with tool-use trajectories
 59  - **AgentTuning** (Tsinghua): General-purpose agent tuning using interaction trajectories from 6 agent tasks
 60  - **FireAct** (Princeton): Fine-tuned agents using ReAct-style trajectories with tool use
 61  
 62  The common pattern: **the training data is structured interaction traces** — sequences of (observation, thought, action, tool_call, tool_result) tuples.
 63  
 64  ## 3. Analysis: A #B4mad-Tuned Agent Model
 65  
 66  ### 3.1 Target Capabilities
 67  
 68  A #B4mad-tuned model needs three core capabilities:
 69  
 70  **1. MCP Tool Calling:** Structured JSON tool invocations following the Model Context Protocol schema. The model must generate valid tool call JSON, handle tool results, and chain multiple tool calls.
 71  
 72  **2. Beads Task Coordination:** Understanding bead lifecycle (create → assign → progress → close), parsing bead IDs, updating status, and reasoning about task dependencies and priorities.
 73  
 74  **3. Multi-Agent Delegation:** Knowing when to delegate vs. handle directly, formulating clear sub-agent task descriptions, and synthesizing results from delegated work.
 75  
 76  ### 3.2 Dataset Strategy
 77  
 78  This is the hard part. We need high-quality training data in three forms:
 79  
 80  **A. Synthetic Trajectories from Existing Agents**
 81  - Instrument our current Claude-powered agents to log full interaction traces
 82  - Each trace: system prompt → user message → tool calls → results → response
 83  - Estimated: 500-2000 high-quality traces needed for meaningful fine-tuning
 84  - Timeline: 2-4 weeks of normal operation with logging enabled
 85  
 86  **B. Curated Tool-Use Examples**
 87  - Hand-craft 100-200 gold-standard examples of each pattern:
 88    - MCP tool call generation and result parsing
 89    - Bead creation, querying, updating, closing
 90    - Sub-agent task formulation and result synthesis
 91  - These serve as the quality anchor for the dataset
 92  
 93  **C. Rejection Sampling / DPO Pairs**
 94  - Run the base model on #B4mad tasks, collect both successful and failed completions
 95  - Use these as preference pairs for Direct Preference Optimization (DPO)
 96  - This teaches the model our specific quality bar
 97  
 98  ### 3.3 Recommended Training Pipeline
 99  
100  ```
101  Phase 1: SFT (Supervised Fine-Tuning)
102    Base: Qwen3-8B (or Qwen3-30B-A3B for production)
103    Method: QLoRA (4-bit base + LoRA rank 32)
104    Data: 1000-2000 curated interaction traces
105    Hardware: RTX 4090 (24GB) — sufficient for QLoRA on 8B
106    Framework: Unsloth or Axolotl + HuggingFace PEFT
107    Training time: ~4-8 hours for 8B, ~12-24 hours for 30B-A3B
108  
109  Phase 2: DPO (Direct Preference Optimization)
110    Data: 500+ preference pairs from rejection sampling
111    Method: QLoRA DPO on Phase 1 checkpoint
112    Training time: ~2-4 hours
113  
114  Phase 3: Evaluation & Iteration
115    Benchmarks: Custom #B4mad agent eval suite
116    - Tool call accuracy (valid JSON, correct tool selection)
117    - Bead lifecycle completion rate
118    - Delegation appropriateness scoring
119    - End-to-end task success on held-out beads
120  ```
121  
122  ### 3.4 Hardware Feasibility
123  
124  Our RTX 4090 (24GB VRAM) is well-suited for QLoRA fine-tuning:
125  
126  | Model | QLoRA VRAM | Feasible? | Inference VRAM (4-bit) |
127  |-------|-----------|-----------|----------------------|
128  | Qwen3-4B | ~8GB | ✅ Easy | ~3GB |
129  | Qwen3-8B | ~14GB | ✅ Comfortable | ~6GB |
130  | Qwen3-14B | ~20GB | ✅ Tight | ~9GB |
131  | Qwen3-30B-A3B | ~16GB* | ✅ Good (MoE) | ~10GB* |
132  | Qwen3-32B | ~28GB | ❌ Too large | ~18GB |
133  
134  *MoE models only load active experts, making the 30B-A3B surprisingly efficient.
135  
136  The sweet spot for #B4mad is **Qwen3-8B for development/testing** and **Qwen3-30B-A3B for production**, both trainable on our single RTX 4090.
137  
138  ### 3.5 Risks and Limitations
139  
140  1. **Catastrophic forgetting:** Fine-tuning on narrow agent tasks may degrade general capabilities. Mitigation: LoRA's parameter isolation naturally preserves base model knowledge; also mix in general instruction data during SFT.
141  
142  2. **Dataset quality:** Garbage in, garbage out. Our biggest risk is insufficient or low-quality training data. Mitigation: Start with curated gold examples, expand gradually.
143  
144  3. **Evaluation difficulty:** Agent task success is hard to measure automatically. Mitigation: Build a structured eval suite before training, not after.
145  
146  4. **Maintenance burden:** Models need retraining as our tool schemas and agent patterns evolve. Mitigation: Keep training pipelines automated and modular.
147  
148  5. **Capability ceiling:** A fine-tuned 8B model won't match Claude Opus on complex reasoning. Mitigation: Use the fine-tuned model for routine agent tasks; escalate to frontier models for complex reasoning.
149  
150  ## 4. Recommendations
151  
152  ### Immediate (Week 1-2)
153  1. **Instrument agent logging:** Add structured trace collection to all #B4mad agents (Brenner, PLTops, Lotti, Romanov). Every tool call, every bead operation, every delegation — logged as training data.
154  2. **Define eval suite:** Create 50+ test cases covering MCP tool calling, bead operations, and delegation scenarios. This is the yardstick before any training begins.
155  
156  ### Short-term (Week 3-6)
157  3. **Curate gold dataset:** Hand-craft 200 gold-standard examples. Run Qwen3-8B base on these tasks to establish baseline performance.
158  4. **First QLoRA training run:** Fine-tune Qwen3-8B on the curated dataset using Unsloth + PEFT. Evaluate against the test suite. This is the proof-of-concept.
159  
160  ### Medium-term (Month 2-3)
161  5. **Scale to Qwen3-30B-A3B:** Once the pipeline is validated on 8B, move to the MoE model for production-quality results.
162  6. **DPO pass:** Collect preference data from real agent runs, apply DPO for quality refinement.
163  7. **A/B test in production:** Run the fine-tuned model alongside Claude for a subset of routine tasks. Measure success rates, latency, and cost.
164  
165  ### Strategic
166  8. **Hybrid architecture:** Use the #B4mad-tuned model for 80% of routine agent operations (tool calling, bead management, simple delegation) and frontier models for the remaining 20% (complex reasoning, novel tasks). This could cut API costs by 80%+ while maintaining quality.
167  
168  ## 5. Conclusion
169  
170  A #B4mad-tuned agent model is feasible, valuable, and achievable with our current hardware. The Qwen3 family — particularly the 8B dense and 30B-A3B MoE models — provides an excellent foundation. QLoRA makes training practical on a single RTX 4090.
171  
172  The critical path is **not compute but data**: instrumenting our agents to collect high-quality interaction traces, curating gold-standard examples, and building a rigorous evaluation suite. With 4-6 weeks of focused effort, we could have a proof-of-concept model that handles routine agent tasks locally, reducing our dependence on frontier API providers and advancing #B4mad's mission of technological sovereignty.
173  
174  The question isn't whether we *can* build a #B4mad-tuned model. It's whether we have the discipline to collect great training data first.
175  
176  ## References
177  
178  1. Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." arXiv:2305.14314.
179  2. Hu, E.J., et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685.
180  3. Qwen Team (2025). "Qwen3: Think Deeper, Act Faster." https://qwenlm.github.io/blog/qwen3/
181  4. Patil, S., et al. (2023). "Gorilla: Large Language Model Connected with Massive APIs." arXiv:2305.15334.
182  5. Qin, Y., et al. (2023). "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs." arXiv:2307.16789.
183  6. Zeng, A., et al. (2023). "AgentTuning: Enabling Generalized Agent Abilities for LLMs." arXiv:2310.12823.
184  7. Chen, B., et al. (2023). "FireAct: Toward Language Agent Fine-tuning." arXiv:2310.05915.
185  8. HuggingFace PEFT Library. https://github.com/huggingface/peft
186  9. Fridman, L. (2026). "State of AI in 2026" Podcast #490, with Sebastian Raschka & Nathan Lambert. https://lexfridman.com/ai-sota-2026-transcript
187  10. Raschka, S. (2025). "Build a Large Language Model from Scratch." Manning Publications.