/ docs / compute-stack.md
compute-stack.md
  1  # Sovereign OS Compute Stack
  2  
  3  ## The Hierarchy (Bottom to Top)
  4  
  5  ```
  6  ┌─────────────────────────────────────────────────────────────────┐
  7  │  LAYER 7: MULTI-AGENT ORCHESTRATION                             │
  8  │  Multiple Claude instances, consensus, parallel exploration     │
  9  │  Cost: $$$$$  Latency: High   Flexibility: Maximum              │
 10  ├─────────────────────────────────────────────────────────────────┤
 11  │  LAYER 6: LARGE MODELS (Cloud)                                  │
 12  │  Claude, GPT-4 - Complex reasoning, novel problems              │
 13  │  Cost: $$$$   Latency: 1-10s   Flexibility: Very High           │
 14  ├─────────────────────────────────────────────────────────────────┤
 15  │  LAYER 5: MEDIUM MODELS (Local GPU)                             │
 16  │  7-30B params - Reasoning, code generation, analysis            │
 17  │  Cost: $$     Latency: 1-5s   Flexibility: High                 │
 18  ├─────────────────────────────────────────────────────────────────┤
 19  │  LAYER 4: SMALL MODELS (Edge)                                   │
 20  │  1-7B params - Summarization, Q&A, formatting                   │
 21  │  Cost: $      Latency: 100ms-1s   Flexibility: Medium           │
 22  ├─────────────────────────────────────────────────────────────────┤
 23  │  LAYER 3: TINY MODELS (Embedded)                                │
 24  │  <1B params - Transcription, classification, EEG, embeddings    │
 25  │  Cost: ¢      Latency: 10-100ms   Flexibility: Low-Medium       │
 26  ├─────────────────────────────────────────────────────────────────┤
 27  │  LAYER 2: DETERMINISTIC CODE                                    │ ← INVERSION
 28  │  Traditional software - Always same output for same input       │   POINT
 29  │  Cost: ~0     Latency: <1ms   Flexibility: None                 │
 30  ├─────────────────────────────────────────────────────────────────┤
 31  │  LAYER 1: HARDWARE                                              │
 32  │  CPU, GPU, NPU, FPGA, custom silicon                            │
 33  │  Cost: Fixed  Latency: ns-μs   Flexibility: None                │
 34  ├─────────────────────────────────────────────────────────────────┤
 35  │  LAYER 0: ENERGY                                                │
 36  │  Electricity, thermal management, physical substrate            │
 37  │  Cost: Base   Latency: N/A   Flexibility: N/A                   │
 38  └─────────────────────────────────────────────────────────────────┘
 39  ```
 40  
 41  ## The Inversion Principle
 42  
 43  > "At some point, AI becomes cheaper than writing code."
 44  
 45  ### Traditional View (Code > AI)
 46  ```
 47  For task T:
 48    if (can_write_code(T)):
 49      use_code()  # Cheaper, faster
 50    else:
 51      use_ai()    # Fallback for novel tasks
 52  ```
 53  
 54  ### Emerging View (AI > Code)
 55  ```
 56  For task T:
 57    if (is_hot_path(T) AND is_stable(T)):
 58      compile_to_code()  # Optimization for known patterns
 59    else:
 60      use_ai()           # Default for everything else
 61  ```
 62  
 63  ### The Crossover Point
 64  
 65  | Task Type | 2020 | 2023 | 2025 | Future |
 66  |-----------|------|------|------|--------|
 67  | Simple transform | Code | Code | Code | Code (always) |
 68  | Text classification | Code | Edge AI | Edge AI | Embedded AI |
 69  | Summarization | Code* | Cloud AI | Edge AI | Edge AI |
 70  | Code generation | Human | Cloud AI | Cloud AI | Edge AI? |
 71  | Novel reasoning | Human | Cloud AI | Cloud AI | Cloud AI |
 72  
 73  *Required complex rules, brittle
 74  
 75  ## Task Routing by Layer
 76  
 77  ### Layer 3: Tiny Models (Embedded)
 78  **Characteristics:**
 79  - Always on
 80  - <10 watts
 81  - Millisecond latency
 82  - Runs on phone, watch, IoT
 83  
 84  **Tasks:**
 85  | Task | Model | Why This Layer |
 86  |------|-------|----------------|
 87  | Voice transcription | Whisper-tiny | Real-time, always listening |
 88  | EEG interpretation | Custom CNN | Low-latency biofeedback |
 89  | Wake word detection | TinyML | Battery-efficient |
 90  | Gaze classification | MobileNet | Real-time attention |
 91  | Text embeddings | all-MiniLM | Local semantic search |
 92  | Sentiment/intent | DistilBERT | Quick classification |
 93  
 94  ### Layer 4: Small Models (Edge)
 95  **Characteristics:**
 96  - On-demand
 97  - 10-50 watts
 98  - Sub-second latency
 99  - Runs on laptop, desktop
100  
101  **Tasks:**
102  | Task | Model | Why This Layer |
103  |------|-------|----------------|
104  | Document summary | Qwen-3B | Privacy, speed |
105  | Simple Q&A | Phi-3 | No cloud needed |
106  | Format conversion | Qwen-3B | Deterministic enough |
107  | Draft writing | Mistral-7B | Interactive speed |
108  | Code completion | CodeQwen | IDE integration |
109  
110  ### Layer 5: Medium Models (Local GPU)
111  **Characteristics:**
112  - Batch or interactive
113  - 100-300 watts
114  - 1-5 second latency
115  - Needs GPU
116  
117  **Tasks:**
118  | Task | Model | Why This Layer |
119  |------|-------|----------------|
120  | Code review | CodeLlama-34B | Needs context |
121  | Research summary | Mixtral | Long context |
122  | Translation | NLLB | Quality matters |
123  | Image analysis | LLaVA | Multimodal |
124  
125  ### Layer 6: Large Models (Cloud)
126  **Characteristics:**
127  - On-demand
128  - Pay per token
129  - Seconds latency
130  - Unlimited scale
131  
132  **Tasks:**
133  | Task | Model | Why This Layer |
134  |------|-------|----------------|
135  | Complex reasoning | Claude | Novel problems |
136  | Architecture design | Claude | Needs judgment |
137  | Code implementation | Claude | Context + quality |
138  | This conversation | Claude | Meta-cognition |
139  
140  ### Layer 7: Multi-Agent
141  **Characteristics:**
142  - Parallel exploration
143  - High cost
144  - Minutes latency
145  - Consensus building
146  
147  **Tasks:**
148  | Task | Pattern | Why This Layer |
149  |------|---------|----------------|
150  | System design | Debate | Multiple perspectives |
151  | Bug hunting | Parallel search | Coverage |
152  | Research | Divide & conquer | Breadth |
153  
154  ## Energy-Aware Routing
155  
156  ```python
157  class EnergyAwareRouter:
158      """
159      Route tasks considering energy budget.
160  
161      On battery: Prefer tiny/small, cache aggressively
162      On power: Can use medium/large freely
163      Solar peak: Batch heavy tasks
164      """
165  
166      def route(self, task, energy_state):
167          if energy_state.on_battery:
168              if energy_state.battery_percent < 20:
169                  return TINY_ONLY
170              elif energy_state.battery_percent < 50:
171                  return PREFER_SMALL
172              else:
173                  return NORMAL
174  
175          if energy_state.is_solar_peak:
176              # Great time for batch processing
177              return ALLOW_MEDIUM
178  
179          if energy_state.grid_carbon_intensity < 100:
180              # Clean grid, can use cloud
181              return ALLOW_LARGE
182  
183          return PREFER_LOCAL
184  ```
185  
186  ## The Dictation Example
187  
188  Your current dictation flow demonstrates the stack:
189  
190  ```
191  Voice (Physical)
192      → Microphone (Hardware)
193      → Audio buffer (Code)
194      → Whisper-tiny (Tiny Model) - Transcription
195      → Text stream
196      → Claude (Large Model) - Interpretation
197      → Structured actions
198  ```
199  
200  **Optimization opportunities:**
201  
202  1. **Local intent detection** (Tiny)
203     - "Schedule meeting" → direct action, skip Claude
204     - "What's the weather" → edge API call
205  
206  2. **Context caching** (Code)
207     - Recent transcripts in memory
208     - Phoenix state pre-loaded
209  
210  3. **Batch interpretation** (Medium)
211     - Collect 30s of speech
212     - Summarize locally
213     - Only escalate decisions to Claude
214  
215  4. **Async background** (Large)
216     - Non-urgent analysis queued
217     - Processed when convenient/cheap
218  
219  ## Implementation Priority
220  
221  ### Phase 1: Edge Foundation
222  - [ ] Install Ollama on all devices
223  - [ ] Deploy Whisper-tiny for always-on transcription
224  - [ ] Build local embedding index (Obsidian, Git)
225  - [ ] Simple intent classifier
226  
227  ### Phase 2: Smart Routing
228  - [ ] Complexity estimator (done ✓)
229  - [ ] Energy state monitoring
230  - [ ] Cost tracking per layer
231  - [ ] Automatic escalation/de-escalation
232  
233  ### Phase 3: Inversion
234  - [ ] Identify hot-path AI tasks
235  - [ ] Compile to deterministic code where stable
236  - [ ] AI generates its own optimizations
237  - [ ] Self-modifying routing rules
238  
239  ### Phase 4: Multi-Agent
240  - [ ] Parallel Claude instances for exploration
241  - [ ] Consensus protocols
242  - [ ] Debate framework for decisions
243  - [ ] Distributed inference across devices
244  
245  ## Cost Model
246  
247  ```
248  Annual cost at current usage patterns:
249  
250  Layer 7 (Multi-Agent):  $500/year   (10 sessions × $50)
251  Layer 6 (Claude):       $200/year   (heavy daily use)
252  Layer 5 (Medium Local): $50/year    (electricity)
253  Layer 4 (Small Local):  $20/year    (electricity)
254  Layer 3 (Tiny):         $5/year     (negligible)
255  Layer 2 (Code):         ~$0         (already running)
256  Layer 1 (Hardware):     Amortized   (already own)
257  Layer 0 (Energy):       Base cost   (unavoidable)
258                          ─────────
259                          ~$800/year  total AI compute
260  
261  With optimization (shift 50% from L6 to L4/L5):
262                          ~$400/year  (50% reduction)
263  ```
264  
265  ## Philosophy
266  
267  > The goal is not to minimize AI usage, but to maximize AI effectiveness per joule and per dollar.
268  
269  1. **Use the right layer** - Don't send "what time is it" to Claude
270  2. **Cache aggressively** - Yesterday's insight is today's prior
271  3. **Compile patterns** - Repeated AI tasks become code
272  4. **Escalate gracefully** - Local failure → cloud success
273  5. **Learn from routing** - Track what works at each layer