/ ops / status.md
status.md
  1  # Operational Status — Bob
  2  
  3  > Last updated: 2026-04-09 (Session 12 — final handoff)
  4  
  5  ## Session Handoff Protocol (MANDATORY)
  6  
  7  Before your session ends or when you complete significant work, update:
  8  1. **What's Working** — current capabilities
  9  2. **What's Next** — reprioritized next steps
 10  3. Sync to `ops/changelog.md` with what you did
 11  
 12  ## Current Status
 13  
 14  ### What's Working
 15  
 16  **Infrastructure (Epic 01)**
 17  - NixOS 26.05 on rig.lan — systemd-boot, Thunderbolt auto-auth, 3x RTX 3090 survive reboot
 18  - NixOS flake (`nix/`) — NVIDIA CDI, Docker, NATS, Caddy, SSH hardening, Prometheus, disko
 19  - Agentic harness (`.harness/`) — 6 agent prompts, workflow v2.0, BMAD agents aligned
 20  
 21  **LLM Inference (Epic 02)**
 22  - vLLM serving **Qwen3-32B AWQ** — :8000, llm.rig.lan
 23    - TP=2 (GPU 0+1), ~40 tok/s, 16K context, 5.7x concurrency, native tool calling
 24  - Embedding model (BAAI/bge-m3) — :8080, GPU 2
 25  
 26  **Knowledge Store (Epic 03)**
 27  - Oxigraph SPARQL — :7878, sparql.rig.lan, 14,152 triples (BFO + CCO + bob-family.ttl — reloaded Session 12)
 28  
 29  **Knowledge Graph (Epic 04)**
 30  - Neo4j 5 Community — :7474/:7687, APOC enabled (for Graphiti agent memory)
 31  - TrustGraph 2.1.26 — 48 containers, Workbench :8888, API Gateway :8088
 32    - vLLM backend, HuggingFace embeddings, Cassandra, Qdrant, PDF decode
 33  
 34  **Home Awareness (Epic 05)**
 35  - NATS JetStream — :4222/:8222/:1883, native NixOS service
 36  - HomeAssistant — :8123, https://home.genexergy.org, onboarded
 37  - HA ↔ NATS MQTT connected (localhost:1883)
 38  - HA→NATS bridge — publishing state_changed events to `bob.home.state.{domain}.{entity_id}`
 39  
 40  **Voice Interface (Epic 06)**
 41  - faster-whisper STT — :10300, Whisper large-v3 INT8, GPU 2
 42  - Kokoro TTS — :10400, 54 voices, GPU 2 (fallback voice)
 43  - **Fish Speech v1.5** — :10600, GPU 2, Ray Porter voice clone (Bob persona)
 44  - openWakeWord — :10500, CPU, Wyoming protocol, **custom "hey bob" wake word**
 45  - **Pipecat voice agent** — :10700, fully working with tool calling:
 46    - Pipeline: mic → **wake word gate** → diarization → STT → **fast-path** → **coordinator router** → **per-user context injection** → LLM (with tools) → Fish Speech streaming TTS → audio out
 47    - Wake word: "hey bob" via Wyoming protocol to openWakeWord, 15s idle timeout back to waiting
 48    - VAD: Silero, confidence 0.85, 0.4s start, 0.8s stop (tuned to reduce false triggers)
 49    - Voice: Ray Porter clone via Fish Speech v1.5 (Kokoro `bf_emma` as fallback via `TTS_ENGINE=kokoro`)
 50    - Tools: weather (Open-Meteo), HA state/control, knowledge graph (SPARQL), REPL sandbox, recall_memory, get_news, create_automation
 51    - **Per-user context** (S12-05): speaker identified → Graphiti memory + calendar events + user profile injected into system prompt. Profiles: Cam (concise/technical), AJ (warm/practical), Hailen (friendly/age-appropriate).
 52    - Face avatar with lip-sync (toggle between orb and face views)
 53    - Web client at voice.genexergy.org (Keycloak-protected)
 54  - Web voice client — voice.rig.lan
 55  
 56  **Operational Agents (Epic 08-09)**
 57  - **Agent Scheduler** — cron-based trigger service, NATS JetStream stream `BOB_AGENTS` (6 subject patterns: agent, calendar, alert, announce, session, news)
 58  - **Home Keeper** — hourly infrastructure health checks (Docker, GPU, services, network, disk/RAM)
 59  - **Morning Coordinator** — daily 7:45 AM briefing (weather, clothing, **news headlines**, system health, alerts)
 60  - **Evening Coordinator** — daily 8:00 PM summary (tomorrow weather, agent activity, **news headlines**, alerts)
 61  - **News Aggregator** — every 2 hours, RSS feeds (NPR, BBC, Tampa Bay Times) + NWS weather alerts
 62  - **Knowledge Gardener** — nightly 2:00 AM consolidation + **real-time session storage** + **memory pruning/dedup** (agent results, voice sessions → Graphiti, 90-day retention, duplicate removal, knowledge graph stats, daily/weekly digests)
 63  - **System Sentinel** — every 15 min deep monitoring (Prometheus metrics, Docker logs, disk, remote SSH inventory)
 64  - **Alert Bridge** — Prometheus Alertmanager → NATS webhook bridge, triggers System Sentinel on critical alerts
 65  - **Alertmanager** — NixOS-native, :9093, 8 alert rules (memory, disk, CPU, load, systemd failures)
 66  - **Device Health** — every 4 hours, SSH health checks across managed devices (rig, kairos, reMarkable)
 67  - **Network Discovery** — nmap subnet scans, known MAC registry, new/missing device detection
 68  - **Service reliability** — all containers have `Restart=always`, pipecat idle timeout disabled
 69  - **Calendar Bridge** — ICS feed poller, 4 feeds active (Cam Personal, Hailen Personal, Noble Hunt, GeoRobotix), publishes to `bob.calendar.events` every 30 min
 70  - **Graphiti Temporal Memory** — Neo4j-backed, Knowledge Gardener stores agent results as episodes
 71  
 72  **Everything Agent (Epic 12)**
 73  - **REPL Sandbox** — :10900, sandboxed Python execution as LLM tool (Docker, Prometheus, HA, Oxigraph access)
 74  - **Session Consolidation** — full chain working: voice conversations → LLM summary → NATS → Knowledge Gardener (real-time) → Graphiti/Neo4j
 75  - **Memory Recall** — `recall_memory` tool queries Neo4j for past conversations and entity facts
 76  - **Context Compaction** — auto-summarize older turns at 12K token threshold
 77  - **Fast-Path Queries** — time/date answered without LLM invocation
 78  - **Speaker-Aware UI** — transcript shows identified speaker names (Cam, AJ, Hailen)
 79  - **Diarization** — diart (streaming) + CAM++ (512-dim embeddings) on GPU 2, identifies enrolled speakers
 80  
 81  **Monitoring (Epic 07)**
 82  - Prometheus + node-exporter — :9090/:9100, NixOS native, 30-day retention
 83  - Grafana — :3000, grafana.rig.lan (credentials in sops) — **FIXED Session 11**
 84  
 85  ### Resource Usage (as of 2026-04-09)
 86  - **GPU 0+1**: 20.9 GB each (vLLM Qwen3-32B), 2.7 GB free each
 87  - **GPU 2**: 16.8 GB used (classifier + Embeddings + STT + TTS + Diarization), **6.9 GB free**
 88  - **RAM**: ~24 GB / 78 GB (54 GB available)
 89  - **Disk**: ~290 GB / 3.6 TB (includes 33 GB restic backup)
 90  - **Containers**: 80 total (35 Bob + 45 TrustGraph), all NixOS-managed
 91  
 92  **Coordinator Agent + Model Tiering (Epic 13)** — **FIXED Session 11, fully operational**
 93  - **Qwen3-8B AWQ** classifier — vLLM on GPU 2 (:8001), ~11 GB VRAM, NixOS-managed
 94  - **Coordinator service** — NORMAL mode, 3-tier classification verified (deterministic/simple/complex)
 95  - **Voice pipeline integration** — CoordinatorRouter active, `COORDINATOR_ENABLED=true`
 96  - **Think tag stripping**: 8B model responses cleaned before TTS (`/no_think` + regex strip)
 97  - **ADR-020**: Full architecture in `_bmad/architecture.md`
 98  
 99  **Semantic Home Automations (Epic 05, FR-15)**
100  - **Home Automations** service — 3-tier automation engine, subscribes to `bob.home.state.>` via NATS
101  - **Tier 1** (rules): YAML-defined rules — sunset lights, all-away lights off, late night dim, AC comfort alert
102  - **Tier 2** (patterns): Tracks state history, detects recurring time-based patterns and multi-entity correlations. Runs nightly at 3 AM. Publishes to `bob.home.automation.patterns`
103  - **Tier 3** (LLM): Anomaly detection every 6 hours via Qwen3-32B (safety/optimization/anomaly). Natural language rule creation via voice (`create_automation` tool). Publishes to `bob.home.automation.anomalies`
104  - Voice rule creation: "Turn off kitchen lights at 11 PM" → LLM parses to YAML rule → added to live rule set
105  
106  **News Aggregation (Epic 09, S09-07)**
107  - **News Aggregator** service — fetches headlines from RSS feeds (AP, NPR, BBC World, BBC Tech, Tampa Bay Times, WFLA Tampa) + Guardian API + NWS weather alerts
108  - Publishes to `bob.news.headlines` on NATS JetStream every 2 hours
109  - **Morning/Evening briefings** now include top 3 news headlines + weather alerts
110  - **`get_news` tool** in voice pipeline + coordinator — Bob can answer "what's in the news?"
111  - Optional NewsAPI.org support via `NEWSAPI_KEY` env var
112  
113  **Family Digital Steward (Epic 14)** — Partially deployed
114  - **Restic Backup** — encrypted nightly backups to /srv/backup (7 daily, 4 weekly, 3 monthly retention)
115  - **Syncthing** — deployed on rig + kairos, NixOS-managed — **FIXED Session 11**
116  - **Device Config Repos** — kairos config snapshot, rig in git
117  - **DR Plan** — documented at ops/disaster-recovery.md (5-9 hour estimated recovery)
118  - **Firefly III** — financial management at :8181, `get_finances` voice tool, NixOS-managed — **FIXED Session 11**
119  - **Ollama** — Qwen2.5-VL-3B vision model on GPU 2 (on-demand, 30s keepalive), NixOS-managed
120  
121  **Per-User Personalization (S12-05) — All 4 Phases**
122  - **Phase 1+2 (voice)**: Speaker identified via diarization → Graphiti memory + calendar + profile injected into system prompt
123  - **Phase 3 (web auth)**: Keycloak login at voice.genexergy.org → user identity passed via WebSocket → per-user context on first connection
124  - **Phase 4 (Reticulum)**: bob-lxmf-bridge service receives text messages from Reticulum mesh, routes through coordinator, sends replies. Bob's LXMF address: `<63760538e0f92ef78915f5ab38d91a60>`
125  - Profiles: Cam (concise/technical), AJ (warm/practical), Hailen (friendly/age-appropriate)
126  
127  **Reticulum Mesh Network**
128  - Transport node `bob-transport` — TCP gateway :4242, connected to MichMesh Hub + RMAP World
129  - Propagation node `noblehunt_transport` — LXMF propagation enabled, peered with 9+ nodes
130  - 7,052 transport paths, 7,239 cached announces
131  - reMarkable 2: NomadNet + LXMF + rnsd, connected to rig transport
132  
133  ### What's Next (Priority Order)
134  
135  **Residential Proxy (S00-02) — LIVE**
136  - Squid on nuclide-amd.lan:3128, externally accessible at proxy.genexergy.org:3128
137  - Authenticated (basic auth, glean user), privacy headers stripped, fail2ban active
138  - Monitored: sentinel SSH checks every 15m, home-keeper functional test hourly, voice tool
139  
140  **Remaining**
141  3. NFR-05 gap: LUKS disk encryption (physical theft threat model decision)
142  4. FR-19 full playbook framework (YAML playbooks, dry-run, approval gates — multi-week)
143  5. FR-15 pattern→rule feedback loop (convert tier-2 pattern detections into tier-1 rules automatically)
144  6. FR-14 TAK/CoT interop (Not Started — needs BMAD discovery)
145  7. FR-16 CRDT family sync / Automerge (Not Started)
146  8. FR-27 Distributed household compute (Not Started)
147  
148  **Epic 14: Family Digital Steward** (Session 10 complete)
149  - S14-01 Disaster Recovery — Done
150  - S14-02 Syncthing — Done (fixed Session 11)
151  - S14-03 Device Config Repos — Done
152  - S14-04 Restic Backup — Done
153  - S14-05 Device Health Agent — Done
154  
155  ### Session 10 Completed (2026-04-06)
156  - Restic encrypted backup initialized (rig → /srv/backup)
157  - Syncthing deployed on rig + kairos (but now crash-looping — perms)
158  - Firefly III + MariaDB deployed
159  - Device Health agent + Network Discovery agent deployed
160  - Kairos config snapshot + kiosk configuration
161  - BFO/CCO ontology loaded into Oxigraph (12,843 triples)
162  - Ollama + Qwen2.5-VL-3B vision model deployed
163  - Keycloak users added (hailen, greatroom, garage)
164  - Firefly III voice tool (get_finances) — 9 voice tools total
165  - reMarkable 2 Reticulum stack (NomadNet, LXMF, rnsd)
166  - Safari AudioContext fix + automation rule persistence
167  
168  ### Session 12 Completed (2026-04-08) — Major cleanup, integration, verification session
169  - **haven/ directory deleted** — ported coordinator tools, archived + removed, adversarial review caught inverted diff
170  - **Firefly III + Calendar Bridge → sops-nix** — all secrets encrypted, 4 ICS feeds live (Proton x2, MS365, Google)
171  - **Timezone bug fixed** — croniter naive datetime, all `datetime.now()` calls timezone-aware across all services
172  - **Browser automation** — playwright-cli verified, Evaluator agent updated with self-service testing mandate
173  - **System audit** — 4 parallel evaluator agents, 12 doc corrections, 5 issues found + fixed (NATS subjects, Oxigraph reload, announce-player, Firefly health check, TrustGraph workbench)
174  - **Quick wins** — sudo PATH fix, traceability matrix updated, sentinel→keeper alert bridge wired
175  - **NFR verification** — 9 NFRs tested: 6 PASS, 2 PARTIAL PASS, 1 CONDITIONAL. All knowledge queries <1s, all GPU services coexist, zero data egress.
176  - **Restic backup initialized** — 33.4 GiB first backup, password moved to sops
177  - **TrustGraph workbench connected** — API key configured, Graph RAG assistant Online
178  
179  ### Session 11 Completed (2026-04-07) — System Audit & Fixes
180  - Comprehensive system audit — identified 7 issues, prioritized, fixed 6
181  - **Classifier vLLM deployed** — Qwen3-8B-AWQ on GPU 2 (:8001), NixOS-managed, coordinator NORMAL
182  - **Coordinator enabled** in voice pipeline — `COORDINATOR_ENABLED=true`, 3-tier classification verified
183  - **Grafana fixed** — chown data dir to UID 472, added tmpfiles rule
184  - **Syncthing fixed** — chown config dir to 1000:100, added to containers.nix
185  - **Firefly III fixed** — host networking, NixOS-managed (firefly-db + firefly-iii)
186  - **All containers NixOS-managed** — ollama, syncthing, firefly-db, firefly-iii added to containers.nix
187  - Doc reconciliation — server.md, status.md, architecture.md, known-issues.md, traceability.md all updated
188  - 35 Bob containers running, 0 failed, 10 key services healthy
189  
190  ### Session 10 Completed (2026-04-06)
191  - Epic 14: Restic backup, Syncthing, device health, network discovery, DR plan
192  - Firefly III + Ollama vision + BFO/CCO ontology + Keycloak users + reMarkable Reticulum
193  
194  ### Session 9 Completed (2026-04-04/05)
195  - Coordinator voice E2E — 100% classification (25/25), all 4 tiers verified via live voice
196  - Session consolidation + News aggregation + NATS stream fix + Reticulum
197  
198  ### Blockers
199  - TrustGraph requires interactive web configurator (trustgraph.ai/builder)