status.md
1 # Operational Status — Bob 2 3 > Last updated: 2026-04-09 (Session 12 — final handoff) 4 5 ## Session Handoff Protocol (MANDATORY) 6 7 Before your session ends or when you complete significant work, update: 8 1. **What's Working** — current capabilities 9 2. **What's Next** — reprioritized next steps 10 3. Sync to `ops/changelog.md` with what you did 11 12 ## Current Status 13 14 ### What's Working 15 16 **Infrastructure (Epic 01)** 17 - NixOS 26.05 on rig.lan — systemd-boot, Thunderbolt auto-auth, 3x RTX 3090 survive reboot 18 - NixOS flake (`nix/`) — NVIDIA CDI, Docker, NATS, Caddy, SSH hardening, Prometheus, disko 19 - Agentic harness (`.harness/`) — 6 agent prompts, workflow v2.0, BMAD agents aligned 20 21 **LLM Inference (Epic 02)** 22 - vLLM serving **Qwen3-32B AWQ** — :8000, llm.rig.lan 23 - TP=2 (GPU 0+1), ~40 tok/s, 16K context, 5.7x concurrency, native tool calling 24 - Embedding model (BAAI/bge-m3) — :8080, GPU 2 25 26 **Knowledge Store (Epic 03)** 27 - Oxigraph SPARQL — :7878, sparql.rig.lan, 14,152 triples (BFO + CCO + bob-family.ttl — reloaded Session 12) 28 29 **Knowledge Graph (Epic 04)** 30 - Neo4j 5 Community — :7474/:7687, APOC enabled (for Graphiti agent memory) 31 - TrustGraph 2.1.26 — 48 containers, Workbench :8888, API Gateway :8088 32 - vLLM backend, HuggingFace embeddings, Cassandra, Qdrant, PDF decode 33 34 **Home Awareness (Epic 05)** 35 - NATS JetStream — :4222/:8222/:1883, native NixOS service 36 - HomeAssistant — :8123, https://home.genexergy.org, onboarded 37 - HA ↔ NATS MQTT connected (localhost:1883) 38 - HA→NATS bridge — publishing state_changed events to `bob.home.state.{domain}.{entity_id}` 39 40 **Voice Interface (Epic 06)** 41 - faster-whisper STT — :10300, Whisper large-v3 INT8, GPU 2 42 - Kokoro TTS — :10400, 54 voices, GPU 2 (fallback voice) 43 - **Fish Speech v1.5** — :10600, GPU 2, Ray Porter voice clone (Bob persona) 44 - openWakeWord — :10500, CPU, Wyoming protocol, **custom "hey bob" wake word** 45 - **Pipecat voice agent** — :10700, fully working with tool calling: 46 - Pipeline: mic → **wake word gate** → diarization → STT → **fast-path** → **coordinator router** → **per-user context injection** → LLM (with tools) → Fish Speech streaming TTS → audio out 47 - Wake word: "hey bob" via Wyoming protocol to openWakeWord, 15s idle timeout back to waiting 48 - VAD: Silero, confidence 0.85, 0.4s start, 0.8s stop (tuned to reduce false triggers) 49 - Voice: Ray Porter clone via Fish Speech v1.5 (Kokoro `bf_emma` as fallback via `TTS_ENGINE=kokoro`) 50 - Tools: weather (Open-Meteo), HA state/control, knowledge graph (SPARQL), REPL sandbox, recall_memory, get_news, create_automation 51 - **Per-user context** (S12-05): speaker identified → Graphiti memory + calendar events + user profile injected into system prompt. Profiles: Cam (concise/technical), AJ (warm/practical), Hailen (friendly/age-appropriate). 52 - Face avatar with lip-sync (toggle between orb and face views) 53 - Web client at voice.genexergy.org (Keycloak-protected) 54 - Web voice client — voice.rig.lan 55 56 **Operational Agents (Epic 08-09)** 57 - **Agent Scheduler** — cron-based trigger service, NATS JetStream stream `BOB_AGENTS` (6 subject patterns: agent, calendar, alert, announce, session, news) 58 - **Home Keeper** — hourly infrastructure health checks (Docker, GPU, services, network, disk/RAM) 59 - **Morning Coordinator** — daily 7:45 AM briefing (weather, clothing, **news headlines**, system health, alerts) 60 - **Evening Coordinator** — daily 8:00 PM summary (tomorrow weather, agent activity, **news headlines**, alerts) 61 - **News Aggregator** — every 2 hours, RSS feeds (NPR, BBC, Tampa Bay Times) + NWS weather alerts 62 - **Knowledge Gardener** — nightly 2:00 AM consolidation + **real-time session storage** + **memory pruning/dedup** (agent results, voice sessions → Graphiti, 90-day retention, duplicate removal, knowledge graph stats, daily/weekly digests) 63 - **System Sentinel** — every 15 min deep monitoring (Prometheus metrics, Docker logs, disk, remote SSH inventory) 64 - **Alert Bridge** — Prometheus Alertmanager → NATS webhook bridge, triggers System Sentinel on critical alerts 65 - **Alertmanager** — NixOS-native, :9093, 8 alert rules (memory, disk, CPU, load, systemd failures) 66 - **Device Health** — every 4 hours, SSH health checks across managed devices (rig, kairos, reMarkable) 67 - **Network Discovery** — nmap subnet scans, known MAC registry, new/missing device detection 68 - **Service reliability** — all containers have `Restart=always`, pipecat idle timeout disabled 69 - **Calendar Bridge** — ICS feed poller, 4 feeds active (Cam Personal, Hailen Personal, Noble Hunt, GeoRobotix), publishes to `bob.calendar.events` every 30 min 70 - **Graphiti Temporal Memory** — Neo4j-backed, Knowledge Gardener stores agent results as episodes 71 72 **Everything Agent (Epic 12)** 73 - **REPL Sandbox** — :10900, sandboxed Python execution as LLM tool (Docker, Prometheus, HA, Oxigraph access) 74 - **Session Consolidation** — full chain working: voice conversations → LLM summary → NATS → Knowledge Gardener (real-time) → Graphiti/Neo4j 75 - **Memory Recall** — `recall_memory` tool queries Neo4j for past conversations and entity facts 76 - **Context Compaction** — auto-summarize older turns at 12K token threshold 77 - **Fast-Path Queries** — time/date answered without LLM invocation 78 - **Speaker-Aware UI** — transcript shows identified speaker names (Cam, AJ, Hailen) 79 - **Diarization** — diart (streaming) + CAM++ (512-dim embeddings) on GPU 2, identifies enrolled speakers 80 81 **Monitoring (Epic 07)** 82 - Prometheus + node-exporter — :9090/:9100, NixOS native, 30-day retention 83 - Grafana — :3000, grafana.rig.lan (credentials in sops) — **FIXED Session 11** 84 85 ### Resource Usage (as of 2026-04-09) 86 - **GPU 0+1**: 20.9 GB each (vLLM Qwen3-32B), 2.7 GB free each 87 - **GPU 2**: 16.8 GB used (classifier + Embeddings + STT + TTS + Diarization), **6.9 GB free** 88 - **RAM**: ~24 GB / 78 GB (54 GB available) 89 - **Disk**: ~290 GB / 3.6 TB (includes 33 GB restic backup) 90 - **Containers**: 80 total (35 Bob + 45 TrustGraph), all NixOS-managed 91 92 **Coordinator Agent + Model Tiering (Epic 13)** — **FIXED Session 11, fully operational** 93 - **Qwen3-8B AWQ** classifier — vLLM on GPU 2 (:8001), ~11 GB VRAM, NixOS-managed 94 - **Coordinator service** — NORMAL mode, 3-tier classification verified (deterministic/simple/complex) 95 - **Voice pipeline integration** — CoordinatorRouter active, `COORDINATOR_ENABLED=true` 96 - **Think tag stripping**: 8B model responses cleaned before TTS (`/no_think` + regex strip) 97 - **ADR-020**: Full architecture in `_bmad/architecture.md` 98 99 **Semantic Home Automations (Epic 05, FR-15)** 100 - **Home Automations** service — 3-tier automation engine, subscribes to `bob.home.state.>` via NATS 101 - **Tier 1** (rules): YAML-defined rules — sunset lights, all-away lights off, late night dim, AC comfort alert 102 - **Tier 2** (patterns): Tracks state history, detects recurring time-based patterns and multi-entity correlations. Runs nightly at 3 AM. Publishes to `bob.home.automation.patterns` 103 - **Tier 3** (LLM): Anomaly detection every 6 hours via Qwen3-32B (safety/optimization/anomaly). Natural language rule creation via voice (`create_automation` tool). Publishes to `bob.home.automation.anomalies` 104 - Voice rule creation: "Turn off kitchen lights at 11 PM" → LLM parses to YAML rule → added to live rule set 105 106 **News Aggregation (Epic 09, S09-07)** 107 - **News Aggregator** service — fetches headlines from RSS feeds (AP, NPR, BBC World, BBC Tech, Tampa Bay Times, WFLA Tampa) + Guardian API + NWS weather alerts 108 - Publishes to `bob.news.headlines` on NATS JetStream every 2 hours 109 - **Morning/Evening briefings** now include top 3 news headlines + weather alerts 110 - **`get_news` tool** in voice pipeline + coordinator — Bob can answer "what's in the news?" 111 - Optional NewsAPI.org support via `NEWSAPI_KEY` env var 112 113 **Family Digital Steward (Epic 14)** — Partially deployed 114 - **Restic Backup** — encrypted nightly backups to /srv/backup (7 daily, 4 weekly, 3 monthly retention) 115 - **Syncthing** — deployed on rig + kairos, NixOS-managed — **FIXED Session 11** 116 - **Device Config Repos** — kairos config snapshot, rig in git 117 - **DR Plan** — documented at ops/disaster-recovery.md (5-9 hour estimated recovery) 118 - **Firefly III** — financial management at :8181, `get_finances` voice tool, NixOS-managed — **FIXED Session 11** 119 - **Ollama** — Qwen2.5-VL-3B vision model on GPU 2 (on-demand, 30s keepalive), NixOS-managed 120 121 **Per-User Personalization (S12-05) — All 4 Phases** 122 - **Phase 1+2 (voice)**: Speaker identified via diarization → Graphiti memory + calendar + profile injected into system prompt 123 - **Phase 3 (web auth)**: Keycloak login at voice.genexergy.org → user identity passed via WebSocket → per-user context on first connection 124 - **Phase 4 (Reticulum)**: bob-lxmf-bridge service receives text messages from Reticulum mesh, routes through coordinator, sends replies. Bob's LXMF address: `<63760538e0f92ef78915f5ab38d91a60>` 125 - Profiles: Cam (concise/technical), AJ (warm/practical), Hailen (friendly/age-appropriate) 126 127 **Reticulum Mesh Network** 128 - Transport node `bob-transport` — TCP gateway :4242, connected to MichMesh Hub + RMAP World 129 - Propagation node `noblehunt_transport` — LXMF propagation enabled, peered with 9+ nodes 130 - 7,052 transport paths, 7,239 cached announces 131 - reMarkable 2: NomadNet + LXMF + rnsd, connected to rig transport 132 133 ### What's Next (Priority Order) 134 135 **Residential Proxy (S00-02) — LIVE** 136 - Squid on nuclide-amd.lan:3128, externally accessible at proxy.genexergy.org:3128 137 - Authenticated (basic auth, glean user), privacy headers stripped, fail2ban active 138 - Monitored: sentinel SSH checks every 15m, home-keeper functional test hourly, voice tool 139 140 **Remaining** 141 3. NFR-05 gap: LUKS disk encryption (physical theft threat model decision) 142 4. FR-19 full playbook framework (YAML playbooks, dry-run, approval gates — multi-week) 143 5. FR-15 pattern→rule feedback loop (convert tier-2 pattern detections into tier-1 rules automatically) 144 6. FR-14 TAK/CoT interop (Not Started — needs BMAD discovery) 145 7. FR-16 CRDT family sync / Automerge (Not Started) 146 8. FR-27 Distributed household compute (Not Started) 147 148 **Epic 14: Family Digital Steward** (Session 10 complete) 149 - S14-01 Disaster Recovery — Done 150 - S14-02 Syncthing — Done (fixed Session 11) 151 - S14-03 Device Config Repos — Done 152 - S14-04 Restic Backup — Done 153 - S14-05 Device Health Agent — Done 154 155 ### Session 10 Completed (2026-04-06) 156 - Restic encrypted backup initialized (rig → /srv/backup) 157 - Syncthing deployed on rig + kairos (but now crash-looping — perms) 158 - Firefly III + MariaDB deployed 159 - Device Health agent + Network Discovery agent deployed 160 - Kairos config snapshot + kiosk configuration 161 - BFO/CCO ontology loaded into Oxigraph (12,843 triples) 162 - Ollama + Qwen2.5-VL-3B vision model deployed 163 - Keycloak users added (hailen, greatroom, garage) 164 - Firefly III voice tool (get_finances) — 9 voice tools total 165 - reMarkable 2 Reticulum stack (NomadNet, LXMF, rnsd) 166 - Safari AudioContext fix + automation rule persistence 167 168 ### Session 12 Completed (2026-04-08) — Major cleanup, integration, verification session 169 - **haven/ directory deleted** — ported coordinator tools, archived + removed, adversarial review caught inverted diff 170 - **Firefly III + Calendar Bridge → sops-nix** — all secrets encrypted, 4 ICS feeds live (Proton x2, MS365, Google) 171 - **Timezone bug fixed** — croniter naive datetime, all `datetime.now()` calls timezone-aware across all services 172 - **Browser automation** — playwright-cli verified, Evaluator agent updated with self-service testing mandate 173 - **System audit** — 4 parallel evaluator agents, 12 doc corrections, 5 issues found + fixed (NATS subjects, Oxigraph reload, announce-player, Firefly health check, TrustGraph workbench) 174 - **Quick wins** — sudo PATH fix, traceability matrix updated, sentinel→keeper alert bridge wired 175 - **NFR verification** — 9 NFRs tested: 6 PASS, 2 PARTIAL PASS, 1 CONDITIONAL. All knowledge queries <1s, all GPU services coexist, zero data egress. 176 - **Restic backup initialized** — 33.4 GiB first backup, password moved to sops 177 - **TrustGraph workbench connected** — API key configured, Graph RAG assistant Online 178 179 ### Session 11 Completed (2026-04-07) — System Audit & Fixes 180 - Comprehensive system audit — identified 7 issues, prioritized, fixed 6 181 - **Classifier vLLM deployed** — Qwen3-8B-AWQ on GPU 2 (:8001), NixOS-managed, coordinator NORMAL 182 - **Coordinator enabled** in voice pipeline — `COORDINATOR_ENABLED=true`, 3-tier classification verified 183 - **Grafana fixed** — chown data dir to UID 472, added tmpfiles rule 184 - **Syncthing fixed** — chown config dir to 1000:100, added to containers.nix 185 - **Firefly III fixed** — host networking, NixOS-managed (firefly-db + firefly-iii) 186 - **All containers NixOS-managed** — ollama, syncthing, firefly-db, firefly-iii added to containers.nix 187 - Doc reconciliation — server.md, status.md, architecture.md, known-issues.md, traceability.md all updated 188 - 35 Bob containers running, 0 failed, 10 key services healthy 189 190 ### Session 10 Completed (2026-04-06) 191 - Epic 14: Restic backup, Syncthing, device health, network discovery, DR plan 192 - Firefly III + Ollama vision + BFO/CCO ontology + Keycloak users + reMarkable Reticulum 193 194 ### Session 9 Completed (2026-04-04/05) 195 - Coordinator voice E2E — 100% classification (25/25), all 4 tiers verified via live voice 196 - Session consolidation + News aggregation + NATS stream fix + Reticulum 197 198 ### Blockers 199 - TrustGraph requires interactive web configurator (trustgraph.ai/builder)