bob-operational-agents-research.md
1 # Bob Operational Agents: Extended Research & Design Rationale 2 3 > Date: 2026-04-02 4 > Status: Research complete, ready for BMAD integration 5 > Sources: Perplexity Deep Research, Google Deep Research, BMAD gap analysis, existing Bob architecture 6 7 ## 1. Problem Statement 8 9 Bob has mature development-time agents (BMAD pipeline) and real-time reactive capabilities (voice, home automation). What's missing is the **operational layer** — agents that run autonomously on schedules, monitor infrastructure, maintain knowledge, and support family daily living without being explicitly asked. 10 11 ## 2. Research-Informed Architecture Decisions 12 13 ### ADR-010: Hybrid Workflow-Agent Architecture 14 **Decision**: Bob's operational layer uses hybrid workflow-agent architecture — deterministic workflows for well-understood tasks (health checks, weather fetch) with agentic autonomy for novel situations (anomaly interpretation, multi-step planning). 15 16 **Rationale**: Research shows multi-agent systems improve parallelizable tasks by +81% but degrade sequential tasks by 39-70%. Bob's operational tasks are a mix: health checks are parallelizable (check all services concurrently), but remediation is sequential (diagnose → fix → verify). The hybrid approach matches agent coordination strategy to task structure. 17 18 ### ADR-011: Blank Context Operational Agents 19 **Decision**: All operational agents start with blank context. They receive only: (a) role definition, (b) task description, (c) lookup tools for gathering additional context. 20 21 **Rationale**: Both research papers confirm blank context subagents outperform inherited-context agents. Context pollution degrades reasoning quality. Bob's operational agents must be stateless-per-invocation but persistent-across-invocations via structured artifacts (state files, knowledge graph entries). 22 23 ### ADR-012: REPL-First for Monitoring Agents 24 **Decision**: Monitoring and maintenance agents receive a Python REPL environment instead of dozens of discrete tools. The REPL has access to: SSH connections, Docker API, NATS client, Prometheus API, and HA API. 25 26 **Rationale**: Research shows REPL-based agents scale to 10M+ tokens and handle novel problems that discrete tools can't anticipate. Infrastructure monitoring frequently requires ad-hoc investigation (e.g., "why is GPU 2 at 95% — is it a specific container or a leak?"). A REPL lets the agent write diagnostic scripts on the fly. 27 28 ### ADR-013: NATS-Based Agent Coordination 29 **Decision**: Operational agents coordinate via NATS subjects under `bob.agent.*`. Agent state, task assignments, and results flow through JetStream for persistence and replay. 30 31 **Rationale**: Bob already has NATS JetStream as its event backbone. Adding agent coordination to NATS (rather than a separate system like Redis or a database) keeps the architecture simple and leverages existing infrastructure. NATS subjects provide natural topic-based routing (e.g., `bob.agent.morning.result`, `bob.agent.health.alert`). 32 33 ### ADR-014: Scheduled Agent Activation via Cron + NATS 34 **Decision**: Agent schedules defined in a YAML config file. A lightweight scheduler service publishes activation events to NATS on schedule. Agents are triggered by NATS messages, not by polling. 35 36 **Rationale**: Event-driven activation via NATS decouples scheduling from execution. Agents don't need to know their schedule — they just respond to activation events. This enables both cron-based triggers and event-driven triggers (e.g., HA state change → activate Home Keeper agent). 37 38 ### ADR-015: Memory Consolidation Daemon (Bob's "AutoDream") 39 **Decision**: A background consolidation process runs during idle periods (late night). It scans interaction logs, knowledge graph changes, and agent execution results, then synthesizes summaries, prunes stale data, and updates persistent context files. 40 41 **Rationale**: Directly inspired by Claude Code's AutoDream architecture. Prevents long-term context decay and ensures each new agent invocation starts with lean, accurate context. 42 43 ## 3. Operational Agent Roles 44 45 ### Role: Morning Coordinator 46 - **Trigger**: Daily at configurable time (default 6:30 AM ET) 47 - **Purpose**: Assemble daily briefing for the family 48 - **Inputs**: Weather API, family calendar, HA device states, overnight event log 49 - **Output**: Structured briefing (text + optional voice announcement) 50 - **Tools**: Weather API, Calendar bridge, NATS event query, Knowledge graph query 51 52 ### Role: Home Keeper 53 - **Trigger**: Hourly health check + event-driven (HA state changes, NATS alerts) 54 - **Purpose**: Monitor and maintain home infrastructure 55 - **Inputs**: Docker container status, NixOS service health, GPU/CPU/RAM/disk metrics, network device inventory 56 - **Output**: Health report, automated remediation actions, maintenance recommendations 57 - **Tools**: Python REPL with SSH, Docker API, Prometheus API, systemctl access 58 - **Escalation**: Critical issues → NATS alert → voice announcement + family notification 59 60 ### Role: Knowledge Gardener 61 - **Trigger**: Nightly (2 AM ET) + weekly deep consolidation 62 - **Purpose**: Maintain and consolidate Bob's knowledge base 63 - **Inputs**: Interaction logs, Graphiti temporal memory, Oxigraph triples, TrustGraph extractions 64 - **Output**: Updated knowledge graph, pruned stale entries, weekly family digest 65 - **Tools**: SPARQL queries, Graphiti API, log analysis, summarization 66 67 ### Role: Family Planner 68 - **Trigger**: On-demand (voice request) + weekly planning session 69 - **Purpose**: Help with family logistics — meal planning, shopping lists, event coordination 70 - **Inputs**: Family calendar, dietary preferences, budget constraints, pantry inventory (future) 71 - **Output**: Plans, lists, reminders, calendar entries 72 - **Tools**: Calendar API, Knowledge graph query, web search (recipes, prices) 73 74 ### Role: System Sentinel 75 - **Trigger**: Continuous via Prometheus alerts + 15-minute polling 76 - **Purpose**: Deep infrastructure monitoring with automated remediation 77 - **Inputs**: Prometheus metrics, Docker logs, NixOS journal, network scans 78 - **Output**: Automated restarts, configuration fixes, security alerts, capacity warnings 79 - **Tools**: Python REPL with full system access (sandboxed), NixOS rebuild capability 80 - **Constraints**: Destructive actions require human approval (NATS confirmation + voice prompt) 81 82 ### Role: Evening Coordinator 83 - **Trigger**: Daily at configurable time (default 8:00 PM ET) 84 - **Purpose**: Daily summary, next-day preparation, maintenance scheduling 85 - **Inputs**: Day's interaction log, health reports, calendar for tomorrow 86 - **Output**: Daily summary, maintenance schedule, next-day prep items 87 - **Tools**: Same as Morning Coordinator + Home Keeper results 88 89 ## 4. Implementation Phases 90 91 ### Phase 1: Agent Scheduler + Home Keeper (Epic 08) 92 - Build agent scheduler service (Python, NATS-triggered) 93 - Implement Home Keeper agent (REPL-based, health checks) 94 - Define agent state protocol (activation → execution → result → archive) 95 - NATS subject hierarchy: `bob.agent.{role}.{event}` 96 97 ### Phase 2: Morning/Evening Coordinators + Weather (Epic 09) 98 - Implement weather integration (Open-Meteo, cached locally) 99 - Build calendar bridge (Google Calendar initially) 100 - Implement Morning Coordinator agent 101 - Implement Evening Coordinator agent 102 - Voice announcement capability (scheduled TTS output to specific rooms) 103 104 ### Phase 3: Knowledge Gardener + Memory Consolidation (Epic 10) 105 - Implement nightly consolidation daemon 106 - Build interaction log analysis pipeline 107 - Implement knowledge graph pruning/merging 108 - Weekly family digest generation 109 110 ### Phase 4: System Sentinel + Distributed Compute (Epic 11) 111 - Implement Prometheus alert → agent trigger pipeline 112 - Build automated remediation playbooks 113 - Satellite device management (kairos-macbook-1.lan, future RPi nodes) 114 - Edge inference exploration (Bonsai 8B, lighter models) 115 116 ## 5. Technology Choices 117 118 | Component | Choice | Rationale | 119 |-----------|--------|-----------| 120 | Agent scheduler | Python + NATS | Lightweight, event-driven, leverages existing NATS | 121 | Agent runtime | Claude Code subagents OR vLLM Qwen3-32B | REPL-capable, tool calling, local inference | 122 | Agent state store | NATS JetStream + filesystem | Persistent, replayable, no new infrastructure | 123 | Monitoring REPL | Python with paramiko (SSH), docker-py, prometheus-api-client | Direct system access without custom tools | 124 | Weather API | Open-Meteo (free, no key) | Already used in voice tools | 125 | Calendar bridge | Google Calendar API or CalDAV | Family already uses Google Calendar (assumption) | 126 | Memory consolidation | Python + Graphiti API + SPARQL | Leverages existing knowledge infrastructure | 127 | Alert routing | Prometheus alertmanager → NATS webhook | Standard, well-supported |