Cradicle Explorer

/ research / bob-operational-agents-research.md
bob-operational-agents-research.md
  1  # Bob Operational Agents: Extended Research & Design Rationale
  2  
  3  > Date: 2026-04-02
  4  > Status: Research complete, ready for BMAD integration
  5  > Sources: Perplexity Deep Research, Google Deep Research, BMAD gap analysis, existing Bob architecture
  6  
  7  ## 1. Problem Statement
  8  
  9  Bob has mature development-time agents (BMAD pipeline) and real-time reactive capabilities (voice, home automation). What's missing is the **operational layer** — agents that run autonomously on schedules, monitor infrastructure, maintain knowledge, and support family daily living without being explicitly asked.
 10  
 11  ## 2. Research-Informed Architecture Decisions
 12  
 13  ### ADR-010: Hybrid Workflow-Agent Architecture
 14  **Decision**: Bob's operational layer uses hybrid workflow-agent architecture — deterministic workflows for well-understood tasks (health checks, weather fetch) with agentic autonomy for novel situations (anomaly interpretation, multi-step planning).
 15  
 16  **Rationale**: Research shows multi-agent systems improve parallelizable tasks by +81% but degrade sequential tasks by 39-70%. Bob's operational tasks are a mix: health checks are parallelizable (check all services concurrently), but remediation is sequential (diagnose → fix → verify). The hybrid approach matches agent coordination strategy to task structure.
 17  
 18  ### ADR-011: Blank Context Operational Agents
 19  **Decision**: All operational agents start with blank context. They receive only: (a) role definition, (b) task description, (c) lookup tools for gathering additional context.
 20  
 21  **Rationale**: Both research papers confirm blank context subagents outperform inherited-context agents. Context pollution degrades reasoning quality. Bob's operational agents must be stateless-per-invocation but persistent-across-invocations via structured artifacts (state files, knowledge graph entries).
 22  
 23  ### ADR-012: REPL-First for Monitoring Agents
 24  **Decision**: Monitoring and maintenance agents receive a Python REPL environment instead of dozens of discrete tools. The REPL has access to: SSH connections, Docker API, NATS client, Prometheus API, and HA API.
 25  
 26  **Rationale**: Research shows REPL-based agents scale to 10M+ tokens and handle novel problems that discrete tools can't anticipate. Infrastructure monitoring frequently requires ad-hoc investigation (e.g., "why is GPU 2 at 95% — is it a specific container or a leak?"). A REPL lets the agent write diagnostic scripts on the fly.
 27  
 28  ### ADR-013: NATS-Based Agent Coordination
 29  **Decision**: Operational agents coordinate via NATS subjects under `bob.agent.*`. Agent state, task assignments, and results flow through JetStream for persistence and replay.
 30  
 31  **Rationale**: Bob already has NATS JetStream as its event backbone. Adding agent coordination to NATS (rather than a separate system like Redis or a database) keeps the architecture simple and leverages existing infrastructure. NATS subjects provide natural topic-based routing (e.g., `bob.agent.morning.result`, `bob.agent.health.alert`).
 32  
 33  ### ADR-014: Scheduled Agent Activation via Cron + NATS
 34  **Decision**: Agent schedules defined in a YAML config file. A lightweight scheduler service publishes activation events to NATS on schedule. Agents are triggered by NATS messages, not by polling.
 35  
 36  **Rationale**: Event-driven activation via NATS decouples scheduling from execution. Agents don't need to know their schedule — they just respond to activation events. This enables both cron-based triggers and event-driven triggers (e.g., HA state change → activate Home Keeper agent).
 37  
 38  ### ADR-015: Memory Consolidation Daemon (Bob's "AutoDream")
 39  **Decision**: A background consolidation process runs during idle periods (late night). It scans interaction logs, knowledge graph changes, and agent execution results, then synthesizes summaries, prunes stale data, and updates persistent context files.
 40  
 41  **Rationale**: Directly inspired by Claude Code's AutoDream architecture. Prevents long-term context decay and ensures each new agent invocation starts with lean, accurate context.
 42  
 43  ## 3. Operational Agent Roles
 44  
 45  ### Role: Morning Coordinator
 46  - **Trigger**: Daily at configurable time (default 6:30 AM ET)
 47  - **Purpose**: Assemble daily briefing for the family
 48  - **Inputs**: Weather API, family calendar, HA device states, overnight event log
 49  - **Output**: Structured briefing (text + optional voice announcement)
 50  - **Tools**: Weather API, Calendar bridge, NATS event query, Knowledge graph query
 51  
 52  ### Role: Home Keeper
 53  - **Trigger**: Hourly health check + event-driven (HA state changes, NATS alerts)
 54  - **Purpose**: Monitor and maintain home infrastructure
 55  - **Inputs**: Docker container status, NixOS service health, GPU/CPU/RAM/disk metrics, network device inventory
 56  - **Output**: Health report, automated remediation actions, maintenance recommendations
 57  - **Tools**: Python REPL with SSH, Docker API, Prometheus API, systemctl access
 58  - **Escalation**: Critical issues → NATS alert → voice announcement + family notification
 59  
 60  ### Role: Knowledge Gardener
 61  - **Trigger**: Nightly (2 AM ET) + weekly deep consolidation
 62  - **Purpose**: Maintain and consolidate Bob's knowledge base
 63  - **Inputs**: Interaction logs, Graphiti temporal memory, Oxigraph triples, TrustGraph extractions
 64  - **Output**: Updated knowledge graph, pruned stale entries, weekly family digest
 65  - **Tools**: SPARQL queries, Graphiti API, log analysis, summarization
 66  
 67  ### Role: Family Planner
 68  - **Trigger**: On-demand (voice request) + weekly planning session
 69  - **Purpose**: Help with family logistics — meal planning, shopping lists, event coordination
 70  - **Inputs**: Family calendar, dietary preferences, budget constraints, pantry inventory (future)
 71  - **Output**: Plans, lists, reminders, calendar entries
 72  - **Tools**: Calendar API, Knowledge graph query, web search (recipes, prices)
 73  
 74  ### Role: System Sentinel
 75  - **Trigger**: Continuous via Prometheus alerts + 15-minute polling
 76  - **Purpose**: Deep infrastructure monitoring with automated remediation
 77  - **Inputs**: Prometheus metrics, Docker logs, NixOS journal, network scans
 78  - **Output**: Automated restarts, configuration fixes, security alerts, capacity warnings
 79  - **Tools**: Python REPL with full system access (sandboxed), NixOS rebuild capability
 80  - **Constraints**: Destructive actions require human approval (NATS confirmation + voice prompt)
 81  
 82  ### Role: Evening Coordinator
 83  - **Trigger**: Daily at configurable time (default 8:00 PM ET)
 84  - **Purpose**: Daily summary, next-day preparation, maintenance scheduling
 85  - **Inputs**: Day's interaction log, health reports, calendar for tomorrow
 86  - **Output**: Daily summary, maintenance schedule, next-day prep items
 87  - **Tools**: Same as Morning Coordinator + Home Keeper results
 88  
 89  ## 4. Implementation Phases
 90  
 91  ### Phase 1: Agent Scheduler + Home Keeper (Epic 08)
 92  - Build agent scheduler service (Python, NATS-triggered)
 93  - Implement Home Keeper agent (REPL-based, health checks)
 94  - Define agent state protocol (activation → execution → result → archive)
 95  - NATS subject hierarchy: `bob.agent.{role}.{event}`
 96  
 97  ### Phase 2: Morning/Evening Coordinators + Weather (Epic 09)
 98  - Implement weather integration (Open-Meteo, cached locally)
 99  - Build calendar bridge (Google Calendar initially)
100  - Implement Morning Coordinator agent
101  - Implement Evening Coordinator agent
102  - Voice announcement capability (scheduled TTS output to specific rooms)
103  
104  ### Phase 3: Knowledge Gardener + Memory Consolidation (Epic 10)
105  - Implement nightly consolidation daemon
106  - Build interaction log analysis pipeline
107  - Implement knowledge graph pruning/merging
108  - Weekly family digest generation
109  
110  ### Phase 4: System Sentinel + Distributed Compute (Epic 11)
111  - Implement Prometheus alert → agent trigger pipeline
112  - Build automated remediation playbooks
113  - Satellite device management (kairos-macbook-1.lan, future RPi nodes)
114  - Edge inference exploration (Bonsai 8B, lighter models)
115  
116  ## 5. Technology Choices
117  
118  | Component | Choice | Rationale |
119  |-----------|--------|-----------|
120  | Agent scheduler | Python + NATS | Lightweight, event-driven, leverages existing NATS |
121  | Agent runtime | Claude Code subagents OR vLLM Qwen3-32B | REPL-capable, tool calling, local inference |
122  | Agent state store | NATS JetStream + filesystem | Persistent, replayable, no new infrastructure |
123  | Monitoring REPL | Python with paramiko (SSH), docker-py, prometheus-api-client | Direct system access without custom tools |
124  | Weather API | Open-Meteo (free, no key) | Already used in voice tools |
125  | Calendar bridge | Google Calendar API or CalDAV | Family already uses Google Calendar (assumption) |
126  | Memory consolidation | Python + Graphiti API + SPARQL | Leverages existing knowledge infrastructure |
127  | Alert routing | Prometheus alertmanager → NATS webhook | Standard, well-supported |