/ epics / epic-08-agent-scheduler.md
epic-08-agent-scheduler.md
 1  # Epic 08: Agent Scheduler & Home Keeper
 2  
 3  > Status: Done | Last updated: 2026-04-02
 4  
 5  ## Goal
 6  
 7  Build the foundational agent scheduler service and the first operational agent (Home Keeper). The scheduler publishes activation events to NATS on schedule; Home Keeper monitors Docker containers, NixOS services, GPU/CPU/RAM/disk, and network devices.
 8  
 9  ## Dependencies
10  - Depends on: Epic 01 (NixOS infrastructure, NATS), Epic 05 (NATS JetStream event bus)
11  - Blocks: Epic 09 (daily coordinators depend on scheduler), Epic 10 (knowledge gardener depends on scheduler), Epic 11 (sentinel uses scheduler and agent protocol)
12  
13  ## Context
14  
15  Bob's autonomous capabilities require a reliable, observable mechanism for triggering agent execution on schedule. Rather than ad-hoc cron jobs, a centralized agent scheduler publishes activation events to NATS, allowing agents to be loosely coupled, independently deployable, and fully auditable. Home Keeper is the first agent built on this framework — it replaces manual infrastructure monitoring with continuous, automated health checks and remediation. This pattern establishes the runtime contract that all subsequent agents (Morning Coordinator, Knowledge Gardener, System Sentinel) will follow.
16  
17  ## Stories
18  
19  | ID | Story | Status | OpenSpec Refs |
20  |----|-------|--------|---------------|
21  | S08-01 | Agent scheduler service (Python, NATS-triggered, cron-style schedule config) | Backlog | REQ-SCHED-001 |
22  | S08-02 | Agent state protocol (activation → execution → result → archive via NATS JetStream) | Backlog | REQ-SCHED-002 |
23  | S08-03 | Home Keeper agent — infrastructure health checks (REPL-based with Docker API, Prometheus API, SSH) | Backlog | REQ-SCHED-003 |
24  | S08-04 | Home Keeper automated remediation — container restart, service restart, disk cleanup | Backlog | REQ-SCHED-004 |
25  | S08-05 | NATS subject hierarchy for agents: `bob.agent.{role}.{trigger\|result\|alert\|state}` | Backlog | REQ-SCHED-005 |
26  | S08-06 | Agent execution history and result storage | Backlog | REQ-SCHED-006 |
27  
28  ## Acceptance Criteria
29  - [ ] Scheduler service runs as a Docker container, reads schedule config (YAML/TOML), and publishes activation events to NATS on time
30  - [ ] Agent state transitions (activation → execution → result → archive) are published to JetStream and can be replayed
31  - [ ] Home Keeper queries Docker API, Prometheus metrics, and SSH endpoints for health status
32  - [ ] Home Keeper automatically restarts failed containers and runs disk cleanup when thresholds are breached
33  - [ ] NATS subject hierarchy `bob.agent.*` is documented and enforced
34  - [ ] Execution history is persisted and queryable (last N runs, failure rate, duration)
35  - [ ] Agent failures produce alerts on `bob.agent.{role}.alert`
36  
37  ## Technical Notes
38  - Scheduler is a lightweight Python service using `nats-py` and `APScheduler` or equivalent cron library
39  - Schedule config is declarative (YAML), hot-reloadable without service restart
40  - Agent state protocol uses NATS JetStream for durable, exactly-once delivery of activation and result messages
41  - Home Keeper uses the Docker SDK for Python, Prometheus HTTP API, and paramiko/asyncssh for NixOS service checks
42  - Remediation actions are gated by configurable policies (max restart attempts, cooldown periods, escalation to alert)
43  - Execution history stored in JetStream KV or a lightweight SQLite database