Cradicle Explorer

/ _bmad / discovery / proxy-agentic-devops.md
proxy-agentic-devops.md
  1  # Discovery Brief: Proxy Monitoring as Agentic DevOps
  2  
  3  > **Analyst**: Claude (Discovery role)
  4  > **Date**: 2026-04-09
  5  > **Status**: Complete
  6  > **Scope**: How Bob should manage, monitor, and secure the residential Squid proxy on nuclide-amd.lan as part of its agentic devops responsibilities
  7  
  8  ## Context
  9  
 10  Bob has a residential HTTP proxy (Squid on nuclide-amd.lan:3128) used by Glean to access APIs that require residential IPs. This is the first externally-exposed service managed by Bob that isn't a web UI — it's infrastructure-as-a-service with real security implications (open to the internet, authenticated, handling third-party traffic).
 11  
 12  Bob's existing agentic devops stack handles local infrastructure well (35 containers on rig.lan, 8 Prometheus alerts, hourly health checks, 15-min deep monitoring). But the proxy introduces new challenges:
 13  
 14  1. **Remote host**: Squid runs on nuclide, not rig. Bob monitors nuclide via SSH but doesn't check Docker containers there.
 15  2. **Internet-facing**: First service exposed to the public internet (beyond Caddy's reverse proxies). Attack surface includes brute-force auth, abuse, bandwidth exhaustion.
 16  3. **Third-party dependency**: Glean depends on this proxy working. Downtime affects an external workflow, not just the family.
 17  4. **Privacy sensitivity**: All traffic through the proxy exits from the family's residential IP. Abuse could trigger ISP issues.
 18  
 19  ## Current Monitoring Gaps
 20  
 21  | Capability | rig.lan Services | nuclide Squid Proxy |
 22  |-----------|-----------------|---------------------|
 23  | Container health | Home Keeper checks every hour | **Not checked** |
 24  | Process status | `docker ps` on rig | **No visibility** |
 25  | HTTP health check | 9 services checked via curl | **Not checked** |
 26  | Log analysis | Sentinel scans 6 containers | **No access** |
 27  | Auth failure detection | N/A (internal services) | **Not monitored** |
 28  | Bandwidth tracking | N/A | **Not tracked** |
 29  | Functional test (end-to-end) | Coordinator NATS test | **Not tested** |
 30  | Prometheus metrics | 8 alert rules | **No scraping** |
 31  | Auto-remediation | Container restart + cleanup | **Not possible remotely** |
 32  
 33  ## Architecture: How Proxy Fits Into Bob's Agent Layer
 34  
 35  ```
 36                      ┌──────────────────────┐
 37                      │   Agent Scheduler     │
 38                      │   (cron triggers)     │
 39                      └──────┬───────────────┘
 40                             │ bob.agent.*.trigger
 41                             ▼
 42                ┌────────────────────────────┐
 43                │     System Sentinel         │
 44                │  (every 15 min, deep)       │
 45                │                             │
 46                │  rig.lan checks (existing)  │
 47                │  + nuclide SSH checks       │
 48                │  + Squid container status   │◄── NEW
 49                │  + Squid log analysis       │◄── NEW
 50                │  + Squid auth failure count │◄── NEW
 51                └──────┬─────────────────────┘
 52                       │ bob.agent.system_sentinel.alert
 53                       ▼
 54                ┌────────────────────────────┐
 55                │      Home Keeper            │
 56                │  (hourly, + alert bridge)   │
 57                │                             │
 58                │  + Proxy functional test    │◄── NEW (curl through proxy → ipify.org)
 59                │  + Remote container restart │◄── NEW (ssh nuclide docker restart squid)
 60                └────────────────────────────┘
 61  ```
 62  
 63  ## Proposed Implementation: 3 Phases
 64  
 65  ### Phase 1: SSH-Based Health Checks (extend System Sentinel)
 66  Add to the sentinel's 15-minute cycle:
 67  - `docker ps --filter name=squid` via SSH to nuclide → container up/down
 68  - `docker logs --tail 50 squid-proxy | grep -c 'TCP_DENIED'` → auth failure count
 69  - `docker logs --tail 50 squid-proxy | grep -c 'error'` → error count
 70  - Alert if: container down, >10 auth failures in 15 min, >5 errors
 71  
 72  ### Phase 2: Functional Proxy Test (extend Home Keeper)
 73  Add an end-to-end proxy test to the hourly health check:
 74  - `curl --proxy http://glean:PASS@nuclide-amd.lan:3128 https://api.ipify.org` → should return `47.205.28.88`
 75  - If fails: alert "Proxy functional test failed"
 76  - If IP changes: alert "Residential IP has changed" (ISP dynamic IP)
 77  - Credential stored in sops, read from `/run/secrets/` at runtime
 78  
 79  ### Phase 3: Abuse Detection + Auto-Remediation
 80  - Parse Squid access logs for patterns: >100 requests/min from single IP, connections to suspicious domains, bandwidth spikes
 81  - LLM-based anomaly detection (tier 3 pattern, similar to home-automations)
 82  - Auto-remediation: restart squid if hung, block IP after N failures (iptables via SSH)
 83  - Voice alerting: "Bob, the proxy is seeing unusual traffic from IP X"
 84  
 85  ## Security Requirements for Monitoring
 86  
 87  1. **Proxy password must be in sops** — monitoring agents read from `/run/secrets/`, never hardcoded
 88  2. **Functional test must use a dedicated test credential** (or the production `glean` credential via sops)
 89  3. **SSH access to nuclide is already established** (System Sentinel + Device Health use it today)
 90  4. **Log data stays on LAN** — Squid logs parsed via SSH, summaries published to NATS, raw logs never leave nuclide
 91  
 92  ## What This Enables (Voice Integration)
 93  
 94  Once monitoring is wired, Bob can answer:
 95  - "Bob, is the proxy working?" → functional test result
 96  - "Bob, any auth failures on the proxy?" → log analysis
 97  - "Bob, what's the proxy bandwidth this hour?" → Squid stats
 98  - "Bob, restart the proxy" → SSH to nuclide, docker restart squid-proxy
 99  
100  These become tools in the coordinator's toolkit — same pattern as `get_home_state`, `execute_code`, etc.
101  
102  ## Monitoring Tools Available (from research)
103  
104  **Squid native stats** (no exporter needed for Phase 1):
105  - `curl http://localhost:3128/squid-internal-mgr/info` — overall status
106  - `curl http://localhost:3128/squid-internal-mgr/counters` — request/byte counters
107  - Requires ACL: `acl manager proto cache_object; acl monitoring src 127.0.0.1; http_access allow monitoring manager`
108  
109  **Squid access.log format** (native):
110  ```
111  timestamp  elapsed  client_ip  result_code/status  bytes  method  URL  username  hierarchy/peer  content_type
112  ```
113  Key codes: `TCP_MISS/200` (success), `TCP_DENIED/407` (auth failure), `TCP_DENIED/403` (ACL block)
114  
115  **Prometheus exporter**: `boynux/squid-exporter` on `:9301` — exports request rates, bytes, cache hits, connection counts. Pre-built Grafana dashboards available (#14394).
116  
117  **Brute force protection**: fail2ban with custom filter matching `TCP_DENIED/407` — ban after 5 failures in 10 min.
118  
119  **Security note**: CVE-2025-62168 (CVSS 10.0) affects Squid — leaks auth data. Must verify we're running Squid >= 7.2.
120  
121  ## Dependencies
122  
123  - System Sentinel already SSH'es to nuclide (confirmed in code)
124  - Home Keeper has the remediation framework for container restarts
125  - Squid proxy password needs to be added to sops for monitoring agents
126  - Router port forward (3128) must be done before external-facing monitoring makes sense
127  
128  ## Recommendation
129  
130  Start with Phase 1 (sentinel SSH checks) — it's 30-40 lines of code in the existing sentinel agent. Phase 2 (functional test) follows naturally once the proxy credential is in sops. Phase 3 (abuse detection) can wait until the proxy has real traffic to analyze.