proxy-agentic-devops.md
1 # Discovery Brief: Proxy Monitoring as Agentic DevOps 2 3 > **Analyst**: Claude (Discovery role) 4 > **Date**: 2026-04-09 5 > **Status**: Complete 6 > **Scope**: How Bob should manage, monitor, and secure the residential Squid proxy on nuclide-amd.lan as part of its agentic devops responsibilities 7 8 ## Context 9 10 Bob has a residential HTTP proxy (Squid on nuclide-amd.lan:3128) used by Glean to access APIs that require residential IPs. This is the first externally-exposed service managed by Bob that isn't a web UI — it's infrastructure-as-a-service with real security implications (open to the internet, authenticated, handling third-party traffic). 11 12 Bob's existing agentic devops stack handles local infrastructure well (35 containers on rig.lan, 8 Prometheus alerts, hourly health checks, 15-min deep monitoring). But the proxy introduces new challenges: 13 14 1. **Remote host**: Squid runs on nuclide, not rig. Bob monitors nuclide via SSH but doesn't check Docker containers there. 15 2. **Internet-facing**: First service exposed to the public internet (beyond Caddy's reverse proxies). Attack surface includes brute-force auth, abuse, bandwidth exhaustion. 16 3. **Third-party dependency**: Glean depends on this proxy working. Downtime affects an external workflow, not just the family. 17 4. **Privacy sensitivity**: All traffic through the proxy exits from the family's residential IP. Abuse could trigger ISP issues. 18 19 ## Current Monitoring Gaps 20 21 | Capability | rig.lan Services | nuclide Squid Proxy | 22 |-----------|-----------------|---------------------| 23 | Container health | Home Keeper checks every hour | **Not checked** | 24 | Process status | `docker ps` on rig | **No visibility** | 25 | HTTP health check | 9 services checked via curl | **Not checked** | 26 | Log analysis | Sentinel scans 6 containers | **No access** | 27 | Auth failure detection | N/A (internal services) | **Not monitored** | 28 | Bandwidth tracking | N/A | **Not tracked** | 29 | Functional test (end-to-end) | Coordinator NATS test | **Not tested** | 30 | Prometheus metrics | 8 alert rules | **No scraping** | 31 | Auto-remediation | Container restart + cleanup | **Not possible remotely** | 32 33 ## Architecture: How Proxy Fits Into Bob's Agent Layer 34 35 ``` 36 ┌──────────────────────┐ 37 │ Agent Scheduler │ 38 │ (cron triggers) │ 39 └──────┬───────────────┘ 40 │ bob.agent.*.trigger 41 ▼ 42 ┌────────────────────────────┐ 43 │ System Sentinel │ 44 │ (every 15 min, deep) │ 45 │ │ 46 │ rig.lan checks (existing) │ 47 │ + nuclide SSH checks │ 48 │ + Squid container status │◄── NEW 49 │ + Squid log analysis │◄── NEW 50 │ + Squid auth failure count │◄── NEW 51 └──────┬─────────────────────┘ 52 │ bob.agent.system_sentinel.alert 53 ▼ 54 ┌────────────────────────────┐ 55 │ Home Keeper │ 56 │ (hourly, + alert bridge) │ 57 │ │ 58 │ + Proxy functional test │◄── NEW (curl through proxy → ipify.org) 59 │ + Remote container restart │◄── NEW (ssh nuclide docker restart squid) 60 └────────────────────────────┘ 61 ``` 62 63 ## Proposed Implementation: 3 Phases 64 65 ### Phase 1: SSH-Based Health Checks (extend System Sentinel) 66 Add to the sentinel's 15-minute cycle: 67 - `docker ps --filter name=squid` via SSH to nuclide → container up/down 68 - `docker logs --tail 50 squid-proxy | grep -c 'TCP_DENIED'` → auth failure count 69 - `docker logs --tail 50 squid-proxy | grep -c 'error'` → error count 70 - Alert if: container down, >10 auth failures in 15 min, >5 errors 71 72 ### Phase 2: Functional Proxy Test (extend Home Keeper) 73 Add an end-to-end proxy test to the hourly health check: 74 - `curl --proxy http://glean:PASS@nuclide-amd.lan:3128 https://api.ipify.org` → should return `47.205.28.88` 75 - If fails: alert "Proxy functional test failed" 76 - If IP changes: alert "Residential IP has changed" (ISP dynamic IP) 77 - Credential stored in sops, read from `/run/secrets/` at runtime 78 79 ### Phase 3: Abuse Detection + Auto-Remediation 80 - Parse Squid access logs for patterns: >100 requests/min from single IP, connections to suspicious domains, bandwidth spikes 81 - LLM-based anomaly detection (tier 3 pattern, similar to home-automations) 82 - Auto-remediation: restart squid if hung, block IP after N failures (iptables via SSH) 83 - Voice alerting: "Bob, the proxy is seeing unusual traffic from IP X" 84 85 ## Security Requirements for Monitoring 86 87 1. **Proxy password must be in sops** — monitoring agents read from `/run/secrets/`, never hardcoded 88 2. **Functional test must use a dedicated test credential** (or the production `glean` credential via sops) 89 3. **SSH access to nuclide is already established** (System Sentinel + Device Health use it today) 90 4. **Log data stays on LAN** — Squid logs parsed via SSH, summaries published to NATS, raw logs never leave nuclide 91 92 ## What This Enables (Voice Integration) 93 94 Once monitoring is wired, Bob can answer: 95 - "Bob, is the proxy working?" → functional test result 96 - "Bob, any auth failures on the proxy?" → log analysis 97 - "Bob, what's the proxy bandwidth this hour?" → Squid stats 98 - "Bob, restart the proxy" → SSH to nuclide, docker restart squid-proxy 99 100 These become tools in the coordinator's toolkit — same pattern as `get_home_state`, `execute_code`, etc. 101 102 ## Monitoring Tools Available (from research) 103 104 **Squid native stats** (no exporter needed for Phase 1): 105 - `curl http://localhost:3128/squid-internal-mgr/info` — overall status 106 - `curl http://localhost:3128/squid-internal-mgr/counters` — request/byte counters 107 - Requires ACL: `acl manager proto cache_object; acl monitoring src 127.0.0.1; http_access allow monitoring manager` 108 109 **Squid access.log format** (native): 110 ``` 111 timestamp elapsed client_ip result_code/status bytes method URL username hierarchy/peer content_type 112 ``` 113 Key codes: `TCP_MISS/200` (success), `TCP_DENIED/407` (auth failure), `TCP_DENIED/403` (ACL block) 114 115 **Prometheus exporter**: `boynux/squid-exporter` on `:9301` — exports request rates, bytes, cache hits, connection counts. Pre-built Grafana dashboards available (#14394). 116 117 **Brute force protection**: fail2ban with custom filter matching `TCP_DENIED/407` — ban after 5 failures in 10 min. 118 119 **Security note**: CVE-2025-62168 (CVSS 10.0) affects Squid — leaks auth data. Must verify we're running Squid >= 7.2. 120 121 ## Dependencies 122 123 - System Sentinel already SSH'es to nuclide (confirmed in code) 124 - Home Keeper has the remediation framework for container restarts 125 - Squid proxy password needs to be added to sops for monitoring agents 126 - Router port forward (3128) must be done before external-facing monitoring makes sense 127 128 ## Recommendation 129 130 Start with Phase 1 (sentinel SSH checks) — it's 30-40 lines of code in the existing sentinel agent. Phase 2 (functional test) follows naturally once the proxy credential is in sops. Phase 3 (abuse detection) can wait until the proxy has real traffic to analyze.