0015-phase7-distribution-and-production-hardening.md
1 # RFC-0015 — Phase 7 Distribution and Production Hardening 2 3 Status: Accepted 4 5 ## Summary 6 7 Implement the Phase 7 production-hardening baseline: 8 9 - supervision tree guidance model 10 - metrics and tracing primitives 11 - rate limiting hooks 12 - CSRF/session protection hooks 13 - deterministic load testing harness 14 - security review documentation 15 - deployment guide documentation 16 17 ## Motivation 18 19 Runtime behavior is stable through Phase 6, but production operation requires explicit controls and documentation for failure isolation, observability, abuse control, and rollout safety. 20 21 ## Design 22 23 ### Supervision guidance 24 25 Add `lightspeed/ops/supervision` with: 26 27 - typed child spec guidance 28 - restart/strategy labels 29 - default plan builder 30 - validation for required children and duplicate IDs 31 32 ### Metrics and tracing 33 34 Add `lightspeed/ops/telemetry` with: 35 36 - mapping from session telemetry events to counters/gauges 37 - receive-span model and status labels 38 - log-friendly metric line serializer 39 40 ### Rate limiting and protection hooks 41 42 Extend transport contract with: 43 44 - `ProtectionHook` for CSRF/session binding and related policy 45 - `RateLimitHook` for event throttling policy 46 - explicit adapter errors for protection and rate-limit failures 47 48 Extend websocket adapter with: 49 50 - `connect_with_hooks` / `reconnect_with_hooks` 51 - `receive_with_hooks` 52 - default wrappers preserving earlier function shape 53 54 ### Load testing harness 55 56 Add `lightspeed/ops/load_harness` with deterministic scenarios covering: 57 58 - reconnect behavior 59 - crash/restart behavior 60 - slow-client ack behavior 61 - repeatability checks 62 63 ### Documentation 64 65 Add: 66 67 - `docs/supervision_tree.md` 68 - `docs/security_review.md` 69 - `docs/deployment_guide.md` 70 71 Update top-level docs to reference the new Phase 7 assets. 72 73 ## API impact 74 75 New public modules: 76 77 - `lightspeed/ops/supervision` 78 - `lightspeed/ops/telemetry` 79 - `lightspeed/ops/load_harness` 80 81 Transport contract public API expands with protection/rate hook types and helpers. 82 83 ## Protocol impact 84 85 No frame shape changes. 86 87 ## ISA impact 88 89 No instruction set changes. 90 91 ## Security impact 92 93 - explicit CSRF/session protection policy surface 94 - explicit rate-limit policy surface 95 - deterministic rejection paths for blocked client traffic 96 97 ## Alternatives 98 99 - defer all hardening to adapter/application layer only 100 - add hardcoded rate-limiter/CSRF policy instead of hooks 101 - prioritize deployment docs without typed guidance modules 102 103 ## Unresolved questions 104 105 - standardized distributed limiter backend API 106 - optional shared replay-nonce guard in core 107 - future structured export formats for metrics/traces