monitoring-setup.md
1 # Radicle Seed Node Monitoring Setup 2 ## Project Auxo Inc. 3 4 ### Overview 5 6 This document describes the monitoring infrastructure for Project Auxo's Radicle seed nodes, providing comprehensive visibility into node health, performance, and availability. 7 8 ## Monitoring Architecture 9 10 ``` 11 ┌─────────────────────────────────────────────────────────────┐ 12 │ Monitoring Stack │ 13 ├─────────────────────────────────────────────────────────────┤ 14 │ │ 15 │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ 16 │ │ Seed 1 │ │ Seed 2 │ │ Seed 3 │ │ 17 │ │ Prometheus │ │ Prometheus │ │ Prometheus │ │ 18 │ │ Node Export │ │ Node Export │ │ Node Export │ │ 19 │ │ Radicle Met │ │ Radicle Met │ │ Radicle Met │ │ 20 │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ 21 │ │ │ │ │ 22 │ └────────────────────┴────────────────────┘ │ 23 │ │ │ 24 │ ┌─────────▼─────────┐ │ 25 │ │ Prometheus │ │ 26 │ │ Central Server │ │ 27 │ └─────────┬─────────┘ │ 28 │ │ │ 29 │ ┌─────────▼─────────┐ │ 30 │ │ Grafana │ │ 31 │ │ Dashboards │ │ 32 │ └─────────┬─────────┘ │ 33 │ │ │ 34 │ ┌─────────▼─────────┐ │ 35 │ │ AlertManager │ │ 36 │ │ Slack / Email │ │ 37 │ └───────────────────┘ │ 38 └─────────────────────────────────────────────────────────────┘ 39 ``` 40 41 ## Components 42 43 ### 1. Node-Level Monitoring 44 45 Each seed node runs: 46 47 - **Prometheus Node Exporter**: System metrics (CPU, memory, disk, network) 48 - **Custom Radicle Exporter**: Radicle-specific metrics 49 - **Log Aggregation**: Systemd journal forwarding 50 51 ### 2. Central Monitoring Server 52 53 - **Prometheus**: Time-series database and scraping 54 - **Grafana**: Visualization and dashboards 55 - **AlertManager**: Alert routing and notifications 56 57 ## Installation Guide 58 59 ### Step 1: Install Prometheus (Central Server) 60 61 ```bash 62 # Create prometheus user 63 sudo useradd --no-create-home --shell /bin/false prometheus 64 65 # Download and install 66 cd /tmp 67 wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz 68 tar xvf prometheus-2.45.0.linux-amd64.tar.gz 69 sudo cp prometheus-2.45.0.linux-amd64/{prometheus,promtool} /usr/local/bin/ 70 sudo cp -r prometheus-2.45.0.linux-amd64/{consoles,console_libraries} /etc/prometheus/ 71 72 # Create directories 73 sudo mkdir -p /etc/prometheus /var/lib/prometheus 74 sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus 75 ``` 76 77 ### Step 2: Configure Prometheus 78 79 Create `/etc/prometheus/prometheus.yml`: 80 81 ```yaml 82 global: 83 scrape_interval: 15s 84 evaluation_interval: 15s 85 86 alerting: 87 alertmanagers: 88 - static_configs: 89 - targets: 90 - localhost:9093 91 92 scrape_configs: 93 - job_name: 'prometheus' 94 static_configs: 95 - targets: ['localhost:9090'] 96 97 - job_name: 'seed-nodes' 98 static_configs: 99 - targets: 100 - 'seed1.auxo.dev:9100' 101 - 'seed2.auxo.dev:9100' 102 - 'seed3.auxo.dev:9100' 103 labels: 104 group: 'radicle-seeds' 105 106 - job_name: 'radicle-metrics' 107 metrics_path: '/home/seed/.radicle/metrics.prom' 108 static_configs: 109 - targets: 110 - 'seed1.auxo.dev:8080' 111 - 'seed2.auxo.dev:8080' 112 - 'seed3.auxo.dev:8080' 113 ``` 114 115 ### Step 3: Create Prometheus Service 116 117 ```bash 118 sudo tee /etc/systemd/system/prometheus.service <<EOF 119 [Unit] 120 Description=Prometheus 121 Wants=network-online.target 122 After=network-online.target 123 124 [Service] 125 User=prometheus 126 Group=prometheus 127 Type=simple 128 ExecStart=/usr/local/bin/prometheus \ 129 --config.file /etc/prometheus/prometheus.yml \ 130 --storage.tsdb.path /var/lib/prometheus/ \ 131 --web.console.templates=/etc/prometheus/consoles \ 132 --web.console.libraries=/etc/prometheus/console_libraries 133 134 [Install] 135 WantedBy=multi-user.target 136 EOF 137 138 sudo systemctl daemon-reload 139 sudo systemctl enable prometheus 140 sudo systemctl start prometheus 141 ``` 142 143 ### Step 4: Install Grafana 144 145 ```bash 146 # Add Grafana repository 147 sudo apt-get install -y software-properties-common 148 sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main" 149 wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add - 150 151 # Install 152 sudo apt-get update 153 sudo apt-get install grafana 154 155 # Start service 156 sudo systemctl enable grafana-server 157 sudo systemctl start grafana-server 158 ``` 159 160 ### Step 5: Configure AlertManager 161 162 Create `/etc/alertmanager/alertmanager.yml`: 163 164 ```yaml 165 global: 166 resolve_timeout: 5m 167 slack_api_url: 'YOUR_SLACK_WEBHOOK_URL' 168 169 route: 170 group_by: ['alertname', 'cluster', 'service'] 171 group_wait: 10s 172 group_interval: 10s 173 repeat_interval: 12h 174 receiver: 'team-alerts' 175 176 routes: 177 - match: 178 severity: critical 179 receiver: 'critical-alerts' 180 continue: true 181 182 receivers: 183 - name: 'team-alerts' 184 slack_configs: 185 - channel: '#radicle-alerts' 186 title: 'Radicle Seed Node Alert' 187 text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' 188 189 - name: 'critical-alerts' 190 email_configs: 191 - to: 'oncall@auxo.dev' 192 from: 'alerts@auxo.dev' 193 smarthost: 'smtp.auxo.dev:587' 194 auth_username: 'alerts@auxo.dev' 195 auth_password: 'password' 196 ``` 197 198 ## Alert Rules 199 200 Create `/etc/prometheus/alerts.yml`: 201 202 ```yaml 203 groups: 204 - name: radicle_alerts 205 interval: 30s 206 rules: 207 208 - alert: NodeDown 209 expr: up{job="seed-nodes"} == 0 210 for: 5m 211 labels: 212 severity: critical 213 annotations: 214 summary: "Seed node {{ $labels.instance }} is down" 215 description: "{{ $labels.instance }} has been down for more than 5 minutes." 216 217 - alert: HighCPUUsage 218 expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 219 for: 10m 220 labels: 221 severity: warning 222 annotations: 223 summary: "High CPU usage on {{ $labels.instance }}" 224 description: "CPU usage is above 80% (current value: {{ $value }}%)" 225 226 - alert: HighMemoryUsage 227 expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85 228 for: 10m 229 labels: 230 severity: warning 231 annotations: 232 summary: "High memory usage on {{ $labels.instance }}" 233 description: "Memory usage is above 85% (current value: {{ $value }}%)" 234 235 - alert: DiskSpaceLow 236 expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20 237 for: 10m 238 labels: 239 severity: warning 240 annotations: 241 summary: "Low disk space on {{ $labels.instance }}" 242 description: "Disk space is below 20% (current value: {{ $value }}%)" 243 244 - alert: RadicleNodeNotRunning 245 expr: radicle_node_running == 0 246 for: 5m 247 labels: 248 severity: critical 249 annotations: 250 summary: "Radicle node not running on {{ $labels.instance }}" 251 description: "The Radicle node service has been down for more than 5 minutes." 252 253 - alert: NoPeersConnected 254 expr: radicle_peer_count == 0 255 for: 15m 256 labels: 257 severity: warning 258 annotations: 259 summary: "No peers connected on {{ $labels.instance }}" 260 description: "Radicle node has no peer connections for 15 minutes." 261 ``` 262 263 ## Grafana Dashboards 264 265 ### Dashboard 1: Seed Node Overview 266 267 Key panels: 268 - Node status (up/down) 269 - CPU usage per node 270 - Memory usage per node 271 - Disk usage per node 272 - Network traffic in/out 273 - Peer connections 274 - Repository count 275 276 ### Dashboard 2: Radicle Metrics 277 278 Key panels: 279 - Total repositories across network 280 - Sync operations per hour 281 - Peer discovery events 282 - Git operations (fetch/push) 283 - Error rates 284 - Latency between seeds 285 286 ### Dashboard 3: System Health 287 288 Key panels: 289 - System load averages 290 - Process count 291 - Open file descriptors 292 - Network connections 293 - Systemd service status 294 - Log error rates 295 296 ## Custom Radicle Metrics Exporter 297 298 Create `/usr/local/bin/radicle-exporter.py`: 299 300 ```python 301 #!/usr/bin/env python3 302 import subprocess 303 import json 304 import time 305 from prometheus_client import start_http_server, Gauge 306 307 # Define metrics 308 node_running = Gauge('radicle_node_running', 'Whether Radicle node is running') 309 peer_count = Gauge('radicle_peer_count', 'Number of connected peers') 310 repo_count = Gauge('radicle_repo_count', 'Number of repositories') 311 sync_lag = Gauge('radicle_sync_lag_seconds', 'Sync lag in seconds') 312 313 def collect_metrics(): 314 # Check if node is running 315 try: 316 result = subprocess.run(['rad', 'node', 'status'], 317 capture_output=True, text=True) 318 node_running.set(1 if 'running' in result.stdout else 0) 319 except: 320 node_running.set(0) 321 322 # Count peers 323 try: 324 result = subprocess.run(['rad', 'node', 'peers'], 325 capture_output=True, text=True) 326 peers = len([l for l in result.stdout.split('\n') if 'connected' in l]) 327 peer_count.set(peers) 328 except: 329 peer_count.set(0) 330 331 # Count repositories 332 try: 333 result = subprocess.run(['rad', 'repo', 'list'], 334 capture_output=True, text=True) 335 repos = len([l for l in result.stdout.split('\n') if 'rad:' in l]) 336 repo_count.set(repos) 337 except: 338 repo_count.set(0) 339 340 if __name__ == '__main__': 341 # Start HTTP server for Prometheus to scrape 342 start_http_server(9101) 343 344 # Collect metrics every 30 seconds 345 while True: 346 collect_metrics() 347 time.sleep(30) 348 ``` 349 350 ## Monitoring Checklist 351 352 ### Daily Checks 353 - [ ] Review Grafana dashboards for anomalies 354 - [ ] Check AlertManager for any firing alerts 355 - [ ] Verify all nodes show as "up" in Prometheus 356 - [ ] Review sync status between seeds 357 358 ### Weekly Checks 359 - [ ] Analyze performance trends 360 - [ ] Review disk usage growth 361 - [ ] Check for any error patterns in logs 362 - [ ] Verify backup completion 363 364 ### Monthly Checks 365 - [ ] Review and tune alert thresholds 366 - [ ] Update Grafana dashboards as needed 367 - [ ] Capacity planning based on trends 368 - [ ] Security patches for monitoring stack 369 370 ## Troubleshooting 371 372 ### Common Issues 373 374 1. **Prometheus can't scrape targets** 375 - Check firewall rules (port 9100) 376 - Verify node exporter is running 377 - Test connectivity: `curl http://seed1.auxo.dev:9100/metrics` 378 379 2. **No Radicle metrics** 380 - Check cron job is running 381 - Verify metrics file exists: `/home/seed/.radicle/metrics.prom` 382 - Check file permissions 383 384 3. **Alerts not firing** 385 - Verify AlertManager is running 386 - Check Slack webhook is valid 387 - Review alert rules syntax 388 - Check Prometheus logs 389 390 ## Security Considerations 391 392 1. **Access Control** 393 - Prometheus: Bind to localhost only 394 - Grafana: Enable authentication 395 - Node Exporter: Firewall to monitoring server only 396 397 2. **TLS/SSL** 398 - Use HTTPS for Grafana 399 - Consider TLS for Prometheus scraping 400 - Encrypt AlertManager webhooks 401 402 3. **Data Retention** 403 - Set appropriate retention policies 404 - Regular backup of Prometheus data 405 - Archive old metrics to object storage 406 407 --- 408 409 **Next**: Create operations runbook for daily management