/ docs / monitoring / monitoring-setup.md
monitoring-setup.md
  1  # Radicle Seed Node Monitoring Setup
  2  ## Project Auxo Inc.
  3  
  4  ### Overview
  5  
  6  This document describes the monitoring infrastructure for Project Auxo's Radicle seed nodes, providing comprehensive visibility into node health, performance, and availability.
  7  
  8  ## Monitoring Architecture
  9  
 10  ```
 11  ┌─────────────────────────────────────────────────────────────┐
 12  │                    Monitoring Stack                          │
 13  ├─────────────────────────────────────────────────────────────┤
 14  │                                                              │
 15  │  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐  │
 16  │  │   Seed 1    │     │   Seed 2    │     │   Seed 3    │  │
 17  │  │ Prometheus  │     │ Prometheus  │     │ Prometheus  │  │
 18  │  │ Node Export │     │ Node Export │     │ Node Export │  │
 19  │  │ Radicle Met │     │ Radicle Met │     │ Radicle Met │  │
 20  │  └──────┬──────┘     └──────┬──────┘     └──────┬──────┘  │
 21  │         │                    │                    │          │
 22  │         └────────────────────┴────────────────────┘          │
 23  │                              │                               │
 24  │                    ┌─────────▼─────────┐                     │
 25  │                    │   Prometheus      │                     │
 26  │                    │   Central Server  │                     │
 27  │                    └─────────┬─────────┘                     │
 28  │                              │                               │
 29  │                    ┌─────────▼─────────┐                     │
 30  │                    │     Grafana       │                     │
 31  │                    │   Dashboards      │                     │
 32  │                    └─────────┬─────────┘                     │
 33  │                              │                               │
 34  │                    ┌─────────▼─────────┐                     │
 35  │                    │   AlertManager    │                     │
 36  │                    │  Slack / Email    │                     │
 37  │                    └───────────────────┘                     │
 38  └─────────────────────────────────────────────────────────────┘
 39  ```
 40  
 41  ## Components
 42  
 43  ### 1. Node-Level Monitoring
 44  
 45  Each seed node runs:
 46  
 47  - **Prometheus Node Exporter**: System metrics (CPU, memory, disk, network)
 48  - **Custom Radicle Exporter**: Radicle-specific metrics
 49  - **Log Aggregation**: Systemd journal forwarding
 50  
 51  ### 2. Central Monitoring Server
 52  
 53  - **Prometheus**: Time-series database and scraping
 54  - **Grafana**: Visualization and dashboards
 55  - **AlertManager**: Alert routing and notifications
 56  
 57  ## Installation Guide
 58  
 59  ### Step 1: Install Prometheus (Central Server)
 60  
 61  ```bash
 62  # Create prometheus user
 63  sudo useradd --no-create-home --shell /bin/false prometheus
 64  
 65  # Download and install
 66  cd /tmp
 67  wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
 68  tar xvf prometheus-2.45.0.linux-amd64.tar.gz
 69  sudo cp prometheus-2.45.0.linux-amd64/{prometheus,promtool} /usr/local/bin/
 70  sudo cp -r prometheus-2.45.0.linux-amd64/{consoles,console_libraries} /etc/prometheus/
 71  
 72  # Create directories
 73  sudo mkdir -p /etc/prometheus /var/lib/prometheus
 74  sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
 75  ```
 76  
 77  ### Step 2: Configure Prometheus
 78  
 79  Create `/etc/prometheus/prometheus.yml`:
 80  
 81  ```yaml
 82  global:
 83    scrape_interval: 15s
 84    evaluation_interval: 15s
 85  
 86  alerting:
 87    alertmanagers:
 88      - static_configs:
 89          - targets:
 90            - localhost:9093
 91  
 92  scrape_configs:
 93    - job_name: 'prometheus'
 94      static_configs:
 95        - targets: ['localhost:9090']
 96  
 97    - job_name: 'seed-nodes'
 98      static_configs:
 99        - targets:
100          - 'seed1.auxo.dev:9100'
101          - 'seed2.auxo.dev:9100'
102          - 'seed3.auxo.dev:9100'
103          labels:
104            group: 'radicle-seeds'
105      
106    - job_name: 'radicle-metrics'
107      metrics_path: '/home/seed/.radicle/metrics.prom'
108      static_configs:
109        - targets:
110          - 'seed1.auxo.dev:8080'
111          - 'seed2.auxo.dev:8080'
112          - 'seed3.auxo.dev:8080'
113  ```
114  
115  ### Step 3: Create Prometheus Service
116  
117  ```bash
118  sudo tee /etc/systemd/system/prometheus.service <<EOF
119  [Unit]
120  Description=Prometheus
121  Wants=network-online.target
122  After=network-online.target
123  
124  [Service]
125  User=prometheus
126  Group=prometheus
127  Type=simple
128  ExecStart=/usr/local/bin/prometheus \
129      --config.file /etc/prometheus/prometheus.yml \
130      --storage.tsdb.path /var/lib/prometheus/ \
131      --web.console.templates=/etc/prometheus/consoles \
132      --web.console.libraries=/etc/prometheus/console_libraries
133  
134  [Install]
135  WantedBy=multi-user.target
136  EOF
137  
138  sudo systemctl daemon-reload
139  sudo systemctl enable prometheus
140  sudo systemctl start prometheus
141  ```
142  
143  ### Step 4: Install Grafana
144  
145  ```bash
146  # Add Grafana repository
147  sudo apt-get install -y software-properties-common
148  sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
149  wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
150  
151  # Install
152  sudo apt-get update
153  sudo apt-get install grafana
154  
155  # Start service
156  sudo systemctl enable grafana-server
157  sudo systemctl start grafana-server
158  ```
159  
160  ### Step 5: Configure AlertManager
161  
162  Create `/etc/alertmanager/alertmanager.yml`:
163  
164  ```yaml
165  global:
166    resolve_timeout: 5m
167    slack_api_url: 'YOUR_SLACK_WEBHOOK_URL'
168  
169  route:
170    group_by: ['alertname', 'cluster', 'service']
171    group_wait: 10s
172    group_interval: 10s
173    repeat_interval: 12h
174    receiver: 'team-alerts'
175    
176    routes:
177    - match:
178        severity: critical
179      receiver: 'critical-alerts'
180      continue: true
181  
182  receivers:
183  - name: 'team-alerts'
184    slack_configs:
185    - channel: '#radicle-alerts'
186      title: 'Radicle Seed Node Alert'
187      text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
188  
189  - name: 'critical-alerts'
190    email_configs:
191    - to: 'oncall@auxo.dev'
192      from: 'alerts@auxo.dev'
193      smarthost: 'smtp.auxo.dev:587'
194      auth_username: 'alerts@auxo.dev'
195      auth_password: 'password'
196  ```
197  
198  ## Alert Rules
199  
200  Create `/etc/prometheus/alerts.yml`:
201  
202  ```yaml
203  groups:
204  - name: radicle_alerts
205    interval: 30s
206    rules:
207    
208    - alert: NodeDown
209      expr: up{job="seed-nodes"} == 0
210      for: 5m
211      labels:
212        severity: critical
213      annotations:
214        summary: "Seed node {{ $labels.instance }} is down"
215        description: "{{ $labels.instance }} has been down for more than 5 minutes."
216    
217    - alert: HighCPUUsage
218      expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
219      for: 10m
220      labels:
221        severity: warning
222      annotations:
223        summary: "High CPU usage on {{ $labels.instance }}"
224        description: "CPU usage is above 80% (current value: {{ $value }}%)"
225    
226    - alert: HighMemoryUsage
227      expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
228      for: 10m
229      labels:
230        severity: warning
231      annotations:
232        summary: "High memory usage on {{ $labels.instance }}"
233        description: "Memory usage is above 85% (current value: {{ $value }}%)"
234    
235    - alert: DiskSpaceLow
236      expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
237      for: 10m
238      labels:
239        severity: warning
240      annotations:
241        summary: "Low disk space on {{ $labels.instance }}"
242        description: "Disk space is below 20% (current value: {{ $value }}%)"
243    
244    - alert: RadicleNodeNotRunning
245      expr: radicle_node_running == 0
246      for: 5m
247      labels:
248        severity: critical
249      annotations:
250        summary: "Radicle node not running on {{ $labels.instance }}"
251        description: "The Radicle node service has been down for more than 5 minutes."
252    
253    - alert: NoPeersConnected
254      expr: radicle_peer_count == 0
255      for: 15m
256      labels:
257        severity: warning
258      annotations:
259        summary: "No peers connected on {{ $labels.instance }}"
260        description: "Radicle node has no peer connections for 15 minutes."
261  ```
262  
263  ## Grafana Dashboards
264  
265  ### Dashboard 1: Seed Node Overview
266  
267  Key panels:
268  - Node status (up/down)
269  - CPU usage per node
270  - Memory usage per node
271  - Disk usage per node
272  - Network traffic in/out
273  - Peer connections
274  - Repository count
275  
276  ### Dashboard 2: Radicle Metrics
277  
278  Key panels:
279  - Total repositories across network
280  - Sync operations per hour
281  - Peer discovery events
282  - Git operations (fetch/push)
283  - Error rates
284  - Latency between seeds
285  
286  ### Dashboard 3: System Health
287  
288  Key panels:
289  - System load averages
290  - Process count
291  - Open file descriptors
292  - Network connections
293  - Systemd service status
294  - Log error rates
295  
296  ## Custom Radicle Metrics Exporter
297  
298  Create `/usr/local/bin/radicle-exporter.py`:
299  
300  ```python
301  #!/usr/bin/env python3
302  import subprocess
303  import json
304  import time
305  from prometheus_client import start_http_server, Gauge
306  
307  # Define metrics
308  node_running = Gauge('radicle_node_running', 'Whether Radicle node is running')
309  peer_count = Gauge('radicle_peer_count', 'Number of connected peers')
310  repo_count = Gauge('radicle_repo_count', 'Number of repositories')
311  sync_lag = Gauge('radicle_sync_lag_seconds', 'Sync lag in seconds')
312  
313  def collect_metrics():
314      # Check if node is running
315      try:
316          result = subprocess.run(['rad', 'node', 'status'], 
317                                capture_output=True, text=True)
318          node_running.set(1 if 'running' in result.stdout else 0)
319      except:
320          node_running.set(0)
321      
322      # Count peers
323      try:
324          result = subprocess.run(['rad', 'node', 'peers'], 
325                                capture_output=True, text=True)
326          peers = len([l for l in result.stdout.split('\n') if 'connected' in l])
327          peer_count.set(peers)
328      except:
329          peer_count.set(0)
330      
331      # Count repositories
332      try:
333          result = subprocess.run(['rad', 'repo', 'list'], 
334                                capture_output=True, text=True)
335          repos = len([l for l in result.stdout.split('\n') if 'rad:' in l])
336          repo_count.set(repos)
337      except:
338          repo_count.set(0)
339  
340  if __name__ == '__main__':
341      # Start HTTP server for Prometheus to scrape
342      start_http_server(9101)
343      
344      # Collect metrics every 30 seconds
345      while True:
346          collect_metrics()
347          time.sleep(30)
348  ```
349  
350  ## Monitoring Checklist
351  
352  ### Daily Checks
353  - [ ] Review Grafana dashboards for anomalies
354  - [ ] Check AlertManager for any firing alerts
355  - [ ] Verify all nodes show as "up" in Prometheus
356  - [ ] Review sync status between seeds
357  
358  ### Weekly Checks
359  - [ ] Analyze performance trends
360  - [ ] Review disk usage growth
361  - [ ] Check for any error patterns in logs
362  - [ ] Verify backup completion
363  
364  ### Monthly Checks
365  - [ ] Review and tune alert thresholds
366  - [ ] Update Grafana dashboards as needed
367  - [ ] Capacity planning based on trends
368  - [ ] Security patches for monitoring stack
369  
370  ## Troubleshooting
371  
372  ### Common Issues
373  
374  1. **Prometheus can't scrape targets**
375     - Check firewall rules (port 9100)
376     - Verify node exporter is running
377     - Test connectivity: `curl http://seed1.auxo.dev:9100/metrics`
378  
379  2. **No Radicle metrics**
380     - Check cron job is running
381     - Verify metrics file exists: `/home/seed/.radicle/metrics.prom`
382     - Check file permissions
383  
384  3. **Alerts not firing**
385     - Verify AlertManager is running
386     - Check Slack webhook is valid
387     - Review alert rules syntax
388     - Check Prometheus logs
389  
390  ## Security Considerations
391  
392  1. **Access Control**
393     - Prometheus: Bind to localhost only
394     - Grafana: Enable authentication
395     - Node Exporter: Firewall to monitoring server only
396  
397  2. **TLS/SSL**
398     - Use HTTPS for Grafana
399     - Consider TLS for Prometheus scraping
400     - Encrypt AlertManager webhooks
401  
402  3. **Data Retention**
403     - Set appropriate retention policies
404     - Regular backup of Prometheus data
405     - Archive old metrics to object storage
406  
407  ---
408  
409  **Next**: Create operations runbook for daily management