/ ops / disaster-recovery.md
disaster-recovery.md
  1  # Disaster Recovery Plan — Bob
  2  
  3  > Status: Draft | Last updated: 2026-04-06
  4  
  5  ## Scenario: Complete Rig Failure
  6  
  7  If the rig server dies (disk failure, power surge, hardware failure), here's how to rebuild Bob from scratch.
  8  
  9  ## Prerequisites
 10  
 11  - Fresh NixOS machine with 3x NVIDIA GPUs (or equivalent)
 12  - Access to external backup drive (Seagate BUP Slim BK, 1.8TB, ext4, label `dev_2`)
 13  - Internet access for NixOS packages and model downloads
 14  
 15  ## Recovery Steps
 16  
 17  ### 1. Install NixOS Base (30 min)
 18  
 19  ```bash
 20  # Boot from NixOS USB installer
 21  # Partition disk (or use existing disko config)
 22  nix-shell -p git
 23  git clone <bob-repo-url> /tmp/bob
 24  # Or restore from backup drive:
 25  mount /dev/sdX1 /mnt/backup  # The external drive
 26  cp -r /mnt/backup/bob-backups/rig/latest/home/rig/bob /tmp/bob
 27  
 28  # Apply NixOS config
 29  sudo nixos-install --flake /tmp/bob/nix#rig
 30  ```
 31  
 32  ### 2. Restore from Restic Backup (1-2 hours)
 33  
 34  ```bash
 35  # Mount backup drive
 36  mount /dev/sdX1 /srv/backup
 37  
 38  # Install restic
 39  nix-shell -p restic
 40  
 41  # List snapshots
 42  export RESTIC_REPOSITORY=/srv/backup/bob-backups/rig
 43  export RESTIC_PASSWORD="$(sudo cat /run/secrets/restic_password)"
 44  restic snapshots
 45  
 46  # Restore Bob data
 47  restic restore latest --target /
 48  # This restores /home/rig/bob and /srv/bob
 49  ```
 50  
 51  ### 3. Rebuild NixOS from Flake (30 min)
 52  
 53  ```bash
 54  cd /home/rig/bob/nix
 55  sudo nixos-rebuild switch --flake .#rig
 56  ```
 57  
 58  This recreates all Docker containers, NATS, Prometheus, Caddy, and systemd services.
 59  
 60  ### 4. Restore Docker Images (1-2 hours)
 61  
 62  ```bash
 63  # Rebuild all Bob service images
 64  cd /home/rig/bob/services
 65  for svc in */; do
 66    if [ -f "$svc/Dockerfile" ]; then
 67      echo "Building $svc..."
 68      DOCKER_BUILDKIT=0 docker build --no-cache -t "bob-${svc%/}" "$svc"
 69    fi
 70  done
 71  
 72  # Rebuild pipecat-agent specifically
 73  cd pipecat-agent
 74  DOCKER_BUILDKIT=0 docker build --no-cache -t bob-pipecat-agent .
 75  ```
 76  
 77  ### 5. Download LLM Models (2-4 hours)
 78  
 79  ```bash
 80  # Models are NOT backed up (too large). Re-download:
 81  # Qwen3-32B AWQ (~20GB)
 82  # Qwen3-8B AWQ (~5GB)
 83  # faster-whisper large-v3 (~3GB)
 84  # Fish Speech v1.5 (~1GB)
 85  # BGE-large-en-v1.5 (~1GB)
 86  
 87  # vLLM will download on first start via HuggingFace
 88  # Other models download via their respective containers
 89  ```
 90  
 91  ### 6. Start Manual Containers (10 min)
 92  
 93  These containers are NOT in NixOS config and must be started manually:
 94  
 95  ```bash
 96  # Coordinator
 97  docker run -d --name bob-coordinator --network=host \
 98    -e [see ops/changelog for full env vars] \
 99    --restart=unless-stopped bob-coordinator
100  
101  # News Aggregator
102  docker run -d --name bob-news-aggregator --network=host \
103    -e NATS_URL=nats://127.0.0.1:4222 \
104    --restart=unless-stopped bob-news-aggregator
105  
106  # Home Automations
107  docker run -d --name bob-home-automations --network=host \
108    -v /srv/bob/home-automations:/data \
109    -e [see ops/changelog for full env vars] \
110    --restart=unless-stopped bob-home-automations
111  
112  # Device Health
113  docker run -d --name bob-device-health --network=host \
114    -v /tmp/bob-ssh:/root/.ssh:ro \
115    -e NATS_URL=nats://127.0.0.1:4222 \
116    --restart=unless-stopped bob-device-health
117  
118  # Network Discovery
119  docker run -d --name bob-network-discovery --network=host \
120    -v /srv/backup/bob-backups/network:/data \
121    -e NATS_URL=nats://127.0.0.1:4222 \
122    --restart=unless-stopped bob-network-discovery
123  
124  # Syncthing
125  docker run -d --name syncthing --network=host \
126    -v /srv/backup/syncthing:/var/syncthing/config \
127    -v /home/rig/bob:/data/bob \
128    --restart=unless-stopped syncthing/syncthing:latest
129  ```
130  
131  ### 7. Verify (30 min)
132  
133  ```bash
134  # Check all containers
135  docker ps | wc -l  # Should be ~80
136  
137  # Check voice pipeline
138  curl -s http://127.0.0.1:8003/health  # Coordinator
139  curl -s http://127.0.0.1:8002/metrics  # Metrics
140  
141  # Check NATS
142  nats stream ls  # Should show BOB_AGENTS, BOB_COORDINATOR
143  
144  # Test voice
145  # Open voice.genexergy.org, say "Hey Bob, what time is it?"
146  ```
147  
148  ## Backup Schedule
149  
150  - **Restic**: Nightly at 3 AM to external drive (`/srv/backup/bob-backups/rig`)
151  - **Syncthing**: Real-time replication of Bob config to kairos
152  - **Device configs**: Backed up to `/srv/backup/bob-backups/{device}/config/`
153  
154  ## What's NOT Backed Up
155  
156  - LLM model weights (re-download from HuggingFace, 30GB total)
157  - TrustGraph data (48 containers, would need separate backup strategy)
158  - Docker images (rebuilt from Dockerfiles)
159  - Neo4j transaction logs (knowledge graph data IS backed up via Restic)
160  
161  ## Recovery Time Estimate
162  
163  | Step | Time |
164  |------|------|
165  | NixOS install | 30 min |
166  | Restic restore | 1-2 hours |
167  | NixOS rebuild | 30 min |
168  | Docker image builds | 1-2 hours |
169  | Model downloads | 2-4 hours |
170  | Manual containers + verify | 30 min |
171  | **Total** | **5-9 hours** |
172  
173  ## Backup Verification
174  
175  Test monthly: `restic check` and `restic stats` to verify backup integrity.