disaster-recovery.md
1 # Disaster Recovery Plan — Bob 2 3 > Status: Draft | Last updated: 2026-04-06 4 5 ## Scenario: Complete Rig Failure 6 7 If the rig server dies (disk failure, power surge, hardware failure), here's how to rebuild Bob from scratch. 8 9 ## Prerequisites 10 11 - Fresh NixOS machine with 3x NVIDIA GPUs (or equivalent) 12 - Access to external backup drive (Seagate BUP Slim BK, 1.8TB, ext4, label `dev_2`) 13 - Internet access for NixOS packages and model downloads 14 15 ## Recovery Steps 16 17 ### 1. Install NixOS Base (30 min) 18 19 ```bash 20 # Boot from NixOS USB installer 21 # Partition disk (or use existing disko config) 22 nix-shell -p git 23 git clone <bob-repo-url> /tmp/bob 24 # Or restore from backup drive: 25 mount /dev/sdX1 /mnt/backup # The external drive 26 cp -r /mnt/backup/bob-backups/rig/latest/home/rig/bob /tmp/bob 27 28 # Apply NixOS config 29 sudo nixos-install --flake /tmp/bob/nix#rig 30 ``` 31 32 ### 2. Restore from Restic Backup (1-2 hours) 33 34 ```bash 35 # Mount backup drive 36 mount /dev/sdX1 /srv/backup 37 38 # Install restic 39 nix-shell -p restic 40 41 # List snapshots 42 export RESTIC_REPOSITORY=/srv/backup/bob-backups/rig 43 export RESTIC_PASSWORD="$(sudo cat /run/secrets/restic_password)" 44 restic snapshots 45 46 # Restore Bob data 47 restic restore latest --target / 48 # This restores /home/rig/bob and /srv/bob 49 ``` 50 51 ### 3. Rebuild NixOS from Flake (30 min) 52 53 ```bash 54 cd /home/rig/bob/nix 55 sudo nixos-rebuild switch --flake .#rig 56 ``` 57 58 This recreates all Docker containers, NATS, Prometheus, Caddy, and systemd services. 59 60 ### 4. Restore Docker Images (1-2 hours) 61 62 ```bash 63 # Rebuild all Bob service images 64 cd /home/rig/bob/services 65 for svc in */; do 66 if [ -f "$svc/Dockerfile" ]; then 67 echo "Building $svc..." 68 DOCKER_BUILDKIT=0 docker build --no-cache -t "bob-${svc%/}" "$svc" 69 fi 70 done 71 72 # Rebuild pipecat-agent specifically 73 cd pipecat-agent 74 DOCKER_BUILDKIT=0 docker build --no-cache -t bob-pipecat-agent . 75 ``` 76 77 ### 5. Download LLM Models (2-4 hours) 78 79 ```bash 80 # Models are NOT backed up (too large). Re-download: 81 # Qwen3-32B AWQ (~20GB) 82 # Qwen3-8B AWQ (~5GB) 83 # faster-whisper large-v3 (~3GB) 84 # Fish Speech v1.5 (~1GB) 85 # BGE-large-en-v1.5 (~1GB) 86 87 # vLLM will download on first start via HuggingFace 88 # Other models download via their respective containers 89 ``` 90 91 ### 6. Start Manual Containers (10 min) 92 93 These containers are NOT in NixOS config and must be started manually: 94 95 ```bash 96 # Coordinator 97 docker run -d --name bob-coordinator --network=host \ 98 -e [see ops/changelog for full env vars] \ 99 --restart=unless-stopped bob-coordinator 100 101 # News Aggregator 102 docker run -d --name bob-news-aggregator --network=host \ 103 -e NATS_URL=nats://127.0.0.1:4222 \ 104 --restart=unless-stopped bob-news-aggregator 105 106 # Home Automations 107 docker run -d --name bob-home-automations --network=host \ 108 -v /srv/bob/home-automations:/data \ 109 -e [see ops/changelog for full env vars] \ 110 --restart=unless-stopped bob-home-automations 111 112 # Device Health 113 docker run -d --name bob-device-health --network=host \ 114 -v /tmp/bob-ssh:/root/.ssh:ro \ 115 -e NATS_URL=nats://127.0.0.1:4222 \ 116 --restart=unless-stopped bob-device-health 117 118 # Network Discovery 119 docker run -d --name bob-network-discovery --network=host \ 120 -v /srv/backup/bob-backups/network:/data \ 121 -e NATS_URL=nats://127.0.0.1:4222 \ 122 --restart=unless-stopped bob-network-discovery 123 124 # Syncthing 125 docker run -d --name syncthing --network=host \ 126 -v /srv/backup/syncthing:/var/syncthing/config \ 127 -v /home/rig/bob:/data/bob \ 128 --restart=unless-stopped syncthing/syncthing:latest 129 ``` 130 131 ### 7. Verify (30 min) 132 133 ```bash 134 # Check all containers 135 docker ps | wc -l # Should be ~80 136 137 # Check voice pipeline 138 curl -s http://127.0.0.1:8003/health # Coordinator 139 curl -s http://127.0.0.1:8002/metrics # Metrics 140 141 # Check NATS 142 nats stream ls # Should show BOB_AGENTS, BOB_COORDINATOR 143 144 # Test voice 145 # Open voice.genexergy.org, say "Hey Bob, what time is it?" 146 ``` 147 148 ## Backup Schedule 149 150 - **Restic**: Nightly at 3 AM to external drive (`/srv/backup/bob-backups/rig`) 151 - **Syncthing**: Real-time replication of Bob config to kairos 152 - **Device configs**: Backed up to `/srv/backup/bob-backups/{device}/config/` 153 154 ## What's NOT Backed Up 155 156 - LLM model weights (re-download from HuggingFace, 30GB total) 157 - TrustGraph data (48 containers, would need separate backup strategy) 158 - Docker images (rebuilt from Dockerfiles) 159 - Neo4j transaction logs (knowledge graph data IS backed up via Restic) 160 161 ## Recovery Time Estimate 162 163 | Step | Time | 164 |------|------| 165 | NixOS install | 30 min | 166 | Restic restore | 1-2 hours | 167 | NixOS rebuild | 30 min | 168 | Docker image builds | 1-2 hours | 169 | Model downloads | 2-4 hours | 170 | Manual containers + verify | 30 min | 171 | **Total** | **5-9 hours** | 172 173 ## Backup Verification 174 175 Test monthly: `restic check` and `restic stats` to verify backup integrity.