/ ops / server.md
server.md
  1  # Server & Infrastructure — Bob
  2  
  3  > Last updated: 2026-04-09 (Session 12 final)
  4  
  5  ## Access
  6  
  7  - **Hostname**: rig.lan (192.168.1.137)
  8  - **SSH**: `ssh rig@rig.lan` (key: `~/.ssh/id_rsa` from NH-OneXPlayer)
  9  - **User**: rig (passwordless sudo)
 10  
 11  ## Hardware
 12  
 13  | Component | Spec |
 14  |-----------|------|
 15  | CPU | AMD Ryzen 9 8945HS (8 cores / 16 threads, up to 5.26 GHz) |
 16  | RAM | 80 GB |
 17  | GPU 0 | NVIDIA GeForce RTX 3090 (24 GB, bus 01:00.0) — internal PCIe x4 |
 18  | GPU 1 | NVIDIA GeForce RTX 3090 (24 GB, bus 08:00.0) — Razer Core X via TB3 |
 19  | GPU 2 | NVIDIA GeForce RTX 3090 (24 GB, bus 68:00.0) — Razer Core X via TB3 |
 20  | Total VRAM | 72 GB |
 21  | eGPU enclosures | 2x Razer Core X (Thunderbolt 3, Intel JHL6340) |
 22  | Storage | 3.6 TB NVMe (nvme0n1) |
 23  | Network | enp3s0 (Ethernet, DHCP) |
 24  | PSU | **Unknown — verify ≥ 1200W** (3x 350W TDP GPUs = 1050W GPU alone) |
 25  
 26  ## Current OS
 27  
 28  - **NixOS 26.05 (Yarara)** — installed 2026-03-24, boot repaired 2026-03-29
 29  - Kernel: 6.18.19
 30  - NVIDIA Driver: 590.48.01 (proprietary), CUDA 13.1
 31  - Bootloader: systemd-boot
 32  - Thunderbolt: auto-authorize via udev rule + bolt.service
 33  - Flake: `nix/` in this repo
 34  
 35  ## Disk Layout
 36  
 37  ```
 38  nvme0n1        3.6T
 39  ├─nvme0n1p1    512M  vfat   /boot     (PARTLABEL=disk-nvme-esp)
 40  ├─nvme0n1p2      8G  swap             (PARTLABEL=swap, not active)
 41  └─nvme0n1p3    3.6T  ext4   /         (PARTLABEL=disk-nvme-root)
 42  ```
 43  
 44  ## Services — Running
 45  
 46  | Service | Port | Runtime | Notes |
 47  |---------|------|---------|-------|
 48  | vLLM (primary) | 8000 | Docker (GPU 0+1, TP=2) | Qwen3-32B AWQ, ~40 tok/s, tool calling, llm.rig.lan |
 49  | Embeddings (TEI) | 8080 | Docker (GPU 2) | BAAI/bge-m3 |
 50  | faster-whisper | 10300 | Docker (GPU 2) | Whisper large-v3 INT8, STT, stt.rig.lan |
 51  | Kokoro TTS | 10400 | Docker (GPU 2) | 54 voices, <300ms latency, tts.rig.lan |
 52  | Fish Speech | 10600 | Docker (GPU 2) | v1.5, Ray Porter voice clone (primary TTS) |
 53  | openWakeWord | 10500 | Docker (CPU) | Wyoming protocol, custom "hey bob" wake word |
 54  | Ollama | 11434 | Docker (GPU 2) | Qwen2.5-VL-3B vision, time-shared, 30s keepalive |
 55  | NATS JetStream | 4222/8222/1883 | NixOS native | Event bus + MQTT bridge, nats.rig.lan |
 56  | HomeAssistant | 8123 | Docker (CPU, host network) | Onboarded, https://home.genexergy.org |
 57  | Pipecat Agent | 10700 | Docker (CPU, host net) | Voice pipeline + 9 tools, wake word, diarization |
 58  | Diarization | — | Docker (GPU 2) | diart + CAM++ streaming, 3 speakers enrolled |
 59  | Oxigraph | 7878 | Docker (CPU) | SPARQL endpoint, ~122 triples (BFO+CCO data lost — needs reload) |
 60  | Neo4j | 7474/7687 | Docker (CPU) | Graph DB for Graphiti agent memory |
 61  | HA→NATS Bridge | — | Docker (CPU) | Publishes HA state changes to NATS |
 62  | Caddy | 80/443 | NixOS native | Reverse proxy for *.rig.lan + voice.rig.lan |
 63  | Prometheus | 9090 | NixOS native | Metrics, 30d retention, prometheus.rig.lan |
 64  | Node Exporter | 9100 | NixOS native | System metrics |
 65  | Alertmanager | 9093 | NixOS native | 8 alert rules |
 66  | Reticulum | 4242 | Docker (CPU) | Transport node, TCP gateway, MichMesh + RMAP peered |
 67  | Syncthing | 8384/22000 | Docker (CPU) | Running, rig + kairos synced |
 68  | Firefly III | 8181 | Docker (CPU) | Financial management, sops credentials |
 69  | Firefly DB | 3306 | Docker (CPU) | MariaDB for Firefly III |
 70  | TrustGraph | 8888/8088 | Docker Compose (~44 containers) | Workbench :8888 + API :8088, Authenticated (API key configured) |
 71  | Squid Proxy | 3128 | Docker (nuclide-amd.lan) | Authenticated forward proxy, residential IP egress. Pending router port forward. |
 72  
 73  | Bob Agent | Schedule | Notes |
 74  |-----------|----------|-------|
 75  | Agent Scheduler | always-on | Cron trigger service, NATS JetStream |
 76  | Home Keeper | hourly | Infrastructure health checks |
 77  | Morning Coordinator | 7:45 AM ET | Daily briefing (weather, news, health) |
 78  | Evening Coordinator | 8:00 PM ET | Daily summary |
 79  | Knowledge Gardener | 2:00 AM ET | Consolidation + real-time session storage + pruning |
 80  | System Sentinel | every 15 min | Deep monitoring (Prometheus, Docker, SSH inventory) |
 81  | News Aggregator | every 2 hours | RSS feeds + NWS + Guardian API |
 82  | Device Health | every 4 hours | SSH checks across managed devices |
 83  | Alert Bridge | always-on | Alertmanager → NATS webhook bridge |
 84  | Calendar Bridge | always-on | ICS feed poller, 4 feeds active (Proton x2, MS365, Google) |
 85  | Coordinator | always-on | Request classifier + model router |
 86  | Home Automations | always-on | 3-tier rule engine (YAML, pattern, LLM) |
 87  | Network Discovery | always-on | Subnet scanner |
 88  | Announce Player | always-on | TTS announcements on speakers |
 89  | REPL Sandbox | always-on | Sandboxed Python execution |
 90  | Voice Enrollment | always-on | Speaker enrollment training |
 91  
 92  ## Services — DOWN or Degraded
 93  
 94  | Service | Issue | Impact |
 95  |---------|-------|--------|
 96  | TrustGraph (7 of ~48) | Exited: workbench-ui, loki, grafana, prometheus, ddg-mcp, garage, mcp-server | TG monitoring + some UI unavailable |
 97  
 98  > All Bob containers are now NixOS-managed. No unmanaged containers remain.
 99  
100  ## GPU Allocation
101  
102  | GPU | Bus | Services | VRAM Used | VRAM Free |
103  |-----|-----|----------|-----------|-----------|
104  | 0 | 01:00.0 (internal) | vLLM TP rank 0 (Qwen3-32B AWQ) | 20.9 GB | 2.7 GB |
105  | 1 | 08:00.0 (Razer Core X) | vLLM TP rank 1 (Qwen3-32B AWQ) | 20.9 GB | 2.7 GB |
106  | 2 | 68:00.0 (Razer Core X) | Classifier + Embeddings + STT + TTS (Fish+Kokoro) + Diarization + Ollama (on-demand) | 15.6 GB | 8.2 GB |
107  
108  > **Note:** GPU 2 has ~8 GB free with classifier running. Ollama loads Qwen2.5-VL-3B on demand (~3.2 GB, 30s keepalive) — will temporarily reduce free VRAM to ~5 GB.
109  
110  ## Credentials
111  
112  <!-- Secrets managed via sops-nix after NixOS install -->
113  | Account | Username | Password | Used For |
114  |---------|----------|----------|----------|
115  | rig SSH | rig | key-based | System access |
116  | Neo4j | neo4j | sops: `neo4j_password` | Graph DB |
117  | Grafana | admin | sops: `grafana_admin_password` | Monitoring dashboards |
118  | Firefly III DB | firefly | sops: `firefly_db_password` | MariaDB |
119  | Firefly III | — | sops: `firefly_app_key` | Laravel APP_KEY |
120  | Firefly DB root | root | sops: `firefly_db_root_password` | MariaDB root |
121  | Calendar Bridge | — | sops: `calendar_ics_urls` | 4 ICS feeds (Proton x2, MS365, Google) |
122  | HomeAssistant | — | sops: `ha_token` | Long-lived access token |
123  | HuggingFace | — | sops: `hf_token` | Gated model access (pyannote) |
124  | Residential Proxy | glean | sops: `proxy_password` | Squid proxy auth (proxy.genexergy.org:3128) |
125  | Restic Backup | — | sops: `restic_password` | Encrypted backup repo |
126  
127  **Keycloak Users** (realm: hydra-ops, auth.genexergy.org):
128  | Username | Name | Role | Notes |
129  |----------|------|------|-------|
130  | cam | Cameron Hunt | haven | Dad — primary admin |
131  | aj | Adriane Hunt | haven | Mom |
132  | hailen | Hailen Hunt | haven | Son — email: hailen.n.hunt@outlook.com (verified) |
133  | operator | — | haven | Service account |
134  | greatroom | — | haven | Kiosk location account |
135  | garage | — | haven | Kiosk location account |
136  
137  > Secrets managed by sops-nix. Decrypt with: `SOPS_AGE_KEY_FILE=/var/lib/sops-nix/key.txt sops nix/secrets/secrets.yaml`
138  
139  ## Common Operations
140  
141  ```bash
142  # SSH into rig
143  ssh rig@rig.lan
144  
145  # Check GPU status
146  ssh rig@rig.lan "nvidia-smi"
147  
148  # Check running containers
149  ssh rig@rig.lan "sudo docker ps"
150  
151  # Test LLM inference
152  ssh rig@rig.lan 'curl -s http://localhost:8000/v1/chat/completions \
153    -H "Content-Type: application/json" \
154    -d "{\"model\":\"Qwen/Qwen3-32B-AWQ\",\"messages\":[{\"role\":\"user\",\"content\":\"Hello\"}],\"max_tokens\":64}"'
155  
156  # Test embeddings
157  ssh rig@rig.lan 'curl -s http://localhost:8080/embed \
158    -H "Content-Type: application/json" \
159    -d "{\"inputs\":\"test\"}" | jq ".[0][:3]"'
160  
161  # Test TTS → WAV
162  ssh rig@rig.lan 'curl -s http://localhost:10400/v1/audio/speech \
163    -H "Content-Type: application/json" \
164    -d "{\"model\":\"kokoro\",\"input\":\"Hello\",\"voice\":\"af_heart\"}" -o /tmp/test.wav'
165  
166  # Test STT (transcribe a WAV)
167  ssh rig@rig.lan 'curl -s http://localhost:10300/v1/audio/transcriptions \
168    -F "file=@/tmp/test.wav" -F "model=Systran/faster-whisper-large-v3"'
169  
170  # Check NATS JetStream
171  ssh rig@rig.lan "curl -s http://localhost:8222/varz | jq .jetstream.config"
172  
173  # Deploy NixOS config changes
174  rsync -avz nix/ rig@rig.lan:/tmp/haven-nix/
175  ssh rig@rig.lan "sudo nixos-rebuild switch --flake /tmp/haven-nix#rig"
176  ```
177  
178  ## Notes
179  
180  - **RAM**: 78 GB total, ~21 GB used, 56 GB available. Adequate for current stack.
181  - **Disk**: 250 GB / 3.6 TB used (8%). 3.2 TB free.
182  - **PSU**: Must verify wattage. 3x RTX 3090 at full load + CPU + system = ~1200W minimum.
183  - **Docker containers fully declarative**: 35 Bob containers managed via NixOS `containers.nix`. TrustGraph has its own docker-compose (~44 containers).
184  - **Model cache**: `/srv/bob/vllm` holds HuggingFace model downloads. Persistent across container restarts.
185  - **Single code directory**: All service code lives in `/home/rig/bob/services/`. The legacy `haven/` directory was removed in Session 12 (archived at `/home/rig/haven-archive-20260407.tar.gz`).