/ ops / known-issues.md
known-issues.md
 1  # Known Issues & Lessons Learned — Bob
 2  
 3  > Last updated: 2026-03-30
 4  
 5  ## Active Issues
 6  
 7  ### BGE-M3 embedding model download hangs
 8  - **Symptom**: HuggingFace TEI container hangs indefinitely downloading `pytorch_model.bin` for `BAAI/bge-m3`
 9  - **Workaround**: Using `BAAI/bge-large-en-v1.5` (English-only, has safetensors) instead
10  - **Impact**: No multilingual embedding support until resolved
11  - **Fix**: Pre-download model weights via `huggingface-cli` and mount as local volume, or wait for BGE-M3 to publish safetensors format
12  
13  ### vLLM CUDA graph OOM on RTX 3090
14  - **Symptom**: vLLM crash-loops after CUDA graph profiling phase, even with ample KV cache memory
15  - **Workaround**: Use `--enforce-eager` flag to skip CUDA graph compilation
16  - **Impact**: ~10-15% lower throughput vs CUDA graphs, but stable operation
17  - **Root cause**: CUDA graph capture temporarily allocates additional memory that exceeds the 24 GB per-GPU limit
18  
19  ## Lessons Learned
20  
21  ### nixos-anywhere + kexec: kernel partition table staleness
22  The kexec NixOS installer environment can have stale kernel partition tables after disko runs. If sgdisk or partprobe warns about "kernel still using old partition table", do NOT attempt manual repartitioning — the on-disk GPT and kernel state will diverge. Instead, let disko handle everything or reboot first to sync kernel state.
23  
24  ### disko swap partition creation failure
25  sgdisk in the kexec environment failed to create the swap partition (partition 2) after creating ESP (partition 1) — likely a race condition between `sgdisk`, `partprobe`, and `udevadm settle`. Workaround: use swapfile instead of swap partition, or create all partitions in a single sgdisk invocation.
26  
27  ### Thunderbolt eGPUs require explicit authorization on NixOS
28  NixOS default Thunderbolt security blocks unapproved devices. GPUs behind Thunderbolt PCIe switches (Razer Core X) are invisible until authorized. Fix: udev rule for auto-authorization + `thunderbolt-pci-rescan` systemd service before nvidia-persistenced and CDI generator.
29  
30  ### NATS JetStream + NixOS ProtectSystem=strict
31  The NixOS NATS module sets `ProtectSystem=strict` and `ReadWritePaths=/var/lib/nats`. Custom `store_dir` paths outside `/var/lib/nats` will fail with "read-only file system". Always use `/var/lib/nats` for JetStream storage.
32  
33  ### 70B Llama models cannot use TP=3
34  Llama 70B architecture has 64 attention heads. 64 is not divisible by 3, so tensor parallelism across 3 GPUs fails. Use TP=2 (64/2=32) or TP=1 with quantization. This applies to any Llama-based model including DeepSeek R1 Distill.
35  
36  ### CDI nvidia-ctk crashes with malloc corruption on NixOS
37  The `nvidia-ctk cdi generate` command (nvidia-container-toolkit 1.18.2) crashes with `malloc(): corrupted top size` on NixOS due to missing `/etc/ld.so.cache`. The NixOS-managed CDI generator service (`nvidia-container-toolkit-cdi-generator.service`) works correctly as it uses proper library paths.
38  
39  ### Voice pipeline — FIXED (9 issues found and resolved)
40  - Full pipeline working: browser mic → WebSocket → Pipecat → STT → LLM → TTS → gapless audio playback + live transcript
41  - Key lessons: Pipecat requires explicit serializer, Kokoro outputs 24kHz (not 16kHz), VALID_VOICES must be patched for non-OpenAI TTS, ThinkTagFilter must pass system frames, audio scheduling must be gapless
42  
43  ### TrustGraph Grafana/Loki/Prometheus restarting
44  - **Symptom**: trustgraph-grafana-1, trustgraph-loki-1, trustgraph-prometheus-1 in restart loop
45  - **Root cause**: Port conflicts with TG's internal container networking after port remapping
46  - **Impact**: Low — TrustGraph's own monitoring unavailable, but Bob's Grafana/Prometheus work fine
47  - **Fix**: Debug TG's docker-compose network config or use Bob's monitoring stack instead