known-issues.md
1 # Known Issues & Lessons Learned — Bob 2 3 > Last updated: 2026-03-30 4 5 ## Active Issues 6 7 ### BGE-M3 embedding model download hangs 8 - **Symptom**: HuggingFace TEI container hangs indefinitely downloading `pytorch_model.bin` for `BAAI/bge-m3` 9 - **Workaround**: Using `BAAI/bge-large-en-v1.5` (English-only, has safetensors) instead 10 - **Impact**: No multilingual embedding support until resolved 11 - **Fix**: Pre-download model weights via `huggingface-cli` and mount as local volume, or wait for BGE-M3 to publish safetensors format 12 13 ### vLLM CUDA graph OOM on RTX 3090 14 - **Symptom**: vLLM crash-loops after CUDA graph profiling phase, even with ample KV cache memory 15 - **Workaround**: Use `--enforce-eager` flag to skip CUDA graph compilation 16 - **Impact**: ~10-15% lower throughput vs CUDA graphs, but stable operation 17 - **Root cause**: CUDA graph capture temporarily allocates additional memory that exceeds the 24 GB per-GPU limit 18 19 ## Lessons Learned 20 21 ### nixos-anywhere + kexec: kernel partition table staleness 22 The kexec NixOS installer environment can have stale kernel partition tables after disko runs. If sgdisk or partprobe warns about "kernel still using old partition table", do NOT attempt manual repartitioning — the on-disk GPT and kernel state will diverge. Instead, let disko handle everything or reboot first to sync kernel state. 23 24 ### disko swap partition creation failure 25 sgdisk in the kexec environment failed to create the swap partition (partition 2) after creating ESP (partition 1) — likely a race condition between `sgdisk`, `partprobe`, and `udevadm settle`. Workaround: use swapfile instead of swap partition, or create all partitions in a single sgdisk invocation. 26 27 ### Thunderbolt eGPUs require explicit authorization on NixOS 28 NixOS default Thunderbolt security blocks unapproved devices. GPUs behind Thunderbolt PCIe switches (Razer Core X) are invisible until authorized. Fix: udev rule for auto-authorization + `thunderbolt-pci-rescan` systemd service before nvidia-persistenced and CDI generator. 29 30 ### NATS JetStream + NixOS ProtectSystem=strict 31 The NixOS NATS module sets `ProtectSystem=strict` and `ReadWritePaths=/var/lib/nats`. Custom `store_dir` paths outside `/var/lib/nats` will fail with "read-only file system". Always use `/var/lib/nats` for JetStream storage. 32 33 ### 70B Llama models cannot use TP=3 34 Llama 70B architecture has 64 attention heads. 64 is not divisible by 3, so tensor parallelism across 3 GPUs fails. Use TP=2 (64/2=32) or TP=1 with quantization. This applies to any Llama-based model including DeepSeek R1 Distill. 35 36 ### CDI nvidia-ctk crashes with malloc corruption on NixOS 37 The `nvidia-ctk cdi generate` command (nvidia-container-toolkit 1.18.2) crashes with `malloc(): corrupted top size` on NixOS due to missing `/etc/ld.so.cache`. The NixOS-managed CDI generator service (`nvidia-container-toolkit-cdi-generator.service`) works correctly as it uses proper library paths. 38 39 ### Voice pipeline — FIXED (9 issues found and resolved) 40 - Full pipeline working: browser mic → WebSocket → Pipecat → STT → LLM → TTS → gapless audio playback + live transcript 41 - Key lessons: Pipecat requires explicit serializer, Kokoro outputs 24kHz (not 16kHz), VALID_VOICES must be patched for non-OpenAI TTS, ThinkTagFilter must pass system frames, audio scheduling must be gapless 42 43 ### TrustGraph Grafana/Loki/Prometheus restarting 44 - **Symptom**: trustgraph-grafana-1, trustgraph-loki-1, trustgraph-prometheus-1 in restart loop 45 - **Root cause**: Port conflicts with TG's internal container networking after port remapping 46 - **Impact**: Low — TrustGraph's own monitoring unavailable, but Bob's Grafana/Prometheus work fine 47 - **Fix**: Debug TG's docker-compose network config or use Bob's monitoring stack instead