pxe-netboot.md
1 # PXE Netboot for DGX Spark 2 3 This document describes how to PXE boot NixOS on the DGX Spark using GRUB over TFTP. 4 5 ## Overview 6 7 The DGX Spark supports network booting via UEFI PXE. We use: 8 9 - **GRUB** as the bootloader (iPXE doesn't work properly - initrd not passed to kernel) 10 - **TFTP** for serving GRUB, kernel, and initrd 11 - **NixOS** netboot image built with the dgx-forge flake 12 13 ## Architecture 14 15 ``` 16 ┌─────────────────┐ ┌──────────────────────────────────────┐ 17 │ DGX Spark │ │ PXE Server (ultraviolence) │ 18 │ 192.168.6.208 │ │ 192.168.4.25 │ 19 ├─────────────────┤ ├──────────────────────────────────────┤ 20 │ │ │ │ 21 │ UEFI PXE Boot │◄───────►│ dnsmasq (DHCP proxy, port 4011) │ 22 │ │ │ │ │ 23 │ v │ │ │ 24 │ grubnetaa64 │◄───TFTP─│ atftpd (port 69, /srv/netboot) │ 25 │ .efi.signed │ │ │ 26 │ │ │ │ │ 27 │ v │ │ │ 28 │ grub.cfg │◄───TFTP─│ │ 29 │ │ │ │ │ 30 │ v │ │ │ 31 │ nixos-kernel │◄───TFTP─│ │ 32 │ nixos-initrd │ │ │ 33 │ │ │ │ │ 34 │ v │ │ │ 35 │ NixOS Stage 1 │ │ │ 36 │ (initrd) │ │ │ 37 │ │ │ │ │ 38 │ v │ │ │ 39 │ NixOS Stage 2 │ │ │ 40 │ (systemd) │ │ │ 41 │ │ │ │ 42 └─────────────────┘ └──────────────────────────────────────┘ 43 ``` 44 45 ## Server Setup 46 47 ### Prerequisites 48 49 On the PXE server (NixOS): 50 51 ```nix 52 # In your NixOS configuration 53 services.atftpd = { 54 enable = true; 55 root = "/srv/netboot"; 56 }; 57 58 # Or run manually: 59 # sudo atftpd --daemon --no-fork --logfile /tmp/atftp.log /srv/netboot 60 ``` 61 62 ### dnsmasq Configuration 63 64 Create `/tmp/dnsmasq-pxe.conf`: 65 66 ```ini 67 # PXE proxy mode - responds to PXE requests without full DHCP 68 port=0 69 interface=enp4s0 70 bind-interfaces 71 72 # Enable TFTP 73 enable-tftp 74 tftp-root=/srv/netboot 75 76 # DHCP proxy on port 4011 77 dhcp-range=192.168.4.0,proxy 78 79 # PXE boot options for UEFI ARM64 80 pxe-service=ARM64_EFI,"NixOS Netboot",grubnetaa64.efi.signed 81 82 # Log for debugging 83 log-dhcp 84 log-queries 85 ``` 86 87 Run: `sudo dnsmasq -d -C /tmp/dnsmasq-pxe.conf` 88 89 ### Directory Structure 90 91 ``` 92 /srv/netboot/ 93 ├── grubnetaa64.efi.signed # Ubuntu signed GRUB for ARM64 UEFI 94 ├── grub/ 95 │ └── grub.cfg # GRUB menu configuration 96 ├── grub.cfg # Fallback location 97 ├── nixos-kernel # NixOS kernel (Image, ~60MB) 98 └── nixos-initrd # NixOS initrd (~500MB-1.3GB) 99 ``` 100 101 ### GRUB Configuration 102 103 `/srv/netboot/grub/grub.cfg`: 104 105 ```grub 106 insmod net 107 insmod tftp 108 set net_default_server=192.168.4.25 109 set root=(tftp,192.168.4.25) 110 111 set timeout=5 112 set default=0 113 114 menuentry 'NixOS Netboot' { 115 linux /nixos-kernel init=/nix/store/<toplevel-hash>-nixos-system-.../init \ 116 nouveau.modeset=0 \ 117 console=ttyS0,921600 \ 118 console=tty0 \ 119 sbsa_gwdt.action=1 \ 120 pci=pcie_bus_safe \ 121 loglevel=7 122 initrd /nixos-initrd 123 } 124 ``` 125 126 **Important kernel parameters:** 127 128 | Parameter | Purpose | 129 |-----------|---------| 130 | `init=/nix/store/.../init` | Points to NixOS stage 2 init in the squashfs | 131 | `nouveau.modeset=0` | Disable nouveau (conflicts with nvidia) | 132 | `console=ttyS0,921600` | Serial console at 921600 baud | 133 | `sbsa_gwdt.action=1` | Watchdog timer behavior | 134 | `pci=pcie_bus_safe` | Safe PCIe enumeration | 135 136 ## Building the NixOS Netboot Image 137 138 ### Flake Configuration 139 140 In `dgx-forge/nix/configurations/netboot.nix`: 141 142 ```nix 143 { config, lib, pkgs, modulesPath, ... }: 144 145 { 146 imports = [ 147 (modulesPath + "/installer/netboot/netboot-minimal.nix") 148 ../modules/nixos/dgx-spark.nix 149 ]; 150 151 # Netboot-specific settings 152 boot.loader.grub.enable = false; 153 154 # Enable SSH for remote access 155 services.openssh.enable = true; 156 157 # Minimal system for debugging 158 environment.systemPackages = with pkgs; [ 159 vim 160 htop 161 pciutils 162 usbutils 163 ]; 164 } 165 ``` 166 167 ### Building 168 169 ```bash 170 # Build on ARM64 builder 171 nix build .#nixosConfigurations.spark-netboot-kexec.config.system.build.netboot \ 172 --builders 'ssh://nix-builder-arm64' \ 173 --max-jobs 0 174 175 # Result contains: 176 ls -la result/ 177 # Image -> kernel 178 # initrd -> initrd (gzipped, concatenated cpio archives) 179 # toplevel -> NixOS system closure 180 ``` 181 182 ### Deploying to TFTP Server 183 184 ```bash 185 # Copy kernel 186 sudo cp result/Image /srv/netboot/nixos-kernel 187 188 # Copy initrd (this needs special handling - see below) 189 sudo cp result/initrd /srv/netboot/nixos-initrd 190 ``` 191 192 ## NixOS Initrd Structure 193 194 The NixOS netboot initrd is **three concatenated cpio archives**: 195 196 1. **First archive** (~4MB): Microcode and early init 197 2. **Second archive** (~87MB): Stage 1 init scripts, busybox, extra-utils 198 3. **Third archive** (~1.2GB): `nix-store.squashfs` containing the full NixOS closure 199 200 ### initrd Contents 201 202 ``` 203 / 204 ├── init # Stage 1 init script (shell script) 205 ├── bin/ # Symlink or busybox binaries 206 ├── nix/ 207 │ └── store/ 208 │ ├── <extra-utils>/ # Busybox, blkid, modprobe, etc. 209 │ ├── <modules>/ # Kernel modules 210 │ └── <fsinfo>/ # Mount configuration 211 ├── nix-store.squashfs # Symlink to squashfs.img 212 └── lib -> /nix/store/.../lib 213 ``` 214 215 ### Stage 1 Boot Flow 216 217 1. Kernel unpacks initrd and executes `/init` 218 2. `/init` is a shell script with shebang `#!/bin/ash` (busybox) 219 3. Mounts tmpfs on `/` as root 220 4. Mounts `nix-store.squashfs` on `/nix/.ro-store` 221 5. Mounts tmpfs on `/nix/.rw-store` 222 6. Creates overlay: `/nix/store` = ro-store + rw-store 223 7. Runs `switch_root` to `/mnt-root` with stage 2 init 224 225 ### Common Initrd Issues 226 227 #### "No working init found" 228 229 **Cause**: Kernel can't execute `/init` 230 231 **Solutions**: 232 1. Ensure `/init` has execute permissions 233 2. Ensure interpreter (`/bin/ash` or `/nix/store/.../ash`) is executable 234 3. Check if busybox is corrupted (compare SHA256 with original) 235 4. For scripts with nix store shebang, create `/bin` symlink or real copy 236 237 ```bash 238 # Fix: Create /bin with real busybox copy 239 mkdir -p /tmp/build-initrd/bin 240 cp /nix/store/<extra-utils>/bin/busybox /tmp/build-initrd/bin/ 241 chmod 755 /tmp/build-initrd/bin/busybox 242 ln -s busybox /tmp/build-initrd/bin/ash 243 ln -s busybox /tmp/build-initrd/bin/sh 244 245 # Update init shebang 246 sed -i '1s|.*|#!/bin/ash|' /tmp/build-initrd/init 247 ``` 248 249 #### "couldn't execute it (error -5)" 250 251 **Cause**: EIO - binary is corrupted 252 253 **Solution**: Re-copy busybox from nix store and verify checksum: 254 255 ```bash 256 cp /nix/store/<extra-utils>/bin/busybox /tmp/build-initrd/bin/ 257 sha256sum /tmp/build-initrd/bin/busybox 258 # Compare with: sha256sum /nix/store/<extra-utils>/bin/busybox 259 ``` 260 261 #### "stage 2 init script not found" 262 263 **Cause**: Stage 1 completed but can't find stage 2 init at `/mnt-root/init` 264 265 **Solution**: Pass correct init path in kernel cmdline: 266 267 ```grub 268 linux /nixos-kernel init=/nix/store/<toplevel>/init ... 269 ``` 270 271 Find the toplevel hash: 272 ```bash 273 # Mount the squashfs and look for nixos-system-* 274 sudo mount -o loop,ro /tmp/build-initrd/nix/store/*-squashfs.img /tmp/sqmount 275 ls /tmp/sqmount/ | grep nixos-system 276 ``` 277 278 ## Rebuilding the Initrd 279 280 If you need to modify the initrd (fix permissions, add files): 281 282 ```bash 283 # 1. Decompress 284 zcat /nix/store/<initrd>/initrd > /tmp/original-initrd.cpio 285 286 # 2. Find archive boundaries 287 grep -boa 'TRAILER!!!' /tmp/original-initrd.cpio 288 # Output shows offsets of each cpio archive end 289 290 # 3. Extract to working directory 291 mkdir -p /tmp/build-initrd && cd /tmp/build-initrd 292 zcat /nix/store/<initrd>/initrd | cpio -idm 293 294 # 4. Extract third archive (squashfs) separately 295 tail -c +<offset> /tmp/original-initrd.cpio > /tmp/third-archive.cpio 296 cpio -idm "nix-store.squashfs" "nix/store/*-squashfs.img" < /tmp/third-archive.cpio 297 298 # 5. Make fixes (permissions, symlinks, etc.) 299 chmod 755 bin/busybox 300 sed -i '1s|.*|#!/bin/ash|' init 301 302 # 6. Rebuild single cpio archive 303 find . -print0 | cpio --null -o -H newc | gzip -1 > /tmp/nixos-initrd-new.gz 304 305 # 7. Deploy 306 sudo cp /tmp/nixos-initrd-new.gz /srv/netboot/nixos-initrd 307 ``` 308 309 ## Troubleshooting 310 311 ### Enable Verbose Boot 312 313 Add to kernel cmdline: 314 ``` 315 loglevel=7 boot.shell_on_fail 316 ``` 317 318 This gives you a shell if stage 1 fails. 319 320 ### Serial Console Access 321 322 Connect to DGX Spark serial port at **921600 baud**: 323 ```bash 324 screen /dev/ttyUSB0 921600 325 # or 326 minicom -D /dev/ttyUSB0 -b 921600 327 ``` 328 329 ### TFTP Debugging 330 331 Check TFTP server logs: 332 ```bash 333 tail -f /tmp/atftp.log 334 ``` 335 336 Test TFTP manually: 337 ```bash 338 tftp 192.168.4.25 339 tftp> get grub/grub.cfg 340 ``` 341 342 ### GRUB Debugging 343 344 From GRUB prompt: 345 ```grub 346 set root=(tftp,192.168.4.25) 347 ls / 348 configfile /grub/grub.cfg 349 ``` 350 351 ## Files Reference 352 353 | File | Location | Purpose | 354 |------|----------|---------| 355 | `grubnetaa64.efi.signed` | `/srv/netboot/` | Ubuntu-signed GRUB for ARM64 | 356 | `grub.cfg` | `/srv/netboot/grub/` | Boot menu | 357 | `nixos-kernel` | `/srv/netboot/` | NixOS kernel Image | 358 | `nixos-initrd` | `/srv/netboot/` | NixOS initrd (merged cpio) | 359 | `dnsmasq-pxe.conf` | `/tmp/` | DHCP proxy config | 360 361 ## Network Information 362 363 | Host | IP | Role | 364 |------|-----|------| 365 | ultraviolence | 192.168.4.25 | PXE/TFTP server | 366 | DGX Spark | 192.168.6.208 | Target (MAC: 4c:bb:47:2d:90:1a) | 367 368 ## Why Not iPXE? 369 370 iPXE was tested but doesn't work properly on DGX Spark: 371 - Loads kernel successfully 372 - Fails with "Please append a correct root=" 373 - The initrd isn't being passed to the kernel correctly 374 375 GRUB works reliably for ARM64 UEFI netboot.