/ docs / spark / boot / pxe-netboot.md
pxe-netboot.md
  1  # PXE Netboot for DGX Spark
  2  
  3  This document describes how to PXE boot NixOS on the DGX Spark using GRUB over TFTP.
  4  
  5  ## Overview
  6  
  7  The DGX Spark supports network booting via UEFI PXE. We use:
  8  
  9  - **GRUB** as the bootloader (iPXE doesn't work properly - initrd not passed to kernel)
 10  - **TFTP** for serving GRUB, kernel, and initrd
 11  - **NixOS** netboot image built with the dgx-forge flake
 12  
 13  ## Architecture
 14  
 15  ```
 16  ┌─────────────────┐         ┌──────────────────────────────────────┐
 17  │  DGX Spark      │         │  PXE Server (ultraviolence)          │
 18  │  192.168.6.208  │         │  192.168.4.25                        │
 19  ├─────────────────┤         ├──────────────────────────────────────┤
 20  │                 │         │                                      │
 21  │  UEFI PXE Boot  │◄───────►│  dnsmasq (DHCP proxy, port 4011)     │
 22  │       │         │         │                                      │
 23  │       v         │         │                                      │
 24  │  grubnetaa64    │◄───TFTP─│  atftpd (port 69, /srv/netboot)      │
 25  │  .efi.signed    │         │                                      │
 26  │       │         │         │                                      │
 27  │       v         │         │                                      │
 28  │  grub.cfg       │◄───TFTP─│                                      │
 29  │       │         │         │                                      │
 30  │       v         │         │                                      │
 31  │  nixos-kernel   │◄───TFTP─│                                      │
 32  │  nixos-initrd   │         │                                      │
 33  │       │         │         │                                      │
 34  │       v         │         │                                      │
 35  │  NixOS Stage 1  │         │                                      │
 36  │  (initrd)       │         │                                      │
 37  │       │         │         │                                      │
 38  │       v         │         │                                      │
 39  │  NixOS Stage 2  │         │                                      │
 40  │  (systemd)      │         │                                      │
 41  │                 │         │                                      │
 42  └─────────────────┘         └──────────────────────────────────────┘
 43  ```
 44  
 45  ## Server Setup
 46  
 47  ### Prerequisites
 48  
 49  On the PXE server (NixOS):
 50  
 51  ```nix
 52  # In your NixOS configuration
 53  services.atftpd = {
 54    enable = true;
 55    root = "/srv/netboot";
 56  };
 57  
 58  # Or run manually:
 59  # sudo atftpd --daemon --no-fork --logfile /tmp/atftp.log /srv/netboot
 60  ```
 61  
 62  ### dnsmasq Configuration
 63  
 64  Create `/tmp/dnsmasq-pxe.conf`:
 65  
 66  ```ini
 67  # PXE proxy mode - responds to PXE requests without full DHCP
 68  port=0
 69  interface=enp4s0
 70  bind-interfaces
 71  
 72  # Enable TFTP
 73  enable-tftp
 74  tftp-root=/srv/netboot
 75  
 76  # DHCP proxy on port 4011
 77  dhcp-range=192.168.4.0,proxy
 78  
 79  # PXE boot options for UEFI ARM64
 80  pxe-service=ARM64_EFI,"NixOS Netboot",grubnetaa64.efi.signed
 81  
 82  # Log for debugging
 83  log-dhcp
 84  log-queries
 85  ```
 86  
 87  Run: `sudo dnsmasq -d -C /tmp/dnsmasq-pxe.conf`
 88  
 89  ### Directory Structure
 90  
 91  ```
 92  /srv/netboot/
 93  ├── grubnetaa64.efi.signed    # Ubuntu signed GRUB for ARM64 UEFI
 94  ├── grub/
 95  │   └── grub.cfg              # GRUB menu configuration
 96  ├── grub.cfg                  # Fallback location
 97  ├── nixos-kernel              # NixOS kernel (Image, ~60MB)
 98  └── nixos-initrd              # NixOS initrd (~500MB-1.3GB)
 99  ```
100  
101  ### GRUB Configuration
102  
103  `/srv/netboot/grub/grub.cfg`:
104  
105  ```grub
106  insmod net
107  insmod tftp
108  set net_default_server=192.168.4.25
109  set root=(tftp,192.168.4.25)
110  
111  set timeout=5
112  set default=0
113  
114  menuentry 'NixOS Netboot' {
115     linux /nixos-kernel init=/nix/store/<toplevel-hash>-nixos-system-.../init \
116           nouveau.modeset=0 \
117           console=ttyS0,921600 \
118           console=tty0 \
119           sbsa_gwdt.action=1 \
120           pci=pcie_bus_safe \
121           loglevel=7
122     initrd /nixos-initrd
123  }
124  ```
125  
126  **Important kernel parameters:**
127  
128  | Parameter | Purpose |
129  |-----------|---------|
130  | `init=/nix/store/.../init` | Points to NixOS stage 2 init in the squashfs |
131  | `nouveau.modeset=0` | Disable nouveau (conflicts with nvidia) |
132  | `console=ttyS0,921600` | Serial console at 921600 baud |
133  | `sbsa_gwdt.action=1` | Watchdog timer behavior |
134  | `pci=pcie_bus_safe` | Safe PCIe enumeration |
135  
136  ## Building the NixOS Netboot Image
137  
138  ### Flake Configuration
139  
140  In `dgx-forge/nix/configurations/netboot.nix`:
141  
142  ```nix
143  { config, lib, pkgs, modulesPath, ... }:
144  
145  {
146    imports = [
147      (modulesPath + "/installer/netboot/netboot-minimal.nix")
148      ../modules/nixos/dgx-spark.nix
149    ];
150  
151    # Netboot-specific settings
152    boot.loader.grub.enable = false;
153    
154    # Enable SSH for remote access
155    services.openssh.enable = true;
156    
157    # Minimal system for debugging
158    environment.systemPackages = with pkgs; [
159      vim
160      htop
161      pciutils
162      usbutils
163    ];
164  }
165  ```
166  
167  ### Building
168  
169  ```bash
170  # Build on ARM64 builder
171  nix build .#nixosConfigurations.spark-netboot-kexec.config.system.build.netboot \
172    --builders 'ssh://nix-builder-arm64' \
173    --max-jobs 0
174  
175  # Result contains:
176  ls -la result/
177  # Image     -> kernel
178  # initrd    -> initrd (gzipped, concatenated cpio archives)
179  # toplevel  -> NixOS system closure
180  ```
181  
182  ### Deploying to TFTP Server
183  
184  ```bash
185  # Copy kernel
186  sudo cp result/Image /srv/netboot/nixos-kernel
187  
188  # Copy initrd (this needs special handling - see below)
189  sudo cp result/initrd /srv/netboot/nixos-initrd
190  ```
191  
192  ## NixOS Initrd Structure
193  
194  The NixOS netboot initrd is **three concatenated cpio archives**:
195  
196  1. **First archive** (~4MB): Microcode and early init
197  2. **Second archive** (~87MB): Stage 1 init scripts, busybox, extra-utils
198  3. **Third archive** (~1.2GB): `nix-store.squashfs` containing the full NixOS closure
199  
200  ### initrd Contents
201  
202  ```
203  /
204  ├── init                    # Stage 1 init script (shell script)
205  ├── bin/                    # Symlink or busybox binaries
206  ├── nix/
207  │   └── store/
208  │       ├── <extra-utils>/  # Busybox, blkid, modprobe, etc.
209  │       ├── <modules>/      # Kernel modules
210  │       └── <fsinfo>/       # Mount configuration
211  ├── nix-store.squashfs      # Symlink to squashfs.img
212  └── lib -> /nix/store/.../lib
213  ```
214  
215  ### Stage 1 Boot Flow
216  
217  1. Kernel unpacks initrd and executes `/init`
218  2. `/init` is a shell script with shebang `#!/bin/ash` (busybox)
219  3. Mounts tmpfs on `/` as root
220  4. Mounts `nix-store.squashfs` on `/nix/.ro-store`
221  5. Mounts tmpfs on `/nix/.rw-store`
222  6. Creates overlay: `/nix/store` = ro-store + rw-store
223  7. Runs `switch_root` to `/mnt-root` with stage 2 init
224  
225  ### Common Initrd Issues
226  
227  #### "No working init found"
228  
229  **Cause**: Kernel can't execute `/init`
230  
231  **Solutions**:
232  1. Ensure `/init` has execute permissions
233  2. Ensure interpreter (`/bin/ash` or `/nix/store/.../ash`) is executable
234  3. Check if busybox is corrupted (compare SHA256 with original)
235  4. For scripts with nix store shebang, create `/bin` symlink or real copy
236  
237  ```bash
238  # Fix: Create /bin with real busybox copy
239  mkdir -p /tmp/build-initrd/bin
240  cp /nix/store/<extra-utils>/bin/busybox /tmp/build-initrd/bin/
241  chmod 755 /tmp/build-initrd/bin/busybox
242  ln -s busybox /tmp/build-initrd/bin/ash
243  ln -s busybox /tmp/build-initrd/bin/sh
244  
245  # Update init shebang
246  sed -i '1s|.*|#!/bin/ash|' /tmp/build-initrd/init
247  ```
248  
249  #### "couldn't execute it (error -5)"
250  
251  **Cause**: EIO - binary is corrupted
252  
253  **Solution**: Re-copy busybox from nix store and verify checksum:
254  
255  ```bash
256  cp /nix/store/<extra-utils>/bin/busybox /tmp/build-initrd/bin/
257  sha256sum /tmp/build-initrd/bin/busybox
258  # Compare with: sha256sum /nix/store/<extra-utils>/bin/busybox
259  ```
260  
261  #### "stage 2 init script not found"
262  
263  **Cause**: Stage 1 completed but can't find stage 2 init at `/mnt-root/init`
264  
265  **Solution**: Pass correct init path in kernel cmdline:
266  
267  ```grub
268  linux /nixos-kernel init=/nix/store/<toplevel>/init ...
269  ```
270  
271  Find the toplevel hash:
272  ```bash
273  # Mount the squashfs and look for nixos-system-*
274  sudo mount -o loop,ro /tmp/build-initrd/nix/store/*-squashfs.img /tmp/sqmount
275  ls /tmp/sqmount/ | grep nixos-system
276  ```
277  
278  ## Rebuilding the Initrd
279  
280  If you need to modify the initrd (fix permissions, add files):
281  
282  ```bash
283  # 1. Decompress
284  zcat /nix/store/<initrd>/initrd > /tmp/original-initrd.cpio
285  
286  # 2. Find archive boundaries
287  grep -boa 'TRAILER!!!' /tmp/original-initrd.cpio
288  # Output shows offsets of each cpio archive end
289  
290  # 3. Extract to working directory
291  mkdir -p /tmp/build-initrd && cd /tmp/build-initrd
292  zcat /nix/store/<initrd>/initrd | cpio -idm
293  
294  # 4. Extract third archive (squashfs) separately
295  tail -c +<offset> /tmp/original-initrd.cpio > /tmp/third-archive.cpio
296  cpio -idm "nix-store.squashfs" "nix/store/*-squashfs.img" < /tmp/third-archive.cpio
297  
298  # 5. Make fixes (permissions, symlinks, etc.)
299  chmod 755 bin/busybox
300  sed -i '1s|.*|#!/bin/ash|' init
301  
302  # 6. Rebuild single cpio archive
303  find . -print0 | cpio --null -o -H newc | gzip -1 > /tmp/nixos-initrd-new.gz
304  
305  # 7. Deploy
306  sudo cp /tmp/nixos-initrd-new.gz /srv/netboot/nixos-initrd
307  ```
308  
309  ## Troubleshooting
310  
311  ### Enable Verbose Boot
312  
313  Add to kernel cmdline:
314  ```
315  loglevel=7 boot.shell_on_fail
316  ```
317  
318  This gives you a shell if stage 1 fails.
319  
320  ### Serial Console Access
321  
322  Connect to DGX Spark serial port at **921600 baud**:
323  ```bash
324  screen /dev/ttyUSB0 921600
325  # or
326  minicom -D /dev/ttyUSB0 -b 921600
327  ```
328  
329  ### TFTP Debugging
330  
331  Check TFTP server logs:
332  ```bash
333  tail -f /tmp/atftp.log
334  ```
335  
336  Test TFTP manually:
337  ```bash
338  tftp 192.168.4.25
339  tftp> get grub/grub.cfg
340  ```
341  
342  ### GRUB Debugging
343  
344  From GRUB prompt:
345  ```grub
346  set root=(tftp,192.168.4.25)
347  ls /
348  configfile /grub/grub.cfg
349  ```
350  
351  ## Files Reference
352  
353  | File | Location | Purpose |
354  |------|----------|---------|
355  | `grubnetaa64.efi.signed` | `/srv/netboot/` | Ubuntu-signed GRUB for ARM64 |
356  | `grub.cfg` | `/srv/netboot/grub/` | Boot menu |
357  | `nixos-kernel` | `/srv/netboot/` | NixOS kernel Image |
358  | `nixos-initrd` | `/srv/netboot/` | NixOS initrd (merged cpio) |
359  | `dnsmasq-pxe.conf` | `/tmp/` | DHCP proxy config |
360  
361  ## Network Information
362  
363  | Host | IP | Role |
364  |------|-----|------|
365  | ultraviolence | 192.168.4.25 | PXE/TFTP server |
366  | DGX Spark | 192.168.6.208 | Target (MAC: 4c:bb:47:2d:90:1a) |
367  
368  ## Why Not iPXE?
369  
370  iPXE was tested but doesn't work properly on DGX Spark:
371  - Loads kernel successfully
372  - Fails with "Please append a correct root=" 
373  - The initrd isn't being passed to the kernel correctly
374  
375  GRUB works reliably for ARM64 UEFI netboot.