README.md
1 # DGX Forge Documentation 2 3 Comprehensive reverse-engineered documentation for NVIDIA DGX systems. 4 5 ## Status 6 7 This documentation is generated through forensic analysis of factory recovery 8 images. It is not official NVIDIA documentation. 9 10 ## Platforms 11 12 | Platform | Status | Source | 13 |----------|--------|--------| 14 | [DGX Spark](./spark/) | Active | FastOS 1.105.17 recovery image | 15 | [Blackwell](./blackwell/) | Planned | - | 16 17 ## Documentation Tree 18 19 ``` 20 docs/ 21 ├── spark/ 22 │ ├── README.md # Platform overview, quick reference 23 │ │ 24 │ ├── hardware/ 25 │ │ ├── README.md # Hardware architecture overview 26 │ │ ├── soc-th500.md # TH500 Thor SoC details 27 │ │ ├── gpu-gb10.md # GB10 Blackwell GPU (1CTA/sm_120) 28 │ │ ├── memory-architecture.md # LPDDR5X, unified memory, NVLink-C2C 29 │ │ ├── ec-mec-n1x.md # Microchip MEC N1X embedded controller 30 │ │ ├── tpm-nuvoton.md # TPM 2.0 implementation 31 │ │ ├── usb-pd-ccg8.md # Infineon CCG8DF USB-PD controller 32 │ │ ├── networking.md # ConnectX-7, WiFi 6E, Bluetooth 5.3 33 │ │ ├── storage.md # NVMe, eMMC boot 34 │ │ └── device-tree.md # DTB analysis, hardware topology 35 │ │ 36 │ ├── firmware/ 37 │ │ ├── README.md # Firmware chain overview 38 │ │ ├── boot-sequence.md # Power-on to Linux timeline 39 │ │ ├── socfw-cap.md # SoC firmware capsule format 40 │ │ ├── ec-firmware.md # EC firmware (Zephyr RTOS) 41 │ │ ├── ec-fused-vs-nofuse.md # Fuse state variants 42 │ │ ├── tpm-firmware.md # TPM capsule analysis 43 │ │ ├── usbpd-firmware.md # USB-PD controller firmware 44 │ │ ├── mlnx-connectx7.md # Mellanox NIC firmware 45 │ │ ├── capsule-format.md # UEFI capsule structure, signatures 46 │ │ ├── signing-chain.md # X.509 certificate hierarchy 47 │ │ └── update-mechanism.md # fwupd integration, rollback 48 │ │ 49 │ ├── boot/ 50 │ │ ├── README.md # Boot architecture overview 51 │ │ ├── uefi.md # UEFI firmware, variables 52 │ │ ├── secureboot.md # Secure Boot chain, MOK 53 │ │ ├── shim.md # SHIM 15.8 analysis 54 │ │ ├── grub.md # GRUB config, kernel params 55 │ │ ├── initrd.md # initramfs contents, hooks 56 │ │ ├── kernel-cmdline.md # Boot parameters 57 │ │ └── recovery-mode.md # FastOS recovery boot 58 │ │ 59 │ ├── kernel/ 60 │ │ ├── README.md # Kernel overview 61 │ │ ├── config.md # Kernel config analysis 62 │ │ ├── tegra241-cmdqv.md # Tegra CMDQV driver 63 │ │ ├── nvidia-modules.md # nvidia.ko, nvidia-uvm.ko, etc 64 │ │ ├── driver-580.md # Open driver 580.95.05 65 │ │ ├── device-drivers.md # Platform-specific drivers 66 │ │ └── patches.md # Ubuntu/NVIDIA kernel patches 67 │ │ 68 │ ├── os/ 69 │ │ ├── README.md # OS layer overview 70 │ │ ├── ubuntu-base.md # Ubuntu 24.04.3 Noble 71 │ │ ├── package-manifest.md # All 2,298 packages 72 │ │ ├── apt-repositories.md # DGX repo, NVIDIA repo, Ubuntu Pro 73 │ │ ├── systemd-services.md # Service dependency graph 74 │ │ ├── users-groups.md # System accounts, permissions 75 │ │ ├── filesystem-layout.md # Partition scheme, mount points 76 │ │ └── update-policy.md # APT pinning, unattended upgrades 77 │ │ 78 │ ├── nvidia-stack/ 79 │ │ ├── README.md # NVIDIA software stack overview 80 │ │ ├── cuda-13.md # CUDA 13.0 toolkit 81 │ │ ├── driver-architecture.md # Open vs proprietary, GSP 82 │ │ ├── nvml.md # NVIDIA Management Library 83 │ │ ├── persistenced.md # nvidia-persistenced 84 │ │ ├── container-toolkit.md # nvidia-container-toolkit, CDI 85 │ │ ├── fabric-manager.md # NVLink/NVSwitch (if applicable) 86 │ │ └── dcgm.md # Data Center GPU Manager 87 │ │ 88 │ ├── services/ 89 │ │ ├── README.md # DGX services overview 90 │ │ │ 91 │ │ ├── oobe/ 92 │ │ │ ├── README.md # OOBE system overview 93 │ │ │ ├── architecture.md # Component diagram 94 │ │ │ ├── state-machine.md # OOBE flow states 95 │ │ │ ├── oobe-service.md # Main Go service (port 80) 96 │ │ │ ├── oobe-admin.md # Admin service (D-Bus) 97 │ │ │ ├── oobe-desktop.md # Electron kiosk app 98 │ │ │ ├── hotspot.md # WiFi AP setup 99 │ │ │ ├── hostname-generation.md # Hostname algorithm 100 │ │ │ ├── bluetooth-pairing.md # HID device pairing 101 │ │ │ ├── network-config.md # Network provisioning 102 │ │ │ ├── user-creation.md # Initial user setup 103 │ │ │ ├── eula-flow.md # EULA acceptance 104 │ │ │ ├── first-run.md # Post-OOBE first login 105 │ │ │ └── dbus-interface.md # com.nvidia.dgx.oobe.admin1 106 │ │ │ 107 │ │ ├── dashboard/ 108 │ │ │ ├── README.md # Dashboard overview 109 │ │ │ ├── architecture.md # Go backend + React frontend 110 │ │ │ ├── api-reference.md # REST API endpoints 111 │ │ │ ├── authentication.md # Login, sessions 112 │ │ │ ├── gpu-telemetry.md # GPU metrics API 113 │ │ │ ├── system-info.md # System info API 114 │ │ │ ├── update-management.md # Software update API 115 │ │ │ ├── jupyterlab.md # JupyterLab integration 116 │ │ │ └── frontend.md # React app structure 117 │ │ │ 118 │ │ ├── telemetry/ 119 │ │ │ ├── README.md # Telemetry system overview 120 │ │ │ ├── consent-model.md # User/device consent 121 │ │ │ ├── sol-service.md # Sign of Life service 122 │ │ │ ├── endpoints.md # Phone-home URLs 123 │ │ │ ├── data-collected.md # What gets transmitted 124 │ │ │ ├── localized-config.md # Config API protocol 125 │ │ │ └── opt-out.md # How to disable 126 │ │ │ 127 │ │ └── system/ 128 │ │ ├── README.md # System services 129 │ │ ├── nv-scripts.md # /usr/local/sbin/nv_scripts/ 130 │ │ ├── first-boot.md # First boot services 131 │ │ ├── power-management.md # Suspend/hibernate 132 │ │ └── platform-detection.md # get_platform_info.bash 133 │ │ 134 │ ├── networking/ 135 │ │ ├── README.md # Network architecture 136 │ │ ├── network-manager.md # NetworkManager config 137 │ │ ├── avahi-mdns.md # mDNS/Bonjour 138 │ │ ├── wifi.md # WiFi 6E configuration 139 │ │ ├── bluetooth.md # Bluetooth stack 140 │ │ └── connectx7.md # Mellanox NIC config 141 │ │ 142 │ ├── security/ 143 │ │ ├── README.md # Security model 144 │ │ ├── secure-boot-chain.md # Full chain analysis 145 │ │ ├── certificates.md # X.509 certs, NVIDIA CA 146 │ │ ├── tpm-integration.md # Measured boot 147 │ │ ├── apparmor.md # AppArmor profiles 148 │ │ ├── firewall.md # nftables rules 149 │ │ └── ubuntu-pro.md # ESM, Livepatch 150 │ │ 151 │ ├── recovery/ 152 │ │ ├── README.md # Recovery system overview 153 │ │ ├── fastos.md # FastOS installer 154 │ │ ├── usb-image-format.md # USB recovery image structure 155 │ │ ├── partition-scheme.md # GPT layout 156 │ │ ├── installer-flow.md # Installation sequence 157 │ │ └── factory-reset.md # Reset procedure 158 │ │ 159 │ └── internals/ 160 │ ├── README.md # Internal NVIDIA details 161 │ ├── codenames.md # Marlin, Wagyu, Jade, Yukon 162 │ ├── gitlab-paths.md # Internal GitLab structure 163 │ ├── artifactory.md # urm.nvidia.com paths 164 │ ├── confluence.md # Internal wiki references 165 │ ├── build-system.md # How FastOS is built 166 │ └── version-strings.md # All version identifiers 167 │ 168 ├── blackwell/ # Future: B200/B300/GB200 169 │ └── README.md # Placeholder 170 │ 171 ├── common/ 172 │ ├── README.md # Cross-platform concepts 173 │ ├── capsule-format.md # UEFI capsule spec 174 │ ├── nvlink.md # NVLink protocol 175 │ └── gpu-architectures.md # Hopper to Blackwell evolution 176 │ 177 └── glossary/ 178 ├── README.md # Terminology guide 179 ├── codenames.md # Product to codename mappings 180 ├── gpu-arch.md # 1CTA/2CTA, sm_xxx mappings 181 ├── internal-urls.md # NVIDIA internal infrastructure 182 └── package-prefixes.md # dgx-*, nvidia-*, nv-* naming 183 ``` 184 185 ## TODO 186 187 - [x] Create directory structure 188 - [x] docs/spark/README.md - Platform overview 189 - [x] docs/spark/hardware/\* - Hardware documentation 190 - [x] docs/spark/firmware/\* - Firmware documentation 191 - [x] docs/spark/boot/\* - Boot chain documentation 192 - [x] docs/spark/kernel/\* - Kernel documentation 193 - [x] docs/spark/os/\* - OS layer documentation 194 - [x] docs/spark/nvidia-stack/\* - NVIDIA stack documentation 195 - [x] docs/spark/services/oobe/\* - OOBE documentation 196 - [ ] docs/spark/services/dashboard/\* - Dashboard documentation 197 - [x] docs/spark/services/telemetry/\* - Telemetry documentation 198 - [ ] docs/spark/services/system/\* - System services documentation 199 - [ ] docs/spark/networking/\* - Network documentation 200 - [ ] docs/spark/security/\* - Security documentation 201 - [ ] docs/spark/recovery/\* - Recovery documentation 202 - [ ] docs/spark/internals/\* - Internal NVIDIA documentation 203 - [x] docs/glossary/\* - Terminology glossary 204 - [ ] docs/common/\* - Cross-platform documentation 205 206 ## Data Sources 207 208 - FastOS 1.105.17 recovery image (`dgx-spark-recovery-image-1.105.17.tar.gz`) 209 - Extracted partition mounted at `/mnt/fastos` 210 - artifact.json software BOM (2,298 packages) 211 - Firmware capsules (socfw.cap, ec\_\*.cap, tpm.cap, usbpd.cap) 212 - Kernel 6.14.0-1013-nvidia 213 - NVIDIA driver 580.95.05 (open)