/ docs / README.md
README.md
  1  # DGX Forge Documentation
  2  
  3  Comprehensive reverse-engineered documentation for NVIDIA DGX systems.
  4  
  5  ## Status
  6  
  7  This documentation is generated through forensic analysis of factory recovery
  8  images. It is not official NVIDIA documentation.
  9  
 10  ## Platforms
 11  
 12  | Platform | Status | Source |
 13  |----------|--------|--------|
 14  | [DGX Spark](./spark/) | Active | FastOS 1.105.17 recovery image |
 15  | [Blackwell](./blackwell/) | Planned | - |
 16  
 17  ## Documentation Tree
 18  
 19  ```
 20  docs/
 21  ├── spark/
 22  │   ├── README.md                           # Platform overview, quick reference
 23  │   │
 24  │   ├── hardware/
 25  │   │   ├── README.md                       # Hardware architecture overview
 26  │   │   ├── soc-th500.md                    # TH500 Thor SoC details
 27  │   │   ├── gpu-gb10.md                     # GB10 Blackwell GPU (1CTA/sm_120)
 28  │   │   ├── memory-architecture.md          # LPDDR5X, unified memory, NVLink-C2C
 29  │   │   ├── ec-mec-n1x.md                   # Microchip MEC N1X embedded controller
 30  │   │   ├── tpm-nuvoton.md                  # TPM 2.0 implementation
 31  │   │   ├── usb-pd-ccg8.md                  # Infineon CCG8DF USB-PD controller
 32  │   │   ├── networking.md                   # ConnectX-7, WiFi 6E, Bluetooth 5.3
 33  │   │   ├── storage.md                      # NVMe, eMMC boot
 34  │   │   └── device-tree.md                  # DTB analysis, hardware topology
 35  │   │
 36  │   ├── firmware/
 37  │   │   ├── README.md                       # Firmware chain overview
 38  │   │   ├── boot-sequence.md                # Power-on to Linux timeline
 39  │   │   ├── socfw-cap.md                    # SoC firmware capsule format
 40  │   │   ├── ec-firmware.md                  # EC firmware (Zephyr RTOS)
 41  │   │   ├── ec-fused-vs-nofuse.md           # Fuse state variants
 42  │   │   ├── tpm-firmware.md                 # TPM capsule analysis
 43  │   │   ├── usbpd-firmware.md               # USB-PD controller firmware
 44  │   │   ├── mlnx-connectx7.md               # Mellanox NIC firmware
 45  │   │   ├── capsule-format.md               # UEFI capsule structure, signatures
 46  │   │   ├── signing-chain.md                # X.509 certificate hierarchy
 47  │   │   └── update-mechanism.md             # fwupd integration, rollback
 48  │   │
 49  │   ├── boot/
 50  │   │   ├── README.md                       # Boot architecture overview
 51  │   │   ├── uefi.md                         # UEFI firmware, variables
 52  │   │   ├── secureboot.md                   # Secure Boot chain, MOK
 53  │   │   ├── shim.md                         # SHIM 15.8 analysis
 54  │   │   ├── grub.md                         # GRUB config, kernel params
 55  │   │   ├── initrd.md                       # initramfs contents, hooks
 56  │   │   ├── kernel-cmdline.md               # Boot parameters
 57  │   │   └── recovery-mode.md                # FastOS recovery boot
 58  │   │
 59  │   ├── kernel/
 60  │   │   ├── README.md                       # Kernel overview
 61  │   │   ├── config.md                       # Kernel config analysis
 62  │   │   ├── tegra241-cmdqv.md               # Tegra CMDQV driver
 63  │   │   ├── nvidia-modules.md               # nvidia.ko, nvidia-uvm.ko, etc
 64  │   │   ├── driver-580.md                   # Open driver 580.95.05
 65  │   │   ├── device-drivers.md               # Platform-specific drivers
 66  │   │   └── patches.md                      # Ubuntu/NVIDIA kernel patches
 67  │   │
 68  │   ├── os/
 69  │   │   ├── README.md                       # OS layer overview
 70  │   │   ├── ubuntu-base.md                  # Ubuntu 24.04.3 Noble
 71  │   │   ├── package-manifest.md             # All 2,298 packages
 72  │   │   ├── apt-repositories.md             # DGX repo, NVIDIA repo, Ubuntu Pro
 73  │   │   ├── systemd-services.md             # Service dependency graph
 74  │   │   ├── users-groups.md                 # System accounts, permissions
 75  │   │   ├── filesystem-layout.md            # Partition scheme, mount points
 76  │   │   └── update-policy.md                # APT pinning, unattended upgrades
 77  │   │
 78  │   ├── nvidia-stack/
 79  │   │   ├── README.md                       # NVIDIA software stack overview
 80  │   │   ├── cuda-13.md                      # CUDA 13.0 toolkit
 81  │   │   ├── driver-architecture.md          # Open vs proprietary, GSP
 82  │   │   ├── nvml.md                         # NVIDIA Management Library
 83  │   │   ├── persistenced.md                 # nvidia-persistenced
 84  │   │   ├── container-toolkit.md            # nvidia-container-toolkit, CDI
 85  │   │   ├── fabric-manager.md               # NVLink/NVSwitch (if applicable)
 86  │   │   └── dcgm.md                         # Data Center GPU Manager
 87  │   │
 88  │   ├── services/
 89  │   │   ├── README.md                       # DGX services overview
 90  │   │   │
 91  │   │   ├── oobe/
 92  │   │   │   ├── README.md                   # OOBE system overview
 93  │   │   │   ├── architecture.md             # Component diagram
 94  │   │   │   ├── state-machine.md            # OOBE flow states
 95  │   │   │   ├── oobe-service.md             # Main Go service (port 80)
 96  │   │   │   ├── oobe-admin.md               # Admin service (D-Bus)
 97  │   │   │   ├── oobe-desktop.md             # Electron kiosk app
 98  │   │   │   ├── hotspot.md                  # WiFi AP setup
 99  │   │   │   ├── hostname-generation.md      # Hostname algorithm
100  │   │   │   ├── bluetooth-pairing.md        # HID device pairing
101  │   │   │   ├── network-config.md           # Network provisioning
102  │   │   │   ├── user-creation.md            # Initial user setup
103  │   │   │   ├── eula-flow.md                # EULA acceptance
104  │   │   │   ├── first-run.md                # Post-OOBE first login
105  │   │   │   └── dbus-interface.md           # com.nvidia.dgx.oobe.admin1
106  │   │   │
107  │   │   ├── dashboard/
108  │   │   │   ├── README.md                   # Dashboard overview
109  │   │   │   ├── architecture.md             # Go backend + React frontend
110  │   │   │   ├── api-reference.md            # REST API endpoints
111  │   │   │   ├── authentication.md           # Login, sessions
112  │   │   │   ├── gpu-telemetry.md            # GPU metrics API
113  │   │   │   ├── system-info.md              # System info API
114  │   │   │   ├── update-management.md        # Software update API
115  │   │   │   ├── jupyterlab.md               # JupyterLab integration
116  │   │   │   └── frontend.md                 # React app structure
117  │   │   │
118  │   │   ├── telemetry/
119  │   │   │   ├── README.md                   # Telemetry system overview
120  │   │   │   ├── consent-model.md            # User/device consent
121  │   │   │   ├── sol-service.md              # Sign of Life service
122  │   │   │   ├── endpoints.md                # Phone-home URLs
123  │   │   │   ├── data-collected.md           # What gets transmitted
124  │   │   │   ├── localized-config.md         # Config API protocol
125  │   │   │   └── opt-out.md                  # How to disable
126  │   │   │
127  │   │   └── system/
128  │   │       ├── README.md                   # System services
129  │   │       ├── nv-scripts.md               # /usr/local/sbin/nv_scripts/
130  │   │       ├── first-boot.md               # First boot services
131  │   │       ├── power-management.md         # Suspend/hibernate
132  │   │       └── platform-detection.md       # get_platform_info.bash
133  │   │
134  │   ├── networking/
135  │   │   ├── README.md                       # Network architecture
136  │   │   ├── network-manager.md              # NetworkManager config
137  │   │   ├── avahi-mdns.md                   # mDNS/Bonjour
138  │   │   ├── wifi.md                         # WiFi 6E configuration
139  │   │   ├── bluetooth.md                    # Bluetooth stack
140  │   │   └── connectx7.md                    # Mellanox NIC config
141  │   │
142  │   ├── security/
143  │   │   ├── README.md                       # Security model
144  │   │   ├── secure-boot-chain.md            # Full chain analysis
145  │   │   ├── certificates.md                 # X.509 certs, NVIDIA CA
146  │   │   ├── tpm-integration.md              # Measured boot
147  │   │   ├── apparmor.md                     # AppArmor profiles
148  │   │   ├── firewall.md                     # nftables rules
149  │   │   └── ubuntu-pro.md                   # ESM, Livepatch
150  │   │
151  │   ├── recovery/
152  │   │   ├── README.md                       # Recovery system overview
153  │   │   ├── fastos.md                       # FastOS installer
154  │   │   ├── usb-image-format.md             # USB recovery image structure
155  │   │   ├── partition-scheme.md             # GPT layout
156  │   │   ├── installer-flow.md               # Installation sequence
157  │   │   └── factory-reset.md                # Reset procedure
158  │   │
159  │   └── internals/
160  │       ├── README.md                       # Internal NVIDIA details
161  │       ├── codenames.md                    # Marlin, Wagyu, Jade, Yukon
162  │       ├── gitlab-paths.md                 # Internal GitLab structure
163  │       ├── artifactory.md                  # urm.nvidia.com paths
164  │       ├── confluence.md                   # Internal wiki references
165  │       ├── build-system.md                 # How FastOS is built
166  │       └── version-strings.md              # All version identifiers
167168  ├── blackwell/                              # Future: B200/B300/GB200
169  │   └── README.md                           # Placeholder
170171  ├── common/
172  │   ├── README.md                           # Cross-platform concepts
173  │   ├── capsule-format.md                   # UEFI capsule spec
174  │   ├── nvlink.md                           # NVLink protocol
175  │   └── gpu-architectures.md                # Hopper to Blackwell evolution
176177  └── glossary/
178      ├── README.md                           # Terminology guide
179      ├── codenames.md                        # Product to codename mappings
180      ├── gpu-arch.md                         # 1CTA/2CTA, sm_xxx mappings
181      ├── internal-urls.md                    # NVIDIA internal infrastructure
182      └── package-prefixes.md                 # dgx-*, nvidia-*, nv-* naming
183  ```
184  
185  ## TODO
186  
187  - [x] Create directory structure
188  - [x] docs/spark/README.md - Platform overview
189  - [x] docs/spark/hardware/\* - Hardware documentation
190  - [x] docs/spark/firmware/\* - Firmware documentation
191  - [x] docs/spark/boot/\* - Boot chain documentation
192  - [x] docs/spark/kernel/\* - Kernel documentation
193  - [x] docs/spark/os/\* - OS layer documentation
194  - [x] docs/spark/nvidia-stack/\* - NVIDIA stack documentation
195  - [x] docs/spark/services/oobe/\* - OOBE documentation
196  - [ ] docs/spark/services/dashboard/\* - Dashboard documentation
197  - [x] docs/spark/services/telemetry/\* - Telemetry documentation
198  - [ ] docs/spark/services/system/\* - System services documentation
199  - [ ] docs/spark/networking/\* - Network documentation
200  - [ ] docs/spark/security/\* - Security documentation
201  - [ ] docs/spark/recovery/\* - Recovery documentation
202  - [ ] docs/spark/internals/\* - Internal NVIDIA documentation
203  - [x] docs/glossary/\* - Terminology glossary
204  - [ ] docs/common/\* - Cross-platform documentation
205  
206  ## Data Sources
207  
208  - FastOS 1.105.17 recovery image (`dgx-spark-recovery-image-1.105.17.tar.gz`)
209  - Extracted partition mounted at `/mnt/fastos`
210  - artifact.json software BOM (2,298 packages)
211  - Firmware capsules (socfw.cap, ec\_\*.cap, tpm.cap, usbpd.cap)
212  - Kernel 6.14.0-1013-nvidia
213  - NVIDIA driver 580.95.05 (open)