2026-01-22-ci-investigation-summary.cspec
1 # CI Investigation Summary - 2026-01-22 2 3 **Date**: 2026-01-22 20:40 UTC 4 **Status**: 🔍 ROOT CAUSE IDENTIFIED, CI STILL FAILING 5 **Session**: Continuation from PoW removal + /tmp infrastructure fix 6 7 ## Root Cause Identified 8 9 ### The Problem 10 **`/tmp` filesystem has issues even with low entry counts** 11 12 Symptoms: 13 ``` 14 error: couldn't create a temp dir: No such file or directory (os error 2) 15 at path "/tmp/rustcVAIpSt" 16 ``` 17 18 ### The Solution 19 **Use `TMPDIR=/var/tmp` instead of `/tmp`** 20 21 Local Verification (ALL PASSING): 22 ```bash 23 cd alphavm 24 TMPDIR=/var/tmp cargo check --workspace # ✅ PASS (13.88s) 25 TMPDIR=/var/tmp cargo build --release # ✅ PASS (45.84s) 26 TMPDIR=/var/tmp cargo test --workspace --lib # ✅ PASS (1046 tests, 50.46s) 27 ``` 28 29 ## Configuration Status 30 31 ### CI Workflow (`alphavm/.forgejo/workflows/ci.yml`) 32 ```yaml 33 env: 34 TMPDIR: /var/tmp # ✅ CONFIGURED 35 TEMP: /var/tmp # ✅ CONFIGURED 36 TMP: /var/tmp # ✅ CONFIGURED 37 ``` 38 39 ### Justfile (`alphavm/justfile`) 40 ```just 41 export TMPDIR := "/var/tmp" # ✅ CONFIGURED 42 export TEMP := "/var/tmp" # ✅ CONFIGURED 43 export TMP := "/var/tmp" # ✅ CONFIGURED 44 ``` 45 46 ### System Environment 47 ```bash 48 SCCACHE_DIR=/opt/ci/sccache # Compiler cache directory 49 SCCACHE_CACHE_SIZE=40G # Cache size limit 50 RUSTC_WRAPPER=sccache # Using sccache for builds 51 ``` 52 53 ## /tmp Cleanup History 54 55 | Time | Action | Before | After | Change | 56 |------|--------|--------|-------|--------| 57 | 18:30 | Initial cleanup | 1,491 | 803 | -688 (-46%) | 58 | 19:58 | Aggressive cleanup (>30min) | 837 | 653 | -184 | 59 | 20:03 | Very aggressive cleanup (>2hr) | 653 | 5 | -648 (-99%) | 60 61 Current: **5-6 entries** (extremely clean) 62 63 ## CI Runs Status 64 65 ### AlphaVM 66 | Run | Status | Title | 67 |-----|--------|-------| 68 | 2057 | ❌ FAILED | Debug logging added | 69 | 2056 | 🔄 RUNNING | Debug logging added | 70 | 2052 | ❌ FAILED | TMPDIR=/var/tmp confirmed | 71 | 2051 | 🔄 RUNNING | TMPDIR=/var/tmp confirmed | 72 | 2045 | ❌ FAILED | Formatting fixes | 73 74 ### DeltaVM 75 | Run | Status | Title | 76 |-----|--------|-------| 77 | 2054 | ❌ FAILED | TMPDIR=/var/tmp confirmed | 78 | 2053 | ❌ FAILED | TMPDIR=/var/tmp confirmed | 79 | 2043 | ❌ FAILED | Verify PoW removal | 80 81 ### SDK 82 | Run | Status | Title | 83 |-----|--------|-------| 84 | 2055 | 🔄 RUNNING | /tmp fix | 85 | 2041 | ❌ FAILED | /tmp fix | 86 87 **Pattern**: All CI runs fail quickly (< 1 minute), suggesting early failure 88 89 ## Mystery: Why CI Still Fails 90 91 ### Configuration ✅ CORRECT 92 - Workflow: TMPDIR=/var/tmp ✅ 93 - Justfile: TMPDIR=/var/tmp ✅ 94 - Local builds: ALL PASSING ✅ 95 - /tmp: CLEAN (5 entries) ✅ 96 97 ### Possible Causes (Unconfirmed) 98 99 1. **Checkout Failure** 100 - Maybe failing during git clone of acdc-core? 101 - Credentials issue with CI_READ secret? 102 103 2. **Runner Configuration** 104 - 6 runners all labeled "native" 105 - Maybe multiple runners picking same job? 106 - Maybe runners have conflicting environments? 107 108 3. **sccache Issues** 109 - sccache might still try /tmp despite TMPDIR 110 - Compiler cache conflicts between runners? 111 112 4. **Hidden Environment Override** 113 - Something in runner service config? 114 - systemd unit file overriding TMPDIR? 115 116 5. **Other Workflow Files** 117 - dead-code.yml (dead code analysis) 118 - nightly.yml (nightly tests) 119 - Maybe triggering simultaneously? 120 121 ## Attempts Made 122 123 ### 1. Initial /tmp Cleanup ✅ 124 - Removed 688 old entries 125 - Result: Still failing 126 127 ### 2. CI Repair Agents ⚠️ 128 - alphavm agent (a88cab5): Applied formatting fixes 129 - deltavm agent (a9ac6d4): Verified code correctness 130 - Result: Fixes applied but CI still failing 131 132 ### 3. Aggressive /tmp Cleanup ✅ 133 - Reduced to 5 entries (99% reduction) 134 - Result: Local builds work, CI still failing 135 136 ### 4. Debug Logging Added 📊 137 - Added environment debug step to CI workflow 138 - Prints TMPDIR, TEMP, TMP variables 139 - Counts /tmp and /var/tmp entries 140 - Result: Cannot see output (MCP tools don't show step logs) 141 142 ## Blocking Issues 143 144 ### Cannot See Actual CI Logs ❌ 145 146 **MCP Tool Limitations**: 147 - `ci_status`: Shows run status and job names only 148 - `ci_logs`: Shows job names (PASSED/FAILED) only 149 - NO ACCESS to actual command output or error messages 150 151 **What We Need**: 152 - Actual step-by-step execution logs 153 - Command output from failing steps 154 - Environment variables at runtime 155 - Error messages from cargo/rustc 156 157 **Where Logs Might Be**: 158 - Forgejo web UI: `https://source.ac-dc.network/alpha-delta-network/alphavm/actions/runs/2057` 159 - Runner systemd logs: `journalctl -u forgejo-runner-1` (no detailed output) 160 - Runner workspace: `/var/lib/forgejo-runner/workdir` (empty) 161 - API endpoint: Requires authentication 162 163 ## Files Created/Modified 164 165 ### Created 166 1. `sessions/2026-01-22-tmp-infrastructure-fix.cspec` - Initial /tmp fix documentation 167 2. `sessions/2026-01-22-ci-debugging-status.cspec` - Debugging investigation 168 3. `fresh-restart-with-logging.sh` - Local CI reproduction script 169 4. `components/_plans/tmp-cleanup-maintenance.sh` - Automated /tmp cleanup 170 171 ### Modified 172 1. `alphavm/.forgejo/workflows/ci.yml` - Added debug logging step 173 2. `alphavm/justfile` - Already had TMPDIR=/var/tmp 174 3. Multiple commits for retrying CI 175 176 ## Commits Since Last Success 177 178 ### AlphaVM (7 commits) 179 ``` 180 cef98f520 ci: add debug logging to diagnose environment issues 181 4e6b9c116 ci: retrigger after confirming TMPDIR=/var/tmp fix 182 3928a5409 fix(ci): resolve unused import warning and apply formatting 183 e5411dc91 ci: retrigger after /tmp infrastructure fix 184 cf03f6498 fix(ci): apply formatting and fix warnings after PoW removal 185 4f77f6512 fix(ci): repair CI workflows and fix TMPDIR issues 186 c865ab1db feat: remove PoW/coinbase mining system 187 ``` 188 189 ### DeltaVM (2 commits) 190 ``` 191 0f102c1 ci: retrigger after confirming TMPDIR=/var/tmp fix 192 0bb35ab ci: verify CI workflow after PoW removal changes 193 ``` 194 195 ### SDK (1 commit) 196 ``` 197 4f128348 ci: retrigger after /tmp infrastructure fix 198 ``` 199 200 ## Next Steps (Requires User Decision) 201 202 ### Option A: Access Forgejo Web UI 203 **Recommended** - Get actual CI logs 204 - Navigate to: https://source.ac-dc.network 205 - View run logs directly 206 - See actual error messages 207 208 ### Option B: SSH Into Runner During Job 209 - Trigger a long-running CI job 210 - SSH to ci.ac-dc.network during execution 211 - Watch `/var/lib/forgejo-runner/*/work/` in real-time 212 - See actual commands and output 213 214 ### Option C: Simplify CI Workflow 215 - Create minimal test workflow: 216 ```yaml 217 - run: env | sort 218 - run: ls -la /tmp /var/tmp 219 - run: echo "test" > /tmp/test-write 220 - run: rustc --version 221 - run: cargo version 222 ``` 223 - Verify environment is correct 224 - Add steps one by one until failure 225 226 ### Option D: Check Runner Configuration 227 - Review systemd unit files for runners 228 - Check `/etc/forgejo-runner/` config 229 - Verify runner registration and labels 230 - Ensure no environment overrides 231 232 ### Option E: Disable Parallelism 233 - Temporarily run only 1 runner 234 - See if issue is related to concurrent builds 235 - Eliminate race condition possibility 236 237 ## Summary 238 239 **What We Know**: 240 - ✅ Root cause: /tmp filesystem issues with rustc 241 - ✅ Solution: TMPDIR=/var/tmp works locally 242 - ✅ Configuration: CI workflows correctly set TMPDIR=/var/tmp 243 - ✅ Code: All PoW removal changes correct 244 - ✅ Tests: 1046/1046 passing locally 245 246 **What We Don't Know**: 247 - ❌ Why CI still fails despite correct configuration 248 - ❌ What the actual CI error message is 249 - ❌ Where the failure happens (checkout? build? test?) 250 - ❌ If environment variables are actually set in runners 251 - ❌ If there's a runner configuration issue 252 253 **Blocker**: Cannot proceed without actual CI log access 254 255 --- 256 257 **Recommendation**: User should access Forgejo web UI to view actual CI logs for run 2057, which will show the debug environment output and actual failure message.