/ TESTNET-ISSUES-2026-01-22.md
TESTNET-ISSUES-2026-01-22.md
1 # Testnet Infrastructure Issues - 2026-01-22 2 3 ## Summary 4 5 During Section 12b (Governance Testing) implementation, we encountered a critical persistent state issue that prevents testnet validators from producing blocks after fresh deployments. This document catalogs all discovered issues and proposed fixes. 6 7 ## Issue 1: Persistent Proposal Cache Survives Cleanup 8 9 ### Symptom 10 ``` 11 Cannot propose a batch for round 1 - the latest proposal cache round is 5236 12 ``` 13 14 ### Root Cause 15 The BFT primary loads a **proposal cache file** on startup that persists across restarts: 16 17 **File Location**: `/root/.current-proposal-cache-{network}-{dev_id}` 18 **Example**: `/root/.current-proposal-cache-1-0` for testnet001 (dev 0) 19 20 **Code Reference**: `alphaos/node/bft/src/helpers/proposal_cache.rs:32-48` 21 ```rust 22 pub fn proposal_cache_path(network: u16, storage_mode: &StorageMode) -> PathBuf { 23 const PROPOSAL_CACHE_FILE_NAME: &str = "current-proposal-cache"; 24 let mut path = alpha_ledger_dir(network, storage_mode); 25 if !matches!(storage_mode, StorageMode::Custom(..)) { 26 path.pop(); 27 } 28 match storage_mode.dev() { 29 Some(id) => path.push(format!(".{PROPOSAL_CACHE_FILE_NAME}-{network}-{id}")), 30 None => path.push(format!("{PROPOSAL_CACHE_FILE_NAME}-{network}")), 31 } 32 path 33 } 34 ``` 35 36 **Load Behavior**: `alphaos/node/bft/src/primary.rs:173-188` 37 ```rust 38 async fn load_proposal_cache(&self) -> Result<()> { 39 match ProposalCache::<N>::exists(&self.storage_mode) { 40 true => match ProposalCache::<N>::load(self.gateway.account().address(), &self.storage_mode) { 41 Ok(proposal_cache) => { 42 let (latest_certificate_round, proposed_batch, signed_proposals, pending_certificates) = 43 proposal_cache.into(); 44 // CRITICAL LINE: Sets propose_lock to cached round 45 *self.propose_lock.lock().await = latest_certificate_round; 46 // ... 47 } 48 } 49 } 50 } 51 ``` 52 53 **Impact**: Validators cannot propose new batches because current round (1) < cached round (5236). 54 55 ### Files Involved 56 1. **Proposal Cache**: `/root/.current-proposal-cache-1-{0..4}` (one per validator) 57 2. **RocksDB Storage**: `/root/.ledger-1-{0..4}/` (BFT storage, contains certificates) 58 3. **Blockchain Ledgers**: `/root/.alpha/` and `/root/.delta/` (block data) 59 60 ### Why Standard Cleanup Fails 61 1. **Hidden files**: Proposal cache files start with `.` (easy to miss) 62 2. **Recreated from RocksDB**: If only cache file deleted, it's regenerated from RocksDB on next start 63 3. **RocksDB persistence**: SST files survive basic `rm -rf` if process still has file handles open 64 65 ### Successful Cleanup Procedure 66 ```bash 67 # 1. Stop validators 68 sudo systemctl stop alphaos-validator 69 70 # 2. Wait for file locks to release 71 sleep 15 72 73 # 3. Delete ALL state (in order of importance) 74 sudo rm -f /root/.current-proposal-cache-* # Proposal cache (CRITICAL) 75 sudo rm -rf /root/.ledger-* # BFT storage 76 sudo rm -rf /root/.alpha /root/.delta # Blockchain ledgers 77 78 # 4. Verify deletion 79 sudo ls -la /root/ | grep -E 'proposal|ledger|alpha|delta' 80 # Should return empty or "No such file" 81 82 # 5. Start validator 83 sudo systemctl start alphaos-validator 84 ``` 85 86 ## Issue 2: RocksDB Proposal Cache Source Unknown 87 88 ### Symptom 89 Even after deleting all files and restarting fresh, validators STILL load round 5236 from somewhere. 90 91 ### Investigation 92 - Deleted `/root/.current-proposal-cache-*` ✓ 93 - Deleted `/root/.ledger-*` ✓ 94 - Deleted `/root/.alpha` and `/root/.delta` ✓ 95 - **Result**: Cache file recreated with round 5236 on next start 96 97 ### Hypothesis 98 The round 5236 state originates from one of: 99 1. **Hardcoded in binary**: Dev mode deterministically generates this state 100 2. **Network sync**: Validators fetch historical state from peers on startup 101 3. **Genesis file**: Embedded genesis contains blocks/certificates up to round 5236 102 4. **External storage**: Some other storage location we haven't found 103 104 ### Evidence 105 After comprehensive cleanup (all files verified deleted), validator logs still show: 106 ``` 107 Loaded the proposal cache from /root/.current-proposal-cache-1-0 at round 5236 108 ``` 109 110 This means the file was RECREATED with the same round 5236 value, suggesting deterministic regeneration from either genesis or dev mode parameters. 111 112 ### Recommended Investigation 113 1. Check dev mode genesis generation code for hardcoded proposals 114 2. Verify what happens when `--dev 0 --dev-num-validators 5` initializes ledger 115 3. Test with `--genesis-file` pointing to a fresh genesis (bypassing dev mode) 116 117 ## Issue 3: No Fresh Start Without Full Rebuild 118 119 ### Symptom 120 Cannot test new features (like governance) on testnet because validators are stuck in old state. 121 122 ### Impact 123 - Section 12b governance testing blocked indefinitely 124 - Cannot verify proposal activation at height 100 125 - Cannot test vote-based execution flow 126 127 ### Workaround Options 128 1. **Deploy fresh testnet**: New servers with clean state 129 2. **Use unit tests**: Verify governance logic without live testnet 130 3. **Fix proposal cache loading**: Add `--ignore-proposal-cache` flag 131 4. **Fix dev mode**: Ensure dev mode always starts from height 1, round 1 132 133 ## Proposed Fixes 134 135 ### Fix 1: Add `--fresh-start` Flag 136 **File**: `alphaos/cli/src/commands/start.rs` 137 ```rust 138 #[clap(long = "fresh-start")] 139 pub fresh_start: bool, 140 141 // In parse_node(): 142 if self.fresh_start { 143 // Delete proposal cache before starting 144 let cache_path = proposal_cache_path(N::ID, &storage_mode); 145 if cache_path.exists() { 146 std::fs::remove_file(&cache_path)?; 147 info!("Deleted proposal cache for fresh start"); 148 } 149 } 150 ``` 151 152 ### Fix 2: Skip Proposal Cache in Dev Mode 153 **File**: `alphaos/node/bft/src/primary.rs` 154 ```rust 155 async fn load_proposal_cache(&self) -> Result<()> { 156 // Skip cache loading in dev mode 157 if self.storage_mode.dev().is_some() { 158 info!("Skipping proposal cache in dev mode"); 159 return Ok(()); 160 } 161 // ... existing cache loading logic 162 } 163 ``` 164 165 ### Fix 3: Proposal Cache Validation 166 **File**: `alphaos/node/bft/src/helpers/proposal_cache.rs` 167 ```rust 168 pub fn load(...) -> Result<Self> { 169 // ... load cache file 170 171 // Validate cache is not stale 172 let current_height = ledger.latest_height(); 173 if proposal_cache.latest_round > current_height + 100 { 174 bail!("Proposal cache round {} is far ahead of ledger height {} - ignoring stale cache", 175 proposal_cache.latest_round, current_height); 176 } 177 178 Ok(proposal_cache) 179 } 180 ``` 181 182 ### Fix 4: Cleanup Script Integration 183 Add `alphaos clean` subcommand: 184 ```rust 185 // alphaos/cli/src/commands/clean.rs 186 pub struct Clean { 187 #[clap(long)] 188 pub network: u16, 189 190 #[clap(long)] 191 pub confirm: bool, 192 } 193 194 impl Clean { 195 pub fn run(&self) -> Result<()> { 196 if !self.confirm { 197 bail!("Add --confirm to proceed with cleanup"); 198 } 199 200 let cache_path = proposal_cache_path(self.network, &StorageMode::default()); 201 let ledger_path = ledger_dir(self.network, &StorageMode::default()); 202 203 // Delete files 204 if cache_path.exists() { fs::remove_file(&cache_path)?; } 205 if ledger_path.exists() { fs::remove_dir_all(&ledger_path)?; } 206 207 info!("Cleanup complete"); 208 Ok(()) 209 } 210 } 211 ``` 212 213 ## Issue 4: Dev Mode Genesis Determinism Unclear 214 215 ### Symptom 216 Using `--dev 0 --dev-num-validators 5` creates deterministic genesis, but the initial state is unclear. 217 218 ### Questions 219 1. Does dev mode genesis include pre-generated blocks? 220 2. Does dev mode genesis include pre-generated certificates up to round 5236? 221 3. Why does dev mode create state with round 5236 specifically? 222 223 ### Recommended Documentation 224 Add to docs explaining exactly what dev mode initializes: 225 - Initial block height 226 - Initial round number 227 - Pre-generated certificates (if any) 228 - Expected proposal cache state 229 230 ## Issue 5: Testnet Deployment Lacks State Reset Procedure 231 232 ### Symptom 233 No documented procedure to reset testnet to clean state without rebuilding servers. 234 235 ### Impact 236 - Testing new features requires infrastructure changes 237 - CI/CD pipelines cannot reset test environment 238 - Manual SSH operations required for cleanup 239 240 ### Recommended Solution 241 Create official state reset procedure in `docs/operations/testnet-reset.md`: 242 1. Automated script (already created: `testnet-cleanup-script.sh`) 243 2. Verification steps 244 3. Expected behavior after reset 245 4. Troubleshooting guide 246 247 ## Codebase Issues Summary 248 249 | Issue | File | Severity | Fix Priority | 250 |-------|------|----------|--------------| 251 | Proposal cache loads stale round | `node/bft/src/primary.rs:173-188` | HIGH | 1 | 252 | No --fresh-start flag | `cli/src/commands/start.rs` | MEDIUM | 2 | 253 | Dev mode loads cache | `node/bft/src/primary.rs:173` | MEDIUM | 2 | 254 | No cache validation | `node/bft/src/helpers/proposal_cache.rs` | LOW | 3 | 255 | No alphaos clean command | Missing file | LOW | 4 | 256 257 ## Testing Recommendations 258 259 ### Unit Test Coverage Needed 260 1. Test proposal cache loading with stale round 261 2. Test dev mode with --fresh-start flag 262 3. Test cache validation with ledger height mismatch 263 264 ### Integration Test Scenarios 265 1. Start 5 validators fresh → verify round 1 266 2. Restart validators → verify round preservation 267 3. Delete cache + restart → verify fresh start 268 4. Mix of fresh + restarted validators → verify sync 269 270 ### Testnet Procedures 271 1. Document state reset procedure 272 2. Automate cleanup with script 273 3. Add verification checks 274 4. Create troubleshooting runbook 275 276 ## Conclusion 277 278 The persistent proposal cache issue is the root cause of testnet being stuck. The code implementation for Section 12b (Rust-native governance) is correct and complete, but cannot be tested live due to this infrastructure issue. 279 280 **Immediate Actions**: 281 1. ✅ Document issues (this file) 282 2. ✅ Create cleanup script (`testnet-cleanup-script.sh`) 283 3. ⏭️ Implement `--fresh-start` flag 284 4. ⏭️ Skip proposal cache in dev mode 285 5. ⏭️ Deploy fresh testnet OR fix cache loading 286 287 **Status**: Section 12b implementation is **CODE COMPLETE** but **TESTING BLOCKED** by infrastructure.