/ TESTNET-ISSUES-2026-01-22.md
TESTNET-ISSUES-2026-01-22.md
  1  # Testnet Infrastructure Issues - 2026-01-22
  2  
  3  ## Summary
  4  
  5  During Section 12b (Governance Testing) implementation, we encountered a critical persistent state issue that prevents testnet validators from producing blocks after fresh deployments. This document catalogs all discovered issues and proposed fixes.
  6  
  7  ## Issue 1: Persistent Proposal Cache Survives Cleanup
  8  
  9  ### Symptom
 10  ```
 11  Cannot propose a batch for round 1 - the latest proposal cache round is 5236
 12  ```
 13  
 14  ### Root Cause
 15  The BFT primary loads a **proposal cache file** on startup that persists across restarts:
 16  
 17  **File Location**: `/root/.current-proposal-cache-{network}-{dev_id}`
 18  **Example**: `/root/.current-proposal-cache-1-0` for testnet001 (dev 0)
 19  
 20  **Code Reference**: `alphaos/node/bft/src/helpers/proposal_cache.rs:32-48`
 21  ```rust
 22  pub fn proposal_cache_path(network: u16, storage_mode: &StorageMode) -> PathBuf {
 23      const PROPOSAL_CACHE_FILE_NAME: &str = "current-proposal-cache";
 24      let mut path = alpha_ledger_dir(network, storage_mode);
 25      if !matches!(storage_mode, StorageMode::Custom(..)) {
 26          path.pop();
 27      }
 28      match storage_mode.dev() {
 29          Some(id) => path.push(format!(".{PROPOSAL_CACHE_FILE_NAME}-{network}-{id}")),
 30          None => path.push(format!("{PROPOSAL_CACHE_FILE_NAME}-{network}")),
 31      }
 32      path
 33  }
 34  ```
 35  
 36  **Load Behavior**: `alphaos/node/bft/src/primary.rs:173-188`
 37  ```rust
 38  async fn load_proposal_cache(&self) -> Result<()> {
 39      match ProposalCache::<N>::exists(&self.storage_mode) {
 40          true => match ProposalCache::<N>::load(self.gateway.account().address(), &self.storage_mode) {
 41              Ok(proposal_cache) => {
 42                  let (latest_certificate_round, proposed_batch, signed_proposals, pending_certificates) =
 43                      proposal_cache.into();
 44                  // CRITICAL LINE: Sets propose_lock to cached round
 45                  *self.propose_lock.lock().await = latest_certificate_round;
 46                  // ...
 47              }
 48          }
 49      }
 50  }
 51  ```
 52  
 53  **Impact**: Validators cannot propose new batches because current round (1) < cached round (5236).
 54  
 55  ### Files Involved
 56  1. **Proposal Cache**: `/root/.current-proposal-cache-1-{0..4}` (one per validator)
 57  2. **RocksDB Storage**: `/root/.ledger-1-{0..4}/` (BFT storage, contains certificates)
 58  3. **Blockchain Ledgers**: `/root/.alpha/` and `/root/.delta/` (block data)
 59  
 60  ### Why Standard Cleanup Fails
 61  1. **Hidden files**: Proposal cache files start with `.` (easy to miss)
 62  2. **Recreated from RocksDB**: If only cache file deleted, it's regenerated from RocksDB on next start
 63  3. **RocksDB persistence**: SST files survive basic `rm -rf` if process still has file handles open
 64  
 65  ### Successful Cleanup Procedure
 66  ```bash
 67  # 1. Stop validators
 68  sudo systemctl stop alphaos-validator
 69  
 70  # 2. Wait for file locks to release
 71  sleep 15
 72  
 73  # 3. Delete ALL state (in order of importance)
 74  sudo rm -f /root/.current-proposal-cache-*      # Proposal cache (CRITICAL)
 75  sudo rm -rf /root/.ledger-*                      # BFT storage
 76  sudo rm -rf /root/.alpha /root/.delta            # Blockchain ledgers
 77  
 78  # 4. Verify deletion
 79  sudo ls -la /root/ | grep -E 'proposal|ledger|alpha|delta'
 80  # Should return empty or "No such file"
 81  
 82  # 5. Start validator
 83  sudo systemctl start alphaos-validator
 84  ```
 85  
 86  ## Issue 2: RocksDB Proposal Cache Source Unknown
 87  
 88  ### Symptom
 89  Even after deleting all files and restarting fresh, validators STILL load round 5236 from somewhere.
 90  
 91  ### Investigation
 92  - Deleted `/root/.current-proposal-cache-*` ✓
 93  - Deleted `/root/.ledger-*` ✓
 94  - Deleted `/root/.alpha` and `/root/.delta` ✓
 95  - **Result**: Cache file recreated with round 5236 on next start
 96  
 97  ### Hypothesis
 98  The round 5236 state originates from one of:
 99  1. **Hardcoded in binary**: Dev mode deterministically generates this state
100  2. **Network sync**: Validators fetch historical state from peers on startup
101  3. **Genesis file**: Embedded genesis contains blocks/certificates up to round 5236
102  4. **External storage**: Some other storage location we haven't found
103  
104  ### Evidence
105  After comprehensive cleanup (all files verified deleted), validator logs still show:
106  ```
107  Loaded the proposal cache from /root/.current-proposal-cache-1-0 at round 5236
108  ```
109  
110  This means the file was RECREATED with the same round 5236 value, suggesting deterministic regeneration from either genesis or dev mode parameters.
111  
112  ### Recommended Investigation
113  1. Check dev mode genesis generation code for hardcoded proposals
114  2. Verify what happens when `--dev 0 --dev-num-validators 5` initializes ledger
115  3. Test with `--genesis-file` pointing to a fresh genesis (bypassing dev mode)
116  
117  ## Issue 3: No Fresh Start Without Full Rebuild
118  
119  ### Symptom
120  Cannot test new features (like governance) on testnet because validators are stuck in old state.
121  
122  ### Impact
123  - Section 12b governance testing blocked indefinitely
124  - Cannot verify proposal activation at height 100
125  - Cannot test vote-based execution flow
126  
127  ### Workaround Options
128  1. **Deploy fresh testnet**: New servers with clean state
129  2. **Use unit tests**: Verify governance logic without live testnet
130  3. **Fix proposal cache loading**: Add `--ignore-proposal-cache` flag
131  4. **Fix dev mode**: Ensure dev mode always starts from height 1, round 1
132  
133  ## Proposed Fixes
134  
135  ### Fix 1: Add `--fresh-start` Flag
136  **File**: `alphaos/cli/src/commands/start.rs`
137  ```rust
138  #[clap(long = "fresh-start")]
139  pub fresh_start: bool,
140  
141  // In parse_node():
142  if self.fresh_start {
143      // Delete proposal cache before starting
144      let cache_path = proposal_cache_path(N::ID, &storage_mode);
145      if cache_path.exists() {
146          std::fs::remove_file(&cache_path)?;
147          info!("Deleted proposal cache for fresh start");
148      }
149  }
150  ```
151  
152  ### Fix 2: Skip Proposal Cache in Dev Mode
153  **File**: `alphaos/node/bft/src/primary.rs`
154  ```rust
155  async fn load_proposal_cache(&self) -> Result<()> {
156      // Skip cache loading in dev mode
157      if self.storage_mode.dev().is_some() {
158          info!("Skipping proposal cache in dev mode");
159          return Ok(());
160      }
161      // ... existing cache loading logic
162  }
163  ```
164  
165  ### Fix 3: Proposal Cache Validation
166  **File**: `alphaos/node/bft/src/helpers/proposal_cache.rs`
167  ```rust
168  pub fn load(...) -> Result<Self> {
169      // ... load cache file
170  
171      // Validate cache is not stale
172      let current_height = ledger.latest_height();
173      if proposal_cache.latest_round > current_height + 100 {
174          bail!("Proposal cache round {} is far ahead of ledger height {} - ignoring stale cache",
175                proposal_cache.latest_round, current_height);
176      }
177  
178      Ok(proposal_cache)
179  }
180  ```
181  
182  ### Fix 4: Cleanup Script Integration
183  Add `alphaos clean` subcommand:
184  ```rust
185  // alphaos/cli/src/commands/clean.rs
186  pub struct Clean {
187      #[clap(long)]
188      pub network: u16,
189  
190      #[clap(long)]
191      pub confirm: bool,
192  }
193  
194  impl Clean {
195      pub fn run(&self) -> Result<()> {
196          if !self.confirm {
197              bail!("Add --confirm to proceed with cleanup");
198          }
199  
200          let cache_path = proposal_cache_path(self.network, &StorageMode::default());
201          let ledger_path = ledger_dir(self.network, &StorageMode::default());
202  
203          // Delete files
204          if cache_path.exists() { fs::remove_file(&cache_path)?; }
205          if ledger_path.exists() { fs::remove_dir_all(&ledger_path)?; }
206  
207          info!("Cleanup complete");
208          Ok(())
209      }
210  }
211  ```
212  
213  ## Issue 4: Dev Mode Genesis Determinism Unclear
214  
215  ### Symptom
216  Using `--dev 0 --dev-num-validators 5` creates deterministic genesis, but the initial state is unclear.
217  
218  ### Questions
219  1. Does dev mode genesis include pre-generated blocks?
220  2. Does dev mode genesis include pre-generated certificates up to round 5236?
221  3. Why does dev mode create state with round 5236 specifically?
222  
223  ### Recommended Documentation
224  Add to docs explaining exactly what dev mode initializes:
225  - Initial block height
226  - Initial round number
227  - Pre-generated certificates (if any)
228  - Expected proposal cache state
229  
230  ## Issue 5: Testnet Deployment Lacks State Reset Procedure
231  
232  ### Symptom
233  No documented procedure to reset testnet to clean state without rebuilding servers.
234  
235  ### Impact
236  - Testing new features requires infrastructure changes
237  - CI/CD pipelines cannot reset test environment
238  - Manual SSH operations required for cleanup
239  
240  ### Recommended Solution
241  Create official state reset procedure in `docs/operations/testnet-reset.md`:
242  1. Automated script (already created: `testnet-cleanup-script.sh`)
243  2. Verification steps
244  3. Expected behavior after reset
245  4. Troubleshooting guide
246  
247  ## Codebase Issues Summary
248  
249  | Issue | File | Severity | Fix Priority |
250  |-------|------|----------|--------------|
251  | Proposal cache loads stale round | `node/bft/src/primary.rs:173-188` | HIGH | 1 |
252  | No --fresh-start flag | `cli/src/commands/start.rs` | MEDIUM | 2 |
253  | Dev mode loads cache | `node/bft/src/primary.rs:173` | MEDIUM | 2 |
254  | No cache validation | `node/bft/src/helpers/proposal_cache.rs` | LOW | 3 |
255  | No alphaos clean command | Missing file | LOW | 4 |
256  
257  ## Testing Recommendations
258  
259  ### Unit Test Coverage Needed
260  1. Test proposal cache loading with stale round
261  2. Test dev mode with --fresh-start flag
262  3. Test cache validation with ledger height mismatch
263  
264  ### Integration Test Scenarios
265  1. Start 5 validators fresh → verify round 1
266  2. Restart validators → verify round preservation
267  3. Delete cache + restart → verify fresh start
268  4. Mix of fresh + restarted validators → verify sync
269  
270  ### Testnet Procedures
271  1. Document state reset procedure
272  2. Automate cleanup with script
273  3. Add verification checks
274  4. Create troubleshooting runbook
275  
276  ## Conclusion
277  
278  The persistent proposal cache issue is the root cause of testnet being stuck. The code implementation for Section 12b (Rust-native governance) is correct and complete, but cannot be tested live due to this infrastructure issue.
279  
280  **Immediate Actions**:
281  1. ✅ Document issues (this file)
282  2. ✅ Create cleanup script (`testnet-cleanup-script.sh`)
283  3. ⏭️ Implement `--fresh-start` flag
284  4. ⏭️ Skip proposal cache in dev mode
285  5. ⏭️ Deploy fresh testnet OR fix cache loading
286  
287  **Status**: Section 12b implementation is **CODE COMPLETE** but **TESTING BLOCKED** by infrastructure.