testnet-proposal-cache-fixes.cspec
1 id: testnet-proposal-cache-fixes 2 title: Fix Persistent Proposal Cache Issues in Dev Mode 3 status: proposed 4 priority: high 5 estimated_effort: 4-6 hours 6 created: 2026-01-22 7 author: Claude Sonnet 4.5 8 9 ## Context 10 11 During Section 12b governance testing, discovered that BFT proposal cache persists across testnet restarts, causing validators to get stuck unable to propose new batches. 12 13 **Root Cause**: Proposal cache file loads `latest_certificate_round` from previous session, validator refuses to propose for round 1 when cache shows round 5236. 14 15 **Impact**: Cannot test new features on testnet without full infrastructure rebuild. 16 17 ## Problem Statement 18 19 **Files Involved**: 20 - `alphaos/node/bft/src/primary.rs:173-188` - Loads proposal cache unconditionally 21 - `alphaos/node/bft/src/helpers/proposal_cache.rs:32-48` - Cache file path generation 22 - `alphaos/cli/src/commands/start.rs` - No fresh-start option 23 24 **Current Behavior**: 25 1. Validator starts with `--dev 0 --dev-num-validators 5` 26 2. BFT primary calls `load_proposal_cache()` 27 3. Loads `/root/.current-proposal-cache-1-0` if exists 28 4. Sets `propose_lock` to cached round (e.g., 5236) 29 5. Refuses to propose for round 1 because 1 < 5236 30 6. **Result**: Testnet stuck forever 31 32 **Error Message**: 33 ``` 34 Cannot propose a batch for round 1 - the latest proposal cache round is 5236 35 ``` 36 37 ## Proposed Solution 38 39 ### Fix 1: Add --fresh-start Flag 40 41 **File**: `alphaos/cli/src/commands/start.rs` 42 **Lines**: Add around line 112 (with other flags) 43 44 ```rust 45 /// Start with fresh state (delete proposal cache) 46 #[clap(long = "fresh-start")] 47 pub fresh_start: bool, 48 ``` 49 50 **Integration** (in `parse_node()` around line 650): 51 ```rust 52 // Before starting consensus, check for fresh start 53 if self.fresh_start { 54 info!("⚠️ Fresh start requested - deleting proposal cache"); 55 56 use acdc_std::StorageMode; 57 use alphaos_node_bft::helpers::proposal_cache_path; 58 59 let storage_mode = if let Some(dev_id) = self.dev { 60 StorageMode::Development(dev_id) 61 } else { 62 StorageMode::Production 63 }; 64 65 let cache_path = proposal_cache_path(N::ID, &storage_mode); 66 67 if cache_path.exists() { 68 std::fs::remove_file(&cache_path) 69 .context("Failed to delete proposal cache")?; 70 info!("✅ Deleted proposal cache: {}", cache_path.display()); 71 } else { 72 info!("No proposal cache found"); 73 } 74 } 75 ``` 76 77 **Usage**: 78 ```bash 79 alphaos start --network 1 --validator --dev 0 --dev-num-validators 5 --fresh-start 80 ``` 81 82 **Testing**: 83 1. Start validator normally, let it run to round 100 84 2. Stop validator 85 3. Restart with `--fresh-start` 86 4. Verify no cache file exists 87 5. Verify validator starts at round 1 88 89 --- 90 91 ### Fix 2: Skip Proposal Cache in Dev Mode 92 93 **File**: `alphaos/node/bft/src/primary.rs` 94 **Function**: `load_proposal_cache()` (lines 172-212) 95 96 **Change**: 97 ```rust 98 /// Load the proposal cache file and update the Primary state with the stored data. 99 async fn load_proposal_cache(&self) -> Result<()> { 100 // IMPORTANT: Skip proposal cache in dev mode to allow fresh starts 101 if self.storage_mode.dev().is_some() { 102 info!("Skipping proposal cache in dev mode (allows fresh restarts)"); 103 return Ok(()); 104 } 105 106 // Fetch the signed proposals from the file system if it exists. 107 match ProposalCache::<N>::exists(&self.storage_mode) { 108 // ... rest of existing logic 109 } 110 } 111 ``` 112 113 **Rationale**: 114 - Dev mode is for testing, should allow easy resets 115 - Production mode still benefits from proposal cache for crash recovery 116 - Explicit opt-in for cache behavior 117 118 **Side Effects**: 119 - Dev mode validators will lose proposal state on restart (acceptable for testing) 120 - Slightly slower recovery after crash in dev mode (acceptable trade-off) 121 122 **Testing**: 123 1. Start dev mode validator, let it run to round 100 124 2. Stop validator (cache file should exist) 125 3. Restart validator 126 4. Verify starts at round 1 (cache ignored) 127 5. Test production mode still loads cache 128 129 --- 130 131 ### Fix 3: Proposal Cache Validation 132 133 **File**: `alphaos/node/bft/src/helpers/proposal_cache.rs` 134 **Function**: `load()` (around line 92) 135 136 **Add validation after loading**: 137 ```rust 138 pub fn load(signer: Address<N>, storage_mode: &StorageMode) -> Result<Self> { 139 // Construct the proposal cache file system path. 140 let path = proposal_cache_path(N::ID, storage_mode); 141 142 // Attempt to read the proposal cache file. 143 let proposal_cache = match fs::read(&path) { 144 Ok(bytes) => match Self::from_bytes_le(&bytes[..]) { 145 Ok(cache) => cache, 146 Err(err) => bail!("Failed to deserialize the proposal cache - {err}"), 147 }, 148 Err(err) => bail!("Failed to read the proposal cache from {} - {err}", path.display()), 149 }; 150 151 // Validate the proposal cache. 152 ensure!(proposal_cache.is_valid(signer), "The proposal cache is invalid"); 153 154 // NEW: Validate cache is not stale 155 // If cache round is suspiciously high, it's likely from an old session 156 if proposal_cache.latest_round > 10000 { 157 warn!( 158 "Proposal cache round {} is very high - this may be from an old session. \ 159 Consider using --fresh-start to reset state.", 160 proposal_cache.latest_round 161 ); 162 } 163 164 info!("Loaded the proposal cache from {} at round {}", path.display(), proposal_cache.latest_round); 165 166 Ok(proposal_cache) 167 } 168 ``` 169 170 **Alternative** (stricter validation): 171 ```rust 172 // Compare cache round with current ledger state 173 // Requires access to ledger, so would need API change 174 pub fn load( 175 signer: Address<N>, 176 storage_mode: &StorageMode, 177 current_height: u32, // NEW parameter 178 ) -> Result<Self> { 179 // ... load cache 180 181 // Validate cache round is reasonable compared to ledger height 182 // Typically round ≈ height, so if round >> height, cache is stale 183 if proposal_cache.latest_round > current_height + 1000 { 184 bail!( 185 "Proposal cache round {} is far ahead of ledger height {} - \ 186 refusing to load stale cache. Use --fresh-start to reset.", 187 proposal_cache.latest_round, 188 current_height 189 ); 190 } 191 192 Ok(proposal_cache) 193 } 194 ``` 195 196 **Testing**: 197 1. Create cache file with round 5236, ledger at height 10 198 2. Attempt to load cache 199 3. Verify validation triggers 200 4. Verify helpful error message 201 202 --- 203 204 ### Fix 4: Add alphaos clean Subcommand 205 206 **New File**: `alphaos/cli/src/commands/clean.rs` 207 208 ```rust 209 // Copyright (c) 2025-2026 ACDC Network 210 // ... standard header 211 212 use acdc_std::{alpha_ledger_dir, StorageMode}; 213 use alphaos_node_bft::helpers::proposal_cache_path; 214 use alphavm::prelude::{anyhow::Result, bail, info}; 215 use clap::Parser; 216 use std::fs; 217 218 /// Clean validator state (proposal cache, BFT storage) 219 #[derive(Debug, Parser)] 220 pub struct Clean { 221 /// Network ID (0=mainnet, 1=testnet, 2=devnet) 222 #[clap(long, default_value = "1")] 223 pub network: u16, 224 225 /// Development node ID (0-4 for dev mode) 226 #[clap(long)] 227 pub dev: Option<u8>, 228 229 /// Confirm deletion (required for safety) 230 #[clap(long)] 231 pub confirm: bool, 232 233 /// Clean blockchain ledgers too (dangerous) 234 #[clap(long)] 235 pub clean_ledger: bool, 236 } 237 238 impl Clean { 239 pub fn run(self) -> Result<()> { 240 if !self.confirm { 241 println!("⚠️ This will delete validator state files."); 242 println!(" Add --confirm to proceed."); 243 return Ok(()); 244 } 245 246 let storage_mode = if let Some(dev_id) = self.dev { 247 StorageMode::Development(dev_id) 248 } else { 249 StorageMode::Production 250 }; 251 252 // Delete proposal cache 253 let cache_path = proposal_cache_path(self.network, &storage_mode); 254 if cache_path.exists() { 255 fs::remove_file(&cache_path)?; 256 info!("✅ Deleted proposal cache: {}", cache_path.display()); 257 } else { 258 info!("No proposal cache found at {}", cache_path.display()); 259 } 260 261 // Delete BFT storage 262 let ledger_dir = alpha_ledger_dir(self.network, &storage_mode); 263 if ledger_dir.exists() { 264 fs::remove_dir_all(&ledger_dir)?; 265 info!("✅ Deleted BFT storage: {}", ledger_dir.display()); 266 } else { 267 info!("No BFT storage found at {}", ledger_dir.display()); 268 } 269 270 // Optionally delete blockchain ledgers 271 if self.clean_ledger { 272 // TODO: Add ledger deletion 273 // This is more dangerous and should require extra confirmation 274 info!("⚠️ Blockchain ledger cleaning not yet implemented"); 275 } 276 277 println!("✅ Cleanup complete!"); 278 Ok(()) 279 } 280 } 281 ``` 282 283 **Integration**: `alphaos/cli/src/commands/mod.rs` 284 ```rust 285 pub mod clean; 286 pub use clean::Clean; 287 ``` 288 289 **CLI Integration**: `alphaos/cli/src/main.rs` 290 ```rust 291 #[derive(Debug, Parser)] 292 pub enum Command { 293 Start(Start), 294 Clean(Clean), // NEW 295 // ... other commands 296 } 297 298 // In main(): 299 match cli.command { 300 Command::Start(start) => start.parse().await, 301 Command::Clean(clean) => clean.run(), 302 } 303 ``` 304 305 **Usage**: 306 ```bash 307 # Clean dev mode validator 0 308 alphaos clean --network 1 --dev 0 --confirm 309 310 # Clean production testnet validator 311 alphaos clean --network 1 --confirm 312 313 # Clean everything including ledger 314 alphaos clean --network 1 --dev 0 --confirm --clean-ledger 315 ``` 316 317 **Testing**: 318 1. Start validator, let it run 319 2. Stop validator 320 3. Run `alphaos clean --network 1 --dev 0 --confirm` 321 4. Verify cache and storage deleted 322 5. Start validator again, verify fresh start 323 324 --- 325 326 ## Implementation Plan 327 328 ### Phase 1: Quick Fix (1-2 hours) 329 1. Implement Fix 2 (skip cache in dev mode) 330 2. Test on local dev environment 331 3. Deploy to testnet for verification 332 333 ### Phase 2: User-Friendly Fix (2-3 hours) 334 1. Implement Fix 1 (--fresh-start flag) 335 2. Update systemd service files to document flag 336 3. Add to testnet deployment scripts 337 338 ### Phase 3: Robustness (1-2 hours) 339 1. Implement Fix 3 (cache validation) 340 2. Add unit tests for validation logic 341 3. Test with various cache scenarios 342 343 ### Phase 4: Developer Experience (2-3 hours) 344 1. Implement Fix 4 (alphaos clean subcommand) 345 2. Write documentation 346 3. Update testnet operations guide 347 348 ## Testing Strategy 349 350 ### Unit Tests 351 **File**: `alphaos/node/bft/src/helpers/proposal_cache_tests.rs` 352 353 ```rust 354 #[test] 355 fn test_cache_validation_stale_round() { 356 // Create cache with round 5236 357 let cache = ProposalCache::new(5236, None, Default::default(), Default::default()); 358 359 // Attempt to load with current height 10 360 let result = cache.validate_against_height(10); 361 362 assert!(result.is_err()); 363 assert!(result.unwrap_err().to_string().contains("stale cache")); 364 } 365 366 #[test] 367 fn test_dev_mode_skips_cache() { 368 // Start primary in dev mode 369 let primary = Primary::new(..., StorageMode::Development(0)); 370 371 // Verify cache not loaded 372 assert_eq!(*primary.propose_lock.lock().await, 0); 373 } 374 ``` 375 376 ### Integration Tests 377 1. **Test fresh-start flag**: 378 - Start validator, run to round 100, stop 379 - Restart with --fresh-start 380 - Verify round resets to 1 381 382 2. **Test clean command**: 383 - Start validator, create state 384 - Run `alphaos clean --confirm` 385 - Verify all state deleted 386 387 3. **Test cache validation**: 388 - Manually create stale cache file 389 - Attempt to start validator 390 - Verify validation prevents startup with clear error 391 392 ### Testnet Verification 393 1. Deploy updated binary to all 5 validators 394 2. Test fresh restart procedure 395 3. Verify governance testing can proceed 396 4. Document new operational procedures 397 398 ## Success Criteria 399 400 - [ ] Dev mode validators can restart fresh without manual cleanup 401 - [ ] `--fresh-start` flag works correctly 402 - [ ] `alphaos clean` command implemented and tested 403 - [ ] Cache validation prevents stale state loading 404 - [ ] Section 12b governance testing unblocked 405 - [ ] Documentation updated with new procedures 406 407 ## Dependencies 408 409 - None (all changes self-contained in alphaos) 410 411 ## Risks 412 413 - **Data loss**: Clean command could delete important state if misused 414 - Mitigation: Require --confirm flag, clear warnings 415 - **Behavior change**: Skipping cache in dev mode changes restart behavior 416 - Mitigation: Only affects dev mode, production unchanged 417 - **Testing gaps**: Cache validation logic needs thorough testing 418 - Mitigation: Comprehensive unit tests before deployment 419 420 ## Documentation Updates 421 422 ### Files to Update 423 1. `alpha-delta-context/docs/operations/testnet-reset.md` - Add clean procedures 424 2. `alpha-delta-context/docs/operations/validator-onboarding.md` - Mention --fresh-start 425 3. `alphaos/cli/README.md` - Document clean subcommand 426 427 ### New Documentation 428 1. Troubleshooting guide for "Cannot propose a batch" error 429 2. Best practices for dev mode testing 430 3. State management in production vs dev mode 431 432 ## Related Issues 433 434 - Section 12b governance testing (blocked by this issue) 435 - Future testnet deployments (need reliable reset procedure) 436 - CI/CD integration testing (needs clean state between runs) 437 438 ## Notes 439 440 This is a HIGH priority fix because it blocks testing of all new consensus features. The immediate workaround (Fix 2) can be implemented in < 1 hour and unblocks governance testing. 441 442 The proposal cache is valuable for production crash recovery, but becomes a liability in dev/test environments. The solution is to make cache behavior configurable and provide tools for state management.