TESTNET-POSTMORTEM.md
1 # Testnet Setup Script - Issue & Resolution 2 3 ## Date: 2026-01-20 4 5 ## Summary 6 During testing of the automated testnet setup script on testnet002.ac-dc.network, a race condition was discovered that resulted in loss of SSH access to the server. 7 8 ## What Happened 9 10 ### Timeline 11 1. **19:29:26** - Setup script started on testnet002.ac-dc.network 12 2. **19:29:26-27** - Steps 1-5 completed successfully: 13 - SSH connection verified 14 - Passwordless SSH configured 15 - Security check script installed 16 - Systemd service created and enabled 17 3. **19:29:27** - Step 6: Initial security check ran 18 - UFW installed and configured 19 - Firewall rules created for ports: 2584 (SSH), 4130/5000/3030 (AlphaOS), 4131/5001/3031 (DeltaOS) 20 - **Critical Issue:** Port 22 was NOT included in allowed ports 21 - UFW enabled and set to deny all incoming except listed ports 22 4. **19:29:27** - Step 7: SSH port migration started 23 - Script attempted to change SSH from port 22 → 2584 24 - SSH service restart initiated 25 - **Connection lost:** Cannot reconnect on either port 22 or 2584 26 5. **19:29:30+** - Server inaccessible via SSH 27 28 ### Current State 29 - testnet002.ac-dc.network is running but SSH inaccessible 30 - AlphaOS validator may still be running but cannot be verified 31 - Requires console/physical access to recover 32 33 ## Root Cause 34 35 **Race Condition in Script Design:** 36 37 The security check script (step 6) configured the firewall to ONLY allow port 2584 BEFORE the SSH port migration (step 7) could complete. This created a scenario where: 38 39 1. Firewall blocks port 22 immediately 40 2. SSH is still running on port 22 41 3. Script tries to change SSH to port 2584 42 4. Connection lost before SSH restarts on new port 43 5. No access on either port 44 45 **Code Issue:** 46 ```bash 47 # testnet-security-check.sh (BEFORE FIX) 48 SSH_PORT=2584 # Only allowed this port 49 ``` 50 51 This did not account for servers that are still on port 22 during initial setup. 52 53 ## Fixes Implemented 54 55 ### 1. Security Check Script (`testnet-security-check.sh`) 56 57 **Changed:** 58 ```bash 59 # BEFORE 60 SSH_PORT=2584 61 62 # AFTER 63 SSH_PORTS=(22 2584) # Allow both during migration 64 ``` 65 66 **Rationale:** 67 - Allows both ports 22 and 2584 to be open simultaneously 68 - Prevents lockout during SSH port migration 69 - Port 22 can be manually removed after confirming port 2584 works 70 71 ### 2. Setup Script (`setup-testnet-server.sh`) 72 73 **Changed:** 74 ```bash 75 # BEFORE 76 # Remove old port from firewall 77 ssh -p "$TARGET_SSH_PORT" "$SERVER_URL" "sudo ufw delete allow 22/tcp" 78 79 # AFTER 80 # Keep port 22 for safety 81 log_warn "Note: Port 22 is still allowed in firewall for safety" 82 log_warn "Remove manually: sudo ufw delete allow 22/tcp" 83 ``` 84 85 **Rationale:** 86 - Does not automatically remove port 22 from firewall 87 - Allows admin to verify port 2584 works before removing port 22 88 - Provides clear instructions for manual removal 89 90 ## Recovery Process for testnet002 91 92 **Option 1: Console Access (Recommended)** 93 If you have console/KVM access to the server: 94 95 1. Login via console 96 2. Check SSH service status: 97 ```bash 98 systemctl status sshd 99 grep Port /etc/ssh/sshd_config 100 ``` 101 3. Check firewall status: 102 ```bash 103 ufw status numbered 104 ``` 105 4. Add port 22 to firewall: 106 ```bash 107 ufw allow 22/tcp 108 ``` 109 5. Restart SSH on port 2584: 110 ```bash 111 sed -i 's/^#*Port.*/Port 2584/' /etc/ssh/sshd_config 112 systemctl restart sshd 113 ``` 114 6. Test SSH from devops machine: 115 ```bash 116 ssh -p 2584 testnet002.ac-dc.network 117 ``` 118 7. If working, remove port 22: 119 ```bash 120 ufw delete allow 22/tcp 121 ``` 122 123 **Option 2: Server Reboot** 124 If the server has remote reboot capability: 125 126 1. Reboot the server 127 2. The security check service will run on boot and open both ports 22 and 2584 128 3. Connect on port 22: 129 ```bash 130 ssh testnet002.ac-dc.network 131 ``` 132 4. Fix SSH configuration: 133 ```bash 134 sudo sed -i 's/^#*Port.*/Port 2584/' /etc/ssh/sshd_config 135 sudo systemctl restart sshd 136 ``` 137 5. Reconnect on port 2584 138 6. Remove port 22 from firewall 139 140 **Option 3: Provider Console** 141 If using a hosting provider (Hetzner, AWS, etc.): 142 143 1. Access web-based console/VNC 144 2. Follow Option 1 steps 145 146 ## Lessons Learned 147 148 ### What Went Well 149 - Steps 1-5 of setup script worked perfectly 150 - Passwordless SSH configuration succeeded 151 - Security script installation succeeded 152 - Systemd service creation and activation succeeded 153 - Firewall configuration (UFW) worked as designed 154 155 ### What Went Wrong 156 - Did not account for SSH port migration edge case 157 - Security check ran before port migration completed 158 - No rollback mechanism if port migration fails 159 - Insufficient testing on live servers 160 161 ### Improvements Made 162 1. ✅ Allow both SSH ports (22 and 2584) in firewall 163 2. ✅ Do not auto-remove port 22 after migration 164 3. ✅ Add warnings about manual port 22 removal 165 4. ✅ Updated documentation with recovery procedures 166 167 ### Additional Improvements Needed 168 1. ⏳ Add preflight check: verify current SSH port before migration 169 2. ⏳ Add SSH connection keepalive during port migration 170 3. ⏳ Add rollback mechanism if new port doesn't respond 171 4. ⏳ Test script in dry-run mode on live servers first 172 5. ⏳ Add option to skip port migration entirely 173 6. ⏳ Consider using SSH multiplexing for persistent connections 174 175 ## Testing Recommendations 176 177 ### Before Production Use: 178 1. **Test in VM/local environment first** 179 - Use VirtualBox/QEMU VM 180 - Simulate complete setup process 181 - Verify recovery procedures work 182 183 2. **Test with console access available** 184 - Have KVM/IPMI access ready 185 - Test on non-critical server first 186 - Have monitoring in place 187 188 3. **Test incrementally** 189 - Run steps 1-6 only (skip port migration) 190 - Verify firewall and services work 191 - Manually test port 2584 before migration 192 - Then complete step 7 if needed 193 194 4. **Use dry-run mode** 195 - Add `--dry-run` flag to script 196 - Show what would be changed 197 - Require explicit confirmation 198 199 ## Impact Assessment 200 201 ### testnet002.ac-dc.network 202 - **Status:** Inaccessible via SSH, requires console access 203 - **Validator:** Unknown status, may still be running 204 - **Network:** 4 of 5 validators operational (testnet001, 004, 005, and one more) 205 - **Priority:** Medium (testnet has redundancy) 206 - **Recovery Time:** 10-15 minutes with console access 207 208 ### Other Testnet Servers 209 - **testnet001:** Already on port 2584 (from earlier session) 210 - **testnet003:** Already had SSH issues (not related) 211 - **testnet004, 005:** Running validators successfully 212 213 ### Script Availability 214 - **Status:** Fixed and ready for future use 215 - **Safety:** Much safer with both ports allowed 216 - **Testing:** Needs validation on new server 217 218 ## Action Items 219 220 - [x] Fix security check script to allow both ports 221 - [x] Update setup script to keep port 22 open 222 - [x] Verify script syntax 223 - [x] Document issue and recovery 224 - [ ] Recover testnet002 using console access 225 - [ ] Test updated script on new server (testnet006+) 226 - [ ] Add dry-run mode to setup script 227 - [ ] Add preflight validation checks 228 - [ ] Create automated testing suite 229 230 ## Conclusion 231 232 A race condition in the SSH port migration logic caused loss of access to testnet002. The issue has been identified and fixed by allowing both SSH ports (22 and 2584) to remain open, preventing future lockouts. The server can be recovered via console access, and the updated scripts are safer for production use. 233 234 **Status:** RESOLVED (scripts fixed), testnet002 requires manual recovery.