/ tools / TESTNET-POSTMORTEM.md
TESTNET-POSTMORTEM.md
  1  # Testnet Setup Script - Issue & Resolution
  2  
  3  ## Date: 2026-01-20
  4  
  5  ## Summary
  6  During testing of the automated testnet setup script on testnet002.ac-dc.network, a race condition was discovered that resulted in loss of SSH access to the server.
  7  
  8  ## What Happened
  9  
 10  ### Timeline
 11  1. **19:29:26** - Setup script started on testnet002.ac-dc.network
 12  2. **19:29:26-27** - Steps 1-5 completed successfully:
 13     - SSH connection verified
 14     - Passwordless SSH configured
 15     - Security check script installed
 16     - Systemd service created and enabled
 17  3. **19:29:27** - Step 6: Initial security check ran
 18     - UFW installed and configured
 19     - Firewall rules created for ports: 2584 (SSH), 4130/5000/3030 (AlphaOS), 4131/5001/3031 (DeltaOS)
 20     - **Critical Issue:** Port 22 was NOT included in allowed ports
 21     - UFW enabled and set to deny all incoming except listed ports
 22  4. **19:29:27** - Step 7: SSH port migration started
 23     - Script attempted to change SSH from port 22 → 2584
 24     - SSH service restart initiated
 25     - **Connection lost:** Cannot reconnect on either port 22 or 2584
 26  5. **19:29:30+** - Server inaccessible via SSH
 27  
 28  ### Current State
 29  - testnet002.ac-dc.network is running but SSH inaccessible
 30  - AlphaOS validator may still be running but cannot be verified
 31  - Requires console/physical access to recover
 32  
 33  ## Root Cause
 34  
 35  **Race Condition in Script Design:**
 36  
 37  The security check script (step 6) configured the firewall to ONLY allow port 2584 BEFORE the SSH port migration (step 7) could complete. This created a scenario where:
 38  
 39  1. Firewall blocks port 22 immediately
 40  2. SSH is still running on port 22
 41  3. Script tries to change SSH to port 2584
 42  4. Connection lost before SSH restarts on new port
 43  5. No access on either port
 44  
 45  **Code Issue:**
 46  ```bash
 47  # testnet-security-check.sh (BEFORE FIX)
 48  SSH_PORT=2584  # Only allowed this port
 49  ```
 50  
 51  This did not account for servers that are still on port 22 during initial setup.
 52  
 53  ## Fixes Implemented
 54  
 55  ### 1. Security Check Script (`testnet-security-check.sh`)
 56  
 57  **Changed:**
 58  ```bash
 59  # BEFORE
 60  SSH_PORT=2584
 61  
 62  # AFTER
 63  SSH_PORTS=(22 2584)  # Allow both during migration
 64  ```
 65  
 66  **Rationale:**
 67  - Allows both ports 22 and 2584 to be open simultaneously
 68  - Prevents lockout during SSH port migration
 69  - Port 22 can be manually removed after confirming port 2584 works
 70  
 71  ### 2. Setup Script (`setup-testnet-server.sh`)
 72  
 73  **Changed:**
 74  ```bash
 75  # BEFORE
 76  # Remove old port from firewall
 77  ssh -p "$TARGET_SSH_PORT" "$SERVER_URL" "sudo ufw delete allow 22/tcp"
 78  
 79  # AFTER
 80  # Keep port 22 for safety
 81  log_warn "Note: Port 22 is still allowed in firewall for safety"
 82  log_warn "Remove manually: sudo ufw delete allow 22/tcp"
 83  ```
 84  
 85  **Rationale:**
 86  - Does not automatically remove port 22 from firewall
 87  - Allows admin to verify port 2584 works before removing port 22
 88  - Provides clear instructions for manual removal
 89  
 90  ## Recovery Process for testnet002
 91  
 92  **Option 1: Console Access (Recommended)**
 93  If you have console/KVM access to the server:
 94  
 95  1. Login via console
 96  2. Check SSH service status:
 97     ```bash
 98     systemctl status sshd
 99     grep Port /etc/ssh/sshd_config
100     ```
101  3. Check firewall status:
102     ```bash
103     ufw status numbered
104     ```
105  4. Add port 22 to firewall:
106     ```bash
107     ufw allow 22/tcp
108     ```
109  5. Restart SSH on port 2584:
110     ```bash
111     sed -i 's/^#*Port.*/Port 2584/' /etc/ssh/sshd_config
112     systemctl restart sshd
113     ```
114  6. Test SSH from devops machine:
115     ```bash
116     ssh -p 2584 testnet002.ac-dc.network
117     ```
118  7. If working, remove port 22:
119     ```bash
120     ufw delete allow 22/tcp
121     ```
122  
123  **Option 2: Server Reboot**
124  If the server has remote reboot capability:
125  
126  1. Reboot the server
127  2. The security check service will run on boot and open both ports 22 and 2584
128  3. Connect on port 22:
129     ```bash
130     ssh testnet002.ac-dc.network
131     ```
132  4. Fix SSH configuration:
133     ```bash
134     sudo sed -i 's/^#*Port.*/Port 2584/' /etc/ssh/sshd_config
135     sudo systemctl restart sshd
136     ```
137  5. Reconnect on port 2584
138  6. Remove port 22 from firewall
139  
140  **Option 3: Provider Console**
141  If using a hosting provider (Hetzner, AWS, etc.):
142  
143  1. Access web-based console/VNC
144  2. Follow Option 1 steps
145  
146  ## Lessons Learned
147  
148  ### What Went Well
149  - Steps 1-5 of setup script worked perfectly
150  - Passwordless SSH configuration succeeded
151  - Security script installation succeeded
152  - Systemd service creation and activation succeeded
153  - Firewall configuration (UFW) worked as designed
154  
155  ### What Went Wrong
156  - Did not account for SSH port migration edge case
157  - Security check ran before port migration completed
158  - No rollback mechanism if port migration fails
159  - Insufficient testing on live servers
160  
161  ### Improvements Made
162  1. ✅ Allow both SSH ports (22 and 2584) in firewall
163  2. ✅ Do not auto-remove port 22 after migration
164  3. ✅ Add warnings about manual port 22 removal
165  4. ✅ Updated documentation with recovery procedures
166  
167  ### Additional Improvements Needed
168  1. ⏳ Add preflight check: verify current SSH port before migration
169  2. ⏳ Add SSH connection keepalive during port migration
170  3. ⏳ Add rollback mechanism if new port doesn't respond
171  4. ⏳ Test script in dry-run mode on live servers first
172  5. ⏳ Add option to skip port migration entirely
173  6. ⏳ Consider using SSH multiplexing for persistent connections
174  
175  ## Testing Recommendations
176  
177  ### Before Production Use:
178  1. **Test in VM/local environment first**
179     - Use VirtualBox/QEMU VM
180     - Simulate complete setup process
181     - Verify recovery procedures work
182  
183  2. **Test with console access available**
184     - Have KVM/IPMI access ready
185     - Test on non-critical server first
186     - Have monitoring in place
187  
188  3. **Test incrementally**
189     - Run steps 1-6 only (skip port migration)
190     - Verify firewall and services work
191     - Manually test port 2584 before migration
192     - Then complete step 7 if needed
193  
194  4. **Use dry-run mode**
195     - Add `--dry-run` flag to script
196     - Show what would be changed
197     - Require explicit confirmation
198  
199  ## Impact Assessment
200  
201  ### testnet002.ac-dc.network
202  - **Status:** Inaccessible via SSH, requires console access
203  - **Validator:** Unknown status, may still be running
204  - **Network:** 4 of 5 validators operational (testnet001, 004, 005, and one more)
205  - **Priority:** Medium (testnet has redundancy)
206  - **Recovery Time:** 10-15 minutes with console access
207  
208  ### Other Testnet Servers
209  - **testnet001:** Already on port 2584 (from earlier session)
210  - **testnet003:** Already had SSH issues (not related)
211  - **testnet004, 005:** Running validators successfully
212  
213  ### Script Availability
214  - **Status:** Fixed and ready for future use
215  - **Safety:** Much safer with both ports allowed
216  - **Testing:** Needs validation on new server
217  
218  ## Action Items
219  
220  - [x] Fix security check script to allow both ports
221  - [x] Update setup script to keep port 22 open
222  - [x] Verify script syntax
223  - [x] Document issue and recovery
224  - [ ] Recover testnet002 using console access
225  - [ ] Test updated script on new server (testnet006+)
226  - [ ] Add dry-run mode to setup script
227  - [ ] Add preflight validation checks
228  - [ ] Create automated testing suite
229  
230  ## Conclusion
231  
232  A race condition in the SSH port migration logic caused loss of access to testnet002. The issue has been identified and fixed by allowing both SSH ports (22 and 2584) to remain open, preventing future lockouts. The server can be recovered via console access, and the updated scripts are safer for production use.
233  
234  **Status:** RESOLVED (scripts fixed), testnet002 requires manual recovery.