Cradicle Explorer

/ vimwiki / 231221-1502.wiki
231221-1502.wiki
 1  %title The Fly.io Reboot Plan
 2  :work:priv:
 3  %date 2023-12-21 15:02
 4  %update 2023-12-21 17:05
 5  
 6  Related to [[231212-1501|Infra Questions for Fly.io]] [[231221-1246|Fly.io's Answers to my Questions]]
 7  
 8  = The Plan =
 9  The plan is to do this per region, and with about 40 regions. Another assumption I'm making is that we wish to do these sooner rather than later
10  
11  - Gather information about each region
12  	- Have some one comb through ~2 weeks of metrics to determine the following:
13  	- How many network attached volumes we can expect
14  	- What's the time when they're generally under lowest utilization
15  	- Identify the worker servers that are to be rebooted in each region
16  - Use this information to come up with a time frame for when reboots would minimize disruption
17  - Develop a tool (using `bash` + `parallel` or `ansible` or `fcm`) that is capable of automating the reboot process
18  	- Ensure this tool / server has access to reboot all the servers in the region 
19  	- Ensure this tool can only be accessed by those authorized to use it (using only SSH access to the specific user, or a specific bastion host across all regions)
20  	- Create necessary backups in case of failure
21  		- Backups in this case is for the server, not the VMs running on the server
22  		- Document the process and create a run book for resolution of rollbacks
23  			- Where are the backups stored
24  			- What's the command to execute the rollback
25  			- How do we connect over wireguard to get BMC access into the machine 
26  		- (This was another question I missed :/)
27  			- If servers boot from local drive: Using a tool like `dd` to create a backup of the entire drive
28  			- If servers boot from networked drive: Ensuring the "old" configuration is still around to boot from
29  	- Load list of servers to reboot (only those in this region), and number of "hold over" servers which we can use to parallelize the reboot process 
30  	- It should be able to drain currently operating VMs into our "hold over" server(s) (including those with network drives attached)
31  		- It was noted in the Q&A that currently Fly does not have a tool to move network drives en mass
32  		- Development of such a tool could look like using a read-only replica of the volume mount on the "hold-over" VM,and then remounting as `rw` once the shutdown has begun
33  	- Ensure that `flyd` and `fly-proxy` are able to re-route traffic to the new temporary server
34  	- Reboot the server
35  	- Verify server operation using `system-check`
36  		- If the server does not come back after 10 minutes, mark as failure and try again. After 3 failures rollback server 
37  	- Reinstate all previous running VMs to the rebooted server
38  	- Generate metrics for in progress reboot procedure so that it can be monitored (number of servers currently rebooted, stage in process, number of failures, reason for failure, etc)
39  - Peer review the tool, and notify team members of scheduled reboot times for each region
40  - Dry run the tool in a staging environment that has similarity to production regions to verify our expectations
41  	- Simulate catastrophic failure in this staging to test backup procedures in the staging environment
42  - Spin up / purchase a new server(s) in the region, which will serve to host VMs that are being rebooted, and will serve as the reboot coordinator of the region
43  	- The number of new servers we utilize will determine how quickly we can reboot servers without downtime
44  	- Given there are about 5 servers per region, I would say we should have 1/5th the total number of servers to be rebooted as hold overs per region, which ensures all reboots in the region can happen in roughly an hour
45  	- If there are multiple, only one will be coordinating reboots, the rest will host "hold over" VMs
46  	- If we're rebooting multiple servers at once, we want to stagger / batch this process with some randomization so they're not all being rebooted at the same time 
47  	- (Sidenote: I realized I forgot to ask this question in the Q&A stage, but am making the assumption here that it's better to optimize for less downtime over lower costs)
48  - Dedicate a team or individual to be responsible for a region during the reboot process, watching monitoring logs, and ready to act in case of failure, following the previously created run book
49  	- Open up a communication channel (per region if necessary) for the reboot process
50  - Put up status page for rebooting services, and notify customers 
51  - Execute the reboot once the previously selected time frame comes around
52  - Once complete and verified, spin down the coordination server / hold over servers
53  - Document the outcome of the reboots
54  
55  Some other things that could go wrong include:
56  - Mass POSTing leading to tripping the fuse in the data centre, do we have access or a contact in the event this happens?
57  - NTP clock drift in VMs and / or server
58  - Hardware failure on machines that have not been rebooted in a long time
59  - General software failures, which is why we have a comprehensive rollback plan