231221-1502.wiki
1 %title The Fly.io Reboot Plan 2 :work:priv: 3 %date 2023-12-21 15:02 4 %update 2023-12-21 17:05 5 6 Related to [[231212-1501|Infra Questions for Fly.io]] [[231221-1246|Fly.io's Answers to my Questions]] 7 8 = The Plan = 9 The plan is to do this per region, and with about 40 regions. Another assumption I'm making is that we wish to do these sooner rather than later 10 11 - Gather information about each region 12 - Have some one comb through ~2 weeks of metrics to determine the following: 13 - How many network attached volumes we can expect 14 - What's the time when they're generally under lowest utilization 15 - Identify the worker servers that are to be rebooted in each region 16 - Use this information to come up with a time frame for when reboots would minimize disruption 17 - Develop a tool (using `bash` + `parallel` or `ansible` or `fcm`) that is capable of automating the reboot process 18 - Ensure this tool / server has access to reboot all the servers in the region 19 - Ensure this tool can only be accessed by those authorized to use it (using only SSH access to the specific user, or a specific bastion host across all regions) 20 - Create necessary backups in case of failure 21 - Backups in this case is for the server, not the VMs running on the server 22 - Document the process and create a run book for resolution of rollbacks 23 - Where are the backups stored 24 - What's the command to execute the rollback 25 - How do we connect over wireguard to get BMC access into the machine 26 - (This was another question I missed :/) 27 - If servers boot from local drive: Using a tool like `dd` to create a backup of the entire drive 28 - If servers boot from networked drive: Ensuring the "old" configuration is still around to boot from 29 - Load list of servers to reboot (only those in this region), and number of "hold over" servers which we can use to parallelize the reboot process 30 - It should be able to drain currently operating VMs into our "hold over" server(s) (including those with network drives attached) 31 - It was noted in the Q&A that currently Fly does not have a tool to move network drives en mass 32 - Development of such a tool could look like using a read-only replica of the volume mount on the "hold-over" VM,and then remounting as `rw` once the shutdown has begun 33 - Ensure that `flyd` and `fly-proxy` are able to re-route traffic to the new temporary server 34 - Reboot the server 35 - Verify server operation using `system-check` 36 - If the server does not come back after 10 minutes, mark as failure and try again. After 3 failures rollback server 37 - Reinstate all previous running VMs to the rebooted server 38 - Generate metrics for in progress reboot procedure so that it can be monitored (number of servers currently rebooted, stage in process, number of failures, reason for failure, etc) 39 - Peer review the tool, and notify team members of scheduled reboot times for each region 40 - Dry run the tool in a staging environment that has similarity to production regions to verify our expectations 41 - Simulate catastrophic failure in this staging to test backup procedures in the staging environment 42 - Spin up / purchase a new server(s) in the region, which will serve to host VMs that are being rebooted, and will serve as the reboot coordinator of the region 43 - The number of new servers we utilize will determine how quickly we can reboot servers without downtime 44 - Given there are about 5 servers per region, I would say we should have 1/5th the total number of servers to be rebooted as hold overs per region, which ensures all reboots in the region can happen in roughly an hour 45 - If there are multiple, only one will be coordinating reboots, the rest will host "hold over" VMs 46 - If we're rebooting multiple servers at once, we want to stagger / batch this process with some randomization so they're not all being rebooted at the same time 47 - (Sidenote: I realized I forgot to ask this question in the Q&A stage, but am making the assumption here that it's better to optimize for less downtime over lower costs) 48 - Dedicate a team or individual to be responsible for a region during the reboot process, watching monitoring logs, and ready to act in case of failure, following the previously created run book 49 - Open up a communication channel (per region if necessary) for the reboot process 50 - Put up status page for rebooting services, and notify customers 51 - Execute the reboot once the previously selected time frame comes around 52 - Once complete and verified, spin down the coordination server / hold over servers 53 - Document the outcome of the reboots 54 55 Some other things that could go wrong include: 56 - Mass POSTing leading to tripping the fuse in the data centre, do we have access or a contact in the event this happens? 57 - NTP clock drift in VMs and / or server 58 - Hardware failure on machines that have not been rebooted in a long time 59 - General software failures, which is why we have a comprehensive rollback plan