HowToBreak.md
1 # Simulating failures in Arti 2 3 This document explains how to simulate different kinds of bootstrapping and 4 network failures in Arti. 5 6 The main reason for simulating failures is to ensure that Arti's 7 behavior is "generally reasonable" when the network is down or 8 misbehaving, when the local host is set up in a confusing way, etc. 9 10 Here "generally reasonable" should mean that we aren't making a huge 11 number of connections to the network or wasting a huge amount of 12 bandwidth. Similarly, we shouldn't be using huge amounts of CPU, or 13 filling up the logs at level `info` or higher. 14 15 It's an extra benefit if we can ensure that our bootstrap reporting 16 mechanisms give us accurate feedback in these cases, and diagnose the 17 problem accurately. 18 19 Most of the examples here will use the `arti-testing` tool. Some will 20 also use a small Chutney network. In either case, you'll need an 21 explicit client configuration, since `arti-testing` doesn't want you to 22 use the default; I'll assume you've put its location in `${ARTI_CONF}`. 23 24 Note that you shouldn't _need_ to use chutney in these cases if Arti is 25 in fact well-behaved. However, it's courteous to do so if you think 26 there might be problems in Arti's behavior: you wouldn't want to flood 27 the real network. 28 29 I'll be assuming that you have a Linux environment. 30 31 ## What to look at 32 33 The output from `arti-testing` will tell you whether bootstrapping 34 succeeded or failed. If bootstrapping is not expected to succeed, try 35 adding `--timeout ${DELAY} --expect timeout` to indicate that the 36 operation isn't supposed to succeed, and should eventually time out. 37 38 If bootstrapping or connecting succeeds when it shouldn't, then the test 39 was wrong: we were trying to make success impossible, but somehow it 40 succeeded anyway. 41 42 When we're done, `arti-testing` will tell us some statistics about TCP 43 connections and log messages. Here is an example of a not-too-bad 44 attempt to bootstrap over 30 seconds: 45 46 ``` 47 TCP stats: TcpCount { n_connect_attempt: 1, n_connect_ok: 1, n_accept: 0, n_bytes_send: 17223, n_bytes_recv: 59092 } 48 Total events: Trace: 159, Debug: 14, Info: 16, Warn: 8, Error: 0 49 ``` 50 51 And here's an example of obviously problematic behavior over a similar 52 period: 53 54 ``` 55 Timeout occurred [as expected] 56 TCP stats: TcpCount { n_connect_attempt: 1220, n_connect_ok: 1220, n_accept: 0, n_bytes_send: 1394460, n_bytes_recv: 4267636 } 57 Total events: Trace: 13431, Debug: 2088, Info: 2383, Warn: 15, Error: 0 58 ``` 59 60 61 62 ## Failures related to time 63 64 These require the [`faketime`] tool. 65 66 ### System clock set wrong, no directory cached 67 68 Start with an empty cache. Optionally, start with an empty state file. 69 Then run: 70 71 `faketime ${WHEN} arti-testing bootstrap -c ${ARTI_CONF} --timeout 30` 72 73 74 Try this with different values of `WHEN`: 75 * '4 hours ago' 76 * '1 day ago' 77 * '1 month ago' 78 * '1 day' 79 * '1 month' 80 * '1 year' 81 82 ### System clock set wrong, live directory cached. 83 84 Start with an empty cache. Optionally, start with an empty state file. 85 Then run: 86 87 `arti-testing bootstrap -c ${ARTI_CONF}` 88 89 This should succeed. Now run: 90 91 ``` 92 faketime ${WHEN} arti-testing connect -c ${ARTI_CONF} \ 93 --target www.torproject.org:80 \ 94 --timeout 30 --retry 0 95 ``` 96 97 Try this with different values of `WHEN` as above. This simulates a 98 case where we previously bootstrapped with a reasonably live directory, 99 but we wound up with a wrong clock when we restarted. 100 101 ### System clock set wrong, obsolete directory cached 102 103 You can simulate this with a directory that you made before, then 104 copied into your cache directory. Use `faketime` to set the current 105 time to a point at which the directory was valid, or recently valid. 106 107 Note that this test won't work well with as chutney, since chutney 108 directory lifetimes are very short. 109 110 TODO: Describe better ways to do this. 111 112 ## Failures related to the network 113 114 The `arti-testing` tool can simulate multiple kinds of errors: 115 * connections fail immediately (or after a little while) 116 (`--tcp-failure error --tcp-failure-delay 1`) 117 * connections time out and never succeed (`--tcp-failure timeout`) 118 * connections succeed, but drop all data and say 119 nothing. (`--tcp-failure blackhole`) 120 121 You can arrange for these failures to start in the bootstrap phase 122 (`--tcp-failure-stage bootstrap`) or in the connect stage 123 (`--tcp-failure-stage connect`). 124 125 With these options, you can simulate different kinds of failures by 126 starting with an empty directory cache (and optionally empty state). 127 The bootstrap phase failures correspond to failures on your fallback 128 directories; the connect-phase failures correspond to failures on the 129 live network. 130 131 (TODO: There's an issue here where if you have open connections to the 132 fallbacks, the TCP-failure code won't yet make them start failing when 133 you connect to the network. As a workaround, bootstrap in a separate 134 `arti-testing` call, then connect with TCP failures enabled.) 135 136 Here's an example of failing during bootstrapping. (Clear your cache 137 first.) 138 139 `arti-testing bootstrap -c ${ARTI_CONF} --timeout 30 --tcp-failure error` 140 141 Here's an example of failing after bootstrapping. (Clear your cache 142 before the first command.) 143 144 ``` 145 # This one should succeed 146 arti-testing bootstrap -c ${ARTI_CONF} 147 148 # This will fail. 149 arti-testing connect -c ${ARTI_CONF} \ 150 --target www.torproject.org:80 \ 151 --timeout 30 --retry 0 \ 152 --tcp-failure blackhole 153 ``` 154 155 ## Partial network blocking 156 157 You can make the above network failures conditional, to simulate 158 different kinds of broken local networks. Try `--tcp-failure-on v4` to 159 simulate an IPv4-only network, or `--tcp-failure-on non443` to simulate 160 a network that blocks everything but HTTPS. 161 162 (These won't work with chutney networks, since a typical chutney 163 network's relays are all on IPv4 with high ports.) 164 165 166 ## Network identity mismatch 167 168 One way to get an interesting set of failures is to mix-and-match the 169 `arti.toml` files from two different chutney networks. You can find older 170 chutney networks in subdirectories of `${CHUTNEY_PATH}/net/` other than 171 `nodes`. 172 173 If you use an older set of fallback directories, you'll simulate the 174 case where the client can't actually connect to any fallback 175 directories because its beliefs about their identities are all wrong. 176 177 If you keep the running set of fallback directories, but use the older 178 set of authorities, you'll simulate the case where the client fetches a 179 directory, but doesn't believe in any authorities that signed it. 180 181 (For both of these cases, start with an empty cache and use the 182 `arti-testing bootstrap` command.) 183 184 185 # TODO 186 187 188 arti-testing: 189 - Ability to clear cache and/or state. 190 - Fresh client for connecting. 191 - Ability to close after a little while. 192 - Directory munger.