/ doc / dev / testing / HowToBreak.md
HowToBreak.md
  1  # Simulating failures in Arti
  2  
  3  This document explains how to simulate different kinds of bootstrapping and
  4  network failures in Arti.
  5  
  6  The main reason for simulating failures is to ensure that Arti's
  7  behavior is "generally reasonable" when the network is down or
  8  misbehaving, when the local host is set up in a confusing way, etc.
  9  
 10  Here "generally reasonable" should mean that we aren't making a huge
 11  number of connections to the network or wasting a huge amount of
 12  bandwidth.  Similarly, we shouldn't be using huge amounts of CPU, or
 13  filling up the logs at level `info` or higher.
 14  
 15  It's an extra benefit if we can ensure that our bootstrap reporting
 16  mechanisms give us accurate feedback in these cases, and diagnose the
 17  problem accurately.
 18  
 19  Most of the examples here will use the `arti-testing` tool.  Some will
 20  also use a small Chutney network.  In either case, you'll need an
 21  explicit client configuration, since `arti-testing` doesn't want you to
 22  use the default; I'll assume you've put its location in `${ARTI_CONF}`.
 23  
 24  Note that you shouldn't _need_ to use chutney in these cases if Arti is
 25  in fact well-behaved.  However, it's courteous to do so if you think
 26  there might be problems in Arti's behavior: you wouldn't want to flood
 27  the real network.
 28  
 29  I'll be assuming that you have a Linux environment.
 30  
 31  ## What to look at
 32  
 33  The output from `arti-testing` will tell you whether bootstrapping
 34  succeeded or failed.  If bootstrapping is not expected to succeed, try
 35  adding `--timeout ${DELAY} --expect timeout` to indicate that the
 36  operation isn't supposed to succeed, and should eventually time out.
 37  
 38  If bootstrapping or connecting succeeds when it shouldn't, then the test
 39  was wrong: we were trying to make success impossible, but somehow it
 40  succeeded anyway.
 41  
 42  When we're done, `arti-testing` will tell us some statistics about TCP
 43  connections and log messages.  Here is an example of a not-too-bad
 44  attempt to bootstrap over 30 seconds:
 45  
 46  ```
 47  TCP stats: TcpCount { n_connect_attempt: 1, n_connect_ok: 1, n_accept: 0, n_bytes_send: 17223, n_bytes_recv: 59092 }
 48  Total events: Trace: 159, Debug: 14, Info: 16, Warn: 8, Error: 0
 49  ```
 50  
 51  And here's an example of obviously problematic behavior over a similar
 52  period:
 53  
 54  ```
 55  Timeout occurred [as expected]
 56  TCP stats: TcpCount { n_connect_attempt: 1220, n_connect_ok: 1220, n_accept: 0, n_bytes_send: 1394460, n_bytes_recv: 4267636 }
 57  Total events: Trace: 13431, Debug: 2088, Info: 2383, Warn: 15, Error: 0
 58  ```
 59  
 60  
 61  
 62  ## Failures related to time
 63  
 64  These require the [`faketime`] tool.
 65  
 66  ### System clock set wrong, no directory cached
 67  
 68  Start with an empty cache.  Optionally, start with an empty state file.
 69  Then run:
 70  
 71  `faketime ${WHEN} arti-testing bootstrap -c ${ARTI_CONF} --timeout 30`
 72  
 73  
 74  Try this with different values of `WHEN`:
 75   * '4 hours ago'
 76   * '1 day ago'
 77   * '1 month ago'
 78   * '1 day'
 79   * '1 month'
 80   * '1 year'
 81  
 82  ### System clock set wrong, live directory cached.
 83  
 84  Start with an empty cache. Optionally, start with an empty state file.
 85  Then run:
 86  
 87  `arti-testing bootstrap -c ${ARTI_CONF}`
 88  
 89  This should succeed.  Now run:
 90  
 91  ```
 92  faketime ${WHEN} arti-testing connect -c ${ARTI_CONF} \
 93          --target www.torproject.org:80 \
 94          --timeout 30 --retry 0
 95  ```
 96  
 97  Try this with different values of `WHEN` as above.  This simulates a
 98  case where we previously bootstrapped with a reasonably live directory,
 99  but we wound up with a wrong clock when we restarted.
100  
101  ### System clock set wrong, obsolete directory cached
102  
103  You can simulate this with a directory that you made before, then
104  copied into your cache directory.  Use `faketime` to set the current
105  time to a point at which the directory was valid, or recently valid.
106  
107  Note that this test won't work well with as chutney, since chutney
108  directory lifetimes are very short.
109  
110  TODO: Describe better ways to do this.
111  
112  ## Failures related to the network
113  
114  The `arti-testing` tool can simulate multiple kinds of errors:
115   * connections fail immediately (or after a little while)
116     (`--tcp-failure error --tcp-failure-delay 1`)
117   * connections time out and never succeed (`--tcp-failure timeout`)
118   * connections succeed, but drop all data and say
119     nothing. (`--tcp-failure blackhole`)
120  
121  You can arrange for these failures to start in the bootstrap phase
122  (`--tcp-failure-stage bootstrap`) or in the connect stage
123  (`--tcp-failure-stage connect`).
124  
125  With these options, you can simulate different kinds of failures by
126  starting with an empty directory cache (and optionally empty state).
127  The bootstrap phase failures correspond to failures on your fallback
128  directories; the connect-phase failures correspond to failures on the
129  live network.
130  
131  (TODO: There's an issue here where if you have open connections to the
132  fallbacks, the TCP-failure code won't yet make them start failing when
133  you connect to the network.  As a workaround, bootstrap in a separate
134  `arti-testing` call, then connect with TCP failures enabled.)
135  
136  Here's an example of failing during bootstrapping.  (Clear your cache
137  first.)
138  
139  `arti-testing bootstrap -c ${ARTI_CONF} --timeout 30 --tcp-failure error`
140  
141  Here's an example of failing after bootstrapping.  (Clear your cache
142  before the first command.)
143  
144  ```
145  # This one should succeed
146  arti-testing bootstrap -c ${ARTI_CONF}
147  
148  # This will fail.
149  arti-testing connect -c ${ARTI_CONF} \
150          --target www.torproject.org:80 \
151          --timeout 30 --retry 0 \
152          --tcp-failure blackhole
153  ```
154  
155  ## Partial network blocking
156  
157  You can make the above network failures conditional, to simulate
158  different kinds of broken local networks.  Try `--tcp-failure-on v4` to
159  simulate an IPv4-only network, or `--tcp-failure-on non443` to simulate
160  a network that blocks everything but HTTPS.
161  
162  (These won't work with chutney networks, since a typical chutney
163  network's relays are all on IPv4 with high ports.)
164  
165  
166  ## Network identity mismatch
167  
168  One way to get an interesting set of failures is to mix-and-match the
169  `arti.toml` files from two different chutney networks.  You can find older
170  chutney networks in subdirectories of `${CHUTNEY_PATH}/net/` other than
171  `nodes`.
172  
173  If you use an older set of fallback directories, you'll simulate the
174  case where the client can't actually connect to any fallback
175  directories because its beliefs about their identities are all wrong.
176  
177  If you keep the running set of fallback directories, but use the older
178  set of authorities, you'll simulate the case where the client fetches a
179  directory, but doesn't believe in any authorities that signed it.
180  
181  (For both of these cases, start with an empty cache and use the
182  `arti-testing bootstrap` command.)
183  
184  
185  # TODO
186  
187  
188  arti-testing:
189  - Ability to clear cache and/or state.
190  - Fresh client for connecting.
191  - Ability to close after a little while.
192  - Directory munger.