/ docs / benchmark_perf_loop.md
benchmark_perf_loop.md
  1  # The Go Performance Optimization Loop: From Benchmarks to Zero Allocations
  2  
  3  When optimizing Go code for performance, particularly in hot paths like
  4  cryptographic operations or protocol handling, the journey from identifying
  5  bottlenecks to achieving zero-allocation code follows a well-defined
  6  methodology. This document walks through the complete optimization loop using
  7  Go's built-in tooling, demonstrating how to systematically eliminate allocations
  8  and improve performance.
  9  
 10  ## Understanding the Performance Baseline
 11  
 12  The first step in any optimization effort is establishing a measurable baseline.
 13  Go's benchmark framework provides the foundation for this measurement. When
 14  writing benchmarks for allocation-sensitive code, always include a call to
 15  `b.ReportAllocs()` before `b.ResetTimer()`. This ensures the benchmark reports
 16  both timing and allocation statistics without including setup costs in the
 17  measurements.
 18  
 19  Consider a benchmark that exercises a cryptographic write path with the largest
 20  possible message size to stress test allocations:
 21  
 22  ```go
 23  func BenchmarkWriteMessage(b *testing.B) {
 24      // Setup code here...
 25      
 26      b.ReportAllocs()  // Essential for tracking allocations
 27      b.ResetTimer()
 28      
 29      for i := 0; i < b.N; i++ {
 30          // Hot path being measured
 31      }
 32  }
 33  ```
 34  
 35  Running the benchmark with `go test -bench=BenchmarkWriteMessage -benchmem
 36  -count=10` provides statistical confidence through multiple runs. The
 37  `-benchmem` flag is redundant if you've called `b.ReportAllocs()`, but it
 38  doesn't hurt to include it explicitly. The output reveals three critical
 39  metrics: nanoseconds per operation, bytes allocated per operation, and the
 40  number of distinct allocations per operation.
 41  
 42  ## Profiling Memory Allocations
 43  
 44  Once you have baseline measurements showing undesirable allocations, the next
 45  phase involves profiling to understand where these allocations originate.
 46  Generate memory profiles during benchmark execution using:
 47  
 48  ```
 49  go test -bench=BenchmarkWriteMessage -memprofile=mem.prof -cpuprofile=cpu.prof -count=1
 50  ```
 51  
 52  The resulting profile can be analyzed through several lenses. To see which
 53  functions allocate the most memory by total bytes, use:
 54  `go tool pprof -alloc_space -top mem.prof`. 
 55  
 56  However, for understanding allocation frequency rather than size, `go tool pprof -alloc_objects -top mem.prof` often provides more actionable insights, especially when hunting small but frequent allocations.
 57  
 58  Here's what the allocation object analysis might reveal:
 59  
 60  ```
 61  $ go tool pprof -alloc_objects -top mem.prof | head -20
 62  File: brontide.test
 63  Type: alloc_objects
 64  Time: Aug 30, 2024 at 2:07pm (WEST)
 65  Showing nodes accounting for 39254, 100% of 39272 total
 66  Dropped 32 nodes (cum <= 196)
 67        flat  flat%   sum%        cum   cum%
 68       32768 83.44% 83.44%      32768 83.44%  github.com/lightningnetwork/lnd/brontide.(*cipherState).Encrypt
 69        5461 13.91% 97.34%       5461 13.91%  runtime.acquireSudog
 70        1025  2.61%   100%       1025  2.61%  runtime.allocm
 71  ```
 72  
 73  This output immediately shows that `cipherState.Encrypt` is responsible for 83%
 74  of allocations by count, focusing our investigation.
 75  
 76  The most powerful profiling technique involves examining allocations at the
 77  source line level. Running `go tool pprof -list 'FunctionName' mem.prof` shows
 78  exactly which lines within a function trigger heap allocations:
 79  
 80  ```
 81  $ go tool pprof -list 'cipherState.*Encrypt' mem.prof
 82  Total: 8.73MB
 83  ROUTINE ======================== github.com/lightningnetwork/lnd/brontide.(*cipherState).Encrypt
 84    512.01kB   512.01kB (flat, cum)  5.73% of Total
 85           .          .    111:func (c *cipherState) Encrypt(associatedData, cipherText, plainText []byte) []byte {
 86           .          .    112:	defer func() {
 87           .          .    113:		c.nonce++
 88           .          .    114:
 89           .          .    115:		if c.nonce == keyRotationInterval {
 90           .          .    116:			c.rotateKey()
 91           .          .    117:		}
 92           .          .    118:	}()
 93           .          .    119:
 94    512.01kB   512.01kB    120:	var nonce [12]byte
 95           .          .    121:	binary.LittleEndian.PutUint64(nonce[4:], c.nonce)
 96           .          .    122:
 97           .          .    123:	return c.cipher.Seal(cipherText, nonce[:], plainText, associatedData)
 98  ```
 99  
100  This granular view reveals that line 120, a seemingly innocent stack array
101  declaration, is allocating 512KB total across all benchmark iterations.
102  
103  ## CPU Profiling for Hot Spots
104  
105  While memory allocations often dominate optimization efforts, CPU profiling
106  reveals where computational time is spent. The CPU profile generated alongside
107  the memory profile provides complementary insights:
108  
109  ```
110  $ go tool pprof -top cpu.prof | head -15
111  File: brontide.test
112  Type: cpu
113  Time: Aug 30, 2024 at 2:07pm (WEST)
114  Duration: 1.8s, Total samples = 1.71s (94.40%)
115  Showing nodes accounting for 1.65s, 96.49% of 1.71s total
116        flat  flat%   sum%        cum   cum%
117       0.51s 29.82% 29.82%      0.51s 29.82%  vendor/golang.org/x/crypto/chacha20poly1305.(*chacha20poly1305).sealGeneric
118       0.28s 16.37% 46.20%      0.28s 16.37%  vendor/golang.org/x/crypto/internal/poly1305.updateGeneric
119       0.24s 14.04% 60.23%      0.24s 14.04%  vendor/golang.org/x/crypto/chacha20.(*Cipher).XORKeyStream
120       0.19s 11.11% 71.35%      0.19s 11.11%  runtime.memmove
121       0.12s  7.02% 78.36%      0.86s 50.29%  github.com/lightningnetwork/lnd/brontide.(*cipherState).Encrypt
122  ```
123  
124  This profile shows that cryptographic operations dominate CPU usage, which is
125  expected. However, note the presence of `runtime.memmove` at 11% - this often
126  indicates unnecessary copying that could be eliminated through careful buffer
127  management.
128  
129  For line-level CPU analysis of a specific function:
130  
131  ```
132  $ go tool pprof -list 'WriteMessage' cpu.prof
133  Total: 1.71s
134  ROUTINE ======================== github.com/lightningnetwork/lnd/brontide.(*Machine).WriteMessage
135        10ms      1.21s (flat, cum) 70.76% of Total
136           .          .    734:func (b *Machine) WriteMessage(p []byte) error {
137           .          .    735:	if len(p) > math.MaxUint16 {
138           .          .    736:		return ErrMaxMessageLengthExceeded
139           .          .    737:	}
140           .          .    738:
141           .       10ms    739:	if len(b.nextHeaderSend) > 0 || len(b.nextBodySend) > 0 {
142           .          .    740:		return ErrMessageNotFlushed
143           .          .    741:	}
144           .          .    742:
145        10ms       10ms    743:	fullLength := uint16(len(p))
146           .          .    744:	var pktLen [2]byte
147           .       10ms    745:	binary.BigEndian.PutUint16(pktLen[:], fullLength)
148           .          .    746:
149           .      580ms    747:	b.nextHeaderSend = b.sendCipher.Encrypt(nil, nil, pktLen[:])
150           .      600ms    748:	b.nextBodySend = b.sendCipher.Encrypt(nil, nil, p)
151  ```
152  
153  This shows that the two `Encrypt` calls consume virtually all the CPU time in
154  `WriteMessage`, confirming that cryptographic operations are the bottleneck
155  rather than the message handling logic itself.
156  
157  ## Understanding Escape Analysis
158  
159  When the profiler indicates that seemingly stack-local variables are being heap
160  allocated, escape analysis becomes your next investigative tool. The Go
161  compiler's escape analysis determines whether variables can remain on the stack
162  or must be moved to the heap. Variables escape to the heap when their lifetime
163  extends beyond the function that creates them or when the compiler cannot prove
164  they remain local.
165  
166  To see the compiler's escape analysis decisions, build with verbose flags:
167  
168  ```
169  go build -gcflags="-m" ./...
170  ```
171  
172  For more detailed output including the reasons for escape, use `-m=2`. The
173  output reveals escape flows, showing exactly why variables move to the heap.
174  When investigating specific escapes, you can grep for the variable in question:
175  
176  ```
177  $ go build -gcflags="-m=2" ./... 2>&1 | grep -A2 -B2 "nonce escapes"
178  ./noise.go:183:17: &errors.errorString{...} does not escape
179  ./noise.go:183:17: new(chacha20poly1305.chacha20poly1305) escapes to heap
180  ./noise.go:120:6: nonce escapes to heap:
181  ./noise.go:120:6:   flow: {heap} = &nonce:
182  ./noise.go:120:6:     from nonce (address-of) at ./noise.go:123:40
183  --
184  ./noise.go:469:21: &keychain.PrivKeyECDH{...} escapes to heap
185  ./noise.go:483:40: []byte{} escapes to heap
186  ./noise.go:138:6: nonce escapes to heap:
187  ./noise.go:138:6:   flow: {heap} = &nonce:
188  ./noise.go:138:6:     from nonce (address-of) at ./noise.go:141:39
189  ```
190  
191  This output shows the exact flow analysis: the nonce array escapes because its
192  address is taken when creating a slice (`nonce[:]`) and passed to a function
193  that the compiler cannot fully analyze.
194  
195  Common causes include passing pointers to interfaces, storing references in
196  heap-allocated structures, or passing slices of stack arrays to functions that
197  might retain them. A particularly instructive example is the seemingly innocent
198  pattern of passing a stack array to a function:
199  
200  ```go
201  var nonce [12]byte
202  binary.LittleEndian.PutUint64(nonce[4:], counter)
203  return cipher.Seal(ciphertext, nonce[:], plaintext, nil)
204  ```
205  
206  Here, `nonce[:]` creates a slice backed by the stack array, but if the compiler
207  cannot prove that `cipher.Seal` won't retain a reference to this slice, the
208  entire array escapes to the heap.
209  
210  ## The Optimization Strategy
211  
212  Armed with profiling data and escape analysis insights, the optimization phase
213  begins. The general strategy for eliminating allocations follows a predictable
214  pattern: move temporary buffers from function scope to longer-lived structures,
215  typically as fields in the enclosing type. This transformation changes
216  allocation from per-operation to per-instance.
217  
218  For the nonce example above, the optimization involves adding a buffer field to
219  the containing struct:
220  
221  ```go
222  type cipherState struct {
223      // ... other fields ...
224      nonceBuffer [12]byte  // Reusable buffer to avoid allocations
225  }
226  
227  func (c *cipherState) Encrypt(...) []byte {
228      binary.LittleEndian.PutUint64(c.nonceBuffer[4:], c.nonce)
229      return c.cipher.Seal(ciphertext, c.nonceBuffer[:], plaintext, nil)
230  }
231  ```
232  
233  This pattern extends to any temporary buffer. When dealing with variable-sized
234  data up to a known maximum, pre-allocate buffers at that maximum size and slice
235  into them as needed. The key insight is using the three-index slice notation to
236  control capacity separately from length:
237  
238  ```go
239  // Pre-allocated: var buffer [maxSize]byte
240  
241  // Creating a zero-length slice with full capacity for append:
242  slice := buffer[:0]  // length=0, capacity=maxSize
243  ```
244  
245  ## Verification and Iteration
246  
247  After implementing optimizations, the cycle returns to benchmarking. Run the
248  same benchmark to measure improvement, but don't stop at the aggregate numbers.
249  Generate new profiles to verify that specific allocations have been eliminated
250  and to identify any remaining allocation sites.
251  
252  The benchstat tool provides statistical comparison between runs:
253  
254  ```
255  go test -bench=BenchmarkWriteMessage -count=10 > old.txt
256  # Make optimizations
257  go test -bench=BenchmarkWriteMessage -count=10 > new.txt
258  benchstat old.txt new.txt
259  ```
260  
261  This comparison reveals not just whether performance improved, but whether the
262  improvement is statistically significant. A typical benchstat output after
263  successful optimization looks like:
264  
265  ```
266  goos: darwin
267  goarch: arm64
268  pkg: github.com/lightningnetwork/lnd/brontide
269  cpu: Apple M4 Max
270                  │   old.txt   │              new.txt               │
271                  │   sec/op    │   sec/op     vs base               │
272  WriteMessage-16   50.34µ ± 1%   46.48µ ± 0%  -7.68% (p=0.000 n=10)
273  
274                  │    old.txt     │              new.txt              │
275                  │      B/op      │    B/op     vs base                │
276  WriteMessage-16   73788.000 ± 0%   2.000 ± 0%  -100.00% (p=0.000 n=10)
277  
278                  │  old.txt   │              new.txt              │
279                  │ allocs/op  │ allocs/op   vs base                │
280  WriteMessage-16   5.000 ± 0%   0.000 ± 0%  -100.00% (p=0.000 n=10)
281  ```
282  
283  The key metrics to examine are:
284  - The percentage change (vs base column) showing the magnitude of improvement
285  
286  - The p-value (p=0.000) indicating statistical significance - values below 0.05
287  suggest real improvements rather than noise
288  
289  - The variance (± percentages) showing consistency across runs
290  
291  This output confirms both a 7.68% speed improvement and complete elimination of
292  allocations, with high statistical confidence.
293  
294  If allocations remain, the cycle continues. Profile again, identify the source,
295  understand why the allocation occurs through escape analysis, and apply the
296  appropriate optimization pattern. Each iteration should show measurable progress
297  toward the goal of zero allocations in the hot path.
298  
299  ## Advanced Techniques
300  
301  When standard profiling doesn't reveal the allocation source, more advanced
302  techniques come into play. Memory profiling with different granularities can
303  help. Instead of looking at total allocations, examine the profile with `go tool
304  pprof -sample_index=alloc_objects` to focus on allocation count rather than
305  size. This distinction matters when hunting for small, frequent allocations that
306  might not show up prominently in byte-focused views.
307  
308  Additional pprof commands that prove invaluable during optimization:
309  
310  ```bash
311  # Interactive mode for exploring the profile
312  go tool pprof mem.prof
313  (pprof) top10        # Show top 10 memory consumers
314  (pprof) list regexp  # List functions matching regexp
315  (pprof) web          # Open visual graph in browser
316  
317  # Generate a flame graph for visual analysis
318  go tool pprof -http=:8080 mem.prof
319  
320  # Compare two profiles directly
321  go tool pprof -base=old.prof new.prof
322  
323  # Show allocations only from specific packages
324  go tool pprof -focus=github.com/lightningnetwork/lnd/brontide mem.prof
325  
326  # Check for specific small allocations
327  go tool pprof -alloc_space -inuse_space mem.prof
328  ```
329  
330  When dealing with elusive allocations, checking what might be escaping to heap
331  can be done more surgically:
332  
333  ```bash
334  # Check specific function or type for escapes
335  go build -gcflags="-m" 2>&1 | grep -E "(YourType|yourFunc)"
336  
337  # See all heap allocations in a package
338  go build -gcflags="-m" 2>&1 | grep "moved to heap"
339  
340  # Check which variables are confirmed to stay on the stack
341  go build -gcflags="-m=2" 2>&1 | grep "does not escape"
342  ```
343  
344  For particularly elusive allocations, instrumenting the code with runtime memory
345  statistics can provide real-time feedback:
346  
347  ```go
348  var m runtime.MemStats
349  runtime.ReadMemStats(&m)
350  before := m.Alloc
351  // Operation being measured
352  runtime.ReadMemStats(&m)
353  allocated := m.Alloc - before
354  ```
355  
356  While this approach adds overhead and shouldn't be used in production, it can
357  help isolate allocations to specific code sections during development.
358  
359  ## The Zero-Allocation Goal
360  
361  Achieving zero allocations in hot paths represents more than just a performance
362  optimization. It provides predictable latency, reduces garbage collection
363  pressure, and improves overall system behavior under load. In systems handling
364  thousands of operations per second, the difference between five allocations per
365  operation and zero can mean the difference between smooth operation and periodic
366  latency spikes during garbage collection.
367  
368  The journey from initial benchmark to zero-allocation code demonstrates the
369  power of Go's built-in tooling. By systematically applying the
370  benchmark-profile-optimize loop, even complex code paths can be transformed into
371  allocation-free implementations. The key lies not in guessing or premature
372  optimization, but in measuring, understanding, and methodically addressing each
373  allocation source.
374  
375  It's best to focus optimization efforts on true hot paths identified through
376  production profiling or realistic load testing. The techniques described here
377  provide the tools to achieve zero-allocation code when it matters, but the
378  judgment of when to apply them remains a critical engineering decision.