benchmark_perf_loop.md
1 # The Go Performance Optimization Loop: From Benchmarks to Zero Allocations 2 3 When optimizing Go code for performance, particularly in hot paths like 4 cryptographic operations or protocol handling, the journey from identifying 5 bottlenecks to achieving zero-allocation code follows a well-defined 6 methodology. This document walks through the complete optimization loop using 7 Go's built-in tooling, demonstrating how to systematically eliminate allocations 8 and improve performance. 9 10 ## Understanding the Performance Baseline 11 12 The first step in any optimization effort is establishing a measurable baseline. 13 Go's benchmark framework provides the foundation for this measurement. When 14 writing benchmarks for allocation-sensitive code, always include a call to 15 `b.ReportAllocs()` before `b.ResetTimer()`. This ensures the benchmark reports 16 both timing and allocation statistics without including setup costs in the 17 measurements. 18 19 Consider a benchmark that exercises a cryptographic write path with the largest 20 possible message size to stress test allocations: 21 22 ```go 23 func BenchmarkWriteMessage(b *testing.B) { 24 // Setup code here... 25 26 b.ReportAllocs() // Essential for tracking allocations 27 b.ResetTimer() 28 29 for i := 0; i < b.N; i++ { 30 // Hot path being measured 31 } 32 } 33 ``` 34 35 Running the benchmark with `go test -bench=BenchmarkWriteMessage -benchmem 36 -count=10` provides statistical confidence through multiple runs. The 37 `-benchmem` flag is redundant if you've called `b.ReportAllocs()`, but it 38 doesn't hurt to include it explicitly. The output reveals three critical 39 metrics: nanoseconds per operation, bytes allocated per operation, and the 40 number of distinct allocations per operation. 41 42 ## Profiling Memory Allocations 43 44 Once you have baseline measurements showing undesirable allocations, the next 45 phase involves profiling to understand where these allocations originate. 46 Generate memory profiles during benchmark execution using: 47 48 ``` 49 go test -bench=BenchmarkWriteMessage -memprofile=mem.prof -cpuprofile=cpu.prof -count=1 50 ``` 51 52 The resulting profile can be analyzed through several lenses. To see which 53 functions allocate the most memory by total bytes, use: 54 `go tool pprof -alloc_space -top mem.prof`. 55 56 However, for understanding allocation frequency rather than size, `go tool pprof -alloc_objects -top mem.prof` often provides more actionable insights, especially when hunting small but frequent allocations. 57 58 Here's what the allocation object analysis might reveal: 59 60 ``` 61 $ go tool pprof -alloc_objects -top mem.prof | head -20 62 File: brontide.test 63 Type: alloc_objects 64 Time: Aug 30, 2024 at 2:07pm (WEST) 65 Showing nodes accounting for 39254, 100% of 39272 total 66 Dropped 32 nodes (cum <= 196) 67 flat flat% sum% cum cum% 68 32768 83.44% 83.44% 32768 83.44% github.com/lightningnetwork/lnd/brontide.(*cipherState).Encrypt 69 5461 13.91% 97.34% 5461 13.91% runtime.acquireSudog 70 1025 2.61% 100% 1025 2.61% runtime.allocm 71 ``` 72 73 This output immediately shows that `cipherState.Encrypt` is responsible for 83% 74 of allocations by count, focusing our investigation. 75 76 The most powerful profiling technique involves examining allocations at the 77 source line level. Running `go tool pprof -list 'FunctionName' mem.prof` shows 78 exactly which lines within a function trigger heap allocations: 79 80 ``` 81 $ go tool pprof -list 'cipherState.*Encrypt' mem.prof 82 Total: 8.73MB 83 ROUTINE ======================== github.com/lightningnetwork/lnd/brontide.(*cipherState).Encrypt 84 512.01kB 512.01kB (flat, cum) 5.73% of Total 85 . . 111:func (c *cipherState) Encrypt(associatedData, cipherText, plainText []byte) []byte { 86 . . 112: defer func() { 87 . . 113: c.nonce++ 88 . . 114: 89 . . 115: if c.nonce == keyRotationInterval { 90 . . 116: c.rotateKey() 91 . . 117: } 92 . . 118: }() 93 . . 119: 94 512.01kB 512.01kB 120: var nonce [12]byte 95 . . 121: binary.LittleEndian.PutUint64(nonce[4:], c.nonce) 96 . . 122: 97 . . 123: return c.cipher.Seal(cipherText, nonce[:], plainText, associatedData) 98 ``` 99 100 This granular view reveals that line 120, a seemingly innocent stack array 101 declaration, is allocating 512KB total across all benchmark iterations. 102 103 ## CPU Profiling for Hot Spots 104 105 While memory allocations often dominate optimization efforts, CPU profiling 106 reveals where computational time is spent. The CPU profile generated alongside 107 the memory profile provides complementary insights: 108 109 ``` 110 $ go tool pprof -top cpu.prof | head -15 111 File: brontide.test 112 Type: cpu 113 Time: Aug 30, 2024 at 2:07pm (WEST) 114 Duration: 1.8s, Total samples = 1.71s (94.40%) 115 Showing nodes accounting for 1.65s, 96.49% of 1.71s total 116 flat flat% sum% cum cum% 117 0.51s 29.82% 29.82% 0.51s 29.82% vendor/golang.org/x/crypto/chacha20poly1305.(*chacha20poly1305).sealGeneric 118 0.28s 16.37% 46.20% 0.28s 16.37% vendor/golang.org/x/crypto/internal/poly1305.updateGeneric 119 0.24s 14.04% 60.23% 0.24s 14.04% vendor/golang.org/x/crypto/chacha20.(*Cipher).XORKeyStream 120 0.19s 11.11% 71.35% 0.19s 11.11% runtime.memmove 121 0.12s 7.02% 78.36% 0.86s 50.29% github.com/lightningnetwork/lnd/brontide.(*cipherState).Encrypt 122 ``` 123 124 This profile shows that cryptographic operations dominate CPU usage, which is 125 expected. However, note the presence of `runtime.memmove` at 11% - this often 126 indicates unnecessary copying that could be eliminated through careful buffer 127 management. 128 129 For line-level CPU analysis of a specific function: 130 131 ``` 132 $ go tool pprof -list 'WriteMessage' cpu.prof 133 Total: 1.71s 134 ROUTINE ======================== github.com/lightningnetwork/lnd/brontide.(*Machine).WriteMessage 135 10ms 1.21s (flat, cum) 70.76% of Total 136 . . 734:func (b *Machine) WriteMessage(p []byte) error { 137 . . 735: if len(p) > math.MaxUint16 { 138 . . 736: return ErrMaxMessageLengthExceeded 139 . . 737: } 140 . . 738: 141 . 10ms 739: if len(b.nextHeaderSend) > 0 || len(b.nextBodySend) > 0 { 142 . . 740: return ErrMessageNotFlushed 143 . . 741: } 144 . . 742: 145 10ms 10ms 743: fullLength := uint16(len(p)) 146 . . 744: var pktLen [2]byte 147 . 10ms 745: binary.BigEndian.PutUint16(pktLen[:], fullLength) 148 . . 746: 149 . 580ms 747: b.nextHeaderSend = b.sendCipher.Encrypt(nil, nil, pktLen[:]) 150 . 600ms 748: b.nextBodySend = b.sendCipher.Encrypt(nil, nil, p) 151 ``` 152 153 This shows that the two `Encrypt` calls consume virtually all the CPU time in 154 `WriteMessage`, confirming that cryptographic operations are the bottleneck 155 rather than the message handling logic itself. 156 157 ## Understanding Escape Analysis 158 159 When the profiler indicates that seemingly stack-local variables are being heap 160 allocated, escape analysis becomes your next investigative tool. The Go 161 compiler's escape analysis determines whether variables can remain on the stack 162 or must be moved to the heap. Variables escape to the heap when their lifetime 163 extends beyond the function that creates them or when the compiler cannot prove 164 they remain local. 165 166 To see the compiler's escape analysis decisions, build with verbose flags: 167 168 ``` 169 go build -gcflags="-m" ./... 170 ``` 171 172 For more detailed output including the reasons for escape, use `-m=2`. The 173 output reveals escape flows, showing exactly why variables move to the heap. 174 When investigating specific escapes, you can grep for the variable in question: 175 176 ``` 177 $ go build -gcflags="-m=2" ./... 2>&1 | grep -A2 -B2 "nonce escapes" 178 ./noise.go:183:17: &errors.errorString{...} does not escape 179 ./noise.go:183:17: new(chacha20poly1305.chacha20poly1305) escapes to heap 180 ./noise.go:120:6: nonce escapes to heap: 181 ./noise.go:120:6: flow: {heap} = &nonce: 182 ./noise.go:120:6: from nonce (address-of) at ./noise.go:123:40 183 -- 184 ./noise.go:469:21: &keychain.PrivKeyECDH{...} escapes to heap 185 ./noise.go:483:40: []byte{} escapes to heap 186 ./noise.go:138:6: nonce escapes to heap: 187 ./noise.go:138:6: flow: {heap} = &nonce: 188 ./noise.go:138:6: from nonce (address-of) at ./noise.go:141:39 189 ``` 190 191 This output shows the exact flow analysis: the nonce array escapes because its 192 address is taken when creating a slice (`nonce[:]`) and passed to a function 193 that the compiler cannot fully analyze. 194 195 Common causes include passing pointers to interfaces, storing references in 196 heap-allocated structures, or passing slices of stack arrays to functions that 197 might retain them. A particularly instructive example is the seemingly innocent 198 pattern of passing a stack array to a function: 199 200 ```go 201 var nonce [12]byte 202 binary.LittleEndian.PutUint64(nonce[4:], counter) 203 return cipher.Seal(ciphertext, nonce[:], plaintext, nil) 204 ``` 205 206 Here, `nonce[:]` creates a slice backed by the stack array, but if the compiler 207 cannot prove that `cipher.Seal` won't retain a reference to this slice, the 208 entire array escapes to the heap. 209 210 ## The Optimization Strategy 211 212 Armed with profiling data and escape analysis insights, the optimization phase 213 begins. The general strategy for eliminating allocations follows a predictable 214 pattern: move temporary buffers from function scope to longer-lived structures, 215 typically as fields in the enclosing type. This transformation changes 216 allocation from per-operation to per-instance. 217 218 For the nonce example above, the optimization involves adding a buffer field to 219 the containing struct: 220 221 ```go 222 type cipherState struct { 223 // ... other fields ... 224 nonceBuffer [12]byte // Reusable buffer to avoid allocations 225 } 226 227 func (c *cipherState) Encrypt(...) []byte { 228 binary.LittleEndian.PutUint64(c.nonceBuffer[4:], c.nonce) 229 return c.cipher.Seal(ciphertext, c.nonceBuffer[:], plaintext, nil) 230 } 231 ``` 232 233 This pattern extends to any temporary buffer. When dealing with variable-sized 234 data up to a known maximum, pre-allocate buffers at that maximum size and slice 235 into them as needed. The key insight is using the three-index slice notation to 236 control capacity separately from length: 237 238 ```go 239 // Pre-allocated: var buffer [maxSize]byte 240 241 // Creating a zero-length slice with full capacity for append: 242 slice := buffer[:0] // length=0, capacity=maxSize 243 ``` 244 245 ## Verification and Iteration 246 247 After implementing optimizations, the cycle returns to benchmarking. Run the 248 same benchmark to measure improvement, but don't stop at the aggregate numbers. 249 Generate new profiles to verify that specific allocations have been eliminated 250 and to identify any remaining allocation sites. 251 252 The benchstat tool provides statistical comparison between runs: 253 254 ``` 255 go test -bench=BenchmarkWriteMessage -count=10 > old.txt 256 # Make optimizations 257 go test -bench=BenchmarkWriteMessage -count=10 > new.txt 258 benchstat old.txt new.txt 259 ``` 260 261 This comparison reveals not just whether performance improved, but whether the 262 improvement is statistically significant. A typical benchstat output after 263 successful optimization looks like: 264 265 ``` 266 goos: darwin 267 goarch: arm64 268 pkg: github.com/lightningnetwork/lnd/brontide 269 cpu: Apple M4 Max 270 │ old.txt │ new.txt │ 271 │ sec/op │ sec/op vs base │ 272 WriteMessage-16 50.34µ ± 1% 46.48µ ± 0% -7.68% (p=0.000 n=10) 273 274 │ old.txt │ new.txt │ 275 │ B/op │ B/op vs base │ 276 WriteMessage-16 73788.000 ± 0% 2.000 ± 0% -100.00% (p=0.000 n=10) 277 278 │ old.txt │ new.txt │ 279 │ allocs/op │ allocs/op vs base │ 280 WriteMessage-16 5.000 ± 0% 0.000 ± 0% -100.00% (p=0.000 n=10) 281 ``` 282 283 The key metrics to examine are: 284 - The percentage change (vs base column) showing the magnitude of improvement 285 286 - The p-value (p=0.000) indicating statistical significance - values below 0.05 287 suggest real improvements rather than noise 288 289 - The variance (± percentages) showing consistency across runs 290 291 This output confirms both a 7.68% speed improvement and complete elimination of 292 allocations, with high statistical confidence. 293 294 If allocations remain, the cycle continues. Profile again, identify the source, 295 understand why the allocation occurs through escape analysis, and apply the 296 appropriate optimization pattern. Each iteration should show measurable progress 297 toward the goal of zero allocations in the hot path. 298 299 ## Advanced Techniques 300 301 When standard profiling doesn't reveal the allocation source, more advanced 302 techniques come into play. Memory profiling with different granularities can 303 help. Instead of looking at total allocations, examine the profile with `go tool 304 pprof -sample_index=alloc_objects` to focus on allocation count rather than 305 size. This distinction matters when hunting for small, frequent allocations that 306 might not show up prominently in byte-focused views. 307 308 Additional pprof commands that prove invaluable during optimization: 309 310 ```bash 311 # Interactive mode for exploring the profile 312 go tool pprof mem.prof 313 (pprof) top10 # Show top 10 memory consumers 314 (pprof) list regexp # List functions matching regexp 315 (pprof) web # Open visual graph in browser 316 317 # Generate a flame graph for visual analysis 318 go tool pprof -http=:8080 mem.prof 319 320 # Compare two profiles directly 321 go tool pprof -base=old.prof new.prof 322 323 # Show allocations only from specific packages 324 go tool pprof -focus=github.com/lightningnetwork/lnd/brontide mem.prof 325 326 # Check for specific small allocations 327 go tool pprof -alloc_space -inuse_space mem.prof 328 ``` 329 330 When dealing with elusive allocations, checking what might be escaping to heap 331 can be done more surgically: 332 333 ```bash 334 # Check specific function or type for escapes 335 go build -gcflags="-m" 2>&1 | grep -E "(YourType|yourFunc)" 336 337 # See all heap allocations in a package 338 go build -gcflags="-m" 2>&1 | grep "moved to heap" 339 340 # Check which variables are confirmed to stay on the stack 341 go build -gcflags="-m=2" 2>&1 | grep "does not escape" 342 ``` 343 344 For particularly elusive allocations, instrumenting the code with runtime memory 345 statistics can provide real-time feedback: 346 347 ```go 348 var m runtime.MemStats 349 runtime.ReadMemStats(&m) 350 before := m.Alloc 351 // Operation being measured 352 runtime.ReadMemStats(&m) 353 allocated := m.Alloc - before 354 ``` 355 356 While this approach adds overhead and shouldn't be used in production, it can 357 help isolate allocations to specific code sections during development. 358 359 ## The Zero-Allocation Goal 360 361 Achieving zero allocations in hot paths represents more than just a performance 362 optimization. It provides predictable latency, reduces garbage collection 363 pressure, and improves overall system behavior under load. In systems handling 364 thousands of operations per second, the difference between five allocations per 365 operation and zero can mean the difference between smooth operation and periodic 366 latency spikes during garbage collection. 367 368 The journey from initial benchmark to zero-allocation code demonstrates the 369 power of Go's built-in tooling. By systematically applying the 370 benchmark-profile-optimize loop, even complex code paths can be transformed into 371 allocation-free implementations. The key lies not in guessing or premature 372 optimization, but in measuring, understanding, and methodically addressing each 373 allocation source. 374 375 It's best to focus optimization efforts on true hot paths identified through 376 production profiling or realistic load testing. The techniques described here 377 provide the tools to achieve zero-allocation code when it matters, but the 378 judgment of when to apply them remains a critical engineering decision.