/ README.md
README.md
1 # Benchmarking Tools and Practices Across Major Software Areas 2 3 ## Executive summary 4 5 Open-source projects that do benchmarking well treat it as an *engineering system* (repeatable workload definitions + controlled execution + interpretable analysis + regression feedback loops), not a one-off timing script. The best examples pair a benchmark harness with an operational workflow: baselines, statistical comparisons, CI/PR gating, artifact publication (dashboards/reports), and (when noise is high) dedicated infrastructure. 6 7 A strong pattern across ecosystems is a two-tier structure: **fast smoke benchmarks** that are safe to run on ordinary CI runners (used for "obvious regressions") and **authoritative runs** on stable hardware (or perf clusters) used for merge decisions and long-term trend lines. This is explicit in infrastructure like Rust's compiler performance monitoring and GitHub-based continuous benchmarking actions. 8 9 Statistical awareness is the most consistent differentiator between "benchmark scripts" and "benchmark systems." Criterion.rs, for example, documents warmup/measurement phases, outlier treatment, bootstrap-based confidence intervals, and comparisons vs stored baselines; this is exactly the level of rigor needed to make regression signals actionable under real-world noise. 10 11 In web/service and load-testing domains, "best practice" shifts away from absolute speed toward **budget/threshold enforcement**, tail-latency awareness, and explicit regression gating (fail the build if p95/p99/error budgets exceed policy). Lighthouse CI and k6 are canonical examples of this style of performance-as-a-contract. 12 13 ## How to read this document 14 15 Use the tables below as a retrieval set for benchmark creation: 16 17 * If the target codebase is a **library/runtime/compiler**, copy patterns from microbenchmark harnesses and perf dashboards. 18 * If it is a **service/web app**, copy patterns from load tools, budgets, and CI assertions. 19 * If it is a **database/storage/search system**, copy workload-spec, dataset, cluster-setup, and tail-latency practices. 20 * If it is **ML/data/cloud**, copy reproducibility, scenario definitions, telemetry, and compliance-style reporting. 21 22 --- 23 24 ## Microbenchmark harnesses and language-specific tools 25 26 | # | Project | Domain | Language | Benchmarking practices | Link to code | 27 | -: | ------- | ------ | -------- | ---------------------- | ------------ | 28 | 1 | Google Benchmark | C/C++ microbench | C++ | Dynamic iteration scaling, min-time runs, CPU vs wall time, JSON output, compare tooling | [google/benchmark](https://github.com/google/benchmark) | 29 | 2 | Nonius | C++ microbench | C++ | Minimal benchmark harness, repeated sampling, comparative benchmarking | [libnonius/nonius](https://github.com/libnonius/nonius) | 30 | 3 | Celero | C++ microbench | C++ | Baselines, fixtures, parameterized experiments, comparative reports | [DigitalInBlue/Celero](https://github.com/DigitalInBlue/Celero) | 31 | 4 | Hayai | C++ microbench | C++ | Test-like benchmark definitions, repetition and timing summaries | [nickbruun/hayai](https://github.com/nickbruun/hayai) | 32 | 5 | nanobench | C++ microbench | C++ | Low-overhead, stability-focused runs, single-header, easy embedding | [martinus/nanobench](https://github.com/martinus/nanobench) | 33 | 6 | folly Benchmark | C++ systems/libs | C++ | Macro + microbench support, tight integration into systems code | [facebook/folly](https://github.com/facebook/folly) | 34 | 7 | Catch2 benchmarks | C++ test+bench | C++ | Benchmark macros colocated with tests, easy adoption for Catch2 users | [catchorg/Catch2](https://github.com/catchorg/Catch2) | 35 | 8 | doctest benchmarks | C++ test+bench | C++ | Lightweight test-adjacent performance checks | [doctest/doctest](https://github.com/doctest/doctest) | 36 | 9 | Criterion (C library) | C benchmark framework | C | Benchmark harness for C projects with structured results | [Snaipe/Criterion](https://github.com/Snaipe/Criterion) | 37 | 10 | JMH | JVM microbench | Java | Warmup/measurement/forks, JVM-pitfall avoidance, profiler hooks | [openjdk/jmh](https://github.com/openjdk/jmh) | 38 | 11 | JMH Samples | JVM examples | Java | Canonical examples for warmups, profilers, state scoping | [openjdk/jmh (samples)](https://github.com/openjdk/jmh/tree/master/jmh-samples) | 39 | 12 | JMH JDK Microbenchmarks | JVM runtime suite | Java | Curated microbench corpus for measuring JDK performance changes | [openjdk/jmh-jdk-microbenchmarks](https://github.com/openjdk/jmh-jdk-microbenchmarks) | 40 | 13 | Caliper | Java benchmarking | Java | Benchmark runners, parameterization, structured result capture | [google/caliper](https://github.com/google/caliper) | 41 | 14 | Renaissance Benchmark Suite | JVM benchmark suite | Java/Scala | Modern concurrent workloads for JVM/JIT evaluation | [renaissance-benchmarks/renaissance](https://github.com/renaissance-benchmarks/renaissance) | 42 | 15 | DaCapo Benchmark Suite | JVM benchmark suite | Java | Long-lived standardized corpus for JVM comparative evaluation | [dacapobench/dacapobench](https://github.com/dacapobench/dacapobench) | 43 | 16 | sbt-jmh | Scala build integration | Scala | JMH plugin for sbt, standardized benchmark project layout | [sbt/sbt-jmh](https://github.com/sbt/sbt-jmh) | 44 | 17 | BenchmarkDotNet | .NET microbench | C# | Pilot/warmup/overhead/workload phases, isolated processes, diagnosers, baselines | [dotnet/BenchmarkDotNet](https://github.com/dotnet/BenchmarkDotNet) | 45 | 18 | Perfolizer | .NET stats | C# | Statistical analysis, distribution summaries, confidence-oriented comparison | [AndreyAkinshin/perfolizer](https://github.com/AndreyAkinshin/perfolizer) | 46 | 19 | dotnet/performance | .NET runtime suite | C# | BenchmarkDotNet-based benchmark repo for runtime regression tracking | [dotnet/performance](https://github.com/dotnet/performance) | 47 | 20 | Criterion.rs | Rust microbench | Rust | Bootstrap CIs, outlier detection, baseline comparison, plots | [bheisler/criterion.rs](https://github.com/bheisler/criterion.rs) | 48 | 21 | cargo-criterion | Rust CLI integration | Rust | Better cargo integration for Criterion workflows | [bheisler/cargo-criterion](https://github.com/bheisler/cargo-criterion) | 49 | 22 | critcmp | Rust comparison CLI | Rust | Compare Criterion benchmark outputs between runs | [BurntSushi/critcmp](https://github.com/BurntSushi/critcmp) | 50 | 23 | Divan | Rust microbench | Rust | Fast benchmark execution, counters, allocation-aware benchmarking | [nvzqz/divan](https://github.com/nvzqz/divan) | 51 | 24 | iai-callgrind | Rust perf analysis | Rust | Instruction-count benchmarking, callgrind-based reproducibility | [iai-callgrind/iai-callgrind](https://github.com/iai-callgrind/iai-callgrind) | 52 | 25 | Iai | Rust deterministic bench | Rust | One-shot instruction-level benchmarking, reduced noise vs time-based | [bheisler/iai](https://github.com/bheisler/iai) | 53 | 26 | cargo-benchcmp | Rust comparison | Rust | Compare saved benchmark outputs between runs | [BurntSushi/cargo-benchcmp](https://github.com/BurntSushi/cargo-benchcmp) | 54 | 27 | pyperf | Python benchmarking | Python | Calibration, multiprocessing, metadata capture, compare-to significance tools | [psf/pyperf](https://github.com/psf/pyperf) | 55 | 28 | pyperformance | Python macrobench | Python | Real-world workloads, interpreter comparison, regression-oriented suite | [python/pyperformance](https://github.com/python/pyperformance) | 56 | 29 | asv | Longitudinal benchmarking | Python | Benchmark-over-git-history, published dashboards, regression detection | [airspeed-velocity/asv](https://github.com/airspeed-velocity/asv) | 57 | 30 | pytest-benchmark | Test-integrated perf | Python | Pytest fixture, auto calibration, JSON output, compare/fail thresholds | [ionelmc/pytest-benchmark](https://github.com/ionelmc/pytest-benchmark) | 58 | 31 | perfplot | Python comparison | Python | N-vs-input-size comparisons, visualization of asymptotic behavior | [nschloe/perfplot](https://github.com/nschloe/perfplot) | 59 | 32 | timeit | Python stdlib | Python | Simple repeat loops, quick microbench smoke testing | [python/cpython (timeit)](https://github.com/python/cpython/tree/main/Lib/timeit.py) | 60 | 33 | nose-timer / pytest timers | Python test perf | Python | Quick timing signals inside test suites | [mahmoudimus/pytest-profiling](https://github.com/mahmoudimus/pytest-profiling) | 61 | 34 | golang/perf / benchstat | Go benchmarking | Go | Median/CI summaries, A/B comparisons, repeated benchmark sample analysis | [golang/perf](https://github.com/golang/perf) | 62 | 35 | Go testing.B | Go stdlib benchmarks | Go | Built-in benchmark loops, ns/op, allocs/op, benchmem patterns | [golang/go (testing)](https://github.com/golang/go/tree/master/src/testing) | 63 | 36 | x/benchmarks | Go benchmark corpora | Go | Shared benchmark suites for runtime/toolchain tracking | [golang/benchmarks](https://github.com/golang/benchmarks) | 64 | 37 | sweet | Go suite | Go | Standardized benchmark suite for broader Go performance work | [golang/benchmarks (sweet)](https://github.com/golang/benchmarks/tree/master/sweet) | 65 | 38 | go test -benchmem patterns | Go app codebases | Go | Allocation tracking paired with benchmark loops for practical regressions | [golang/go](https://github.com/golang/go) | 66 | 39 | Deno bench | JS/TS runtime | TypeScript/Rust | Built-in benchmarking, percentiles, JSON output, environment metadata | [denoland/deno](https://github.com/denoland/deno) | 67 | 40 | Node.js benchmark suite | JS runtime | JavaScript/C++ | Documented benchmark authoring, compare scripts, repeated runs, CSV outputs | [nodejs/node (benchmark)](https://github.com/nodejs/node/tree/main/benchmark) | 68 | 41 | tinybench | JS/TS microbench | TypeScript | Minimal harness, warmup, stats, browser/node support | [tinylibs/tinybench](https://github.com/tinylibs/tinybench) | 69 | 42 | Benchmark.js | JS microbench | JavaScript | High-resolution timing, statistical benchmark cycles, browser/node support | [bestiejs/benchmark.js](https://github.com/bestiejs/benchmark.js) | 70 | 43 | mitata | JS microbench | TypeScript | Fast modern benchmarking, comparisons, concise output | [evanwashere/mitata](https://github.com/evanwashere/mitata) | 71 | 44 | bun benchmark support | JS runtime | Zig/TypeScript | Built-in runtime benchmarking workflows for Bun ecosystem | [oven-sh/bun](https://github.com/oven-sh/bun) | 72 | 45 | kotlinx-benchmark | Kotlin multiplatform | Kotlin | Profiles, warmups, iterations, target-specific tasks, detailed reports | [Kotlin/kotlinx-benchmark](https://github.com/Kotlin/kotlinx-benchmark) | 73 | 46 | kotlinx-benchmark examples | Kotlin examples | Kotlin | Separate benchmark source sets, config profiles, task-driven runs | [Kotlin/kotlinx-benchmark (examples)](https://github.com/Kotlin/kotlinx-benchmark/tree/master/examples) | 74 | 47 | BenchmarkTools.jl | Julia microbench | Julia | Explicit sample/eval/trial model, saved params, regression comparisons | [JuliaCI/BenchmarkTools.jl](https://github.com/JuliaCI/BenchmarkTools.jl) | 75 | 48 | BaseBenchmarks.jl | Julia runtime suite | Julia | Versioned manifests, CI tracking for Julia runtime performance | [JuliaCI/BaseBenchmarks.jl](https://github.com/JuliaCI/BaseBenchmarks.jl) | 76 | 49 | PkgBenchmark.jl | Julia package perf | Julia | Compare package benchmarks across commits/versions | [JuliaCI/PkgBenchmark.jl](https://github.com/JuliaCI/PkgBenchmark.jl) | 77 | 50 | Nanosoldier.jl | Julia perf bot | Julia | Bot-driven CI perf runs, report publishing, PR-native workflow | [JuliaCI/Nanosoldier.jl](https://github.com/JuliaCI/Nanosoldier.jl) | 78 | 51 | AirspeedVelocity.jl | Julia history benchmarks | Julia | Longitudinal benchmark tracking over revisions | [MilesCranmer/AirspeedVelocity.jl](https://github.com/MilesCranmer/AirspeedVelocity.jl) | 79 | 52 | PHPBench | PHP benchmarking | PHP | Revolutions/iterations, process isolation, assertions, baselines, HTML/CSV/Markdown reports | [phpbench/phpbench](https://github.com/phpbench/phpbench) | 80 | 53 | Pest benchmarks + PHPBench | PHP testing/perf | PHP | Keep perf checks close to test workflows | [pestphp/pest](https://github.com/pestphp/pest) | 81 | 54 | benchmark-ips | Ruby microbench | Ruby | Iterations-per-second, compare implementations, memory optionality | [evanphx/benchmark-ips](https://github.com/evanphx/benchmark-ips) | 82 | 55 | benchmark-memory | Ruby memory perf | Ruby | Memory-focused benchmarking alongside execution speed | [michaelherold/benchmark-memory](https://github.com/michaelherold/benchmark-memory) | 83 | 56 | benchmark-driver | Ruby benchmark runner | Ruby | Generated scripts, comparison mode, graphs, multi-runtime comparison | [benchmark-driver/benchmark-driver](https://github.com/benchmark-driver/benchmark-driver) | 84 | 57 | package-benchmark | Swift benchmarking | Swift | CPU, ARC, malloc, memory, OS resource metrics, CI-friendly reports | [ordo-one/package-benchmark](https://github.com/ordo-one/package-benchmark) | 85 | 58 | swift-benchmark | Swift microbench | Swift | Benchmark suites for library/runtime performance tracking | [google/swift-benchmark](https://github.com/google/swift-benchmark) | 86 | 59 | Criterion (Haskell) | Haskell benchmarking | Haskell | Statistical benchmarking, configurable measurement, report generation | [bos/criterion](https://github.com/bos/criterion) | 87 | 60 | gauge | Haskell benchmarking | Haskell | Modernized Criterion-like measurement and reports | [vincenthz/hs-gauge](https://github.com/vincenthz/hs-gauge) | 88 | 61 | Core_bench | OCaml benchmarking | OCaml | Stable benchmarks, allocation and runtime observations | [janestreet/core_bench](https://github.com/janestreet/core_bench) | 89 | 62 | Bechamel | OCaml microbench | OCaml | Statistical benchmark toolkit for OCaml code | [mirage/bechamel](https://github.com/mirage/bechamel) | 90 | 63 | benchee | Elixir/Erlang benchmarking | Elixir | Parallel benchmarks, memory/time/reduction metrics, HTML output | [bencheeorg/benchee](https://github.com/bencheeorg/benchee) | 91 | 64 | benny | Elixir simple benchmark | Elixir | Small benchmark runner on top of Benchee-style workflow | [CultivateHQ/benny](https://github.com/CultivateHQ/benny) | 92 | 65 | Criterium | Clojure benchmarking | Clojure | Warmup, GC handling, statistical summaries | [hugoduncan/criterium](https://github.com/hugoduncan/criterium) | 93 | 66 | bench (R) | R microbench | R | High-precision timing primitives for R | [r-lib/bench](https://github.com/r-lib/bench) | 94 | 67 | microbenchmark (R) | R microbench | R | Repeated evaluation with distribution awareness | [joshuaulrich/microbenchmark](https://github.com/joshuaulrich/microbenchmark) | 95 | 68 | hyperfine | CLI benchmarking | Rust | Auto run count, warmups, prepare commands, shell overhead correction, JSON/Markdown export | [sharkdp/hyperfine](https://github.com/sharkdp/hyperfine) | 96 | 69 | gauge.js / CPCBenchmark | Misc examples | Various | Cross-implementation comparison and reproducibility templates | [benchmarko/CPCBenchmark](https://github.com/benchmarko/CPCBenchmark) | 97 | 70 | RepoTransBench | Repository-level code benchmark | Multi-language | Executable test-based evaluation across real repositories | [DeepSoftwareAnalytics/RepoTransBench](https://github.com/DeepSoftwareAnalytics/RepoTransBench) | 98 99 ### Web, frontend, API, and load testing 100 101 These tools illustrate a different pattern: not just timing loops, but **budgets, thresholds, repeated runs, request-shape control, and CI pass/fail semantics**. Lighthouse CI emphasizes historical report tracking and assertions, k6 emphasizes thresholds/SLO-style gates, and wrk2 explicitly addresses coordinated omission for tail-latency correctness. 102 103 | # | Project | Domain | Language | Benchmarking practices | Link to code | 104 | -: | ------- | ------ | -------- | ---------------------- | ------------ | 105 | 71 | Lighthouse CI | Web perf CI | JavaScript | Repeated Lighthouse runs, budgets/assertions, PR status checks, report uploads | [GoogleChrome/lighthouse-ci](https://github.com/GoogleChrome/lighthouse-ci) | 106 | 72 | Lighthouse | Frontend audit | JavaScript | Standardized perf audits, lab metrics, report generation | [GoogleChrome/lighthouse](https://github.com/GoogleChrome/lighthouse) | 107 | 73 | js-framework-benchmark | Frontend frameworks | JavaScript | Fixed UI operations, weighted scores, browser-consistent comparisons | [krausest/js-framework-benchmark](https://github.com/krausest/js-framework-benchmark) | 108 | 74 | k6 | Load testing | Go/JavaScript | Tests-as-code, thresholds, percentile gates, CI-friendly exit codes | [grafana/k6](https://github.com/grafana/k6) | 109 | 75 | k6-operator | Kubernetes load testing | Go | Kubernetes operator for distributed k6 runs, declarative CRDs | [grafana/k6-operator](https://github.com/grafana/k6-operator) | 110 | 76 | Locust | Load testing | Python | User-behavior scripting, distributed workers, scenario realism | [locustio/locust](https://github.com/locustio/locust) | 111 | 77 | Vegeta | HTTP load testing | Go | Constant-rate attack model, latency histograms, reproducible traffic shapes | [tsenart/vegeta](https://github.com/tsenart/vegeta) | 112 | 78 | wrk | HTTP benchmarking | C/LuaJIT | Multithreaded load generation, scriptable request shaping | [wg/wrk](https://github.com/wg/wrk) | 113 | 79 | wrk2 | HTTP load benchmarking | C | Constant-throughput model, coordinated-omission awareness, HdrHistogram percentiles | [giltene/wrk2](https://github.com/giltene/wrk2) | 114 | 80 | autocannon | Node HTTP benchmarking | JavaScript | High-throughput HTTP/1.1 tests, pipelining, CLI/API usage | [mcollina/autocannon](https://github.com/mcollina/autocannon) | 115 | 81 | hey | HTTP load testing | Go | Simple constant-concurrency request benchmarking | [rakyll/hey](https://github.com/rakyll/hey) | 116 | 82 | oha | HTTP load testing | Rust | CLI load generation with modern TUI output and HTTP/2 support | [hatoo/oha](https://github.com/hatoo/oha) | 117 | 83 | bombardier | HTTP benchmarking | Go | Fast benchmark tool, latency distribution reporting, CLI usability | [codesenberg/bombardier](https://github.com/codesenberg/bombardier) | 118 | 84 | Gatling | Load testing | Scala | Scenario DSL, detailed reports, user-journey modeling | [gatling/gatling](https://github.com/gatling/gatling) | 119 | 85 | JMeter | Performance testing | Java | Thread groups, protocol breadth, report dashboards, scripting-heavy workloads | [apache/jmeter](https://github.com/apache/jmeter) | 120 | 86 | artillery | API/web perf | JavaScript | YAML/JS scenarios, CI automation, arrival-rate oriented traffic | [artilleryio/artillery](https://github.com/artilleryio/artillery) | 121 | 87 | siege | Web load test | C | Simple repeatable HTTP load generation and concurrency tests | [JoeDog/siege](https://github.com/JoeDog/siege) | 122 | 88 | fortio | Service benchmarking | Go | Latency histograms, gRPC/HTTP load generation, server + client modes | [fortio/fortio](https://github.com/fortio/fortio) | 123 | 89 | ghz | gRPC load testing | Go | gRPC-specific request shaping, reporting, scripting | [bojand/ghz](https://github.com/bojand/ghz) | 124 | 90 | NBomber | Load testing | F# / .NET | Scenario modeling, thresholds, reports, distributed load for .NET stacks | [PragmaticFlow/NBomber](https://github.com/PragmaticFlow/NBomber) | 125 | 91 | Tsung | Distributed load test | Erlang | Clustered load generation, protocol support, soak/stress testing | [processone/tsung](https://github.com/processone/tsung) | 126 | 92 | Nighthawk | L7 characterization | C++ | Envoy-centric HTTP/2+ performance investigations, config-driven scenarios | [envoyproxy/nighthawk](https://github.com/envoyproxy/nighthawk) | 127 | 93 | Drill | HTTP load testing | Rust | YAML-defined scenarios, scenario-driven load testing | [fcsonline/drill](https://github.com/fcsonline/drill) | 128 | 94 | sitespeed.io | Web performance analysis | JavaScript | Real-browser metrics, reporting, Docker-friendly automation | [sitespeedio/sitespeed.io](https://github.com/sitespeedio/sitespeed.io) | 129 | 95 | bundlesize | Bundle-size budgets | JavaScript | PR checks, config file budgets, fail PR if exceeds budget | [siddharthkp/bundlesize](https://github.com/siddharthkp/bundlesize) | 130 | 96 | size-limit | Performance budgets | JavaScript | "Cost to run" + size metrics, error in PR if exceeds limit | [ai/size-limit](https://github.com/ai/size-limit) | 131 | 97 | webpack-bundle-analyzer | Bundle visualization | JavaScript | Interactive treemap UI for bundle composition analysis | [webpack/webpack-bundle-analyzer](https://github.com/webpack/webpack-bundle-analyzer) | 132 133 ### Databases, storage, search, and analytics 134 135 Benchmark quality in this category depends on **workload specs, datasets, cluster provisioning, query sets, and tail-latency reporting**. ClickBench emphasizes transparent scripts and datasets, and db-benchmarks emphasizes reproducibility and coefficient-of-variation control. 136 137 | # | Project | Domain | Language | Benchmarking practices | Link to code | 138 | --: | ------- | ------ | -------- | ---------------------- | ------------ | 139 | 98 | YCSB | DB / KV serving | Java | Standard CRUD workloads, configurable bindings, throughput + tail latency | [brianfrankcooper/YCSB](https://github.com/brianfrankcooper/YCSB) | 140 | 99 | go-ycsb | DB / KV serving | Go | Go rewrite of YCSB patterns with modern bindings | [pingcap/go-ycsb](https://github.com/pingcap/go-ycsb) | 141 | 100 | BenchBase (formerly OLTPBench) | SQL benchmarking | Java | Multi-DB apples-to-apples SQL workload harness | [cmu-db/benchbase](https://github.com/cmu-db/benchbase) | 142 | 101 | pgbench | PostgreSQL | C | TPS/latency, concurrency scaling, bundled with DB for repeatable checks | [postgres/postgres (pgbench)](https://github.com/postgres/postgres/tree/master/src/bin/pgbench) | 143 | 102 | sysbench | OLTP systems | C/Lua | Repeatable DB + CPU/memory/file benchmarks, scripted workloads | [akopytov/sysbench](https://github.com/akopytov/sysbench) | 144 | 103 | fio | Storage / I/O | C | Job files, IOPS/latency/bandwidth logs, workload modeling, tail-latency focus | [axboe/fio](https://github.com/axboe/fio) | 145 | 104 | RocksDB db_bench | KV storage | C++ | Product-bundled benchmark tool, config-sensitive workload runs | [facebook/rocksdb](https://github.com/facebook/rocksdb) | 146 | 105 | LevelDB db_bench | KV storage | C++ | Bundled benchmark tool, throughput/latency summaries | [google/leveldb](https://github.com/google/leveldb) | 147 | 106 | db_benchmarks | DB/search benchmarking | Python | Reproducibility, fair conditions, coefficient-of-variation control | [db-benchmarks/db-benchmarks](https://github.com/db-benchmarks/db-benchmarks) | 148 | 107 | Rally | Elasticsearch benchmarking | Python | Versioned tracks, setup/teardown, telemetry, compare results | [elastic/rally](https://github.com/elastic/rally) | 149 | 108 | rally-tracks | Elasticsearch workloads | JSON/Python | Versioned benchmark tracks aligned with product versions | [elastic/rally-tracks](https://github.com/elastic/rally-tracks) | 150 | 109 | OpenSearch Benchmark | OpenSearch benchmarking | Python | Open benchmarking harness for search with community-driven methodology | [opensearch-project/opensearch-benchmark](https://github.com/opensearch-project/opensearch-benchmark) | 151 | 110 | ClickBench | OLAP benchmarking | Shell/SQL | Published dataset, fixed query set, install/run scripts, limits documented | [ClickHouse/ClickBench](https://github.com/ClickHouse/ClickBench) | 152 | 111 | TSBS | Time-series DB benchmarking | Go | Domain-specific workloads, ingestion/query scenarios, structured generators | [timescale/tsbs](https://github.com/timescale/tsbs) | 153 | 112 | OpenMLDB benchmark tools | Feature/serving DB | C++/Java | DB-engine-centric benchmark suites with workload realism | [4paradigm/OpenMLDB](https://github.com/4paradigm/OpenMLDB) | 154 | 113 | TPC-C derivatives | OLTP | Various | Transactional workload modeling, scale-factor benchmarking | [apavlo/py-tpcc](https://github.com/apavlo/py-tpcc) | 155 | 114 | TPC-H derivatives | Analytics | SQL/C++ | Standard query corpus, scale factor, analytical reproducibility | [electrum/tpch-dbgen](https://github.com/electrum/tpch-dbgen) | 156 | 115 | TPC-DS tooling | Analytics | C/SQL | More complex query corpus, warehouse-style workloads | [gregrahn/tpcds-kit](https://github.com/gregrahn/tpcds-kit) | 157 | 116 | sqlite speedtest1 | Embedded DB | C | Product-shipped speed test embedded in engineering workflow | [sqlite/sqlite](https://github.com/sqlite/sqlite) | 158 | 117 | DuckDB benchmark suite | Analytics DB | C++ | Product benchmark corpus and analytical query testing | [duckdb/duckdb](https://github.com/duckdb/duckdb) | 159 | 118 | CockroachDB roachtest | Distributed DB | Go | Cluster scenarios, scale/perf validation, fault + performance integration | [cockroachdb/cockroach (roachtest)](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/roachtest) | 160 | 119 | MongoDB benchRun | Document DB | JavaScript/C++ | Scriptable benchmark execution for DB operations | [mongodb/mongo](https://github.com/mongodb/mongo) | 161 | 120 | mongo-perf | MongoDB perf tools | Python/JS | Dedicated performance tooling and harness-based suites | [mongodb/mongo-perf](https://github.com/mongodb/mongo-perf) | 162 | 121 | genny | MongoDB load generation | C++/YAML | Workload-as-code patterns for repeatable DB performance testing | [mongodb/genny](https://github.com/mongodb/genny) | 163 | 122 | Redis benchmark | In-memory DB | C | Throughput/latency checks, protocol-aware benchmarking, environment guidance | [redis/redis](https://github.com/redis/redis) | 164 | 123 | memtier_benchmark | Cache / key-value | C | Multi-threaded cache benchmarking, pipelining, latency histograms | [redis/memtier_benchmark](https://github.com/redis/memtier_benchmark) | 165 166 ### Search, stream, messaging, and distributed systems 167 168 | # | Project | Domain | Language | Benchmarking practices | Link to code | 169 | --: | ------- | ------ | -------- | ---------------------- | ------------ | 170 | 124 | OpenMessaging Benchmark | Messaging systems | Java | Standard pub/sub workloads, cloud-message broker comparisons | [openmessaging/benchmark](https://github.com/openmessaging/benchmark) | 171 | 125 | Kafka performance tools | Streaming/messaging | Java/Scala | Producer/consumer perf scripts, throughput and latency checks | [apache/kafka](https://github.com/apache/kafka) | 172 | 126 | NATS benchmark | Messaging | Go | Pub/sub throughput benchmarking for NATS clusters | [nats-io/nats-server (scripts)](https://github.com/nats-io/nats-server/tree/main/scripts) | 173 | 127 | Pulsar performance tools | Messaging | Java | Built-in producer/consumer benchmarking utilities | [apache/pulsar](https://github.com/apache/pulsar) | 174 | 128 | RabbitMQ perf-test | Messaging | Java | Broker-specific throughput/latency benchmarking under varied configs | [rabbitmq/rabbitmq-perf-test](https://github.com/rabbitmq/rabbitmq-perf-test) | 175 | 129 | Cassandra stress | Distributed DB | Java | Product-bundled workload generator, schema-aware stress tests | [apache/cassandra](https://github.com/apache/cassandra) | 176 | 130 | ScyllaDB test/perf tools | Distributed DB | C++/Python | Latency-sensitive DB benchmarking and comparative workloads | [scylladb/scylladb](https://github.com/scylladb/scylladb) | 177 | 131 | scylla-bench | Distributed DB | Go | Purpose-built benchmark tool for Scylla-like workloads | [scylladb/scylla-bench](https://github.com/scylladb/scylla-bench) | 178 | 132 | Aerospike benchmark tool | Distributed KV | Java | Cluster-aware read/write latency benchmarking | [aerospike/aerospike-client-java](https://github.com/aerospike/aerospike-client-java) | 179 | 133 | FoundationDB benchmark tools | Distributed KV | C++ | Internal stress/performance tools for transactional systems | [apple/foundationdb](https://github.com/apple/foundationdb) | 180 | 134 | TiDB sysbench/ycsb integrations | Distributed SQL | Go | Public benchmark recipes based on standard suites | [pingcap/tidb](https://github.com/pingcap/tidb) | 181 182 ### ML, data, AI, and scientific computing 183 184 ML benchmarking introduces another mature pattern: **scenario definitions, datasets, compliance rules, hardware disclosure, and throughput/latency tradeoffs**. MLPerf is the clearest example. 185 186 | # | Project | Domain | Language | Benchmarking practices | Link to code | 187 | --: | ------- | ------ | -------- | ---------------------- | ------------ | 188 | 135 | MLPerf Inference | ML inference | Python/C++ | Formal scenarios, compliance rules, throughput/latency comparability | [mlcommons/inference](https://github.com/mlcommons/inference) | 189 | 136 | MLPerf Training | ML training | Python | Standard training tasks, throughput/time-to-train, compliance workflows | [mlcommons/training](https://github.com/mlcommons/training) | 190 | 137 | TorchBench | PyTorch perf | Python | Real models/workloads, stack/regression tracking for framework evolution | [pytorch/benchmark](https://github.com/pytorch/benchmark) | 191 | 138 | lm-evaluation-harness | LLM eval | Python | Standardized task harness, model-vs-model comparison, reproducible prompts/tasks | [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) | 192 | 139 | HELM | LLM benchmarking | Python | Scenario-based model evaluation, standardized reporting | [stanford-crfm/helm](https://github.com/stanford-crfm/helm) | 193 | 140 | MTEB | Embedding benchmarks | Python | Broad task suite for embeddings, standardized leaderboard-style evaluation | [embeddings-benchmark/mteb](https://github.com/embeddings-benchmark/mteb) | 194 | 141 | BigCode Evaluation Harness | Code model eval | Python | Benchmark tasks for code generation models | [bigcode-project/bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness) | 195 | 142 | DeepSpeed | LLM/training perf | Python/C++/CUDA | Throughput, memory, scale-out benchmarking patterns | [deepspeedai/DeepSpeed](https://github.com/deepspeedai/DeepSpeed) | 196 | 143 | DeepSpeedExamples | LLM/training examples | Python | Benchmark scripts for distributed training workflows | [microsoft/DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples) | 197 | 144 | Megatron-LM benchmarks | Large-scale training | Python | Scaling benchmarks, throughput, distributed systems measurements | [NVIDIA/Megatron-LM](https://github.com/NVIDIA/Megatron-LM) | 198 | 145 | vLLM benchmark tools | LLM serving | Python | Request throughput, token latency, serving-scale comparisons | [vllm-project/vllm](https://github.com/vllm-project/vllm) | 199 | 146 | TensorRT-LLM benchmarks | LLM inference systems | C++/Python | Batch-size, tokens/sec, latency under deployment constraints | [NVIDIA/TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) | 200 | 147 | DeepBench | DL kernel benchmarks | C++/CUDA | Operation-level performance across hardware for DL primitives | [baidu-research/DeepBench](https://github.com/baidu-research/DeepBench) | 201 | 148 | ONNX Runtime | ML runtime perf | C++/Python | Inference performance testing, profiler support, build-over-build comparison | [microsoft/onnxruntime](https://github.com/microsoft/onnxruntime) | 202 | 149 | CUTLASS | GPU kernel benchmarks | C++/CUDA | Linear algebra kernel throughput/latency benchmarking | [NVIDIA/cutlass](https://github.com/NVIDIA/cutlass) | 203 | 150 | Triton | Compiler/kernel benchmarks | Python/C++ | Kernel-compiler benchmark scripts, GPU profiling integration | [triton-lang/triton](https://github.com/triton-lang/triton) | 204 | 151 | Transformers | ML library benchmarks | Python | Training/inference timing, multi-model performance measurement | [huggingface/transformers](https://github.com/huggingface/transformers) | 205 | 152 | JAX benchmarks | Numeric/ML | Python | Reproducible kernel/model benchmarks inside framework ecosystem | [jax-ml/jax](https://github.com/jax-ml/jax) | 206 | 153 | NumPy benchmarks | Numeric kernels | Python | asv-based history benchmarking over library revisions | [numpy/numpy (benchmarks)](https://github.com/numpy/numpy/tree/main/benchmarks) | 207 | 154 | pandas asv_bench | Dataframes | Python | Realistic dataframe workloads over git-history regressions | [pandas-dev/pandas (asv_bench)](https://github.com/pandas-dev/pandas/tree/main/asv_bench) | 208 | 155 | SciPy benchmarks | Scientific computing | Python | asv-based regressions for numerical kernels and algorithms | [scipy/scipy (benchmarks)](https://github.com/scipy/scipy/tree/main/benchmarks) | 209 | 156 | RAPIDS benchmarks | GPU data science | Python/C++ | End-to-end dataframe/ML perf on accelerated pipelines | [rapidsai/cudf](https://github.com/rapidsai/cudf) | 210 | 157 | Apache Arrow benchmarks | Data systems | C++/Python | archery benchmark runs, JSON capture, cross-revision comparisons | [apache/arrow](https://github.com/apache/arrow) | 211 | 158 | Polars benchmarks | Dataframes | Rust/Python | Dataframe-operation micro + macro comparisons against alternatives | [pola-rs/polars](https://github.com/pola-rs/polars) | 212 213 ### Cloud, infra, OS, and platform benchmarking 214 215 | # | Project | Domain | Language | Benchmarking practices | Link to code | 216 | --: | ------- | ------ | -------- | ---------------------- | ------------ | 217 | 159 | PerfKitBenchmarker | Cloud benchmarking | Python | Provision-run-teardown workflow, cloud-neutral benchmark automation | [GoogleCloudPlatform/PerfKitBenchmarker](https://github.com/GoogleCloudPlatform/PerfKitBenchmarker) | 218 | 160 | Phoronix Test Suite | System benchmarking | PHP/Shell | Versioned test profiles, automated install/run/report pipeline | [phoronix-test-suite/phoronix-test-suite](https://github.com/phoronix-test-suite/phoronix-test-suite) | 219 | 161 | UnixBench | OS/system benchmark | C/Shell | Classic system benchmark battery across CPU/filesystem/process metrics | [kdlucas/byte-unixbench](https://github.com/kdlucas/byte-unixbench) | 220 | 162 | lmbench | OS/microbenchmark | C | Kernel/OS latency and bandwidth microbenchmarks | [intel/lmbench](https://github.com/intel/lmbench) | 221 | 163 | stress-ng | System stress/perf | C | Controlled stressors with measurable system behavior | [ColinIanKing/stress-ng](https://github.com/ColinIanKing/stress-ng) | 222 | 164 | STREAM | Memory bandwidth | C | Canonical memory bandwidth benchmark, standardized kernels | [jeffhammond/STREAM](https://github.com/jeffhammond/STREAM) | 223 | 165 | sysstat tools | System observability | C | Benchmark-adjacent measurement of CPU/I/O behavior during runs | [sysstat/sysstat](https://github.com/sysstat/sysstat) | 224 | 166 | perf | Linux profiling/perf | C | Hardware-counter and profile-assisted benchmark diagnosis | [torvalds/linux (perf)](https://github.com/torvalds/linux/tree/master/tools/perf) | 225 | 167 | fio benchmark profiles | Storage infra | C | Reusable job specs for storage infra reproducibility | [axboe/fio (examples)](https://github.com/axboe/fio/tree/master/examples) | 226 | 168 | kubernetes/perf-tests | Kubernetes benchmarking | Go/YAML | Cluster performance measurements, workload-as-code | [kubernetes/perf-tests](https://github.com/kubernetes/perf-tests) | 227 | 169 | iperf3 | Network bandwidth | C | Canonical open network bandwidth measurement tool | [esnet/iperf](https://github.com/esnet/iperf) | 228 | 170 | netperf | Network throughput/latency | C | Throughput and latency tests across multiple networking types | [HewlettPackard/netperf](https://github.com/HewlettPackard/netperf) | 229 | 171 | sockperf | Network latency | C | Specialized for latency-sensitive network benchmarking | [Mellanox/sockperf](https://github.com/Mellanox/sockperf) | 230 231 ### Frameworks, browser engines, compilers, and large project perf suites 232 233 | # | Project | Domain | Language | Benchmarking practices | Link to code | 234 | --: | ------- | ------ | -------- | ---------------------- | ------------ | 235 | 172 | rustc-perf | Compiler perf | Rust | Dedicated perf infra, commit-vs-commit comparison, dashboard, PR feedback | [rust-lang/rustc-perf](https://github.com/rust-lang/rustc-perf) | 236 | 173 | LLVM LNT | Compiler perf infra | Python | Submission, storage, dashboards, machine metadata, trend tracking | [llvm/llvm-lnt](https://github.com/llvm/llvm-lnt) | 237 | 174 | LLVM Test Suite | Compiler benchmark corpus | C/C++ | Standard benchmark corpus for compiler/runtime performance tracking | [llvm/llvm-test-suite](https://github.com/llvm/llvm-test-suite) | 238 | 175 | Chromium Telemetry | Browser perf | Python | Page sets, repeat runs, benchmark stories, result pipelines | [catapult-project/catapult](https://github.com/catapult-project/catapult) | 239 | 176 | WebPageTest | Web systems | C++/Python/JS | Repeated page runs, lab metrics, regression-oriented analysis | [catchpoint/WebPageTest](https://github.com/catchpoint/WebPageTest) | 240 | 177 | JetStream benchmark | Browser JS | JavaScript | Standard browser JavaScript performance corpus | [WebKit/JetStream](https://github.com/WebKit/JetStream) | 241 | 178 | Speedometer | Browser/web app perf | JavaScript | Standardized web-app responsiveness benchmark | [WebKit/Speedometer](https://github.com/WebKit/Speedometer) | 242 | 179 | ARES-6 | Browser JS | JavaScript | JavaScript engine computational benchmark | [WebKit/ARES-6](https://github.com/WebKit/ARES-6) | 243 | 180 | Octane (archived) | JS engines | JavaScript | Historical benchmark corpus still useful as reference material | [chromium/octane](https://github.com/chromium/octane) | 244 | 181 | AreWeFastYet | JS VM perf tracking | JavaScript | Continuous VM performance tracking dashboards | [nbp/arewefastyet](https://github.com/nbp/arewefastyet) | 245 | 182 | V8 benchmark suite | JS runtime | C++/JS | Engine perf corpora and regression tracking | [v8/v8](https://github.com/v8/v8) | 246 | 183 | CPython performance suite | Runtime perf | Python/C | pyperformance-driven language evolution checks | [python/cpython](https://github.com/python/cpython) | 247 | 184 | HHVM benchmarks | Runtime/JIT | C++/Hack | Runtime-oriented benchmark suites and perf tracking | [facebook/hhvm](https://github.com/facebook/hhvm) | 248 249 ### Profiling and visualization tools 250 251 Profiling is the natural complement to benchmarking — once a benchmark identifies a regression, profiling tools diagnose the root cause. The best performance workflows tightly couple benchmark harnesses with profiling capture and comparison. 252 253 | # | Project | Domain | Language | Benchmarking practices | Link to code | 254 | --: | ------- | ------ | -------- | ---------------------- | ------------ | 255 | 185 | FlameGraph | Stack visualization | Perl | Iconic flamegraph SVGs from sampled stack traces, repeatable comparison | [brendangregg/FlameGraph](https://github.com/brendangregg/FlameGraph) | 256 | 186 | speedscope | Profile visualization | TypeScript | Interactive web viewer for profile exploration and sharing | [jlfwong/speedscope](https://github.com/jlfwong/speedscope) | 257 | 187 | Perfetto | Tracing platform | C++/TypeScript | Production-grade tracing, trace configs, analysis tools, web UI | [google/perfetto](https://github.com/google/perfetto) | 258 | 188 | pprof | Profile analysis | Go | CPU/heap profiles, visualization and analysis tooling | [google/pprof](https://github.com/google/pprof) | 259 | 189 | bcc | eBPF tooling | Python/C | Low-overhead kernel/user-space eBPF measurements | [iovisor/bcc](https://github.com/iovisor/bcc) | 260 | 190 | bpftrace | eBPF tracing | C++ | Dynamic system measurement for benchmark root-cause analysis | [bpftrace/bpftrace](https://github.com/bpftrace/bpftrace) | 261 | 191 | Pyroscope | Continuous profiling | Go | CPU/memory/I/O profiling, continuous capture, time-series analysis | [grafana/pyroscope](https://github.com/grafana/pyroscope) | 262 | 192 | Parca | Continuous profiling | Go | CPU/memory usage down to line number over time, aggregate queries | [parca-dev/parca](https://github.com/parca-dev/parca) | 263 | 193 | async-profiler | JVM profiling | Java/C++ | Low-overhead CPU/heap sampling for JVM benchmark diagnosis | [async-profiler/async-profiler](https://github.com/async-profiler/async-profiler) | 264 | 194 | py-spy | Python profiling | Rust | Sampling profiler without code instrumentation, fast benchmark→profile loops | [benfred/py-spy](https://github.com/benfred/py-spy) | 265 | 195 | rbspy | Ruby profiling | Rust | Low-overhead sampling profiling for Ruby performance workflows | [rbspy/rbspy](https://github.com/rbspy/rbspy) | 266 | 196 | gperftools | C++ profiling/allocators | C++ | CPU/heap profiling toolkit, often paired with benchmark harnesses | [gperftools/gperftools](https://github.com/gperftools/gperftools) | 267 | 197 | heaptrack | Heap profiling | C++ | Heap allocation profiles, memory regression comparison across versions | [KDE/heaptrack](https://github.com/KDE/heaptrack) | 268 | 198 | Tracy | Instrumentation profiler | C++ | High-resolution interactive profiling for games/systems benchmarks | [wolfpld/tracy](https://github.com/wolfpld/tracy) | 269 | 199 | samply | Sampling profiler | Rust | Modern cross-platform sampling profiler for benchmark regression diagnosis | [mstange/samply](https://github.com/mstange/samply) | 270 | 200 | cargo-flamegraph | Rust profiling CLI | Rust | Bridges Rust benchmarking and profiling with minimal friction | [flamegraph-rs/flamegraph](https://github.com/flamegraph-rs/flamegraph) | 271 | 201 | yappi | Python profiling | C/Python | Thread/async-aware function profiling for concurrent benchmarks | [sumerc/yappi](https://github.com/sumerc/yappi) | 272 273 ### CI and benchmark orchestration 274 275 | # | Project | Domain | Language | Benchmarking practices | Link to code | 276 | --: | ------- | ------ | -------- | ---------------------- | ------------ | 277 | 202 | github-action-benchmark | CI benchmark tracking | Multi-language | Persist benchmark history in GitHub Pages, compare commits | [benchmark-action/github-action-benchmark](https://github.com/benchmark-action/github-action-benchmark) | 278 | 203 | codspeed | CI benchmarking | Multi-language | PR regression detection, stable virtualized measurements, CI integration | [CodSpeedHQ/codspeed](https://github.com/CodSpeedHQ/codspeed) | 279 | 204 | Bencher | CI benchmark tracking | Multi-language | Benchmark result storage, alerts, regression gating, dashboards | [bencherdev/bencher](https://github.com/bencherdev/bencher) | 280 | 205 | BenchExec | Reproducible benchmarking | Python | Controlled execution, resource isolation, fair comparisons | [sosy-lab/benchexec](https://github.com/sosy-lab/benchexec) | 281 | 206 | ReBench | Benchmark orchestration | Python | Experiment definitions, result collection, reproducibility | [smarr/ReBench](https://github.com/smarr/ReBench) | 282 283 --- 284 285 ## Practical synthesis for a benchmark-generation skill 286 287 ### 1. Classify the repo first 288 289 Library, CLI, service, DB, frontend, compiler/runtime, ML, or cloud infra. 290 291 ### 2. Choose the benchmark style by repo type 292 293 * Library/runtime: microbench harness 294 * Service/web: load + percentile/SLO checks 295 * DB/search/storage: workload suite + dataset + environment capture 296 * ML: scenario/task suite + throughput/latency/compliance metadata 297 298 ### 3. Always generate these artifacts 299 300 * benchmark entrypoints 301 * reproducible input/workload definitions 302 * baseline storage 303 * compare script 304 * CI workflow 305 * machine-readable output 306 * human-readable summary 307 308 ### 4. Prefer these practices by default 309 310 * warmup 311 * repeated runs 312 * controlled environment notes 313 * percentile reporting for services 314 * baseline comparison 315 * regression thresholds 316 * raw JSON/CSV export 317 * optional profiling hooks 318 319 ### 5. Version the workload, not just the code 320 321 The best projects treat tracks, job files, page sets, benchmark corpora, and datasets as first-class versioned artifacts. 322 323 --- 324 325 ## High-value patterns your skill should copy 326 327 | Pattern | Good examples | 328 | ------- | ------------- | 329 | Warmup + calibrated loops | JMH, pyperf, BenchmarkDotNet, Criterion.rs | 330 | Stats-aware comparisons | Criterion.rs, benchstat, BenchmarkDotNet, asv | 331 | CI regression gates | Lighthouse CI, k6, github-action-benchmark, CodSpeed | 332 | Workload versioning | Rally, fio job files, ClickBench, js-framework-benchmark | 333 | Perf dashboards over time | rustc-perf, asv, LNT, AreWeFastYet | 334 | Bundled product benchmarks | pgbench, db_bench, redis-benchmark, sqlite speedtest1 | 335 | Realistic macrobench suites | pyperformance, TorchBench, MLPerf, PerfKitBenchmarker | 336 | Request-shape control | Vegeta, wrk2, Locust, k6 | 337 | Tail-latency focus | YCSB, k6, Vegeta, fortio, ghz | 338 | Provision-run-teardown infra | PerfKitBenchmarker, Rally, Phoronix Test Suite | 339 340 --- 341 342 ## Generalized benchmarking skill template 343 344 This template blends what the exemplary projects above actually do: warmups + repeated sampling, explicit regression thresholds, artifact publication, and (for noisy domains) infrastructure separation. 345 346 ### Checklist 347 348 - Define the *decision* the benchmark must support (PR gate? release acceptance? capacity planning?). 349 - Choose benchmark type: 350 - micro (tight loop) for code-level deltas (e.g., Google Benchmark, JMH, Criterion.rs). 351 - macro (end-to-end) for user experience and systems behavior (e.g., Rally, ClickBench). 352 - Specify workloads as versioned artifacts: 353 - code (bench definitions), 354 - data generation scripts or fixed datasets, 355 - scenario configs (YAML/JSON), and 356 - a "how to run" doc. 357 - Encode performance contracts: 358 - thresholds/budgets (fail the run when violated) for CI gating. 359 - Manage noise: 360 - warmup and repeated sampling, 361 - stable environments for authoritative runs, 362 - record system metadata. 363 - Publish artifacts: 364 - raw machine-readable output (JSON/CSV), 365 - human report (HTML/markdown), 366 - optional time-series dashboard (asv/rustc-perf/CI actions). 367 368 ### Workflow diagram 369 370 ```mermaid 371 flowchart TD 372 A[Define question + budget] --> B[Workload spec + dataset + success criteria] 373 B --> C[Harness design: warmup, sample size, correctness checks] 374 C --> D[Run smoke benchmarks in CI] 375 D --> E[Compare vs baseline; enforce budgets] 376 E -->|Fail| F[Profile & diagnose] 377 F --> C 378 E -->|Pass| G[Publish artifacts: JSON/CSV + HTML + trends] 379 G --> H[Authoritative scheduled/stable-hardware runs] 380 H --> E 381 ``` 382 383 ### CI snippet 384 385 Example GitHub Actions setup with **PR smoke** + **authoritative** runs (scheduled or manually triggered). This mirrors the industry pattern of separating fast/noisy CI checks from stable, infrastructure-backed performance decisions. 386 387 ```yaml 388 name: performance 389 390 on: 391 pull_request: 392 workflow_dispatch: 393 schedule: 394 - cron: "0 3 * * *" 395 396 jobs: 397 pr-smoke: 398 if: github.event_name == 'pull_request' 399 runs-on: ubuntu-latest 400 steps: 401 - uses: actions/checkout@v4 402 - name: Build (Release) 403 run: ./ci/build_release.sh 404 - name: Run smoke benchmarks 405 run: ./ci/run_bench_smoke.sh --output smoke.json 406 - name: Compare to baseline thresholds 407 run: ./ci/compare_budget.py --input smoke.json --budget ./ci/budget.json 408 - uses: actions/upload-artifact@v4 409 with: 410 name: smoke-bench 411 path: smoke.json 412 413 authoritative: 414 if: github.event_name != 'pull_request' 415 runs-on: [self-hosted, perf] 416 steps: 417 - uses: actions/checkout@v4 418 - name: Pin environment 419 run: ./ci/pin_env.sh 420 - name: Build (Release) 421 run: ./ci/build_release.sh 422 - name: Run full benchmarks 423 run: ./ci/run_bench_full.sh --output full.json 424 - name: Compare vs last-known-good 425 run: ./ci/compare_against_lkg.py --input full.json --lkg ./ci/lkg.json 426 - uses: actions/upload-artifact@v4 427 with: 428 name: full-bench 429 path: full.json 430 ``` 431 432 ### Example microbenchmark scripts 433 434 **C++ (Google Benchmark):** 435 436 ```cpp 437 #include <benchmark/benchmark.h> 438 #include <vector> 439 #include <numeric> 440 #include <algorithm> 441 442 static void BM_Sort(benchmark::State& state) { 443 std::vector<int> v(state.range(0)); 444 for (auto _ : state) { 445 std::iota(v.begin(), v.end(), 0); 446 std::reverse(v.begin(), v.end()); 447 benchmark::DoNotOptimize(v); 448 std::sort(v.begin(), v.end()); 449 benchmark::ClobberMemory(); 450 } 451 } 452 BENCHMARK(BM_Sort)->Range(1<<10, 1<<20); 453 BENCHMARK_MAIN(); 454 ``` 455 456 **Java (JMH):** 457 458 ```java 459 import org.openjdk.jmh.annotations.*; 460 461 import java.util.concurrent.TimeUnit; 462 463 @State(Scope.Thread) 464 @BenchmarkMode(Mode.AverageTime) 465 @OutputTimeUnit(TimeUnit.NANOSECONDS) 466 public class ParseBench { 467 468 @Benchmark 469 @Warmup(iterations = 5, time = 1) 470 @Measurement(iterations = 5, time = 1) 471 @Fork(3) 472 public int parseIntBench() { 473 return Integer.parseInt("12345"); 474 } 475 } 476 ``` 477 478 **Rust (Criterion.rs):** 479 480 ```rust 481 use criterion::{criterion_group, criterion_main, Criterion}; 482 483 fn fib(n: u64) -> u64 { 484 if n < 2 { n } else { fib(n-1) + fib(n-2) } 485 } 486 487 fn bench_fib(c: &mut Criterion) { 488 c.bench_function("fib 20", |b| b.iter(|| fib(20))); 489 } 490 491 criterion_group!(benches, bench_fib); 492 criterion_main!(benches); 493 ``` 494 495 **Python (pyperf):** 496 497 ```python 498 import pyperf 499 500 def work(): 501 s = 0 502 for i in range(10000): 503 s += i * i 504 return s 505 506 runner = pyperf.Runner() 507 runner.bench_func("work", work) 508 ``` 509 510 **Deno (deno bench):** 511 512 ```ts 513 Deno.bench("url parse", () => { 514 new URL("https://deno.land"); 515 }); 516 ``` 517 518 ### Recommended tools per language 519 520 | Language | Preferred microbenchmark harness | Longitudinal/regression tracking | Notes | 521 | -------- | -------------------------------- | -------------------------------- | ----- | 522 | C/C++ | Google Benchmark | github-action-benchmark, custom dashboards | Pair with profilers (Perfetto/FlameGraph) for diagnosis | 523 | Java/JVM | JMH | CI artifacts + suites (Renaissance/DaCapo) | Always use warmup/forks; profile with async-profiler | 524 | .NET | BenchmarkDotNet | dotnet/performance patterns | Built-in plots + summaries help enforce discipline | 525 | Rust | Criterion.rs | rustc-perf style for large infra | Use cargo-flamegraph for diagnosis | 526 | Python | pyperf / pytest-benchmark | asv dashboards | Prefer representative suites (pyperformance) for macro | 527 | Go | `testing.B` + golang/perf tools | golang/benchmarks + dashboards | Profile with pprof | 528 | JS/TS | Runtime-integrated benches (Deno) | Lighthouse CI for web budgets | For web perf: budgets + reports (Lighthouse/WebPageTest) | 529 | Databases/Search | Rally, YCSB, ClickBench | Store artifacts + compare across versions | Explicitly version workloads/tracks and datasets | 530 | Load testing | k6, wrk2, Locust | Thresholds + CI gating | Prefer percentile-based contracts to avoid "avg-only" regressions | 531 532 ### Reproducibility matrix 533 534 Use this as a checklist for what must be pinned or recorded to make results comparable. 535 536 | Category | Pin/record | Example "gold standard" practice | 537 | -------- | ---------- | -------------------------------- | 538 | Code identity | Baseline + contender commit hashes | Continuous benchmarking actions center workflows around commit comparisons | 539 | Benchmark definition | Versioned benchmark suite + parameters | Criterion stores prior-run statistics for comparisons | 540 | Dataset/workload | Dataset version/digest + workload files | ClickBench publishes dataset + scripts; YCSB defines workload files | 541 | Build/runtime | Compiler/runtime version + flags | BenchmarkDotNet builds isolated Release binaries; JMH uses forks/warmups | 542 | Environment | CPU model, governors, OS, container/VM flags | Systems and macrobench tools require explicit environment control | 543 | Measurement policy | Warmup, sample size, run duration | Criterion documents warmup/measurement phases and noise considerations | 544 | Regression policy | Budgets/thresholds + fail conditions | k6 thresholds and Lighthouse CI assertions encode pass/fail contracts | 545 | Artifacts | Raw JSON/CSV + human report | asv web dashboards and Lighthouse reports publish durable artifacts | 546 547 --- 548 549 ## Pitfalls and best practices 550 551 The most common failure mode is treating benchmarks as "single numbers" instead of **distributions under uncertainty**. Tools like Criterion.rs explicitly warn about outliers/noise and use robust methods (bootstrap, outlier classification, comparisons vs stored baselines) to avoid chasing randomness. 552 553 Latency measurement pitfalls are especially severe in service/load testing; coordinated omission can make reported latencies look better than reality. wrk2 is exemplary precisely because it is explicit about constant-throughput load generation and histogram-based tail latency recording. 554 555 Budgets/thresholds are a best practice when you need CI gating: they force teams to encode *what matters* (e.g., "p95 < X" or "bundle cost < Y") rather than debating noisy deltas after the fact. Lighthouse CI and k6 document exactly this style of regression prevention. 556 557 Finally, macrobenchmarks are only as credible as their workload definitions and environment control. Benchmark suites like Rally (tracks/telemetry) and ClickBench (dataset + scripts) show that reproducibility requires the *entire pipeline* to be treated as an artifact, not only the system-under-test. 558 559 --- 560 561 ## Minimal recommendation set for a universal benchmark skill 562 563 If the skill must start with a small "starter brain," seed it with these 12 first: 564 565 1. Google Benchmark 566 2. JMH 567 3. BenchmarkDotNet 568 4. Criterion.rs 569 5. pyperf 570 6. asv 571 7. benchstat 572 8. hyperfine 573 9. Lighthouse CI 574 10. k6 575 11. YCSB 576 12. PerfKitBenchmarker 577 578 They cover the main benchmark archetypes: microbench, long-term regression, CI gating, service/load, data/store, and infra automation.