Cradicle Explorer

/ README.md
README.md
  1  # Benchmarking Tools and Practices Across Major Software Areas
  2  
  3  ## Executive summary
  4  
  5  Open-source projects that do benchmarking well treat it as an *engineering system* (repeatable workload definitions + controlled execution + interpretable analysis + regression feedback loops), not a one-off timing script. The best examples pair a benchmark harness with an operational workflow: baselines, statistical comparisons, CI/PR gating, artifact publication (dashboards/reports), and (when noise is high) dedicated infrastructure.
  6  
  7  A strong pattern across ecosystems is a two-tier structure: **fast smoke benchmarks** that are safe to run on ordinary CI runners (used for "obvious regressions") and **authoritative runs** on stable hardware (or perf clusters) used for merge decisions and long-term trend lines. This is explicit in infrastructure like Rust's compiler performance monitoring and GitHub-based continuous benchmarking actions.
  8  
  9  Statistical awareness is the most consistent differentiator between "benchmark scripts" and "benchmark systems." Criterion.rs, for example, documents warmup/measurement phases, outlier treatment, bootstrap-based confidence intervals, and comparisons vs stored baselines; this is exactly the level of rigor needed to make regression signals actionable under real-world noise.
 10  
 11  In web/service and load-testing domains, "best practice" shifts away from absolute speed toward **budget/threshold enforcement**, tail-latency awareness, and explicit regression gating (fail the build if p95/p99/error budgets exceed policy). Lighthouse CI and k6 are canonical examples of this style of performance-as-a-contract.
 12  
 13  ## How to read this document
 14  
 15  Use the tables below as a retrieval set for benchmark creation:
 16  
 17  * If the target codebase is a **library/runtime/compiler**, copy patterns from microbenchmark harnesses and perf dashboards.
 18  * If it is a **service/web app**, copy patterns from load tools, budgets, and CI assertions.
 19  * If it is a **database/storage/search system**, copy workload-spec, dataset, cluster-setup, and tail-latency practices.
 20  * If it is **ML/data/cloud**, copy reproducibility, scenario definitions, telemetry, and compliance-style reporting.
 21  
 22  ---
 23  
 24  ## Microbenchmark harnesses and language-specific tools
 25  
 26  |  # | Project | Domain | Language | Benchmarking practices | Link to code |
 27  | -: | ------- | ------ | -------- | ---------------------- | ------------ |
 28  |  1 | Google Benchmark | C/C++ microbench | C++ | Dynamic iteration scaling, min-time runs, CPU vs wall time, JSON output, compare tooling | [google/benchmark](https://github.com/google/benchmark) |
 29  |  2 | Nonius | C++ microbench | C++ | Minimal benchmark harness, repeated sampling, comparative benchmarking | [libnonius/nonius](https://github.com/libnonius/nonius) |
 30  |  3 | Celero | C++ microbench | C++ | Baselines, fixtures, parameterized experiments, comparative reports | [DigitalInBlue/Celero](https://github.com/DigitalInBlue/Celero) |
 31  |  4 | Hayai | C++ microbench | C++ | Test-like benchmark definitions, repetition and timing summaries | [nickbruun/hayai](https://github.com/nickbruun/hayai) |
 32  |  5 | nanobench | C++ microbench | C++ | Low-overhead, stability-focused runs, single-header, easy embedding | [martinus/nanobench](https://github.com/martinus/nanobench) |
 33  |  6 | folly Benchmark | C++ systems/libs | C++ | Macro + microbench support, tight integration into systems code | [facebook/folly](https://github.com/facebook/folly) |
 34  |  7 | Catch2 benchmarks | C++ test+bench | C++ | Benchmark macros colocated with tests, easy adoption for Catch2 users | [catchorg/Catch2](https://github.com/catchorg/Catch2) |
 35  |  8 | doctest benchmarks | C++ test+bench | C++ | Lightweight test-adjacent performance checks | [doctest/doctest](https://github.com/doctest/doctest) |
 36  |  9 | Criterion (C library) | C benchmark framework | C | Benchmark harness for C projects with structured results | [Snaipe/Criterion](https://github.com/Snaipe/Criterion) |
 37  | 10 | JMH | JVM microbench | Java | Warmup/measurement/forks, JVM-pitfall avoidance, profiler hooks | [openjdk/jmh](https://github.com/openjdk/jmh) |
 38  | 11 | JMH Samples | JVM examples | Java | Canonical examples for warmups, profilers, state scoping | [openjdk/jmh (samples)](https://github.com/openjdk/jmh/tree/master/jmh-samples) |
 39  | 12 | JMH JDK Microbenchmarks | JVM runtime suite | Java | Curated microbench corpus for measuring JDK performance changes | [openjdk/jmh-jdk-microbenchmarks](https://github.com/openjdk/jmh-jdk-microbenchmarks) |
 40  | 13 | Caliper | Java benchmarking | Java | Benchmark runners, parameterization, structured result capture | [google/caliper](https://github.com/google/caliper) |
 41  | 14 | Renaissance Benchmark Suite | JVM benchmark suite | Java/Scala | Modern concurrent workloads for JVM/JIT evaluation | [renaissance-benchmarks/renaissance](https://github.com/renaissance-benchmarks/renaissance) |
 42  | 15 | DaCapo Benchmark Suite | JVM benchmark suite | Java | Long-lived standardized corpus for JVM comparative evaluation | [dacapobench/dacapobench](https://github.com/dacapobench/dacapobench) |
 43  | 16 | sbt-jmh | Scala build integration | Scala | JMH plugin for sbt, standardized benchmark project layout | [sbt/sbt-jmh](https://github.com/sbt/sbt-jmh) |
 44  | 17 | BenchmarkDotNet | .NET microbench | C# | Pilot/warmup/overhead/workload phases, isolated processes, diagnosers, baselines | [dotnet/BenchmarkDotNet](https://github.com/dotnet/BenchmarkDotNet) |
 45  | 18 | Perfolizer | .NET stats | C# | Statistical analysis, distribution summaries, confidence-oriented comparison | [AndreyAkinshin/perfolizer](https://github.com/AndreyAkinshin/perfolizer) |
 46  | 19 | dotnet/performance | .NET runtime suite | C# | BenchmarkDotNet-based benchmark repo for runtime regression tracking | [dotnet/performance](https://github.com/dotnet/performance) |
 47  | 20 | Criterion.rs | Rust microbench | Rust | Bootstrap CIs, outlier detection, baseline comparison, plots | [bheisler/criterion.rs](https://github.com/bheisler/criterion.rs) |
 48  | 21 | cargo-criterion | Rust CLI integration | Rust | Better cargo integration for Criterion workflows | [bheisler/cargo-criterion](https://github.com/bheisler/cargo-criterion) |
 49  | 22 | critcmp | Rust comparison CLI | Rust | Compare Criterion benchmark outputs between runs | [BurntSushi/critcmp](https://github.com/BurntSushi/critcmp) |
 50  | 23 | Divan | Rust microbench | Rust | Fast benchmark execution, counters, allocation-aware benchmarking | [nvzqz/divan](https://github.com/nvzqz/divan) |
 51  | 24 | iai-callgrind | Rust perf analysis | Rust | Instruction-count benchmarking, callgrind-based reproducibility | [iai-callgrind/iai-callgrind](https://github.com/iai-callgrind/iai-callgrind) |
 52  | 25 | Iai | Rust deterministic bench | Rust | One-shot instruction-level benchmarking, reduced noise vs time-based | [bheisler/iai](https://github.com/bheisler/iai) |
 53  | 26 | cargo-benchcmp | Rust comparison | Rust | Compare saved benchmark outputs between runs | [BurntSushi/cargo-benchcmp](https://github.com/BurntSushi/cargo-benchcmp) |
 54  | 27 | pyperf | Python benchmarking | Python | Calibration, multiprocessing, metadata capture, compare-to significance tools | [psf/pyperf](https://github.com/psf/pyperf) |
 55  | 28 | pyperformance | Python macrobench | Python | Real-world workloads, interpreter comparison, regression-oriented suite | [python/pyperformance](https://github.com/python/pyperformance) |
 56  | 29 | asv | Longitudinal benchmarking | Python | Benchmark-over-git-history, published dashboards, regression detection | [airspeed-velocity/asv](https://github.com/airspeed-velocity/asv) |
 57  | 30 | pytest-benchmark | Test-integrated perf | Python | Pytest fixture, auto calibration, JSON output, compare/fail thresholds | [ionelmc/pytest-benchmark](https://github.com/ionelmc/pytest-benchmark) |
 58  | 31 | perfplot | Python comparison | Python | N-vs-input-size comparisons, visualization of asymptotic behavior | [nschloe/perfplot](https://github.com/nschloe/perfplot) |
 59  | 32 | timeit | Python stdlib | Python | Simple repeat loops, quick microbench smoke testing | [python/cpython (timeit)](https://github.com/python/cpython/tree/main/Lib/timeit.py) |
 60  | 33 | nose-timer / pytest timers | Python test perf | Python | Quick timing signals inside test suites | [mahmoudimus/pytest-profiling](https://github.com/mahmoudimus/pytest-profiling) |
 61  | 34 | golang/perf / benchstat | Go benchmarking | Go | Median/CI summaries, A/B comparisons, repeated benchmark sample analysis | [golang/perf](https://github.com/golang/perf) |
 62  | 35 | Go testing.B | Go stdlib benchmarks | Go | Built-in benchmark loops, ns/op, allocs/op, benchmem patterns | [golang/go (testing)](https://github.com/golang/go/tree/master/src/testing) |
 63  | 36 | x/benchmarks | Go benchmark corpora | Go | Shared benchmark suites for runtime/toolchain tracking | [golang/benchmarks](https://github.com/golang/benchmarks) |
 64  | 37 | sweet | Go suite | Go | Standardized benchmark suite for broader Go performance work | [golang/benchmarks (sweet)](https://github.com/golang/benchmarks/tree/master/sweet) |
 65  | 38 | go test -benchmem patterns | Go app codebases | Go | Allocation tracking paired with benchmark loops for practical regressions | [golang/go](https://github.com/golang/go) |
 66  | 39 | Deno bench | JS/TS runtime | TypeScript/Rust | Built-in benchmarking, percentiles, JSON output, environment metadata | [denoland/deno](https://github.com/denoland/deno) |
 67  | 40 | Node.js benchmark suite | JS runtime | JavaScript/C++ | Documented benchmark authoring, compare scripts, repeated runs, CSV outputs | [nodejs/node (benchmark)](https://github.com/nodejs/node/tree/main/benchmark) |
 68  | 41 | tinybench | JS/TS microbench | TypeScript | Minimal harness, warmup, stats, browser/node support | [tinylibs/tinybench](https://github.com/tinylibs/tinybench) |
 69  | 42 | Benchmark.js | JS microbench | JavaScript | High-resolution timing, statistical benchmark cycles, browser/node support | [bestiejs/benchmark.js](https://github.com/bestiejs/benchmark.js) |
 70  | 43 | mitata | JS microbench | TypeScript | Fast modern benchmarking, comparisons, concise output | [evanwashere/mitata](https://github.com/evanwashere/mitata) |
 71  | 44 | bun benchmark support | JS runtime | Zig/TypeScript | Built-in runtime benchmarking workflows for Bun ecosystem | [oven-sh/bun](https://github.com/oven-sh/bun) |
 72  | 45 | kotlinx-benchmark | Kotlin multiplatform | Kotlin | Profiles, warmups, iterations, target-specific tasks, detailed reports | [Kotlin/kotlinx-benchmark](https://github.com/Kotlin/kotlinx-benchmark) |
 73  | 46 | kotlinx-benchmark examples | Kotlin examples | Kotlin | Separate benchmark source sets, config profiles, task-driven runs | [Kotlin/kotlinx-benchmark (examples)](https://github.com/Kotlin/kotlinx-benchmark/tree/master/examples) |
 74  | 47 | BenchmarkTools.jl | Julia microbench | Julia | Explicit sample/eval/trial model, saved params, regression comparisons | [JuliaCI/BenchmarkTools.jl](https://github.com/JuliaCI/BenchmarkTools.jl) |
 75  | 48 | BaseBenchmarks.jl | Julia runtime suite | Julia | Versioned manifests, CI tracking for Julia runtime performance | [JuliaCI/BaseBenchmarks.jl](https://github.com/JuliaCI/BaseBenchmarks.jl) |
 76  | 49 | PkgBenchmark.jl | Julia package perf | Julia | Compare package benchmarks across commits/versions | [JuliaCI/PkgBenchmark.jl](https://github.com/JuliaCI/PkgBenchmark.jl) |
 77  | 50 | Nanosoldier.jl | Julia perf bot | Julia | Bot-driven CI perf runs, report publishing, PR-native workflow | [JuliaCI/Nanosoldier.jl](https://github.com/JuliaCI/Nanosoldier.jl) |
 78  | 51 | AirspeedVelocity.jl | Julia history benchmarks | Julia | Longitudinal benchmark tracking over revisions | [MilesCranmer/AirspeedVelocity.jl](https://github.com/MilesCranmer/AirspeedVelocity.jl) |
 79  | 52 | PHPBench | PHP benchmarking | PHP | Revolutions/iterations, process isolation, assertions, baselines, HTML/CSV/Markdown reports | [phpbench/phpbench](https://github.com/phpbench/phpbench) |
 80  | 53 | Pest benchmarks + PHPBench | PHP testing/perf | PHP | Keep perf checks close to test workflows | [pestphp/pest](https://github.com/pestphp/pest) |
 81  | 54 | benchmark-ips | Ruby microbench | Ruby | Iterations-per-second, compare implementations, memory optionality | [evanphx/benchmark-ips](https://github.com/evanphx/benchmark-ips) |
 82  | 55 | benchmark-memory | Ruby memory perf | Ruby | Memory-focused benchmarking alongside execution speed | [michaelherold/benchmark-memory](https://github.com/michaelherold/benchmark-memory) |
 83  | 56 | benchmark-driver | Ruby benchmark runner | Ruby | Generated scripts, comparison mode, graphs, multi-runtime comparison | [benchmark-driver/benchmark-driver](https://github.com/benchmark-driver/benchmark-driver) |
 84  | 57 | package-benchmark | Swift benchmarking | Swift | CPU, ARC, malloc, memory, OS resource metrics, CI-friendly reports | [ordo-one/package-benchmark](https://github.com/ordo-one/package-benchmark) |
 85  | 58 | swift-benchmark | Swift microbench | Swift | Benchmark suites for library/runtime performance tracking | [google/swift-benchmark](https://github.com/google/swift-benchmark) |
 86  | 59 | Criterion (Haskell) | Haskell benchmarking | Haskell | Statistical benchmarking, configurable measurement, report generation | [bos/criterion](https://github.com/bos/criterion) |
 87  | 60 | gauge | Haskell benchmarking | Haskell | Modernized Criterion-like measurement and reports | [vincenthz/hs-gauge](https://github.com/vincenthz/hs-gauge) |
 88  | 61 | Core_bench | OCaml benchmarking | OCaml | Stable benchmarks, allocation and runtime observations | [janestreet/core_bench](https://github.com/janestreet/core_bench) |
 89  | 62 | Bechamel | OCaml microbench | OCaml | Statistical benchmark toolkit for OCaml code | [mirage/bechamel](https://github.com/mirage/bechamel) |
 90  | 63 | benchee | Elixir/Erlang benchmarking | Elixir | Parallel benchmarks, memory/time/reduction metrics, HTML output | [bencheeorg/benchee](https://github.com/bencheeorg/benchee) |
 91  | 64 | benny | Elixir simple benchmark | Elixir | Small benchmark runner on top of Benchee-style workflow | [CultivateHQ/benny](https://github.com/CultivateHQ/benny) |
 92  | 65 | Criterium | Clojure benchmarking | Clojure | Warmup, GC handling, statistical summaries | [hugoduncan/criterium](https://github.com/hugoduncan/criterium) |
 93  | 66 | bench (R) | R microbench | R | High-precision timing primitives for R | [r-lib/bench](https://github.com/r-lib/bench) |
 94  | 67 | microbenchmark (R) | R microbench | R | Repeated evaluation with distribution awareness | [joshuaulrich/microbenchmark](https://github.com/joshuaulrich/microbenchmark) |
 95  | 68 | hyperfine | CLI benchmarking | Rust | Auto run count, warmups, prepare commands, shell overhead correction, JSON/Markdown export | [sharkdp/hyperfine](https://github.com/sharkdp/hyperfine) |
 96  | 69 | gauge.js / CPCBenchmark | Misc examples | Various | Cross-implementation comparison and reproducibility templates | [benchmarko/CPCBenchmark](https://github.com/benchmarko/CPCBenchmark) |
 97  | 70 | RepoTransBench | Repository-level code benchmark | Multi-language | Executable test-based evaluation across real repositories | [DeepSoftwareAnalytics/RepoTransBench](https://github.com/DeepSoftwareAnalytics/RepoTransBench) |
 98  
 99  ### Web, frontend, API, and load testing
100  
101  These tools illustrate a different pattern: not just timing loops, but **budgets, thresholds, repeated runs, request-shape control, and CI pass/fail semantics**. Lighthouse CI emphasizes historical report tracking and assertions, k6 emphasizes thresholds/SLO-style gates, and wrk2 explicitly addresses coordinated omission for tail-latency correctness.
102  
103  |  # | Project | Domain | Language | Benchmarking practices | Link to code |
104  | -: | ------- | ------ | -------- | ---------------------- | ------------ |
105  | 71 | Lighthouse CI | Web perf CI | JavaScript | Repeated Lighthouse runs, budgets/assertions, PR status checks, report uploads | [GoogleChrome/lighthouse-ci](https://github.com/GoogleChrome/lighthouse-ci) |
106  | 72 | Lighthouse | Frontend audit | JavaScript | Standardized perf audits, lab metrics, report generation | [GoogleChrome/lighthouse](https://github.com/GoogleChrome/lighthouse) |
107  | 73 | js-framework-benchmark | Frontend frameworks | JavaScript | Fixed UI operations, weighted scores, browser-consistent comparisons | [krausest/js-framework-benchmark](https://github.com/krausest/js-framework-benchmark) |
108  | 74 | k6 | Load testing | Go/JavaScript | Tests-as-code, thresholds, percentile gates, CI-friendly exit codes | [grafana/k6](https://github.com/grafana/k6) |
109  | 75 | k6-operator | Kubernetes load testing | Go | Kubernetes operator for distributed k6 runs, declarative CRDs | [grafana/k6-operator](https://github.com/grafana/k6-operator) |
110  | 76 | Locust | Load testing | Python | User-behavior scripting, distributed workers, scenario realism | [locustio/locust](https://github.com/locustio/locust) |
111  | 77 | Vegeta | HTTP load testing | Go | Constant-rate attack model, latency histograms, reproducible traffic shapes | [tsenart/vegeta](https://github.com/tsenart/vegeta) |
112  | 78 | wrk | HTTP benchmarking | C/LuaJIT | Multithreaded load generation, scriptable request shaping | [wg/wrk](https://github.com/wg/wrk) |
113  | 79 | wrk2 | HTTP load benchmarking | C | Constant-throughput model, coordinated-omission awareness, HdrHistogram percentiles | [giltene/wrk2](https://github.com/giltene/wrk2) |
114  | 80 | autocannon | Node HTTP benchmarking | JavaScript | High-throughput HTTP/1.1 tests, pipelining, CLI/API usage | [mcollina/autocannon](https://github.com/mcollina/autocannon) |
115  | 81 | hey | HTTP load testing | Go | Simple constant-concurrency request benchmarking | [rakyll/hey](https://github.com/rakyll/hey) |
116  | 82 | oha | HTTP load testing | Rust | CLI load generation with modern TUI output and HTTP/2 support | [hatoo/oha](https://github.com/hatoo/oha) |
117  | 83 | bombardier | HTTP benchmarking | Go | Fast benchmark tool, latency distribution reporting, CLI usability | [codesenberg/bombardier](https://github.com/codesenberg/bombardier) |
118  | 84 | Gatling | Load testing | Scala | Scenario DSL, detailed reports, user-journey modeling | [gatling/gatling](https://github.com/gatling/gatling) |
119  | 85 | JMeter | Performance testing | Java | Thread groups, protocol breadth, report dashboards, scripting-heavy workloads | [apache/jmeter](https://github.com/apache/jmeter) |
120  | 86 | artillery | API/web perf | JavaScript | YAML/JS scenarios, CI automation, arrival-rate oriented traffic | [artilleryio/artillery](https://github.com/artilleryio/artillery) |
121  | 87 | siege | Web load test | C | Simple repeatable HTTP load generation and concurrency tests | [JoeDog/siege](https://github.com/JoeDog/siege) |
122  | 88 | fortio | Service benchmarking | Go | Latency histograms, gRPC/HTTP load generation, server + client modes | [fortio/fortio](https://github.com/fortio/fortio) |
123  | 89 | ghz | gRPC load testing | Go | gRPC-specific request shaping, reporting, scripting | [bojand/ghz](https://github.com/bojand/ghz) |
124  | 90 | NBomber | Load testing | F# / .NET | Scenario modeling, thresholds, reports, distributed load for .NET stacks | [PragmaticFlow/NBomber](https://github.com/PragmaticFlow/NBomber) |
125  | 91 | Tsung | Distributed load test | Erlang | Clustered load generation, protocol support, soak/stress testing | [processone/tsung](https://github.com/processone/tsung) |
126  | 92 | Nighthawk | L7 characterization | C++ | Envoy-centric HTTP/2+ performance investigations, config-driven scenarios | [envoyproxy/nighthawk](https://github.com/envoyproxy/nighthawk) |
127  | 93 | Drill | HTTP load testing | Rust | YAML-defined scenarios, scenario-driven load testing | [fcsonline/drill](https://github.com/fcsonline/drill) |
128  | 94 | sitespeed.io | Web performance analysis | JavaScript | Real-browser metrics, reporting, Docker-friendly automation | [sitespeedio/sitespeed.io](https://github.com/sitespeedio/sitespeed.io) |
129  | 95 | bundlesize | Bundle-size budgets | JavaScript | PR checks, config file budgets, fail PR if exceeds budget | [siddharthkp/bundlesize](https://github.com/siddharthkp/bundlesize) |
130  | 96 | size-limit | Performance budgets | JavaScript | "Cost to run" + size metrics, error in PR if exceeds limit | [ai/size-limit](https://github.com/ai/size-limit) |
131  | 97 | webpack-bundle-analyzer | Bundle visualization | JavaScript | Interactive treemap UI for bundle composition analysis | [webpack/webpack-bundle-analyzer](https://github.com/webpack/webpack-bundle-analyzer) |
132  
133  ### Databases, storage, search, and analytics
134  
135  Benchmark quality in this category depends on **workload specs, datasets, cluster provisioning, query sets, and tail-latency reporting**. ClickBench emphasizes transparent scripts and datasets, and db-benchmarks emphasizes reproducibility and coefficient-of-variation control.
136  
137  |   # | Project | Domain | Language | Benchmarking practices | Link to code |
138  | --: | ------- | ------ | -------- | ---------------------- | ------------ |
139  |  98 | YCSB | DB / KV serving | Java | Standard CRUD workloads, configurable bindings, throughput + tail latency | [brianfrankcooper/YCSB](https://github.com/brianfrankcooper/YCSB) |
140  |  99 | go-ycsb | DB / KV serving | Go | Go rewrite of YCSB patterns with modern bindings | [pingcap/go-ycsb](https://github.com/pingcap/go-ycsb) |
141  | 100 | BenchBase (formerly OLTPBench) | SQL benchmarking | Java | Multi-DB apples-to-apples SQL workload harness | [cmu-db/benchbase](https://github.com/cmu-db/benchbase) |
142  | 101 | pgbench | PostgreSQL | C | TPS/latency, concurrency scaling, bundled with DB for repeatable checks | [postgres/postgres (pgbench)](https://github.com/postgres/postgres/tree/master/src/bin/pgbench) |
143  | 102 | sysbench | OLTP systems | C/Lua | Repeatable DB + CPU/memory/file benchmarks, scripted workloads | [akopytov/sysbench](https://github.com/akopytov/sysbench) |
144  | 103 | fio | Storage / I/O | C | Job files, IOPS/latency/bandwidth logs, workload modeling, tail-latency focus | [axboe/fio](https://github.com/axboe/fio) |
145  | 104 | RocksDB db_bench | KV storage | C++ | Product-bundled benchmark tool, config-sensitive workload runs | [facebook/rocksdb](https://github.com/facebook/rocksdb) |
146  | 105 | LevelDB db_bench | KV storage | C++ | Bundled benchmark tool, throughput/latency summaries | [google/leveldb](https://github.com/google/leveldb) |
147  | 106 | db_benchmarks | DB/search benchmarking | Python | Reproducibility, fair conditions, coefficient-of-variation control | [db-benchmarks/db-benchmarks](https://github.com/db-benchmarks/db-benchmarks) |
148  | 107 | Rally | Elasticsearch benchmarking | Python | Versioned tracks, setup/teardown, telemetry, compare results | [elastic/rally](https://github.com/elastic/rally) |
149  | 108 | rally-tracks | Elasticsearch workloads | JSON/Python | Versioned benchmark tracks aligned with product versions | [elastic/rally-tracks](https://github.com/elastic/rally-tracks) |
150  | 109 | OpenSearch Benchmark | OpenSearch benchmarking | Python | Open benchmarking harness for search with community-driven methodology | [opensearch-project/opensearch-benchmark](https://github.com/opensearch-project/opensearch-benchmark) |
151  | 110 | ClickBench | OLAP benchmarking | Shell/SQL | Published dataset, fixed query set, install/run scripts, limits documented | [ClickHouse/ClickBench](https://github.com/ClickHouse/ClickBench) |
152  | 111 | TSBS | Time-series DB benchmarking | Go | Domain-specific workloads, ingestion/query scenarios, structured generators | [timescale/tsbs](https://github.com/timescale/tsbs) |
153  | 112 | OpenMLDB benchmark tools | Feature/serving DB | C++/Java | DB-engine-centric benchmark suites with workload realism | [4paradigm/OpenMLDB](https://github.com/4paradigm/OpenMLDB) |
154  | 113 | TPC-C derivatives | OLTP | Various | Transactional workload modeling, scale-factor benchmarking | [apavlo/py-tpcc](https://github.com/apavlo/py-tpcc) |
155  | 114 | TPC-H derivatives | Analytics | SQL/C++ | Standard query corpus, scale factor, analytical reproducibility | [electrum/tpch-dbgen](https://github.com/electrum/tpch-dbgen) |
156  | 115 | TPC-DS tooling | Analytics | C/SQL | More complex query corpus, warehouse-style workloads | [gregrahn/tpcds-kit](https://github.com/gregrahn/tpcds-kit) |
157  | 116 | sqlite speedtest1 | Embedded DB | C | Product-shipped speed test embedded in engineering workflow | [sqlite/sqlite](https://github.com/sqlite/sqlite) |
158  | 117 | DuckDB benchmark suite | Analytics DB | C++ | Product benchmark corpus and analytical query testing | [duckdb/duckdb](https://github.com/duckdb/duckdb) |
159  | 118 | CockroachDB roachtest | Distributed DB | Go | Cluster scenarios, scale/perf validation, fault + performance integration | [cockroachdb/cockroach (roachtest)](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/roachtest) |
160  | 119 | MongoDB benchRun | Document DB | JavaScript/C++ | Scriptable benchmark execution for DB operations | [mongodb/mongo](https://github.com/mongodb/mongo) |
161  | 120 | mongo-perf | MongoDB perf tools | Python/JS | Dedicated performance tooling and harness-based suites | [mongodb/mongo-perf](https://github.com/mongodb/mongo-perf) |
162  | 121 | genny | MongoDB load generation | C++/YAML | Workload-as-code patterns for repeatable DB performance testing | [mongodb/genny](https://github.com/mongodb/genny) |
163  | 122 | Redis benchmark | In-memory DB | C | Throughput/latency checks, protocol-aware benchmarking, environment guidance | [redis/redis](https://github.com/redis/redis) |
164  | 123 | memtier_benchmark | Cache / key-value | C | Multi-threaded cache benchmarking, pipelining, latency histograms | [redis/memtier_benchmark](https://github.com/redis/memtier_benchmark) |
165  
166  ### Search, stream, messaging, and distributed systems
167  
168  |   # | Project | Domain | Language | Benchmarking practices | Link to code |
169  | --: | ------- | ------ | -------- | ---------------------- | ------------ |
170  | 124 | OpenMessaging Benchmark | Messaging systems | Java | Standard pub/sub workloads, cloud-message broker comparisons | [openmessaging/benchmark](https://github.com/openmessaging/benchmark) |
171  | 125 | Kafka performance tools | Streaming/messaging | Java/Scala | Producer/consumer perf scripts, throughput and latency checks | [apache/kafka](https://github.com/apache/kafka) |
172  | 126 | NATS benchmark | Messaging | Go | Pub/sub throughput benchmarking for NATS clusters | [nats-io/nats-server (scripts)](https://github.com/nats-io/nats-server/tree/main/scripts) |
173  | 127 | Pulsar performance tools | Messaging | Java | Built-in producer/consumer benchmarking utilities | [apache/pulsar](https://github.com/apache/pulsar) |
174  | 128 | RabbitMQ perf-test | Messaging | Java | Broker-specific throughput/latency benchmarking under varied configs | [rabbitmq/rabbitmq-perf-test](https://github.com/rabbitmq/rabbitmq-perf-test) |
175  | 129 | Cassandra stress | Distributed DB | Java | Product-bundled workload generator, schema-aware stress tests | [apache/cassandra](https://github.com/apache/cassandra) |
176  | 130 | ScyllaDB test/perf tools | Distributed DB | C++/Python | Latency-sensitive DB benchmarking and comparative workloads | [scylladb/scylladb](https://github.com/scylladb/scylladb) |
177  | 131 | scylla-bench | Distributed DB | Go | Purpose-built benchmark tool for Scylla-like workloads | [scylladb/scylla-bench](https://github.com/scylladb/scylla-bench) |
178  | 132 | Aerospike benchmark tool | Distributed KV | Java | Cluster-aware read/write latency benchmarking | [aerospike/aerospike-client-java](https://github.com/aerospike/aerospike-client-java) |
179  | 133 | FoundationDB benchmark tools | Distributed KV | C++ | Internal stress/performance tools for transactional systems | [apple/foundationdb](https://github.com/apple/foundationdb) |
180  | 134 | TiDB sysbench/ycsb integrations | Distributed SQL | Go | Public benchmark recipes based on standard suites | [pingcap/tidb](https://github.com/pingcap/tidb) |
181  
182  ### ML, data, AI, and scientific computing
183  
184  ML benchmarking introduces another mature pattern: **scenario definitions, datasets, compliance rules, hardware disclosure, and throughput/latency tradeoffs**. MLPerf is the clearest example.
185  
186  |   # | Project | Domain | Language | Benchmarking practices | Link to code |
187  | --: | ------- | ------ | -------- | ---------------------- | ------------ |
188  | 135 | MLPerf Inference | ML inference | Python/C++ | Formal scenarios, compliance rules, throughput/latency comparability | [mlcommons/inference](https://github.com/mlcommons/inference) |
189  | 136 | MLPerf Training | ML training | Python | Standard training tasks, throughput/time-to-train, compliance workflows | [mlcommons/training](https://github.com/mlcommons/training) |
190  | 137 | TorchBench | PyTorch perf | Python | Real models/workloads, stack/regression tracking for framework evolution | [pytorch/benchmark](https://github.com/pytorch/benchmark) |
191  | 138 | lm-evaluation-harness | LLM eval | Python | Standardized task harness, model-vs-model comparison, reproducible prompts/tasks | [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) |
192  | 139 | HELM | LLM benchmarking | Python | Scenario-based model evaluation, standardized reporting | [stanford-crfm/helm](https://github.com/stanford-crfm/helm) |
193  | 140 | MTEB | Embedding benchmarks | Python | Broad task suite for embeddings, standardized leaderboard-style evaluation | [embeddings-benchmark/mteb](https://github.com/embeddings-benchmark/mteb) |
194  | 141 | BigCode Evaluation Harness | Code model eval | Python | Benchmark tasks for code generation models | [bigcode-project/bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness) |
195  | 142 | DeepSpeed | LLM/training perf | Python/C++/CUDA | Throughput, memory, scale-out benchmarking patterns | [deepspeedai/DeepSpeed](https://github.com/deepspeedai/DeepSpeed) |
196  | 143 | DeepSpeedExamples | LLM/training examples | Python | Benchmark scripts for distributed training workflows | [microsoft/DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples) |
197  | 144 | Megatron-LM benchmarks | Large-scale training | Python | Scaling benchmarks, throughput, distributed systems measurements | [NVIDIA/Megatron-LM](https://github.com/NVIDIA/Megatron-LM) |
198  | 145 | vLLM benchmark tools | LLM serving | Python | Request throughput, token latency, serving-scale comparisons | [vllm-project/vllm](https://github.com/vllm-project/vllm) |
199  | 146 | TensorRT-LLM benchmarks | LLM inference systems | C++/Python | Batch-size, tokens/sec, latency under deployment constraints | [NVIDIA/TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) |
200  | 147 | DeepBench | DL kernel benchmarks | C++/CUDA | Operation-level performance across hardware for DL primitives | [baidu-research/DeepBench](https://github.com/baidu-research/DeepBench) |
201  | 148 | ONNX Runtime | ML runtime perf | C++/Python | Inference performance testing, profiler support, build-over-build comparison | [microsoft/onnxruntime](https://github.com/microsoft/onnxruntime) |
202  | 149 | CUTLASS | GPU kernel benchmarks | C++/CUDA | Linear algebra kernel throughput/latency benchmarking | [NVIDIA/cutlass](https://github.com/NVIDIA/cutlass) |
203  | 150 | Triton | Compiler/kernel benchmarks | Python/C++ | Kernel-compiler benchmark scripts, GPU profiling integration | [triton-lang/triton](https://github.com/triton-lang/triton) |
204  | 151 | Transformers | ML library benchmarks | Python | Training/inference timing, multi-model performance measurement | [huggingface/transformers](https://github.com/huggingface/transformers) |
205  | 152 | JAX benchmarks | Numeric/ML | Python | Reproducible kernel/model benchmarks inside framework ecosystem | [jax-ml/jax](https://github.com/jax-ml/jax) |
206  | 153 | NumPy benchmarks | Numeric kernels | Python | asv-based history benchmarking over library revisions | [numpy/numpy (benchmarks)](https://github.com/numpy/numpy/tree/main/benchmarks) |
207  | 154 | pandas asv_bench | Dataframes | Python | Realistic dataframe workloads over git-history regressions | [pandas-dev/pandas (asv_bench)](https://github.com/pandas-dev/pandas/tree/main/asv_bench) |
208  | 155 | SciPy benchmarks | Scientific computing | Python | asv-based regressions for numerical kernels and algorithms | [scipy/scipy (benchmarks)](https://github.com/scipy/scipy/tree/main/benchmarks) |
209  | 156 | RAPIDS benchmarks | GPU data science | Python/C++ | End-to-end dataframe/ML perf on accelerated pipelines | [rapidsai/cudf](https://github.com/rapidsai/cudf) |
210  | 157 | Apache Arrow benchmarks | Data systems | C++/Python | archery benchmark runs, JSON capture, cross-revision comparisons | [apache/arrow](https://github.com/apache/arrow) |
211  | 158 | Polars benchmarks | Dataframes | Rust/Python | Dataframe-operation micro + macro comparisons against alternatives | [pola-rs/polars](https://github.com/pola-rs/polars) |
212  
213  ### Cloud, infra, OS, and platform benchmarking
214  
215  |   # | Project | Domain | Language | Benchmarking practices | Link to code |
216  | --: | ------- | ------ | -------- | ---------------------- | ------------ |
217  | 159 | PerfKitBenchmarker | Cloud benchmarking | Python | Provision-run-teardown workflow, cloud-neutral benchmark automation | [GoogleCloudPlatform/PerfKitBenchmarker](https://github.com/GoogleCloudPlatform/PerfKitBenchmarker) |
218  | 160 | Phoronix Test Suite | System benchmarking | PHP/Shell | Versioned test profiles, automated install/run/report pipeline | [phoronix-test-suite/phoronix-test-suite](https://github.com/phoronix-test-suite/phoronix-test-suite) |
219  | 161 | UnixBench | OS/system benchmark | C/Shell | Classic system benchmark battery across CPU/filesystem/process metrics | [kdlucas/byte-unixbench](https://github.com/kdlucas/byte-unixbench) |
220  | 162 | lmbench | OS/microbenchmark | C | Kernel/OS latency and bandwidth microbenchmarks | [intel/lmbench](https://github.com/intel/lmbench) |
221  | 163 | stress-ng | System stress/perf | C | Controlled stressors with measurable system behavior | [ColinIanKing/stress-ng](https://github.com/ColinIanKing/stress-ng) |
222  | 164 | STREAM | Memory bandwidth | C | Canonical memory bandwidth benchmark, standardized kernels | [jeffhammond/STREAM](https://github.com/jeffhammond/STREAM) |
223  | 165 | sysstat tools | System observability | C | Benchmark-adjacent measurement of CPU/I/O behavior during runs | [sysstat/sysstat](https://github.com/sysstat/sysstat) |
224  | 166 | perf | Linux profiling/perf | C | Hardware-counter and profile-assisted benchmark diagnosis | [torvalds/linux (perf)](https://github.com/torvalds/linux/tree/master/tools/perf) |
225  | 167 | fio benchmark profiles | Storage infra | C | Reusable job specs for storage infra reproducibility | [axboe/fio (examples)](https://github.com/axboe/fio/tree/master/examples) |
226  | 168 | kubernetes/perf-tests | Kubernetes benchmarking | Go/YAML | Cluster performance measurements, workload-as-code | [kubernetes/perf-tests](https://github.com/kubernetes/perf-tests) |
227  | 169 | iperf3 | Network bandwidth | C | Canonical open network bandwidth measurement tool | [esnet/iperf](https://github.com/esnet/iperf) |
228  | 170 | netperf | Network throughput/latency | C | Throughput and latency tests across multiple networking types | [HewlettPackard/netperf](https://github.com/HewlettPackard/netperf) |
229  | 171 | sockperf | Network latency | C | Specialized for latency-sensitive network benchmarking | [Mellanox/sockperf](https://github.com/Mellanox/sockperf) |
230  
231  ### Frameworks, browser engines, compilers, and large project perf suites
232  
233  |   # | Project | Domain | Language | Benchmarking practices | Link to code |
234  | --: | ------- | ------ | -------- | ---------------------- | ------------ |
235  | 172 | rustc-perf | Compiler perf | Rust | Dedicated perf infra, commit-vs-commit comparison, dashboard, PR feedback | [rust-lang/rustc-perf](https://github.com/rust-lang/rustc-perf) |
236  | 173 | LLVM LNT | Compiler perf infra | Python | Submission, storage, dashboards, machine metadata, trend tracking | [llvm/llvm-lnt](https://github.com/llvm/llvm-lnt) |
237  | 174 | LLVM Test Suite | Compiler benchmark corpus | C/C++ | Standard benchmark corpus for compiler/runtime performance tracking | [llvm/llvm-test-suite](https://github.com/llvm/llvm-test-suite) |
238  | 175 | Chromium Telemetry | Browser perf | Python | Page sets, repeat runs, benchmark stories, result pipelines | [catapult-project/catapult](https://github.com/catapult-project/catapult) |
239  | 176 | WebPageTest | Web systems | C++/Python/JS | Repeated page runs, lab metrics, regression-oriented analysis | [catchpoint/WebPageTest](https://github.com/catchpoint/WebPageTest) |
240  | 177 | JetStream benchmark | Browser JS | JavaScript | Standard browser JavaScript performance corpus | [WebKit/JetStream](https://github.com/WebKit/JetStream) |
241  | 178 | Speedometer | Browser/web app perf | JavaScript | Standardized web-app responsiveness benchmark | [WebKit/Speedometer](https://github.com/WebKit/Speedometer) |
242  | 179 | ARES-6 | Browser JS | JavaScript | JavaScript engine computational benchmark | [WebKit/ARES-6](https://github.com/WebKit/ARES-6) |
243  | 180 | Octane (archived) | JS engines | JavaScript | Historical benchmark corpus still useful as reference material | [chromium/octane](https://github.com/chromium/octane) |
244  | 181 | AreWeFastYet | JS VM perf tracking | JavaScript | Continuous VM performance tracking dashboards | [nbp/arewefastyet](https://github.com/nbp/arewefastyet) |
245  | 182 | V8 benchmark suite | JS runtime | C++/JS | Engine perf corpora and regression tracking | [v8/v8](https://github.com/v8/v8) |
246  | 183 | CPython performance suite | Runtime perf | Python/C | pyperformance-driven language evolution checks | [python/cpython](https://github.com/python/cpython) |
247  | 184 | HHVM benchmarks | Runtime/JIT | C++/Hack | Runtime-oriented benchmark suites and perf tracking | [facebook/hhvm](https://github.com/facebook/hhvm) |
248  
249  ### Profiling and visualization tools
250  
251  Profiling is the natural complement to benchmarking — once a benchmark identifies a regression, profiling tools diagnose the root cause. The best performance workflows tightly couple benchmark harnesses with profiling capture and comparison.
252  
253  |   # | Project | Domain | Language | Benchmarking practices | Link to code |
254  | --: | ------- | ------ | -------- | ---------------------- | ------------ |
255  | 185 | FlameGraph | Stack visualization | Perl | Iconic flamegraph SVGs from sampled stack traces, repeatable comparison | [brendangregg/FlameGraph](https://github.com/brendangregg/FlameGraph) |
256  | 186 | speedscope | Profile visualization | TypeScript | Interactive web viewer for profile exploration and sharing | [jlfwong/speedscope](https://github.com/jlfwong/speedscope) |
257  | 187 | Perfetto | Tracing platform | C++/TypeScript | Production-grade tracing, trace configs, analysis tools, web UI | [google/perfetto](https://github.com/google/perfetto) |
258  | 188 | pprof | Profile analysis | Go | CPU/heap profiles, visualization and analysis tooling | [google/pprof](https://github.com/google/pprof) |
259  | 189 | bcc | eBPF tooling | Python/C | Low-overhead kernel/user-space eBPF measurements | [iovisor/bcc](https://github.com/iovisor/bcc) |
260  | 190 | bpftrace | eBPF tracing | C++ | Dynamic system measurement for benchmark root-cause analysis | [bpftrace/bpftrace](https://github.com/bpftrace/bpftrace) |
261  | 191 | Pyroscope | Continuous profiling | Go | CPU/memory/I/O profiling, continuous capture, time-series analysis | [grafana/pyroscope](https://github.com/grafana/pyroscope) |
262  | 192 | Parca | Continuous profiling | Go | CPU/memory usage down to line number over time, aggregate queries | [parca-dev/parca](https://github.com/parca-dev/parca) |
263  | 193 | async-profiler | JVM profiling | Java/C++ | Low-overhead CPU/heap sampling for JVM benchmark diagnosis | [async-profiler/async-profiler](https://github.com/async-profiler/async-profiler) |
264  | 194 | py-spy | Python profiling | Rust | Sampling profiler without code instrumentation, fast benchmark→profile loops | [benfred/py-spy](https://github.com/benfred/py-spy) |
265  | 195 | rbspy | Ruby profiling | Rust | Low-overhead sampling profiling for Ruby performance workflows | [rbspy/rbspy](https://github.com/rbspy/rbspy) |
266  | 196 | gperftools | C++ profiling/allocators | C++ | CPU/heap profiling toolkit, often paired with benchmark harnesses | [gperftools/gperftools](https://github.com/gperftools/gperftools) |
267  | 197 | heaptrack | Heap profiling | C++ | Heap allocation profiles, memory regression comparison across versions | [KDE/heaptrack](https://github.com/KDE/heaptrack) |
268  | 198 | Tracy | Instrumentation profiler | C++ | High-resolution interactive profiling for games/systems benchmarks | [wolfpld/tracy](https://github.com/wolfpld/tracy) |
269  | 199 | samply | Sampling profiler | Rust | Modern cross-platform sampling profiler for benchmark regression diagnosis | [mstange/samply](https://github.com/mstange/samply) |
270  | 200 | cargo-flamegraph | Rust profiling CLI | Rust | Bridges Rust benchmarking and profiling with minimal friction | [flamegraph-rs/flamegraph](https://github.com/flamegraph-rs/flamegraph) |
271  | 201 | yappi | Python profiling | C/Python | Thread/async-aware function profiling for concurrent benchmarks | [sumerc/yappi](https://github.com/sumerc/yappi) |
272  
273  ### CI and benchmark orchestration
274  
275  |   # | Project | Domain | Language | Benchmarking practices | Link to code |
276  | --: | ------- | ------ | -------- | ---------------------- | ------------ |
277  | 202 | github-action-benchmark | CI benchmark tracking | Multi-language | Persist benchmark history in GitHub Pages, compare commits | [benchmark-action/github-action-benchmark](https://github.com/benchmark-action/github-action-benchmark) |
278  | 203 | codspeed | CI benchmarking | Multi-language | PR regression detection, stable virtualized measurements, CI integration | [CodSpeedHQ/codspeed](https://github.com/CodSpeedHQ/codspeed) |
279  | 204 | Bencher | CI benchmark tracking | Multi-language | Benchmark result storage, alerts, regression gating, dashboards | [bencherdev/bencher](https://github.com/bencherdev/bencher) |
280  | 205 | BenchExec | Reproducible benchmarking | Python | Controlled execution, resource isolation, fair comparisons | [sosy-lab/benchexec](https://github.com/sosy-lab/benchexec) |
281  | 206 | ReBench | Benchmark orchestration | Python | Experiment definitions, result collection, reproducibility | [smarr/ReBench](https://github.com/smarr/ReBench) |
282  
283  ---
284  
285  ## Practical synthesis for a benchmark-generation skill
286  
287  ### 1. Classify the repo first
288  
289  Library, CLI, service, DB, frontend, compiler/runtime, ML, or cloud infra.
290  
291  ### 2. Choose the benchmark style by repo type
292  
293  * Library/runtime: microbench harness
294  * Service/web: load + percentile/SLO checks
295  * DB/search/storage: workload suite + dataset + environment capture
296  * ML: scenario/task suite + throughput/latency/compliance metadata
297  
298  ### 3. Always generate these artifacts
299  
300  * benchmark entrypoints
301  * reproducible input/workload definitions
302  * baseline storage
303  * compare script
304  * CI workflow
305  * machine-readable output
306  * human-readable summary
307  
308  ### 4. Prefer these practices by default
309  
310  * warmup
311  * repeated runs
312  * controlled environment notes
313  * percentile reporting for services
314  * baseline comparison
315  * regression thresholds
316  * raw JSON/CSV export
317  * optional profiling hooks
318  
319  ### 5. Version the workload, not just the code
320  
321  The best projects treat tracks, job files, page sets, benchmark corpora, and datasets as first-class versioned artifacts.
322  
323  ---
324  
325  ## High-value patterns your skill should copy
326  
327  | Pattern | Good examples |
328  | ------- | ------------- |
329  | Warmup + calibrated loops | JMH, pyperf, BenchmarkDotNet, Criterion.rs |
330  | Stats-aware comparisons | Criterion.rs, benchstat, BenchmarkDotNet, asv |
331  | CI regression gates | Lighthouse CI, k6, github-action-benchmark, CodSpeed |
332  | Workload versioning | Rally, fio job files, ClickBench, js-framework-benchmark |
333  | Perf dashboards over time | rustc-perf, asv, LNT, AreWeFastYet |
334  | Bundled product benchmarks | pgbench, db_bench, redis-benchmark, sqlite speedtest1 |
335  | Realistic macrobench suites | pyperformance, TorchBench, MLPerf, PerfKitBenchmarker |
336  | Request-shape control | Vegeta, wrk2, Locust, k6 |
337  | Tail-latency focus | YCSB, k6, Vegeta, fortio, ghz |
338  | Provision-run-teardown infra | PerfKitBenchmarker, Rally, Phoronix Test Suite |
339  
340  ---
341  
342  ## Generalized benchmarking skill template
343  
344  This template blends what the exemplary projects above actually do: warmups + repeated sampling, explicit regression thresholds, artifact publication, and (for noisy domains) infrastructure separation.
345  
346  ### Checklist
347  
348  - Define the *decision* the benchmark must support (PR gate? release acceptance? capacity planning?).
349  - Choose benchmark type:
350    - micro (tight loop) for code-level deltas (e.g., Google Benchmark, JMH, Criterion.rs).
351    - macro (end-to-end) for user experience and systems behavior (e.g., Rally, ClickBench).
352  - Specify workloads as versioned artifacts:
353    - code (bench definitions),
354    - data generation scripts or fixed datasets,
355    - scenario configs (YAML/JSON), and
356    - a "how to run" doc.
357  - Encode performance contracts:
358    - thresholds/budgets (fail the run when violated) for CI gating.
359  - Manage noise:
360    - warmup and repeated sampling,
361    - stable environments for authoritative runs,
362    - record system metadata.
363  - Publish artifacts:
364    - raw machine-readable output (JSON/CSV),
365    - human report (HTML/markdown),
366    - optional time-series dashboard (asv/rustc-perf/CI actions).
367  
368  ### Workflow diagram
369  
370  ```mermaid
371  flowchart TD
372    A[Define question + budget] --> B[Workload spec + dataset + success criteria]
373    B --> C[Harness design: warmup, sample size, correctness checks]
374    C --> D[Run smoke benchmarks in CI]
375    D --> E[Compare vs baseline; enforce budgets]
376    E -->|Fail| F[Profile & diagnose]
377    F --> C
378    E -->|Pass| G[Publish artifacts: JSON/CSV + HTML + trends]
379    G --> H[Authoritative scheduled/stable-hardware runs]
380    H --> E
381  ```
382  
383  ### CI snippet
384  
385  Example GitHub Actions setup with **PR smoke** + **authoritative** runs (scheduled or manually triggered). This mirrors the industry pattern of separating fast/noisy CI checks from stable, infrastructure-backed performance decisions.
386  
387  ```yaml
388  name: performance
389  
390  on:
391    pull_request:
392    workflow_dispatch:
393    schedule:
394      - cron: "0 3 * * *"
395  
396  jobs:
397    pr-smoke:
398      if: github.event_name == 'pull_request'
399      runs-on: ubuntu-latest
400      steps:
401        - uses: actions/checkout@v4
402        - name: Build (Release)
403          run: ./ci/build_release.sh
404        - name: Run smoke benchmarks
405          run: ./ci/run_bench_smoke.sh --output smoke.json
406        - name: Compare to baseline thresholds
407          run: ./ci/compare_budget.py --input smoke.json --budget ./ci/budget.json
408        - uses: actions/upload-artifact@v4
409          with:
410            name: smoke-bench
411            path: smoke.json
412  
413    authoritative:
414      if: github.event_name != 'pull_request'
415      runs-on: [self-hosted, perf]
416      steps:
417        - uses: actions/checkout@v4
418        - name: Pin environment
419          run: ./ci/pin_env.sh
420        - name: Build (Release)
421          run: ./ci/build_release.sh
422        - name: Run full benchmarks
423          run: ./ci/run_bench_full.sh --output full.json
424        - name: Compare vs last-known-good
425          run: ./ci/compare_against_lkg.py --input full.json --lkg ./ci/lkg.json
426        - uses: actions/upload-artifact@v4
427          with:
428            name: full-bench
429            path: full.json
430  ```
431  
432  ### Example microbenchmark scripts
433  
434  **C++ (Google Benchmark):**
435  
436  ```cpp
437  #include <benchmark/benchmark.h>
438  #include <vector>
439  #include <numeric>
440  #include <algorithm>
441  
442  static void BM_Sort(benchmark::State& state) {
443    std::vector<int> v(state.range(0));
444    for (auto _ : state) {
445      std::iota(v.begin(), v.end(), 0);
446      std::reverse(v.begin(), v.end());
447      benchmark::DoNotOptimize(v);
448      std::sort(v.begin(), v.end());
449      benchmark::ClobberMemory();
450    }
451  }
452  BENCHMARK(BM_Sort)->Range(1<<10, 1<<20);
453  BENCHMARK_MAIN();
454  ```
455  
456  **Java (JMH):**
457  
458  ```java
459  import org.openjdk.jmh.annotations.*;
460  
461  import java.util.concurrent.TimeUnit;
462  
463  @State(Scope.Thread)
464  @BenchmarkMode(Mode.AverageTime)
465  @OutputTimeUnit(TimeUnit.NANOSECONDS)
466  public class ParseBench {
467  
468    @Benchmark
469    @Warmup(iterations = 5, time = 1)
470    @Measurement(iterations = 5, time = 1)
471    @Fork(3)
472    public int parseIntBench() {
473      return Integer.parseInt("12345");
474    }
475  }
476  ```
477  
478  **Rust (Criterion.rs):**
479  
480  ```rust
481  use criterion::{criterion_group, criterion_main, Criterion};
482  
483  fn fib(n: u64) -> u64 {
484      if n < 2 { n } else { fib(n-1) + fib(n-2) }
485  }
486  
487  fn bench_fib(c: &mut Criterion) {
488      c.bench_function("fib 20", |b| b.iter(|| fib(20)));
489  }
490  
491  criterion_group!(benches, bench_fib);
492  criterion_main!(benches);
493  ```
494  
495  **Python (pyperf):**
496  
497  ```python
498  import pyperf
499  
500  def work():
501      s = 0
502      for i in range(10000):
503          s += i * i
504      return s
505  
506  runner = pyperf.Runner()
507  runner.bench_func("work", work)
508  ```
509  
510  **Deno (deno bench):**
511  
512  ```ts
513  Deno.bench("url parse", () => {
514    new URL("https://deno.land");
515  });
516  ```
517  
518  ### Recommended tools per language
519  
520  | Language | Preferred microbenchmark harness | Longitudinal/regression tracking | Notes |
521  | -------- | -------------------------------- | -------------------------------- | ----- |
522  | C/C++ | Google Benchmark | github-action-benchmark, custom dashboards | Pair with profilers (Perfetto/FlameGraph) for diagnosis |
523  | Java/JVM | JMH | CI artifacts + suites (Renaissance/DaCapo) | Always use warmup/forks; profile with async-profiler |
524  | .NET | BenchmarkDotNet | dotnet/performance patterns | Built-in plots + summaries help enforce discipline |
525  | Rust | Criterion.rs | rustc-perf style for large infra | Use cargo-flamegraph for diagnosis |
526  | Python | pyperf / pytest-benchmark | asv dashboards | Prefer representative suites (pyperformance) for macro |
527  | Go | `testing.B` + golang/perf tools | golang/benchmarks + dashboards | Profile with pprof |
528  | JS/TS | Runtime-integrated benches (Deno) | Lighthouse CI for web budgets | For web perf: budgets + reports (Lighthouse/WebPageTest) |
529  | Databases/Search | Rally, YCSB, ClickBench | Store artifacts + compare across versions | Explicitly version workloads/tracks and datasets |
530  | Load testing | k6, wrk2, Locust | Thresholds + CI gating | Prefer percentile-based contracts to avoid "avg-only" regressions |
531  
532  ### Reproducibility matrix
533  
534  Use this as a checklist for what must be pinned or recorded to make results comparable.
535  
536  | Category | Pin/record | Example "gold standard" practice |
537  | -------- | ---------- | -------------------------------- |
538  | Code identity | Baseline + contender commit hashes | Continuous benchmarking actions center workflows around commit comparisons |
539  | Benchmark definition | Versioned benchmark suite + parameters | Criterion stores prior-run statistics for comparisons |
540  | Dataset/workload | Dataset version/digest + workload files | ClickBench publishes dataset + scripts; YCSB defines workload files |
541  | Build/runtime | Compiler/runtime version + flags | BenchmarkDotNet builds isolated Release binaries; JMH uses forks/warmups |
542  | Environment | CPU model, governors, OS, container/VM flags | Systems and macrobench tools require explicit environment control |
543  | Measurement policy | Warmup, sample size, run duration | Criterion documents warmup/measurement phases and noise considerations |
544  | Regression policy | Budgets/thresholds + fail conditions | k6 thresholds and Lighthouse CI assertions encode pass/fail contracts |
545  | Artifacts | Raw JSON/CSV + human report | asv web dashboards and Lighthouse reports publish durable artifacts |
546  
547  ---
548  
549  ## Pitfalls and best practices
550  
551  The most common failure mode is treating benchmarks as "single numbers" instead of **distributions under uncertainty**. Tools like Criterion.rs explicitly warn about outliers/noise and use robust methods (bootstrap, outlier classification, comparisons vs stored baselines) to avoid chasing randomness.
552  
553  Latency measurement pitfalls are especially severe in service/load testing; coordinated omission can make reported latencies look better than reality. wrk2 is exemplary precisely because it is explicit about constant-throughput load generation and histogram-based tail latency recording.
554  
555  Budgets/thresholds are a best practice when you need CI gating: they force teams to encode *what matters* (e.g., "p95 < X" or "bundle cost < Y") rather than debating noisy deltas after the fact. Lighthouse CI and k6 document exactly this style of regression prevention.
556  
557  Finally, macrobenchmarks are only as credible as their workload definitions and environment control. Benchmark suites like Rally (tracks/telemetry) and ClickBench (dataset + scripts) show that reproducibility requires the *entire pipeline* to be treated as an artifact, not only the system-under-test.
558  
559  ---
560  
561  ## Minimal recommendation set for a universal benchmark skill
562  
563  If the skill must start with a small "starter brain," seed it with these 12 first:
564  
565  1. Google Benchmark
566  2. JMH
567  3. BenchmarkDotNet
568  4. Criterion.rs
569  5. pyperf
570  6. asv
571  7. benchstat
572  8. hyperfine
573  9. Lighthouse CI
574  10. k6
575  11. YCSB
576  12. PerfKitBenchmarker
577  
578  They cover the main benchmark archetypes: microbench, long-term regression, CI gating, service/load, data/store, and infra automation.