Cradicle Explorer

avx512-keccak-bench
AVX-512 SHA3-256 fixed-prefix throughput benchmark with A2+A1 round-level optimizations
rad:z3PfFA3CHj64RkyY8tRkieX7mk94f
Visibility
public
Delegates
did:key:z6MkqJbkwYkmjH3tRyNTNKjkEicZJGRUxc6QXjzkfsriaKwu
Default branch
main → b392677e8da1d0e96cdf7ebd0493c82b2b62d23f (Sat Apr 18 18:23:55 2026)
Threshold
README.md
# avx512-keccak-bench

**A fast SHA3-256 benchmark for fixed-prefix preimage throughput measurement
on AVX-512 hardware.** Demonstrates two microarchitectural tricks that push
a 16-core Xeon 8488C (Sapphire Rapids) from `146 MH/s` (XKCP baseline,
8 workers) through `181 MH/s` (+24% with direct state init, 16 workers) to
`195 MH/s` (+7.7% more with A2 + A1 round-level optimizations on top).

License: [MIT](LICENSE). Pure CPU, no crypto mined, not a real miner — just
a throughput harness for SHA3-256 preimage-style workloads (fixed prefix +
variable suffix, 24-byte messages, single-block absorb).

## What this measures

Repeatedly hash `SHA3-256(PREFIX || salt || counter)` with `PREFIX` = 11
bytes, `salt` = 5 bytes fixed per worker, `counter` = 8 bytes LE varying per
iteration. At each new lane-0 minimum, emit a `best` JSON record. This is
the typical shape of:

- Proof-of-work benchmarks where the hash input is a fixed header plus a
  nonce (relevant whenever the inner hash is SHA3-256 with a fixed prefix).
- Preimage-golf puzzles where you want the hash of a 24-byte input to
  have the most leading zero bits.
- Microbench harnesses exploring Keccak throughput on a new CPU.

It is NOT a cryptocurrency miner. There is no network protocol, no wallet,
no reward mechanism — just `printf("%s %s\n", hex_input, hex_hash)` when a
new best is found.

## Headline result

| Binary | MH/s | Workers | vs baseline |
|---|---:|---:|---:|
| XKCP `KeccakP1600times8_AVX512_PermuteAll_24rounds`, 8 workers | ~146 | 8 | baseline |
| `keccak_bench` (direct state init, no A2/A1), 16 workers | ~181 | 16 | +24% |
| `keccak_bench_lto` (A2 + A1), 16 workers | **~195** | 16 | **+33% overall, +7.7% vs 16w direct** |

All measurements: Xeon 8488C @ 3.32 GHz sustained all-core AVX-512, 16
pthreads, 60 s warmup + 300 s steady state, aggregate over 16 workers. See
[`docs/benchmarks.md`](docs/benchmarks.md) for methodology and raw data.

## Quick start

```sh
# 1. Install xsltproc (XKCP build dependency).
sudo apt-get install -y xsltproc

# 2. Build XKCP AVX-512 static lib (one-time).
git clone --recurse-submodules https://github.com/XKCP/XKCP.git /tmp/XKCP
make -C /tmp/XKCP AVX512/libXKCP.a

# 3. Build keccak_bench.
make

# 4. Run the 60-second bench.
make bench

# 5. Run the correctness test suite (3 checks, ~45s).
make test
```

For Docker users:

```sh
docker build -t avx512-keccak-bench .
docker run --rm avx512-keccak-bench make bench
```

Requires a CPU with AVX-512F + AVX-512VL + `vpternlogq` (Skylake-X or
newer). On Sapphire Rapids you'll see the full headline number; Icelake /
Rocket Lake will be proportionally slower. The binary will SIGILL on
pre-AVX-512 hardware; run inside the provided Dockerfile on an AVX-512-
enabled EC2 `m7i.*`, GCP `c3-*`, or a modern workstation for reproducible
numbers.

## What's novel

### A2 — prefix-fixed round-1 theta partial precomputation

Keccak round 1's theta uses 5 column parities and 5 deltas. For a
fixed-prefix workload the state absorb leaves 24 of 25 state lanes constant
per worker (only the counter lane varies). That means 4 of 5 column parities
and 3 of 5 deltas are constants and can be computed once at worker startup.
In the hot loop, only 1 column parity and 2 deltas need recomputing.
Savings: ~28 ops per iteration. See [`docs/design.md`](docs/design.md) for
the derivation.

### A1 — round-24 lane-(0,0)-only short-circuit

SHA3-256's output bytes 0..7 come from state lane (0,0). For the
"does this hash beat my current best?" comparison we only need the first
8 bytes in practice — they decide the lex order >99.999% of the time.

Round 24 still needs the full theta (all 25 columns contribute to theta
deltas, else lane (0,0) would be wrong). But after theta, we only need
rho/pi/chi/iota for the 3 pre-pi lanes feeding post-pi row-0 columns
0/1/2 (diagonals (0,0), (1,1), (2,2)) — everything else is dead code.

Savings: ~56 ops per iteration. On hash-beat (extremely rare) we re-init
the state with the same counter and call XKCP's standard 24-round permute
to recover the full 32-byte digest for emission.

A1 and A2 compose: their critical paths are different, and throughput
ratios multiply. Modeled expectation was `~3-7%`; measured was `+7.7%`.

### What didn't work (negative results)

Several "should have worked" optimizations did not:

- **cpuminer-opt lane-complement trick**: obsolete post-`vpternlogq`
  (-11% in microbench on SPR).
- **PGO**: +0.08% (noise). Inner loop is branch-free AVX-512, no branch
  miss for PGO to fix.
- **BOLT**: -1.9% on a VM host that can't expose LBR (`perf record -j
  any` denied). Prerequisite unchecked.
- **gcc 13 vs clang 18 full-LTO or thin-LTO**: within ±0.5%.
- **iTLB hugepages**: hot code fits in L1i (measured iTLB miss rate
  0.00025%, far below any threshold worth a page-map change).
- **2-stream ILP (pipeline 2 independent Keccak states per thread)**:
  -0.35% at 8-threads-no-SMT on SPR. Too few ZMM registers to schedule
  the extra stream.
- **Round-23 short-circuit**: symbolic op-count analysis showed +3
  `vpternlogq` + 6 `vpxorq` over the standard path → ~0.1-0.2% regression
  on SPR (vpternlogq is the port 0/5 bottleneck).

See [`docs/design.md`](docs/design.md) for the full ruleout matrix.

## JSON-lines output schema

Miner prints one JSON record per event to stdout:

```json
{"kind":"startup","workers":16,"prefix":"bench_sha3:","base_salt_hex":"…","source":"…","variant":"a2_a1"}
{"kind":"stats","worker":3,"total":4194304,"rate":12058624}
{"kind":"best","input_hex":"…13 bytes…","hash_hex":"…32 bytes…","worker":3,"counter":123}
{"kind":"canary_fail","worker":0,"expected":"cf0d…","got":"300d…","msg":"HARDWARE_ERROR_CANARY_FAIL"}
{"kind":"shutdown","total":81801609464}
```

`canary_fail` is emitted if the hardware-error canary (one known-answer
SHA3-256 check every `2^30` iterations per worker) finds a mismatch. In
that case the miner sets an internal stop flag and exits; a supervisor
(e.g., `systemd Restart=always`) should restart it. See
[`docs/correctness.md`](docs/correctness.md).

## Correctness

Three test layers, all invoked by `make test` (~45 s):

1. **Fixed-vector parity** — 1000 random inputs, hashed both via
   `keccak_bench --validate-hex` (which routes through XKCP's canonical
   permute + our SHA3 padding) and via Python `hashlib.sha3_256`. Must
   agree.
2. **Hot-path beat parity** — 5 seconds of live mining, verify every
   emitted `best` record re-hashes correctly with `hashlib`.
3. **A1 no-missed-beats** — 2 million counters, single worker: compute
   the reference set of lane-0 beats locally with `hashlib`, run the
   miner over the same counter range, assert the emitted set is bit-
   identical. This is the test that directly exercises the A1
   short-circuit claim.

See [`docs/correctness.md`](docs/correctness.md).

## File layout

```
README.md              — this file
LICENSE                — MIT (+ XKCP CC0 attribution footnote)
Makefile               — top-level `make bench`, `make test`, etc.
Dockerfile             — reproducible build container
docs/
  design.md            — A2 + A1 derivation; op-count analysis
  benchmarks.md        — MH/s measurements + methodology
  correctness.md       — validation approach
  architecture.md      — SPR AVX-512 microarch notes
src/
  keccak_bench.c       — the miner
  Makefile             — builds `keccak_bench` + `keccak_bench_lto`
  validate.py          — 3-check correctness harness
test/
  vectors.json         — known-answer test vectors (hashlib ground truth)
  test_bench.py        — pytest shim around validate.py
tools/
  leak_audit.sh        — repo-hygiene grep for banned strings; run as
                         pre-commit hook
  bench.sh             — 60s bench runner
backends/
  cuda/                — NVIDIA GPU port (A2 + A1 on scalar Keccak)
  hip/                 — AMD ROCm/HIP port (A2 + A1; shares keccak_scalar.h via symlink)
  sve2/                — ARM SVE2 port (A2 + A1 with EOR3/XAR/BCAX)
  wasm/                — WebAssembly port (Node CLI + browser demo;
                         hosted at `backends/wasm/LIVE_DEMO.md`)
docs/articles/
  preprint/            — LaTeX + PDF + abstract + Zenodo metadata
                         for the two technical articles
.github/workflows/
  ci.yml               — GitHub Actions CI (build + test, x86 AVX-512)
  cuda-ci.yml          — CUDA backend CI (nvcc compile-check + CPU parity)
  sve2-ci.yml          — SVE2 backend CI (cross-compile + QEMU parity)
  hip-ci.yml           — HIP backend CI (hipcc compile-check + CPU parity)
  wasm-ci.yml          — WASM backend CI (emscripten build + Node parity)
```

## Backends

The AVX-512 implementation lives in `src/keccak_bench.c`. Ports to
other ISAs live under `backends/<arch>/` and share the same JSON-lines
output schema + `bench_sha3:` prefix + 24-byte single-block absorb
layout, so they're drop-in replacements for each other.

| Backend | Path | Status |
|---|---|---|
| **AVX-512 (x86_64, Sapphire Rapids / Granite Rapids / etc.)** | `src/keccak_bench.c` | Live (195 MH/s on Xeon 8488C, 16 workers) |
| **CUDA (NVIDIA GPUs, sm_70+)** | `backends/cuda/` | Compiles + hashlib-parity on CPU; GPU runtime validation deferred |
| **HIP / ROCm (AMD GPUs, gfx906 / gfx90a / gfx1100)** | `backends/hip/` | Compiles under `hipcc` + hashlib-parity on CPU; AMD GPU runtime validation deferred |
| **ARM SVE2 (Neoverse-N2 / Cortex-A710 / Cobalt 100 / Graviton 4)** | `backends/sve2/` | Cross-compiles + QEMU-SVE2-parity at VL∈{2,4,8}; native runtime validation deferred |
| **WebAssembly (browser + Node)** | `backends/wasm/` | Node CLI + browser WebWorker demo; hashlib parity via Node (4 tests) + Playwright-headless-Chromium smoke. **Served live** — see `backends/wasm/LIVE_DEMO.md` |

Each backend directory has its own `README.md` + `CORRECTNESS.md` +
build/test instructions.

## Live demo (WASM backend)

A public browser demo of the WASM backend is served at a
`trycloudflare.com` URL — see
[`backends/wasm/LIVE_DEMO.md`](backends/wasm/LIVE_DEMO.md) for the
current URL and a note on its ephemerality (the subdomain rotates on
tunnel restart). Open it in a modern browser, click **Start**, and a
WebWorker will benchmark SHA3-256 preimage throughput in your tab. The
page carries an opt-in disclosure banner and a visible "no
cryptocurrency mining" statement; nothing runs until clicked, and
nothing is sent to a server.

## Contributing

This is a small focused benchmark repo. Happy to take:

- Fixes for correctness bugs (rare; report with a reproducer).
- Extensions for other AVX-512-class CPUs (Icelake-SP, Sierra Forest,
  Granite Rapids) with measured numbers.
- Non-x86 AVX-512-analogue ports: additional targets for
  `backends/<arch>/` — RISC-V Zvkng, WebAssembly relaxed-SIMD,
  Apple Silicon AMX / SME2 (once ACLE 2024 lands in mainline
  clang). Keep them as sibling directories under `backends/`.

Open an issue or PR. For larger changes, an RFC-style issue first is
welcome.

## Articles

Long-form write-ups in `docs/articles/`:

- [`article_a2_round1_theta_precompute.md`](docs/articles/article_a2_round1_theta_precompute.md)
  — the A2 + A1 derivation in detail, what didn't work alongside them
  (PGO / BOLT / 2-stream ILP / iTLB hugepages / round-23 short-
  circuit), and the reproducer.
- [`article_cpuminer_opt_lane_complement_obsolete.md`](docs/articles/article_cpuminer_opt_lane_complement_obsolete.md)
  — negative-result article: cpuminer-opt's 8-way AVX-512 Keccak is
  11-14 % slower than XKCP on SPR because the pre-`vpternlogq` "lane-
  complement" χ trick has become strictly obsolete.

## Upstream

A proposal to contribute A2 + A1 upstream to XKCP is planned; if/when it
lands, this repo's `src/keccak_bench.c` will shrink to a thin wrapper.