<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>Cradicle Explorer</title>
    <link href="/css/bootstrap/bootstrap.min.css" rel="stylesheet">
    <style>
      .form-control-dark::placeholder {
          color: #aaa;
          opacity: 1;
      }
    </style>
    <link rel="stylesheet" href="/assets/fontawesome/css/all.min.css">
    <link rel="icon" type="image/png" href="/favicon.png">


                <link href="/css/dashboard.css" rel="stylesheet">
                </head>
                <body>
                <header class="navbar navbar-dark sticky-top bg-dark flex-md-nowrap p-0 shadow">
                  <a class="navbar-brand col-md-3 col-lg-2 me-0 px-3 fs-6" href="/">Cradicle Explorer</a>
                  <button class="navbar-toggler position-absolute d-md-none collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#sidebarMenu" aria-controls="sidebarMenu" aria-expanded="false" aria-label="Toggle navigation">
                    <span class="navbar-toggler-icon"></span>
                  </button>
                  <form method="get" action="/cgi-bin/main" style="width:100%;"><input class="form-control form-control-dark w-100 rounded-0 border-0" type="text" name="q" placeholder="Search repos" aria-label="Search"></form>
                  <div class="navbar-nav flex-row">
                    <div class="nav-item text-nowrap">
                      <a class="nav-link px-3 active" href="/cgi-bin/repo?id=z3PfFA3CHj64RkyY8tRkieX7mk94f">avx512-keccak-bench</a>
                    </div>
                  </div>
                </header>
                <div class="container-fluid">
                  <div class="row">
                    <nav id="sidebarMenu" class="col-md-3 col-lg-2 d-md-block bg-dark sidebar collapse">
                      <div class="position-sticky pt-3 sidebar-sticky">
                        <ul class="nav flex-column">
                          <li class="nav-item">
                            <a class="nav-link active" href="/cgi-bin/repo?id=z3PfFA3CHj64RkyY8tRkieX7mk94f">
                              <i class="align-text-bottom fa-solid fa-info"></i>
                              Info
                            </a>
                          </li>
                          <li class="nav-item">
                            <a class="nav-link" href="/cgi-bin/repo?id=z3PfFA3CHj64RkyY8tRkieX7mk94f&issue=list">
                              <i class="align-text-bottom fa-solid fa-layer-group"></i>
                              Issues
                            </a>
                          </li>
                          <li class="nav-item">
                            <a class="nav-link" href="/cgi-bin/repo?id=z3PfFA3CHj64RkyY8tRkieX7mk94f&patch=list">
                              <i class="align-text-bottom fa-solid fa-vest-patches"></i>
                              Patches
                            </a>
                          </li>
                          <li class="nav-item">
                            <a class="nav-link" href="/cgi-bin/repo?id=z3PfFA3CHj64RkyY8tRkieX7mk94f&wallet=list">
                              <i class="align-text-bottom fa-solid fa-wallet"></i>
                              Wallets
                            </a>
                          </li>
                          <li class="nav-item">
                            <a class="nav-link" href="/cgi-bin/repo?id=z3PfFA3CHj64RkyY8tRkieX7mk94f&source=.">
                              <i class="align-text-bottom fa-solid fa-code"></i>
                              Source
                            </a>
                          </li>
                        <h6 class="sidebar-heading d-flex justify-content-between align-items-center px-3 mt-4 mb-1 text-muted text-uppercase">
                          <span></span>
                        </h6>
                        <ul class="nav flex-column mb-2">
                        
                        </ul>
                      </div>
                    </nav>
                <main class="col-md-9 ms-sm-auto col-lg-10">
                  <div class="container px-1 py-3">
        

    <div class="list-group">
    <div class="list-group-item">
    <div style="font-size:1.3rem;">avx512-keccak-bench</div>
    <div class="repo-item">AVX-512 SHA3-256 fixed-prefix throughput benchmark with A2+A1 round-level optimizations</div>
    <div>rad:z3PfFA3CHj64RkyY8tRkieX7mk94f</div>
    </div>
    <div class="list-group-item">
    <div>Visibility</div>
    <div class="repo-item">public</div>
    </div>
    <div class="list-group-item">
    <div>Delegates</div><div class="repo-item">did:key:z6MkqJbkwYkmjH3tRyNTNKjkEicZJGRUxc6QXjzkfsriaKwu</div>
    </div>
    <div class="list-group-item">
    <div>Default branch</div>
    <div><span class="repo-item">main &#8594 b392677e8da1d0e96cdf7ebd0493c82b2b62d23f</span> (Sat Apr 18 18:23:55 2026)</div>
    </div>
    <div class="list-group-item">
    <div>Threshold</div>
    <div class="repo-item">1</div>
    </div>
    </div>
    
        <div class="list-group mt-3">
        <div class="list-group-item">
        <div class="mb-2" style="font-weight:bold;"><i class="fa-solid fa-book"></i> README.md</div>
        <pre style="margin:0; font-size:0.85rem; overflow-x:auto; color:#fafafa;"># avx512-keccak-bench

**A fast SHA3-256 benchmark for fixed-prefix preimage throughput measurement
on AVX-512 hardware.** Demonstrates two microarchitectural tricks that push
a 16-core Xeon 8488C (Sapphire Rapids) from `146 MH/s` (XKCP baseline,
8 workers) through `181 MH/s` (+24% with direct state init, 16 workers) to
`195 MH/s` (+7.7% more with A2 + A1 round-level optimizations on top).

License: [MIT](LICENSE). Pure CPU, no crypto mined, not a real miner — just
a throughput harness for SHA3-256 preimage-style workloads (fixed prefix +
variable suffix, 24-byte messages, single-block absorb).

## What this measures

Repeatedly hash `SHA3-256(PREFIX || salt || counter)` with `PREFIX` = 11
bytes, `salt` = 5 bytes fixed per worker, `counter` = 8 bytes LE varying per
iteration. At each new lane-0 minimum, emit a `best` JSON record. This is
the typical shape of:

- Proof-of-work benchmarks where the hash input is a fixed header plus a
  nonce (relevant whenever the inner hash is SHA3-256 with a fixed prefix).
- Preimage-golf puzzles where you want the hash of a 24-byte input to
  have the most leading zero bits.
- Microbench harnesses exploring Keccak throughput on a new CPU.

It is NOT a cryptocurrency miner. There is no network protocol, no wallet,
no reward mechanism — just `printf(&quot;%s %s\n&quot;, hex_input, hex_hash)` when a
new best is found.

## Headline result

| Binary | MH/s | Workers | vs baseline |
|---|---:|---:|---:|
| XKCP `KeccakP1600times8_AVX512_PermuteAll_24rounds`, 8 workers | ~146 | 8 | baseline |
| `keccak_bench` (direct state init, no A2/A1), 16 workers | ~181 | 16 | +24% |
| `keccak_bench_lto` (A2 + A1), 16 workers | **~195** | 16 | **+33% overall, +7.7% vs 16w direct** |

All measurements: Xeon 8488C @ 3.32 GHz sustained all-core AVX-512, 16
pthreads, 60 s warmup + 300 s steady state, aggregate over 16 workers. See
[`docs/benchmarks.md`](docs/benchmarks.md) for methodology and raw data.

## Quick start

```sh
# 1. Install xsltproc (XKCP build dependency).
sudo apt-get install -y xsltproc

# 2. Build XKCP AVX-512 static lib (one-time).
git clone --recurse-submodules https://github.com/XKCP/XKCP.git /tmp/XKCP
make -C /tmp/XKCP AVX512/libXKCP.a

# 3. Build keccak_bench.
make

# 4. Run the 60-second bench.
make bench

# 5. Run the correctness test suite (3 checks, ~45s).
make test
```

For Docker users:

```sh
docker build -t avx512-keccak-bench .
docker run --rm avx512-keccak-bench make bench
```

Requires a CPU with AVX-512F + AVX-512VL + `vpternlogq` (Skylake-X or
newer). On Sapphire Rapids you&#x27;ll see the full headline number; Icelake /
Rocket Lake will be proportionally slower. The binary will SIGILL on
pre-AVX-512 hardware; run inside the provided Dockerfile on an AVX-512-
enabled EC2 `m7i.*`, GCP `c3-*`, or a modern workstation for reproducible
numbers.

## What&#x27;s novel

### A2 — prefix-fixed round-1 theta partial precomputation

Keccak round 1&#x27;s theta uses 5 column parities and 5 deltas. For a
fixed-prefix workload the state absorb leaves 24 of 25 state lanes constant
per worker (only the counter lane varies). That means 4 of 5 column parities
and 3 of 5 deltas are constants and can be computed once at worker startup.
In the hot loop, only 1 column parity and 2 deltas need recomputing.
Savings: ~28 ops per iteration. See [`docs/design.md`](docs/design.md) for
the derivation.

### A1 — round-24 lane-(0,0)-only short-circuit

SHA3-256&#x27;s output bytes 0..7 come from state lane (0,0). For the
&quot;does this hash beat my current best?&quot; comparison we only need the first
8 bytes in practice — they decide the lex order &gt;99.999% of the time.

Round 24 still needs the full theta (all 25 columns contribute to theta
deltas, else lane (0,0) would be wrong). But after theta, we only need
rho/pi/chi/iota for the 3 pre-pi lanes feeding post-pi row-0 columns
0/1/2 (diagonals (0,0), (1,1), (2,2)) — everything else is dead code.

Savings: ~56 ops per iteration. On hash-beat (extremely rare) we re-init
the state with the same counter and call XKCP&#x27;s standard 24-round permute
to recover the full 32-byte digest for emission.

A1 and A2 compose: their critical paths are different, and throughput
ratios multiply. Modeled expectation was `~3-7%`; measured was `+7.7%`.

### What didn&#x27;t work (negative results)

Several &quot;should have worked&quot; optimizations did not:

- **cpuminer-opt lane-complement trick**: obsolete post-`vpternlogq`
  (-11% in microbench on SPR).
- **PGO**: +0.08% (noise). Inner loop is branch-free AVX-512, no branch
  miss for PGO to fix.
- **BOLT**: -1.9% on a VM host that can&#x27;t expose LBR (`perf record -j
  any` denied). Prerequisite unchecked.
- **gcc 13 vs clang 18 full-LTO or thin-LTO**: within ±0.5%.
- **iTLB hugepages**: hot code fits in L1i (measured iTLB miss rate
  0.00025%, far below any threshold worth a page-map change).
- **2-stream ILP (pipeline 2 independent Keccak states per thread)**:
  -0.35% at 8-threads-no-SMT on SPR. Too few ZMM registers to schedule
  the extra stream.
- **Round-23 short-circuit**: symbolic op-count analysis showed +3
  `vpternlogq` + 6 `vpxorq` over the standard path → ~0.1-0.2% regression
  on SPR (vpternlogq is the port 0/5 bottleneck).

See [`docs/design.md`](docs/design.md) for the full ruleout matrix.

## JSON-lines output schema

Miner prints one JSON record per event to stdout:

```json
{&quot;kind&quot;:&quot;startup&quot;,&quot;workers&quot;:16,&quot;prefix&quot;:&quot;bench_sha3:&quot;,&quot;base_salt_hex&quot;:&quot;…&quot;,&quot;source&quot;:&quot;…&quot;,&quot;variant&quot;:&quot;a2_a1&quot;}
{&quot;kind&quot;:&quot;stats&quot;,&quot;worker&quot;:3,&quot;total&quot;:4194304,&quot;rate&quot;:12058624}
{&quot;kind&quot;:&quot;best&quot;,&quot;input_hex&quot;:&quot;…13 bytes…&quot;,&quot;hash_hex&quot;:&quot;…32 bytes…&quot;,&quot;worker&quot;:3,&quot;counter&quot;:123}
{&quot;kind&quot;:&quot;canary_fail&quot;,&quot;worker&quot;:0,&quot;expected&quot;:&quot;cf0d…&quot;,&quot;got&quot;:&quot;300d…&quot;,&quot;msg&quot;:&quot;HARDWARE_ERROR_CANARY_FAIL&quot;}
{&quot;kind&quot;:&quot;shutdown&quot;,&quot;total&quot;:81801609464}
```

`canary_fail` is emitted if the hardware-error canary (one known-answer
SHA3-256 check every `2^30` iterations per worker) finds a mismatch. In
that case the miner sets an internal stop flag and exits; a supervisor
(e.g., `systemd Restart=always`) should restart it. See
[`docs/correctness.md`](docs/correctness.md).

## Correctness

Three test layers, all invoked by `make test` (~45 s):

1. **Fixed-vector parity** — 1000 random inputs, hashed both via
   `keccak_bench --validate-hex` (which routes through XKCP&#x27;s canonical
   permute + our SHA3 padding) and via Python `hashlib.sha3_256`. Must
   agree.
2. **Hot-path beat parity** — 5 seconds of live mining, verify every
   emitted `best` record re-hashes correctly with `hashlib`.
3. **A1 no-missed-beats** — 2 million counters, single worker: compute
   the reference set of lane-0 beats locally with `hashlib`, run the
   miner over the same counter range, assert the emitted set is bit-
   identical. This is the test that directly exercises the A1
   short-circuit claim.

See [`docs/correctness.md`](docs/correctness.md).

## File layout

```
README.md              — this file
LICENSE                — MIT (+ XKCP CC0 attribution footnote)
Makefile               — top-level `make bench`, `make test`, etc.
Dockerfile             — reproducible build container
docs/
  design.md            — A2 + A1 derivation; op-count analysis
  benchmarks.md        — MH/s measurements + methodology
  correctness.md       — validation approach
  architecture.md      — SPR AVX-512 microarch notes
src/
  keccak_bench.c       — the miner
  Makefile             — builds `keccak_bench` + `keccak_bench_lto`
  validate.py          — 3-check correctness harness
test/
  vectors.json         — known-answer test vectors (hashlib ground truth)
  test_bench.py        — pytest shim around validate.py
tools/
  leak_audit.sh        — repo-hygiene grep for banned strings; run as
                         pre-commit hook
  bench.sh             — 60s bench runner
backends/
  cuda/                — NVIDIA GPU port (A2 + A1 on scalar Keccak)
  hip/                 — AMD ROCm/HIP port (A2 + A1; shares keccak_scalar.h via symlink)
  sve2/                — ARM SVE2 port (A2 + A1 with EOR3/XAR/BCAX)
  wasm/                — WebAssembly port (Node CLI + browser demo;
                         hosted at `backends/wasm/LIVE_DEMO.md`)
docs/articles/
  preprint/            — LaTeX + PDF + abstract + Zenodo metadata
                         for the two technical articles
.github/workflows/
  ci.yml               — GitHub Actions CI (build + test, x86 AVX-512)
  cuda-ci.yml          — CUDA backend CI (nvcc compile-check + CPU parity)
  sve2-ci.yml          — SVE2 backend CI (cross-compile + QEMU parity)
  hip-ci.yml           — HIP backend CI (hipcc compile-check + CPU parity)
  wasm-ci.yml          — WASM backend CI (emscripten build + Node parity)
```

## Backends

The AVX-512 implementation lives in `src/keccak_bench.c`. Ports to
other ISAs live under `backends/&lt;arch&gt;/` and share the same JSON-lines
output schema + `bench_sha3:` prefix + 24-byte single-block absorb
layout, so they&#x27;re drop-in replacements for each other.

| Backend | Path | Status |
|---|---|---|
| **AVX-512 (x86_64, Sapphire Rapids / Granite Rapids / etc.)** | `src/keccak_bench.c` | Live (195 MH/s on Xeon 8488C, 16 workers) |
| **CUDA (NVIDIA GPUs, sm_70+)** | `backends/cuda/` | Compiles + hashlib-parity on CPU; GPU runtime validation deferred |
| **HIP / ROCm (AMD GPUs, gfx906 / gfx90a / gfx1100)** | `backends/hip/` | Compiles under `hipcc` + hashlib-parity on CPU; AMD GPU runtime validation deferred |
| **ARM SVE2 (Neoverse-N2 / Cortex-A710 / Cobalt 100 / Graviton 4)** | `backends/sve2/` | Cross-compiles + QEMU-SVE2-parity at VL∈{2,4,8}; native runtime validation deferred |
| **WebAssembly (browser + Node)** | `backends/wasm/` | Node CLI + browser WebWorker demo; hashlib parity via Node (4 tests) + Playwright-headless-Chromium smoke. **Served live** — see `backends/wasm/LIVE_DEMO.md` |

Each backend directory has its own `README.md` + `CORRECTNESS.md` +
build/test instructions.

## Live demo (WASM backend)

A public browser demo of the WASM backend is served at a
`trycloudflare.com` URL — see
[`backends/wasm/LIVE_DEMO.md`](backends/wasm/LIVE_DEMO.md) for the
current URL and a note on its ephemerality (the subdomain rotates on
tunnel restart). Open it in a modern browser, click **Start**, and a
WebWorker will benchmark SHA3-256 preimage throughput in your tab. The
page carries an opt-in disclosure banner and a visible &quot;no
cryptocurrency mining&quot; statement; nothing runs until clicked, and
nothing is sent to a server.

## Contributing

This is a small focused benchmark repo. Happy to take:

- Fixes for correctness bugs (rare; report with a reproducer).
- Extensions for other AVX-512-class CPUs (Icelake-SP, Sierra Forest,
  Granite Rapids) with measured numbers.
- Non-x86 AVX-512-analogue ports: additional targets for
  `backends/&lt;arch&gt;/` — RISC-V Zvkng, WebAssembly relaxed-SIMD,
  Apple Silicon AMX / SME2 (once ACLE 2024 lands in mainline
  clang). Keep them as sibling directories under `backends/`.

Open an issue or PR. For larger changes, an RFC-style issue first is
welcome.

## Articles

Long-form write-ups in `docs/articles/`:

- [`article_a2_round1_theta_precompute.md`](docs/articles/article_a2_round1_theta_precompute.md)
  — the A2 + A1 derivation in detail, what didn&#x27;t work alongside them
  (PGO / BOLT / 2-stream ILP / iTLB hugepages / round-23 short-
  circuit), and the reproducer.
- [`article_cpuminer_opt_lane_complement_obsolete.md`](docs/articles/article_cpuminer_opt_lane_complement_obsolete.md)
  — negative-result article: cpuminer-opt&#x27;s 8-way AVX-512 Keccak is
  11-14 % slower than XKCP on SPR because the pre-`vpternlogq` &quot;lane-
  complement&quot; χ trick has become strictly obsolete.

## Upstream

A proposal to contribute A2 + A1 upstream to XKCP is planned; if/when it
lands, this repo&#x27;s `src/keccak_bench.c` will shrink to a thin wrapper.
</pre>
        </div>
        </div>

</div>
</main>
</div>
</div>


</body>
</html>

