guides/hll_performance.md

Select File
guides/hll_performance.md

# HLL Hot-Path Optimization

This document explains the HLL hot-path architecture introduced (and
extended) by `ex_data_sketch` v0.8.0 Phase 3.

## Why HLL throughput matters

HyperLogLog is the workhorse cardinality sketch. In production it is
called once per stream event, so its update path runs on every record
in the system. A 10x improvement in `update_many/2` directly translates
into a 10x improvement in the upstream pipeline's ceiling. CMS, ULL,
and Theta share the same hot-path structure and benefit from the same
optimizations.

## Why the BEAM-side hot path is slow

For an HLL with `p = 14` (16,384 registers), the steady-state per-item
work is:

1. Encode the item as a binary (`term_to_binary` if not already binary).
2. Compute a 64-bit hash.
3. Split the hash into bucket index (top `p` bits) + remaining bits.
4. Count leading zeros in the remaining bits.
5. Conditionally update one byte in the register array.

Steps 1–4 produce **per-item heap garbage** on the BEAM. Even the
"optimal" pure Elixir path allocates a binary for the hash, a tuple for
the bucket+rank pair, and a few intermediate small integers per item.
At 1 M items the GC pressure dominates everything else, and the bench
numbers in this document confirm it: Pure HLL caps at ~2 M items/sec
while Rust HLL exceeds 30 M items/sec on the same hardware.

## v0.8.0 hot-path architecture

There are four hot paths exposed by v0.8.0:

| # | Path | Backend | Hashing | Used by |
|---|------|---------|---------|---------|
| 1 | Pure Elixir | `Backend.Pure` | Elixir (`phash2 / xxh3 / murmur3`) | NIF-less builds, BEAM-only test envs |
| 2 | Rust non-raw NIF | `Backend.Rust` | Elixir → pass hashes_bin to Rust | Pre-hashed inputs, `:phash2`, `:custom` |
| 3 | Rust raw NIF (XXH3) | `Backend.Rust` | Rust XXH3 inside the NIF | Default `:xxhash3` path |
| 4 | Rust raw_h NIF (Murmur3) | `Backend.Rust` | Rust Murmur3 inside the NIF | `:murmur3` path (new in v0.8.0) |

The four sketches that benefit from this architecture are **HLL, ULL,
Theta, CMS** — the cardinality / point-count family that hashes every
input item.

### Path dispatch logic

Each of the four sketches' `update_many/2` runs this decision tree:

```text
if backend == Backend.Pure:
    # Path 1: pure Elixir, batched
    hashes = chunk |> Enum.map(&hash_item/2)
    Backend.Pure.<sketch>_update_many(state, hashes, opts)

elif backend == Backend.Rust:
    if opts[:hash_fn] != nil:
        # Custom closure: must run on the BEAM. Path 2.
        hashes = chunk |> Enum.map(&hash_item/2)
        Backend.Rust.<sketch>_update_many(state, hashes, opts)

    elif opts[:hash_strategy] == :phash2:
        # phash2 not implemented in Rust. Path 2.
        hashes = chunk |> Enum.map(&hash_item/2)
        Backend.Rust.<sketch>_update_many(state, hashes, opts)

    else:
        # :xxhash3 or :murmur3 → Rust hashing. Path 3 or 4.
        Backend.Rust.<sketch>_update_many_raw(state, chunk, opts)
```

The `<sketch>_update_many_raw/3` Elixir function then dispatches at the
Rust-call level:

- `:xxhash3` → legacy `_raw_nif` (binary-compatible with v0.7.1).
- `:murmur3` → new `_raw_h_nif` with `algorithm=2`.
- Both gain a Dirty-CPU variant once the chunk size exceeds the
  configured threshold (default 10,000 items).

### Why chunking matters

Items arrive at `update_many/2` as an arbitrary enumerable. The
implementation chunks them at 10,000 items per chunk. This serves three
purposes:

1. **Bounded per-call cost** — even with 100M items, each NIF call
   stays bounded to ~30ms wall-clock, well within Erlang's 1 ms
   reduction budget for the dirty scheduler.
2. **Steady allocation** — a 10,000-item `ListIterator` is small enough
   to fit in L2 cache on most platforms.
3. **Predictable scheduler behavior** — the chunk size is the unit
   compared against the configured dirty threshold to choose normal vs
   dirty NIF.

### Why `update_many` is the only acceleration target

The `update/2` (single-item) path is intentionally NOT NIF-accelerated.
A single NIF call has fixed overhead of ~200 ns; for a single item
update this dominates the actual register write (~30 ns). Callers that
want per-item throughput either batch with `update_many/2` or accept
the BEAM-side single-item path. This matches Apache DataSketches' own
single-item-vs-batch trade-off.

## v0.8.0 Phase 3 additions

The Phase 3 work in v0.8.0 is:

1. **Generalized in-Rust hashing.** v0.7.1 shipped a `_raw_nif` family
   that hardcoded XXH3. Phase 3 adds a parallel `_raw_h_nif` family
   that accepts an algorithm byte (`1 = XXH3`, `2 = Murmur3`). The
   legacy XXH3-only NIFs stay so v0.7.x callers see zero regression.
2. **End-to-end Murmur3 support.** `HLL.new(hash_strategy: :murmur3)`
   (and the same for ULL/Theta/CMS) now wires through to the new
   `_raw_h_nif` Rust path. Murmur3 callers no longer fall off the fast
   path. Cross-language interop with Apache DataSketches becomes a
   one-line configuration change at sketch creation time.
3. **`ExDataSketch.Hash.resolve_strategy/1`** — single source of truth
   for sketch constructors. Centralizes the user-opt → strategy mapping
   that all four sketches previously open-coded inconsistently.

What Phase 3 explicitly does NOT do (per the v0.8.0 prompt's
non-goals):

- No SIMD intrinsics (deferred to v0.9 / v1.0).
- No 6-bit register packing (deferred — would require a sketch-family
  version bump and is incompatible with the existing 8-bit-register
  state binary).
- No sparse/dense layout redesign.
- No ARM-specific tuning.
- No raw-NIF migration for membership filters (Bloom/Cuckoo/Quotient/
  CQF/XorFilter/IBLT) — those already cross the NIF boundary with
  pre-hashed integers via `put_many` and would need a separate design
  pass.

## Scheduler safety

Every hot-path NIF is paired with a Dirty-CPU variant:

```
hll_update_many_nif            (normal)   small batches
hll_update_many_dirty_nif      (dirty)    batches above threshold

hll_update_many_raw_nif        (normal)   small batches, XXH3 inside Rust
hll_update_many_raw_dirty_nif  (dirty)    batches above threshold

hll_update_many_raw_h_nif      (normal)   small batches, dispatched hash
hll_update_many_raw_h_dirty_nif (dirty)   batches above threshold
```

The dispatcher in `Backend.Rust` picks dirty when `length(chunk) >
threshold`. Defaults are listed in `lib/ex_data_sketch/backend/rust.ex`
under `@default_thresholds`. They can be overridden per-call (`opts[:dirty_threshold]`)
or globally (`config :ex_data_sketch, :dirty_thresholds, %{...}`).

See [`plans/hll_scheduler_safety.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/hll_scheduler_safety.md) for the full scheduler discussion.

## Measured throughput (Phase 3 benchmark)

The full benchmark is `bench/hll_hot_path_bench.exs`. Below is a
representative run on Apple Silicon (M1 / OTP 27 / Elixir 1.19); your
hardware will vary. Units are items per second computed from the
`ips * batch_size` column.

| Path                | 10,000 items | 100,000 items | 1,000,000 items |
|---------------------|--------------|---------------|------------------|
| Pure Elixir phash2  | ~1.67 M/s    | ~1.81 M/s     | ~2.02 M/s        |
| Pure Elixir xxhash3 | ~1.72 M/s    | ~1.89 M/s     | ~2.10 M/s        |
| Rust raw XXH3       | ~25.8 M/s    | ~29.4 M/s     | ~34.3 M/s        |
| Rust raw_h Murmur3  | ~23.8 M/s    | ~27.8 M/s     | ~31.1 M/s        |

Headline numbers:

- **Rust XXH3 vs Pure xxhash3**: ~15× throughput.
- **Rust XXH3 vs Pure phash2**: ~17× throughput.
- **Rust Murmur3 vs Rust XXH3**: ~92% throughput. The 8% slowdown is
  the cost of Murmur3's full finalizer arithmetic, which is acceptable
  given Murmur3's interop value.
- **Memory**: Rust paths allocate ~10% less than Pure paths because no
  intermediate hashes list is materialized in Elixir.

## How these numbers compare to external HLL implementations

We deliberately do NOT re-benchmark external HLL implementations in
this release. Throughput numbers from upstream documentation:

- **Apache DataSketches HLL (Java)** — ~10–20 M items/sec for `p=14`
  with Java's MurmurHash3. Reference: their JMH suite.
- **`hyperloglog-rs` (Rust)** — ~80–120 M items/sec for `p=14` with
  XxHash3 and SIMD. Reference: crate README.
- **`axiomhq/hyperloglog-go`** — ~25 M items/sec for `p=14`. Reference:
  the Axiom blog post that accompanied the library release.
- **RedisBloom HLL** — single-thread bound by the Redis event loop;
  upstream throughput is on the order of 1 M items/sec including
  network round-trip.

**Our position in the stack:** v0.8.0's Rust HLL is within 2–3× of the
fastest in-process Rust implementation (which uses SIMD and a custom
packed register layout), substantially ahead of Java and Go for the
single-threaded case, and ~30× ahead of RedisBloom when network is
included.

The 2–3× headroom against `hyperloglog-rs` is **deliberate**: closing
that gap requires SIMD-tuned register updates and 6-bit packing, both
of which are deferred to v0.9 / v1.0 per the v0.8.0 prompt's scope.

## Reproducing these numbers

```sh
EX_DATA_SKETCH_BUILD=1 MIX_ENV=dev mix run bench/hll_hot_path_bench.exs
```

Outputs JSON to `bench/output/hll_hot_path_bench.json` for further
analysis.

## References

- `lib/ex_data_sketch/hll.ex`, `lib/ex_data_sketch/ull.ex`,
  `lib/ex_data_sketch/theta.ex`, `lib/ex_data_sketch/cms.ex` — high-
  level `update_many/2` dispatch.
- `lib/ex_data_sketch/backend/rust.ex` — Rust-side dispatch
  (`hll_update_many_raw/3` and friends).
- `native/ex_data_sketch_nif/src/hll.rs` — Rust HLL implementation.
- `native/ex_data_sketch_nif/src/ull.rs`,
  `native/ex_data_sketch_nif/src/theta.rs`,
  `native/ex_data_sketch_nif/src/cms.rs` — sibling sketches.
- `bench/hll_hot_path_bench.exs` — this document's measurement source.
- [`plans/hll_scheduler_safety.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/hll_scheduler_safety.md) — companion scheduler discussion.
- Heule et al. "HyperLogLog in Practice." Google, 2013.