guides/v0.8.0_architecture.md

Select File
guides/v0.8.0_architecture.md

# v0.8.0 Architectural Summary

This document is the consolidated architectural overview of
`ex_data_sketch` v0.8.0 ("Deterministic Foundations"). It is intended
for contributors who want to understand the whole-system design after
five phases of focused work, and for downstream maintainers who need to
reason about the library's guarantees.

For per-phase detail see the individual plan documents (linked at the
bottom). For risk tracking see [`plans/0.8.0-risks.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/0.8.0-risks.md). For the user-
facing change log see `CHANGELOG.md`.

## The thesis behind v0.8.0

`ex_data_sketch` v0.1.0 through v0.7.1 grew the surface area of the
library: HLL, then CMS, then Theta, then KLL, then DDSketch,
FrequentItems, Bloom, Cuckoo, Quotient, CQF, XorFilter, IBLT, REQ,
MisraGries, ULL — 15 sketch families in 7 releases. Each release added
a sketch; none invested in the substrate.

v0.8.0 inverts the priority. It adds **zero new sketch families** and
instead establishes the production substrate they all share:

1. **Deterministic hashing.** A documented, validated, byte-stable
   hash layer used by every sketch.
2. **Binary stability.** A versioned, corruption-checked wire format
   that every sketch agrees on.
3. **Hot-path performance.** In-Rust hashing for every cardinality
   sketch, across both XXH3 and Murmur3.
4. **Installation reliability.** Precompiled NIFs across an 8-target
   matrix.
5. **Probabilistic correctness validation.** Property-based locking
   of the algebraic and probabilistic guarantees every sketch claims.

The thesis: a library can ship 15 algorithms and still be hobby-grade
if its substrate is undocumented. v0.8.0 turns ex_data_sketch from
"a collection of sketches" into "a probabilistic runtime on the BEAM".

## Layered architecture

```
                ┌─────────────────────────────────────────────┐
                │   User code: Stream.transform / Broadway /  │
                │   Phoenix / etc.                            │
                └────────────────┬────────────────────────────┘
                                 │
                ┌────────────────▼────────────────────────────┐
                │   Sketch modules (15)                       │
                │   HLL, ULL, KLL, REQ, Theta, CMS,           │
                │   DDSketch, FrequentItems, MisraGries,      │
                │   Bloom, Cuckoo, Quotient, CQF, XorFilter,  │
                │   IBLT, FilterChain                         │
                └────────────────┬────────────────────────────┘
                                 │
                ┌────────────────▼────────────────────────────┐
                │   Binary facade                             │
                │   ExDataSketch.Binary                       │
                │       encode/3, decode/1, peek_version/1,   │
                │       build_payload/2, metadata_from_opts/3 │
                └─────────┬─────────────────────┬─────────────┘
                          │                     │
              ┌───────────▼────────────┐   ┌────▼────────────────┐
              │ Binary v2 (default)    │   │ Codec v1 (legacy)   │
              │ Header / Validator/CRC │   │ ExDataSketch.Codec  │
              └───────────┬────────────┘   └─────────────────────┘
                          │
              ┌───────────▼────────────┐
              │ Hash.Metadata block    │
              │ (algorithm, seed,      │
              │  family, family_ver,   │
              │  backend, extension)   │
              └───────────┬────────────┘
                          │
              ┌───────────▼────────────────────────────────────┐
              │ Hash registry and validators                   │
              │ ExDataSketch.Hash                              │
              │   default_algorithm/0, algorithm_info/1,       │
              │   resolve_strategy/1                           │
              │ ExDataSketch.Hash.Validation                   │
              │   validate_options!/3, validate_metadata!/3    │
              └───────────┬────────────────────────────────────┘
                          │
              ┌───────────▼─────────────────────────────────────┐
              │ Hash implementations                            │
              │   ExDataSketch.Hash.XXH3       (NIF only)       │
              │   ExDataSketch.Hash.Murmur3    (Pure + NIF)     │
              │   ExDataSketch.Hash.* (phash2 + mix64 fallback) │
              └───────────┬─────────────────────────────────────┘
                          │
              ┌───────────▼─────────────────────────────────────┐
              │ Backends                                        │
              │   ExDataSketch.Backend.Pure   (always present)  │
              │   ExDataSketch.Backend.Rust   (NIF dispatcher)  │
              └───────────┬─────────────────────────────────────┘
                          │
              ┌───────────▼─────────────────────────────────────┐
              │ Rust NIF (native/ex_data_sketch_nif/src/)       │
              │   hash.rs   — xxhash3, murmur3                  │
              │   crc.rs    — crc32c                            │
              │   hll.rs    — register update + raw_h dispatch  │
              │   ull.rs    — register update + raw_h dispatch  │
              │   theta.rs  — BTreeSet ops   + raw_h dispatch   │
              │   cms.rs    — counter update + raw_h dispatch   │
              │   {bloom, cuckoo, quotient, cqf, xor, iblt, fi, │
              │    kll, ddsketch}.rs — sketch-specific          │
              └─────────────────────────────────────────────────┘
```

### Layer responsibilities (rules)

1. **Sketch modules own the algorithm.** They know their state layout,
   their parameter semantics, and their estimator math. They do NOT
   know about wire format or hash algorithm details.
2. **The Binary facade owns the wire format.** Sketch modules call
   `Binary.encode/3` and `Binary.decode/1`. They never write magic
   bytes, version bytes, or CRCs themselves.
3. **The Hash registry owns hash identity.** Sketch modules call
   `Hash.resolve_strategy/1` at construction and pass the resulting
   `:hash_strategy` opt through their operations. They never compare
   hash strategies themselves; that is `Hash.Validation`'s job.
4. **Backends own execution.** Sketch modules dispatch operations
   through `Backend.Pure` or `Backend.Rust`. They do not invoke NIFs
   directly.
5. **The Rust NIF owns hot loops.** Everything inside a NIF is
   stateless and operates on input bytes. Sketch state is BEAM-owned;
   the NIF receives a binary, computes a new binary, and returns it.

These rules are the architectural invariants that v0.8.0 establishes.
They are enforced by structure (not lint) — violating them produces
visible code-review smell.

## Phase-by-phase contribution

### Phase 1 — Deterministic Hashing

**What it added.** Four new submodules (`Hash.XXH3`, `Hash.Murmur3`,
`Hash.Metadata`, `Hash.Validation`) plus the registry API on
`ExDataSketch.Hash` and the byte-identical pure-Elixir/Rust Murmur3
parity.

**Why it matters.** Every probabilistic merge depends on hash
identity. Without a documented, validated, versioned hash layer, the
library cannot promise that a sketch produced on Node A will merge
correctly with a sketch produced on Node B six months later. Phase 1
is the foundation everything else stands on.

**Key invariant.** Two sketches may be merged only when their hash
algorithm, hash seed, sketch family, and sketch family version
agree. Backend (Pure vs Rust) is intentionally NOT part of this
equivalence — the parity tests guarantee both backends produce
byte-identical output.

### Phase 2 — Binary Stability & Corruption Detection

**What it added.** The `ExDataSketch.Binary` facade and three
submodules (`Binary.Header`, `Binary.Validator`, `Binary.CRC`). EXSK
v2 wire format. v1 reader backward compatibility. Regenerated golden
vectors with `test/vectors_v1/` preserved as a regression corpus.
60+ new tests including a 200-mutation bit-flip fuzz suite.

**Why it matters.** Pre-v0.8 EXSK had no checksum. A bit-flip in
persisted state would silently corrupt the next merge or estimate.
v2 closes that gap with CRC32C (Castagnoli, hardware-accelerated).
It also embeds the Phase 1 hash metadata into every frame so the
merge invariant from Phase 1 has somewhere to live on the wire.

**Key invariant.** Every persisted sketch carries its own hash
identity. The serializer cannot lie about it; the deserializer
cannot ignore it.

### Phase 3 — HLL Hot-Path Optimization

**What it added.** 8 new Rust NIFs (`_raw_h_nif` family) for HLL,
ULL, Theta, CMS. Each accepts an `algorithm: u8` parameter (XXH3 or
Murmur3) and dispatches at the per-NIF-call boundary, not per-item.
`Hash.resolve_strategy/1` opens the `:hash_strategy` opt to user
selection. `bench/hll_hot_path_bench.exs` measures all four paths
across three batch sizes.

**Why it matters.** v0.7.1 introduced in-Rust hashing for XXH3.
Phase 3 generalizes to Murmur3 so the new `:murmur3` strategy from
Phase 1 doesn't fall off the fast path. Net effect: ~15x throughput
over Pure Elixir, ~8% slowdown for Murmur3 vs XXH3 (intrinsic to
the algorithm).

**Key invariant.** Sketch state is BEAM-owned. The NIF receives a
binary, returns a binary. Per-item allocation crosses zero Elixir
references in steady state.

### Phase 4 — Precompiled NIF Validation

**What it added.** Two Windows targets (`x86_64-pc-windows-msvc`,
`aarch64-pc-windows-msvc`) bringing the matrix to 8 × 2 = 16
artifacts per release. `mix test.nif_on` / `mix test.nif_off` aliases
for local NIF mode flips. 18 NIF-availability contract tests.

**Why it matters.** Adoption friction. Pre-v0.8 Windows users had to
install Rust to build the NIF. Phase 4 removes that step. Apple
Silicon, Linux glibc/musl, Linux ARM64, Windows x86_64, and Windows
ARM64 all install via Hex with no toolchain.

**Key invariant.** `EX_DATA_SKETCH_SKIP_NIF=true` (NIF stubs only)
and `EX_DATA_SKETCH_BUILD=true` (source build) are independent
escape hatches. The default precompiled-download path is the user-
visible recommended install.

### Phase 5 — Property-Based Validation

**What it added.** `test/property_guarantees_test.exs` with 14 new
StreamData properties locking the algebraic and probabilistic
guarantees the prompt enumerates: HLL/ULL monotonicity and RSE
bounds, KLL/REQ rank consistency and quantile inversion, CMS
overestimation-only, Bloom/XOR/Cuckoo no-false-negative, Binary v2
bit-flip corruption never silently propagates.

**Why it matters.** Example-based tests check one trajectory. The
production substrate Phase 1-4 builds is only worth the substrate
work if its guarantees hold across the distribution of inputs.
Property-based testing closes that gap.

**Key invariant.** Coverage ≥ 70% (current: 92.7%). Property suite
runs in < 1 s on top of the example suite. Every property carries
prose justification of its tolerance / slack.

## Numbers that matter

### Test suite

| Metric | v0.7.1 | v0.8.0 | Delta |
|--------|--------|--------|-------|
| Tests (NIF on) | 1,186 | 1,317 | +131 |
| Tests (NIF off) | ~1,000 | 1,088 | +88 |
| Doctests | 169 | 202 | +33 |
| Properties (NIF on) | 152 | 171 | +19 |
| Properties (NIF off) | 116 | 128 | +12 |
| Line coverage | 88% | 92.7% | +4.7 pp |
| `mix credo --strict` issues | 0 | 0 | — |

### Performance

| Path (HLL p=14) | v0.7.1 throughput | v0.8.0 throughput | Notes |
|-----------------|-------------------|-------------------|-------|
| Pure phash2 | ~1.7 M items/sec | ~1.7 M items/sec | unchanged |
| Pure xxhash3 | ~1.9 M items/sec | ~1.9 M items/sec | unchanged |
| Rust raw XXH3 | ~30 M items/sec | ~30 M items/sec | unchanged |
| Rust raw_h Murmur3 | — | ~28 M items/sec | new in v0.8.0 |

### Code surface

| Asset | v0.7.1 | v0.8.0 | Delta |
|-------|--------|--------|-------|
| Elixir modules in `lib/` | ~30 | ~40 | +10 |
| Rust NIF functions | 47 | 58 | +11 |
| Plans / design docs | 47 | 62 | +15 |
| Precompiled NIF targets | 6 | 8 | +2 (Windows MSVC) |
| Artifacts per release | 12 | 16 | +4 |

### Wire format

| Sketch (empty) | v1 size | v2 size | Overhead |
|----------------|---------|---------|----------|
| HLL p=4 | 18 bytes | 50 bytes | +32 (2.8x) |
| HLL p=14 | 16,398 bytes | 16,430 bytes | +32 (0.2%) |
| KLL k=200 (populated) | ~3-5 KB | ~3-5 KB | +32 (~1%) |

## Design decisions worth re-reading

A handful of decisions shaped the v0.8.0 architecture and deserve
explicit documentation here so future maintainers don't relitigate
them.

### Why two-layer versioning (frame + metadata block)?

The EXSK v2 frame has its own `serialization_version` byte. The
embedded `Hash.Metadata` block has its own `block_version` byte. Two
independent axes.

The rationale is fine-grained evolution:

- Adding a new hash algorithm: claim a new wire byte. No version
  bump on either axis.
- Adding a new metadata field: append to the metadata block's
  `extension` trailer. v1 readers preserve unknown extension bytes
  verbatim on re-encode. No version bump.
- Restructuring the metadata block layout itself: bump
  `block_version`. Frame version stays at 2.
- Restructuring the frame layout (e.g., changing the magic, the CRC
  algorithm, the header field order): bump `serialization_version`
  to 3.

Single-axis versioning would force every change to either be
backward-incompatible or to crowd into a single ever-larger version
namespace. Two axes give us 16+ years of additive evolution before
either runs out of room.

### Why CRC32C (Castagnoli), not CRC32 (IEEE) or xxhash3-32?

- CRC32C has hardware acceleration on every modern CPU (Intel SSE
  4.2, ARMv8.1+). Same speed class as CRC32 IEEE on hardware that
  supports it; substantially faster on hardware that does not.
- CRC32C is the standard checksum in iSCSI, Btrfs, SCTP, Snappy
  frame format. The algorithm is settled; the wire bytes are stable
  across implementations. Cross-language interop is trivial.
- xxhash3-32 is faster but is NOT a CRC. It has different error-
  detection guarantees. For storage integrity (the primary use case)
  CRC32 family is the right tool.

Full rationale in [`plans/corruption_detection.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/corruption_detection.md).

### Why preserve the legacy `_raw_nif` family alongside `_raw_h_nif`?

The v0.7.1 `_raw_nif` family hardcoded XXH3. Phase 3 added the
generalized `_raw_h_nif` family with an algorithm byte. The two
families are now functionally equivalent for XXH3.

We preserve the legacy NIFs because:

1. They are part of the v0.7.x ABI. Removing them is a breaking
   change reserved for v1.0.
2. They serve as a regression baseline: `_raw_nif` and `_raw_h_nif
   (algo=1)` are property-tested for byte-identical output, locking
   the equivalence and catching any drift.

A v1.0 deprecation could remove the legacy family.

### Why a 16-byte fixed metadata block when sketches differ in family?

The block could vary per sketch family. We chose fixed for three
reasons:

1. **Cross-family validation.** The merge validator can compare two
   metadata blocks without knowing which sketch family they belong
   to. Useful for generic tooling.
2. **Forward compatibility.** The fixed length means a v0.8 reader
   can skip a future metadata block of unknown internal structure
   and still successfully parse the surrounding frame.
3. **Smallest worst-case.** For HLL p=4 the overhead is 32 bytes. For
   any production-sized sketch (p >= 8) the overhead is < 1%.

Variable-length metadata was rejected as a premature optimization
that would have made the binary contract harder to validate.

### Why is `Backend.default/0` still `Pure`?

The "no silent default change" guarantee from v0.7.x. Users who
benchmarked the library and chose `Pure` for some reason should not
have their default flip under them on a minor-version upgrade.

This is locked by `test/ex_data_sketch/nif_availability_test.exs`
and documented in `precompiled_nifs.md`.

The trade-off: users adopting the library for the first time may
benchmark with the wrong backend. We accept that as the smaller
risk.

A future major-version bump (v1.0) is the appropriate moment to
revisit.

## What v0.8.0 does NOT do

Explicit non-goals to prevent scope creep in maintenance and to
document the boundary for v0.9.0 planning:

- **No new sketch families.** CPC, Tuple, MinHash, VarOpt — all
  deferred (v0.11+).
- **No Apache DataSketches binary interop** beyond Theta CompactSketch
  which already existed. KLL and HLL interop deferred (v0.10).
- **No streaming integrations.** Broadway, Flow, GenStage —
  deferred (v0.9).
- **No persistence layers.** ETS, DETS, CubDB — deferred (v0.9).
- **No telemetry / OpenTelemetry.** Deferred (v0.9).
- **No SIMD intrinsics.** The HLL hot path uses scalar Rust;
  hyperloglog-rs uses SIMD and is 2-3x faster. Deferred (v1.0).
- **No 6-bit register packing.** HLL stores 1 byte per register,
  wasting 25%. Deferred (v1.0).
- **No raw-NIF path for membership filters.** Bloom, Cuckoo,
  Quotient, CQF, XorFilter, IBLT still hash in Elixir. Deferred
  (v0.9 candidate).
- **No SBOM / SLSA / reproducible builds.** Deferred (v1.0).

## See also

- [`prompts/0.8.0_prompt.md`](https://github.com/thanos/ex_data_sketch/blob/main/prompts/0.8.0_prompt.md) — original release brief.
- [`plans/next_steps.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/next_steps.md) — strategic roadmap (v0.8.0 through v1.0).
- [`plans/0.8.0_implementation_plan.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/0.8.0_implementation_plan.md) — master tracker.
- `hash_strategies.md`, [`plans/hash_binary_contract.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/hash_binary_contract.md) — Phase 1 deep dives.
- [`plans/binary_contract.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/binary_contract.md), [`plans/corruption_detection.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/corruption_detection.md) — Phase 2 deep dives.
- `hll_performance.md`, [`plans/hll_scheduler_safety.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/hll_scheduler_safety.md) — Phase 3 deep dives.
- `precompiled_nifs.md` — Phase 4 deep dive.
- [`plans/property_testing.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/property_testing.md) — Phase 5 deep dive.
- [Phase 1-5 reviewer checklists](https://github.com/thanos/ex_data_sketch/blob/main/plans/) — per-phase checklists.
- `v0.8.0_migration_notes.md` — downstream upgrade guide.
- `serialization_compatibility.md` — wire-format stability contract.
- `roadmap.md` — next release preview.
- [`plans/0.8.0-risks.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/0.8.0-risks.md) — open risks at release time.
- `CHANGELOG.md` — full v0.8.0 change log.