# v0.8.0 Architectural Summary
This document is the consolidated architectural overview of
`ex_data_sketch` v0.8.0 ("Deterministic Foundations"). It is intended
for contributors who want to understand the whole-system design after
five phases of focused work, and for downstream maintainers who need to
reason about the library's guarantees.
For per-phase detail see the individual plan documents (linked at the
bottom). For risk tracking see [`plans/0.8.0-risks.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/0.8.0-risks.md). For the user-
facing change log see `CHANGELOG.md`.
## The thesis behind v0.8.0
`ex_data_sketch` v0.1.0 through v0.7.1 grew the surface area of the
library: HLL, then CMS, then Theta, then KLL, then DDSketch,
FrequentItems, Bloom, Cuckoo, Quotient, CQF, XorFilter, IBLT, REQ,
MisraGries, ULL — 15 sketch families in 7 releases. Each release added
a sketch; none invested in the substrate.
v0.8.0 inverts the priority. It adds **zero new sketch families** and
instead establishes the production substrate they all share:
1. **Deterministic hashing.** A documented, validated, byte-stable
hash layer used by every sketch.
2. **Binary stability.** A versioned, corruption-checked wire format
that every sketch agrees on.
3. **Hot-path performance.** In-Rust hashing for every cardinality
sketch, across both XXH3 and Murmur3.
4. **Installation reliability.** Precompiled NIFs across an 8-target
matrix.
5. **Probabilistic correctness validation.** Property-based locking
of the algebraic and probabilistic guarantees every sketch claims.
The thesis: a library can ship 15 algorithms and still be hobby-grade
if its substrate is undocumented. v0.8.0 turns ex_data_sketch from
"a collection of sketches" into "a probabilistic runtime on the BEAM".
## Layered architecture
```
┌─────────────────────────────────────────────┐
│ User code: Stream.transform / Broadway / │
│ Phoenix / etc. │
└────────────────┬────────────────────────────┘
│
┌────────────────▼────────────────────────────┐
│ Sketch modules (15) │
│ HLL, ULL, KLL, REQ, Theta, CMS, │
│ DDSketch, FrequentItems, MisraGries, │
│ Bloom, Cuckoo, Quotient, CQF, XorFilter, │
│ IBLT, FilterChain │
└────────────────┬────────────────────────────┘
│
┌────────────────▼────────────────────────────┐
│ Binary facade │
│ ExDataSketch.Binary │
│ encode/3, decode/1, peek_version/1, │
│ build_payload/2, metadata_from_opts/3 │
└─────────┬─────────────────────┬─────────────┘
│ │
┌───────────▼────────────┐ ┌────▼────────────────┐
│ Binary v2 (default) │ │ Codec v1 (legacy) │
│ Header / Validator/CRC │ │ ExDataSketch.Codec │
└───────────┬────────────┘ └─────────────────────┘
│
┌───────────▼────────────┐
│ Hash.Metadata block │
│ (algorithm, seed, │
│ family, family_ver, │
│ backend, extension) │
└───────────┬────────────┘
│
┌───────────▼────────────────────────────────────┐
│ Hash registry and validators │
│ ExDataSketch.Hash │
│ default_algorithm/0, algorithm_info/1, │
│ resolve_strategy/1 │
│ ExDataSketch.Hash.Validation │
│ validate_options!/3, validate_metadata!/3 │
└───────────┬────────────────────────────────────┘
│
┌───────────▼─────────────────────────────────────┐
│ Hash implementations │
│ ExDataSketch.Hash.XXH3 (NIF only) │
│ ExDataSketch.Hash.Murmur3 (Pure + NIF) │
│ ExDataSketch.Hash.* (phash2 + mix64 fallback) │
└───────────┬─────────────────────────────────────┘
│
┌───────────▼─────────────────────────────────────┐
│ Backends │
│ ExDataSketch.Backend.Pure (always present) │
│ ExDataSketch.Backend.Rust (NIF dispatcher) │
└───────────┬─────────────────────────────────────┘
│
┌───────────▼─────────────────────────────────────┐
│ Rust NIF (native/ex_data_sketch_nif/src/) │
│ hash.rs — xxhash3, murmur3 │
│ crc.rs — crc32c │
│ hll.rs — register update + raw_h dispatch │
│ ull.rs — register update + raw_h dispatch │
│ theta.rs — BTreeSet ops + raw_h dispatch │
│ cms.rs — counter update + raw_h dispatch │
│ {bloom, cuckoo, quotient, cqf, xor, iblt, fi, │
│ kll, ddsketch}.rs — sketch-specific │
└─────────────────────────────────────────────────┘
```
### Layer responsibilities (rules)
1. **Sketch modules own the algorithm.** They know their state layout,
their parameter semantics, and their estimator math. They do NOT
know about wire format or hash algorithm details.
2. **The Binary facade owns the wire format.** Sketch modules call
`Binary.encode/3` and `Binary.decode/1`. They never write magic
bytes, version bytes, or CRCs themselves.
3. **The Hash registry owns hash identity.** Sketch modules call
`Hash.resolve_strategy/1` at construction and pass the resulting
`:hash_strategy` opt through their operations. They never compare
hash strategies themselves; that is `Hash.Validation`'s job.
4. **Backends own execution.** Sketch modules dispatch operations
through `Backend.Pure` or `Backend.Rust`. They do not invoke NIFs
directly.
5. **The Rust NIF owns hot loops.** Everything inside a NIF is
stateless and operates on input bytes. Sketch state is BEAM-owned;
the NIF receives a binary, computes a new binary, and returns it.
These rules are the architectural invariants that v0.8.0 establishes.
They are enforced by structure (not lint) — violating them produces
visible code-review smell.
## Phase-by-phase contribution
### Phase 1 — Deterministic Hashing
**What it added.** Four new submodules (`Hash.XXH3`, `Hash.Murmur3`,
`Hash.Metadata`, `Hash.Validation`) plus the registry API on
`ExDataSketch.Hash` and the byte-identical pure-Elixir/Rust Murmur3
parity.
**Why it matters.** Every probabilistic merge depends on hash
identity. Without a documented, validated, versioned hash layer, the
library cannot promise that a sketch produced on Node A will merge
correctly with a sketch produced on Node B six months later. Phase 1
is the foundation everything else stands on.
**Key invariant.** Two sketches may be merged only when their hash
algorithm, hash seed, sketch family, and sketch family version
agree. Backend (Pure vs Rust) is intentionally NOT part of this
equivalence — the parity tests guarantee both backends produce
byte-identical output.
### Phase 2 — Binary Stability & Corruption Detection
**What it added.** The `ExDataSketch.Binary` facade and three
submodules (`Binary.Header`, `Binary.Validator`, `Binary.CRC`). EXSK
v2 wire format. v1 reader backward compatibility. Regenerated golden
vectors with `test/vectors_v1/` preserved as a regression corpus.
60+ new tests including a 200-mutation bit-flip fuzz suite.
**Why it matters.** Pre-v0.8 EXSK had no checksum. A bit-flip in
persisted state would silently corrupt the next merge or estimate.
v2 closes that gap with CRC32C (Castagnoli, hardware-accelerated).
It also embeds the Phase 1 hash metadata into every frame so the
merge invariant from Phase 1 has somewhere to live on the wire.
**Key invariant.** Every persisted sketch carries its own hash
identity. The serializer cannot lie about it; the deserializer
cannot ignore it.
### Phase 3 — HLL Hot-Path Optimization
**What it added.** 8 new Rust NIFs (`_raw_h_nif` family) for HLL,
ULL, Theta, CMS. Each accepts an `algorithm: u8` parameter (XXH3 or
Murmur3) and dispatches at the per-NIF-call boundary, not per-item.
`Hash.resolve_strategy/1` opens the `:hash_strategy` opt to user
selection. `bench/hll_hot_path_bench.exs` measures all four paths
across three batch sizes.
**Why it matters.** v0.7.1 introduced in-Rust hashing for XXH3.
Phase 3 generalizes to Murmur3 so the new `:murmur3` strategy from
Phase 1 doesn't fall off the fast path. Net effect: ~15x throughput
over Pure Elixir, ~8% slowdown for Murmur3 vs XXH3 (intrinsic to
the algorithm).
**Key invariant.** Sketch state is BEAM-owned. The NIF receives a
binary, returns a binary. Per-item allocation crosses zero Elixir
references in steady state.
### Phase 4 — Precompiled NIF Validation
**What it added.** Two Windows targets (`x86_64-pc-windows-msvc`,
`aarch64-pc-windows-msvc`) bringing the matrix to 8 × 2 = 16
artifacts per release. `mix test.nif_on` / `mix test.nif_off` aliases
for local NIF mode flips. 18 NIF-availability contract tests.
**Why it matters.** Adoption friction. Pre-v0.8 Windows users had to
install Rust to build the NIF. Phase 4 removes that step. Apple
Silicon, Linux glibc/musl, Linux ARM64, Windows x86_64, and Windows
ARM64 all install via Hex with no toolchain.
**Key invariant.** `EX_DATA_SKETCH_SKIP_NIF=true` (NIF stubs only)
and `EX_DATA_SKETCH_BUILD=true` (source build) are independent
escape hatches. The default precompiled-download path is the user-
visible recommended install.
### Phase 5 — Property-Based Validation
**What it added.** `test/property_guarantees_test.exs` with 14 new
StreamData properties locking the algebraic and probabilistic
guarantees the prompt enumerates: HLL/ULL monotonicity and RSE
bounds, KLL/REQ rank consistency and quantile inversion, CMS
overestimation-only, Bloom/XOR/Cuckoo no-false-negative, Binary v2
bit-flip corruption never silently propagates.
**Why it matters.** Example-based tests check one trajectory. The
production substrate Phase 1-4 builds is only worth the substrate
work if its guarantees hold across the distribution of inputs.
Property-based testing closes that gap.
**Key invariant.** Coverage ≥ 70% (current: 92.7%). Property suite
runs in < 1 s on top of the example suite. Every property carries
prose justification of its tolerance / slack.
## Numbers that matter
### Test suite
| Metric | v0.7.1 | v0.8.0 | Delta |
|--------|--------|--------|-------|
| Tests (NIF on) | 1,186 | 1,317 | +131 |
| Tests (NIF off) | ~1,000 | 1,088 | +88 |
| Doctests | 169 | 202 | +33 |
| Properties (NIF on) | 152 | 171 | +19 |
| Properties (NIF off) | 116 | 128 | +12 |
| Line coverage | 88% | 92.7% | +4.7 pp |
| `mix credo --strict` issues | 0 | 0 | — |
### Performance
| Path (HLL p=14) | v0.7.1 throughput | v0.8.0 throughput | Notes |
|-----------------|-------------------|-------------------|-------|
| Pure phash2 | ~1.7 M items/sec | ~1.7 M items/sec | unchanged |
| Pure xxhash3 | ~1.9 M items/sec | ~1.9 M items/sec | unchanged |
| Rust raw XXH3 | ~30 M items/sec | ~30 M items/sec | unchanged |
| Rust raw_h Murmur3 | — | ~28 M items/sec | new in v0.8.0 |
### Code surface
| Asset | v0.7.1 | v0.8.0 | Delta |
|-------|--------|--------|-------|
| Elixir modules in `lib/` | ~30 | ~40 | +10 |
| Rust NIF functions | 47 | 58 | +11 |
| Plans / design docs | 47 | 62 | +15 |
| Precompiled NIF targets | 6 | 8 | +2 (Windows MSVC) |
| Artifacts per release | 12 | 16 | +4 |
### Wire format
| Sketch (empty) | v1 size | v2 size | Overhead |
|----------------|---------|---------|----------|
| HLL p=4 | 18 bytes | 50 bytes | +32 (2.8x) |
| HLL p=14 | 16,398 bytes | 16,430 bytes | +32 (0.2%) |
| KLL k=200 (populated) | ~3-5 KB | ~3-5 KB | +32 (~1%) |
## Design decisions worth re-reading
A handful of decisions shaped the v0.8.0 architecture and deserve
explicit documentation here so future maintainers don't relitigate
them.
### Why two-layer versioning (frame + metadata block)?
The EXSK v2 frame has its own `serialization_version` byte. The
embedded `Hash.Metadata` block has its own `block_version` byte. Two
independent axes.
The rationale is fine-grained evolution:
- Adding a new hash algorithm: claim a new wire byte. No version
bump on either axis.
- Adding a new metadata field: append to the metadata block's
`extension` trailer. v1 readers preserve unknown extension bytes
verbatim on re-encode. No version bump.
- Restructuring the metadata block layout itself: bump
`block_version`. Frame version stays at 2.
- Restructuring the frame layout (e.g., changing the magic, the CRC
algorithm, the header field order): bump `serialization_version`
to 3.
Single-axis versioning would force every change to either be
backward-incompatible or to crowd into a single ever-larger version
namespace. Two axes give us 16+ years of additive evolution before
either runs out of room.
### Why CRC32C (Castagnoli), not CRC32 (IEEE) or xxhash3-32?
- CRC32C has hardware acceleration on every modern CPU (Intel SSE
4.2, ARMv8.1+). Same speed class as CRC32 IEEE on hardware that
supports it; substantially faster on hardware that does not.
- CRC32C is the standard checksum in iSCSI, Btrfs, SCTP, Snappy
frame format. The algorithm is settled; the wire bytes are stable
across implementations. Cross-language interop is trivial.
- xxhash3-32 is faster but is NOT a CRC. It has different error-
detection guarantees. For storage integrity (the primary use case)
CRC32 family is the right tool.
Full rationale in [`plans/corruption_detection.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/corruption_detection.md).
### Why preserve the legacy `_raw_nif` family alongside `_raw_h_nif`?
The v0.7.1 `_raw_nif` family hardcoded XXH3. Phase 3 added the
generalized `_raw_h_nif` family with an algorithm byte. The two
families are now functionally equivalent for XXH3.
We preserve the legacy NIFs because:
1. They are part of the v0.7.x ABI. Removing them is a breaking
change reserved for v1.0.
2. They serve as a regression baseline: `_raw_nif` and `_raw_h_nif
(algo=1)` are property-tested for byte-identical output, locking
the equivalence and catching any drift.
A v1.0 deprecation could remove the legacy family.
### Why a 16-byte fixed metadata block when sketches differ in family?
The block could vary per sketch family. We chose fixed for three
reasons:
1. **Cross-family validation.** The merge validator can compare two
metadata blocks without knowing which sketch family they belong
to. Useful for generic tooling.
2. **Forward compatibility.** The fixed length means a v0.8 reader
can skip a future metadata block of unknown internal structure
and still successfully parse the surrounding frame.
3. **Smallest worst-case.** For HLL p=4 the overhead is 32 bytes. For
any production-sized sketch (p >= 8) the overhead is < 1%.
Variable-length metadata was rejected as a premature optimization
that would have made the binary contract harder to validate.
### Why is `Backend.default/0` still `Pure`?
The "no silent default change" guarantee from v0.7.x. Users who
benchmarked the library and chose `Pure` for some reason should not
have their default flip under them on a minor-version upgrade.
This is locked by `test/ex_data_sketch/nif_availability_test.exs`
and documented in `precompiled_nifs.md`.
The trade-off: users adopting the library for the first time may
benchmark with the wrong backend. We accept that as the smaller
risk.
A future major-version bump (v1.0) is the appropriate moment to
revisit.
## What v0.8.0 does NOT do
Explicit non-goals to prevent scope creep in maintenance and to
document the boundary for v0.9.0 planning:
- **No new sketch families.** CPC, Tuple, MinHash, VarOpt — all
deferred (v0.11+).
- **No Apache DataSketches binary interop** beyond Theta CompactSketch
which already existed. KLL and HLL interop deferred (v0.10).
- **No streaming integrations.** Broadway, Flow, GenStage —
deferred (v0.9).
- **No persistence layers.** ETS, DETS, CubDB — deferred (v0.9).
- **No telemetry / OpenTelemetry.** Deferred (v0.9).
- **No SIMD intrinsics.** The HLL hot path uses scalar Rust;
hyperloglog-rs uses SIMD and is 2-3x faster. Deferred (v1.0).
- **No 6-bit register packing.** HLL stores 1 byte per register,
wasting 25%. Deferred (v1.0).
- **No raw-NIF path for membership filters.** Bloom, Cuckoo,
Quotient, CQF, XorFilter, IBLT still hash in Elixir. Deferred
(v0.9 candidate).
- **No SBOM / SLSA / reproducible builds.** Deferred (v1.0).
## See also
- [`prompts/0.8.0_prompt.md`](https://github.com/thanos/ex_data_sketch/blob/main/prompts/0.8.0_prompt.md) — original release brief.
- [`plans/next_steps.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/next_steps.md) — strategic roadmap (v0.8.0 through v1.0).
- [`plans/0.8.0_implementation_plan.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/0.8.0_implementation_plan.md) — master tracker.
- `hash_strategies.md`, [`plans/hash_binary_contract.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/hash_binary_contract.md) — Phase 1 deep dives.
- [`plans/binary_contract.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/binary_contract.md), [`plans/corruption_detection.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/corruption_detection.md) — Phase 2 deep dives.
- `hll_performance.md`, [`plans/hll_scheduler_safety.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/hll_scheduler_safety.md) — Phase 3 deep dives.
- `precompiled_nifs.md` — Phase 4 deep dive.
- [`plans/property_testing.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/property_testing.md) — Phase 5 deep dive.
- [Phase 1-5 reviewer checklists](https://github.com/thanos/ex_data_sketch/blob/main/plans/) — per-phase checklists.
- `v0.8.0_migration_notes.md` — downstream upgrade guide.
- `serialization_compatibility.md` — wire-format stability contract.
- `roadmap.md` — next release preview.
- [`plans/0.8.0-risks.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/0.8.0-risks.md) — open risks at release time.
- `CHANGELOG.md` — full v0.8.0 change log.