Skip to main content

guides/v0.8.0_migration_notes.md

# Migrating to ex_data_sketch v0.8.0

This document explains what changes between v0.7.x and v0.8.0, who is
affected, and how to roll out the upgrade safely.

## TL;DR

- **No code changes required for most users.** All sketch public APIs
  are source-compatible with v0.7.x.
- **EXSK serialization bumped from v1 to v2** (adds CRC32C trailer and
  embedded hash metadata block). v0.8.0 still reads v1 binaries.
  **v0.7.x cannot read v2 binaries.**
- **Two Windows precompiled NIF targets added** (`x86_64-pc-windows-msvc`,
  `aarch64-pc-windows-msvc`); installation on Windows no longer
  requires a local Rust toolchain.
- **`:murmur3` is now a user-selectable hash strategy** for HLL / ULL /
  Theta / CMS, enabling Apache DataSketches interop. Default remains
  `:xxhash3`.
- Upgrade order for distributed deployments: **readers first, producers
  second**.

## Who is affected and how

| User profile | Action required |
|--------------|-----------------|
| Pure in-process user, no persistence | None. `mix deps.update ex_data_sketch` and continue. |
| Persists sketches to disk / Redis / etc. | Plan the rollout: deploy v0.8.0 readers everywhere first; once stable, deploy producers. v0.8.0 readers handle both v1 and v2 binaries; v0.7.x readers cannot handle v2. |
| Multi-node distributed (sketches travel between nodes) | Same staged rollout as above. Mixed v0.7.x and v0.8.0 nodes work as long as producers stay on v0.7.x until all readers are upgraded. |
| Linux glibc/musl x86_64/aarch64, macOS | Precompiled NIF available; no Rust toolchain needed. |
| Windows x86_64 or ARM64 | NEW: precompiled NIF available (was source-build only). |
| FreeBSD / NetBSD / RISC-V | Source build still required: `EX_DATA_SKETCH_BUILD=1 mix deps.compile`. |
| Custom `:hash_fn` callers | None. The `:hash_fn` path is preserved verbatim. |
| Apache DataSketches interop | Optional: switch to `hash_strategy: :murmur3` for native compatibility. |

## What's new in v0.8.0

### Hash strategy selection

`HLL.new/1`, `ULL.new/1`, `Theta.new/1`, `CMS.new/1` now honor a
user-supplied `:hash_strategy`. Previous v0.7.x behavior silently
overrode this option.

```elixir
# v0.7.x — silently used :xxhash3 regardless of the :hash_strategy opt.
sketch = ExDataSketch.HLL.new(p: 14, hash_strategy: :murmur3)
# sketch.opts[:hash_strategy] was :xxhash3 (BUG).

# v0.8.0 — honors the requested strategy.
sketch = ExDataSketch.HLL.new(p: 14, hash_strategy: :murmur3)
# sketch.opts[:hash_strategy] == :murmur3.
```

Supported strategies:

- `:xxhash3` (default when the Rust NIF is loaded) — fastest, stable
  across platforms.
- `:murmur3` — Apache DataSketches compatibility. ~8% slower than XXH3
  in the Rust hot path. Pure Elixir fallback bundled.
- `:phash2` — BEAM-only. Not portable across OTP major versions; use
  only for offline / single-OTP workloads.
- `:custom` — pass `:hash_fn` for a caller-supplied closure. Sketches
  built with `:custom` are NEVER merge-compatible with any other
  sketch.

Cross-strategy merges are rejected with
`ExDataSketch.Errors.IncompatibleSketchesError`. This is unchanged
from v0.7.1 but is now stricter: a v0.8.0 reader catches more
mismatch cases.

### EXSK v2 binary format

Every sketch's `serialize/1` now produces an EXSK v2 frame. The new
layout adds:

- A 16-byte `Hash.Metadata` block recording the exact hash identity
  used to produce the sketch.
- A `family_version` byte for per-sketch internal-state evolution.
- A `flags` byte (reserved in v2).
- A trailing CRC32C checksum over the entire preceding frame.

Empty HLL sketch grows from ~18 bytes (v1) to ~50 bytes (v2). For any
sketch larger than ~1 KB the overhead is negligible (< 5%).

```elixir
# v0.7.x produced an EXSK v1 frame:
v1 = HLL.serialize(sketch)
<<"EXSK", 1, _rest::binary>> = v1  # version byte = 1

# v0.8.0 produces an EXSK v2 frame:
v2 = HLL.serialize(sketch)
<<"EXSK", 2, _rest::binary>> = v2  # version byte = 2
```

v0.8.0's `deserialize/1` accepts both:

```elixir
# Both succeed in v0.8.0:
{:ok, _} = HLL.deserialize(v1)
{:ok, _} = HLL.deserialize(v2)

# Only v1 succeeds in v0.7.x:
# v0.7.x will return {:error, %DeserializationError{}} on v2 input.
```

If you need to write v1 frames from v0.8.0 during a staged rollout,
the legacy codec is still available:

```elixir
# Reads v2:
{:ok, decoded} = ExDataSketch.Binary.decode(v2_binary)

# Writes v1 (legacy, for backward-compat producers only):
v1_binary = ExDataSketch.Codec.encode(
  ExDataSketch.Codec.sketch_id_hll(),
  1,                                    # version 1
  <<14, 1>>,                            # params (p + hash strategy)
  sketch.state                          # state binary
)
```

No public sketch API exposes the v1 writer; use `Codec.encode/4`
directly only as an interim staged-rollout escape hatch.

### Corruption detection

EXSK v2 frames carry a trailing CRC32C (Castagnoli, hardware-
accelerated on modern x86 and ARM). Single-bit corruption in any byte
of the frame is detected with probability `> 99.99999%` and surfaces
as a structured error:

```elixir
case HLL.deserialize(possibly_corrupted) do
  {:ok, sketch} -> use_it(sketch)
  {:error, %ExDataSketch.Errors.DeserializationError{message: msg}} ->
    Logger.warning("corrupt HLL: #{msg}")
end
```

v0.7.x had no corruption detection — bit-flips in the sketch state
would silently produce wrong estimates.

### Precompiled NIF: Windows support

The precompiled NIF matrix now covers 8 target triples (was 6):

- Linux glibc x86_64 / ARM64
- Linux musl x86_64 / ARM64
- macOS Intel / Apple Silicon
- Windows MSVC x86_64 / ARM64 (NEW in v0.8.0)

Each target has 2 NIF versions (2.16, 2.17), so 16 artifacts per
release. Hex installation on any of these platforms requires no
Rust toolchain.

For FreeBSD / NetBSD / RISC-V or other unsupported platforms:

```sh
EX_DATA_SKETCH_BUILD=1 mix deps.compile ex_data_sketch
```

(Requires a local Rust toolchain.)

## Upgrade procedure

### Step 1: Local dev environment

```sh
# Bump the version in your mix.exs:
{:ex_data_sketch, "~> 0.8.0"}

# Update and compile:
mix deps.update ex_data_sketch
mix deps.compile ex_data_sketch
```

The precompiled NIF will be downloaded automatically. If you are on
an unsupported platform, set `EX_DATA_SKETCH_BUILD=1`.

### Step 2: Verify your test suite

Run your existing tests. All sketch public APIs are source-compatible
with v0.7.x. The only behavior difference visible from sketch APIs is:

1. Serialized binaries have a new version byte (2 instead of 1).
2. Sketches built with `:hash_strategy: :murmur3` (or `:phash2`)
   actually use that strategy now. v0.7.x silently used `:xxhash3`.

If your tests assert byte-identical equality with hardcoded v1 frames
(unlikely), update them to expect v2 frames.

### Step 3: Plan the staged rollout (if you persist sketches)

For deployments that share persisted sketches across process / node /
machine boundaries:

1. **Deploy v0.8.0 to all readers first.** v0.8.0 reads both v1 and
   v2 frames. v0.7.x readers cannot read v2.
2. **Verify reader stability for at least one deploy cycle.**
3. **Deploy v0.8.0 to producers.** They now emit v2 frames.
4. **Optional rollback drill.** If you need to roll back a producer
   to v0.7.x while v2 frames are in flight, you can:
   - Re-serialize affected sketches with `Codec.encode/4` (the
     escape hatch shown above), or
   - Accept temporary data loss for the v2-only sketches.

### Step 4: Adopt new features (optional)

- **Hash strategy.** Switch high-throughput callers to
  `hash_strategy: :xxhash3` (default; no change required). Switch
  Apache DataSketches interop callers to `hash_strategy: :murmur3`.
- **Custom hash.** The `:hash_fn` opt is unchanged.

## Behavior changes that may surprise

### `HLL.new(p: 14, hash_strategy: :murmur3)` now actually uses Murmur3

In v0.7.x, this option was silently overridden to `:xxhash3`. If your
v0.7.x code relied on this silent override (e.g., by passing
`:hash_strategy` from a config that was never validated), the v0.8.0
behavior may produce DIFFERENT estimates than v0.7.x for the same
input.

The estimates are still mathematically correct. The difference is
that two sketches that used to hash identically (both XXH3 in
practice) now hash differently if one is built with `:xxhash3` and
another with `:murmur3`.

**Mitigation.** Audit any `hash_strategy:` opts in your codebase. If
you intended `:xxhash3` (the most common case), the option becomes
optional in v0.8.0 (the default already is `:xxhash3` when the NIF
is loaded). If you intended `:murmur3` for interop, you now have a
real interop path; merge with v0.7.x sketches is impossible.

### Merge of sketches built with different hash strategies now fails fast

v0.7.1 introduced merge validation. v0.8.0 inherits and extends it.
A merge between two sketches with mismatched `:hash_strategy` or
`:seed` raises `IncompatibleSketchesError`:

```elixir
xxh3_sketch = HLL.from_enumerable(items, hash_strategy: :xxhash3)
murm_sketch = HLL.from_enumerable(items, hash_strategy: :murmur3)

HLL.merge(xxh3_sketch, murm_sketch)
# ** (ExDataSketch.Errors.IncompatibleSketchesError)
#    HLL hash strategy mismatch: xxhash3 vs murmur3
```

This is intended; merging would produce a corrupt result.

### v2 frame size grows for tiny sketches

Empty HLL: 18 bytes -> 50 bytes (~2.8x). For any production-sized
sketch (p ≥ 8, KLL k ≥ 100) the overhead is < 5%.

If you persist millions of tiny sketches, audit storage. The trade-
off is documented in [`plans/binary_contract.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/binary_contract.md).

### `Backend.default/0` is still `Pure`

Even when the Rust NIF is loaded, the default backend is
`ExDataSketch.Backend.Pure`. To opt into the Rust hot paths:

```elixir
# Per-sketch:
sketch = HLL.new(p: 14, backend: ExDataSketch.Backend.Rust)

# Or application-wide:
config :ex_data_sketch, backend: ExDataSketch.Backend.Rust
```

This is unchanged from v0.7.x and is intentional. The "no silent
default change" guarantee is documented in `precompiled_nifs.md`.

## Test-side changes

If you maintain a test suite that exercises ex_data_sketch:

### NIF mode switching

If you flip `EX_DATA_SKETCH_BUILD` between local test runs, use the
new aliases:

```sh
EX_DATA_SKETCH_BUILD=1 mix test.nif_on
EX_DATA_SKETCH_SKIP_NIF=true mix test.nif_off
```

These automatically reset the per-env `rustler_precompiled` state.
The bare `mix test` invocation works for single-mode CI but trips
the compile-vs-runtime env check when used to flip modes locally.

### Hardcoded v1 frame assertions

If your tests do something like:

```elixir
# v0.7.x style:
assert <<"EXSK", 1, _rest::binary>> = HLL.serialize(sketch)
```

…update to v2:

```elixir
# v0.8.0 style:
assert <<"EXSK", 2, _rest::binary>> = HLL.serialize(sketch)
```

Three sketch-internal tests in this repo were updated this way (KLL,
REQ, MisraGries). External users likely don't have such assertions.

## Rolling back from v0.8.0 to v0.7.x

Possible but requires care:

1. Stop producers, drain the v2-frame queue.
2. Roll producers back to v0.7.x.
3. Any v2 frame that survives the drain will fail to deserialize on
   v0.7.x with `DeserializationError`. The persisted state binary
   inside the v2 frame is recoverable by hand-parsing the EXSK v2
   layout (see [`plans/binary_contract.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/binary_contract.md)).

In practice, rolling back a binary-format upgrade is painful. Plan
the v0.8.0 upgrade as one-way and test it thoroughly in staging.

## Known issues at release time

Tracked in [`plans/0.8.0-risks.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/0.8.0-risks.md). Highlights:

- **ULL accuracy at low `p` and high cardinality** (5-R1). ULL at
  `p < 12` produces large over-estimates when `n / 2^p` exceeds ~2.
  Production guidance: use `p ≥ 12`. Investigation tracked as a
  follow-up issue.
- **HLL memory profile at very high cardinality** (X-R1). Streaming
  10M items into a single HLL allocates ~1.86 GB of transient
  Elixir state due to `Stream.chunk_every/2` lifecycle. Investigation
  tracked as a follow-up issue. Workaround: smaller chunk sizes or
  custom enumerable batching.

## See also

- [`plans/binary_contract.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/binary_contract.md) — full EXSK v2 layout specification.
- [`plans/corruption_detection.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/corruption_detection.md) — CRC32C rationale and error
  taxonomy.
- `hash_strategies.md` — hash algorithm selection guide.
- `hll_performance.md` — performance characteristics of each path.
- `precompiled_nifs.md` — platform support details.
- [`plans/0.8.0-risks.md`](https://github.com/thanos/ex_data_sketch/blob/main/plans/0.8.0-risks.md) — open risk register at release time.
- `CHANGELOG.md` — full v0.8.0 change log.