Skip to main content

README.md

# ExDataSketch

Production-grade streaming data sketching algorithms for Elixir.

ExDataSketch provides probabilistic data structures for approximate counting,
frequency estimation, quantile computation, heavy-hitter detection, membership
testing with deletion, and set reconciliation on streaming data. Stream-native
integration with Elixir's `Collectable`, `GenStage`, `Broadway`, and `Flow`,
plus `:telemetry`/OpenTelemetry instrumentation, persistence backends (ETS,
DETS, CubDB, Mnesia, Ecto), and nine production-oriented Livebooks.

[![CI](https://github.com/thanos/ex_data_sketch/actions/workflows/ci.yml/badge.svg)](https://github.com/thanos/ex_data_sketch/actions/workflows/ci.yml)
[![Hex version](https://img.shields.io/hexpm/v/ex_data_sketch.svg)](https://hex.pm/packages/ex_data_sketch)
[![Hex docs](https://img.shields.io/badge/docs-hexdocs.pm-blue)](https://hexdocs.pm/ex_data_sketch)
[![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Coverage Status](https://coveralls.io/repos/github/thanos/ex_data_sketch/badge.svg?branch=main)](https://coveralls.io/github/thanos/ex_data_sketch?branch=main)



## Supported Algorithms

| Algorithm | Purpose | Status |
|-----------|---------|--------|
| HyperLogLog (HLL) | Cardinality estimation | Implemented (Pure + Rust) |
| Count-Min Sketch (CMS) | Frequency estimation | Implemented (Pure + Rust) |
| Theta Sketch | Set operations on cardinalities | Implemented (Pure + Rust) |
| KLL Quantiles | Rank and quantile estimation | Implemented (Pure + Rust) |
| DDSketch | Relative-error quantile estimation | Implemented (Pure + Rust) |
| FrequentItems (SpaceSaving) | Heavy-hitter / top-k detection | Implemented (Pure + Rust) |
| Bloom Filter | Probabilistic membership testing | Implemented (Pure + Rust) |
| Cuckoo Filter | Membership testing with deletion | Implemented (Pure + Rust) |
| Quotient Filter | Membership with deletion and merge | Implemented (Pure + Rust) |
| CQF (Counting Quotient) | Multiset membership with counting | Implemented (Pure + Rust) |
| XorFilter | Static immutable membership testing | Implemented (Pure + Rust) |
| IBLT | Set reconciliation | Implemented (Pure + Rust) |
| REQ Sketch | Relative-error quantile estimation | Implemented (Pure) |
| Misra-Gries | Deterministic heavy-hitter detection | Implemented (Pure) |
| UltraLogLog (ULL) | Improved cardinality estimation | Implemented (Pure + Rust) |

### Capability Matrix

| Structure | insert | delete | merge | count | serialize | static | reconciliation |
|-----------|--------|--------|-------|-------|-----------|--------|----------------|
| Bloom | yes | -- | yes | yes* | yes | -- | -- |
| Cuckoo | yes | yes | -- | yes | yes | -- | -- |
| Quotient | yes | yes | yes | yes | yes | -- | -- |
| CQF | yes | yes | yes | yes | yes | -- | -- |
| XorFilter | -- | -- | -- | yes | yes | yes | -- |
| IBLT | yes | yes | yes | yes | yes | -- | yes |

*Bloom count is a popcount-based cardinality estimate.

### When to Choose

- **Bloom** -- Default choice for membership testing. No deletion needed, mergeable.
- **Cuckoo** -- Need to delete items. Better space efficiency than Bloom at low FPR.
- **Quotient** -- Need deletion and merge. Good when filter resizing may be needed.
- **CQF** -- Need to count how many times each item was inserted (multiset).
- **XorFilter** -- Static set known at build time. Smallest footprint, fastest lookups.
- **IBLT** -- Need to find differences between two sets (reconciliation).

## Installation

Add `ex_data_sketch` to your list of dependencies in `mix.exs`:

```elixir
def deps do
  [
    {:ex_data_sketch, "~> 0.9.0"}
  ]
end
```

## Quick Start

```elixir
# HLL: count distinct elements
hll = ExDataSketch.HLL.new() |> ExDataSketch.HLL.update_many(1..100_000)
ExDataSketch.HLL.estimate(hll)  # ~100_000

# Stream API: build from lazy enumerables
sketch = 1..100_000 |> Stream.map(&to_string/1) |> ExDataSketch.Stream.hll(p: 14)

# Collectable: Enum.into works with any mergeable sketch
sketch = Enum.into(1..50_000, ExDataSketch.ULL.new(p: 14))

# Partitioned parallel reduction
sketch = ExDataSketch.Stream.reduce_partitioned(1..1_000_000, ExDataSketch.HLL, partitions: 8, p: 14)

# Persistence: save and load from ETS
:ets.new(:sketches, [:set, :public, :named_table])
ExDataSketch.Storage.ETS.save(sketch, :sketches, "daily:2024-01-15")
{:ok, loaded} = ExDataSketch.Storage.ETS.load(ExDataSketch.HLL, :sketches, "daily:2024-01-15")

# KLL: quantile estimation
kll = ExDataSketch.KLL.new() |> ExDataSketch.KLL.update_many(1..100_000)
ExDataSketch.KLL.quantile(kll, 0.5)   # approximate median (~50_000)
ExDataSketch.KLL.quantile(kll, 0.99)  # 99th percentile (~99_000)

# Bloom: membership testing
bloom = ExDataSketch.Bloom.new(capacity: 100_000)
bloom = ExDataSketch.Bloom.put_many(bloom, 1..50_000)
ExDataSketch.Bloom.member?(bloom, 42)      # true
ExDataSketch.Bloom.member?(bloom, 99_999)  # false (probably)
```

See the [Quick Start Guide](guides/quick_start.md) for more examples.

## Livebooks

Nine production-oriented Livebooks demonstrate real-world patterns, from
basic stream consumption to distributed merge semantics and AI workload
analytics. See [Livebooks Guide](guides/livebooks.md) for the recommended
reading order and what each Livebook teaches.

## Documentation

Full documentation is available at [HexDocs](https://hexdocs.pm/ex_data_sketch).

## Architecture

- **Binary state**: All sketch state is canonical Elixir binaries. No opaque NIF resources.
- **Backend system**: Pure Elixir reference implementation with optional Rust NIF acceleration. The Rust backend falls back to Pure automatically when unavailable.
- **Streaming native**: `ExDataSketch.Stream`, `Collectable`, and `from_enumerable/2` for lazy stream consumption. Broadway, GenStage, and Flow integration for production pipelines.
- **Persistence**: ETS, DETS, CubDB, Mnesia, and Ecto backends with atomic `save/load/merge/delete` operations. All storage uses EXSK v2 binary format with CRC32C checksum.
- **Observability**: Structured `:telemetry` events at batch boundaries, optional OpenTelemetry span bridge, category-based enable/disable.
- **Serialization**: ExDataSketch-native format (EXSK) for all sketches, plus Apache DataSketches CompactSketch interop for Theta. v1 format escape hatch for backward compatibility.
- **Deterministic hashing**: Stable 64-bit hash (`ExDataSketch.Hash`) for reproducible results.
- **Backend parity**: Both backends produce byte-identical serialized output for the same inputs.

## Compatibility and Stability

The following guarantees apply within the v0.x release series:

- **EXSK serialization**: The ExDataSketch-native binary format is stable. Binaries produced by any v0.x release can be deserialized by any other v0.x release.
- **Pure vs Rust parity**: Given identical inputs, both backends produce byte-identical serialized state and identical estimates.
- **Deterministic output**: The same input sequence always produces the same sketch state and estimate, regardless of backend.
- **Backward compatibility**: v0.7.x EXSK v1 frames are decodable by v0.9.0. `HLL.serialize(sketch, format: :v1)` produces backward-compatible v0.7.x output (requires `:phash2` hash strategy).

Not guaranteed:

- **ULL estimate stability**: ULL estimates at very low cardinalities (p < 12, n < 500) differ between v0.8.0 and v0.9.0 due to accuracy corrections. The v0.9.0 estimates are more accurate.
- **Cross-language interop**: Only Theta supports Apache DataSketches CompactSketch format. HLL and CMS DataSketches interop is not implemented.
- **Performance stability**: Benchmark results may vary across hardware and OTP versions.
- **EXSK format across major versions**: The binary format may change in future major releases.

## Development

```bash
# Get dependencies
mix deps.get

# Run tests with coverage
mix test --cover

# Run lints
mix lint

# Run benchmarks
mix bench

# Generate docs
mix docs
```

## Roadmap

| Version | Focus | Status |
|---------|-------|--------|
| v0.1.0 | Core sketches (HLL, CMS, Theta) + Rust NIFs | Released |
| v0.2.0 | KLL quantiles | Released |
| v0.2.1 | DDSketch relative-error quantiles | Released |
| v0.3.0 | FrequentItems (SpaceSaving) | Released |
| v0.4.0 | Bloom filter (membership testing) | Released |
| v0.5.0 | Advanced membership filters (Cuckoo, Quotient, CQF, XorFilter, IBLT, FilterChain) | Released |
| v0.6.0 | REQ sketch, Misra-Gries, XXHash3, Rust NIF parity for all membership filters | Released |
| v0.7.0 | ULL (UltraLogLog) -- improved cardinality estimation with Pure + Rust NIF | Released |
| v0.7.1 | NIF batch hashing, hash customization, quotient filter fix, merge safety | Released |
| v0.8.0 | Deterministic Foundations -- pluggable hash registry (XXHash3 + Murmur3), binary stability and corruption detection, HLL hot-path optimization, precompiled NIFs, property-based validation | Released |
| v0.9.0 | Streaming Integrations -- Stream/Collectable API, Broadway/GenStage/Flow integration, persistence (ETS/DETS/CubDB/Mnesia/Ecto), telemetry + OpenTelemetry, ULL accuracy fix, v1 serialization escape hatch | Released |
| v0.10.0 | Apache Interoperability -- full cross-language KLL and HLL exchange, golden binary corpus, validation suite | Planned |
| v0.11.0 | New Sketch Families -- CPC (Compressed Probabilistic Counting), Tuple Sketch (weighted distinct counting) | Planned |
| v0.12.0 | Similarity & Sampling -- MinHash, Weighted MinHash, VarOpt sampling | Planned |
| v1.0.0 | Stable Binary Contract -- locked EXSK format, full benchmark suite, Nx / Arrow ecosystem integrations | Planned |

See [`guides/roadmap.md`](guides/roadmap.md) for the next-release preview.
The long-form strategic roadmap is in [`plans/next_steps.md`](plans/next_steps.md).

## License

MIT License. See [LICENSE](LICENSE) for details.