# Bloomy 🌸
**High-performance Bloom Filter library for Elixir, powered by Nx**
Bloomy provides probabilistic data structures for efficient set membership testing with minimal memory usage. Built on Nx tensors for blazing-fast performance with optional EXLA/GPU acceleration.
[](https://elixir-lang.org/)
[](https://github.com/elixir-nx/nx)
## Features
- 🚀 **High Performance** - Nx tensor operations with EXLA/GPU acceleration
- 🎯 **Multiple Types** - Standard, Counting (with deletion), Scalable (auto-grows), Learned (ML-enhanced)
- 💾 **Persistence** - Serialization with optional compression
- 🔀 **Merge Operations** - Union/intersection for distributed systems
- 📊 **Rich Statistics** - Comprehensive monitoring and metrics
- 🎓 **Well Documented** - Extensive docs and examples
## Installation
Add `bloomy` to your dependencies in `mix.exs`:
```elixir
def deps do
[
{:bloomy, "~> 0.1.0"},
{:nx, "~> 0.10.0"},
{:exla, "~> 0.10.0"} # Optional but recommended for performance
]
end
```
## Quick Start
```elixir
# Create a bloom filter for 10,000 items with 1% false positive rate
filter = Bloomy.new(10_000, false_positive_rate: 0.01)
# Add items
filter = filter
|> Bloomy.add("user@example.com")
|> Bloomy.add("alice@example.com")
|> Bloomy.add("bob@example.com")
# Check membership
Bloomy.member?(filter, "user@example.com") # => true
Bloomy.member?(filter, "other@example.com") # => false
# Get statistics
Bloomy.info(filter)
# => %{
# type: :standard,
# capacity: 10000,
# items_count: 3,
# fill_ratio: 0.0021,
# false_positive_rate: 0.01,
# ...
# }
```
## Bloom Filter Types
### Standard Bloom Filter
Classic space-efficient probabilistic set membership test.
```elixir
filter = Bloomy.new(10_000)
filter = Bloomy.add(filter, "apple")
Bloomy.member?(filter, "apple") # => true
```
### Counting Bloom Filter
Supports deletion operations using counters instead of bits.
```elixir
filter = Bloomy.new(10_000, type: :counting)
filter = Bloomy.add(filter, "item")
Bloomy.member?(filter, "item") # => true
filter = Bloomy.remove(filter, "item")
Bloomy.member?(filter, "item") # => false
```
### Scalable Bloom Filter
Automatically grows to maintain target false positive rate.
```elixir
filter = Bloomy.new(1_000, type: :scalable)
# Add millions of items - it will automatically scale!
filter = Enum.reduce(1..1_000_000, filter, fn i, f ->
Bloomy.add(f, "item_#{i}")
end)
info = Bloomy.info(filter)
info.slices_count # => Multiple slices created automatically
```
### Learned Bloom Filter
ML-enhanced filter with lower false positive rates.
```elixir
filter = Bloomy.new(10_000, type: :learned)
# Train with positive and negative examples
training_data = %{
positive: ["valid_user_1", "valid_user_2", "valid_user_3"],
negative: ["spam_1", "spam_2", "spam_3"]
}
filter = Bloomy.train(filter, training_data)
filter = Bloomy.add(filter, "valid_user_1")
Bloomy.member?(filter, "valid_user_1") # => true (uses ML model + backup filter)
```
## Performance: Using EXLA Backend
### Why EXLA?
EXLA provides significant performance improvements (2-20x faster) by:
- JIT compilation of numerical operations
- CPU/GPU acceleration
- Optimized tensor operations
### Option 1: Set Default Backend (Recommended)
In your `config/config.exs`:
```elixir
# Use EXLA with CPU
config :nx, default_backend: EXLA.Backend
# Or use EXLA with GPU (if CUDA is available)
config :nx, default_backend: {EXLA.Backend, client: :cuda}
# Or use EXLA with specific device
config :nx, default_backend: {EXLA.Backend, client: :cuda, device_id: 0}
```
### Option 2: Per-Filter Backend
Specify backend when creating filters:
```elixir
# Use EXLA backend for this filter
filter = Bloomy.new(100_000, backend: EXLA.Backend)
# Use BinaryBackend (default, no compilation)
filter = Bloomy.new(100_000, backend: Nx.BinaryBackend)
```
### Option 3: Runtime Backend Selection
```elixir
# Set backend at runtime
Nx.default_backend(EXLA.Backend)
# All filters created after this will use EXLA
filter = Bloomy.new(100_000)
```
### GPU Acceleration
To use GPU acceleration (requires CUDA):
```elixir
# In config/config.exs
config :nx, default_backend: {EXLA.Backend, client: :cuda}
# Or at runtime
Nx.default_backend({EXLA.Backend, client: :cuda})
# Create filter - will automatically use GPU
filter = Bloomy.new(1_000_000)
```
### Backend Performance Comparison
Run the backend comparison benchmark:
```bash
mix run benchmarks/backend_comparison.exs
```
**Typical Performance Gains:**
- Small filters (<10K): ~1.5x faster
- Medium filters (10K-100K): 2-5x faster
- Large filters (>100K): 5-10x faster
- Batch operations: 10-20x faster
- GPU (large filters): 20-50x faster
## Operations
### Batch Operations
```elixir
# Add multiple items efficiently
filter = Bloomy.add_all(filter, ["apple", "banana", "orange"])
# Batch membership testing
results = Bloomy.batch_member?(filter, ["apple", "grape", "banana"])
# => %{"apple" => true, "grape" => false, "banana" => true}
# Create from list
filter = Bloomy.from_list(["a", "b", "c", "d"])
```
### Merge Operations
```elixir
# Union (merge) filters
f1 = Bloomy.new(10_000) |> Bloomy.add("apple")
f2 = Bloomy.new(10_000) |> Bloomy.add("banana")
merged = Bloomy.union(f1, f2)
Bloomy.member?(merged, "apple") # => true
Bloomy.member?(merged, "banana") # => true
# Union multiple filters
filters = [filter1, filter2, filter3]
merged = Bloomy.union_all(filters)
# Intersection
intersected = Bloomy.intersect_all([filter1, filter2])
```
### Similarity Metrics
```elixir
# Calculate Jaccard similarity
similarity = Bloomy.jaccard_similarity(filter1, filter2)
# => 0.75 (75% similar)
# Calculate overlap coefficient
overlap = Bloomy.Operations.overlap_coefficient(filter1, filter2)
```
## Persistence
### Binary Serialization
```elixir
# Serialize to binary
binary = Bloomy.to_binary(filter)
# With compression (recommended for large filters)
binary = Bloomy.to_binary(filter, compress: true)
# Deserialize
{:ok, loaded_filter} = Bloomy.from_binary(binary)
```
### File I/O
```elixir
# Save to file
:ok = Bloomy.save(filter, "my_filter.bloom")
# Save with compression
:ok = Bloomy.save(filter, "my_filter.bloom", compress: true)
# Load from file
{:ok, filter} = Bloomy.load("my_filter.bloom")
```
## Benchmarking
Run the included benchmarks to see performance on your system:
```bash
# Basic operations (add, member?, batch operations)
mix run benchmarks/basic_operations.exs
# Compare filter types (Standard, Counting, Scalable)
mix run benchmarks/filter_comparison.exs
# Merge operations (union, intersection, similarity)
mix run benchmarks/operations.exs
# Serialization and persistence
mix run benchmarks/serialization.exs
# Backend comparison (BinaryBackend vs EXLA)
mix run benchmarks/backend_comparison.exs
```
## Configuration Options
### Creating Filters
```elixir
Bloomy.new(capacity, opts)
```
**Common Options:**
- `:type` - Filter type: `:standard`, `:counting`, `:scalable`, `:learned` (default: `:standard`)
- `:false_positive_rate` - Target false positive rate (default: `0.01` or 1%)
- `:backend` - Nx backend: `Nx.BinaryBackend`, `EXLA.Backend` (default: `Nx.default_backend()`)
**Counting Filter Options:**
- `:counter_width` - Bits per counter: `8`, `16`, `32` (default: `8`)
**Scalable Filter Options:**
- `:growth_factor` - Capacity multiplier for new slices (default: `2`)
- `:error_tightening_ratio` - Error rate multiplier (default: `0.8`)
**Learned Filter Options:**
- `:confidence_threshold` - Model confidence threshold (default: `0.7`)
### Examples
```elixir
# Standard with custom false positive rate
filter = Bloomy.new(10_000, false_positive_rate: 0.001)
# Counting with 16-bit counters and EXLA
filter = Bloomy.new(10_000,
type: :counting,
counter_width: 16,
backend: EXLA.Backend
)
# Scalable with custom growth
filter = Bloomy.new(1_000,
type: :scalable,
growth_factor: 3,
error_tightening_ratio: 0.5
)
```
## Use Cases
### Web Application - User Session Tracking
```elixir
# Track active sessions
defmodule SessionTracker do
use GenServer
def start_link(_) do
GenServer.start_link(__MODULE__, nil, name: __MODULE__)
end
def init(_) do
# 1M sessions, 0.1% false positive rate, use EXLA for speed
filter = Bloomy.new(1_000_000,
false_positive_rate: 0.001,
backend: EXLA.Backend
)
{:ok, filter}
end
def track_session(session_id) do
GenServer.cast(__MODULE__, {:track, session_id})
end
def active?(session_id) do
GenServer.call(__MODULE__, {:check, session_id})
end
def handle_cast({:track, session_id}, filter) do
{:noreply, Bloomy.add(filter, session_id)}
end
def handle_call({:check, session_id}, _from, filter) do
{:reply, Bloomy.member?(filter, session_id), filter}
end
end
```
### Distributed System - Seen Items Cache
```elixir
# Track which items have been processed across nodes
defmodule DistributedCache do
def merge_caches(node_filters) do
# Collect filters from all nodes
filters = Enum.map(node_filters, & &1)
# Merge into single filter
merged = Bloomy.union_all(filters)
# Distribute back to all nodes
distribute_filter(merged)
end
def sync_filter(local_filter) do
# Save filter to shared storage
Bloomy.save(local_filter, "/shared/cache.bloom", compress: true)
end
def load_filter do
case Bloomy.load("/shared/cache.bloom") do
{:ok, filter} -> filter
{:error, _} -> Bloomy.new(1_000_000)
end
end
end
```
### Data Pipeline - Deduplication
```elixir
defmodule Deduplicator do
def process_stream(stream) do
initial_filter = Bloomy.new(10_000_000, type: :scalable)
Stream.transform(stream, initial_filter, fn item, filter ->
if Bloomy.member?(filter, item.id) do
# Skip duplicate
{[], filter}
else
# New item, add to filter and pass through
{[item], Bloomy.add(filter, item.id)}
end
end)
end
end
```
## Performance Tips
1. **Use EXLA Backend** - Significant performance improvement (2-20x faster)
```elixir
config :nx, default_backend: EXLA.Backend
```
2. **Choose Appropriate Capacity** - Overestimate slightly for better performance
```elixir
# If expecting ~10K items, use 15K capacity
filter = Bloomy.new(15_000)
```
3. **Batch Operations** - Use `add_all` and `batch_member?` for multiple items
```elixir
# Fast: single batch operation
filter = Bloomy.add_all(filter, items)
# Slower: many individual operations
filter = Enum.reduce(items, filter, &Bloomy.add(&2, &1))
```
4. **Use Scalable for Unknown Sizes** - Automatically adapts to data volume
```elixir
filter = Bloomy.new(1_000, type: :scalable)
```
5. **Serialize with Compression** - Reduces storage/network overhead
```elixir
Bloomy.save(filter, "cache.bloom", compress: true)
```
## How It Works
### Bloom Filters
A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set:
- **False positives** are possible (item might be in set)
- **False negatives** are impossible (if not found, definitely not in set)
- Uses multiple hash functions to set/check bits in an array
### Mathematical Foundation
- **Optimal bit array size**: `m = -(n * ln(p)) / (ln(2)^2)`
- **Optimal hash functions**: `k = (m / n) * ln(2)`
- **False positive rate**: `p = (1 - e^(-kn/m))^k`
Where:
- `n` = number of items
- `p` = false positive rate
- `m` = bit array size
- `k` = number of hash functions
### Implementation Details
- **Hash Functions**: MurmurHash3 + FNV-1a with double hashing technique
- **Storage**: Nx tensors (`:u8` for bits, `:u8/:u16/:u32` for counters)
- **Acceleration**: EXLA JIT compilation for vectorized operations
- **ML**: Simple neural network for learned filters
## API Reference
### Main Functions
- `new/2` - Create new bloom filter
- `add/2` - Add item to filter
- `add_all/2` - Add multiple items
- `member?/2` - Check if item might be in set
- `remove/2` - Remove item (counting filters only)
- `clear/1` - Reset filter to empty state
- `info/1` - Get statistics and metadata
### Operations
- `union/2`, `union_all/1` - Merge filters
- `intersect_all/1` - Intersect filters
- `batch_member?/2` - Test multiple items
- `from_list/2` - Create filter from list
- `jaccard_similarity/2` - Calculate similarity
### Persistence
- `save/3` - Save to file
- `load/2` - Load from file
- `to_binary/2` - Serialize to binary
- `from_binary/2` - Deserialize from binary
### ML (Learned Filters)
- `train/3` - Train model on examples
## Testing
Run the test suite:
```bash
mix test
```
Run the integration tests:
```bash
mix run test_bloomy.exs
```
## License
MIT License
## Acknowledgments
- Built with [Nx](https://github.com/elixir-nx/nx) - Numerical computing library
- Uses [EXLA](https://github.com/elixir-nx/nx/tree/main/exla) - Accelerated Linear Algebra
- ML features powered by [Scholar](https://github.com/elixir-nx/scholar)
- Inspired by classical bloom filter research and "The Case for Learned Index Structures"
## Resources
- [Bloom Filter on Wikipedia](https://en.wikipedia.org/wiki/Bloom_filter)
- [Nx Documentation](https://hexdocs.pm/nx)
- [EXLA Documentation](https://hexdocs.pm/exla)
- [Space/Time Trade-offs in Hash Coding with Allowable Errors](https://dl.acm.org/doi/10.1145/362686.362692) - Original paper
---
Made with ❤️ and Elixir