guides/posting_lists.md

# Posting Lists

Posting lists are used for inverted indexes, search engines, and document tagging systems. erlang-rocksdb provides a specialized merge operator for efficient posting list management with built-in support for set operations and roaring bitmap indexes.

## Overview

A posting list stores a set of keys (e.g., document IDs) associated with a term. The merge operator allows efficient append operations and handles key deletion using tombstones that are automatically cleaned up during merge operations (reads and compaction).

**Key Features:**
- Sorted keys (lexicographic order)
- Roaring bitmap for fast existence checks and set operations
- Automatic V1 to V2 format migration
- Efficient intersection, union, and difference operations

## Binary Formats

### V2 Format (Current)

V2 is the default format as of version 2.5.0. It stores keys sorted and includes a roaring64 bitmap for fast operations.

| Field | Size | Description |
|-------|------|-------------|
| Version | 1 byte | 0x02 for V2 |
| BitmapSize | 4 bytes (big-endian) | Size of serialized roaring bitmap |
| BitmapData | BitmapSize bytes | Serialized roaring64 bitmap |
| KeyCount | 4 bytes (big-endian) | Number of keys |
| Keys | Variable | Sorted entries: `<Len:32/big><Key:Len>...` |

### V1 Format (Legacy)

V1 format is still readable but will be automatically upgraded to V2 on the next merge operation.

| Field | Size | Description |
|-------|------|-------------|
| KeyLength | 4 bytes (big-endian) | Length of the key data |
| Flag | 1 byte | 0 = normal, non-zero = tombstone |
| KeyData | KeyLength bytes | The key binary |

## Setup

Open a database with the posting list merge operator:

```erlang
{ok, Db} = rocksdb:open("mydb", [
    {create_if_missing, true},
    {merge_operator, posting_list_merge_operator}
]).
```

## Adding Keys

Use merge with `{posting_add, Key}`:

```erlang
ok = rocksdb:merge(Db, <<"term:erlang">>, {posting_add, <<"doc1">>}, []),
ok = rocksdb:merge(Db, <<"term:erlang">>, {posting_add, <<"doc2">>}, []),
ok = rocksdb:merge(Db, <<"term:erlang">>, {posting_add, <<"doc3">>}, []).
```

## Deleting Keys

Use merge with `{posting_delete, Key}` to add a tombstone:

```erlang
ok = rocksdb:merge(Db, <<"term:erlang">>, {posting_delete, <<"doc2">>}, []).
```

The key is logically deleted and removed during the merge operation.

## Reading Posting Lists

```erlang
{ok, Binary} = rocksdb:get(Db, <<"term:erlang">>, []).
```

## Helper Functions

### Get Active Keys (Sorted)

Returns deduplicated keys in lexicographic order:

```erlang
Keys = rocksdb:posting_list_keys(Binary),
%% [<<"doc1">>, <<"doc3">>]  % sorted
```

### Check if Key is Active

```erlang
true = rocksdb:posting_list_contains(Binary, <<"doc1">>),
false = rocksdb:posting_list_contains(Binary, <<"doc2">>).  % deleted
```

### Find Key

Returns `{ok, IsTombstone}` or `not_found`:

```erlang
{ok, false} = rocksdb:posting_list_find(Binary, <<"doc1">>),
not_found = rocksdb:posting_list_find(Binary, <<"unknown">>).
```

### Count Active Keys

```erlang
Count = rocksdb:posting_list_count(Binary).
```

### Get Format Version

```erlang
2 = rocksdb:posting_list_version(Binary).  % V2 format
```

### Convert to Map

Get the full state as a map:

```erlang
Map = rocksdb:posting_list_to_map(Binary),
%% #{<<"doc1">> => active, <<"doc3">> => active}
```

### Decode to List (V1 only)

For V1 format binaries:

```erlang
Entries = rocksdb:posting_list_decode(Binary),
%% [{<<"doc1">>, false}, {<<"doc2">>, true}]
```

### Fold Over Entries

```erlang
Count = rocksdb:posting_list_fold(
    fun(_Key, _IsTombstone, Acc) -> Acc + 1 end,
    0,
    Binary
).
```

## Set Operations

V2 posting lists support efficient set operations using roaring bitmaps:

### Intersection

Find keys present in both posting lists:

```erlang
{ok, Bin1} = rocksdb:get(Db, <<"term:erlang">>, []),
{ok, Bin2} = rocksdb:get(Db, <<"term:otp">>, []),
ResultBin = rocksdb:posting_list_intersection(Bin1, Bin2),
CommonKeys = rocksdb:posting_list_keys(ResultBin).
```

### Union

Combine all keys from both posting lists:

```erlang
ResultBin = rocksdb:posting_list_union(Bin1, Bin2),
AllKeys = rocksdb:posting_list_keys(ResultBin).
```

### Difference

Find keys in the first list but not in the second:

```erlang
ResultBin = rocksdb:posting_list_difference(Bin1, Bin2),
UniqueKeys = rocksdb:posting_list_keys(ResultBin).
```

### Fast Intersection Count

Get cardinality without materializing the result (uses roaring bitmap):

```erlang
Count = rocksdb:posting_list_intersection_count(Bin1, Bin2).
```

### Multi-List Intersection

Intersect multiple posting lists efficiently (processes smallest first):

```erlang
ResultBin = rocksdb:posting_list_intersect_all([Bin1, Bin2, Bin3]).
```

### Bitmap Contains (Fast Lookup)

Fast hash-based lookup using the embedded bitmap:

```erlang
true = rocksdb:posting_list_bitmap_contains(Binary, <<"doc1">>).
```

Note: May have rare false positives due to hash collisions. Use `posting_list_contains/2` for exact checks.

## Postings Resource API

For repeated lookups on the same posting list, use the resource-based API. Parse once, lookup many times with O(1) performance.

### Open/Parse Posting List

```erlang
{ok, Binary} = rocksdb:get(Db, <<"term:erlang">>, []),
{ok, Postings} = rocksdb:postings_open(Binary).
```

### Fast Contains Lookup

```erlang
%% Exact match (O(log n) using sorted set)
true = rocksdb:postings_contains(Postings, <<"doc1">>),

%% Hash-based lookup (O(1), rare false positives)
true = rocksdb:postings_bitmap_contains(Postings, <<"doc1">>).
```

### Count and Keys

```erlang
Count = rocksdb:postings_count(Postings),
Keys = rocksdb:postings_keys(Postings).  %% sorted
```

### Set Operations on Resources

Set operations accept both binaries and resources, returning a resource:

```erlang
%% Intersection (AND)
{ok, Result} = rocksdb:postings_intersection(Postings1, Postings2),

%% Union (OR)
{ok, Result} = rocksdb:postings_union(Postings1, Postings2),

%% Difference (A - B)
{ok, Result} = rocksdb:postings_difference(Postings1, Postings2),

%% Fast intersection count
Count = rocksdb:postings_intersection_count(Postings1, Postings2),

%% Multi-way intersection
{ok, Result} = rocksdb:postings_intersect_all([P1, P2, P3]).
```

### Convert Back to Binary

```erlang
Binary = rocksdb:postings_to_binary(Postings).
```

### Performance Comparison

| Method | Lookup Time |
|--------|-------------|
| `postings_contains/2` | ~0.1-0.2 us |
| `postings_bitmap_contains/2` | ~0.1 us |
| `posting_list_to_map/1` + `maps:is_key/2` | ~0.1 us |
| `posting_list_contains/2` (binary) | ~1-10 us |

Use the resource API for batch lookups (e.g., checking many document IDs against a term's posting list).

## Tombstone Cleanup

Tombstones are automatically cleaned up by the merge operator:

- **During reads**: When reading a key, the merge operator combines all entries and removes tombstoned keys from the result.
- **During compaction**: The merge operator consolidates entries, keeping only active (non-tombstoned) keys.

This means you don't need a separate compaction filter for tombstone removal - it's built into the merge operator.

## Format Migration

V1 data is automatically migrated to V2 format on the next merge operation. No manual migration is needed. You can check the format version with:

```erlang
Version = rocksdb:posting_list_version(Binary).
%% 1 = V1 (legacy), 2 = V2 (current)
```

## Use Cases

- **Inverted Index**: Map terms to document IDs for full-text search
- **Tagging System**: Map tags to item IDs
- **Graph Adjacency**: Store outgoing edges for each node
- **Set Membership**: Efficient set operations via roaring bitmaps
- **Query Intersection**: Fast AND queries across multiple terms

## Performance Tips

1. **Batch Writes**: Use write_batch to add multiple keys atomically
2. **Periodic Compaction**: Run compaction to reclaim space and optimize layout
3. **Large Lists**: For very large posting lists (100K+), consider sharding by key prefix
4. **Use Set Operations**: For multi-term queries, use `postings_intersection/2` or `postings_intersect_all/1` instead of manual iteration
5. **Intersection Count**: Use `postings_intersection_count/2` when you only need cardinality
6. **Use Resources for Batch Lookups**: For multiple contains checks, use `postings_open/1` once then `postings_contains/2` for each lookup (~0.1 us vs ~1-10 us per lookup)
7. **NIF Functions**: All helper functions are implemented as NIFs for efficiency

## Example: Inverted Index with Query

```erlang
%% Index a document
index_document(Db, DocId, Terms) ->
    lists:foreach(fun(Term) ->
        ok = rocksdb:merge(Db, <<"term:", Term/binary>>, {posting_add, DocId}, [])
    end, Terms).

%% Remove document from index
remove_document(Db, DocId, Terms) ->
    lists:foreach(fun(Term) ->
        ok = rocksdb:merge(Db, <<"term:", Term/binary>>, {posting_delete, DocId}, [])
    end, Terms).

%% Search for documents containing a term
search(Db, Term) ->
    case rocksdb:get(Db, <<"term:", Term/binary>>, []) of
        {ok, Binary} -> rocksdb:posting_list_keys(Binary);
        not_found -> []
    end.

%% Search for documents containing ALL terms (AND query)
search_all(Db, Terms) ->
    Postings = lists:filtermap(fun(Term) ->
        case rocksdb:get(Db, <<"term:", Term/binary>>, []) of
            {ok, Binary} ->
                {ok, P} = rocksdb:postings_open(Binary),
                {true, P};
            not_found -> false
        end
    end, Terms),
    case Postings of
        [] -> [];
        _ ->
            {ok, Result} = rocksdb:postings_intersect_all(Postings),
            rocksdb:postings_keys(Result)
    end.

%% Search for documents containing ANY term (OR query)
search_any(Db, Terms) ->
    Postings = lists:filtermap(fun(Term) ->
        case rocksdb:get(Db, <<"term:", Term/binary>>, []) of
            {ok, Binary} ->
                {ok, P} = rocksdb:postings_open(Binary),
                {true, P};
            not_found -> false
        end
    end, Terms),
    case Postings of
        [] -> [];
        [Single] -> rocksdb:postings_keys(Single);
        [First | Rest] ->
            {ok, Result} = lists:foldl(fun(P, {ok, Acc}) ->
                rocksdb:postings_union(Acc, P)
            end, {ok, First}, Rest),
            rocksdb:postings_keys(Result)
    end.

%% Check if document contains term (single lookup)
has_term(Db, Term, DocId) ->
    case rocksdb:get(Db, <<"term:", Term/binary>>, []) of
        {ok, Binary} -> rocksdb:posting_list_contains(Binary, DocId);
        not_found -> false
    end.

%% Check if document contains term (batch lookups - use resource)
has_term_batch(Postings, DocId) ->
    rocksdb:postings_contains(Postings, DocId).

%% Count documents matching AND query
count_matches(Db, Terms) ->
    Postings = lists:filtermap(fun(Term) ->
        case rocksdb:get(Db, <<"term:", Term/binary>>, []) of
            {ok, Binary} ->
                {ok, P} = rocksdb:postings_open(Binary),
                {true, P};
            not_found -> false
        end
    end, Terms),
    case Postings of
        [] -> 0;
        [Single] -> rocksdb:postings_count(Single);
        [First, Second | Rest] ->
            %% Use fast intersection count
            lists:foldl(fun(P, Count) ->
                min(Count, rocksdb:postings_intersection_count(First, P))
            end, rocksdb:postings_intersection_count(First, Second), Rest)
    end.

%% Filter documents by multiple terms (batch contains check)
filter_docs(Db, Term, DocIds) ->
    case rocksdb:get(Db, <<"term:", Term/binary>>, []) of
        {ok, Binary} ->
            {ok, Postings} = rocksdb:postings_open(Binary),
            [DocId || DocId <- DocIds,
                      rocksdb:postings_contains(Postings, DocId)];
        not_found -> []
    end.
```