README.md

# Vettore

Vettore is a high-performance Elixir library for fast, in-memory operations on vector (embedding) data. It leverages a Rust backend via Rustler to store and manipulate vectors efficiently in a concurrent-safe `HashMap`.

## Features

* **Collections**: Create named sets of embeddings with a fixed dimension and a choice of similarity metric.
* **CRUD operations**: Insert, batch-insert, retrieve, and delete embeddings by ID or by vector.
* **Similarity search**: Nearest-neighbor search with customizable `:limit` and optional metadata filtering.
* **Reranking**: Maximal Marginal Relevance (MMR) reranker for diversity-aware result reordering.
* **Distance helpers**: Standalone Euclidean, Cosine, Dot, and Hamming metrics, plus binary compression for ultra-fast comparisons.

## Installation

Add `vettore` to your list of dependencies in `mix.exs`:

```elixir
def deps do
  [
    {:vettore, "~> 0.2.2"}
  ]
end
```

Then fetch and compile:

```bash
mix deps.get
mix compile
```

*Note*: The first compile will build the Rust crate; ensure you have a recent Rust toolchain installed.

## Quickstart

```elixir
# 1. Start a new in-memory database reference
db = Vettore.new()

# 2. Create a collection named "my_collection" with 3-dimensional vectors
:ok = Vettore.create_collection(db, "my_collection", 3, :euclidean)

# 3. Insert a single embedding
embedding = %Vettore.Embedding{
  value: "item_1",
  vector: [1.0, 2.0, 3.0],
  metadata: %{"note" => "first vector"}
}
:ok = Vettore.insert(db, "my_collection", embedding)

# 4. Retrieve by ID
{:ok, emb} = Vettore.get_by_value(db, "my_collection", "item_1")
IO.inspect(emb.vector, label: "Vector")

# 5. Similarity search (top 2 nearest neighbors)
{:ok, results} = Vettore.similarity_search(db, "my_collection", [1.5, 1.5, 1.5], limit: 2)
IO.inspect(results, label: "Top-2 Results")

# 6. Rerank with MMR for diversity (alpha = 0.7)
{:ok, reranked} = Vettore.rerank(db, "my_collection", results, limit: 2, alpha: 0.7)
IO.inspect(reranked, label: "MMR Reranked")
```

## API Reference

### `Vettore.new/0`

```elixir
def new() :: reference()
```

Allocates and returns an in-memory database handle backed by Rust.

---

### `Vettore.create_collection/5`

```elixir
@spec create_collection(
        db :: reference(),
        name :: String.t(),
        dim :: pos_integer(),
        metric :: :euclidean | :cosine | :dot | :hnsw | :binary,
        opts :: [keep_embeddings: boolean()]
      ) :: {:ok, String.t()} | {:error, String.t()}
```

* **name**: Collection identifier
* **dim**: Dimensionality of vectors
* **metric**: Similarity measure
* **opts**:

  * `:keep_embeddings` (default: `true`) — whether to retain embeddings on deletion

---

### `Vettore.insert/3`

```elixir
@spec insert(
        db :: reference(),
        collection :: String.t(),
        embedding :: Vettore.Embedding.t()
      ) :: {:ok, String.t()} | {:error, String.t()}
```

Insert a single `%Vettore.Embedding{}` struct into the named collection.

---

### `Vettore.batch/3`

```elixir
@spec batch(
        db :: reference(),
        collection :: String.t(),
        embeddings :: [Vettore.Embedding.t()]
      ) :: {:ok, [String.t()]} | {:error, String.t()}
```

Batch-insert multiple embeddings at once; non-embedding elements are ignored.

---

### Retrieval and Deletion

* `Vettore.get_by_value/3` — fetch by embedding ID
* `Vettore.get_by_vector/3` — fetch by exactly matching vector
* `Vettore.get_all/2` — returns all `{value, vector, metadata}`
* `Vettore.delete/3` — delete by ID

---

### `Vettore.similarity_search/4`

```elixir
@spec similarity_search(
        db :: reference(),
        collection :: String.t(),
        query :: [number()],
        opts :: [limit: pos_integer(), filter: map()]
      ) :: {:ok, [{String.t(), float()}]} | {:error, String.t()}
```

* **limit** (default: `10`)
* **filter**: metadata map to pre-filter embeddings

---

### `Vettore.rerank/4` (MMR)

```elixir
@spec rerank(
        db :: reference(),
        collection :: String.t(),
        initial :: [{String.t(), number()}],
        opts :: [limit: pos_integer(), alpha: float()]
      ) :: {:ok, [{String.t(), number()}]} | {:error, String.t()}
```

* **alpha**: `0.0..1.0` balance between relevance and diversity

---

## Distance Helpers (`Vettore.Distance`)

You can call these functions without creating a DB or collection:

```elixir
Vettore.Distance.euclidean([1.0,2.0], [2.0,3.0])      # => 1 / (1 + L2)
Vettore.Distance.cosine([1,0],[0,1])                  # => (dot + 1) / 2
Vettore.Distance.dot_product([1,2],[3,4])             # => raw dot
Vettore.Distance.hamming(bits1, bits2)                # => Hamming distance
bits = Vettore.Distance.compress_f32_vector([0.1,0.4])

# MMR re-ranker standalone (collection-agnostic)
initial = [{"id1", 0.9}, {"id2", 0.85}, ...]
embeds  = [{"id1", [v1...]}, {"id2", [v2...]}, ...]
Vettore.Distance.mmr_rerank(initial, embeds, "cosine", 0.5, 5)
```

## Similarity Search

The `similarity_search` function works as follows:

1. It retrieves the target collection and verifies that the query vector’s dimension matches.
2. Depending on the chosen distance metric, it selects an appropriate function:
   - **Euclidean:** Computes the standard Euclidean distance.
   - **Cosine / DotProduct:** Computes the dot product (with normalization applied for Cosine).
   - **HNSW:** Uses a graph-based approach for approximate nearest neighbor search.
   - **Binary:** Compresses the query vector into a binary signature and computes the Hamming distance between this signature and those of all stored embeddings.
3. For every embedding in the collection, it calculates a “score” between the stored vector (or its compressed representation) and the query.
4. The results are sorted:
   - For **Euclidean distance**, lower scores (closer to zero) are better.
   - For **Cosine/DotProduct**, higher scores are considered more similar.
   - For **Binary**, a lower Hamming distance means the vectors are more similar.
5. Finally, the top‑k results are returned as a list of tuples `(embedding_id, score)`.

| Technique/Algorithm    | Measures                                                | Magnitude Sensitive?¹ | Scale Invariant?¹   | Best Use Cases                                                             | Pros                                                                     | Cons                                                                            |
| ---------------------- | ------------------------------------------------------- | --------------------- | ------------------- | -------------------------------------------------------------------------- | ------------------------------------------------------------------------ | ------------------------------------------------------------------------------- |
| **Euclidean Distance** | Straight-line distance                                  | Yes                   | No                  | Dense data where both magnitude & direction are important                  | Intuitive, widely used, captures magnitude differences                   | Sensitive to scale differences, high dimensionality issues                      |
| **Cosine Similarity**  | Directional similarity (angle)                          | No                    | Yes                 | Text or high-dimensional data where scale invariance is desired            | Insensitive to magnitude, works well with normalized vectors             | Ignores magnitude differences                                                   |
| **Dot Product**        | Combination of direction & magnitude                    | Yes                   | No                  | Applications where both direction & magnitude matter                       | Computationally efficient, captures both aspects                         | Sensitive to vector magnitudes                                                  |
| **HNSW Indexing**      | Graph-based Approximate Nearest Neighbor Search         | Dependent on Metric   | Dependent on Metric | Large datasets, real-time search when approximate results are acceptable   | **Sublinear search time**, good speed-accuracy trade-off, scalable       | Approximate results, index build time and memory overhead                       |
| **Binary (Hamming)**   | Fast binary signature comparison using Hamming distance | No                    | Yes                 | Applications requiring ultra‑fast approximate searches on large-scale data | Extremely fast comparison via bit-level operations, low memory footprint | Loses precision due to compression, less suited when exact distances are needed |

---

## Performance Notes

- **HNSW** can speed up searches significantly for large datasets but comes with higher memory usage for the index.
- **Binary** distance uses bit-level compression and Hamming distance for extremely fast approximate similarity checks (especially beneficial for large or high-dimensional vectors, though it trades off some precision).
- **Cosine** normalizes vectors once on insertion, so queries and stored embeddings use a straightforward dot product.
- **Dot Product** directly multiplies corresponding elements.
- **Euclidean** uses a SIMD approach (`wide::f32x4`) for partial vectorization.

## Contributing

Contributions are welcome! Please open an issue or submit a PR.

1. Fork the repo
2. Create a feature branch
3. Add tests in `test/`
4. Submit a PR

## License

Apache 2.0 [LICENSE](LICENSE)