Skip to main content

guides/sharded-duckdb.md

# Sharded DuckDB semantics

Sharded DuckDB indexes split a corpus across multiple independent DuckDB databases and query them through `Exograph.ShardedIndex`. This is useful for large local corpora because it reduces single-file write/MERGE contention during ingestion.

Sharding is currently an opt-in architecture, not the default DuckDB backend.

## CLI usage

Build a sharded Hex corpus by choosing a shard count, shard directory, and manifest path:

```bash
mix exograph.index.hex \
  --backend duckdb \
  --entries-file bench-results/fixed-top2000-20260614/entries.ndjson \
  --tarball-dir /tmp/exograph-top2000-tarballs \
  --concurrency 8 \
  --duckdb-shards 4 \
  --shard-concurrency 2 \
  --shard-pool-size 2 \
  --duckdb-threads 2 \
  --shard-dir data/hex-shards \
  --manifest-path data/hex-shards/manifest.term
```

Open a sharded corpus in the web UI with the manifest:

```bash
mix exograph.web \
  --manifest-path data/hex-shards/manifest.term \
  --duckdb-threads 2 \
  --shard-pool-size 1 \
  --shard-port-base 19700
```

The manifest stores shard file paths and package/version ownership metadata. Keep it with the shard database files; moving shard files requires updating or rebuilding the manifest.

## What is intended to match single-DB behavior

Package/version-scoped queries should route to the shard that owns the package version and should match the corresponding single-package results. Use package/version filters with the same identity stored in the manifest. Filters may be maps/keywords with `:name` and `:version`, or `Exograph.PackageVersion` structs with `:package_name` and `:version`:

```elixir
Exograph.search(index, "def handle_call(_, _, _) do ... end",
  package_version: %{name: "my_package", version: "1.2.3"}
)
```

The shard manifest records package/version ownership so `Exograph` can avoid querying unrelated shards.

## What is intentionally not transparent yet

Shard-local tables keep local fragment IDs, term IDs, and content de-duplication. As a result, global row counts and broad unscoped searches are not guaranteed to match a single logical DuckDB index.

Known differences from the fixed top500 probe:

- fragment/content/term de-duplication is shard-local;
- global `fragments`, `terms`, and `fragment_terms` counts can be higher than single-DB counts;
- broad structural queries can over-count;
- a simple fanout dedup by fragment hash did not restore parity.

Do not present sharded DuckDB as an invisible performance flag until global search semantics are redesigned.

## QuackDB boundary

Exograph owns sharded corpus semantics: package-to-shard ownership, shard-local versus global fragment identity, search fanout, ranking, and result aggregation. QuackDB should not grow generic application-level sharded search semantics at this stage.

If repeated infrastructure appears, only low-level primitives should move down to QuackDB later, such as managed server groups, connection-group lifecycle helpers, fanout execution helpers, or telemetry aggregation across connections.

## Product choices before making sharding default

A sharded read architecture needs explicit product semantics for global queries. Possible directions:

1. Keep shard-local identity and document global counts/search as shard-local aggregates.
2. Add a global result identity/ranking model for fanout search.
3. Use sharding only as a build/staging strategy, then finalize into one logical global index.

Until one of those is chosen, single-DuckDB online MERGE remains the default correctness baseline.