Skip to main content

guides/architecture.md

# Architecture

Exograph is built around one principle: storage and indexes are advisory;
ExAST remains the semantic authority for structural matches.

## Components

- ExAST extracts structural terms, comments, symbols, and verifies patterns
- ExDNA provides structural fingerprints for fragments and similarity search
- Reach optionally extracts call graph facts
- Ecto/Postgres stores normalized files, fragments, facts, package scope, and graph facts
- ParadeDB optionally accelerates text and code-fact retrieval

## Indexing pipeline

```txt
source files
  ├── ExAST extractor
  │   ├── fragments
  │   ├── comments
  │   ├── definitions
  │   └── references
  ├── Reach extractor (optional)
  │   ├── graph nodes
  │   └── call edges
  └── Postgres stores
      ├── files
      ├── fragments
      ├── facts
      └── package/version scope
```

For Hex.pm indexing, an outer streaming loop wraps the pipeline:

```txt
Hex registry
  └── for each package (concurrent, bounded)
        ├── download tarball (HTTP, mirror round-robin)
        ├── detect Elixir files (skip non-Elixir before disk write)
        ├── extract to tmpdir
        ├── indexing pipeline (above)
        └── rm -rf tmpdir
```

## Storage model

`Exograph.Index` separates execution by concern:

- Postgres inverted index: structural term candidate retrieval from fragment rows
- fragment store: AST blobs, ExDNA hashes, symbols, and file joins
- source files: source text and aggregated comment text stored once per file
- code facts: normalized comments, definitions, references, graph nodes, and call edges
- tree access: derived lazily from stored AST fragments
- verifier: `ExAST.Pattern` / `ExAST.Query`
- similarity: ExDNA structural reranking

## Query execution

Structural queries are planned into candidate retrieval plus verification:

```txt
ExAST selector
  ├── required/advisory terms
  ├── Postgres candidate scan
  ├── hydrate fragments/source
  └── ExAST verification
```

DSL queries add relational candidate filters before structural verification:

```txt
Exograph.DSL.Query
  ├── Exograph.DSL.Plan validation
  ├── Ecto query over fragments/facts/calls
  ├── containing-function join semantics
  └── ExAST verification for fragment matches
```

## Lateral joins for line-range containment

The "containing function" join — find the `def` that contains a given fragment
at line N — uses a SQL `LATERAL` subquery rather than a self-join. The lateral
join evaluates the subquery once per outer row and uses the `(file_id, line,
end_line)` index to locate the enclosing fragment in O(log n) per row. This
keeps the containing-function semantic available without materializing a closure
table.

## Advisory locks for concurrent term insertion

When multiple workers index packages concurrently, term insertion into the
inverted index can deadlock on duplicate-key conflicts. Exograph acquires a
Postgres advisory lock keyed on `hash(term_text)` before inserting or looking up
a term record. This serializes conflicting inserts per term without locking the
entire terms table, and retries automatically on the rare case where two workers
hash-collide to different lock IDs.

## `(kind, name, arity)` btree index

Most structural patterns extract kind, name, and arity at query planning time
(e.g. `def handle_call(_, _, _) do ... end` → kind=`def`, name=`handle_call`,
arity=3). A btree index on `(kind, name, arity)` on the fragments table lets
these queries bypass the GIN term index entirely and go to a btree range scan,
which is significantly faster at high fragment counts. The GIN term index is
used only when the pattern has no extractable kind/name (e.g. `_ + _`).

## File-first text search with lateral fragment lookup

Text and regex search operate file-first rather than fragment-first:

```txt
text query
  ├── scan files.source with pg_trgm ILIKE (or BM25 ranking)
  ├── collect matching file IDs
  └── LATERAL join: for each file, find fragments containing the match line
```

This avoids storing duplicated source text per fragment and keeps `files.source`
as the single source of truth. The lateral join uses the `(file_id, line,
end_line)` btree index to locate the containing fragment efficiently.

## Why Postgres

Postgres gives Exograph:

- durable local/self-hosted indexes
- Ecto schemas and migrations
- package/version scope
- joins across structural and semantic facts
- optional ParadeDB BM25 indexes
- a natural substrate for tools that already run inside Elixir applications

## Raw SQL boundary

Exograph uses Ecto where possible. Raw SQL is limited to extension/backend
features Ecto cannot express directly, especially ParadeDB index creation,
tokenizer casts, BM25 operators, and scoring.