docs/PERFORMANCE.md

Select File
# Performance

How fast is exgit at what it's designed for — agent workflows that
lazy-clone a repo, prefetch the trees, and do many reads (grep,
read_path, walk)?

**Short answer:** on `anomalyco/opencode` (4,645 files, ~30 MB
fetch pack), a cold clone + prefetch completes in **~7 s** and
steady-state `grep` runs in **130–160 ms** for literal patterns
(Boyer-Moore) or **~335 ms** for case-insensitive regex. On the
436-file `adafruit/Adafruit_CircuitPython_Bundle`, steady-state
`grep` runs in **~2 ms**.

## TL;DR

| Fixture | Files | Fetch pack | Clone + prefetch | Grep (literal) | Grep (regex/ci) |
|---|---:|---:|---:|---:|---:|
| [`ivarvong/pyex`](https://github.com/ivarvong/pyex) | 275 | 1.2 MB | ~700 ms | ~11 ms | ~11 ms |
| [`cloudflare/agents`](https://github.com/cloudflare/agents) | 1,418 | 4 MB | ~8 s | ~58 ms | ~58 ms |
| [`anomalyco/opencode`](https://github.com/anomalyco/opencode) | 4,645 | ~30 MB | **~7 s** | **~140 ms** | **~335 ms** |
| [`adafruit/Adafruit_CircuitPython_Bundle`](https://github.com/adafruit/Adafruit_CircuitPython_Bundle) | 436 | ~1 MB | ~2 s | **~2 ms** | ~15 ms |

"Clone + prefetch" = `Exgit.clone(url, lazy: true)` +
`Exgit.FS.prefetch(repo, "HEAD", blobs: true)`.

"Grep (literal)" = case-sensitive, no regex metacharacters;
routes through `:binary.matches` (Boyer-Moore). "Grep (regex/ci)"
= case-insensitive `%Regex{}` scan.

Both grep numbers are steady-state (warm CPU, all objects in the
Memory store). The first grep after prefetch is the same cost —
unlike an older lazy-fetch path, prefetch now pre-populates
everything.

## What we measure

Four fixtures, each a real public GitHub repo. Picked to cover a
small-to-large size range with repos that have submodules, many
binary assets, and diverse layouts:

- `ivarvong/pyex` — **owned by the exgit maintainer**. Guaranteed
  against surprise force-push. Small (275 files), good for
  validating algorithmic baselines.
- `cloudflare/agents` — ~1.4k files. Medium real-world project.
- `anomalyco/opencode` — ~4.6k files, 26 MB blob in the pack. Large.
- `adafruit/Adafruit_CircuitPython_Bundle` — 436 files, uses git
  submodules (`.gitmodules` present). Previously crashed on prefetch
  due to an over-eager reserved-name check. Included as a regression
  fixture.

The benchmark harness (`bench/review_bench.exs`) does:

1. `Exgit.clone(url, lazy: true)` — refs only, no objects.
2. `Exgit.FS.prefetch(repo, "HEAD", blobs: true)` — stream the full
   blob pack directly into the Memory store via `Pack.StreamParser`.
3. **One "cold" grep** — steady-state (prefetch already populated everything).
4. **Five "warm" greps** — report the median.

## Benchmarks

All numbers are medians measured on a MacBook over a home internet
connection. `transport.fetch` varies 30–50% run-to-run due to
network; `pack.stream_parse`, `fs.grep`, and `fs.walk` are stable.

### anomalyco/opencode (4,645 files)

```
Phase                                    measured
--------------------------------------------------------------
clone(url, lazy: true)                   0.26 s
prefetch(blobs: true)                    6.4 s    ← streaming parser
grep "scd"             literal           158 ms
grep "TODO"            literal           130 ms
grep "useState"        literal           138 ms
grep "export default"  literal           132 ms
grep "anthropic"       regex/ci          333 ms
```

**Grep phase breakdown (4,645 blobs, 82 MB raw text):**

| Phase | Time | Share |
|---|---:|---:|
| Tree walk | 13 ms | 4% |
| `zlib.uncompress` × 4,645 blobs | 140 ms | 43% |
| Boyer-Moore scan (literal) | ~3 ms | 1% |
| PCRE scan (regex/ci) | ~150 ms | 46% |
| Line lookup + result alloc | ~15–65 ms | ~10% |

Literal patterns spend almost no time in the scan phase; the
bottleneck is `zlib.uncompress` in the Memory store. Case-insensitive
regex pays both the decompress and a slower PCRE scan.

### adafruit/Adafruit_CircuitPython_Bundle (436 files)

```
Phase                                    measured
--------------------------------------------------------------
clone(url, lazy: true)                   1.16 s
prefetch(blobs: true)                    0.65 s
grep "scd"             literal           2.2 ms   (14 hits)
grep "scd"             literal           2.5 ms
grep "scd"             literal           2.3 ms
```

436 files fit entirely in L3 cache after one prefetch pass;
Boyer-Moore through the full repo takes 2 ms and barely registers.

### Scaling

| Files | Grep / literal (ms) | Per-file (µs) |
|---:|---:|---:|
| 275 | 11 | 40 |
| 436 | 2 | 5 |
| 1,418 | 58 | 41 |
| 4,645 | 140 | 30 |

Per-file cost on opencode is **lower** than smaller repos because
its blobs are larger (more bytes compressed per file → fewer
`zlib.uncompress` calls relative to scan throughput); on adafruit
it is 5 µs because the compressed blobs stay warm in CPU cache
after the first grep.

## Architecture: end-to-end streaming pipeline

The biggest structural change since the original benchmarks is the
replacement of the buffered pack pipeline with a fully streaming
one. The old shape:

```
HTTP response → full binary in heap
              → Pack.Reader.parse (binary + resolved objects in heap simultaneously)
              → import_objects (another copy into the store)
```

Peak memory for opencode's 135 MB pack: ~400 MB (pack binary +
decoded object list + compressed store).

The new shape:

```
HTTP chunks → PktLine.Decoder → sideband demux
           → Pack.StreamParser.ingest/2 (one chunk at a time)
                ├── type/size header decode
                ├── zlib inflate port (open across ingest calls)
                ├── streaming deflate → ObjectStore directly
                └── OFS/REF delta resolved through store
           → StreamParser.finalize/1 (checksum verify)
```

Peak memory: one HTTP chunk (~4 KB) + one object's compressed
bytes in the write handle + the compressed store. The pack binary
never exists as a whole.

**opencode prefetch: 57 s → 6.4 s** — most of the 57 s was the
old Pack.Reader holding 135 MB of binary and the object list
simultaneously, triggering multiple major GC cycles. The streaming
parser never triggers that pressure.

### Adversarial hardening in the parser

`Pack.StreamParser.new/2` accepts limits enforced per-object
during the streaming parse:

```
max_object_bytes:   100 MB   — rejects before allocating
max_inflate_ratio:  1000×    — zip-bomb defence (compressed/raw ratio)
max_delta_depth:    50       — OFS/REF delta chain cap (same as git)
max_objects:        10 M     — rejects absurd pack headers
deadline:           nil      — monotonic cutoff; returns :deadline_exceeded
```

These fire during streaming, not as a post-parse check, so a
hostile pack stops consuming CPU/memory immediately.

## Grep: literal pattern fast path

`FS.grep/4` and `FS.multi_grep/4` detect case-sensitive literal
patterns (no PCRE metacharacters) at compile time and route them
through `:binary.matches` (Boyer-Moore-Horspool in the BEAM
runtime) instead of `Regex.scan`:

```elixir
# case-sensitive, no metacharacters → :binary.matches (9.5× faster)
FS.grep(repo, "HEAD", "useState")

# case-insensitive or metacharacters → Regex.scan
FS.grep(repo, "HEAD", "useState", case_insensitive: true)
FS.grep(repo, "HEAD", "use.*State")
```

Measured on 7.4 MB of synthetic text:

| Engine | Time | Speedup |
|---|---:|---:|
| `:binary.matches` (literal) | 8.6 ms | **1×** (baseline) |
| `Regex.scan` (literal regex) | 82 ms | 9.5× slower |
| `Regex.scan` (ci regex) | >10 s | >>100× slower at high hit density |

For typical code-search patterns (function names, import paths,
identifiers), the literal path is the default. Most agent queries
hit it without any caller changes.

## Parallelism: still a net loss

An earlier attempt parallelized `FS.grep` across blobs via
`Task.async_stream`. Result on opencode:

```
sequential (default):   340 ms
parallel (16 workers):  1550 ms   ← 4.5× SLOWER
```

The cause: `zlib.uncompress` is a regular (non-dirty) NIF. Running
16 concurrent calls each allocating large binaries simultaneously
causes severe GC pressure — 74 MB of heap allocation per grep in
16 processes simultaneously fragments memory and triggers
stop-the-world GC. The sequential path avoids this: each blob's
bytes are allocated, used, and collected before the next blob is
touched.

`max_concurrency: :schedulers` remains available for callers with
workloads where per-file work is substantial (large blobs, I/O-bound
stores). For typical code search on a Memory-backed repo, leave it
at the default of `1`.

## Bug fixes in this cycle

### `.gitmodules` blocked legitimate repos

`Tree.decode/1` was rejecting `.gitmodules` as a reserved entry
name, treating it the same as `.git` (CVE-2014-9390 class). The
comment even noted it was pre-emptive: "URL-injection vector for
submodule handling *if/when we add submodules*."

The consequence: any repo that uses git submodules — including
`adafruit/Adafruit_CircuitPython_Bundle` — crashed on prefetch
with `{:tree_entry_name_reserved, ".gitmodules"}`.

**Fix:** `.gitmodules` is now accepted. The URL-injection concern
only applies if we process submodule URLs, which exgit does not.
`.git` remains rejected (CVE-2014-9390 is real on case-insensitive
filesystems even for read-only clients).

### Earlier bugs (still in history)

Three compounding bugs in the original hot path documented here
for historical context (fixes landed in commit `550100d`):

1. **`FS.walk` discarded the updated repo** after `resolve_tree`,
   re-fetching the same commit from GitHub on every `walk` call.
   7.7s → 2 ms on cloudflare/agents.

2. **Promisor cache accounting counted decompressed bytes** while
   the store held compressed bytes; eviction fired 3–10× too early
   and dropped commits that were immediately needed. Fixed by
   tracking compressed sizes.

3. **`:max_resolved_bytes` default of 500 MiB** rejected
   opencode's ~524 MiB resolved set. Raised to 2 GiB.

## Optimizations that matter (shipped)

In order of impact:

1. **Streaming pack parser** (`Pack.StreamParser`) — replaces the
   buffered `Pack.Reader` in all fetch/prefetch paths. Eliminates
   the O(pack_size) binary + object list from the heap; bounded to
   one chunk + one object at a time. opencode prefetch: 57 s → 6 s.

2. **Streaming object-store writes** — `open_write/write_chunk/close_write`
   protocol on `ObjectStore`; Memory and Disk stores stream
   compressed output as inflate output arrives. Raw content never
   coexists with compressed form in the heap.

3. **Walk state threading** — updated repo threaded through the
   walk `Stream.resource` state, eliminating per-walk network
   fetches on lazy repos. 3,800× faster on cloudflare/agents.

4. **Literal grep fast path** — `:binary.matches` (Boyer-Moore)
   for case-sensitive literal patterns. 9.5× faster scan per blob;
   visible at adafruit scale (2 ms grep) and meaningful at opencode
   scale (dominant cost shifts to `zlib.uncompress`, not scan).

5. **Adler32 probe for pack zlib tracking** — finds the end of each
   zlib stream in O(1) instead of O(log N) binary-search probes.
   2.6× faster `Pack.Reader.parse` (still used for Disk store
   random-access lookups).

6. **Sequential grep as default** — avoids Task.async_stream GC
   pressure on typical workloads.

## What we're not doing

- **Decompressed-blob cache.** The 140 ms `zlib.uncompress` tax is
  paid on every grep call. A `repo.blob_cache: %{sha => binary}`
  field on the Repository struct, populated by a `FS.warm/2` call,
  would reduce repeated greps to near-zero. The design is correct
  (state on the struct, caller opts in, GC'd with the repo) but
  deferred until a measured workload asks for it. We explicitly
  ruled out ETS, Process dictionary, and persistent_term — any
  cache must be caller-visible and scoped to the repo value.

- **NIF-based zlib / libdeflate.** Would reduce `zlib.uncompress`
  cost 3–5×, making the 140 ms → ~30 ms. Undercuts the
  "pure Elixir, no NIFs" positioning; not doing this without a
  concrete workload and a clear tradeoff decision.

- **Parallel pack parsing.** OFS_DELTA chains impose a sequential
  dependency (base must precede delta in the forward walk). A
  two-pass design could unlock parallelism for the inflate phase;
  left for when a workload demonstrates the need.

- **Chunked parallel grep.** Per-task `Task.async_stream` at file
  granularity is net-negative (4.5× slower). A chunked variant
  batching 200–500 files per task would amortize spawn overhead and
  likely win on 10k+ file repos. Needs a measured workload.

## Running the benchmark yourself

```sh
# Clone + prefetch + grep workflow (all fixtures, 30 runs each)
mix run bench/review_bench.exs

# Filter to one fixture
mix run bench/review_bench.exs 10 opencode

# Local pack parse: StreamParser vs Pack.Reader head-to-head
# (requires local opencode .git pack files)
mix run bench/local_pack_eval.exs

# Pack parse scaling (synthetic, no network)
mix run bench/pack_parse_bench.exs

# Agent-session simulation: multi_grep + grep+context + blame + read_lines
mix run bench/agent_session_bench.exs
```

## Memory model summary

| Component | Bound |
|---|---|
| HTTP transport | One pkt-line per ingest chunk |
| Pack buffer | One object's compressed bytes |
| In-flight inflate | O(zlib_window) per chunk |
| Streaming write handle | O(compressed output chunks) |
| offset_to_sha map | ~35 bytes × N objects |
| sha_to_depth map | ~30 bytes × N objects |
| raw_cache (delta resolution) | 64 MB budget (plain map in StreamParser state) |
| Object store (Memory) | All objects compressed — inherent minimum |

The object store is the floor: if you fetch a 135 MB pack and store
it in a Memory backend, you'll hold however many bytes the compressed
objects take. Exgit does not add overhead on top of that minimum.

## Correctness oracle

`FS.grep` output is validated against `git grep` via
`test/exgit/fs_grep_git_parity_test.exs`. The test builds a small
real-git repo, runs both `git grep -n` and `Exgit.FS.grep` against
a set of representative patterns, and asserts the two agree on the
`(path, line_number)` match set. Tagged `:real_git` and `:slow`.

## History

See [`CHANGELOG.md`](../CHANGELOG.md) for the feature-level history.
Key perf commits:

- Streaming pack parser, streaming writes, literal grep, `.gitmodules` fix — current PR
- `550100d` — walk state threading; cache accounting fix; Adler32 probe
- `9bb1256` — partial clone haves bug fix
- `8678b0d` — initial Adler32 probe; code-quality gates