PERFORMANCE.md

# gen_durable — performance

Performance is a first-class constraint for this engine, not an afterthought. This
document is the cost model: what one step costs, the EXPLAIN plans of every hot-path
statement, the indexes that back them, throughput estimates, and the known limits.

All `EXPLAIN (ANALYZE, BUFFERS)` output below is **real**, captured on **Postgres 17**
(the devcontainer image) against a seeded dataset of **1,000,000 `gen_durable` rows**
(295k `runnable`, 5k `executing`, 700k terminal) and **50k `signals`**, over a local
Unix socket. Reproduction steps are at the end. Numbers are warm-cache; absolute ms
will differ on your hardware, but the *plan shapes* and *row counts* are what matter.

---

## 1. The cost model of one step

A step runs **between two short database transactions**, with the user's code in the
middle, outside any transaction (see the architecture notes in `gen_durable_plan.md`):

```
 pick (claim+lease)         user step/2 (no DB, no txn)        outcome (commit)
 ───────────────────►  ····························  ───────────────────►
 TX1: short, batched        side effects live here            TX2: short
 amortized over the batch    (idempotency is yours)           consume + update
```

The database work per step, and where it goes:

| Phase | Statements | Round-trips | Notes |
|---|---|---|---|
| **pick** | 1 (`UPDATE … RETURNING` over a window-dedup claim) | ~`1/B` per step | one pick claims a batch of `B`; the feeder amortizes it |
| **load signals** | 1 `SELECT` | 1 | always-on today; gated by an FSM flag is the remaining F4 step |
| **load childs** | 1 `SELECT` | 1 | same |
| **user `step/2`** | 0 | 0 | runs outside any transaction |
| **outcome** | 1 CTE (`consumed` DELETE + outcome UPDATE [+ parent-join]) | **1** | folded from a 4–5-stmt transaction into one statement (F4) |

So a plain `:next` step is now **~3 round-trips** (2 loads + 1 outcome), with the pick
amortized to near-zero by batching. The outcome used to be a 4-statement
`BEGIN`/consume/`UPDATE`/`COMMIT` transaction (+1 for `:done`/`:stop`'s parent-join);
folding it into a single data-modifying CTE made it **1** — measured at exactly one
statement (`test/perf_test.exs`), and consistently faster than the transaction form on
wall-clock (the round-trips are the cost). Gating the two loads behind an FSM capability
flag is the last F4 step → **~1 round-trip/step**. See §7.

Round-trips, not query execution time, dominate: every statement below executes in
**well under 1 ms** at the database, but each client↔Postgres hop is a network round-trip
(~0.3–1 ms across hosts). The cost is the count of hops, which is why collapsing the
outcome from 4 hops to 1 matters more than any single statement's plan.

---

## 2. The picker (the one query that must scale)

The picker runs on every poll and every completion-driven refill, so it is the query
that most has to stay cheap as the table grows. It does three things in one statement:
claim a batch atomically, serialize partitions, and dedup partition keys (spec §6).

### 2.1 The decisive detail: `queue = $1`, never `ANY`

Each scheduler owns exactly one queue, so the picker filters `queue = $1` (equality).
This is **not cosmetic** — it is what lets the `gen_durable_pick (queue, priority,
eligible_at) WHERE status='runnable'` index supply rows already ordered, so the `LIMIT`
stops after a handful. With `queue = ANY($1)` the planner cannot trust the index order
and falls back to **scanning the entire runnable set and top-N sorting it**:

```
-- queue = ANY('{default}')  ❌
->  Index Scan using gen_durable_pick on gen_durable g_2 (rows=280000)   ← whole runnable set
      Filter: ((partition_key IS NULL) OR (NOT ...))
->  Sort  Sort Method: top-N heapsort
Execution Time: 613.578 ms
```

```
-- queue = 'default'  ✅
->  Limit (rows=50)
      ->  Index Scan using gen_durable_pick on gen_durable g_2 (rows=50)  ← stops at the batch
            Index Cond: ((queue = 'default') AND (eligible_at <= now()))
Execution Time: 0.725 ms
```

**~850× faster, and O(batch) instead of O(runnable backlog).** Same data, one operator.

### 2.2 Full plan of the picker (`batch = 50`)

The picker is the canonical Postgres claim — one `SELECT … FOR UPDATE SKIP LOCKED LIMIT`,
then one `UPDATE` — with the partition dedup folded in as a window function over the
locked set, so there is **exactly one nested loop** (the `UPDATE` join):

```
Update on gen_durable g (actual time=0.400..0.983 rows=50 loops=1)
  Buffers: shared hit=868
  CTE locked
    ->  WindowAgg (rows=50)                                     ← dedup: row_number() per key
          ->  Sort  Sort Key: COALESCE(partition_key, id::text), priority, eligible_at  (27kB)
                ->  Limit (rows=50)
                      ->  LockRows (rows=50)                    ← FOR UPDATE SKIP LOCKED, in-scan
                            ->  Index Scan using gen_durable_pick (rows=50)
                                  Index Cond: ((queue = 'default') AND (eligible_at <= now()))
                                  Filter: ((partition_key IS NULL) OR (NOT (… hashed SubPlan 2)))
                                  SubPlan 2
                                    ->  Index Scan using gen_durable_lease (never executed)
  ->  Nested Loop (rows=50)                                     ← the one, optimal join
        ->  CTE Scan on locked l (rows=50)  Filter: (rn = 1)
        ->  Index Scan using gen_durable_pkey on gen_durable g (rows=1 loops=50)
Planning Time: 0.582 ms
Execution Time: 1.122 ms
```

What to read here:

- **`gen_durable_pick` index scan stops at the `LIMIT`** — the scan touches ~`batch`
  rows, not the 295k runnable rows. This is the whole game.
- **The lock happens in that one scan** (`LockRows`), and `row_number()` dedups the
  *already-locked* set — so there is no separate re-lock pass. The earlier design
  (`DISTINCT ON` → re-lock → update) had a second nested loop that scaled per-row with
  the batch; folding the dedup into a window function removed it (measured −19% at
  batch 5000, §2.4).
- **`SubPlan 2` (the partition `NOT EXISTS`) was `never executed`** here because the top
  50 rows were non-partitioned (`partition_key IS NULL` short-circuits the guard). A
  non-partitioned queue pays **nothing** for the partition machinery. When partitioned
  rows *are* in the window, the guard probes `gen_durable_partition_active` — a partial
  index over only the executing rows with a **non-null** `partition_key`, so a
  non-partitioned claim never even writes to it.
- **The single `Nested Loop` is the `UPDATE` join, and it is optimal** — outer = `batch`
  rows, inner = one primary-key point lookup each (`loops=50, rows=1`). That is O(batch)
  point updates, the textbook-best way to update N rows by id; see §2.5 for the proof
  that forcing it off is ~10× slower. The per-key losers (`rn > 1`) were locked but not
  updated, so they stay `runnable` and their lock releases at commit — no advisory bounce.

### 2.3 The bounded-window trade-off

The scan is `LIMIT $2` (the batch size). A cluster of same-key rows filling that window
dedups down to one, so a single pick can return fewer than `batch` — completion-driven
refill closes the gap on the next pick. This keeps the hot path index-cheap at the cost
of a little fill latency under heavy single-key load. See §6 for the degenerate case.

### 2.4 Large batches and the per-row floor

Crank `prefetch` and the pick claims a big batch in one statement. The cost is **linear
in the batch** — every component (index scan, dedup sort, lock, the `UPDATE`) is per-row:

| batch | pick (median of 5, warm) | per row |
|---|---|---|
| 50 | ~1.1 ms | ~22 µs |
| 100 | ~1.9 ms | ~19 µs |
| 1000 | ~16 ms | ~16 µs |
| 5000 | ~57 ms | ~11 µs |

The fixed cost (planning + the `NOT EXISTS` hash build over the executing set) amortizes
away with bigger batches; the **~11–16 µs/row marginal cost does not**. That floor is the
actual work of *claiming*: flipping each row `runnable → executing` is a heap write plus
moving the row out of the `runnable` partial index and into the two `executing` partial
indexes (`lease`, `partition_active`) — plus WAL for all of it. **You cannot claim a row
without writing it**, so ~11–16 µs/row is close to the floor for this schema.

Two consequences worth designing around:

- A pick of 5000 is a **70 ms synchronous statement** in the scheduler GenServer and holds
  `FOR UPDATE` over 5000 rows. Past ~a few hundred, batch size buys no amortization (the
  fixed cost is already gone) and only adds blocking. If you push `prefetch` very high,
  prefer many medium picks over one giant one.
- Per-row cost is **aggregate DB CPU**: at 10k steps/s that floor is ~0.15 s/s of CPU on
  picking alone. The way to cut it is fewer, larger logical steps — not a faster pick.

### 2.5 The one nested loop is optimal (proof)

The `UPDATE` join is a `Nested Loop` and that is exactly right. "Update these N rows by id"
wants N primary-key point lookups — a nested loop with the PK index on the inner side,
O(batch). It only looks scary when the inner side is *unindexed* (then it is O(N·M)); here
it is `loops=batch, rows=1` against `gen_durable_pkey`. Forcing the planner off it
(`SET enable_nestloop = off`) makes it fall back to a hash join that **Seq Scans the whole
million-row table** to find the batch:

```
->  Hash Join  (Hash Cond: g.id = l.id)
      ->  Seq Scan on gen_durable g  (rows=1000000)     ← scans everything to find `batch` rows
```

| batch | nested loop (default) | forced off → hash join + seq scan |
|---|---|---|
| 1000 | 17 ms | 160 ms |
| 5000 | 56 ms | 195 ms |

~10× slower. The nested loop is the canonical Postgres-queue claim, and it is the floor:
`UPDATE` cannot `LIMIT`, so the claim must be `SELECT … FOR UPDATE … LIMIT` then `UPDATE`
joined by id; and updating N rows means writing N rows regardless of how they are reached.

### 2.6 What did *not* help (measured, rejected)

Restructurings prototyped against the seeded dataset and **rejected by measurement** —
recorded so they are not re-attempted blind:

- **Single-scan dedup** (correlated `NOT EXISTS` "I am the most-urgent runnable of my key"
  + `FOR UPDATE` in the scan, backed by a new `(partition_key, priority, eligible_at)`
  partial index). A wash on wall-clock and **worse at batch 5000** — the new partial index
  adds write amplification to every claim *and* every return-to-`runnable`. The window
  function over the locked set (the shipped picker) gets the same dedup with no new index.
- **`ctid`-join instead of `id`-join** for the `UPDATE` re-touch (TID scan instead of a PK
  descent). Cut **logical buffer hits ~28%** but **warm-cache wall-clock did not move**
  (17.0 vs 17.0 ms at 1000) and was **slower at 5000** (79 vs 70). Those buffers are
  nanosecond cache hits; time is dominated by heap writes + WAL, which `ctid` does not
  touch — consistent with §2.5 (the cost is the write, not the lookup).

**Lesson:** on a warm queue the picker is at its floor; fewer logical page touches do not
translate to time when they are all cache hits. The real lever is **round-trips per step**
(§7), not the pick query. Beware benchmarking on a bloated table — rolled-back
`EXPLAIN ANALYZE` runs accumulate dead tuples and inflate timings by ~20%; `VACUUM` and
take the median of several runs before believing a delta.

---

## 3. The outcome and point queries (all O(1) by primary key)

Every outcome and every signal/insert touches rows **by primary key or a covering index** —
constant work regardless of table size.

### The collapsed outcome (F4) — the plan, not the statement count

Each `complete_*` is one statement (§1): the signal consume rides as a leading `consumed`
CTE, and `:done`/`:stop` carry the parent-join as the main `UPDATE` after a `terminal` CTE.
Counting "one statement" only proves the *round-trip* — the plan has to prove the single
statement is *cheap*. Here is the heaviest outcome, `complete_done` on a **child** (all
three parts: consume + done + parent decrement), measured with **50,000 children present**
so the parent path is real, not a one-row fixture:

```
Update on gen_durable p                                          ← the parent decrement
  CTE consumed
    ->  Delete on signals s   (Index Scan signals_target → gen_durable_pkey)
  CTE terminal
    ->  Update on gen_durable  ->  Index Scan using gen_durable_pkey  (id = $1)
  ->  Nested Loop
        ->  Index Scan using gen_durable_pkey on c   (id = $1)            ← child by PK
        ->  Index Scan using gen_durable_pkey on p   (id = c.parent_id)   ← parent by PK
Execution Time: 0.268 ms
```

**Every node is a primary-key index scan — no `Seq Scan` on `gen_durable`**, even with 50k
rows in the parent index. The `c.id = $1` equality is the selective path so the planner
stays on the PK (a *one-row* fixture can fool it onto the partial parent index with an
`id` filter — harmless there, and it does not happen once stats are real). For a non-child
`:done` the parent-join is a clean PK no-op (`p` matches 0 rows).

And it is not only fewer round-trips: the single statement is **less DB execution time**
than the old three separate statements — median **~0.25 ms vs ~0.42 ms** (EXPLAIN ANALYZE,
300k-row table), before even counting the `BEGIN`/`COMMIT` round-trips the transaction form
also paid. Consume is by received id (spec §5): a progressing outcome deletes exactly the
`ctx.awaited` ids the step saw (`id = ANY($consumed)`, a PK lookup), so latecomers and
never-awaited signals survive; a terminal outcome drops the whole inbox (`target_id = $id`).

(The statement-count assertions in `test/perf_test.exs` guard the *round-trip* count; this
EXPLAIN is the *execution-cost* evidence. Different claims, both checked.)

### Point queries

**`deliver_signal` — wake by id+status+awaits:**
```
Update on gen_durable
  ->  Index Scan using gen_durable_pkey (rows=0)
        Filter: ((status = 'awaiting_signal') AND (awaits = 'go'))
Execution Time: 0.029 ms
```

**`insert` — single row, dedup via the partial unique index:**
```
Insert on gen_durable
  Conflict Arbiter Indexes: gen_durable_unique
  Tuples Inserted: 1
Execution Time: 0.468 ms
```

**`load_signals` — the inbox by target:**
```
Sort  Sort Key: id
  ->  Index Scan using signals_target on signals (rows=1)   Index Cond: (target_id = 700001)
Execution Time: 0.072 ms
```

All sub-millisecond, all index-driven. None of them grow with the table.

---

## 4. The reaper (proportional to *expired* rows only)

The reaper sweeps expired leases via the `gen_durable_lease (lease_expires_at) WHERE
status='executing'` partial index — it never scans live or terminal rows:

```
Update on gen_durable (rows=5000)
  ->  Index Scan using gen_durable_lease (rows=5000)
        Index Cond: (lease_expires_at < now())
        Filter: (status = 'executing')
Execution Time: 42.002 ms   ← 5000 rows reaped at once (mass-crash scenario)
```

The index scan itself is ~2 ms; the 42 ms is the heap update of 5000 rows. **In steady
state almost nothing is expired**, so a sweep finds 0 rows and costs an index probe. Cost
scales with the size of the *expired* set (a mass worker death), not the table — exactly
what you want.

---

## 5. Index coverage

Every hot statement is served by a partial or primary-key index; the big terminal-row
bulk (700k `done`/`failed`) sits in none of the partial indexes.

| Query | Index | Kind |
|---|---|---|
| `pick` scan | `gen_durable_pick (queue, priority, eligible_at) WHERE status='runnable'` | partial, ordered |
| `pick` partition guard | `gen_durable_partition_active (partition_key) WHERE status='executing' AND partition_key IS NOT NULL` | partial, index-only probe |
| `pick` lock / outcomes / signal | `gen_durable_pkey (id)` | primary key |
| `reap` | `gen_durable_lease (lease_expires_at) WHERE status='executing'` | partial |
| `insert` dedup | `gen_durable_unique (unique_guard) WHERE unique_guard IS NOT NULL` | partial unique |
| child join | `gen_durable_parent (parent_id) WHERE parent_id IS NOT NULL` | partial |
| `load_signals` | `signals_target (target_id, name)` | covering |

The partial predicates matter: `gen_durable_pick` indexes only the ~295k runnable rows,
not the 700k terminal ones, so the working set stays small as completed work accumulates.

---

## 6. Throughput model and known limits

### Round-trip-bound throughput

Query *execution* is sub-millisecond, so steady-state throughput is set by **round-trips
per step × round-trip latency**, divided across concurrency.

Let `R` = round-trips/step, `L` = client↔Postgres round-trip latency, `U` = user
`step/2` time. One serial chain does `1 / (R·L + U)` steps/s; with `C` concurrent workers
(pool and DB permitting), throughput ≈ `C / (R·L + U)`.

| | round-trips `R` | per-step DB latency at `L`=0.2 ms (local) | at `L`=1 ms (cloud) |
|---|---|---|---|
| today (2 loads + 4-stmt outcome txn) | ~6 | ~1.2 ms | ~6 ms |
| F4 (gate loads, single-CTE outcome) | ~1–2 | ~0.2–0.4 ms | ~1–2 ms |

So a trivial-step workload at `C = 50`, local, today: `50 / 1.2 ms ≈ 40k steps/s`,
DB permitting; F4 would roughly triple-to-quintuple that. These are model figures from
the measured per-query costs, not an end-to-end benchmark — treat them as order-of-
magnitude. The pick is amortized out by the feeder batch (`0.7 ms / 50 rows ≈ 14 µs/row`).

### Tuning levers (feeder knobs — see `GenDurable.Scheduler`)

- **`prefetch`** — claims a batch ahead into memory; turns many small picks into few fat
  ones (amortizes the pick round-trip). Buffered rows are heartbeated, so depth is safe
  w.r.t. `lease_ttl`. Cost of depth: cross-node fairness, crash blip (bounded by TTL).
- **`min_demand`** — batch gate: fetch fat, not one row per freed slot.
- **`poll_interval` / `max_poll_interval`** — idle backoff cuts the polling load on an
  empty queue to near-zero without hurting busy-queue latency.

### Known limits / pathologies (honest list)

1. **A single hot partition key with a large runnable backlog.** While that key is
   `executing`, the picker's `NOT EXISTS` excludes each of its runnable siblings — one
   cheap index probe each — but if those siblings dominate the top of the queue by
   priority, the scan walks past many of them per pick. The window `LIMIT` bounds the
   *output*, not how far it skips. Mitigation (future): picker sharding by `hashtext(key)`
   so one key maps to one scheduler, or a per-key "next eligible" side structure. A single
   key monopolizing a queue is itself a modeling smell.
2. **Partitioned steps pin a connection.** `partition_key` serialization holds a session
   advisory lock on a checked-out connection for the *whole* step (user code included), so
   in-flight partitioned steps consume pool connections 1:1. Size the pool for peak
   partitioned concurrency. Non-partitioned steps grab/release per statement.
3. **`load_signals` + `load_childs` run on every step**, even for FSMs that never await or
   spawn. Two wasted round-trips for the common machine; gate them behind
   `use GenDurable.FSM, awaits: true, childs: true`. The last remaining F4 step.

---

## 7. Optimization backlog — ordered by payoff

1. ✅ **Collapse the outcome transaction into one CTE statement** — done (F4): a 4–5-stmt
   `BEGIN … COMMIT` became one data-modifying CTE, ~4 round-trips → 1, the biggest single
   win on the hot path. Asserted single-statement in `test/perf_test.exs`.
2. **Gate `load_signals` / `load_childs` on FSM capability flags** — ~2 round-trips → 0
   for plain machines. Takes a plain `:next` step from ~3 round-trips to ~1.
3. **Picker sharding by key hash** — removes the hot-key skip cost (limit #1).

Together (1)+(2) take a plain `:next` step from ~6 round-trips to ~1–2.

---

## 8. Reproducing these numbers

The dataset and every `EXPLAIN` above were produced in the devcontainer Postgres. To
regenerate: bring up the stack (`make up`), then in `docker compose … exec db psql -U
postgres`, create a scratch database, apply the v1 DDL from
`lib/gen_durable/migration.ex` verbatim, seed with `generate_series` (700k terminal /
200k runnable non-partitioned / 80k distinct-key / 15k hot-key / 5k executing / 50k
signals), `ANALYZE`, and run `EXPLAIN (ANALYZE, BUFFERS)` on each statement from
`lib/gen_durable/queries.ex`. Wrap mutating statements in `BEGIN; … ; ROLLBACK;` so the
plan executes without changing the dataset. Run each twice and read the second (warm).