# Changelog
All notable changes to `gen_durable` are documented here. The format follows
[Keep a Changelog](https://keepachangelog.com/); this project is pre-1.0 and makes
**no backward-compatibility guarantees** — there is one schema version and migrations are
edited in place until the MVP settles.
## 0.2.0
A design-review hardening release: two correctness races closed, outcome commits made
ownership-safe, the engine made multi-instance, and the per-step round-trip count cut to ~1.
Findings and their resolutions are tracked in `ISSUES.md`.
### Fixed
- **Lost-wakeup race** between parking (`:await`) and signal delivery under READ COMMITTED:
a signal racing the park could strand a parked instance with its wake-up already in the
inbox. Delivery now always takes the instance row lock (the flip condition moved into the
statement, not its WHERE), and parking rechecks the inbox under the same lock — `:await`
is now deliberately a two-statement transaction.
- **Cross-pick deadlock on rate buckets**: bucket rows are locked in key order; every
multi-row bucket writer (ensure CTEs, config upsert) got deterministic ordering too.
- **Stale-worker overwrite**: every outcome now commits only while the worker still owns
the claim (`locked_by` + `executing` guard). An orphaned task (its scheduler crashed;
the row was reclaimed) gets its late commit **dropped** — observable as
`[:gen_durable, :outcome, :stale]` — instead of rewinding step/state, silencing the new
claimant's heartbeat, or double-firing terminal side effects (inbox purge, parent join
decrement).
- **Double-encoded jsonb**: `state`/`result`/signal payloads were stored as jsonb *scalar
strings* (invisible to `->>` and jsonb indexes). All JSON parameters are now parsed
server-side (`::text::jsonb`) and stored as objects; rows in the old format still decode.
- **Batch-size ceiling**: `insert_all` and `schedule_childs` children ride in as `unnest`
parallel arrays — the old per-row placeholders hit the wire protocol's 65535-parameter
cap at ~5400 rows.
- **Arbiter-order deadlock on concurrent batch inserts**: batches insert in
`correlation_key` order (server-side), so two nodes racing the same new keys can no
longer deadlock on the dedup index. Ids are consequently assigned in key order, not
entry order.
- An uncaught `throw` in a step routes to `handle/2` as `{:throw, value}` instead of
crashing the task and waiting out a full lease.
- **A second adversarial review pass** (findings 13–21 in `ISSUES.md`): the pick
self-heals a swept rate bucket (a slept-past-refill row can no longer stall forever and
starve the queue); re-awaiting with already-presented signals parks cleanly instead of
spinning (the accumulate-a-pack pattern no longer busy-loops); unserializable outcomes
(an unencodable `:done` result, a bad child spec) route to `handle/2` instead of looping
through the reaper forever; startup reclaim requires a stale lease, so claim-prefix
collisions (containers with identical hostnames, BEAM as pid 1) can never release a live
VM's claims; all multi-row maintenance statements claim rows via ordered
`FOR UPDATE SKIP LOCKED` (no maintenance-vs-maintenance deadlocks); a numeric
concurrency_key can no longer collide with a row id in the dedup window; `min_demand` is
clamped to the claim ceiling; `:await` resets `attempt` like `:next`.
### Changed (behavior)
- `GenDurable.signal/4` returns `{:error, :no_target}` for a terminal or nonexistent
target — for integer ids too (previously `:ok` with a durably stored orphan signal, or
an FK violation for a missing id).
- The default supervisor registration is now `GenDurable` (was `GenDurable.Supervisor`);
the `child_spec` id follows.
- Worker ids are now `<instance>:<queue>@<vm>-<uniq>` (opaque, stored in `locked_by`).
### Added
- **Await timeouts**: `{:await, names, next_step, state, timeout: ms}` wakes the instance
after the deadline even without a signal — a wake, not a failure (`attempt` untouched);
a fresh await distinguishes by empty `ctx.awaited`, the accumulate pattern proceeds with
its partial pack. Resolution is bounded by `:reap_interval`. New `await_deadline` column
+ partial index (v1 DDL edited in place, per the pre-1.0 stance — re-create the schema).
Telemetry: `[:gen_durable, :await, :timeout]`.
- **Multi-instance engines**: give each a `:name` (default `GenDurable`) and route API
calls with `name:` — config, task supervisor, and FSM registry are per-instance; a
duplicate name fails with `:already_started`.
- **Startup reclaim**: a starting scheduler releases claims left by a dead predecessor
(same instance+queue+VM) instead of letting them wait out the lease
(`[:gen_durable, :scheduler, :reclaimed]`).
- **Rate-bucket GC**: the GC sweep prunes buckets idle past their refill horizon and
buckets whose named limit was removed (`[:gen_durable, :gc, :swept]` gained a `buckets`
measurement) — partitioned keys no longer grow the bucket table without bound.
### Performance
- **~1 round-trip per step**: the pick batch-loads signal inboxes and children for its
whole claim set (3 statements per batch), removing the two per-step loads; the inbox
snapshot is taken at pick time (consumption stays exact — it deletes the ids the step
actually saw).
- **`deliver_signal` is one statement** (was a transaction of up to 5 round trips).
- **Prepared-statement caching** for every query (`cache_statement:`): parse + plan once
per connection. Hosts behind a transaction-pooling proxy set `prepare: :unnamed`.
- Advisory locks hash concurrency keys with `hashtextextended` (64-bit).
## 0.1.8
### Added
- **Rate limiting — token bucket, per step (spec §12).** A step opts into a named limit by
returning `{:next, step, state, rate_limit: :stripe}` (or `{:stripe, partition}` for a bucket
per tenant). Configured at start: `rate_limits: [stripe: [allowed: 100, period: {1, :minute}]]`
(`burst` defaults to `allowed`). Enforced in the picker (one statement) by a per-bucket token
counter locked with `FOR UPDATE` — correct across nodes, and measured to cost nothing on the
common path (NULL `rate_limit` short-circuits; the rate CTEs are `never executed`).
- **Weighted steps.** `{:next, step, state, rate_limit: :stripe, weight: 50}` — a step may consume
more than one budget unit. Grants take the urgency prefix whose cumulative weight fits (strict
order, free head-of-line reservation). `weight ≤ burst` is the caller's responsibility and is
**not** validated — a too-fat step freezes its bucket; split the step instead.
- New tables `gen_durable_rate_configs` / `gen_durable_rate_buckets`; new `gen_durable.rate_limit`
and `gen_durable.weight` columns. `:next` now normalizes to a 4-tuple carrying a per-transition
opts map (`rate_limit`, `weight`). `insert`, `insert_all`, and `schedule_childs` children all
carry the columns and ensure their buckets.
- Telemetry: `[:gen_durable, :rate_limit, :throttled]` (a bucket granted fewer than wanted) and
`[:gen_durable, :rate_limit, :unknown]` (a step named an unconfigured rate-limit).
### Deliberately not added (settled decisions)
- **Weighted-step poison guard** (`weight ≤ burst`): a too-fat step freezes its bucket — the caller's
responsibility (split the step), consistent with the engine's "you own correctness" stance.
- **Sliding/fixed-window rate algorithms**: token bucket only (its `rate`+`burst` knobs cover the
spectrum; sliding-log breaks single-row locking).
- **Boot-time validation / `on_unknown` policy**: an unknown rate-limit key stalls the row and emits
`[:gen_durable, :rate_limit, :unknown]` — that is the chosen v1 behaviour (no fail-fast at boot, no
`:run`/`:stop`/`:defer` knob).
## 0.1.7
### Changed
- Renamed `partition_key` → `concurrency_key` (column, insert option, SQL, the
`gen_durable_concurrency_active` index, and the `[:gen_durable, :concurrency, :contended]`
telemetry event). Pure rename; the SQL `PARTITION BY` window keyword is untouched.
## 0.1.6
### Changed
- Dropped the `:unique` policy enum (`:live`/`:global`). `correlation_scope` (a `durable_status[]`)
is now passed directly, defaulting to the non-terminal statuses. This also removes the surprising
built-in `:global` behaviour where a finished instance silently swallowed signals.
## 0.1.5
### Changed
- Renamed the same-step outcome `:replay` → `:retry` (it redoes the step with `attempt += 1` after a
delay; the old name read like event-sourcing replay).
- Merged addressing and uniqueness into one `correlation_key` (Temporal/DBOS workflow-id model): the
business key you `signal/4` by is the same key the engine deduplicates on. Replaces the separate
`external_id` (addressing) and `unique_key`/`unique_scope` (dedup). One partial unique index does
both jobs.
- Dropped the misleading "on top of GenServer" framing: an FSM is a row, not a process — there is no
GenServer per instance; the runtime backbone (scheduler/reaper/GC) is a small set of GenServers.
## 0.1.4
### Added
- Built-in GC of terminal (`done`/`failed`) rows: `GenDurable.GC`, configurable `:gc_interval` /
`:gc_retention` (default 1 day) / `:gc_batch`; `[:gen_durable, :gc, :swept]` telemetry. The delete
is O(batch) (select ids, then `DELETE … WHERE id = ANY`), not O(table).
## 0.1.3
### Changed
- `await` waits on a **set** of signal names; the woken step sees the matched subset as `ctx.awaited`
(full inbox in `ctx.all`). Consumption is by received id on progress (latecomers survive; packs can
accumulate via re-await); a terminal outcome clears the whole inbox.
## 0.1.2
### Added
- Job form: define `perform/1`/`perform/2` instead of `step/2` for a one-shot durable job with
built-in retry/backoff. Folded into `GenDurable.FSM`.
## 0.1.1
### Added
- Nested `State` embedded-schema adopted by convention (no `state:` option needed).
## 0.1.0
### Added
- Initial durable FSM engine: Postgres-backed, state committed before each step proceeds
(at-least-once, whole-step re-execution). Steps and outcomes (`:next`/`:retry`/`:await`/`:done`/
`:stop`), `schedule_childs` fan-out + fan-in barrier (§11), durable signals/await (§5), queues with
concurrency, priority, scheduling sugar, lease + reaper crash recovery, `concurrency_key`
serialization, uniqueness, single-round-trip outcomes, feeder backpressure, graceful drain, broad
telemetry, dynamic FSM resolution, library-owned migration.