# DGen
`dgen_server` brings the gen_server programming model to a distributed system. It uses the same callback interface — `init/1`, `handle_call/3`, `handle_cast/2` — but unlike a gen_server, it is not a single Erlang process. A `dgen_server` is a virtual entity that exists as long as its state does in FoundationDB, independent of any running process. What makes it *distributed* rather than merely durable is its message queue: multiple consumers on any number of nodes can consume from the same queue simultaneously, with serialization and exactly-once delivery guaranteed by the backend.
## Motivation
I love gen_server. There are only 2 things stopping me from writing my entire app with them:
1. Durability: The state is lost when the process goes down.
2. High availability: The functionality is unavailable when the process goes down.
Let's try to solve this with a distributed system, and find out if an app actually can be
written with only gen_servers.
## Getting Started
<!-- tabs-open -->
### Erlang
The simplest distributed server is just a regular gen_server with `dgen_server` behaviour:
```erlang
-module(counter).
-behavior(dgen_server).
-export([start/1, increment/1, value/1]).
-export([init/1, handle_call/3]).
start(Tenant) ->
dgen_server:start(?MODULE, [], [{tenant, Tenant}]).
increment(Pid) ->
dgen_server:call(Pid, increment).
value(Pid) ->
dgen_server:call(Pid, value).
init([]) ->
{ok, 0}.
handle_call(increment, _From, State) ->
{reply, ok, State + 1};
handle_call(value, _From, State) ->
{reply, State, State}.
```
Start it inside a FoundationDB directory, and the state persists across restarts:
```erlang
Tenant = dgen_erlfdb:sandbox_open(<<"demo">>, <<"counter">>),
{ok, Pid} = counter:start(Tenant),
counter:increment(Pid),
counter:increment(Pid),
2 = counter:value(Pid),
%% Restart the process
dgen_server:stop(Pid),
{ok, Pid2} = counter:start(Tenant),
2 = counter:value(Pid2). %% State persisted!
```
### Elixir
The simplest distributed server is just a regular GenServer with `use DGenServer`:
```elixir
defmodule Counter do
use DGenServer
def start(tenant), do: DGenServer.start(__MODULE__, [], tenant: tenant)
def increment(pid), do: DGenServer.cast(pid, :increment)
def value(pid), do: DGenServer.call(pid, :value)
@impl true
def init([]), do: {:ok, 0}
@impl true
def handle_call(:value, _from, state), do: {:reply, state, state}
@impl true
def handle_cast(:increment, state), do: {:noreply, state + 1}
end
```
Start it inside a FoundationDB directory, and the state persists across restarts:
```elixir
tenant = :dgen_erlfdb.sandbox_open("demo", "counter")
{:ok, pid} = Counter.start(tenant)
Counter.increment(pid)
Counter.increment(pid)
2 = Counter.value(pid)
# Restart the process
GenServer.stop(pid)
{:ok, pid2} = Counter.start(tenant)
2 = Counter.value(pid2) # State persisted!
```
<!-- tabs-close -->
## Installation
DGen can be installed by adding `dgen` to your list of dependencies in `mix.exs`:
```elixir
def deps do
[
{:dgen, "~> 0.1.0"}
]
end
```
The docs can be found at <https://hexdocs.pm/dgen>.
## API Contract
### Message Processing
`dgen_server` provides different message paths with different guarantees:
**Standard messages** (`call`, `cast`):
- Processed with strict serializability via the durable queue
- Execute within a database transaction (subject to FDB transaction limits)
- Must not include side effects. Callbacks must be pure with respect to external systems
- Respect the lock (see Locking below)
**Priority messages** (`priority_call`, `priority_cast`, `handle_info`):
- Skip the durable queue and execute immediately
- Still execute within a database transaction (subject to FDB transaction limits)
- Must not include side effects
- Do not respect the lock. Always execute even when locked
### Actions
Callbacks may return `{reply, Reply, State, Actions}` or `{noreply, State, Actions}` where `Actions` is a list of 1-arity functions. These functions:
- Execute after the transaction commits
- Receive the committed `State` as their argument, but cannot modify it
- Are the correct place for side effects: logging, telemetry, publishing to external systems
- Can return `halt` to stop processing actions, or any other value to continue
### Locking
A callback may return `{lock, State}` to enter locked mode. When locked:
- Standard `call` and `cast` messages are queued but not processed
- Priority messages and `handle_info` continue to execute
- The `handle_locked/3` callback is invoked outside of a transaction
- Not subject to FDB transaction limits
- Side effects are permitted
- Can modify state, which is written back to the database
Use locking for long-running operations that would exceed transaction time limits, such as calling external APIs or performing extended computations.
## Persisted State
### Encoder/Decoder
State is persisted to the key-value store using a structured encoding scheme that optimizes for partial updates. Three encoding types are supported:
1. **Assigns map**: Maps with all atom keys are split across separate keys, one per entry. No ordering guarantees.
#{
mykey => <<"my value">>,
otherkey => 42
}
2. **Components list**: Lists where every item is a map with an atom `id` key containing a binary value. Each item is stored separately with ordering maintained via fractional indexing in the storage key.
[
#{id => "item1", value => 1},
#{id => "item2", value => 2}
]
3. **Term**: All other terms use `term_to_binary` and are chunked into 100KB values.
{this, is <<"some">>, term, 4.5, %{3 => 2}}
4. **Nesting**: The encoder handles nested structures recursively. For example, an assigns map containing a components list will nest both encodings in the key path.
#{
mykey => <<"my value">>,
mylist => [
#{id => "item1", value => 1},
#{id => "item2", value => 2}
]
}
When writing updates, diffs are generated by comparing old and new state:
- **Assigns map**: Only changed entries are written; removed entries are cleared
- **Components list**: Only changed items are written; ordering changes update fractional indices
- **Term**: Full rewrite (no diffing)
If the encoding type changes between updates, the old keys are cleared and the new encoding is written in full.
### Caching
Each consumer process can maintain an in-memory cache of the state paired with its versionstamp. On subsequent messages, if the cached versionstamp matches the current database version, the state is reused without a read operation. This eliminates redundant reads when processing multiple messages in sequence.
The cache is invalidated when the process detects that another consumer has modified the state.
### Crashing
DGenServer has well-defined behavior during crashes.
**Key guarantee:** Standard `call` and `cast` messages are processed **exactly-once** under normal operation. In the event of a crash before the transaction commits, the message will be retried — so in crash scenarios, delivery is **at-least-once**. When `dead_letter_threshold` is set, retries are bounded by that limit. Design your callbacks to be idempotent when possible.
**During `init/1`:**
- If the first `init/1` crashes, the gen_server process exits before any state is persisted
- When restarted `init/1` runs again from scratch
- No durable state exists yet, so there's nothing to recover
**During transactional callbacks (`handle_call`, `handle_cast`, `handle_info`):**
- The database transaction is automatically aborted — no state changes are committed
- For `call` and `cast`: the message remains in the durable queue and will be retried by the next consumer. If `dead_letter_threshold` is set, each failed attempt increments a counter embedded in the message envelope; once the counter reaches the threshold, the message is moved to the dead-letter queue instead of being retried (see below).
- For `priority_call` and `priority_cast`: the message is lost (it never entered the queue)
- For `handle_info`: the Erlang message is lost (info messages are not durable)
- State remains unchanged from before the callback was invoked
**Dead-letter queue:**
A *poison message* is a queue message that consistently crashes consumers. To enable bounded retries, set `dead_letter_threshold` to a positive integer. Each message envelope carries an attempt counter; when the counter reaches the threshold, the message is moved to the dead-letter queue (DLQ) in FoundationDB rather than being retried:
- For `call` messages, the blocked caller raises `{dead_letter, N}` where `N` is the attempt count.
- The optional `handle_dead_letter/2` callback is invoked with the original message and attempt count, if defined.
- A warning is logged.
Dead-lettering is **disabled by default** (`dead_letter_threshold: infinity`). Enable it with the start option:
<!-- tabs-open -->
### Erlang
```erlang
dgen_server:start(MyMod, [], [{tenant, Tenant}, {dead_letter_threshold, 3}])
```
### Elixir
```elixir
DGenServer.start_link(MyMod, [], tenant: tenant, dead_letter_threshold: 3)
```
<!-- tabs-close -->
**Coordinating with the supervisor's restart intensity:**
Each consumer crash counts as one restart from the supervisor's perspective. OTP supervisors enforce a *restart intensity* — a maximum number of restarts allowed within a sliding time window (`max_restarts` / `max_seconds` in Erlang; `max_restarts` / `max_seconds` in Elixir, defaulting to `3` restarts in `5` seconds). If the supervisor reaches this limit before `dead_letter_threshold` is hit, the supervisor itself terminates rather than the message being dead-lettered.
To ensure dead-lettering takes effect, configure the supervisor so that it tolerates at least `dead_letter_threshold` restarts within the window. A practical approach is to set `max_restarts >= dead_letter_threshold` with a `max_seconds` value long enough to cover the expected crash-restart cycle time:
<!-- tabs-open -->
### Erlang
```erlang
%% Allow up to 5 restarts in 60 seconds — enough headroom for a threshold of 3.
{ok, _} = supervisor:start_link({local, my_sup}, my_sup, []),
%% In the supervisor's init/1:
SupFlags = #{strategy => one_for_one, intensity => 5, period => 60},
```
### Elixir
```elixir
# Allow up to 5 restarts in 60 seconds — enough headroom for a threshold of 3.
Supervisor.start_link(children,
strategy: :one_for_one,
max_restarts: 5,
max_seconds: 60
)
```
<!-- tabs-close -->
With `dead_letter_threshold: infinity` (the default), poison messages produce an unbounded crash loop. The supervisor will eventually exhaust its restart intensity and terminate, which is standard OTP crash-loop behavior. Set a finite threshold to bound the loop and keep the supervisor alive.
**Inspecting and managing the dead-letter queue:**
Dead-lettered messages accumulate in FoundationDB and are not automatically cleared. An operator can inspect and manage them from a remote shell using functions in `dgen_queue`. The `Quid` for a server is obtained with `dgen_server:get_quid/1` (Erlang) or `:dgen_server.get_quid/1` (Elixir), passing the `tuid` the server was started with.
<!-- tabs-open -->
### Erlang
```erlang
Quid = dgen_server:get_quid(Tuid),
%% Inspect — returns [{Key, Envelope, AttemptCount, TimestampMs}]
Entries = dgen_queue:peek_dlq(Tenant, Quid),
%% Count without decoding
dgen_queue:dlq_length(Tenant, Quid),
%% Requeue one entry (resets attempt count to 0, atomic with DLQ delete)
{Key, _Envelope, _N, _Ts} = hd(Entries),
dgen_queue:requeue_dlq_entry(Tenant, Quid, Key), %% ok | {error, not_found}
%% Discard one entry permanently
dgen_queue:delete_dlq_entry(Tenant, Key),
%% Discard all entries for the queue
dgen_queue:purge_dlq(Tenant, Quid).
```
### Elixir
```elixir
quid = :dgen_server.get_quid(tuid)
# Inspect — returns [{key, envelope, attempt_count, timestamp_ms}]
entries = :dgen_queue.peek_dlq(tenant, quid)
# Count without decoding
:dgen_queue.dlq_length(tenant, quid)
# Requeue one entry (resets attempt count to 0, atomic with DLQ delete)
{key, _envelope, _n, _ts} = hd(entries)
:dgen_queue.requeue_dlq_entry(tenant, quid, key) # :ok | {:error, :not_found}
# Discard one entry permanently
:dgen_queue.delete_dlq_entry(tenant, key)
# Discard all entries for the queue
:dgen_queue.purge_dlq(tenant, quid)
```
<!-- tabs-close -->
`requeue_dlq_entry/3` is atomic: it pushes the envelope back onto the main queue with the attempt count reset to zero and deletes the DLQ entry in a single FDB transaction. If the root cause of the crashes has been fixed and the server has been redeployed, requeueing allows the message to be retried from a clean state.
**During `handle_locked`:**
- `handle_locked` executes outside a transaction, so previous state changes have already been persisted
- If the crash is an Erlang/Elixir throw, then the lock is cleared before the process exits
- If the crash is a system disruption such as SegFault, OOM, or sudden power loss, the lock is not cleared and the dgen_server is deadlocked. Manual intervention is required to clear the lock.
- In either case, the triggering message has been consumed from the queue, so it will not be retried
**During action execution:**
- Actions run after the transaction commits, so state changes are already persisted
- If an action crashes, the state update succeeds but remaining actions are not executed
- The message has been consumed from the queue and will not be retried
**Supervisor restart:**
- When a dgen_server is restarted by a supervisor, it reads existing state from the database
- If state exists, `init/1` is called, but the initial state is ignored. The server resumes with the persisted state
- The process immediately begins consuming any queued messages, if it's configured to do so
- Multiple processes can safely consume from the same queue; they coordinate via database transactions