# phi_accrual
A source-agnostic φ-accrual failure detector for Elixir/OTP, built on
Hayashibara et al. 2004 with a dual-α EWMA estimator, head-of-line and
local-pause awareness, and a telemetry-first API.
> ⚠️ **Alpha — `v0.1.x`.** The API and configuration surface may change
> before `v1.0`. The **telemetry event schema is already stable** (see
> [Versioning](#versioning)), but everything else is subject to tuning
> based on real-deployment feedback. Production use at your own risk;
> please open issues as you find rough edges.
> **Observability-grade, not decision-grade.** Designed for dashboards,
> alerting, and operator intuition — not for automated routing, quorum,
> or correctness decisions. See [limitations](#limitations) for why.
## Quick start
```elixir
# mix.exs
def deps do
[{:phi_accrual, "~> 0.1"}]
end
```
The application auto-starts. Feed in heartbeat arrivals from anywhere
your code already receives cross-node traffic, and read out φ on demand:
```elixir
# Call this whenever you receive evidence that a peer is alive —
# a GenServer reply, a :pg broadcast, an :rpc response, a custom ping.
# First call for an unknown node auto-tracks it with defaults.
PhiAccrual.observe(:"peer@host")
# Query φ at any time.
PhiAccrual.phi(:"peer@host")
#=> {:ok, 0.42, :steady}
```
That's the whole core loop: **feed in arrivals, read out φ.** Everything
below is about making it useful in production — reference heartbeat
sources if you have none of your own, telemetry wiring for Prometheus,
thresholding with hysteresis, and honest limitations.
## What it does
Given a stream of heartbeat arrivals from a remote node, the detector
maintains an EWMA estimate of the inter-arrival distribution (mean and
variance, independently smoothed) and emits a continuous suspicion
value φ. φ is calibrated so that `φ ≈ -log₁₀(P(arrival still pending))`:
| φ value | Rough meaning |
| ------- | ------------------------------------------------------ |
| 1 | 1-in-10 chance the node is dead |
| 3 | 1-in-1000 |
| 8 | 1-in-100 000 000 — very likely down |
**Thresholding is a consumer concern.** The detector does not decide
whether a node is up or down; it publishes φ, and you (or the optional
`PhiAccrual.Threshold` module) decide what crosses what line.
## Why another failure detector?
The Elixir/OTP ecosystem has plenty of cluster-management libraries
(`libcluster`, `swarm`, `horde`, `partisan`), but all of them use
binary up/down detectors or entangle detection with membership.
`phi_accrual` is the thing that goes alongside them: a pure detector,
unopinionated about who sends heartbeats, what the topology looks like,
or what to do when φ gets high.
## Usage — bring your own signal
Anything that arrives from a remote node is evidence of liveness. If
your app already has cross-node traffic, call `observe/2` from the
receive path — no extra network cost:
```elixir
defmodule MyApp.Chatter do
use GenServer
def handle_info({:reply_from, node}, state) do
PhiAccrual.observe(node)
{:noreply, state}
end
end
```
Then pattern-match on `phi/1` to handle every result state:
```elixir
case PhiAccrual.phi(:"node_a@host") do
{:ok, phi, :steady} -> # warm estimator, normal
{:ok, phi, :recovering} -> # warm estimator, absorbing a recent gap
{:insufficient_data, n} -> # still in bootstrap, `n` samples remaining
{:stale, elapsed_ms} -> # no arrival for > stale_after_ms
{:error, :not_tracked} -> # never observed
end
```
Call `PhiAccrual.track(node, opts)` **before** your first `observe` if
you need custom per-node estimator options; otherwise the first
`observe` auto-tracks with defaults.
## Usage — reference source
If you have no existing cross-node chatter, enable the bundled
`DistributionPing` source in config:
```elixir
# config/runtime.exs
config :phi_accrual,
distribution_ping: [interval_ms: 1_000, auto_track: true]
```
Each node then pings every peer every `interval_ms` over BEAM
distribution. Cheap per-ping, but cluster cost is O(N²) —
at 50 nodes and 1 s interval that's 2 500 pings/second of distribution
traffic.
**This source inherits HoL blocking** — see
[limitations](#limitations). The v2 `UdpSource` will escape it.
## What happens when a node fails
Suppose `:node_a@host` has been heartbeating every ~1 s for a few
minutes. Its estimator has mean ≈ 1 000 ms, σ ≈ 50 ms, and φ hovers
around 0.3 (the median for an on-schedule arrival).
Then the node goes dark. Here is the timeline, using the default
options and a threshold instance configured at `suspect_at: 4.0`,
`recover_at: 3.0`:
```
t=0s last heartbeat arrives. φ ≈ 0.3.
→ [:phi_accrual, :sample, :observed] (interval_ms: ~1000)
t=1s no new heartbeat. φ ≈ 0.3 (still on-schedule).
→ [:phi_accrual, :phi, :computed] (periodic gauge tick)
t=2s φ ≈ 3.5. starting to get suspicious.
→ [:phi_accrual, :phi, :computed]
t=3s φ crosses 4.0.
→ [:phi_accrual, :phi, :computed]
→ [:phi_accrual, :threshold, :suspected]
t=10s φ very high. state still :steady (stale_after_ms default 60 s).
→ [:phi_accrual, :phi, :computed]
t=60s elapsed > stale_after_ms.
→ [:phi_accrual, :phi, :computed] (state: :stale)
```
If `:node_a@host` comes back at t=15s and resumes heartbeating, the
first-arrival interval of 15 000 ms exceeds `recovering_threshold_ms`
(default 10 000). The state transitions to `:recovering` for the next
3 samples while the EWMA absorbs the outlier. Once φ drops below 3.0:
```
t=15s first heartbeat after outage. interval = 15 000 ms.
→ [:phi_accrual, :sample, :observed]
state becomes :recovering.
t=16s next heartbeat. φ has fallen sharply (elapsed is small).
→ [:phi_accrual, :phi, :computed] (state: :recovering)
→ [:phi_accrual, :threshold, :recovered] (φ crossed 3.0 downward)
t=19s three samples since the outlier.
→ state returns to :steady.
```
**Nowhere in this flow does the library decide the node is "down."**
It just publishes φ and state labels; the `Threshold` module (or your
own consumer) decides what to do. That separation is why the detector
can be wired to a dashboard, an alert, and an automated-routing policy
simultaneously with different thresholds.
## Telemetry schema (v1.x stable)
Event names, measurement keys, and metadata keys are a contract.
**Breaking changes only in v2.**
```
[:phi_accrual, :sample, :observed]
measurements: %{interval_ms}
metadata: %{node, local_pause?}
[:phi_accrual, :phi, :computed] # periodic gauge stream
measurements: %{phi, elapsed_ms}
metadata: %{node, state, local_pause?, confidence}
# state ∈ [:steady, :recovering, :insufficient_data, :stale]
[:phi_accrual, :local_pause, :start] # rising edge
metadata: %{kind} # :long_gc | :long_schedule | :busy_dist_port
[:phi_accrual, :local_pause, :stop] # falling edge
[:phi_accrual, :overload, :shed]
measurements: %{mailbox_len}
metadata: %{node}
[:phi_accrual, :source, :started]
metadata: %{source, interval_ms}
[:phi_accrual, :threshold, :suspected] # emitted by Threshold module
[:phi_accrual, :threshold, :recovered]
measurements: %{phi}
metadata: %{node, instance, threshold, confidence, detector_state}
```
Pipe these to Prometheus via `telemetry_metrics_prometheus`, to logs,
or to your own alerting (see next section).
## Wiring telemetry to Prometheus
Pull in [`telemetry_metrics_prometheus`](https://hex.pm/packages/telemetry_metrics_prometheus)
(or your preferred `telemetry_metrics` reporter) and declare the
metrics you care about:
```elixir
# mix.exs — add dependency
{:telemetry_metrics_prometheus, "~> 1.1"}
# In your supervision tree
children = [
{TelemetryMetricsPrometheus,
metrics: [
# φ as a gauge — one series per (node, state) pair.
Telemetry.Metrics.last_value(
"phi_accrual.phi.computed.phi",
event_name: [:phi_accrual, :phi, :computed],
measurement: :phi,
tags: [:node, :state, :confidence]
),
# Counter of every heartbeat observed.
Telemetry.Metrics.counter(
"phi_accrual.sample.observed.count",
event_name: [:phi_accrual, :sample, :observed],
tags: [:node]
),
# Local-pause events — correlate noise in φ with GC / HoL.
Telemetry.Metrics.counter(
"phi_accrual.local_pause.start.count",
event_name: [:phi_accrual, :local_pause, :start],
tags: [:kind]
),
# Overload shedding — if this is ever non-zero in steady state,
# tune α instead of raising :shed_threshold.
Telemetry.Metrics.counter(
"phi_accrual.overload.shed.count",
event_name: [:phi_accrual, :overload, :shed],
tags: [:node]
),
# Discrete alert events from the Threshold module.
Telemetry.Metrics.counter(
"phi_accrual.threshold.suspected.count",
event_name: [:phi_accrual, :threshold, :suspected],
tags: [:node, :instance]
),
Telemetry.Metrics.counter(
"phi_accrual.threshold.recovered.count",
event_name: [:phi_accrual, :threshold, :recovered],
tags: [:node, :instance]
)
]}
]
```
For ad-hoc logging, attach a handler directly:
```elixir
:telemetry.attach_many(
"phi-accrual-logger",
[
[:phi_accrual, :threshold, :suspected],
[:phi_accrual, :threshold, :recovered]
],
&MyApp.PhiLogger.handle/4,
nil
)
defmodule MyApp.PhiLogger do
require Logger
def handle([:phi_accrual, :threshold, kind], %{phi: phi}, %{node: node}, _) do
Logger.warning("node=#{node} #{kind} phi=#{Float.round(phi, 2)}")
end
end
```
## Thresholding (optional)
`PhiAccrual.Threshold` converts the φ gauge stream into discrete
`:suspected` / `:recovered` events with hysteresis:
```elixir
# In your supervision tree
children = [
{PhiAccrual.Threshold, name: :dash, suspect_at: 4.0, recover_at: 3.0},
{PhiAccrual.Threshold, name: :route, suspect_at: 8.0, recover_at: 7.0}
]
```
Multiple instances coexist — one for dashboards at φ=4, another for
automated routing at φ=8. Skip the module entirely if you want to roll
your own.
## Configuration
```elixir
# config/runtime.exs
config :phi_accrual,
# enable the node-global :erlang.system_monitor hook (default: true).
# Disable if another library already subscribes.
pause_monitor: true,
# back-pressure threshold — observe/2 sheds samples when mailbox
# exceeds this count and emits [:overload, :shed] telemetry.
shed_threshold: 10_000,
# bundled reference source — off by default, opt in:
distribution_ping: [interval_ms: 1_000, auto_track: true]
```
Per-node estimator options (passed to `PhiAccrual.track/2`):
| Option | Default | Notes |
| ---------------------------- | -------- | --------------------------------------------- |
| `:alpha_mean` | `0.125` | EWMA smoothing for mean |
| `:alpha_var` | `0.125` | EWMA smoothing for variance (tune lower) |
| `:min_std_dev_ms` | `50.0` | Floor on σ — prevents singular distribution |
| `:min_samples` | `8` | Bootstrap gate before φ is reported |
| `:stale_after_ms` | `60_000` | Elapsed past which state becomes `:stale` |
| `:recovering_threshold_ms` | `10_000` | Large-gap detection for `:recovering` tag |
| `:recovering_grace_samples` | `3` | Samples the `:recovering` tag persists for |
| `:initial_interval_ms` | `1_000` | Prior mean before any observation |
| `:initial_std_dev_ms` | `500` | Prior σ (variance = σ²) |
## Limitations
Read these before wiring φ to anything that takes irreversible action.
**Head-of-line blocking (primary v1 caveat).** `DistributionPing` and
any source that travels over BEAM distribution shares a TCP socket
with user traffic. A large GenServer reply or `:pg` broadcast can
delay heartbeats for arbitrary periods. `PauseMonitor` subscribes to
`:busy_dist_port` so you can *observe* this (pause telemetry +
`confidence: false` on φ events), but the underlying problem cannot be
fixed by this library while the source is distribution-based. The v2
`UdpSource` solves it by using a dedicated socket.
**Local-pause suppression is best-effort.** `:erlang.system_monitor`
fires on `:long_gc`, `:long_schedule`, and `:busy_dist_port`. The
monitor marks φ output with `local_pause?: true` and
`confidence: false` for a short lockout window after any event. It
does **not** freeze φ or widen the variance — we decided the silent-
detector failure mode is worse than noisy φ. Consumers are expected to
filter on the confidence flag (the `Threshold` module passes it
through in metadata).
**Gaussian assumption misbehaves under bimodal distributions.** BEAM
GC produces intermittent large pauses that, combined with normal
intervals, yield a bimodal inter-arrival distribution. A Gaussian EWMA
is a poor fit and will over-alert. Correlate φ with
`:erlang.statistics(:garbage_collection)` before acting on high φ. A
non-parametric estimator (Satzger or a two-component mixture) is a v2
consideration once we have real traces from deployments.
**One `:erlang.system_monitor` per node.** Only one subscription can
exist. If another library installs its own, enabling both will cause
one to silently win. Disable `pause_monitor` in config and feed pause
state to `PhiAccrual.PauseMonitor.put_state/1` yourself if you need
coexistence.
## Testing strategy
Failure detectors are hard to test against wall-clock. This project:
* Uses [`StreamData`](https://hex.pm/packages/stream_data) for
property-based tests of estimator math (`test/phi_accrual/core_test.exs`).
* Injects clocks into `PhiAccrual.Estimator` via the `:clock_fn`
option — no `Process.sleep` in unit tests.
* Integration tests against live distribution (`:peer`-based,
multi-node) are planned for v2 alongside the `UdpSource` work.
## Versioning
v1.x is **telemetry-schema-stable**: event names, measurement keys,
and metadata keys will not change until v2. Per-node option defaults
may be tuned within v1.x.
## Roadmap
### v1 (shipped)
- Dual-α EWMA estimator with bootstrap / stale / recovering states
- `PauseMonitor` with `:busy_dist_port` tracking
- Per-node estimator GenServer + `DynamicSupervisor` + `Registry`
- Overload shedding with telemetry
- Bring-your-own-signal API + `DistributionPing` reference source
- Optional `Threshold` module with hysteresis
- Committed telemetry event schema
### v2 (planned)
- `UdpSource` — dedicated UDP socket for heartbeats, escapes HoL,
makes the detector decision-grade
- Evidence-based evaluation of non-parametric / mixture estimators
- `:peer`-based multi-node integration tests
- Optional `phi_accrual_libcluster` companion package
### Related ideas
This library is the first of three composable primitives:
φ-accrual → HLC + causal broadcast → SWIM-Lifeguard standalone.
## License
Apache-2.0. See LICENSE.