Skip to main content

README.md

# Firebreak

**Find the process coupling that crosses your OTP supervision tree — the
synchronous dependencies a restart turns into `:noproc`/`:timeout`, which the
supervision tree itself can't show.**

Your supervision tree declares how your app is *structured*; your code declares
how your processes *actually talk to each other*. Those are two different graphs,
and the dangerous bugs live in the gap: a process in one branch of the tree
synchronously depends on a process in another, so a restart the tree calls
"contained" surfaces an error somewhere that looks unrelated.

That's firebreak's headline finding. Run it on **Livebook**, for instance, and it
flags that `HomeLive`/`OpenLive`/`SessionLive` call `NotebookManager`
synchronously inside `mount/3` — but `NotebookManager` lives in a *different*
branch of the supervision tree, so if it restarts mid-request the page fails to
render. One static run, no app boot.

It reads the tree the way OTP does — by calling each supervisor's `init/1` (child
specs are runtime data; `init/1` *returns* them without starting anything) — and
falls back to static AST parsing for code it can't load, marking which. The
coupling graph is always static. By default: no app boot, no LLM, no runtime
tracing — fast and deterministic enough for a CI gate.

> **What it is / isn't.** Best-effort *static* analysis, not a type system: it
> surfaces *hazards*, not certainties, and metaprogramming or runtime-computed
> names can still hide an edge. The default report leads with the high-confidence
> coupling/correctness findings and collapses the advisory ones to a count
> (`--all` shows them). For exact runtime truth — live `DynamicSupervisor`
> children, recovered names — the opt-in `--observe` mode attaches to a running
> node.

## What it finds

```
firebreak - supervision & coupling analysis
  1 files | 7 modules | 3 supervisors (2 exact, 1 static) | 1 coupling edges

Supervision roots:
  - Demo.App

Findings (3):
  [HIGH] missing_trap_exit (best-effort)  Demo.Worker1 (demo.ex:40)
         links a process (task_start_link) but does not trap exits; a crash in the linked process takes Demo.Worker1 down with it. Use a supervised Task or set Process.flag(:trap_exit, true).
  [MED ] one_for_all_blast_radius (exact)  Demo.SupA (demo.ex:9)
         :one_for_all with 3 children - any single child crash restarts all 3. If these children aren't genuinely interdependent, a narrower strategy contains the blast.
  [LOW ] cross_tree_coupling (best-effort)  Demo.SupA (demo.ex:9)
         1 module(s) outside Demo.SupA's subtree depend on processes inside it (Demo.Api); restarting it can surface :timeout/:noproc in those callers - coupling the supervision tree does not show.

Summary: 1 high, 1 medium, 1 low, 0 info
```

The `(2 exact, 1 static)` header counts how each supervisor's tree was read:
`Demo.SupA`/`Demo.SupB` via `init/1` (exact), the `Application` statically (its
`start/2` boots the tree, so Firebreak never calls it). Each finding is tagged
`(exact)` or `(best-effort)` so you know whether it rests on the real tree or on
a static read.

That last finding is the one no other tool gives you. `Demo.Api` lives under
`Demo.SupB`, but it synchronously calls `Demo.Cache`, which lives under
`Demo.SupA`. The supervision tree says these subtrees are independent. The
coupling graph says a `Demo.SupA` restart can surface `:timeout`/`:noproc`
inside `Demo.Api` — a failure path the tree alone would call "contained."

### Tier 1 — structural checks

Facts you can read straight off each module, no cross-module graph needed:

| Check | What it flags |
|---|---|
| `one_for_all_blast_radius` | `:one_for_all` over many children — one crash restarts every sibling |
| `missing_trap_exit` | a GenServer that links a process but doesn't trap exits |
| `shutdown_exceeds_intensity_window` | a child shutdown timeout that can burn the supervisor's whole restart budget |
| `default_restart_intensity` | a large supervisor (5+ children) on a tight restart budget — the default `3 restarts / 5s`, or less |
| `start_link_in_callback` | a process started with a direct `start_link` inside an OTP callback (`handle_*`/`init`) — re-spawned on every invocation, so it leaks duplicates or fails on `:already_started`. Should be a supervised child |
| `lookup_or_create_race` | a non-atomic registry test-and-set — one function that reads the registry (`whereis`/`Registry.lookup`/`:global.whereis_name`) and then creates/registers in the same body. Two callers can both miss and both create; the loser crashes (a raising `register`) or gets `{:error, {:already_started}}`, and its just-spawned process leaks as a ghost (Christakis & Sagonas, PADL 2010) |
| `unhandled_port_exit` | a `GenServer`/`:gen_statem` that opens a port (`Port.open`/`:erlang.open_port`) but no `handle_info` clause handles the port's termination (`{port, {:exit_status, _}}` / `{:EXIT, port, _}`) — the external program can die without the owner noticing, or take it down via the linked exit |

### Tier 2 — coupling across the tree

The differentiator. Resolves the process-to-process coupling graph
(`GenServer.call`/`cast`, `:gen_server`, `:gen_statem`, registered names,
`Registry`, `:global`, `Process.whereis`, `:ets`, `Phoenix.PubSub`, `:pg`),
maps each call to the module that owns the target, then reasons about the gap
between that graph and the supervision forest:

| Check | What it flags |
|---|---|
| `cross_tree_coupling` | a module *outside* a supervisor's subtree depends on a process *inside* it — the restart the tree calls "contained" surfaces `:timeout`/`:noproc` in the outside caller. Synchronous callers from several modules rank highest |
| `supervisor_subtree_blast` | `:one_for_all`/`:rest_for_one` *over child supervisors* — a crash restarts whole sibling subtrees, not just a worker |
| `dynamic_supervisor_restart_blast` | a `DynamicSupervisor`/`Registry` get-or-start race where a restart drops the registration callers rely on |
| `boot_order_dependency` | an `init/1` that synchronously calls a sibling started *after* it — the dependency isn't alive yet on first boot |
| `crash_cascade` | failure simulation: "if this process crashes now, who blocks?" — follows the restart closure (`:one_for_all`/`:rest_for_one`) so a crash that co-restarts a depended-on sibling is caught even when the coupling is invisible from the call sites alone |
| `cyclic_coupling` | a cycle of synchronous calls (A→B→A) — a deadlock hazard: each can block in `handle_call` awaiting the other |
| `boot_order_cycle` | a cycle of synchronous calls made *inside `init/1`* — the tree can't start: on boot each `init` waits on a peer that isn't running yet. The sharper, `:high` sibling of `cyclic_coupling` |
| `orphaned_stateful_process` | a `GenServer`/`:gen_statem`/`GenStage`/`Agent` in no supervisor's subtree. Sharpened with evidence: *supervised via a `child_spec` builder* or *via `DynamicSupervisor.start_child`* (relabelled, likely fine), *hand-started* via a direct `start_link` outside any supervisor (the exact call site is named), or genuinely unknown |

Crossings are weighted by **synchronicity**: only a synchronous caller
(`GenServer.call`/`whereis`) blocks on `:noproc`, so async-only coupling
(`cast`/`send`/`pubsub`) is rated lower. Per-entity `:via`/`:syn` targets
(`{:via, _, {Reg, {Owner, id}}}`) resolve to the keyed owner module, not the
shared registry.

The text report **leads with these coupling/correctness findings** and groups
the structural/advisory ones (blast-radius strategies, orphan heuristics — real,
but often by-design) beneath them.

**Wrapper-call coupling.** Apps rarely scatter `GenServer.call(Server, …)`; they
wrap it in a public API (`Server.fetch(id)`). Firebreak does first-level
inter-procedural analysis: when `A` calls `M.f(...)` and `M.f` itself couples to
a process, it synthesises the edge from `A` onto that process — so a dependency
routed through an API module isn't invisible.

### Runtime observation (`--observe`)

`mix firebreak --observe app@host` attaches to a *running* node over distributed
Erlang and folds its real shape into the analysis: live `DynamicSupervisor`
children become part of the forest, registered names static analysis couldn't
bind are recovered, a `runtime_fanout` finding reports supervisors running
far more children than the source models, and a `runtime_mailbox_backlog` finding
flags a process with a deep mailbox (≥1000 queued) that something calls
*synchronously* — a live back-pressure chokepoint where callers block on the
backlog (all at `:exact` confidence — it's observed reality). The target needs
nothing installed; reads use standard-library `:rpc` calls only.

`mix firebreak --observe app@host --format overlay` projects the *join*: every
synchronous cross-tree crossing from the static IR, annotated with its target's
live state — alive?, mailbox depth, instance count. It answers the question
neither view answers alone: *of my static crossings, which targets are hot right
now?* (The judgements stay in the `runtime_fanout`/`runtime_mailbox_backlog`
findings; the overlay is the structured ground-truth layer they're read from.)

### Formal specs (experimental): `mix firebreak.spec`

Firebreak can project its findings into a verified **supervision model** and
generate a TLA+ lifecycle spec per supervisor — turning the separate static
warnings into a single, model-checkable failure scenario with a counterexample
trace.

```sh
mix firebreak --format model        # the model IR: per-supervisor strategy,
                                     # intensity, ordered children (+restart type),
                                     # parent, and inbound crossings (sync/async)
mix firebreak.spec --out specs/      # one <Supervisor>.tla + .cfg per supervisor
# then, with TLC (tla2tools.jar):
java -cp tla2tools.jar tlc2.TLC -deadlock -config <Sup>.cfg <Sup>.tla

# or generate the same model as Quint:
mix firebreak.spec --lang quint     # one <Supervisor>.qnt per supervisor
quint verify --invariant SupNeverDies <Supervisor>.qnt
```

Each generated spec is a **pure function of `--format model`** — nothing is
hand-written. It models the restart-intensity budget and escalation, and (only
where firebreak found a *synchronous* crossing) the external caller's permanent
`:noproc` after escalation. So a supervisor with no crossing gets only the
`SupNeverDies` (budget) property; one with a real sync crossing also gets
`ExtNeverStuck`. TLC then composes findings the report lists separately — e.g.
"four `:one_for_all` child crashes exhaust the 3-in-5s budget, the supervisor
escalates, and a cross-tree caller is left permanently `:noproc`" — into one
proven trace.

Sharper cases the model captures:

- **`:temporary` target** — never restarted, so a *single* crash permanently
  breaks the caller (TLC shows it in two steps, supervisor still alive).
- **`:one_for_all` transient amplification** (`TargetTransientlySafe`) — *any*
  one child crash transiently downs every child, so a cross-tree caller is hit
  even by an unrelated sibling's crash. `:one_for_one` isolates and gets no such
  property — the contrast is the signal.
- **Real `max_seconds` window** — the budget is spent within a time window
  (`Tick` ages it; it resets after `max_seconds`), so an escalation trace shows
  the crashes were a *fast burst*, not spread out. (A fixed-window approximation
  of OTP's sliding window.)

Scope (honest): `:one_for_all`/`:one_for_one`/`:rest_for_one` are templated. It
verifies the *declared* topology, so it inherits firebreak's static blind spots
`--observe` narrows those.

The `--format model` output is a **versioned, documented contract** — TLA+ is just
one consumer. If you want to build your own backend (a different model checker, a
diagram, a `lockstep` scenario), see [`notes/model-ir-contract.md`](notes/model-ir-contract.md):
the schema, serialization, design law, and a backend-author guide.
`Firebreak.Model.valid?/1` checks a projection against it.

### Reproduce it dynamically: `mix firebreak.lockstep`

The dynamic counterpart to `firebreak.spec`. For each *synchronous cross-tree
crossing*, it generates a [lockstep](https://hex.pm/packages/lockstep) `ctest`
scaffold — the starting point for a test that reproduces the `:noproc` failure in
the running BEAM, not just in a model. It names the two processes, sets up the
harness, and marks the app-specific TODOs (start the target, drive the call,
assert it's handled). Static finding → proof (TLA+) → executable regression test
(lockstep), all from the same model IR.

## Usage

```sh
# human-readable report for the current project
mix firebreak

# point it at another project
mix firebreak ../some_app

# JSON artifact (CI handoff / tooling)
mix firebreak --format json

# graph the supervision forest + coupling (crossing edges highlighted)
mix firebreak --format dot | dot -Tsvg -o firebreak.svg
mix firebreak --format mermaid          # paste into a Markdown doc
mix firebreak --format html > report.html   # findings + graph in one page
mix firebreak --format failure          # Mermaid of just the failure modes (who :noproc-blocks)

# a structural supervision-risk score + per-supervisor ranking (dashboards/trend)
mix firebreak --format score

# join the static crossings against a live node's observed state (needs --observe)
mix firebreak --observe app@host --format overlay

# CI: emit GitHub Actions annotations (one per finding, on the PR diff)
mix firebreak --format github

# fold in a running node's real runtime shape
mix firebreak --observe my_app@127.0.0.1 --cookie secret

# only show medium and above
mix firebreak --min-severity medium

# extra source globs (repeatable)
mix firebreak --path "test/support/**/*.ex"

# skip compilation and analyse statically only
mix firebreak --no-compile
```

`mix firebreak` compiles the current project first so it can read supervision
trees exactly from `init/1`; in CI, where the app is already built, that's a
no-op. It never *starts* your tree — `init/1` only returns child specs. Pass
`--no-compile` to stay purely static (best-effort), or point Firebreak at
another project (`mix firebreak ../some_app`), where it uses that project's
`_build` artifacts if present and falls back to static parsing otherwise.

## CI gate

Fail the build when a new high-severity finding lands:

```yaml
# .github/workflows/firebreak.yml
- run: mix firebreak --fail-on high
```

`--fail-on <severity>` exits non-zero if any finding at or above that severity
is present. Pair it with `--format json` if you want to archive the full report
as a build artifact.

There's a bundled **GitHub Action** (`action.yml`) and an example workflow in
`.github/workflows/firebreak.yml`: it runs `--format github` to annotate the PR
diff with each finding, then gates the job on `--fail-on`. Copy the workflow
into your project, or `uses: b-erdem/firebreak@main` once it's published.

### Suppression and baselines

For an existing codebase with a backlog, gate on *new* coupling rather than the
whole pile:

```sh
# accept a reviewed finding forever: commit a .firebreak.exs in the project root
#   %{suppress: [
#     %{check: :cross_tree_coupling, module: MyApp.Cache.Supervisor},
#     "boot_order_dependency:MyApp.Early/MyApp.Late"   # or an exact signature
#   ]}

# snapshot today's findings once, on a green commit
mix firebreak --write-baseline .firebreak_baseline.exs

# thereafter, fail only on findings absent from the baseline
mix firebreak --baseline .firebreak_baseline.exs --fail-on info
```

Both match findings by a stable signature (`check:module`) that ignores line
numbers and message wording, so the allowlist doesn't churn as unrelated code
moves. `--config` overrides the default `.firebreak.exs` path.

### Topology conformance (`--expect`)

The baseline pins the *findings* you've accepted; conformance pins the *shape* of
the tree you designed. Snapshot the intended supervision topology once, commit
it, and fail the build when the tree drifts from it — a strategy quietly flipped
to `:one_for_all`, a child dropped out of a supervisor, the restart intensity
loosened:

```sh
# snapshot the intended topology on a green commit, and commit the file
mix firebreak --write-expected config/expected_topology.exs

# thereafter, report topology_drift findings (and gate on them) when the tree changes
mix firebreak --expect config/expected_topology.exs --fail-on medium
```

Drift findings carry a stable `topology_drift:<sup>/<subtype>` signature, so they
suppress and baseline like any other finding. The spec is a plain Elixir term
(read like `.firebreak.exs`) — diff it in code review to *see* the topology
change a PR introduces.

## Installation

Add `firebreak` to your dev/test dependencies:

```elixir
def deps do
  [
    {:firebreak, "~> 0.2.0", only: [:dev, :test], runtime: false}
  ]
end
```

## How it works

1. **Parse** every source file to AST and collect module facts — `use`/behaviours,
   supervisor strategy and intensity, child specs, name registrations, links and
   spawns, and outbound calls.
2. **Resolve the tree exactly** where possible: for each loadable supervisor, call
   `Mod.init/1` (the same call OTP's `supervisor` makes) to read the real
   `{flags, child_specs}` — without starting a thing — and replace the static
   guess. Un-loadable modules and `Application` roots keep the static read.
3. **Resolve** the coupling graph: map each call target (a module, or a
   registered name) to the module that owns it, and add first-level wrapper edges
   (a caller of a public API that itself couples to a process). Always static.
4. **Build the forest**: supervisors, roots, and each supervisor's subtree.
5. **Check**: run the Tier-1 structural rules and the Tier-2 cross-tree pass
   (including failure simulation and the orphan check), tagging each finding
   `exact` or `best-effort`.
6. **Optionally observe** (`--observe`): attach to a live node and fold its real
   runtime shape into the analysis before the checks run.

The coupling graph is a best-effort static read, not a type system: runtime name
computation and metaprogramming can hide an edge, and when in doubt Firebreak
stays quiet rather than guessing. The supervision tree, when read from `init/1`,
is exact.

## License

MIT