# Incident Response Playbook
This guide is the operator-facing reference for responding to SAML incidents
when a Relyra deployment is live. Every login resolves to a verified trust
path or a typed rejection; this playbook is what to do when the rejection
recurs, when a metadata refresh suspends, when a signing cert rotates, or
when a replay storm hits the ACS endpoint. Relyra owns the typed-rejection
contract and the six evidence surfaces (telemetry, audit ledger, LiveView
admin, Mix tasks, login trace, troubleshooting decoder); the host owns the operational
response. The reference table below is the centerpiece — every scenario
runbook in this guide points back into it rather than restating where the
evidence lives.
## Relyra owns / Host owns
## Relyra owns
- Emitting structured `[:relyra, :saml, ...]` telemetry events at every stage
of the trust pipeline (`lib/relyra/telemetry.ex`).
- Writing append-only audit rows to the `relyra_audit_events` table for every
trust-state mutation: connection, metadata, certificate, and mapping.
- Publishing the `/relyra/admin` LiveView routes for connection / metadata /
certificate / mapping inspection and mutation.
- Shipping the 8 `mix relyra.*` operator hand-tools.
- Documenting the canonical `Relyra.Error` atom taxonomy in
[`../troubleshooting.md`](../troubleshooting.md).
## Host owns
- Telemetry storage and alerting (Relyra emits; the host's observability stack
retains, indexes, and pages on-call).
- Audit-row access control and retention (the host's policy decides who reads
`relyra_audit_events` and for how long).
- Authentication and authorization for the `/relyra/admin` LiveView surface
(Relyra exposes the routes; the host's pipeline gates them).
- Scheduling Mix-task invocation — for example, a cron entry that runs
`mix relyra.refresh_due` on a cadence appropriate to the deployment.
- The human operator playbook. This guide is a starting point; institutional
procedures (paging, escalation, communication) wrap it.
## Evidence surfaces
When an incident hits, walk this table top to bottom. Each row names the
exact code anchor — Relyra's strict-defaults posture means the operator
should never need to grep the source to find the right evidence.
| Surface | What it tells you | Exact code anchor |
| --- | --- | --- |
| Telemetry events | Structured `:start` / `:stop` / `:exception` spans for every trust-pipeline stage, plus auto-refresh state-transition events and certificate-expiry warnings | `lib/relyra/telemetry.ex` |
| Audit ledger | Append-only `relyra_audit_events` rows recording every trust-state mutation (connection / metadata / certificate / mapping) | `lib/relyra/ecto/audit_event.ex` |
| LiveView admin routes | Operator UI for connection, metadata, certificate, and mapping inspection and mutation | `lib/relyra/live_admin/router.ex` |
| Mix tasks | 8 operator hand-tools for drift checks, diagnostic bundles, login-trace inspection, scaffolding, metadata pinning, scheduled refresh, and security-review evidence | `lib/mix/tasks/relyra.*.ex` |
| Troubleshooting decoder | Per-atom decoder with the four-field micro-block (Means / Likely root cause / Operator action / Source) for every `Relyra.Error` atom | [`../troubleshooting.md`](../troubleshooting.md) |
| Login trace | Per-login-attempt step timeline (`response.decode` → `response.validate` → `signature.verify` → `replay.check` → `user.map` → `session.establish`) with `:error_code` on failed stages; browser UI or headless CLI | `lib/relyra/live_admin/connection_trace_live.ex`, `lib/mix/tasks/relyra.trace.ex`, `lib/relyra/live_admin/query.ex` |
### Login trace vs audit ledger
Login traces persist as `domain: :login` rows in `relyra_audit_events` — they
record per-attempt pipeline steps, not trust-state mutations. Trust mutations
use the audit vocabulary in the table above (`:connection`, `:metadata`,
`:certificate`, `:mapping`). Replays do not mutate trust state and write no
audit row; replay outcomes appear in login trace and telemetry only.
### Telemetry event catalog
Cite these event names verbatim when attaching `:telemetry.attach/4`
handlers. Each span event below is a `:start` / `:stop` / `:exception`
triplet — the `:stop` event carries `duration_ms` and an `:outcome` /
`:error_code` pair when the stage failed.
Span events:
- `[:relyra, :saml, :login]`
- `[:relyra, :saml, :authn_request]`
- `[:relyra, :saml, :response, :decode]`
- `[:relyra, :saml, :response, :validate]`
- `[:relyra, :saml, :signature, :verify]`
- `[:relyra, :saml, :replay, :check]`
- `[:relyra, :saml, :user, :map]`
- `[:relyra, :saml, :session, :establish]`
- `[:relyra, :saml, :metadata, :refresh]`
- `[:relyra, :saml, :metadata, :import]`
- `[:relyra, :saml, :metadata, :auto_refresh]`
Auto-refresh state-transition events (one-shot, not span-bracketed):
- `[:relyra, :saml, :metadata, :auto_refresh, :degraded]`
- `[:relyra, :saml, :metadata, :auto_refresh, :suspended]`
- `[:relyra, :saml, :metadata, :auto_refresh, :recovered]`
- `[:relyra, :saml, :metadata, :auto_refresh, :validity_warning]`
- `[:relyra, :saml, :metadata, :auto_refresh, :skipped]`
Certificate expiry warning event (one-shot):
- `[:relyra, :saml, :certificate, :expiring]`
### Audit vocabulary
Filter `relyra_audit_events` by these atom literals — they are the exact
values stored in the `:domain` and `:action` columns
(`lib/relyra/ecto/audit_event.ex:13-26`).
```
@domain_values = [:connection, :metadata, :certificate, :mapping]
@action_values = [:created, :updated, :enabled, :disabled, :applied,
:refreshed, :staged, :activated, :retired, :replaced,
:deleted]
```
Trust mutations co-commit an audit row inside the same Ecto transaction —
if a connection edit or a metadata apply succeeded, the row is present.
Replays do not mutate trust state and therefore write no audit row; see
Scenario 3 below.
### LiveView admin routes
The route parameter is `:connection_id` everywhere a connection is
referenced — match against this exact name in your host's authorization
plug. The path prefix `/relyra/admin` is configurable when you mount the
routes, but the suffix shapes are fixed.
| Path | Module / action |
| --- | --- |
| `/relyra/admin/` | `Relyra.LiveAdmin.ConnectionsLive` `:index` |
| `/relyra/admin/connections/new` | `Relyra.LiveAdmin.ConnectionsLive` `:new` |
| `/relyra/admin/connections/:connection_id` | `Relyra.LiveAdmin.ConnectionsLive` `:show` |
| `/relyra/admin/connections/:connection_id/edit` | `Relyra.LiveAdmin.ConnectionsLive` `:edit` |
| `/relyra/admin/connections/:connection_id/metadata` | `Relyra.LiveAdmin.ConnectionMetadataLive` `:metadata` |
| `/relyra/admin/connections/:connection_id/trace` | `Relyra.LiveAdmin.ConnectionTraceLive` `:trace` |
| `/relyra/admin/diagnostic/bundle` | `Relyra.Phoenix.Controllers.DiagnosticController` `:download` |
Operators reach the trace page from connection detail **View Login Trace**
(`lib/relyra/live_admin/components/connection_detail.ex`); the path prefix
remains configurable via host mount, suffix shape fixed.
### Mix tasks
These are the 8 Relyra operator hand-tools. `hex.audit` is a third-party
Hex task and is NOT a Relyra hand-tool — do not include it in operator
runbooks alongside these.
| Task | Purpose |
| --- | --- |
| `mix relyra.batteries_included` | Generate or drift-check BATTERIES_INCLUDED.md. |
| `mix relyra.conformance` | Generate or drift-check CONFORMANCE.md. |
| `mix relyra.diagnostic` | Generate a Relyra diagnostic bundle. |
| `mix relyra.trace` | Print redacted per-login step timelines for a connection (headless; same data as LiveView trace). Requires `--repo` and `--connection`; optional `--last` (default **20**). |
| `mix relyra.install` | Scaffold the minimal Relyra integration surface. |
| `mix relyra.metadata.pin` | Pin a SHA-256 metadata trust fingerprint on a connection. |
| `mix relyra.refresh_due` | Refresh any metadata sources whose schedule is due. |
| `mix relyra.security_review` | Generate or drift-check SECURITY_REVIEW_EVIDENCE.md. |
```bash
mix relyra.trace --repo MyApp.Repo --connection CONNECTION_ID
mix relyra.trace --repo MyApp.Repo --connection CONNECTION_ID --last 5
```
Use **LiveView** (`/relyra/admin/connections/:connection_id/trace`) for browser
triage during an incident; use **`mix relyra.trace`** for SSH/headless on-call
without a browser. Both call `Query.get_login_traces/4` with shared redaction
(`Relyra.LoginTrace.Export`).
## Scenario 1: Certificate expiry imminent
Symptom: no `Relyra.Error` atom — Relyra warns proactively before any
signature failure occurs. The first signal is telemetry, not a rejection.
1. Triage — the `[:relyra, :saml, :certificate, :expiring]` event (see the
telemetry catalog in the surface table) fires with `:days_until_expiry`
and `:fingerprint_sha256` measurements. Confirm the `:connection_id`
identifies the connection at risk. If your observability stack pages on
this event, this is the moment to acknowledge.
2. Diagnose — open `/relyra/admin/connections/:connection_id` and review
the certificate inventory section for the affected connection. Query
the audit ledger for `domain = :certificate` rows with `action` in
`[:staged, :activated]` to see prior rollover history on this
connection. If you need a redacted snapshot to share with the IdP
vendor, run `mix relyra.diagnostic`.
3. Recover — stage the new IdP signing certificate via the admin UI
(audit row `domain = :certificate`, `action = :staged` will record the
stage). When the IdP cuts over, activate the staged certificate
(`action = :activated`) and retire the old one (`action = :retired`).
If the IdP publishes the new cert via metadata, run
`mix relyra.refresh_due` first to pick it up.
## Scenario 2: Metadata drift after IdP change
Symptom: [`:metadata_drift_requires_review`](../troubleshooting.md#metadata_drift_requires_review)
returns from metadata-apply attempts; auto-refresh telemetry may have
transitioned the source to `:degraded` or `:suspended`.
1. Triage — check the auto-refresh state-transition events from the
telemetry catalog: `[:relyra, :saml, :metadata, :auto_refresh, :degraded]`
or `:suspended` indicate the auto-refresh worker has flagged the
source. The audit ledger will show `domain = :metadata` rows with
`action = :applied` carrying a drift flag.
2. Diagnose — open `/relyra/admin/connections/:connection_id/metadata`
for a side-by-side of the staged-vs-current trust state. Query the
audit ledger for `domain = :metadata` rows with `action` in
`[:applied, :refreshed]` over the last 24 hours to see what the
auto-refresh worker has been doing on this source. Cross-reference
the atom decoder at
[`../troubleshooting.md#metadata_drift_requires_review`](../troubleshooting.md#metadata_drift_requires_review).
3. Recover — if the drift is legitimate (the IdP rotated its signing
cert or moved its SSO endpoint), run `mix relyra.refresh_due` to
force the worker to retry, then approve the drift via the admin UI
(audit row `domain = :metadata`, `action = :applied` will record the
approval). If the deployment requires byte-exact metadata identity,
run `mix relyra.metadata.pin` with the new SHA-256 fingerprint to
harden the connection against future drift.
## Scenario 3: Replay storm
Symptom: [`:replayed_assertion`](../troubleshooting.md#replayed_assertion)
in logs, repeating across many login attempts in a short window.
1. Triage — the `[:relyra, :saml, :replay, :check]` telemetry event
(see the catalog) fires with repeated `:stop` events whose
`:outcome` is `:error` and whose `:error_code` is
`:replayed_assertion`. **Replays do not mutate trust state, so no
audit row is written** — `lib/relyra/replay_store/ecto.ex` and
`lib/relyra/replay_store/ets.ex` contain zero `AuditWriter.append_event`
calls. Operators rely on `[:relyra, :saml, :replay, :check]`
telemetry alone for replay-storm detection; the audit ledger will
not corroborate.
2. Diagnose — With triage confirming repeated `:replayed_assertion` on
this connection, open `/relyra/admin/connections/:connection_id/trace`
(or `mix relyra.trace --repo MyApp.Repo --connection CONNECTION_ID`)
and inspect **`replay.check`** step outcomes and volume across recent
attempts. Because replays write no audit rows, login trace and
telemetry remain the primary evidence surfaces before you tune
Recover-rate limits. Use host logs for source-IP grouping as a
secondary signal; `:connection_id` in telemetry metadata localizes the
storm to one connection when possible.
3. Recover — apply host-app-level rate limiting on the ACS endpoint;
replays are protocol-correct messages that the SAML library cannot
refuse to receive (only refuse to accept). If the storm correlates
with a single connection, disable that connection via
`/relyra/admin/connections/:connection_id/edit` while you
investigate (audit row `domain = :connection`, `action = :disabled`
will record the disable). Re-enable
(`domain = :connection`, `action = :enabled`) when the source is
confirmed legitimate or rate-limited.
## Scenario 4: Signature regression after IdP key rotation
Symptom: [`:digest_mismatch`](../troubleshooting.md#digest_mismatch),
[`:invalid_signature`](../troubleshooting.md#invalid_signature), or
[`:trust_anchor_mismatch`](../troubleshooting.md#trust_anchor_mismatch)
returning from every login attempt for a specific connection.
1. Triage — the `[:relyra, :saml, :signature, :verify]` telemetry event
(see the catalog) fires `:stop` with `:outcome` `:error` and the
exact atom in `:error_code`. Group by `:connection_id` to confirm
the regression is isolated to one connection — a fleet-wide signature
failure indicates an algorithm-policy change, not a key rotation.
2. Diagnose — With triage isolating signature failures to one
connection, open `/relyra/admin/connections/:connection_id` and review
the signing certificate inventory — the IdP may have rotated to a key
Relyra does not yet trust. Open login trace
(`/relyra/admin/connections/:connection_id/trace` or
`mix relyra.trace --repo MyApp.Repo --connection CONNECTION_ID`) and
inspect **`signature.verify`** for the exact `:error_code`
(`:digest_mismatch`, `:invalid_signature`, `:trust_anchor_mismatch`)
before or alongside cert review. Run `mix relyra.diagnostic` to bundle
connection state plus a redacted failing response for the IdP vendor.
Cross-reference
[`../troubleshooting.md#digest_mismatch`](../troubleshooting.md#digest_mismatch)
and
[`../troubleshooting.md#trust_anchor_mismatch`](../troubleshooting.md#trust_anchor_mismatch)
for the distinction between failure modes — align the cert inventory
before Recover staging.
3. Recover — stage the new IdP signing certificate via the admin UI
cert inventory (audit row `domain = :certificate`,
`action = :staged`); confirm the next login succeeds against the
staged cert; activate it (`action = :activated`) and retire the old
one (`action = :retired`). The signing source is configured IdP
certs only — Relyra will never trust a document `KeyInfo` no matter
how convenient it would be at this moment.
## Scenario 5: ACS misconfiguration at provisioning
Symptom: [`:destination_mismatch`](../troubleshooting.md#destination_mismatch),
[`:recipient_mismatch`](../troubleshooting.md#recipient_mismatch), or
[`:in_response_to_mismatch`](../troubleshooting.md#in_response_to_mismatch)
on every login for a new or freshly reconfigured connection.
1. Triage — the `[:relyra, :saml, :response, :validate]` telemetry event
(see the catalog) fires `:stop` with `:outcome` `:error` and the atom
in `:error_code`. The atom names exactly which field mismatched —
`:destination_mismatch` is the `Destination` attribute on the
response, `:recipient_mismatch` is the `SubjectConfirmationData
Recipient`, and `:in_response_to_mismatch` is the response's
`InResponseTo` attribute against the request store.
2. Diagnose — With triage naming the exact validation atom, open
`/relyra/admin/connections/:connection_id/edit` and verify `acs_url`,
`sp_entity_id`, `idp_entity_id`, and `idp_sso_url` match the IdP's
published metadata. Use login trace **`response.validate`** to see which
field failed (`:destination_mismatch`, `:recipient_mismatch`,
`:in_response_to_mismatch`) before or alongside field verification —
the IdP-side error message reveals the mismatched value.
Cross-reference
[`../troubleshooting.md#destination_mismatch`](../troubleshooting.md#destination_mismatch)
for field semantics, then apply Recover field updates once canonical
values are confirmed.
3. Recover — update the affected connection field via the admin UI; the
audit row `domain = :connection`, `action = :updated` records the
change. If your deployment uses metadata-driven configuration, fix
the canonical value at the source (your metadata generator or IdP
admin console) and run `mix relyra.refresh_due` to pick it up.
## Scenario 6: Attribute mapping breakage
Symptom: [`:invalid_audience`](../troubleshooting.md#invalid_audience)
or mapping-stage errors surfacing at the host-app boundary (the user
mapper rejects the principal, or the host session adapter rejects the
attribute set).
1. Triage — the `[:relyra, :saml, :user, :map]` telemetry event (see
the catalog) fires `:stop` with `:outcome` `:error`. Query the
audit ledger for `domain = :mapping` rows with `action` in
`[:created, :updated]` to see recent mapping changes on the
connection — a recent mapping edit is the most common cause.
2. Diagnose — With triage pointing at recent `domain = :mapping` audit
rows, open `/relyra/admin/connections/:connection_id` and review the
mapping section plus the audit timeline. When mapping-stage telemetry
is insufficient, open login trace and inspect **`user.map`** for
mapper-stage outcome and error detail. Cross-reference
[`../troubleshooting.md#invalid_audience`](../troubleshooting.md#invalid_audience)
for audience mismatch. If the failure is in the host `UserMapper`
callback rather than connection mapping configuration, the fix belongs
in host application code — see
[`../identity_mapping_and_provisioning.md`](../identity_mapping_and_provisioning.md)
— before Recover updates the admin mapping UI.
3. Recover — update the mapping via the admin UI (audit row
`domain = :mapping`, `action = :updated`); confirm the next login
for a representative user succeeds. If the host application owns
the failing logic in its `UserMapper`, ship the fix there and
redeploy — Relyra emits the attribute set unchanged, so it never
needs a library change for mapping-policy edits.
## When in doubt
Every login resolves to a verified trust path or a typed rejection. Two
operator tools answer different questions — do not conflate them:
- **`mix relyra.diagnostic`** — redacted diagnostic **bundle** for IdP/vendor
handoff, security review, or Relyra issue filing (connection state + recent
audit + failing response snapshot). Safe to share externally after redaction.
Route the bundle to your IdP support contact when the symptom is a
signing-cert regression, to your security review queue when an active attack
is suspected, or attach it to a Relyra issue when you suspect a library bug.
The diagnostic bundle redactor strips secret material before the archive is written.
- **Login trace** (LiveView `/relyra/admin/connections/:connection_id/trace`
or `mix relyra.trace --repo MyApp.Repo --connection CONNECTION_ID`) —
per-attempt **step timeline** answering "which pipeline stage failed and
with what `:error_code`?" Use during active triage before escalating with a
diagnostic bundle.