docs/operator-ui.md

Select File
# Parapet Operator UI Guide

The Parapet Operator UI is an optional, generated LiveView workbench that sits inside your host application. Rather than offering another dashboard with raw telemetry, it provides a strictly controlled surface for initiating actionable mitigations when an SLO is burning, with an immutable audit trail for every action.

The workbench remains host-owned and evidence-first: summary-first for current truth, canonical timeline immediately after it for chronology, and risky controls only after enough context is visible.

Phase 6 extends that boundary with fault-domain triage for async and delivery incidents. The workbench now treats a compact evidence-backed triage block as the current-state index and the incident chronology as the authoritative source of sequence.

## Prerequisites

- Phoenix and LiveView installed in your host app
- Parapet installed and configured (`mix parapet.install`)
- A router with an existing authenticated pipeline or `live_session`

## Installation

The UI remains optional. If you want the installer to compose it for you, use:

```bash
mix parapet.install --with-ui
```

If you prefer to keep the core install path but explicitly suppress the UI branch in automation, use:

```bash
mix parapet.install --skip-ui
```

You can also run the generator directly from the root of your project:

```bash
mix parapet.gen.ui
```

This will scaffold three files into your `lib/my_app_web/live/parapet/` directory:
- `operator_live.ex` (The main workbench view)
- `operator_detail_live.ex` (Detailed view of specific SLOs or incidents)
- `operator_components.ex` (Reusable UI components)

### Mounting the Operator UI

The generated files belong to your application. The UI is only relevant when Phoenix LiveView is present, and Parapet does **not** provide its own authentication system. You must mount the operator routes inside your application's authenticated scope to ensure the UI is secured according to your app's existing authorization policies.

Update your `router.ex` to include the Parapet routes within a protected area:

```elixir
# lib/my_app_web/router.ex

scope "/admin", MyAppWeb do
  pipe_through [:browser, :require_authenticated_user]

  live_session :parapet_operator,
    on_mount: [{MyAppWeb.UserAuth, :ensure_authenticated}] do

    live "/parapet", Parapet.OperatorLive.Index, :index
    live "/parapet/:id", Parapet.OperatorDetailLive.Show, :show
  end
end
```

## Security and Verification

The Parapet Doctor includes a dedicated check to verify that your operator UI is securely mounted.

Run the doctor task to ensure the UI is not exposed publicly:

```bash
mix parapet.doctor
```

If the doctor detects that `OperatorLive` or `OperatorDetailLive` are mounted outside of an authenticated scope, it reports a `warn` finding (`Unsecured operator UI LiveView found`).

Local doctor runs fail only on `error`, while CI can treat warnings as blocking:

```bash
mix parapet.doctor --ci
```

For live cluster facts around the same install, use:

```bash
mix parapet.doctor cluster
```

That runtime mode reports evidence-backed live facts, but it still does not prove distributed correctness on its own.

## Phase 3 Performance Proof Lane

Phase 3 keeps the generated incident queue bounded and operator-paced under large-installation load.

- The default queue remains active-only (`open` and `investigating`).
- Queue refresh is explicit. Background changes should surface a visible refresh affordance instead of silently reordering the list while an operator is reading it.
- Queue-side `Resolve` is a real lifecycle transition through `Parapet.Operator.resolve_incident/2`, not a UI-only note shortcut.
- Phase 3 remains the canonical runtime proof owner for this seam through the named `generated resolve-flow proof lane`.
- Performance proof is layered: bounded queue telemetry in `Parapet.Operator`, deterministic queue tests, and an opt-in advisory benchmark lane.
- The `generated resolve-flow proof lane` stays in the targeted `mix test test/parapet/generated_operator_live_paging_test.exs test/parapet/operator_ui_integration_test.exs test/mix/tasks/parapet.gen.ui_test.exs` lane rather than a heavier browser harness.

### Advisory 50k+ Benchmark

Run the advisory proof lane with:

```bash
mix run bench/operator_ui_perf.exs
```

What the script does:

- Bootstraps a deterministic in-memory dataset of exactly `50,120` incidents.
- Uses `50,000` active incidents plus `120` resolved incidents so the active queue and resolved history both exist without changing the default queue scope.
- Measures the bounded `Parapet.Operator.list_incident_queue/1` fetch for one `30`-row page.
- Compiles the generated `OperatorLive` and `OperatorComponents` templates, mounts the generated LiveView path adopters use, and measures the first render for that same bounded page.
- Verifies the rendered first page still shows `30` active rows, excludes resolved incidents, and reports that additional pages remain.

What success looks like:

- The script exits `0`.
- Output includes `queue.visible_rows=30` and `render.visible_rows=30`.
- Output includes finite `queue_fetch_ms=` and `first_render_ms=` summaries.
- Output includes `advisory=true` and `merge_gate=disabled`, confirming this lane is reproducible but not part of the default `mix test` merge gate.

This lane is intentionally advisory. It is for reproducible operator-UI proof at scale, not a default CI blocker.

## Evidence-First Design

The Parapet operator workbench adheres to strict evidence-first design principles (D-05, D-07-D-12, D-17-D-19):

1. **Grafana/Runbooks are External:** (D-05) The Parapet UI does not attempt to replace Grafana for telemetry exploration or Notion/Confluence for runbooks. It provides focused, context-aware links to these external tools rather than duplicating their functionality.
2. **First-Class Actions:** (D-07 - D-09) The UI surface is explicitly limited to initiating predefined, safe mitigation actions. It is not an arbitrary admin console.
3. **Immutable Factual Timelines:** (D-10 - D-12) Any events or incidents viewed within the UI reflect immutable facts stored in the evidence spine. The UI reads these facts but cannot alter history.
4. **Required Audit Context:** (D-17 - D-19) Every mutating action triggered from the workbench automatically captures audit context, including the actor's identity and the rationale. This ensures every operational change leaves a durable, queryable trace.

## Phase 4 Escalation Surfacing

Phase 4 extends the generated detail view with escalation-aware operator surfacing without widening Parapet into a control-plane console.

The generated detail page should render:

1. A summary-first escalation status block that answers current state, next derived step, the active escalation chain, time-until-next-escalation when durable truth supports it, suppression state, and whether the system already acted.
2. The canonical timeline directly underneath that summary, with typed entries that make system automation, operator actions, and external evidence visibly distinct.
3. Bounded manual controls only after the summary and chronology are visible.

### Durable Escalation Truth

- Escalation status in the UI is a derived projection over durable incident state and timeline evidence.
- The active escalation chain and countdown are read-only projections over bounded durable fields such as current step and next escalation timestamp.
- The canonical timeline remains the authoritative sequence of what happened.
- System-executed mitigations and escalation actions stay inside that single chronology; there is no second automation narrative.

### Bounded Manual Controls

- `Trigger Next Escalation` records operator intent through the public `Parapet.Operator` API.
- `Suppress Pending Escalation` records a durable, expiring suppression window through the same audited seam.
- Escalation controls should only be offered while the incident is still open; investigating and resolved incidents remain read-oriented.
- Suppression is not scheduler surgery, direct Oban job manipulation, or hidden UI-only state. Workers remain the final truth gate.
- Generated LiveView code should refresh `Parapet.Operator.incident_detail/1` after those actions rather than maintain its own escalation state machine.

## Phase 6 Triage Contract

For async and delivery incidents, the generated detail view should render:

1. A compact triage block derived from durable evidence only.
2. The normalized chronology immediately underneath it.
3. External links outward to provider consoles, Grafana, and runbooks.
4. Exact action items only when one concrete object needs manual follow-up.

The triage block is sourced from the incident summary in `runbook_data["triage"]` and the latest `triage_snapshot` timeline entry. It should answer:

- Observed symptom
- Likely fault plane
- Why we think that, using 2-4 bounded evidence facts
- Safe next step

The detail page should not infer fault planes by parsing titles, should not treat `runbook_data` as a hidden timeline, and should not attempt provider-console-style forensics.

## Exact Follow-Up Only

`ActionItem`s remain a narrow exact-object seam. They are appropriate when one concrete async or delivery object needs operator attention, such as a suppressed delivery, dead-lettered job, stale workflow, or orphaned callback. They are not generic investigation todos, ownership queues, or SLA-tracked incident tasks.

## Phase 7 Preview-First Recovery

Phase 7 introduces a formal recovery model built on top of the Phase 6 triage foundation. The workbench moves from evidence display to guided recovery, while maintaining strict safety boundaries.

### Safe Recovery Principles

The operator workbench adheres to these recovery principles (D-20, D-21):

1. **Chronology First:** Investigation always starts with the chronological evidence. Recovery actions are only considered after the operator has reviewed the triage facts and timeline.
2. **Preview Before Mutation:** Destructive or mutating recovery actions (e.g., retrying jobs, clearing suppressions) must be previewed. The UI renders the exact scope of the change, warnings, and idempotency caveats before asking for confirmation.
3. **Bounded Recovery:** Recovery is not a broad admin console. It is limited to the specific capabilities and runbook steps defined for the burning SLO.
4. **Exact-Item Preference:** Scoped recovery targeting specific `ActionItem`s (exact-item recovery) is preferred over bulk replays or opaque automation.

### Recovery Flow

The generated UI implements a three-state recovery flow for runbook steps:

- **Guidance:** For steps that are purely informational or not yet wired to a host capability. These render with guidance text and no action button.
- **Preview:** For executable steps, the operator first clicks "Preview". This triggers a call to the host capability to calculate the effect of the recovery (e.g., "This will retry 42 suppressed deliveries").
- **Confirm:** After reviewing the preview, warnings, and target scope, the operator confirms the action. This execution is recorded in the immutable timeline with a unique idempotency key.

### Named Capabilities

Recovery actions are backed by named host capabilities. These capabilities are responsible for:
- Validating preconditions.
- Generating a time-bounded preview.
- Executing the mutation idempotently.
- Reporting success or failure back to the evidence spine.

By naming and bounding these capabilities, the host application maintains control over what the operator can do, ensuring the workbench remains a safe environment for high-stakes incident response.