Skip to main content

docs/HISTORY.md

# Parapet Milestone History

This document records the development milestones for Parapet v0.1–v0.9.
These are planning tranches, not Hex release versions — the package was not
published to hex.pm during this period.

For the changelog of published Hex releases (v0.10+), see [CHANGELOG.md](../CHANGELOG.md).

---

## v0.9 — Performance, Scale & DX (2026-05-23)

- Shipped proactive TSDB cardinality protection: a `mix parapet.doctor cardinality` static
  analyzer plus a compile-time Parapet.Metrics.Validator enforcing a 10-label ceiling per metric,
  applied across all built-in metrics and adapter SLIs.
- Delivered database scale and pruning: composite indexes for Incident, TimelineEntry, and
  ToolAudit at over 100k rows, a Parapet.Evidence.Archiver with resolved-only retention, and
  a `mix parapet.archive` task plus Oban cron worker that never prunes active investigating work.
- Made the Operator UI responsive under load with bounded queue paging, index-aware Operator
  queries, and a 50k+ incident benchmark — and repaired the generated resolve flow so the
  active-to-resolved lifecycle is true again.
- Unified the Day-1 experience under `mix parapet.install`, a deterministic Igniter orchestrator
  that chains spine/prometheus/ui with explicit opt-in extras, backed by severity-aware multi-node
  `mix parapet.doctor` checks (for example, Oban uniqueness).
- Proved multi-node safety with Ecto-backed action claims and circuit breakers under concurrency
  simulation, plus an environment-conditional peer-node canary that skips cleanly without
  distributed Erlang.
- Hardened milestone closure: phases 6–14 backfilled milestone-grade verification surfaces,
  reconciled planning-artifact drift, tightened archive retention, and added a
  regression-catching closure-proof chain for the generated operator UI.

**Stats:** ~20,274 LOC (Elixir/EEx, lib+priv+test) · Phases 1–14 (5 core, 9 closure) · 36 plans · 88 commits · 2026-05-19 → 2026-05-23

---

## v0.8 — Deterministic Escalation & Bounded Mitigation (2026-05-19)

- Built a durable Oban-backed escalation engine (Parapet.Escalation.Worker) that routes incidents
  to next tiers unless acknowledged or resolved.
- Implemented system-identity execution for Bounded Runbooks to safely perform auto-mitigations
  using the Parapet.Operator API, logging all actions under the `:system` URN identity.
- Created an Ecto-backed circuit breaker leveraging ToolAudit histories to prevent mitigation
  flap-loops: once a mitigation has run N times, the breaker trips and escalates instead.
- Updated the LiveView Operator UI to visualize escalation chains and distinctively style
  system-executed mitigations with manual trigger overrides for operator control.

**Stats:** ~13,900 LOC (Elixir/EEx) · Phases 1–4 · 8 plans · 2026-05-19

---

## v0.7 — Async & Delivery Reliability (2026-05-18)

- Established safe telemetry contracts for Mailglass, Chimeway, and Rindle integrations to emit
  bounded async and delivery events using normalized event semantics for diverse external providers.
- Implemented out-of-the-box provider-first SLOs for async pipeline health and provider delivery
  states, including multi-burn-rate PromQL alerts.
- Created explicit fault-domain triage enrichment for async and delivery incidents, leveraging
  durable evidence — triage snapshot chronology — over UI-derived heuristics.
- Added safe, host-wired recovery runbook templates for stalled async work, covering dead-letter
  handling, provider outage recovery, stalled job cleanup, and callback delay flows.

**Stats:** ~13,401 LOC (Elixir/EEx) · Phases 4–7 · 12 plans · 2026-05-18

---

## v0.6 — Change Correlation & Audit Trailing (2026-05-17)

- Implemented OpenTelemetry trace exemplar extraction from events and process dictionaries,
  appending trace identifiers to generated Prometheus metrics.
- Added trace identifier storage to Ecto Incident schemas and dynamically formatted trace links
  within the Operator UI for one-click navigation to external trace backends.
- Consumed Rulestead feature flag toggles via telemetry, creating durable timeline entries and
  suspect change markers to instantly correlate feature flag changes with SLO burn rates.
- Highlighted recent proximate system changes (like flag toggles) on active incidents in the
  Operator UI, distinguishing them visually from human actions.
- Implemented Parapet.Integrations.Threadline for compliance sync, mirroring Operator audit
  actions to Threadline event logs.
- Added dual audit modes (`:threadline_deferred` and `:dual_write`) to satisfy strict compliance
  constraints, including bypassing internal Parapet storage entirely when deferred.

**Stats:** 8,968 LOC (Elixir/EEx) · Phases 1–3 · 9 plans · 2026-05-17

---

## v0.5 — Proactive Resilience & Copilot Triage (2026-05-16)

- Implemented Parapet.Probe for defining and scheduling active synthetic canaries via
  NativeScheduler and ObanScheduler, enabling proactive health detection before alerts fire.
- Expanded Sigra and Accrue integrations to emit explicit login, signup, and checkout SLIs for
  business-critical journey monitoring.
- Built a Parapet MCP server to allow AI agents to safely read incident data and act as triage
  copilots, providing structured access without write permissions.
- Resolved compilation and type warnings across the project, achieving a clean zero-warning
  compilation state.

**Stats:** ~8,500 LOC (Elixir/EEx) · Phases 1–3 · 9 plans · 2026-05-16

---

## v0.4 — Scoria AI Integration (2026-05-15)

- Implemented telemetry translation consuming Scoria.SRE.Telemetry events and producing Parapet
  Prometheus metrics and durable Ecto Incidents for AI infrastructure observability.
- Built Parapet.SLO.ScoriaEval to define and alert on Eval-Driven SLOs based on Scoria
  deterministic evaluation scores, with Grafana visualization for SLO error budget correlation.
- Added native tracking of AI Config Changes (scorer_version, baseline_version, model) to
  correlate configuration drift with SLO degradation.
- Monitored Scoria MCP tool failure modes (timeout, execution_failed, breaker_open, access_denied)
  as explicit SLIs using bounded atoms to protect Ecto from high-volume telemetry.
- Monitored Scoria workflow approval pauses as durable human-in-the-loop states, triggering alerts
  on stale requests, and extended the Operator UI with deep-links to Scoria's durable evidence.

**Stats:** 7,847 LOC (Elixir/EEx) · Phases 1–4 · 9 plans · 2026-05-15

---

## v0.3 — Runbooks & Alert Routing (2026-05-12)

- Implemented a webhook receiver endpoint for Prometheus Alertmanager, automatically routing
  "firing" and "resolved" alerts to the durable Ecto Incident lifecycle with intelligent
  deduplication and correlation by alert name and labels.
- Created a structured Parapet.Runbook DSL for defining operator-triggered mitigation steps and
  attaching them based on SLOs or alert names.
- Extended the Operator UI to interactively display attached runbooks and execute one-click
  mitigations with complete ToolAudit logging.
- Built a modular Parapet.Notifier system with out-of-the-box Slack (Block Kit) and Microsoft
  Teams (Adaptive Cards) adapters to broadcast incident state changes and record timeline entries.
- Added UI capabilities for operators to explicitly acknowledge incidents and generate comprehensive
  Markdown retrospectives automatically.

**Stats:** 6,667 LOC (Elixir/EEx) · Phases 1–4 · 12 plans · 2026-05-12

---

## v0.2 — Durable Spine and Operator UI (2026-05-11)

- Implemented the Parapet.Evidence context with Incident, TimelineEntry, and ToolAudit Ecto
  schemas for durable SRE tracking, separating ephemeral telemetry from low-volume Ecto data.
- Created `mix parapet.gen.spine` generator to scaffold evidence migrations into host applications
  safely separated from high-volume telemetry.
- Defined the Operator API with transactional audited commands and a WorkbenchContract for safe
  UI derivations.
- Created `mix parapet.gen.ui` to generate an isolated, secure, and visually responsive Phoenix
  LiveView Operator Workbench inside the host app.
- Automated structural UI tests to guarantee responsive mobile and desktop layout fidelity without
  relying on human QA or full browser end-to-end tests.
- Implemented optional integration adapters for Mailglass, Chimeway, Accrue, Rindle, Threadline,
  and Rulestead leveraging a new capability registry, with all adapters compiling out cleanly when
  their sibling libraries are absent.

**Stats:** 3,164 LOC (Elixir/EEx) · Phases 1–3 · 11 plans · 2026-05-11

---

## v0.1 — Trustworthy Spine (2026-05-10)

- Established the foundational Parapet telemetry contract, supervisor, and install generator,
  defining a documented telemetry surface treated as a public API with semver guarantees.
- Built core metrics instrumentation for HTTP, Ecto, and Oban safely via a robust API enforcing
  low-cardinality by default with explicit label contracts.
- Created an SLO DSL converting standard Elixir definitions to fully functional Prometheus
  recording and alerting rules.
- Delivered a seamless Day-1 experience with `mix parapet.doctor` health checks and Grafana
  dashboard generation.

**Stats:** 1,992 LOC (Elixir) · Phases 1–4 · 15 plans · 2026-05-10