docs/slo-authoring-guide.md

Select File
docs/slo-authoring-guide.md

# Parapet SLO Authoring Guide

Parapet is built around a simple conviction: an SLO should track whether users can do the things they came to your app to do, not whether the servers are breathing. A CPU gauge that stays under 80% tells you nothing about whether login is working. A journey SLO that burns at 2% tells you exactly what is wrong and who is affected.

This guide walks through how to decide what deserves a slice, how to use the built-in `Parapet.SLO.StarterPack.WebSaaS` slices as anchors for your own decisions, and how to handle the situations where low traffic or low volume makes naive alerting unreliable.

For the full provider and slice catalog - including built-in provider modules for Mailglass, Chimeway, Rindle, and the WebSaaS pack - see [Parapet SLO Reference](docs/slo-reference.md).

## How to decide what to slice

The decision is not about what you can measure. It is about what failing would cost a user.

Use this tree to decide whether a potential signal warrants its own journey SLO:

- **Does this failure directly prevent a user task?**
  - Yes -> this is a candidate for a journey SLO. Continue down.
    - **Is the failure observable through a metric Parapet already emits (or that your integration emits)?**
      - Yes -> define a slice against that metric.
      - No -> wire the metric first (or use a synthetic probe - see the low-traffic section below), then define the slice.
    - **Is the failure synchronous (request-time) or async (job, callback, provider-mediated)?**
      - Synchronous -> use an HTTP availability or login-journey style ratio slice.
      - Async -> use a job-success or delivery-confirmation style slice.
  - No (infrastructure-only signal, does not directly prevent a user task) -> this is not a journey SLO. Consider a system-health dashboard instead.

**Litmus:** "Does this failure directly prevent a user task?" is the one question you should always answer first.

**Good examples** from `Parapet.SLO.StarterPack.WebSaaS`:

- `web_saas_login_journey` - a failed login directly blocks the user from entering your app. Auth failures are low-volume, high-impact, and exactly what a journey SLO is for.
- `web_saas_http_availability` - request-level availability is the baseline user expectation. A user who cannot load a page is directly blocked.
- `web_saas_oban_job_success` - Oban job failures directly affect users when the job gates a user-visible outcome (order confirmation, email delivery, image processing, billing). Wire a job-success slice for each critical async path.

**Bad example:**

- A CPU utilization or memory gauge SLO. CPU at 95% does not directly prevent a user task. You might be processing batch work, running GC, or handling a spike with headroom to spare. Alerting on raw infrastructure metrics produces noise without actionable user-impact framing.

**Real anchor:** The three `web_saas_*` slice names in `Parapet.SLO.StarterPack.WebSaaS` are the reference implementation. Each is pinned to a real Prometheus series, has a documented default objective in human terms, and is overridable. Read the source or [Parapet SLO Reference](docs/slo-reference.md) to understand the defaults before changing them.

## Writing a custom slice

When the built-in packs do not cover your journey, you define a custom provider module that returns `Parapet.SLO.SliceSpec` structs. The `SliceSpec` struct drives all generator output - you never write raw PromQL.

The minimum fields are `name`, `integration`, `kind`, `alert_class`, `runbook`, a good metric + matchers, and a total metric + matchers. Set `objective` as a percentage (e.g., `99.5`) and the Generator derives the error-rate threshold for you.

Register your provider module the same way as the built-ins:

```elixir
config :parapet,
  providers: [
    Parapet.SLO.StarterPack.WebSaaS,
    MyApp.SLO.CheckoutJourney
  ]
```

Then run `mix parapet.gen.prometheus` to write the recording rules and alert expressions. You never hand-write PromQL.

## Provider-as-bundle pattern

A `Parapet.SLO.Provider` that returns slices from multiple sub-providers is the bundle abstraction. No separate macro or base module is required — the `slos/0` callback returns a flat list, and list concatenation (`++`) is the composition primitive.

The canonical example is `Parapet.SLO.StarterPack.DeliverySaaS`, which composes three providers into one registration: the three WebSaaS slices plus conditionally-guarded Mailglass and Chimeway delivery slices. Its `slos/0` calls `WebSaaS.slos() ++ delivery_slices(Mailglass, Chimeway)`, where each delivery slice set is included only when the corresponding host library is loaded.

```elixir
defmodule MyApp.SLO.FullStack do
  @behaviour Parapet.SLO.Provider

  @impl true
  def slos do
    Parapet.SLO.StarterPack.WebSaaS.slos() ++
      (if Code.ensure_loaded?(Mailglass), do: Parapet.SLO.MailglassDelivery.slos(), else: []) ++
      my_custom_slices()
  end

  defp my_custom_slices, do: [...]
end
```

Register the bundle provider the same way as any single provider:

```elixir
config :parapet, providers: [MyApp.SLO.FullStack]
```

**Conditional registration:** Use `Code.ensure_loaded?/1` to guard slices for optional host libraries. The bundle module itself is always loadable (passes `mix verify.public_api`) regardless of whether the guarded library is present. This is the pattern used by `Parapet.SLO.StarterPack.DeliverySaaS` — see its moduledoc for the reference implementation.

For the full built-in provider catalog and starter packs, see [Parapet SLO Reference](docs/slo-reference.md#starter-packs).

## Low-traffic and low-volume services

Low-traffic services introduce a specific failure mode: the SLO burns when there is not enough data to know. A single failed login attempt out of five total produces a 20% error rate - which would fire a page alert - even though five requests is not a meaningful signal. The naive solution is to lower the objective to stop the noise. That is the wrong move.

### The denominator guard the generator renders

Every alert expression the Generator produces includes a denominator guard. For a slice named `web_saas_login_journey` with a `:page` alert class (14.4x multiplier, 5m window) and a 99.9% objective:

```
parapet:web_saas_login_journey:error_ratio:5m > 0.0144 and parapet:web_saas_login_journey:total_rate:5m > 0.01
```

The guard shape is:

```
parapet:<slice_name>:error_ratio:<window> > <threshold> and parapet:<slice_name>:total_rate:<window> > <min_total_rate>
```

The second condition - `total_rate > min_total_rate` - is the denominator guard. The alert fires only when there is enough traffic to make the error ratio meaningful. Without that guard, a single failure in a quiet window would trigger a page.

The 0.0144 threshold comes from the objective: 99.9% -> 0.001 error budget x 14.4 multiplier = 0.0144.

### The min_total_rate default and the six windows

The default `min_total_rate` is `0.01` - defined in `Parapet.SLO.SliceSpec` as the struct default and applied to every slice unless you override it. You can override it per-slice by passing `min_total_rate: <value>` when constructing a `SliceSpec`.

The Generator emits alert expressions for one window per alert class. The full set of recording rule windows is `["5m", "30m", "1h", "2h", "6h", "3d"]`. The alert window and multiplier by class are:

- `:page` - 5m window, 14.4x multiplier
- `:ticket` - 30m window, 6.0x multiplier
- `:warning` - 6h window, 1.0x multiplier

Recording rules are generated for all six windows (`"5m"`, `"30m"`, `"1h"`, `"2h"`, `"6h"`, `"3d"`), so you have history for retrospectives and trend analysis at every granularity.

### The extended-window approach

The 6h and 3d windows the Generator already emits are naturally more tolerant of low-traffic variance - a service that handles 10 requests per day accumulates enough denominator data over six hours to produce a reliable ratio. If you are seeing false-positive `:warning` alerts on a low-volume slice, the first question is not "should I lower the objective?" It is "is the denominator guard firing correctly, and am I looking at the right window?"

### Synthetic probes

When traffic is genuinely too low to produce a reliable signal even at the 6h window - for example, an internal-only workflow that runs once a week - the right tool is a synthetic probe.

`Parapet.Metrics.Probe` is a real, implemented fallback. It emits two metrics:

- `parapet.probe.run.total` - a counter tagged with `probe` and `status`
- `parapet.probe.run.duration.ms` - a distribution for latency tracking

A synthetic probe continuously exercises the journey at a known rate, giving the SLO a stable denominator even on services with negligible organic traffic. The probe outcome then feeds into a slice the same way real traffic does - you define the slice against `parapet.probe.run.total` and the denominator guard works as intended.

## What not to do

These are the failure modes that produce noise instead of signal.

- **Lower the objective to silence noise.** This is the wrong move. Dropping a login-journey SLO from 99.9% to 90% because it was firing on low traffic means you will not page when 10% of your users cannot log in. The denominator guard, extended windows, and synthetic probes exist precisely so you do not have to choose between accuracy and quiet alerts.
- **Alert on infrastructure metrics as if they were journey SLOs.** CPU, memory, and disk are system-health signals. They are useful for capacity planning. They are not journey SLOs, and wiring them as SLOs produces alerts that are both noisy and unactionable.
- **Emit a new journey SLO without wiring a denominator guard.** The Generator handles this for you via the `min_total_rate` field on `SliceSpec` - but if you bypass the Generator and write raw PromQL, you need to add the guard yourself.
- **Assume "no data" means "green."** If a slice has no traffic - for example, the `web_saas_login_journey` slice before you wire the Sigra integration or another login-count emitter - the denominator guard prevents the alert from firing. That is correct behavior. But silence is not a health signal. Use `mix parapet.doctor` and check that the expected metrics are present before treating a quiet slice as a passing one.