guides/telemetry.md

# Telemetry

Every HTTP request that ExAtlas makes emits a `:telemetry` event, so you
can wire the library into your existing metrics pipeline without writing
provider-specific code.

## Events

### `[:ex_atlas, <provider>, :request]`

Emitted after every REST, runtime, or GraphQL call.

**Measurements:**

| Key      | Type   | Value              |
| -------- | ------ | ------------------ |
| `status` | int    | HTTP status code   |

**Metadata:**

| Key      | Type   | Value                                       |
| -------- | ------ | ------------------------------------------- |
| `api`    | atom   | `:management` / `:runtime` / `:graphql`     |
| `method` | atom   | `:get` / `:post` / `:delete` / ...          |
| `url`    | string | Full request URL                            |

### `[:ex_atlas, :fly, :token, :acquire]` (span)

`:start` / `:stop` / `:exception` events around every
`ExAtlas.Fly.Tokens.get/1` call. Measure cache-hit rate, CLI acquisition
latency, and resolution failures.

**`:stop` metadata:**

| Key        | Type   | Value                                                                                  |
| ---------- | ------ | -------------------------------------------------------------------------------------- |
| `app`      | string | Fly app name                                                                           |
| `source`   | atom   | `:ets` / `:storage` / `:config` / `:cli` / `:manual` / `:none` (resolution failed)     |
| `acquirer` | atom   | `:facade` (cross-process ETS fast-path hit) / `:app_server` (slow path or coalesced)   |

Measurements follow the standard `:telemetry.span/3` shape (`system_time`
on `:start`, `duration` + `monotonic_time` on `:stop`).

### Reading `source` + `acquirer` together

- `source: :ets, acquirer: :facade` — pure fast-path cache hit. No mailbox
  round-trip. This is what you want for the vast majority of requests once
  caches are warm.
- `source: :ets, acquirer: :app_server` — **coalescing success**. The caller
  entered the AppServer mailbox, and by the time `handle_call` ran, a
  concurrent first-mover had already filled ETS. Mostly seen during
  cold-start thundering herds; proves the per-app serialization is coalescing
  CLI calls.
- `source: :cli, acquirer: :app_server` — first-in-line caller doing the
  actual `fly tokens create readonly` work. One of these per app per cold
  start (plus expiries).
- `source: :storage, acquirer: :app_server` — ETS empty but DETS storage
  had a valid token. Expect a burst of these right after VM restart.
- `source: :none, acquirer: :app_server` — full resolution chain miss.
  Worth alerting on if sustained.

### `[:ex_atlas, :fly, :logs, :fetch]` (span)

`:start` / `:stop` / `:exception` around `ExAtlas.Fly.Logs.Client.fetch_logs/3`.
Emitted regardless of whether you call `fetch_logs/3` directly or go
through `fetch_logs_with_retry/2`.

**`:stop` metadata:**

| Key      | Type   | Value                        |
| -------- | ------ | ---------------------------- |
| `app`    | string | Fly app name                 |
| `status` | term   | `:ok` / `{:error, reason}`   |
| `count`  | int    | Number of entries returned   |

Log line content is never included in metadata — Fly log bodies may
contain bearer tokens, and we do not want them flowing into a metrics
pipeline.

### `[:ex_atlas, :fly, :deploy, :line]` and `[:ex_atlas, :fly, :deploy, :exit]`

Two events from `ExAtlas.Fly.Deploy.stream_deploy/3`:

- `:line` fires once per non-empty output line. `measurements: %{count: 1}`
  so a Counter reporter sums to total lines.
- `:exit` fires once when the deploy terminates.

**`:line` metadata:**

| Key         | Type   | Value                     |
| ----------- | ------ | ------------------------- |
| `ticket_id` | string | The deploy ticket ID      |

**`:exit` metadata:**

| Key         | Type   | Value                                                      |
| ----------- | ------ | ---------------------------------------------------------- |
| `ticket_id` | string | The deploy ticket ID                                       |
| `result`    | term   | `:ok` / `{:error, :timeout}` / `{:error, {:exit_code, N}}` |

Line **content** is deliberately excluded — Fly build output can contain
bearer tokens.

## Wiring into Logger

```elixir
:telemetry.attach(
  "atlas-http-logger",
  [:ex_atlas, :runpod, :request],
  fn _event, measurements, metadata, _config ->
    Logger.info(
      "ExAtlas → #{metadata.api} #{metadata.method} #{metadata.url} → #{measurements.status}"
    )
  end,
  nil
)
```

## Wiring into `:telemetry_metrics`

```elixir
defmodule MyAppWeb.Telemetry do
  use Supervisor
  import Telemetry.Metrics

  def metrics do
    [
      # Count requests grouped by provider + status class
      counter("atlas.runpod.request.count",
        event_name: [:ex_atlas, :runpod, :request],
        measurement: :status,
        tags: [:api, :method]
      ),

      # Watch error rates
      counter("atlas.runpod.request.errors",
        event_name: [:ex_atlas, :runpod, :request],
        measurement: :status,
        tags: [:api, :method],
        keep: fn metadata, measurements ->
          measurements.status >= 400
        end
      )
    ]
  end
end
```

Plug into Grafana / Prometheus / StatsD via whichever reporter you
prefer (`TelemetryMetricsPrometheus`, `TelemetryMetricsStatsd`, ...).

## Event attachment on application start

```elixir
defmodule MyApp.AtlasTelemetry do
  @events [
    # Provider HTTP requests
    [:ex_atlas, :runpod, :request],
    [:ex_atlas, :fly, :request],
    [:ex_atlas, :lambda_labs, :request],
    [:ex_atlas, :vast, :request],
    # Fly platform ops (spans emit :start + :stop + :exception)
    [:ex_atlas, :fly, :token, :acquire, :start],
    [:ex_atlas, :fly, :token, :acquire, :stop],
    [:ex_atlas, :fly, :logs, :fetch, :start],
    [:ex_atlas, :fly, :logs, :fetch, :stop],
    [:ex_atlas, :fly, :deploy, :line],
    [:ex_atlas, :fly, :deploy, :exit]
  ]

  def attach do
    :telemetry.attach_many(
      "atlas-telemetry",
      @events,
      &__MODULE__.handle/4,
      nil
    )
  end

  def handle(event, measurements, metadata, _config) do
    # Dispatch to your metrics system
  end
end

# lib/my_app/application.ex
def start(_type, _args) do
  MyApp.AtlasTelemetry.attach()
  # ...
end
```

## Orchestrator events

PubSub broadcasts from the orchestrator are covered in the README —
subscribe to `"compute:<id>"` on `ExAtlas.PubSub` for state-change
notifications. These are **PubSub messages**, not Telemetry events.

If you want Telemetry-style metrics for spawn/terminate counts, wrap
`ExAtlas.Orchestrator.spawn/1` in your own helper that emits a Telemetry
event alongside the call.