# Changelog
All notable changes to this project will be documented in this file.
The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/)
and ExAtlas adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## v0.5.0 — unreleased
Closes all remaining audit items. Library is now at feature parity with
the audit recommendations.
### Added
- **`ExAtlas.Fly.Supervisor`** (E3) — top-level supervisor for the Fly
sub-tree, exposed as a `child_spec/1` so hosts can embed ExAtlas under
their own supervision tree. `ExAtlas.Application` delegates to its
`fly_children/0` to avoid duplication.
- **`ExAtlas.Fly.Tokens.refresh/1`** (E5) — atomic invalidate-then-acquire.
Equivalent to `invalidate/1` + `get/1` but runs under a single
GenServer call on the AppServer, closing the race where a concurrent
caller acquires between the two.
- **`ExAtlas.Fly.Dispatcher.subscribe_with_backpressure/2`** (E6) — opt-in
eviction watchdog. Monitors the subscriber's message queue and signals
an eviction via `{:ex_atlas_fly_backpressure_evict, topic}` if the
queue exceeds a configurable threshold.
- **Proactive soft-expiry refresh** (E7) — `ExAtlas.Fly.Tokens.AppServer`
schedules a background refresh `:soft_expiry_lead_seconds` (default 3600)
before a cached token's `expires_at`. Avoids the expiry cliff where
every caller around expiry hits the CLI at once.
- **Monorepo discovery** (M4) — `ExAtlas.Fly.Deploy.discover_apps/2`
now accepts a `:max_depth` option. Default `1` preserves current
behavior; set higher for `apps/<name>/fly.toml` layouts.
- **Streamer shutdown signal** (L5) — the Streamer sends a final
`{:ex_atlas_fly_logs_stopped, app_name}` on its topic when it
terminates, so subscribers can unsubscribe themselves from the
framework-agnostic dispatcher.
### Changed
- **`ExAtlas.Fly.Deploy.deploy/2`** (M5) — now returns
`{:error, {:fly_error, :not_found, _}}` when `fly` is not on `PATH`,
matching `stream_deploy/3`. Previously raised `ErlangError` from
`System.cmd/3` on missing executables.
- **`ExAtlas.Fly.Deploy.parse_app_name/1`** (L3) — tightened regex:
quoted values must not contain whitespace (pre-fix `app = "my app"`
returned `{:ok, "my"}`). Still accepts unquoted values and
whitespace-separated inline comments on the `app =` line.
- **`ExAtlas.Fly.Logs.Streamer` L7 race fix** — until the first
subscriber registers via `subscribe_pid/2`, the Streamer advances its
cursor silently without dispatching. Previously the very first poll
could fire before a caller's `subscribe_pid/2`, dropping the first
batch onto a zero-subscriber topic.
- **`ExAtlas.Fly.Logs.Streamer.subscribe/2`** (L4) — `project_dir` is
no longer required. New `subscribe/2` arity takes keyword options
only; `subscribe/3` stays for backward compatibility with the old
positional signature.
- **`ExAtlas.Fly.Tokens.AppServer` config resolution** (M8, M9) —
`:fly_config_file_enabled` and `:cli_timeout_ms` are now resolved
once at AppServer `init/1` rather than on every `handle_call`. Uses
the consistent `Keyword.get(config, :key, default)` pattern.
- **`ExAtlas.Fly.Tokens.AppServer` structured logging** (E4) —
remaining `Logger.warning` interpolations for CLI failures now use
metadata (`app:`, `exit_code:`, `output:`, `timeout_ms:`) instead of
interpolated strings.
- **`ExAtlas.Fly.TokenStorage.Dets` mkdir fallback** (M6) — when the
explicitly-configured `:storage_path` is not writable, falls back to
`System.tmp_dir!/0` with a `:warning` log, rather than crashing on
`File.mkdir_p!/1`. Previously only the default path had the fallback.
- **`unless` → `if` throughout `deploy.ex`** (L1).
- **`deploy/2` and `stream_deploy/3` error shape typed explicitly**
(L2) — new `ExAtlas.Fly.Deploy.deploy_error/0` type spec documents
the three `:fly_error` reason variants (`:not_found`, `:timeout`,
`non_neg_integer()`).
### Installer
- **`mix ex_atlas.install` runtime.exs example** (M2) — the post-install
notice now includes a `runtime.exs` pattern for containerized deploys
that want to override `:storage_path` via an environment variable.
### Dispatcher docs (H7)
- Added a subsection describing dispatch serialization semantics and
pointing hosts with large fan-out at `:phoenix_pubsub` mode. The
per-subscriber `send/2` loop in `:registry` mode is documented as
intentional for the typical log-streaming / deploy workload.
## v0.4.1 — unreleased
### Changed — Async token persist (closes audit H3)
- `ExAtlas.Fly.Tokens.AppServer` now offloads cached-token storage
writes to a supervised `Task` under a new
`ExAtlas.Fly.Tokens.TaskSupervisor` child. The AppServer's
`handle_call` replies as soon as ETS is updated; `:dets.sync`
happens in the background.
- Net effect: a slow storage write for one app no longer blocks that
app's own subsequent token requests (and never blocked other apps',
post-E1). Callers get the token with latency gated on ETS + cmd_fn
only. Audit finding H3.
- **Manual-token persist stays synchronous.** Manual tokens are not
re-acquirable, so `ExAtlas.Fly.Tokens.set_manual/2` still returns
`{:error, {:persist_failed, reason}}` when storage raises — the
caller must know if persist failed.
- Persist failures on the cached path continue to log at `:error`
level with `{app, reason}` metadata, now emitted from the task
rather than the mailbox (contract preserved, emission point
moved).
### Added
- `ExAtlas.Fly.Tokens.TaskSupervisor` is a new child of
`ExAtlas.Fly.Tokens.Supervisor`, ordered after `ETSOwner` and
before the `DynamicSupervisor`. Tests can inject a custom name
via `:task_sup` on `Tokens.Supervisor.start_link/1`.
## v0.4.0 — unreleased
### Changed — Per-app Fly tokens (audit E1; closes H3, H4)
- Replaced the singleton `ExAtlas.Fly.Tokens.Server` with a per-app
`ExAtlas.Fly.Tokens.AppServer` supervised under
`ExAtlas.Fly.Tokens.Supervisor`. Token resolution for one app no
longer blocks resolution for any other. A thundering herd of CLI
acquisitions (e.g. post-VM-restart across N apps) now runs in
parallel rather than serialized behind a single mailbox.
- `ExAtlas.Fly.Tokens.Server` is **removed**. The documented public API
(`ExAtlas.Fly.Tokens.{get/1, invalidate/1, set_manual/2}`) is
unchanged and remains the stable entry point.
- Shared ETS table (`:ex_atlas_fly_tokens`) is now `:public` and owned
by `ExAtlas.Fly.Tokens.ETSOwner`, outliving individual AppServer
crashes. A crashed AppServer restarts with its cache intact; an
ETSOwner crash rebuilds the whole tokens subtree via `:rest_for_one`
(Registry survives, DynamicSupervisor + every AppServer restart).
- Concurrent `Tokens.get/1` calls for the **same** app coalesce at the
AppServer mailbox — only the first-in-line caller invokes the CLI;
subsequent callers re-check ETS (filled by the first) before
descending the resolution chain.
### Added
- `[:ex_atlas, :fly, :token, :acquire]` `:stop` metadata gains a new
`:acquirer` field — `:facade` for pure ETS fast-path hits (no
AppServer consulted) or `:app_server` for slow-path / coalesced
resolutions. Existing handlers that match only on `:source` are
unaffected. See `guides/telemetry.md` for the diagnostic interpretation.
- `ExAtlas.Fly.Tokens.Supervisor.whereis_app_server/2` and
`resolve_app_server/2` — lookup / resolve-or-start helpers.
Primarily for tests.
## v0.3.1 — 2026-04-22
### Added — Telemetry for Fly platform ops
- `[:ex_atlas, :fly, :token, :acquire]` span events around every
`ExAtlas.Fly.Tokens.get/1` call. `:stop` metadata includes `source:`
(`:ets` / `:storage` / `:config` / `:cli` / `:manual` / `:none`) so
operators can measure cache-hit rate and acquisition-path latency.
- `[:ex_atlas, :fly, :logs, :fetch]` span events around
`ExAtlas.Fly.Logs.Client.fetch_logs/3`. Metadata: `{app, status, count}`.
Inherited automatically by `fetch_logs_with_retry/2`.
- `[:ex_atlas, :fly, :deploy, :line]` (one per non-empty output line) +
`[:ex_atlas, :fly, :deploy, :exit]` (one per deploy termination) from
`Deploy.stream_deploy/3`. Line content is deliberately excluded — Fly
build output can contain bearer tokens.
See `guides/telemetry.md` for the full event reference.
### Added — Shared TokenStorage conformance suite
- `ExAtlas.Fly.TokenStorageConformance` — a `use`-able ExUnit macro that
any `TokenStorage` implementation can adopt to inherit the full
`get/put/delete` contract coverage across `:cached` and `:manual`
keys. Mirrors the existing `ExAtlas.Test.ProviderConformance` pattern.
- `Memory` and `Dets` both run under the shared suite now, so any
future adapter (Redis, Postgres, vault) can prove parity with one
`use` line.
## v0.3.0 — unreleased
### Changed — Fly token / streamer return contracts
- `ExAtlas.Fly.Tokens.set_manual/2` (and `Tokens.Server.set_manual_token/3`)
now return `:ok | {:error, {:persist_failed, reason}}` instead of always
`:ok`. Manual tokens are not re-acquirable, so storage failures must be
surfaced rather than silently logged. Callers that pattern-match on
`:ok` should handle the error tuple.
- `ExAtlas.Fly.subscribe_logs/3` (and `Streamer.subscribe/3`) now return
`:ok | {:error, :no_streamer}` when no streamer can be resolved
(e.g. the Fly sub-tree is disabled). Previously this case returned a
silent `:ok` with no messages ever arriving.
### Fixed — Hardening round
- `ExAtlas.Fly.Tokens.Server` `persist/3` (cached path) now returns
`:ok | {:error, {:persist_failed, reason}}` and logs failures at
`:error` level with `{app, reason}` metadata instead of `:warning`
with interpolated strings. ETS still holds a fresh token for the
session, but a silent storage outage is now operator-visible.
- `ExAtlas.Fly.Dispatcher` `:mfa` mode wraps the host MFA in
`try/rescue/catch` so a raising MFA no longer takes down the caller
(most commonly the log Streamer, whose crash drops the pagination
cursor). Failures are logged at `:error` level with the topic and MFA
identity.
- `ExAtlas.Fly.TokenStorage.Dets` refuses to auto-recreate a corrupt
`manual.dets` file on startup — manual tokens are bearer credentials
that are NOT re-acquirable. Returns `{:stop, {:manual_dets_corrupt,
path, reason}}` and preserves the file for operator intervention. The
cached-token path still recreates (re-acquirable, perf regression only).
- `ExAtlas.Fly.TokenStorage.Dets` now `chmod`s the storage dir to `0700`
and each DETS file to `0600` after open. Default umask on typical
Linux/macOS left token files world- or group-readable.
- `mix ex_atlas.install` surfaces `.gitignore` update failures as an
`Igniter.add_notice` with the exact line the user must add manually;
previously the installer silently swallowed the exception and moved on.
- `ExAtlas.Fly.TokenStorage.Memory` (test support) now catches `:exit`
from pre-init reads and returns `:error`, matching the Dets `rescue
ArgumentError` semantics so the test double is faithful to prod.
### Added
- `ExAtlas.Fly.TokenStorage.Dets.start_link/1` accepts `:name`,
`:cached_table`, `:manual_table` opts so custom-supervised /
per-test instances are possible alongside the default singleton.
- First test coverage for `ExAtlas.Fly.Dispatcher`, `TokenStorage.Dets`,
`TokenStorage.Memory`, and the `Streamer.subscribe/3` silent-failure
path.
## v0.2.0 — unreleased
### Fixed
- `ExAtlas.Fly.Logs.Client.next_start_time/1` no longer crashes the
Streamer when a log entry has a `nil` or malformed ISO-8601 timestamp;
unparseable entries are logged and skipped.
- `ExAtlas.Fly.Deploy.stream_deploy/3` cleans both the activity and
absolute timers symmetrically across all exit branches, so no stray
`{:deploy_*_timeout, _}` message leaks into a long-lived caller's
mailbox. Exposes `:activity_timeout_ms` / `:max_timeout_ms` options.
- `ExAtlas.Fly.Tokens.Server` now implements `terminate/2` to delete its
named ETS table, avoiding an `ArgumentError` on supervisor restart,
and defensively reclaims an existing table in `init/1`.
- `ExAtlas.Fly.Tokens.Server` shuts down a hung `fly` CLI task with
`:brutal_kill` so the configured `cli_timeout_ms` is actually the
mailbox blocking time, not `cli_timeout_ms + 5_000`.
- `ExAtlas.Fly.Logs.StreamerSupervisor` uses `:rest_for_one` with a
generous restart budget on the `DynamicSupervisor` so one app's
misbehaving streamer no longer tears down the registry and every
other app's pagination cursor.
### Added — Fly.io platform operations
- `ExAtlas.Fly` top-level facade for Fly.io platform ops:
`discover_apps/1`, `deploy/2`, `stream_deploy/3`, `subscribe_logs/3`,
`unsubscribe_logs/1`, `subscribe_deploy/1`, `unsubscribe_deploy/1`.
- `ExAtlas.Fly.Deploy` — `fly deploy --remote-only` with 15 min timeout
(`deploy/2`) and Port-based streaming (`stream_deploy/3`) with a
5 min activity timer and 30 min absolute cap. Dispatches
`{:ex_atlas_fly_deploy, ticket_id, line}` on each line.
- `ExAtlas.Fly.Logs.Client` — `Req`-backed client for the Fly Machines
log API (NDJSON, cursor pagination, automatic 401 retry).
- `ExAtlas.Fly.Logs.Streamer` + `StreamerSupervisor` — per-app GenServer
that polls the log API every 2 s, dispatches
`{:ex_atlas_fly_logs, app, entries}`, and stops once all subscribers
have disconnected (monitor-based).
- `ExAtlas.Fly.Tokens` + `ExAtlas.Fly.Tokens.Server` — cache-first token
resolver. Order: ETS → `TokenStorage` → `~/.fly/config.yml` →
`fly tokens create readonly` → manual override.
- `ExAtlas.Fly.TokenStorage` — pluggable behaviour for durable token
persistence. Default impl `ExAtlas.Fly.TokenStorage.Dets` is
zero-config and survives VM restarts.
- `ExAtlas.Fly.Dispatcher` — framework-agnostic broadcast. Modes:
`:registry` (default, zero-dep), `:phoenix_pubsub` (when host uses
Phoenix), or `{:mfa, {m, f, a}}` custom routing.
- `ExAtlas.Application` now supervises the Fly sub-tree by default.
Disable with `config :ex_atlas, :fly, enabled: false`.
### Added — Igniter installer
- `mix ex_atlas.install` — adds sensible `config :ex_atlas, :fly` defaults,
creates the DETS storage directory, wires `phoenix_pubsub` when
available.
- `mix ex_atlas.upgrade` — runs per-version upgraders (no-op for 0.1.x
→ 0.2.0; reserved for future migrations).
### Changed
- Description and package scope broadened from "GPU/compute SDK" to
"infrastructure SDK".
- `ExAtlas.Application`'s Fly sub-tree boots by default. The existing
orchestrator sub-tree is still opt-in via `start_orchestrator: true`.
## v0.1.0 — unreleased
Initial public release.
### Added — Core API
- `ExAtlas` top-level provider-agnostic module (`spawn_compute/1`,
`get_compute/2`, `list_compute/1`, `stop/2`, `start/2`, `terminate/2`,
`run_job/1`, `get_job/2`, `cancel_job/2`, `stream_job/2`,
`list_gpu_types/1`, `capabilities/1`).
- `ExAtlas.Provider` behaviour defining the contract every provider
implements.
- `ExAtlas.Config` — per-call > app-env > env-var resolution for provider
and API key. Supports user-defined provider modules passed directly by
name (no registration needed).
- `ExAtlas.Error` — canonical error struct with `:kind` atoms
(`:unauthorized`, `:not_found`, `:rate_limited`, `:timeout`,
`:unsupported`, `:validation`, `:provider`, `:transport`, `:unknown`)
and `from_response/3` for translating HTTP responses.
### Added — Normalized specs
- `ExAtlas.Spec.ComputeRequest` — input to `spawn_compute/1` with
`NimbleOptions`-validated fields, `:provider_opts` escape hatch.
- `ExAtlas.Spec.Compute` — normalized compute resource response.
- `ExAtlas.Spec.JobRequest` / `ExAtlas.Spec.Job` — serverless jobs.
- `ExAtlas.Spec.GpuType` — catalog entry with pricing + stock.
- `ExAtlas.Spec.GpuCatalog` — stable canonical GPU atoms
(`:h100`, `:a100_80g`, `:rtx_4090`, ...) mapped to each provider's
native identifier.
### Added — Providers
- `ExAtlas.Providers.RunPod` — full implementation covering REST management
(pods, endpoints, templates, network volumes, billing), serverless
runtime (async/sync/stream job submission, status, cancel), and the
legacy GraphQL pricing catalog. Built on `Req`.
- Sub-modules: `Client`, `GraphQL`, `Pods`, `Endpoints`, `Jobs`,
`Templates`, `NetworkVolumes`, `Billing`, `Translate`.
- `ExAtlas.Providers.Mock` — in-memory ETS-backed provider for tests and
demos. Implements every callback.
- `ExAtlas.Providers.Stub` macro — shared base for placeholder providers.
- `ExAtlas.Providers.Fly`, `ExAtlas.Providers.LambdaLabs`,
`ExAtlas.Providers.Vast` — placeholder modules reserving atoms and
capability lists for v0.2 / v0.3.
### Added — Auth
- `ExAtlas.Auth.Token` — cryptographically random 256-bit bearer tokens
with SHA-256 hashing and constant-time comparison (`Plug.Crypto`).
- `ExAtlas.Auth.SignedUrl` — S3-style HMAC-SHA256 signed URLs with
expiry, for media streams and WebSockets that can't set headers.
- Auto-injection: `auth: :bearer` on `spawn_compute/1` mints a token,
injects it into the pod as `ATLAS_PRESHARED_KEY`, and returns the
handle in `compute.auth`.
### Added — Orchestrator (opt-in)
- `ExAtlas.Orchestrator` — high-level API (`spawn/1`, `touch/1`, `info/1`,
`stop_tracked/1`, `list_ids/0`).
- `ExAtlas.Orchestrator.ComputeServer` — one GenServer per tracked
resource, traps exits, enforces `:idle_ttl_ms`, broadcasts state
changes via `ExAtlas.Orchestrator.Events`.
- `ExAtlas.Orchestrator.ComputeSupervisor` (`DynamicSupervisor`) +
`ExAtlas.Orchestrator.ComputeRegistry` (`Registry` with `:via` lookup).
- `ExAtlas.Orchestrator.Reaper` — periodic reconciliation; terminates
orphans whose `:name` matches the configurable safety-prefix.
- `ExAtlas.Application` starts the tree only when
`config :ex_atlas, start_orchestrator: true`; library-only users pay
nothing.
- Phoenix.PubSub broadcasts on `"compute:<id>"` topic as
`{:atlas_compute, id, event}` for `{:status, s}`,
`{:heartbeat, ms}`, `{:terminating, reason}`,
`{:terminate_failed, err}` events.
### Added — Phoenix LiveDashboard integration
- `ExAtlas.LiveDashboard.ComputePage` — drop-in
`Phoenix.LiveDashboard.PageBuilder` page. Host apps mount it via
`additional_pages: [atlas: ExAtlas.LiveDashboard.ComputePage]`. Live
table with Touch/Stop/Terminate row actions. Auto-refreshing;
subscribes to `ExAtlas.PubSub` for push updates when available.
- Guarded by `Code.ensure_loaded?(Phoenix.LiveDashboard.PageBuilder)` so
the module only compiles when LiveDashboard is in the host app's deps.
### Added — HTTP + observability
- Every REST / runtime / GraphQL request goes through `Req` with
`:retry :transient`, 3 retries by default, and telemetry.
- Telemetry events `[:ex_atlas, <provider>, :request]` with
`%{status: status}` measurements and `%{api, method, url}` metadata.
- Per-call `Req` overrides via `req_options:`.
### Added — Testing
- `ExAtlas.Test.ProviderConformance` — shared ExUnit suite every provider
implementation must pass. `use`-macro form accepts `:reset` MFA for
test isolation.
- Full unit coverage (68 tests, 3 doctests).
### Added — Documentation
- Comprehensive `README.md` with architecture diagram, capability
matrix, GPU mapping table, error kinds, security considerations,
FAQ, and roadmap.
- `guides/getting_started.md`, `guides/transient_pods.md`,
`guides/writing_a_provider.md`, `guides/telemetry.md`,
`guides/testing.md` — long-form deep-dives surfaced via ex_doc extras.
- Full module-level `@moduledoc` on every public module.