README.md

# nova_resilience

Production-grade resilience patterns for [Nova](https://github.com/novaframework/nova) web applications.

Bridges Nova and [Seki](https://github.com/Taure/seki) to provide dependency health checking, Kubernetes-ready probes, circuit breakers, bulkheads, deadline propagation, and ordered graceful shutdown — all via declarative configuration.

## Features

- **Health endpoints** — `/health`, `/ready`, `/live` for Kubernetes probes
- **Startup gating** — traffic held until critical dependencies are healthy
- **Circuit breakers** — stop calling failing dependencies, allow recovery
- **Bulkheads** — limit concurrent requests per dependency
- **Retry** — configurable retry with exponential backoff and jitter
- **Deadline propagation** — per-request timeouts via headers or defaults
- **Graceful shutdown** — ordered teardown with drain, priority groups, and LB coordination
- **Telemetry** — events for all resilience operations (calls, breakers, shutdown, health)
- **Pluggable adapters** — built-in support for pgo, kura, brod, or custom

## Quick start

Add to your deps:

```erlang
{deps, [
    nova,
    nova_resilience
]}.
```

Add to your `.app.src` applications:

```erlang
{applications, [kernel, stdlib, nova, nova_resilience]}.
```

Register health routes in your Nova config:

```erlang
{my_app, [
    {nova_apps, [nova_resilience]}
]}.
```

Configure dependencies in `sys.config`:

```erlang
{nova_resilience, [
    {dependencies, [
        #{name => primary_db,
          type => database,
          adapter => pgo,
          pool => default,
          critical => true,
          breaker => #{failure_threshold => 5, wait_duration => 30000},
          bulkhead => #{max_concurrent => 25},
          shutdown_priority => 2}
    ]}
]}.
```

That's it. Your app now has `/health`, `/ready`, and `/live` endpoints, automatic startup gating, circuit breakers, bulkheads, and ordered shutdown.

## How it works

### Startup

1. App starts, nova_resilience provisions seki primitives for each dependency
2. Health checks run — `/ready` returns **503** until all critical deps are healthy
3. Kubernetes readiness probe holds traffic until ready
4. Once healthy, `/ready` returns **200** and traffic flows

### Running

Wrap calls to external dependencies through the resilience stack:

```erlang
case nova_resilience:call(primary_db, fun() ->
    pgo:query(~"SELECT * FROM users WHERE id = $1", [Id])
end) of
    {ok, #{rows := Rows}} ->
        {json, #{users => Rows}};
    {error, circuit_open} ->
        {json, 503, #{}, #{error => ~"db unavailable"}};
    {error, bulkhead_full} ->
        {json, 503, #{}, #{error => ~"overloaded"}};
    {error, deadline_exceeded} ->
        {json, 504, #{}, #{error => ~"timeout"}}
end.
```

### Shutdown

On SIGTERM (or application stop):

1. `/ready` immediately returns **503** (load balancer stops sending traffic)
2. Waits `shutdown_delay` for LB health checks to propagate
3. Drains in-flight requests (monitors bulkhead occupancy)
4. Tears down dependencies in `shutdown_priority` order
5. Nova drains HTTP connections and stops

No manual `prep_stop` calls needed — shutdown is fully automatic.

## Health endpoints

| Endpoint | Purpose | Response |
|----------|---------|----------|
| `GET /health` | Full diagnostic report | `{"status":"healthy","dependencies":{...},"vm":{...}}` |
| `GET /ready` | Kubernetes readiness probe | 200 when ready, 503 when not |
| `GET /live` | Kubernetes liveness probe | 200 if process is responsive |

The `/health` endpoint returns per-dependency status with circuit breaker state, bulkhead occupancy, and VM metrics (memory, process count, run queue, uptime, node).

## Configuration

### Application environment

```erlang
{nova_resilience, [
    {dependencies, [...]},            %% List of dependency configs
    {health_check_interval, 10000},   %% ms between health checks
    {vm_checks, true},                %% Include BEAM VM info in health report
    {gate_enabled, true},             %% false to skip startup gating (dev/test)
    {gate_timeout, 30000},            %% Max ms to wait for deps on startup
    {gate_check_interval, 1000},      %% ms between gate readiness checks
    {health_severity, info},          %% critical: /health returns 503 when unhealthy
    {shutdown_delay, 5000},           %% ms to wait after marking not-ready
    {shutdown_drain_timeout, 15000},  %% Max ms to drain per priority group
    {drain_poll_interval, 100},       %% ms between drain occupancy polls
    {health_prefix, ~""}              %% Prefix for health routes (e.g. ~"/internal")
]}.
```

Unknown config keys are logged as warnings on startup to catch typos.

### Dependency config

```erlang
#{
    name => atom(),                    %% Required — unique identifier
    type => database | kafka | custom, %% Optional — infers adapter
    adapter => pgo | kura | brod | module(), %% Optional — inferred from type
    critical => boolean(),             %% Default: false — gates /ready
    shutdown_priority => non_neg_integer(), %% Default: 10 — lower = first
    default_timeout => pos_integer(),  %% Default deadline in ms
    health_check => {module(), function()}, %% Override adapter health check

    %% Circuit breaker
    breaker => #{
        failure_threshold => pos_integer(),
        wait_duration => pos_integer(),
        slow_call_duration => pos_integer(),
        half_open_requests => pos_integer()
    },

    %% Concurrency limiter
    bulkhead => #{
        max_concurrent => pos_integer()
    },

    %% Retry with backoff
    retry => #{
        max_attempts => pos_integer(),
        base_delay => non_neg_integer(),
        max_delay => non_neg_integer()
    }
}
```

## Built-in adapters

| Type | Adapter | Health check | Shutdown |
|------|---------|-------------|----------|
| `database` | `pgo` (default) | `SELECT 1` via pgo pool | no-op |
| `database` | `kura` | `SELECT 1` via kura repo | no-op |
| `kafka` | `brod` | `brod:get_partitions_count/2` | `brod:stop_client/1` |
| any | custom module | `nova_resilience_adapter` behaviour | custom |

## Guides

- [Getting Started](guides/getting-started.md) — Installation and basic setup
- [Circuit Breakers & Bulkheads](guides/resilience-patterns.md) — Protecting dependencies
- [Deadline Propagation](guides/deadlines.md) — Per-request timeout budgets
- [Adapters](guides/adapters.md) — Built-in and custom adapters
- [Graceful Shutdown](guides/shutdown.md) — Ordered teardown and Kubernetes integration
- [Telemetry](guides/telemetry.md) — Observability and monitoring

## License

Apache-2.0