README.md

# StatusCodeTracker

An Elixir library for tracking HTTP status code rates and service health monitoring. It monitors the rate of 5xx error codes and flags the service as unhealthy when it reaches a configured threshold.

## Installation

Add `status_code_tracker` to your list of dependencies in `mix.exs`:

```elixir
def deps do
  [
    {:status_code_tracker, "~> 0.1.3"}
  ]
end
```

## Configuration

Configure the library in your application config:

```elixir
config :status_code_tracker, :settings,
  time_window_seconds: 60,
  error_threshold: 10,
  keep_unhealthy?: false,
  unhealthy_action: fn -> YourModule.on_unhealthy() end,
  healthy_action: fn -> YourModule.on_healthy() end,
  extra_checks: fn -> YourModule.extra_checks() end,
  unhealthy_status_code: 503,
  verbose?: true,
  unhealthy_message: "Service unhealthy due to many 5xx",
  extra_checks_error_message: "Extra checks failed"
```

### Configuration Options

| Option | Default | Description |
|--------|---------|-------------|
| `time_window_seconds` | `60` | Sliding time window (in seconds) for counting errors |
| `error_threshold` | `10` | Number of 5xx errors within the time window that triggers unhealthy state |
| `keep_unhealthy?` | `false` | If `true`, service stays unhealthy until manually reset. If `false`, service auto-recovers when errors drop below threshold |
| `unhealthy_action` | `fn -> :noop end` | Callback function triggered when service becomes unhealthy |
| `healthy_action` | `fn -> :noop end` | Callback function triggered when service recovers and becomes healthy again |
| `extra_checks` | `fn -> false end` | Custom validation function for additional health checks beyond error rate |
| `unhealthy_status_code` | `503` | HTTP status code returned when service is unhealthy |
| `verbose?` | `false` | Enable detailed logging of health status changes |
| `unhealthy_message` | `"Service unhealthy due to many 5xx"` | Custom message returned when unhealthy due to error threshold |
| `extra_checks_error_message` | `"Extra checks failed"` | Custom message returned when extra checks fail |

## Usage

### Adding the Health Check Endpoint

You can add the health check endpoint to your router:

```elixir
scope "/health" do
  get("/", StatusCodeTracker.HealthPlug, [json: true, body: "{\"status\":\"success\"}"])
end
```

Or add it to your endpoint:

```elixir
plug StatusCodeTracker.HealthPlug, path: "/health"
```

### Adding the Error Tracker

Add the tracker plug to your endpoint to automatically track all 5xx errors:

```elixir
plug StatusCodeTracker.Plug
```

## How it Works

### Error Tracking

The library uses an ETS table to store timestamps of 5xx errors. When a request results in a 5xx status code, the timestamp is recorded. The health check endpoint is automatically excluded from tracking to prevent recursive errors.

### Health Evaluation

When a health check is performed:
1. The library counts errors that occurred within `time_window_seconds`
2. If the count exceeds `error_threshold`, the service is marked unhealthy
3. Optionally, `extra_checks` function is called for additional validation

### Automatic Cleanup

A periodic cleanup process removes old timestamps (older than `time_window_seconds`) to prevent memory growth.

### Health State Behavior

#### When `keep_unhealthy?: false` (default)

The health status is **transient** - re-evaluated on every health check:
- Errors below threshold → healthy (200 OK)
- Errors above threshold → unhealthy (503)
- Service automatically recovers when errors drop below threshold

```
[errors spike] → unhealthy → [errors drop] → automatically healthy
```

#### When `keep_unhealthy?: true`

The health status is **sticky** - once unhealthy, stays unhealthy:
- When errors exceed threshold → service marked unhealthy permanently
- Even if errors drop, service remains unhealthy
- Recovery requires manual intervention (e.g., calling `StatusCodeTracker.Server.update_healthy(true)`)

```
[errors spike] → unhealthy → [errors drop] → STILL unhealthy (requires manual reset)
```

### Action Callbacks

#### `unhealthy_action`

Triggered when the service transitions from healthy to unhealthy. Use this for:
- Sending alerts/notifications
- Logging incidents
- Triggering automated recovery procedures

```elixir
unhealthy_action: fn ->
  Logger.error("Service became unhealthy!")
  AlertService.send_alert("Service down")
end
```

#### `healthy_action`

Triggered when the service transitions from unhealthy back to healthy (only when `keep_unhealthy?: false`). Use this for:
- Sending recovery notifications
- Logging recovery events
- Resetting alert states

```elixir
healthy_action: fn ->
  Logger.info("Service recovered!")
  AlertService.send_recovery("Service recovered")
end
```

### Extra Checks

You can define custom health checks beyond error rate monitoring:

```elixir
extra_checks: fn ->
  case check_database_connection() do
    :ok -> false  # false means no issues
    :error -> true  # true means check failed
  end
end
```

The `extra_checks` function should return:
- `false` - all checks passed
- `true` - checks failed, service should be marked unhealthy

## API Reference

### StatusCodeTracker.Server

- `track_error/0` - Records a 5xx error timestamp
- `health_check_pass?/0` - Returns `true` if service is healthy
- `healthy?/0` - Returns current health state
- `update_healthy/1` - Manually set health state (useful with `keep_unhealthy?: true`)
- `error_threshold_reached?/0` - Check if errors exceed threshold

## License

MIT License. See LICENSE file for details.