# Hedged
[](https://hex.pm/packages/hedged)
[](https://github.com/yoavgeva/hedged/actions/workflows/ci.yml)
> 126 tests, zero warnings, Dialyzer + Credo strict clean.
Hedged requests for Elixir — fire a backup request after a delay, take whichever finishes first, cancel the rest. A tail-latency optimization with adaptive delay tuning.
## Why hedged requests?
A single slow backend call can dominate your p99. Retries help with failures, but they don't help when the server is just *slow*. Hedged requests solve this by racing a backup against the original — your latency becomes the *minimum* of two attempts, not the maximum.
Google's ["Tail at Scale"](https://research.google/pubs/pub40801/) paper showed that issuing a redundant request after a brief delay can reduce p99 latency by up to 50% while adding only ~5% extra load.
No existing Elixir library does this. Go has `cristalhq/hedgedhttp` (HTTP-only) and gRPC has built-in hedging. We built a generic, composable version for any Elixir function.
### Hedging vs retries
| | Retries | Hedging |
|---|---|---|
| Trigger | Failure | Timeout (slow response) |
| Concurrent requests | No (sequential) | Yes (parallel) |
| Solves | Transient errors | Tail latency |
| Extra load | Only on failure | ~5-10% steady state |
| Latency impact | Additive (delay + retry) | Subtractive (min of two) |
They're complementary — use both. See [Composing with BackoffRetry](#composing-with-backoffretry) below.
## Design goals
- **Generic** — works with any `fn -> result end`, not just HTTP. Database queries, RPC calls, DNS lookups, file reads — anything
- **Adaptive delay** — auto-tunes from observed latency percentiles so you don't have to guess a static delay
- **Token bucket** — prevents hedge storms under sustained load (~10% hedge rate at defaults)
- **Non-fatal fast-forward** — transient errors like `:timeout` or `:econnrefused` immediately fire the next hedge without waiting
- **Race mode** — `delay: 0` fires all requests simultaneously, takes the fastest
- **Staggered dispatch** — fire up to N requests with configurable delays between them
- **Automatic cancellation** — losers are shut down immediately, no wasted work
- **Composable** — stateless mode for simple cases, supervised tracker for production
- **Testable** — injectable `now_fn` for deterministic, instant test suites
- **Observable** — built-in stats: total requests, hedge rate, hedge win rate, p50/p95/p99, current delay, token level
- **Zero runtime deps** — just Elixir/OTP
- **Supervision-ready** — `child_spec/1` and `start_link/1` for your supervision tree
## Installation
```elixir
def deps do
[{:hedged, "~> 0.1.0"}]
end
```
## Quick start
### Stateless — fixed delay
```elixir
# Fire a backup after 100ms if the first hasn't responded
{:ok, body} = Hedged.run(fn -> fetch(url) end)
# With options
{:ok, body} = Hedged.run(fn -> fetch(url) end,
delay: 50,
max_requests: 3,
timeout: 2_000
)
```
### Adaptive — delay auto-tunes from observed latency
```elixir
# Add to your supervision tree
children = [
{Hedged, name: MyApp.Hedged, percentile: 95, min_delay: 5, max_delay: 500}
]
# Delay adapts automatically based on p95 latency
{:ok, body} = Hedged.run(MyApp.Hedged, fn -> fetch(url) end, [])
# Check how it's doing
Hedged.Tracker.stats(MyApp.Hedged)
# => %{total_requests: 1042, hedged_requests: 98, hedge_won: 31,
# p50: 12, p95: 45, p99: 120, current_delay: 45, tokens: 8.2}
```
## Real-world examples
### HTTP with non-fatal errors
```elixir
Hedged.run(
fn -> HTTPClient.get(url) end,
delay: 50,
max_requests: 3,
non_fatal: fn
:timeout -> true
:econnrefused -> true
_ -> false
end,
on_hedge: fn attempt ->
Logger.info("Firing hedge ##{attempt}")
end
)
```
When `non_fatal` returns true, the next hedge fires *immediately* instead of waiting for the delay — you don't waste time sleeping on errors you know are transient.
### Database query with tight deadline
```elixir
Hedged.run(fn -> Repo.query("SELECT ...") end,
delay: 20,
timeout: 1_000
)
```
### DNS resolution — race mode
Fire all at once, take the fastest:
```elixir
Hedged.run(fn -> dns_lookup(host) end,
delay: 0,
max_requests: 3
)
```
### Adaptive with multiple services
```elixir
# Each service gets its own tracker with tuned settings
children = [
{Hedged, name: MyApp.PaymentHedge, percentile: 99, max_delay: 2_000},
{Hedged, name: MyApp.SearchHedge, percentile: 90, max_delay: 200},
{Hedged, name: MyApp.CacheHedge, percentile: 95, min_delay: 1, max_delay: 50}
]
# In your code
{:ok, result} = Hedged.run(MyApp.PaymentHedge, fn -> charge(card) end, [])
{:ok, results} = Hedged.run(MyApp.SearchHedge, fn -> search(query) end, [])
```
### Composing with BackoffRetry
Hedge the outer call, retry the inner:
```elixir
Hedged.run(fn ->
BackoffRetry.retry(fn -> flaky_api_call() end,
max_attempts: 2,
backoff: :constant,
base_delay: 50
)
end, delay: 200)
```
## How adaptive delay works
The adaptive tracker solves the "what delay should I pick?" problem by learning from your traffic:
```
Request lifecycle:
1. get_config() → {delay_ms, allow_hedge?} # Tracker computes from recent latencies
2. Runner fires request #1, waits delay_ms
3. If no response, fires hedge #2
4. Winner returns, losers cancelled
5. record(%{latency_ms, hedged?, hedge_won?}) # Tracker learns from this request
6. Next request → delay has shifted
```
**Percentile-based delay**: A circular buffer (default 1000 samples) tracks recent latencies. The delay is set to the target percentile (e.g., p95) of that buffer, clamped to `[min_delay, max_delay]`. Old samples are evicted FIFO, so the delay naturally tracks recent conditions.
```
Requests 1-9: delay = 100ms (cold start, not enough samples)
Request 50: p95 = 22ms → delay 22ms (learned from traffic)
[service degrades]
Request 200: p95 = 180ms → delay 180ms (adapted to new conditions)
[service recovers]
Request 400: p95 = 25ms → delay 25ms (old slow samples evicted)
```
**Token bucket**: Prevents hedge storms. Each request earns a small credit (default 0.1 tokens). Each hedge costs more (default 1.0 token). When tokens drop below threshold, hedging is disabled entirely. At defaults this naturally limits the hedge rate to ~10% under steady state. After a burst of hedging depletes tokens, normal traffic replenishes them.
## Options
### Stateless (`run/2`)
| Option | Default | Description |
|---|---|---|
| `delay` | `100` | ms before firing the next hedge |
| `max_requests` | `2` | total concurrent attempts |
| `timeout` | `5_000` | overall deadline in ms |
| `non_fatal` | `fn _ -> false end` | predicate: `true` fires next hedge immediately |
| `on_hedge` | `nil` | `fn attempt -> any` callback before each hedge |
| `now_fn` | `System.monotonic_time/1` | injectable clock for testing |
### Adaptive tracker (`start_link/1`)
| Option | Default | Description |
|---|---|---|
| `name` | *required* | registered name |
| `percentile` | `95` | target percentile for adaptive delay |
| `buffer_size` | `1000` | max latency samples to keep |
| `min_delay` | `1` | floor for adaptive delay (ms) |
| `max_delay` | `5_000` | ceiling for adaptive delay (ms) |
| `initial_delay` | `100` | delay used before enough samples collected |
| `min_samples` | `10` | samples needed before adapting |
| `token_max` | `10` | token bucket capacity |
| `token_success_credit` | `0.1` | tokens earned per request |
| `token_hedge_cost` | `1.0` | tokens spent per hedge |
| `token_threshold` | `1.0` | min tokens to allow hedging |
### Tuning the token bucket
The defaults give ~10% hedge rate. To adjust:
| Desired behavior | Configuration |
|---|---|
| More aggressive hedging (~20%) | `token_success_credit: 0.2` |
| Conservative hedging (~5%) | `token_success_credit: 0.05` |
| Always allow hedging | `token_threshold: 0` |
| Disable hedging temporarily | `token_max: 0` |
| Larger hedge budget bursts | `token_max: 20` |
## When not to hedge
Hedging adds ~5% extra load at defaults. Don't use it when:
- **Non-idempotent operations** — double-charging a credit card is bad. Only hedge reads or idempotent writes
- **Resource-constrained backends** — if your DB is at capacity, extra queries make things worse
- **Already fast** — if p99 is already acceptable, hedging adds complexity for no gain
- **Single backend instance** — hedging helps when slowness is per-request (GC pauses, network jitter), not when the entire service is slow
## Algorithm
1. Fire request #1 immediately
2. Wait `delay` ms
3. If response arrived — return it, cancel nothing else pending
4. Fire request #2 (the hedge)
5. Wait for any response: first success wins, losers cancelled
6. If a failure is `non_fatal` — fire next hedge immediately (fast-forward)
7. If all attempts fail — return `{:error, last_reason}`
8. If overall `timeout` hit — cancel everything, return `{:error, :timeout}`
Key behaviors:
- If a request fails *before* the delay expires, the next hedge fires when the delay would have normally triggered (or immediately if `non_fatal`)
- If *all* pending requests have failed but `max_requests` isn't reached, the next hedge fires immediately — no point waiting
- Raises, exits, and throws in tasks are captured via `:DOWN` messages — they don't crash the caller
## Error handling
Exceptions, exits, and throws inside hedged tasks are captured — they never crash the caller:
| Source | Wrapped as |
|---|---|
| `raise "boom"` | `{:error, %RuntimeError{}}` |
| `exit(:reason)` | `{:error, {:reason, stacktrace}}` |
| `throw(:value)` | `{:error, {{:nocatch, :value}, stacktrace}}` |
If multiple tasks are in flight and one crashes, the others keep running. You still get a result as long as any task succeeds.
## Return values
| Scenario | Return |
|---|---|
| Function returns `{:ok, value}` | `{:ok, value}` |
| Bare value (e.g. `42`) | `{:ok, 42}` |
| `:ok` | `{:ok, :ok}` |
| `{:ok, {:error, _}}` | `{:ok, {:error, _}}` (inner value preserved) |
| All attempts return `{:error, r}` | `{:error, r}` (last error) |
| Overall timeout exceeded | `{:error, :timeout}` |
| All tasks raise / exit / throw | `{:error, reason}` (from `:DOWN`) |
| `:error` | `{:error, :error}` |
## Stats & observability
```elixir
Hedged.Tracker.stats(MyApp.Hedged)
```
Returns:
| Field | Description |
|---|---|
| `total_requests` | Total requests processed |
| `hedged_requests` | Requests that triggered at least one hedge |
| `hedge_won` | Times the hedge beat the original request |
| `p50` | Median observed latency (ms) |
| `p95` | 95th percentile latency (ms) |
| `p99` | 99th percentile latency (ms) |
| `current_delay` | Current adaptive delay being used (ms) |
| `tokens` | Current token bucket level |
## Architecture
```
lib/
hedged.ex # Public API: run/2 (stateless), run/3 (adaptive),
# start_link/1, child_spec/1
hedged/
runner.ex # Core engine: staggered dispatch + receive loop
tracker.ex # GenServer: adaptive delay + token bucket + stats
percentile.ex # Circular buffer + percentile calculation
```
Three layers, each independently useful:
- **Runner** — pure hedging engine, no state, no GenServer
- **Tracker** — adaptive delay + throttling, plugs into Runner
- **Percentile** — data structure, no processes, usable standalone
## License
MIT