README.md

# NebulaAPI

Transparent, safe cluster-wide APIs for Elixir — compile-time verified,
zero-overhead distributed calls.

Define your functions once. The compiler decides what runs where. Calls
across nodes look and feel like local function calls.

## The model in 30 seconds

A NebulaAPI cluster is a set of **nodes** (each one an Erlang VM, e.g.
`db@db.example`). Every node carries one or more **tags** — arbitrary atoms. No atom is
special; a tag can name a role (`:db`, `:worker`), a capability (`:cache`), or a whole
deployment (`:mainframe_cluster`, with the worker off in another cloud as
`:cloud_worker_lambda`). You declare the map once, in config:

```elixir
# config/config.exs
config :nebula_api,
  nodes: [
    "api@api.example":       [:mainframe_cluster, :api, :cache],
    "db@db.example":         [:mainframe_cluster, :db, :cache],
    "worker@worker.example": [:cloud_worker_lambda, :worker]
  ]
```

In your code you pick *where* things run with two sigils — by capability, or by
name:

- **`&tag`** — *any* node carrying that tag (picking by capability). `&db` reads
  as "wherever the `:db` tag lives"; the `&` turns the tag atom `:db` into a
  selector. Tags are lowercase atoms — `&db`, `&cache`, `&mainframe_cluster`.
- **`@node`** — pick a node by name. `@worker` is the **short** name (everything
  before `@`); when several nodes share it, `@worker` targets them all — that's a
  feature, see [short vs full names](#short-vs-full-names) for pinning exactly one.

`!` negates either one: `!&legacy` is "every node *without* the `:legacy` tag",
`!@backup` is "every node except `@backup`". These are **selectors** — they tell
the compiler which nodes get the real code.

Now write functions and tag each with the selector for where its body belongs:

```elixir
defmodule MyApp.Users do
  use NebulaAPI

  # `&db` → the body is compiled only on nodes carrying the :db tag.
  # On every other node, the same call becomes transparent RPC to a :db node.
  defapi &db, find(id) do
    Repo.get(User, id)        # %User{} or nil — returned verbatim, no wrapping
  end

  # A different capability, on different nodes: the cache lives on &cache nodes.
  defapi &cache, update_cache(id, user) do
    Cachex.put(:users, id, user)
  end
end
```

On a node tagged `:db`, `find/1` is a direct `Repo.get`; on every other node the same call
dispatches over Erlang distribution to a `:db` node and hands back the identical value. The
caller never knows which node ran it — and never has to. The body's value comes back as-is,
so you branch on it like any local call:

```elixir
# Same call on any node:
case MyApp.Users.find(42) do
  %User{} = user -> MyApp.Users.update_cache(user.id, user)
  nil            -> :not_found
end
```

That `update_cache/2` call carries `&cache`, so **by default it resolves on one node** —
locally if the caller is a `&cache` node, otherwise a single `&cache` worker (the first
registered one; it's a unicast, not a broadcast and not a race). The *other* `&cache` nodes
still hold a stale copy. When you mean "reach more than one", say so explicitly:

```elixir
# every &cache node serving the method
call_on_all_nodes do
  MyApp.Users.update_cache(user.id, user)
end

# one specific node
call_on_node @db do
  MyApp.Users.update_cache(user.id, user)
end

# every &cache node except @db — multicast, space-juxtaposed selector + negation
call_on_nodes &cache !@db do
  MyApp.Users.update_cache(user.id, user)
end
```

## What you get from compile-time

NebulaAPI resolves all routing decisions at compile time. This is not a
runtime router — it's a code generator that produces different bytecode
for each node. That buys you four things:

**No unnecessary deps.** Wrap a `use`, an `import`, or a child spec in `on_nebula_nodes` so
it exists only where it belongs:

```elixir
defmodule MyApp.Cache do
  use NebulaAPI

  on_nebula_nodes &cache do
    import Cachex, only: [put: 3]   # only &cache nodes even reference Cachex
  end

  defapi &cache, update_cache(id, user), do: put(:users, id, user)
end
```

The non-matching branch is absent from the bytecode, so a non-`&cache` node never loads
Cachex (gate the dependency itself the same way and it isn't even pulled in).

**Smaller binaries.** Code that doesn't belong on a node doesn't exist in its binary — a
`defapi` body is only emitted on matching nodes. Whole dependencies fall away the same way. The
[runnable demo](https://github.com/podCloud/NebulaAPI/tree/main/demo) pins Cachex to its
`db` node (`on_nebula_nodes @db` plus a conditional dep), so only that build carries Cachex
and its dependency tree (~570 KB); every other node never compiles it and comes out
**~38% smaller — ≈860 KB vs the db node's 1.4 MB** (measured, per-node `_build` from
`mix compile`). Your web node doesn't carry FFmpeg bindings; your worker doesn't carry
Phoenix routes.

**Compile-time safety.** Reference a tag or node that isn't in your topology and the build
stops — no silent RPC into the void:

```elixir
defapi @nope, f() do ... end
```
```
** (CompileError) Unknown nodes in defapi call :
	- @nope

Available nodes :
	- @api
	- @:"api@api.example"
	- @db
	- @:"db@db.example"
	- @worker
	- @:"worker@worker.example"
```

The `:nebula` compiler goes one further: an app with `defapi` modules but no
`nebula_api_server()` wired in fails to compile, instead of silently shipping workers that
never register:

```
Found 1 module(s) using NebulaAPI with local methods in app :my_app, but no
nebula_api_server() has been found in :my_app's supervisor — their RPC workers
will never start.

   App:         :my_app
   Application: MyApp.Application
                ^------ hint: add nebula_api_server() to its supervisor's children
   Modules using NebulaAPI (with local methods on this node):
         - MyApp.Users
```

**Zero runtime overhead.** A locally-resolved call is a direct function call — no routing
table, no RPC serialization, just a couple of process-dictionary reads to check for an
active routing context. Measured, that's **~60 ns** versus **~8 ns** for a plain call (see
[Performance](#performance)) — about **0.00005 ms** of overhead, free in any practical
sense. The decision was made once, at compile time.

> **"Compile per release" — the one mental shift.** NebulaAPI produces
> different bytecode per node, so each release is its own build. For Elixir
> devs used to a single runtime artifact, that's the surprising part. In
> practice it's one extra `elixir --name node@host -S mix compile` per
> release — a few seconds of CI, paid back many times over in smaller
> binaries, fewer dependencies, and zero routing overhead.

## How it works

Same source, different bytecode. Each release is compiled with its target node name (the
compiler reads `node()`), so a `&db` body is **real code** on a node that has `:db` and an
**RPC stub** everywhere else — the stub routes through `:pg` process groups to a node that
does have the body.

<details>
<summary>📊 Diagram</summary>

```
┌─────────────────────────────────────────────────────────┐
│                    Source code (same)                    │
│                                                         │
│   defapi &db, find_user(id) do                          │
│     Repo.get(User, id)                                  │
│   end                                                   │
└────────────────────┬────────────────────────────────────┘
                     │
          ┌──────────┴──────────┐
          │  mix compile        │
          │  --name node@host   │
          ▼                     ▼
   ┌─────────────┐      ┌─────────────┐
   │   @alpha     │      │   @beta     │
   │  (has &db)   │      │  (no &db)   │
   ├─────────────┤      ├─────────────┤
   │ find_user/1 │      │ find_user/1 │
   │ → Repo.get  │      │ → RPC call  │
   │   (local)   │      │   (remote)  │
   └─────────────┘      └──────┬──────┘
                               │
                        :pg process groups
                               │
                        ┌──────▼──────┐
                        │   @alpha    │
                        │   Worker    │
                        │   Repo.get  │
                        └─────────────┘
```

</details>

## Reshape your topology without touching code

This is why NebulaAPI exists: the flexibility of umbrella releases, **without rewriting
code** every time you split a node out or stand up a new release. The same source ships as
one node or many — you change config and which releases you build, nothing else.

```elixir
# dev — one node wears every hat, a single release, every call local
nodes: ["dev@localhost": [:api, :db, :worker, :cache]]

# staging — pull the database onto its own node
nodes: [
  "app@app.staging": [:staging_cluster, :api, :worker, :cache],
  "db@db.staging":   [:staging_cluster, :db, :cache]
]

# prod — scale the workers out, keep one db; w3 lives in another cloud
nodes: [
  "app@app.prod":    [:mainframe_cluster, :api, :cache],
  "worker@w1.prod":  [:mainframe_cluster, :gpu],
  "worker@w2.prod":  [:alpha_cluster, :llm],
  "worker@w3.prod":  [:cloud_worker_lambda, :gpu, :storage],
  "db@db.prod":      [:mainframe_cluster, :db, :cache]
]
```

Moving `:db` off the app node, or fanning workers across three machines, is a config change
and a rebuild — never a code change. And the tags follow how you actually think about the
fleet: the three workers share the short name `worker@` (so `@worker` hits all of them
without any `:worker` tag), the deployment tag varies by environment and even by node
(`worker@w3.prod` is `:cloud_worker_lambda` — off in another cloud), and the capability
tags (`:gpu`, `:llm`, `:storage`) carve out *which* worker you mean (`@worker &gpu`). A tag
is just a label; slice the cluster however suits you.

## Installation

Add `:nebula_api` to your deps — from [Hex](https://hex.pm/packages/nebula_api):

```elixir
def deps do
  [
    {:nebula_api, "~> 0.5"}
  ]
end
```

Or track the repo directly (e.g. for an unreleased fix):

```elixir
def deps do
  [
    {:nebula_api, git: "git@github.com:podCloud/NebulaAPI.git", tag: "v0.5.1"}
  ]
end
```

## Quick start

### 1. Define your cluster topology

```elixir
# config/config.exs
config :nebula_api,
  nodes: [
    "api@api.example": [:mainframe_cluster, :api],
    "db@db.example": [:mainframe_cluster, :db],
    "worker@worker.example": [:alpha_cluster, :worker]
  ]
```

Each key is a full node name (`short@host`); each value is a list of capability
**tags** (see [the model above](#the-model-in-30-seconds)). In selectors you can
use the short name: `@db` matches `:"db@db.example"`, `@worker` matches
`:"worker@worker.example"` — when there's no ambiguity, short names are all you
need.

### 2. Define distributed functions

```elixir
defmodule MyApp.Users do
  use NebulaAPI

  # Body compiles on &db nodes. Everywhere else: transparent RPC.
  defapi &db, find(id) do
    Repo.get!(User, id)
  end
end
```

### 3. Wire a server into each app's supervision tree

```elixir
defmodule MyApp.Application do
  use Application
  use NebulaAPI.Server

  def start(_type, _args) do
    Supervisor.start_link([nebula_api_server()], strategy: :one_for_one, name: MyApp.Sup)
  end
end
```

`use NebulaAPI.Server` brings the `nebula_api_server/0` macro into scope (plus the
`on_nebula_nodes` / `call_on_*` macros) — without the `defapi` bookkeeping, since the host
module defines none of its own. Use it on the module that wires the server; use
`use NebulaAPI` on the modules that actually define `defapi` endpoints.

`nebula_api_server()` discovers the app's own modules that `use NebulaAPI` and starts a
supervised GenServer worker for each one that has local methods on this node; each worker
registers in `:pg` process groups for discovery across nodes. No module list to maintain —
and because the server lives in the app's own tree, its workers die with the app (so `:pg`
never holds stale entries).

#### Optional: guard against forgetting it

Add the `:nebula` compiler to catch a missing `nebula_api_server()` at compile time:

```elixir
def project do
  [
    # ...
    compilers: Mix.compilers() ++ [:nebula]
  ]
end
```

If an app has modules with local methods but no `nebula_api_server()` wired into its
supervisor, `mix compile` fails with an explanatory error — the same spirit as the
compile error raised for a `defapi` targeting an unknown node.

### 4. Compile with the target node name

With the code and server in place, compile each release **as the node it will run as** —
NebulaAPI keys its codegen on `node()` at **compile time**, which you set with the `--name`
flag on `mix compile`:

```bash
elixir --name api@api.example -S mix compile && mix release api
```

Forget `--name` and the build stops with a clear `CompileError` (`node()` would be
`nonode@nohost` — the name isn't *unknown*, it's *unset*, so `allow_unknown_self_node`
won't paper over it). Set `allow_nonode_nohost: true` if you really mean a nameless
[generic build](#generic-nodes-serve-nothing-call-everything).

Build each release in its own stage, pinning the compile-time node name:

```dockerfile
# api release — compiled as node api@api.example
RUN elixir --name api@api.example -S mix compile && mix release api

# worker release — separate stage, compiled as node worker@worker.example
RUN elixir --name worker@worker.example -S mix compile && mix release worker
```

Then each release must **boot as that same node name**. That's a separate, *runtime*
concern, handled by [Mix release's own env vars](https://hexdocs.pm/mix/Mix.Tasks.Release.html#module-environment-variables)
— `RELEASE_NODE` (the node name) and `RELEASE_DISTRIBUTION` (`name` for fully-qualified
names across hosts; the default is `sname`):

```bash
# at run time, in the api container
RELEASE_DISTRIBUTION=name RELEASE_NODE=api@api.example bin/api start
```

The compile-time `--name` and the runtime `RELEASE_NODE` **must match** — that's the whole
contract: the routing was decided for `api@api.example` at build, so the release has to
actually be `api@api.example` when it runs. NebulaAPI enforces it: if the running node
differs from the one the release was compiled as, the server **crashes at boot** with a
clear message rather than misrouting silently — unless you opt into running it as a
[generic node](#generic-nodes-serve-nothing-call-everything). (`RELEASE_NODE` defaults to
`<release_name>@…` with short-name distribution, so set it explicitly to get the
fully-qualified name.)

In dev/test, you typically don't start the VM with `--name`. Use
`default_opts` to tell the compiler which node to pretend to be:

```elixir
# config/dev.exs
config :nebula_api,
  default_opts: [self_node: :"api@api.example"]
```

### 5. Call it — local or remote, same API

```elixir
# On @db (has &db) → local Repo.get!
MyApp.Users.find(42)
#=> %User{id: 42, ...}

# On @worker (no &db) → transparent RPC to a &db node
MyApp.Users.find(42)
#=> %User{id: 42, ...}
```

## Selectors

Selectors tell the compiler which nodes get the real implementation. Every other node
gets a *stub* in its place — a generated function that forwards the call over RPC to a
node that does have the body.

| Syntax | Meaning |
|---|---|
| `&tag` | Nodes with this tag |
| `!&tag` | Nodes without this tag |
| `@node` | Specific node (short or full name) |
| `!@node` | All nodes except this one |
| *(no selector)* | Every node — the body is local everywhere |

Combine selectors by **juxtaposing them with a space** — no commas between them, no
brackets. This is the canonical NebulaAPI syntax, and it's what keeps the code readable
(`&db !@backup` reads as "a `:db` node, but not `@backup`"):

```elixir
# Nodes with the :db tag, excluding @backup
defapi &db !@backup, run_migration(version) do
  Ecto.Migrator.run(Repo, :up, to: version)
end

# Specific node only
defapi @worker, transcode(input, opts) do
  FFmpex.new_command()
  |> FFmpex.add_input_file(input)
  |> FFmpex.add_output_file(opts[:output])
  |> FFmpex.execute()
end

# No selector → the body is local on every node, each returning its own data
defapi get_node_health() do
  %{node: node(), uptime: :erlang.statistics(:wall_clock) |> elem(0)}
end
```

### Short vs full names

In config, node names are full Erlang names — `short@host`. In a selector you can use just
the **short** part (everything before `@`), which keeps call sites readable:

```elixir
# Equivalent when only one node is named "db@…":
defapi @db, do_something() do ... end
defapi @:"db@db.example", do_something() do ... end   # full name as an atom
```

The full-name form is `@:"name@host"` (an atom, because of the `@`) — and `!@:"name@host"`
to negate it.

**The short name is intentionally "many": that's a feature.** A short name matches *every*
node that shares it, which is usually exactly what you want for a horizontally-scaled role.
Picture three nodes running the same `worker` release on three hosts (as the
[runnable demo](https://github.com/podCloud/NebulaAPI/tree/main/demo) does), each kitted out
differently:

```elixir
"worker@worker1.test": [:alpha_cluster, :gpu, :storage],
"worker@worker2.test": [:beta_server, :llm],
"worker@worker3.test": [:alpha_cluster, :vps]
```

`@worker` targets *all three* — every node whose release name is `worker`, across hosts,
whatever capability tags they happen to carry. To pin exactly one, reach for its full name:
`@:"worker@worker2.test"`.

### What gets generated

For each `defapi`, the macro generates:

1. **`<name>/N`** — the public router callers actually invoke.
2. **`__nbapi_remote_<name>/N`** — RPC dispatch via `APIServer`, on **every** node.
3. **`__nbapi_local_<name>/N`** — the real body, on **matching nodes only**. Elsewhere
   nothing is emitted: the router goes remote there, so there's no stub to keep.

The remote function is generated on **every** node, including nodes
that have the local implementation. This is what makes `call_on_node`
and `call_on_nodes` work from anywhere — even a `&db` node can call
other `&db` nodes remotely for quorum writes, load distribution, etc.

## Router and priorities

The public router on each `defapi` decides where a call goes, from the default outward —
the more explicit you get, the more it wins. Take the same call, `MyApp.Cache.get(key)`:

1. **Default** — `MyApp.Cache.get(key)` runs locally if this node serves the method,
   otherwise a single remote call (unicast).
2. **Wrapped in a block** — the same call inside `call_on_nodes &cache do … end` routes per
   the block instead.
3. **Its own trailing opts win over the block** — `MyApp.Cache.get(key, multicast: true)`
   routes itself, even inside a block; a routing key set to `nil` / `false` opts the call
   back out to the default.

**Default unicast goes to the first node on the `:pg` list that serves the method — never
the others.** Concretely that's the first node serving the API that connected to NebulaAPI
(joined the method's `:pg` group); that's the only node that runs the call. No fan-out, no
load-balancing by default. Membership is live, though: if that node drops, `:pg` removes it,
so the next call simply lands on whoever is now first among the nodes still connected. (Want
several nodes at once, a specific one, a random one, or a load-aware pick? That's
[runtime routing](#runtime-routing).)

## `on_nebula_nodes` — conditional compilation

Include or exclude entire blocks of code based on the current node.
Unlike `defapi`, this works at any level — module body, `use`
directives, supervision trees:

```elixir
defmodule MyApp.Repo do
  use NebulaAPI.AST

  # Only connect to the database on &db nodes.
  # Other nodes don't even load Ecto.
  on_nebula_nodes &db do
    use Ecto.Repo, otp_app: :my_app
  end
end

defmodule MyApp.Application do
  use NebulaAPI.AST

  # Start the FFmpeg pool only on worker nodes
  on_nebula_nodes &worker do
    def extra_children, do: [MyApp.TranscoderPool]
  else
    def extra_children, do: []
  end
end
```

The non-matching branch is completely absent from the compiled bytecode. A module that
does only this can `use NebulaAPI.AST` — the lightest entry point, no `defapi` bookkeeping.

## Runtime routing

The selector on a `defapi` is the *default* route. Sometimes you need to override it at
runtime — send one call to a specific node, fan it out to several, or pick a node by load.
Three macros wrap a block to do that, named after how far the call goes:

- **`call_on_node`** — *unicast*: run on exactly one node.
- **`call_on_nodes`** — *multicast*: run on every node a selector matches.
- **`call_on_all_nodes`** — *broadcast*: run on every node that serves the method.

### `call_on_node` — unicast

```elixir
# Force execution on a specific node
call_on_node @worker do
  MyApp.Jobs.transcode(file, opts)
end

# Pick a node dynamically based on runtime info — least loaded
call_on_node fn nodes_info ->
  nodes_info
  |> Enum.filter(fn {_, info} -> info.connected && info.runtime end)
  |> Enum.min_by(fn {_, info} -> info.runtime.memory_percent end)
  |> elem(0)
end do
  MyApp.HeavyTask.run()
end

# Or just pick one at random
call_on_node fn nodes_info -> nodes_info |> Map.keys() |> Enum.random() end do
  MyApp.Jobs.transcode(file, opts)
end
```

### `call_on_nodes` — multicast

```elixir
# Call all &worker nodes, wait for all results
call_on_nodes &worker, strategy: :all, timeout: 30_000 do
  MyApp.Jobs.health_check()
end

# First to respond wins
call_on_nodes &worker, strategy: :first do
  MyApp.Jobs.transcode(file, opts)
end

# Quorum: a strict majority of the configured &db nodes must succeed (the default).
# A single live node out of three configured refuses — that's the point of a quorum.
call_on_nodes &db, strategy: :quorum do
  MyApp.Users.write_replica(user)
end

# A selector function over live node info — fan out only to nodes seen recently
call_on_nodes fn nodes_info ->
  cutoff = DateTime.add(DateTime.utc_now(), -30, :second)
  nodes_info
  |> Enum.filter(fn {_, i} -> i.last_seen_at && DateTime.compare(i.last_seen_at, cutoff) == :gt end)
  |> Enum.map(&elem(&1, 0))
end, strategy: :all do
  MyApp.Cache.invalidate(:all)
end
```

### `call_on_all_nodes` — broadcast

```elixir
call_on_all_nodes timeout: 5_000 do
  MyApp.Cache.invalidate(:all)
end
```

### Multicast strategies

Results are always tagged per node — `{node, value}` on success,
`{node, {:nebula_error, reason}}` for a node whose call failed at the transport level.

| Strategy | Behavior |
|---|---|
| `:all` | Wait for every node (or timeout). Returns a list of `{node, value}`. |
| `:first` | Return the first response that counts as a success (then stop waiting on the rest — the pending tasks are brutal-killed); `{:nebula_error, :no_success, results}` if none. |
| `:quorum` | Wait for a strict majority of the quorum set, or an exact `at_least:` count. The set is the **configured** nodes serving the method (`quorum: :configured`, the default — connected or not, so a single live node can't pass a 3-node quorum) or the connected workers (`quorum: :available`). The moment the quorum is reached it stops waiting on the rest (same brutal-kill as `:first`); fails fast (`:quorum_unreachable`) when the live set can't reach it. |

> "Stops waiting" is exactly that: once you have what you asked for (a first success, or
> the quorum), the rest is just wasted waiting — so NebulaAPI kills the local tasks still
> awaiting a reply and discards their late responses. A body that already started running on
> a remote node isn't aborted — the RPC was already sent.

`:first` and `:quorum` let you define what counts as a success with a `success:` (or
`failure:`) predicate — by default, any node that responded counts:

```elixir
# A write quorum that only accepts {:ok, _} replies
call_on_nodes &replica, strategy: :quorum, success: &match?({:ok, _}, &1) do
  MyApp.Store.write(key, value)
end
```

## Node info and intelligent routing

`call_on_node` and `call_on_nodes` accept selector functions that
receive live runtime data about every node:

```elixir
%{
  short_name: :db,
  long_name: :"db@db.example",
  host: "db.example",
  tags: [:mainframe_cluster, :db],
  connected: true,
  last_seen_at: ~U[2024-06-15 12:00:00Z],
  runtime: %{
    memory_used_mb: 256,
    memory_total_mb: 1024,
    memory_percent: 25.0,
    process_count: 1542,
    schedulers: 8,
    otp_release: "26",
    uptime_seconds: 86400
  }
}
```

A node whose worker just registered but isn't in the background snapshot yet still appears,
with `runtime: nil` / `last_seen_at: nil` until the next refresh — so filter on
`info.runtime` before reading through it.

```elixir
# Route to the node with the most headroom
call_on_node fn nodes_info ->
  nodes_info
  |> Enum.filter(fn {_, info} -> info.connected && info.runtime end)
  |> Enum.min_by(fn {_, info} -> info.runtime.memory_percent end)
  |> elem(0)
end do
  MyApp.HeavyTask.run()
end

# Only call nodes seen in the last 30 seconds
call_on_nodes fn nodes_info ->
  cutoff = DateTime.add(DateTime.utc_now(), -30, :second)
  nodes_info
  |> Enum.filter(fn {_, info} ->
    info.last_seen_at && DateTime.compare(info.last_seen_at, cutoff) == :gt
  end)
  |> Enum.map(&elem(&1, 0))
end do
  MyApp.Cache.invalidate(:all)
end
```

## Return values

NebulaAPI **never wraps** your return value. A `defapi` body returns exactly what it
computed — local or over RPC, the result is identical:

```elixir
defapi &db, find(id) do
  Repo.get(User, id)      # returns %User{} or nil
end

find(1)        #=> %User{...}
find(999)      #=> nil

# Tuples you return yourself are passed through untouched, including your own
# {:ok, _} / {:error, _}:
defapi &db, create(attrs) do
  Repo.insert(User.changeset(attrs))  # {:ok, user} or {:error, changeset}
end

create(%{name: "Ada"})   #=> {:ok, %User{...}}
create(%{})              #=> {:error, %Ecto.Changeset{...}}
```

The one value the library *does* inject is a `:nebula_error` tuple — a **library or
transport** failure (a timeout, no worker available, a crashing body, a quorum that wasn't
reached), never a business outcome. So any `:ok` / `:error` you ever see is **yours**, and
you never have to guess whether an `{:error, _}` came from your code or the framework. An
exception, throw or exit escaping a body is reported the same way — identically whether the
body ran locally or remotely.

Its shape depends on the scope of the failure. A **single-node** failure (unicast, or one
node inside a multicast result) is the 2-tuple `{:nebula_error, reason}`. A **whole-call**
multicast failure carries an extra element with the partial results — `{:nebula_error,
:no_success, results}`, `{:nebula_error, :quorum_not_reached, results}`,
`{:nebula_error, :quorum_unreachable, %{workers: n, required: m}}` (see
[Calling → multicast results](docs/calling.md#multicast-results)). Match the 3-tuples when
you handle a `:first` / `:quorum` call's top-level outcome, not just `{:nebula_error, _}`.

## Wrap any single-node library

Here's the pattern that tends to click: **NebulaAPI turns any single-node
library into a cluster-wide one without touching the library.** No fork, no
monkey-patch — just a few lines of `defapi` that delegate to it on a chosen
node.

If you've ever thought *"I'd love to use Cachex / a counter / a cron here, but
its state is per-node, so now I need Redis / a shared DB / `:global` locks…"* —
this is the escape hatch. The library stays exactly as it is. You pin it to one
node and wrap it.

```elixir
# Cachex runs only on the @cache node; every node shares one cache through the wrapper.
defmodule MyApp.Cache do
  use NebulaAPI

  defapi @cache, get(key),        do: Cachex.get(:app_cache, key)
  defapi @cache, put(key, value), do: Cachex.put(:app_cache, key, value)
end
```

Any node calls `MyApp.Cache.get/1`; it resolves locally on `@cache` and routes
transparently everywhere else. One shared cache, no Redis. The same trick gives you
cluster-wide rate limiters, counters, run-once-per-cluster schedulers, singleton
coordinators, and feature-flag stores.

> **An honest caveat.** This is great for values read often and invalidated rarely
> (dynamic config, reference data). But for a hot path doing thousands of reads per second
> per node, every read becomes an RPC round-trip — that's the **wrong** use, and a real
> distributed cache (Redis, or `:mnesia`) stays better. NebulaAPI is the right tool when
> the access pattern fits, not a universal replacement for a distributed cache.

## Worked example: a 3-role cluster

Three nodes, three roles — an API front, a database node, and a worker:

```elixir
config :nebula_api,
  nodes: [
    "api@api.example": [:mainframe_cluster, :api],
    "db@db.example": [:alpha_server, :db],
    "worker@worker.example": [:mainframe_cluster, :gpu]
  ]
```

### Data access — `&db` nodes only

```elixir
defmodule MyApp.Users do
  use NebulaAPI

  defapi &db, get(id) do
    Repo.get(User, id)
  end

  defapi &db, list(filters \\ []) do
    User |> where_filters(filters) |> Repo.all()
  end

  # A plain def — no defapi: keep utils and pure business logic local, on every release.
  def user_name(%User{nickname: name}), do: name

  # Helper only exists on &db nodes
  on_nebula_nodes &db do
    defp where_filters(query, filters) do
      Enum.reduce(filters, query, fn {k, v}, q -> where(q, [u], field(u, ^k) == ^v) end)
    end
  end
end
```

### Background jobs — `@worker` only

```elixir
defmodule MyApp.Jobs do
  use NebulaAPI

  # @worker targets the worker node by its (short) name — no :worker tag needed.
  defapi @worker, transcode(input, opts) do
    FFmpex.new_command()
    |> FFmpex.add_input_file(input)
    |> FFmpex.add_output_file(opts[:output])
    |> FFmpex.execute()
  end

  # @worker AND &gpu — a faster path that only the GPU-equipped workers carry.
  defapi @worker &gpu, quick_transcode(input, opts) do
    GpuTranscoder.run(input, opts)
  end
end
```

### Conditional application setup

```elixir
defmodule MyApp.Application do
  use Application
  use NebulaAPI.Server

  def start(_type, _args) do
    # Only the &db node starts the Repo; everyone runs the nebula server.
    children =
      [nebula_api_server()] ++
        on_nebula_nodes &db do
          [MyApp.Repo]
        else
          []
        end

    Supervisor.start_link(children, strategy: :one_for_one, name: MyApp.Sup)
  end
end
```

### Cross-node calls from a web controller

```elixir
defmodule MyAppWeb.UserController do
  def show(conn, %{"id" => id}) do
    # "Just works" on any node. Local on @db, RPC everywhere else.
    # get/1 returns the struct (or nil) directly — no wrapping.
    case MyApp.Users.get(id) do
      %MyApp.User{} = user -> render(conn, :show, user: user)
      nil -> send_resp(conn, 404, "Not found")
    end
  end

  def transcode(conn, %{"path" => path}) do
    # Explicitly route to a worker, even if we have the code locally
    call_on_node @worker do
      MyApp.Jobs.transcode(path, output: "/tmp/out.mp3")
    end
  end
end
```

## When NOT to use NebulaAPI

Being honest about the edges:

- **External clients.** If the caller isn't a node in your Erlang cluster — a
  public web client, a non-Elixir mobile app — gRPC or REST is still the right
  boundary. NebulaAPI is for intra-cluster calls.
- **Node names unknown at build time.** NebulaAPI needs your node names and tags in
  config when you compile. The nodes themselves can come up and go down freely at
  runtime — workers register and drop through `:pg`, and selectors only ever route to
  what's actually connected. What it can't handle is a node whose *name* wasn't known at
  build time: an unbounded fleet of randomly-named pods has no compiled identity to route
  to — though a fixed, generic *caller* node is easy (see
  [generic nodes](#generic-nodes-serve-nothing-call-everything)). Scaling the count of *known* roles
  is fine; minting brand-new node identities at
  runtime is not.
- **Topologies whose roles change at runtime.** Adding a wholly new tag or node *name* to
  the cluster means a recompile — NebulaAPI decided the routing at build time. Bringing
  more instances of an existing role online needs nothing but starting them.

## Performance

Measured by [`bench/routing.exs`](bench/routing.exs) on OTP 26 (run it yourself with
`elixir --name bench@127.0.0.1 --cookie nebula_bench -S mix run bench/routing.exs`):

| Call | Per call |
|---|---|
| Plain local Elixir call (baseline) | ~8 ns |
| NebulaAPI, resolved local | ~60 ns |
| Cross-node round-trip, same host (loopback) | ~50 µs |

The point: a locally-resolved NebulaAPI call adds only a handful of nanoseconds over a
plain call — a couple of process-dictionary reads and a `cond` — so it's free in any
practical sense. A cross-node call is a standard Erlang-distribution round-trip; the ~50 µs
above is loopback (same host), and over a real network you pay link latency on top
(commonly ~0.2–2 ms). Either way the rule of thumb holds: resolve local whenever you can,
and a cross-node hop costs roughly what a distributed `GenServer.call` costs — no more.

## Configuration reference

```elixir
config :nebula_api,
  # Required: cluster topology — tags per node.
  # Used at compile time to decide what code goes where.
  nodes: [
    "api@api.example": [:mainframe_cluster, :api, :cache],
    "db@db.example": [:mainframe_cluster, :db, :cache],
    "worker@worker.example": [:cloud_worker_lambda, :worker]
  ],

  # Optional: override node identity for dev/test.
  # In production, compile with: elixir --name node@host -S mix compile.
  # default_opts also accepts inherited defaults for every `use NebulaAPI` module:
  # max_concurrent_calls: and default_timeout:.
  default_opts: [self_node: :"api@api.example"],

  # Optional: global default timeout (ms) for remote calls.
  # Per-call timeout: > per-module default_timeout: > this > 5000.
  default_timeout: 5_000,

  # Optional: how often (ms) each node's background NodesInfoCache rebuilds
  # the node-info snapshot served to selector functions.
  nodes_info_refresh_interval: 5_000
```

## Generic nodes: serve nothing, call everything

A release is normally tied to one node: it must run as the node it was compiled for (see
[the boot policy](#4-compile-with-the-target-node-name)). A **generic node** is the
exception — a node that serves nothing (no workers, registers nothing in `:pg`) and routes
**every** `defapi` call remotely. To actually reach the cluster it must be **distributed** (a
real `name@host`); a `nonode@nohost` build can't join a cluster (`Node.connect` is a no-op
there), so it stays **inert** — safe, but it calls no one. Two ways to get one:

**1. A dedicated server-less build (`allow_nonode_nohost`).** Set the flag and compile
**without** `--name`, so `node()` is `nonode@nohost` and every `defapi` compiles as a pure
remote stub — no local bodies, no server, the smallest binary:

```elixir
config :nebula_api, nodes: [ ...the real cluster nodes... ], allow_nonode_nohost: true
```
```bash
mix compile && mix release console   # no --name → a generic, server-less build
```

The flag registers `nonode@nohost` as an empty, tagless node so the build compiles cleanly
(you can't list it in `:nodes` yourself — it's reserved; the flag is the only way to admit
it). Run it as `nonode@nohost` and it's inert; launch it under a **real** name to make it a
connected, calls-everything client.

**2. Any build, repurposed.** No dedicated build on hand? Boot an existing release (a
`worker`, an `api`) under a node name that *isn't* the one it was compiled for. It serves
nothing and routes every call remote just the same — you only carry the extra local bodies
that build happens to contain.

Either way, launching under a name that isn't the compiled one is a node mismatch, so you opt
in with `ALLOW_RUNTIME_NEBULA_NODE_MISMATCH=1` (keep `allow_nonode_nohost` in the build that
wants it, not the shared cluster config). The operational recipe — a prod console, a debug
shell — is in
[Calling → spawning a generic node](docs/calling.md#spawning-a-generic-node-debug-or-call-anything-remotely).

## But wait — how do the nodes actually connect?

NebulaAPI decides *what code goes where*; it does **not** form the Erlang cluster. That's
deliberate — clustering is your call, and the library stays agnostic. All it needs is that
the nodes are connected Erlang nodes (so `:pg` syncs and distribution RPC flows); *how* they
find each other is entirely up to you. Anything that ends up calling `Node.connect/1` works:

- **[libcluster](https://hex.pm/packages/libcluster)** — the usual answer. Pick a strategy
  for your environment: `Gossip` on a flat network, `Kubernetes` / `Kubernetes.DNS` on k8s,
  `EpmdDNS` behind a headless service, or a static `Epmd` list for a fixed fleet. Point its
  topology at the same node names you put in `config :nebula_api, :nodes`. (The
  [runnable demo](https://github.com/podCloud/NebulaAPI/tree/main/demo) does exactly this
  with libcluster's `Epmd` strategy over a Docker network.)
- **Plain epmd + `Node.connect/1`** — for a handful of known hosts, a few `Node.connect`
  calls at boot (or `-kernel sync_nodes_mandatory ...` in `vm.args`) are enough.
- **Anything else** — a custom strategy, a service-discovery hook, manual connects from a
  release `env.sh`. NebulaAPI never looks; it only ever reads `node()` and `:pg`.

Two practical notes: share the **same cookie** across the cluster, and use **long names**
(`name@host`, `RELEASE_DISTRIBUTION=name`) so the running node names match what you compiled
for. Once the nodes are connected, NebulaAPI's workers register in `:pg` and routing just
works.

## Architecture

Two halves: a **compile-time** code generator (`AST.Parser` / `AST.Builder` / `Config`,
which fail the build on an unknown tag or node) and a small **runtime** layer
(`NebulaAPI.Server` per app starting a `Worker` per locally-served module, `APIServer`
holding the `:pg` routing and the node-info ETS cache).

<details>
<summary>📊 Diagram</summary>

```
┌─────────────────────────────────────────────────────┐
│                  Compile time                        │
│                                                      │
│  AST.Parser     parses selectors (&tag, @node, !&)   │
│  AST.Builder    generates the defapi functions        │
│  Config         resolves nodes, validates topology    │
│                 → CompileError on unknown tag/node    │
└─────────────────────┬───────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────┐
│                   Runtime                            │
│                                                      │
│  NebulaAPI.Server   per-app supervisor; starts one    │
│                     Worker per locally-served module  │
│                     (wired via nebula_api_server())   │
│  APIServer          :pg routing + node-info ETS cache │
│  APIServer.Worker   per-module GenServer; registers   │
│                     its methods in :pg                │
│  :pg groups         worker discovery across nodes     │
└─────────────────────────────────────────────────────┘
```

</details>

## Documentation

This README is the whole picture. The [`docs/`](docs/README.md) pages go deeper, in the order you
meet each theme:

1. [Configuration](docs/configuration.md) — nodes, tags, topology, compile-per-node, dev/test, validation
2. [Defining APIs](docs/defining.md) — the three `use` macros, `defapi`, selectors, return values, `on_nebula_nodes`, wiring the server
3. [Calling across nodes](docs/calling.md) — calling endpoints, `call_on_*`, multicast strategies, node-info routing, wrapping single-node libraries, spawning a generic node
4. [Gotchas and troubleshooting](docs/gotchas.md) — trailing opts, process scope, the `nil`-selector distinction, common errors

Deep dive:

- [AST deep-dive](docs/deep-dive/ast-deep-dive.md) — how the per-node code is generated

See also [About LLMs](ABOUT-LLMS.md) — how (and how much) LLMs were used to build this library.

## License

MIT