# ExFdbmonitor
An Elixir application that manages [FoundationDB](https://www.foundationdb.org/)
clusters using the BEAM's distributed capabilities.
<!-- MDOC !-->
ExFdbmonitor starts and supervises `fdbmonitor` (the FoundationDB management
process), bootstraps new clusters, and handles scaling operations — all
coordinated across nodes via Erlang distribution.
## How it works
1. **First node** — detects that no FDB peers exist, creates the cluster file,
writes a `foundationdb.conf`, and runs `configure new single <storage_engine>`.
2. **Subsequent nodes** — discover existing peers via `:erlang.nodes()`, copy
the cluster file, and join the cluster.
3. **Redundancy** — once enough nodes are registered, `scale_up` configures
coordinators and the declared redundancy mode (`"double"`, `"triple"`).
4. **Restarts** — on restart the bootstrap config is ignored (data files
already exist). The node re-includes itself if necessary and re-evaluates
redundancy automatically.
All mutating FDB operations are serialized through `ExFdbmonitor.MgmtServer`, a
[DGenServer](https://github.com/foundationdb-beam/dgen) backed by FDB itself.
This prevents concurrent `fdbcli` commands from interleaving across nodes.
## Requirements
- Elixir ~> 1.18
- FoundationDB client and server packages
([releases](https://github.com/apple/foundationdb/releases))
## Usage
See [examples/example_app/README.md](examples/example_app/README.md) for a tutorial on
using ExFdbmonitor in your application.
## Configuration
### FDB executable paths
If your FoundationDB installation is not in the default location, then you must set
the following environment variables. The paths shown here are the defaults.
```
config :ex_fdbmonitor,
fdbmonitor: "/usr/local/libexec/fdbmonitor",
fdbcli: "/usr/local/bin/fdbcli",
fdbserver: "/usr/local/libexec/fdbserver",
fdbdr: "/usr/local/bin/fdbdr",
backup_agent: "/usr/local/foundationdb/backup_agent/backup_agent",
dr_agent: "/usr/local/bin/dr_agent"
```
### Minimal (single-node dev)
```elixir
# config/dev.exs
import Config
config :ex_fdbmonitor,
etc_dir: ".my_app/dev/fdb/etc",
run_dir: ".my_app/dev/fdb/run"
config :ex_fdbmonitor,
bootstrap: [
conf: [
data_dir: ".my_app/dev/fdb/data",
log_dir: ".my_app/dev/fdb/log",
fdbservers: [[port: 5000]]
]
]
```
### Multi-node production
```elixir
# config/runtime.exs
import Config
addr = fn interface ->
{:ok, addrs} = :inet.getifaddrs()
:proplists.get_value(to_charlist(interface), addrs)[:addr]
|> :inet.ntoa()
|> to_string()
end
config :ex_fdbmonitor,
etc_dir: "/var/lib/my_app/fdb/etc",
run_dir: "/var/lib/my_app/fdb/run"
config :ex_fdbmonitor,
bootstrap: [
# nodes must communicate with coordinators over the
# network interface
cluster: [coordinator_addr: addr.("eth0")],
conf: [
data_dir: "/var/lib/my_app/fdb/data",
log_dir: "/var/lib/my_app/fdb/log",
storage_engine: "ssd-2",
# We're defining 2 fdbservers per node
fdbservers: [[port: 4500], [port: 4501]],
# When safe to do so, ex_fdbmonitor will upgrade
# to 'double' redunancy automatically
redundancy_mode: "double"
]
]
```
### Configuration reference
| Key | Required | Description |
|-----|----------|-------------|
| `:etc_dir` | yes | Directory for `fdb.cluster` and `foundationdb.conf` |
| `:run_dir` | yes | Directory for `fdbmonitor` pid file |
| `:bootstrap` | no | Bootstrap config (ignored after first successful start) |
**Bootstrap keys:**
| Key | Description |
|-----|-------------|
| `cluster: [coordinator_addr:]` | IP address for the initial coordinator (default `"127.0.0.1"`) |
| `conf: [data_dir:]` | FDB data directory |
| `conf: [log_dir:]` | FDB log directory |
| `conf: [storage_engine:]` | Storage engine (default `"ssd-2"`) |
| `conf: [fdbservers:]` | List of `[port: N]` keyword lists, one per `fdbserver` process |
| `conf: [redundancy_mode:]` | `"single"`, `"double"`, or `"triple"` (default: `nil` / single) |
| `fdbcli:` | Extra `fdbcli` args to run at bootstrap (optional, repeatable) |
## Bootstrap flow
On application start, ExFdbmonitor runs two phases:
**Phase 1** (before any processes start):
- If the conf file and data dir are empty (first boot), write config files.
If FDB peers exist on `:erlang.nodes()`, copy their cluster file.
Otherwise, create a new cluster file and generate `configure new single <engine>`.
- If files already exist (restart), skip — use existing cluster file.
**Phase 2** (after `fdbmonitor` / `fdbserver` are running):
- Start `ExFdbmonitor.MgmtServer` (connects to FDB for distributed coordination).
- Register this node's `machine_id`.
- Call `scale_up(redundancy_mode, [node()])` — includes the node back
into FDB and configures redundancy when enough nodes are present.
## Public API
### `ExFdbmonitor.leave/0`
Gracefully remove the current node from the cluster. Downgrades redundancy
if needed, reassigns coordinators, excludes the node (blocks until data is
moved), and stops the local `fdbmonitor`. To rejoin, restart the
`:ex_fdbmonitor` application.
### Redundancy modes
| Mode | Min nodes | Min coordinators |
|------|-----------|------------------|
| `"single"` | 1 | 1 |
| `"double"` | 3 | 3 |
| `"triple"` | 5 | 5 |
`scale_up` stores the declared mode as a ceiling. `scale_down`
auto-determines the highest mode the surviving nodes can support, capped
at that ceiling. This prevents a scale-down/scale-up cycle from
accidentally exceeding the operator's intent.
## Scaling example
When a node is gracefully shutting down,
```elixir
# On the departing node:
ExFdbmonitor.leave()
```
When a node is returning from previously having been gracefully shutdown,
```elixir
# Later, restart the :ex_fdbmonitor application to rejoin:
Application.stop(:ex_fdbmonitor)
Application.ensure_all_started(:ex_fdbmonitor)
```
## Testing
ExFdbmonitor provides sandbox modules for integration testing:
```elixir
# Single-node sandbox
sandbox = ExFdbmonitor.Sandbox.Single.checkout("my-test", starting_port: 5000)
# ... run tests ...
ExFdbmonitor.Sandbox.Single.checkin(sandbox, drop?: true)
# 3-node double-redundancy sandbox
sandbox = ExFdbmonitor.Sandbox.Double.checkout("my-test", starting_port: 5500)
# ... run tests ...
ExFdbmonitor.Sandbox.Double.checkin(sandbox, drop?: true)
```
Sandboxes start isolated `local_cluster` nodes with their own FDB
processes. Pass `drop?: true` to delete all data on checkin.