# NetRunner
[](https://hex.pm/packages/net_runner)
[](https://hexdocs.pm/net_runner)
Safe OS process execution for Elixir. Zero zombie processes, NIF-based backpressure, PTY support, and cgroup isolation.
NetRunner combines NIF-based async I/O (`enif_select`) with a persistent **shepherd binary** that watches over child processes. Three layers of cleanup guarantee zero zombies under every failure mode — BEAM crash, GenServer crash, or resource leak.
## Installation
```elixir
def deps do
[
{:net_runner, "~> 1.0"}
]
end
```
Requires a C compiler (`gcc` or `clang`) and `make`.
## Quick Start
```elixir
# Run a command, collect output
{output, 0} = NetRunner.run(~w(echo hello))
# => {"hello\n", 0}
# Pipe data through stdin
{output, 0} = NetRunner.run(~w(cat), input: "from elixir")
# => {"from elixir", 0}
# Nonzero exit status
{"", 1} = NetRunner.run(~w(false))
# Multi-word commands with arguments
{output, 0} = NetRunner.run(["sh", "-c", "echo $HOME"])
```
## Streaming
Lazy, demand-driven streams with backpressure — data stays in the OS pipe buffer until you consume it, so you can stream gigabytes without OOM:
```elixir
# Stream stdout chunks with backpressure
NetRunner.stream!(~w(cat /usr/share/dict/words))
|> Stream.filter(&String.starts_with?(&1, "elixir"))
|> Enum.to_list()
# Pipe input through a command
NetRunner.stream!(~w(tr a-z A-Z), input: "hello world")
|> Enum.join()
# => "HELLO WORLD"
# Sort lines
NetRunner.stream!(~w(sort), input: "cherry\napple\nbanana\n")
|> Enum.join()
# => "apple\nbanana\ncherry\n"
# Process large files without loading into memory
File.stream!("huge.csv")
|> NetRunner.stream!(~w(grep ERROR))
|> Stream.each(&IO.write/1)
|> Stream.run()
# Non-raising variant
{:ok, stream} = NetRunner.stream(~w(sort), input: "c\nb\na\n")
Enum.to_list(stream)
# => ["a\nb\nc\n"]
```
## Timeouts and Limits
```elixir
# Kill process after 500ms
{:error, :timeout} = NetRunner.run(~w(sleep 100), timeout: 500)
# Cap output size — kills process if exceeded
{:error, {:max_output_exceeded, _partial}} =
NetRunner.run(["sh", "-c", "yes"], max_output_size: 1000)
# Custom kill escalation: SIGTERM → wait 2s → SIGKILL
NetRunner.run(~w(my_server), kill_timeout: 2000, timeout: 10_000)
```
## Process API
For fine-grained control over the OS process lifecycle:
```elixir
alias NetRunner.Process, as: Proc
# Start a process
{:ok, pid} = Proc.start("cat", [])
# Write to stdin
:ok = Proc.write(pid, "hello world")
:ok = Proc.close_stdin(pid)
# Read from stdout (blocks until data available)
{:ok, "hello world"} = Proc.read(pid)
:eof = Proc.read(pid)
# Wait for exit
{:ok, 0} = Proc.await_exit(pid)
```
### Incremental reads and writes
```elixir
{:ok, pid} = Proc.start("cat", [])
# Write in chunks — useful for feeding large data
:ok = Proc.write(pid, "chunk 1 ")
:ok = Proc.write(pid, "chunk 2 ")
:ok = Proc.write(pid, "chunk 3")
:ok = Proc.close_stdin(pid)
# Read comes back in whatever chunks the OS delivers
{:ok, data} = Proc.read(pid)
# data => "chunk 1 chunk 2 chunk 3" (may come in multiple reads)
```
### Signals and process groups
```elixir
{:ok, pid} = Proc.start("sleep", ["100"])
:ok = Proc.kill(pid, :sigterm)
{:ok, 143} = Proc.await_exit(pid) # 128 + SIGTERM(15)
# Signals kill the entire process group (catches grandchildren)
{:ok, pid} = Proc.start("sh", ["-c", "sleep 100 & sleep 100 & wait"])
:ok = Proc.kill(pid, :sigkill)
# All three processes (sh + both sleeps) are killed
```
Supported signals: `:sigterm`, `:sigkill`, `:sigint`, `:sighup`, `:sigusr1`, `:sigusr2`, `:sigstop`, `:sigcont`, `:sigquit`, `:sigpipe`.
### Checking process state
```elixir
{:ok, pid} = Proc.start("sleep", ["10"])
Proc.alive?(pid) # => true
Proc.os_pid(pid) # => 12345 (the actual OS PID)
Proc.kill(pid, :sigkill)
Proc.await_exit(pid)
Proc.alive?(pid) # => false
```
### Stats
Every process tracks I/O statistics automatically:
```elixir
{:ok, pid} = Proc.start("cat", [])
Proc.write(pid, "hello")
Proc.close_stdin(pid)
Proc.read(pid)
Proc.await_exit(pid)
stats = Proc.stats(pid)
stats.bytes_in # => 5 (bytes written to stdin)
stats.bytes_out # => 5 (bytes read from stdout)
stats.read_count # => 1 (number of read calls)
stats.write_count # => 1 (number of write calls)
stats.duration_ms # => 3 (wall-clock time)
stats.exit_status # => 0 (exit code)
```
## PTY Mode
Run commands with a pseudo-terminal for programs that require a TTY. PTY mode is designed for **interactive and long-running programs** — shells, REPLs, curses apps.
```elixir
# Programs see a real terminal
{:ok, pid} = Proc.start("python3", ["-c", "import sys; print(sys.stdout.isatty())"], pty: true)
{:ok, data} = Proc.read(pid)
# data =~ "True"
# Interactive REPL
{:ok, pid} = Proc.start("python3", ["-i"], pty: true)
Proc.write(pid, "print(1 + 2)\n")
{:ok, data} = Proc.read(pid)
# Resize the terminal window
Proc.set_window_size(pid, 40, 120)
# Clean up when done (PTY doesn't support independent stdin close)
Proc.kill(pid, :sigkill)
Proc.await_exit(pid)
```
### PTY caveats
PTY mode differs from pipe mode in important ways:
- **No independent stdin close** — the PTY is a single bidirectional FD. Use `kill/2` to terminate.
- **Echo** — the terminal echoes input back by default, so reads include what you wrote.
- **Fast-exiting commands** — if a command exits before you call `read/1` (e.g., in iex), the PTY buffer may be lost. PTY mode is meant for long-running programs. For simple commands, use pipe mode (the default).
- **Line buffering** — the terminal line discipline buffers input until `\n` by default.
```elixir
# DON'T: Use PTY for simple commands in iex (data may be lost)
{:ok, pid} = Proc.start("echo", ["hi"], pty: true)
# ... time passes while you type ...
Proc.read(pid) # => :eof (too late, PTY torn down)
# DO: Use pipe mode (default) for simple commands
{:ok, pid} = Proc.start("echo", ["hi"])
{:ok, "hi\n"} = Proc.read(pid)
# DO: Use PTY for interactive programs that need a terminal
{:ok, pid} = Proc.start("bash", [], pty: true)
Proc.write(pid, "echo hello\n")
{:ok, data} = Proc.read(pid) # works — bash stays alive
```
## Daemon Mode
Run long-lived processes under a supervision tree. Automatically drains stdout/stderr to prevent pipe blocking:
```elixir
# In your supervisor
children = [
{NetRunner.Daemon,
cmd: "redis-server",
args: ["--port", "6380"],
on_output: :log,
name: MyApp.Redis}
]
Supervisor.start_link(children, strategy: :one_for_one)
# Interact with the daemon
NetRunner.Daemon.os_pid(MyApp.Redis) # => 12345
NetRunner.Daemon.alive?(MyApp.Redis) # => true
NetRunner.Daemon.write(MyApp.Redis, "PING\r\n")
```
Output handling options:
- `:discard` (default) — silently consume output to prevent pipe blocking
- `:log` — log each chunk via `Logger.info`
- `fun/1` — custom callback for each chunk
```elixir
# Custom output handler
{:ok, daemon} = NetRunner.Daemon.start_link(
cmd: "tail",
args: ["-f", "/var/log/system.log"],
on_output: fn chunk -> MyApp.LogIngester.ingest(chunk) end
)
```
Graceful shutdown: on `terminate/2`, sends SIGTERM, waits 5 seconds, then SIGKILL.
## cgroup Support (Linux)
Isolate child processes in a cgroup v2 hierarchy for resource control:
```elixir
{:ok, pid} = Proc.start("my_worker", [],
cgroup_path: "net_runner/job_123"
)
```
The shepherd creates the cgroup directory, moves the child into it, and cleans up on exit (kills all processes via `cgroup.kill`, then removes the directory). No-op on macOS.
## Parallel Execution
Every NetRunner process is fully independent — no shared state, no singleton bottleneck:
```elixir
# Process files in parallel
files
|> Task.async_stream(fn file ->
{out, 0} = NetRunner.run(["ffprobe", "-hide_banner", file])
{file, out}
end, max_concurrency: System.schedulers_online())
|> Enum.to_list()
# Fan-out pattern
urls
|> Task.async_stream(fn url ->
{body, 0} = NetRunner.run(["curl", "-s", url], timeout: 30_000)
body
end, max_concurrency: 20)
|> Enum.to_list()
```
## Why NetRunner?
`System.cmd` uses Erlang ports which tie stdin/stdout lifecycle together, have no backpressure (mailbox flooding), and leave zombies when programs ignore stdin EOF. This was filed as [ERL-128](https://bugs.erlang.org/browse/ERL-128) and marked **Won't Fix**.
| Feature | System.cmd | MuonTrap | Exile | **NetRunner** |
|---------|-----------|----------|-------|---------------|
| No zombies (BEAM SIGKILL) | - | Yes | - | **Yes** |
| NIF async I/O + backpressure | - | - | Yes | **Yes** |
| Close stdin independently | - | - | Yes | **Yes** |
| Process group kills | - | Yes | - | **Yes** |
| PTY / terminal emulation | - | - | - | **Yes** |
| cgroup isolation (Linux) | - | Yes | - | **Yes** |
| Per-process stats | - | - | - | **Yes** |
| Daemon mode (supervision) | - | Yes | - | **Yes** |
## Options Reference
### `NetRunner.run/2`
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `:input` | binary \| list | `nil` | Data to write to stdin |
| `:timeout` | integer | `nil` | Wall-clock timeout in ms |
| `:max_output_size` | integer | `nil` | Max bytes to collect |
| `:stderr` | atom | `:consume` | `:consume`, `:redirect`, or `:disabled` |
| `:pty` | boolean | `false` | Use pseudo-terminal |
| `:kill_timeout` | integer | `5000` | SIGTERM→SIGKILL escalation timeout in ms |
| `:cgroup_path` | string | `nil` | cgroup v2 path (Linux only) |
### `NetRunner.Process.start/3`
Accepts all options above except `:input`, `:timeout`, and `:max_output_size`.
## Architecture
```
BEAM Process (NetRunner.Process GenServer)
|
|-- Port.open("priv/shepherd", [:nouse_stdio, :exit_status])
| |
| v
| Shepherd Binary (stays alive for child's lifetime)
| |-- fork() → child process (execvp)
| |-- Passes pipe FDs to BEAM via UDS + SCM_RIGHTS
| |-- poll() loop: watches UDS + signal pipe (SIGCHLD)
| |-- BEAM dies (POLLHUP) → SIGTERM → SIGKILL child
| |-- Child dies (SIGCHLD) → notify BEAM, exit
|
|-- NIF (enif_select on raw FDs, dirty IO schedulers)
| |-- Demand-driven backpressure via OS pipe buffers
|
v
NetRunner.Watcher (belt-and-suspenders with shepherd)
|-- Monitors GenServer, kills OS process if GenServer crashes
```
**Three layers of zombie prevention:**
1. **Shepherd** — detects BEAM death via POLLHUP on UDS, kills child process group
2. **Watcher** — detects GenServer crash via `Process.monitor`, kills child via NIF
3. **NIF resource destructor** — closes FDs on GC, child sees broken pipe
## Performance
Spawn overhead is ~20-25ms per process (fork + execvp + UDS handshake + FD passing). This is a one-time cost — actual I/O is sub-millisecond. For comparison, `System.cmd` is ~10-15ms (simpler setup, same fork cost).
The tradeoff: ~10ms extra spawn time buys you backpressure, zero zombies, and process group kills. For long-running processes or large data streams, the spawn cost is negligible.
## Documentation
- [API Reference (HexDocs)](https://hexdocs.pm/net_runner)
- [Architecture](docs/architecture.md) — component diagrams, spawn sequence, zombie prevention
- [Protocol](docs/protocol.md) — shepherd ↔ BEAM byte protocol spec
- [Backpressure](docs/backpressure.md) — how demand-driven I/O works
- [Decisions](docs/decisions.md) — architecture decision records
- [Comparison](docs/comparison.md) — feature matrix vs alternatives
## Requirements
- Elixir ~> 1.17
- Erlang/OTP 27+
- C compiler (`gcc` or `clang`)
- `make`
- Linux or macOS
## License
Apache-2.0 — see [LICENSE](LICENSE).