Skip to main content

CHANGELOG.md

# Changelog

All notable changes to this project will be documented in this file.

This project adheres to [Semantic Versioning](https://semver.org/).

## [Unreleased]

Focused code-review pass across the NIF, shepherd, and Elixir layers.
Correctness-first: closes two real-world race/leak bugs, hardens the
post-fork child window, and adds an AddressSanitizer + UBSan CI job.

### Fixed

- **FD leak in `nif_create_fd`** when `enif_mutex_create` failed
  — the destructor previously gated `close(fd)` on a non-NULL lock,
  so a failed mutex allocation leaked the file descriptor and armed
  a NULL-deref in any later `nif_close`. The mutex result is checked
  and the dtor now closes the fd unconditionally.
- **Use-after-close race in NIF read/write vs. close/down**
  — `nif_read`/`nif_write` copied `res->fd` under the mutex and
  released the lock before the syscall; a concurrent `nif_close` or
  owner-death callback could close the fd before the syscall ran,
  letting the read/write target a recycled fd. The mutex is now held
  across the syscall and the subsequent `enif_select` registration;
  the actual `close()` is deferred to the `io_resource_stop` callback
  so BEAM can drain pending selects before the fd is released.
- **Lost initial stderr chunk in `:consume` mode**
  — `kick_stderr_read` in `init/1` sent `{:stderr_data, data}` to
  `self()` but no `handle_info/2` clause matched, so the first (and
  often only) chunk of stderr for fast-exiting processes was silently
  dropped. The missing handler now appends to the stderr buffer and
  drains any remainder.
- **`write_loop` spin on `{:ok, 0}`** — if the kernel ever returned
  0 bytes on a non-empty write, the GenServer would recurse forever
  on the dirty scheduler. Bounded with a 1 ms sleep-retry.
- **Shepherd UDS command framing** — the event loop parsed only
  `buf[0]`, discarding any coalesced or tail commands (e.g.
  `CMD_CLOSE_STDIN` followed immediately by `CMD_KILL`). Frames are
  now length-dispatched per opcode with a carry-over buffer across
  `poll()` iterations.
- **Post-fork child stdio and signal safety** — replaced `fprintf` /
  `strerror` in the post-fork / pre-exec window with a `write(2)`-
  based `child_fail()` helper (async-signal-safe). Every `dup2`,
  `setsid`, and `TIOCSCTTY` return is now checked; on failure the
  child exits 127 with a diagnostic instead of running with broken
  stdio.
- **`waitpid` after SIGKILL** — replaced the unbounded
  `waitpid(child_pid, NULL, 0)` with a bounded WNOHANG loop
  (~3 s cap) so the shepherd cannot hang on a child stuck in
  uninterruptible kernel sleep (D-state).
- **SIGCHLD reap loop** — reap all pending children per SIGCHLD
  (`while waitpid(-1, ..., WNOHANG) > 0`) so a coalesced signal
  never leaks zombies.
- **Cgroup / UDS path hardening** — validate every `snprintf` return,
  reject too-long UDS paths, set `FD_CLOEXEC` on the PTY master,
  treat user-requested cgroup setup failure as fatal, and replace
  the fixed 100 ms `usleep` in `cgroup_cleanup` with a bounded
  polling `rmdir`.
- **`Stream` consumer crash cleanup** — `Stream.resource`'s `after`
  callback is only run on normal termination. A consumer crash
  orphaned the `NetRunner.Process` GenServer and its OS child.
  `NetRunner.Process.start/3` now accepts an `:owner` option that
  monitors the caller; `NetRunner.Stream.stream/3` passes `self()`,
  so a consumer crash SIGKILLs the OS process and stops the
  GenServer.
- **Watcher blocking on `Process.sleep`** — the 5 s sleep in
  `handle_info/2` wedged the Watcher unresponsive (including to
  supervisor shutdown). Replaced with `Process.send_after/3` and a
  new `:escalate_to_sigkill` handler.
- **Parked-caller tracking in `Operations`** — callers parked on
  EAGAIN are now `Process.monitor/1`-ed; dead callers are pruned on
  `:DOWN` instead of lingering in the pending map until process
  exit.
- **`read_uds_message` race** — replaced the `:peek` + full-recv
  pattern (which could time out if the payload arrived a moment
  after the opcode) with an opcode-first read flow and longer
  timeouts.
- **`cmd` / `args` validation** — reject non-binary, empty, or
  NUL-containing cmd and args at the spawn boundary. Passing NUL
  bytes through `Port.open`'s `args:` is undefined on the C side.
- **`NetRunner.run/2` error surface** — previously pattern-matched
  `{:ok, pid}` from `Proc.start`, raising `MatchError` when
  validation failed. Now returns `{:error, reason}` cleanly.
- **`File.rm` cleanup of UDS socket** — tolerate `:enoent`
  (shepherd may have unlinked), propagate other errors.
- **`Signal.resolve` integer range** — integer signals outside
  POSIX `1..31` now return `{:error, :unknown_signal}` instead of
  being forwarded to `kill(2)`.
- **`Signal` single source of truth** — `Signal.resolve` delegates
  to the NIF for known-atom lookup instead of maintaining a duplicate
  allow-list that drifted from the C side.
- **Daemon drain resilience** — drain-task crashes used to match a
  catch-all `:DOWN` handler and silently stop draining; the pipe
  then filled until the child blocked. Narrowed to recognised refs
  with a warning log; `drain_loop` wrapped in `try/rescue/catch` so
  a reader or logger exception cannot take the daemon down through
  the linked Task.
- **`terminate/2`** explicitly closes the shepherd `Port` after the
  UDS socket for deterministic teardown order.

### Added

- **AddressSanitizer + UBSan** — opt-in build via `SANITIZE=1 make all`
  or `make asan`. New CI job (`sanitizers`) rebuilds the NIF and
  shepherd with `-fsanitize=address,undefined`, preloads `libasan`,
  and runs the full `mix test`. The publish job depends on it.
- **Stale UDS socket sweep** in `test/test_helper.exs` (before and
  after the suite) — stops accumulation from test crashes before
  `cleanup_listener/2` runs.
- **Regression tests** for: NUL-byte validation in `cmd` and `args`,
  `Signal.resolve` range + type handling, `:owner` monitor SIGKILL
  path, stderr-only fast-exit stats, binary-with-NUL round-trip, and
  `NetRunner.run` / `NetRunner.stream` returning validation errors
  cleanly.

## [1.0.0] - 2026-02-26

Initial release.

### Core

- `NetRunner.run/2` — run a command and collect output as `{output, exit_status}`
- `NetRunner.stream!/2` / `NetRunner.stream/2` — lazy streaming I/O with backpressure
- `NetRunner.Process` — GenServer with full lifecycle control: `start/3`, `read/2`, `write/2`, `close_stdin/1`, `kill/2`, `await_exit/2`, `os_pid/1`, `alive?/1`

### Shepherd Binary (C)

- Persistent watchdog process that stays alive for the child's lifetime
- Detects BEAM death via UDS `POLLHUP` — guarantees child cleanup even under `SIGKILL`
- FD passing via `SCM_RIGHTS` over Unix domain sockets
- `poll()` event loop with self-pipe trick for `SIGCHLD` handling
- Process group kills: `setpgid(0,0)` + `kill(-pgid, sig)` catches grandchildren
- Configurable SIGTERM → SIGKILL escalation timeout (`--kill-timeout`)

### NIF I/O

- `enif_select` integration with BEAM's epoll/kqueue for async I/O
- All NIF functions on dirty IO schedulers
- Demand-driven backpressure via OS pipe buffers + `EAGAIN` + enif_select
- Resource-based FD management with destructor/stop/down callbacks

### Zombie Prevention (3 layers)

- **Shepherd** — detects BEAM crash via UDS POLLHUP, kills child process group
- **Watcher** — detects GenServer crash via `Process.monitor`, kills child via NIF
- **NIF resource destructor** — closes FDs on GC, child sees broken pipe

### PTY Support

- `pty: true` option for pseudo-terminal emulation
- `openpty()` with `setsid()` + `TIOCSCTTY` for controlling terminal
- `set_window_size/3` via `ioctl(TIOCSWINSZ)`
- Single bidirectional master FD, duped for independent stdin/stdout NIF resources
- Platform support: `<util.h>` on macOS, `<pty.h>` on Linux

### cgroup Support (Linux)

- `:cgroup_path` option for cgroup v2 resource isolation
- Creates cgroup directory, moves child to `cgroup.procs`
- Cleanup via `cgroup.kill` + `rmdir` on process exit
- No-op on macOS/BSD

### Daemon Mode

- `NetRunner.Daemon` — supervised long-running process for supervision trees
- Auto-drains stdout/stderr to prevent pipe blocking
- Output handling: `:discard` (default), `:log`, or custom `fun/1` callback
- Graceful shutdown: SIGTERM → 5s wait → SIGKILL

### Stats

- `NetRunner.Process.stats/1` — per-process I/O statistics
- Tracks: `bytes_in`, `bytes_out`, `bytes_err`, `read_count`, `write_count`, `duration_ms`, `exit_status`
- Zero-cost integer counters in GenServer state

### Safety

- Timeout enforcement on `run/2` via `:timeout` option
- Output size limits via `:max_output_size` option
- Platform support: macOS (Darwin) and Linux