# ADR 001: Worker.Starter Supervision Pattern
**Status**: Accepted
**Date**: 2025-10-07
**Deciders**: nshkrdotcom
**Context**: Issue #2 feedback questioning supervision layer complexity
---
## Context and Problem Statement
Snakepit workers manage external OS processes (Python gRPC servers). We need to handle:
- Worker crashes and automatic restarts
- Clean resource cleanup (ports, file descriptors, child processes)
- Future potential for per-worker resource pooling (connections, caches)
**The Question** (from Issue #2):
> "What does having a :temporary worker under a Supervisor, which itself is under a DynamicSupervisor bring us (that it's worth the extra layer of complexity)?"
---
## Decision Drivers
1. **External Process Management** - Workers spawn and manage OS processes
2. **Automatic Recovery** - Workers should restart without Pool intervention
3. **Clean Lifecycle** - Worker termination must clean up all associated resources
4. **Future Extensibility** - May need per-worker connection pools, caches, etc.
5. **Separation of Concerns** - Pool manages availability, not lifecycle details
---
## Considered Options
### Option 1: Direct DynamicSupervisor (Standard Pattern)
```
DynamicSupervisor (WorkerSupervisor)
└── GRPCWorker (GenServer, :transient)
```
**Implementation**:
```elixir
def start_worker(worker_id) do
child_spec = {GRPCWorker, [id: worker_id]}
DynamicSupervisor.start_child(__MODULE__, child_spec)
end
```
**Pros**:
- ✅ Simple, standard OTP pattern
- ✅ One less process per worker
- ✅ Familiar to Elixir developers
**Cons**:
- ❌ Pool must track and handle worker restarts
- ❌ Harder to group worker + resources in future
- ❌ Less encapsulation
---
### Option 2: Worker.Starter Wrapper (Current Choice)
```
DynamicSupervisor (WorkerSupervisor)
└── Worker.Starter (Supervisor, :permanent)
└── GRPCWorker (GenServer, :transient)
```
**Implementation**:
```elixir
def start_worker(worker_id) do
child_spec = {Worker.Starter, {worker_id, GRPCWorker}}
DynamicSupervisor.start_child(__MODULE__, child_spec)
end
```
**Pros**:
- ✅ Automatic restarts without Pool intervention
- ✅ Worker.Starter can supervise multiple related processes
- ✅ Clean encapsulation (terminate Starter = terminate all)
- ✅ Extensible for future per-worker resources
**Cons**:
- ❌ Extra process per worker (~1KB memory)
- ❌ More complex process tree
- ❌ Non-standard pattern (requires explanation)
---
### Option 3: Worker as Supervisor
```
DynamicSupervisor (WorkerSupervisor)
└── GRPCWorker (Supervisor + GenServer hybrid)
└── Python Process (Port)
```
**Implementation**:
```elixir
defmodule GRPCWorker do
use Supervisor
# Also implements GenServer-like callbacks
end
```
**Pros**:
- ✅ One fewer module
- ✅ Worker directly supervises its resources
**Cons**:
- ❌ Violates single responsibility (mixing Supervisor + GenServer)
- ❌ Complex GenServer.call routing
- ❌ Harder to reason about behavior
---
### Option 4: erlexec Integration
```
DynamicSupervisor (WorkerSupervisor)
└── GRPCWorker (GenServer)
└── erlexec Port (C++ middleware)
└── Python Process (guaranteed cleanup)
```
**Implementation**:
```elixir
# Use erlexec library
{:ok, pid, os_pid} = :exec.run(
"python3 grpc_server.py",
[monitor: true, kill_timeout: 5000, kill_group: true]
)
```
**Pros**:
- ✅ Guaranteed cleanup (C++ port enforces)
- ✅ No orphans possible
- ✅ Simpler than Worker.Starter
**Cons**:
- ❌ External dependency
- ❌ Tight coupling (Python dies with BEAM)
- ❌ No independence for long-running jobs
---
## Decision Outcome
**Chosen Option**: **Option 2 (Worker.Starter Wrapper)**
### Justification
1. **External processes are special** - Not just Elixir GenServers
2. **Automatic restarts proven** - Works in production without Pool logic
3. **Future-proof** - Can add per-worker resources without refactoring
4. **Clean abstraction** - Terminating Starter atomically cleans up everything
### Trade-offs Accepted
**Memory**: +1KB per worker
- For 100 workers: +100KB (negligible on modern systems)
- Acceptable cost for cleaner architecture
**Complexity**: Non-standard pattern
- Requires documentation (this ADR)
- Benefits outweigh learning curve
**Process tree depth**: 3 levels instead of 2
- Observer shows deeper tree
- But clearer ownership of resources
---
## Consequences
### Positive
**Automatic Restart**:
```elixir
# Worker crashes
# Worker.Starter detects :DOWN
# Worker.Starter restarts worker automatically
# Pool gets :DOWN but doesn't need to act
```
**Atomic Cleanup**:
```elixir
# Stop a worker
DynamicSupervisor.terminate_child(WorkerSupervisor, starter_pid)
# → Starter stops
# → Worker stops
# → Python process stops
# → All cleaned up atomically
```
**Future Extensibility**:
```elixir
# v0.5: Add per-worker connection pool
children = [
{GRPCWorker, [id: worker_id]},
{ConnectionPool, [worker_id: worker_id]}, # Future
{MetricsCollector, [worker_id: worker_id]} # Future
]
Supervisor.init(children, strategy: :one_for_one)
```
### Negative
**Memory Overhead**:
- Each Worker.Starter process: ~1KB
- 100 workers: 100KB total
- Monitored but acceptable
**Conceptual Overhead**:
- Developers must understand pattern
- Not in typical Phoenix/Elixir apps
- Requires this ADR for explanation
**Debugging Complexity**:
- :observer shows 3-level tree
- Must understand which PID is which
- More processes to track
---
## Validation
### Tested Scenarios
✅ **Normal operation** (139 tests passing):
- Workers start under Starter
- Restart on crash works
- Clean shutdown works
✅ **High concurrency** (100 workers):
- Initialization: 3 seconds
- No resource leaks
- Clean shutdown
✅ **Crash recovery**:
- Worker crashes → Starter restarts
- Starter crashes → DynamicSupervisor restarts Starter+Worker
- Pool handles both gracefully
### Performance Impact
**Startup**:
- Direct: ~20ms for 4 workers
- With Starter: ~30ms for 4 workers
- Overhead: ~2.5ms per worker (negligible)
**Memory**:
- Per worker: +1KB (Worker.Starter process)
- 100 workers: +100KB total
- Acceptable on modern systems
**Runtime**:
- No performance difference
- Message routing: extra hop (microseconds)
---
## Alternatives for Future
### If Complexity Becomes Issue
**Option A**: Simplify to direct supervision
- Benchmark performance difference
- If negligible overhead, remove pattern
- Keep in v0.4.x, reconsider in v0.5
**Option B**: Adopt erlexec for coupled mode
- Guaranteed cleanup with less code
- Trade independence for simplicity
- See multi-mode architecture design
### If Resources Are Added
**Validation**: If per-worker resources materialize, pattern was correct
**Re-evaluation**: If no resources by v0.5, reconsider necessity
---
## Related Decisions
**Multi-Mode Architecture** (Future):
- `docs/20251007_external_process_supervision_design.md`
- Coupled mode (current): Keep Worker.Starter
- Supervised mode (systemd): Connect to existing pool
- Independent mode (ML): Heartbeat-based
- Distributed mode (k8s): Service mesh
**Process Cleanup**:
- DETS tracking (ProcessRegistry)
- ApplicationCleanup (emergency handler)
- beam_run_id for safe cleanup
---
## Review Schedule
**Initial Review**: 2025-10-07 (this ADR)
**Next Review**: 2025-Q2 (6 months)
**Criteria**: Check if per-worker resources were added
**Questions for Review**:
1. Did we add per-worker resources? (validates pattern)
2. Is memory overhead acceptable? (re-benchmark)
3. Is team comfortable with pattern? (survey)
4. Would simpler approach work? (prototype comparison)
---
## References
- **Issue #2**: ElixirForum feedback questioning pattern
- **Code**: `lib/snakepit/pool/worker_starter.ex`
- **Tests**: `test/unit/pool/worker_supervisor_test.exs`
- **Poolboy**: Uses similar wrapper patterns
- **Our analysis**: `docs/20251007_slop_cleanup_analysis/`
---
## Notes
This pattern is **intentional, not accidental**. It solves real problems with managing external OS processes from Erlang. The complexity is justified by the requirements, but requires documentation (this ADR) to communicate the reasoning.
**For new team members**: Read this ADR first before questioning the pattern. If still unclear, discuss in team review.
**For future refactoring**: Benchmark direct supervision before removing. The pattern serves a purpose.