CHANGELOG.md

# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

## [0.13.0] - 2026-02-06

This release is primarily internal hardening: non-blocking GenServer callbacks,
structured public API errors, centralized configuration, and a deprecation
framework for legacy optional modules. No new user-facing features are
introduced beyond instance isolation tokens and the structured error contract.

### Added

- **Instance isolation tokens** - `instance_token` configuration (and `SNAKEPIT_INSTANCE_TOKEN` env var) provides per-VM isolation when multiple Snakepit instances share a host or deployment directory. Each concurrent VM must use a unique token so cleanup logic never targets another live instance's workers.
- **Structured public API errors** - `Error.normalize_public_result/2` converts internal atom/tuple error codes (`:queue_timeout`, `:pool_saturated`, `:worker_busy`, `:session_worker_unavailable`, `:pool_not_initialized`, `:pool_not_found`, `:worker_exit`) into categorized `%Snakepit.Error{}` structs. All public returns from `Snakepit.execute/3`, `Pool.execute/3`, and `Pool.execute_stream/3` are now `{:error, %Snakepit.Error{}}`.
- **Legacy module deprecation framework** - `Snakepit.Internal.Deprecation` provides telemetry-based once-per-VM deprecation events (`[:snakepit, :deprecated, :module_used]`) for legacy optional modules: `Snakepit.Compatibility`, `Snakepit.Executor`, `Snakepit.HealthMonitor`, `Snakepit.PythonVersion`, `Snakepit.Telemetry`, `Snakepit.Telemetry.GPUProfiler`, `Snakepit.Telemetry.Handlers.Logger`, and `Snakepit.Telemetry.Handlers.Metrics`.
- **Centralized timeout runner** - `Snakepit.Internal.TimeoutRunner` standardizes execution with timeouts across the executor, Python package runner, and shutdown modules using `spawn_monitor`+`receive` instead of `Task.async`/`yield`/`shutdown`.
- **Async fallback helpers** - `Snakepit.Internal.AsyncFallback` consolidates duplicated supervisor-unavailable fallback logic (`start_nolink_with_fallback/3`, `start_child_with_fallback/3`, `start_monitored/1`, `start_monitored_fire_and_forget/1`).
- **Pool RuntimeSupervisor** - pool-dependent children are grouped under a `rest_for_one` supervisor ensuring dependency-order restarts.
- **Enriched pool_not_found errors** - `Dispatcher.get_pool` now returns `{:error, {:pool_not_found, pool_name}}` carrying the missing pool name for diagnostics.
- **Pre-stream telemetry buffering** - Python `TelemetryStream` buffers events emitted before the gRPC stream is attached and flushes them once the async loop is initialized, preventing dropped startup events.
- **Session quota enforcement** in `SessionStore` to protect against resource exhaustion during high-volume worker assignments.
- **Supervisor fallbacks for heartbeat and lifecycle tasks** - when `TaskSupervisor` is unavailable, heartbeat pings and lifecycle checks fall back to manually monitored processes instead of crashing.
- **Dispatch telemetry event** - `[:snakepit, :pool, :call, :dispatched]` is emitted when a request is assigned to a worker and execution begins. Metadata includes `pool`, `worker_id`, `command`, and `queued` (boolean indicating whether the request waited in the queue). Enables deterministic synchronization for contention-aware consumers.
- New runtime-configurable defaults: `process_registry_dets_flush_interval_ms`, `grpc_stream_open_timeout_ms`, `grpc_stream_control_timeout_ms`, `lifecycle_check_max_concurrency`, `lifecycle_worker_action_timeout_ms`, `grpc_worker_health_check_timeout_ms`.
- `@enforce_keys` on `Pool`, `Pool.State`, `ProcessRegistry`, `HeartbeatMonitor`, and `LifecycleManager` structs.
- Generated getters in `Defaults` for `pool_reconcile_interval_ms`, `pool_reconcile_batch_size`, and supervisor restart intensity values.

### Changed

- **GRPCWorker is fully non-blocking** - long-running gRPC calls execute in async tasks with an internal request queue, keeping workers responsive to health checks and state queries during active calls. `get_health` and `get_info` calls also use non-blocking async mechanisms.
- **Periodic health checks route through async RPC queue** instead of running synchronously in `handle_info`.
- **ProcessRegistry DETS persistence is now batched** - sync operations are deferred behind a configurable flush interval instead of running directly inside GenServer callbacks. Startup cleanup deferred to `handle_continue` to avoid blocking the supervisor.
- **Pool initialization uses supervised async tasks** - initialization is launched with `Task.Supervisor.async_nolink/2` with explicit crash attribution instead of `spawn_link`.
- **Lifecycle checks run off the GenServer callback path** with bounded per-worker concurrency via `async_stream_nolink`. Worker recycle operations run in supervised tasks tracked via `recycle_task_refs`. `LifecycleManager.terminate/2` cancels timers and kills tracked tasks.
- **GRPCWorker terminate cleans up pending calls** - iterates `pending_rpc_calls` and `rpc_request_queue`, killing in-flight task PIDs, demonitoring refs, and replying to waiting callers with structured shutdown errors.
- **Telemetry stream operations are asynchronous** - gRPC stream open and control operations execute in supervised tasks with explicit operation timeouts. Connection lifecycle driven by a dedicated task with stream_ready/timeout messages.
- **Heartbeat pings run in supervised tasks** instead of blocking the `HeartbeatMonitor` GenServer.
- **GPU profiling moved to asynchronous model** to prevent slow hardware queries from stalling telemetry collection.
- **Config resolution centralized** - `Snakepit.Config.adapter_module/2`, `Snakepit.Config.capacity_strategy/1`, `Snakepit.Config.adapter_args/1` resolve with explicit precedence (override -> pool -> legacy -> global -> default). All consumers delegate to these helpers.
- **Shutdown module consolidation** - `Shutdown.shutdown_reason?/1` replaces duplicated private implementations across `GRPCWorker` and `GrpcStream`. `Shutdown.stop_supervisor/2` extracted for reusable supervisor stop logic.
- **Application compile-time env replaced with runtime function** to prevent stale environment values.
- **Legacy pool_size precedence fixed** - top-level `:pool_size` now wins over `pool_config.pool_size` when both are set.
- **GRPC Client mock channel dispatch tightened** - mock response logic extracted into `ClientMock`; `Client.mock_channel?/1` no longer silently treats non-map channels as mocks.
- **ToolRegistry errors use tagged tuples** instead of string messages; `BridgeServer` formats them at the API boundary.
- **SessionStore default arguments consolidated** using Elixir default argument syntax.
- **ClientSupervisor startup race normalization** - `{:error, {:already_started, pid}}` normalized to `:ignore`.
- **TaintRegistry consume_restart atomicity** - uses `:ets.take/2` instead of lookup-then-delete for single-consumer semantics.
- **ProcessRegistry DETS access indirection** - direct `:dets` calls replaced with `persist_put/3`, `persist_delete/2`, `persist_sync/1` wrappers.

### Deprecated

- **`Snakepit.HealthMonitor`** - use worker lifecycle telemetry and host-managed health policy. Emits `[:snakepit, :deprecated, :module_used]` once per VM.
- **`Snakepit.Executor`** - use `Snakepit.RetryPolicy`, `Snakepit.CircuitBreaker`, and timeout helpers directly. Emits deprecation event once per VM.
- **`Snakepit.Compatibility`**, **`Snakepit.PythonVersion`**, **`Snakepit.Telemetry`** (legacy module), **`Snakepit.Telemetry.GPUProfiler`**, **`Snakepit.Telemetry.Handlers.Logger`**, **`Snakepit.Telemetry.Handlers.Metrics`** - all emit deprecation events on first use; see event metadata for replacement guidance.

### Fixed

- **Shutdown flag stickiness** - `mark_in_progress` now stores a `{pid, ref}` marker with owner monitoring. Stale flags from crashed processes are automatically cleared and no longer block worker startup.
- **Process.alive? TOCTOU races** removed from `HeartbeatMonitor`, `GRPCWorker`, `Application`, `WorkerSupervisor`, `Initializer`, `Listener`, `Shutdown`, and `LifecycleManager` in favor of monitor-based or catch-based patterns.
- **CapacityStore :noproc crashes during shutdown** - all public APIs catch exits and return typed fallback values.
- **GRPCWorker orphaned monitor growth** - orphaned RPC task monitors are now cleaned from `pending_rpc_monitors` when no matching pending call exists.
- **HeartbeatMonitor stale timeout messages** - ignores `:heartbeat_timeout` when `timeout_timer` is `nil`. Demonitors ping task refs on timeout to prevent stale `:DOWN` delivery.
- **ApplicationCleanup bounded termination** - cleanup runs in a spawned process with a configurable timeout budget, preventing blocked supervision tree shutdown.
- **Listener process liveness detection** - replaced `Process.alive?/1` with monitor-and-receive to correctly detect remote node processes.
- **ProcessRegistry cleanup task lifecycle** - catches `TaskSupervisor` `:noproc`, falls back to `spawn_monitor`, drains in-flight cleanup on terminate with configurable timeout.
- **Telemetry stream callback blocking** - gRPC stream open and control operations now execute asynchronously with explicit operation timeouts.
- **Heartbeat ping callback blocking** - pings run in supervised tasks with bounded timeout handling and cleanup.
- **Heartbeat pong routing under async execution** - `notify_pong` remains backward-compatible when `ping_fun` executes in a task by routing self-targeted pongs back to the owning monitor process.
- **SessionStore callback containment** - `update_session` now catches `throw` and `exit` in addition to rescued exceptions.
- **Dynamic atom creation from telemetry config keys** - config normalization uses template-driven key matching instead of `String.to_atom`.
- **Async task monitor hygiene in Pool** - tracked async task refs are demonitor/flushed and no longer misrouted through worker `:DOWN` handling.
- **WorkerSupervisor shutdown race handling** - APIs return structured errors when the supervisor is unavailable instead of raising `:noproc`.
- **Pool initialization shutdown cleanup** - in-flight async initialization tasks are cancelled when `Pool` terminates.
- **Initialization resource delta telemetry** - baseline captured at start instead of sampling both values at completion.
- **Python telemetry events dropped during startup** - pre-stream buffering ensures events emitted before gRPC connection are preserved.
- **Thread-safe Python telemetry emission** - uses `loop.call_soon_threadsafe` with loop state checks.
- **Rogue cleanup configuration** - correctly handles explicit `false` values and string-key variations.
- **GrpcStream and Snakepit.cleanup `:noproc` tolerance** - catch exits instead of pre-checking `Process.whereis`.
- **Port reservation race in tests** - test helper table reservation tolerates ETS owner races during concurrent execution.

## [0.12.0] - 2026-01-25

### Added

- **Post-readiness process group resolution** - Workers re-check process group membership after Python signals readiness, handling cases where `os.setsid()` is called after initial spawn. Uses exponential backoff (up to 250ms) to accommodate delayed OS-level bookkeeping.
- `ProcessRegistry.update_process_group/3` to update `:pgid` and `:process_group?` metadata after worker startup, with PID mismatch protection to prevent corrupting restarted worker entries.
- `ready_workers` tracking in pool state to distinguish workers that have completed the gRPC handshake from those merely spawned.
- `init_failed` flag on pool state to mark pools that failed to start any workers.
- Global `await_ready_waiters` list for coordinating callers waiting on all pools.
- Python executable validation for `:python_executable` and `SNAKEPIT_PYTHON` overrides, checking both existence and execute permissions before use.
- **`Snakepit.Pool.await_init_complete/2`** - waits for asynchronous pool initialization to complete, separate from `await_ready/2` which returns as soon as each pool has at least one ready worker. Useful for tests and scripts that need to wait for all workers to be spawned before proceeding.
- **Pool initialization telemetry events**:
  - `[:snakepit, :pool, :init_started]` - emitted when pool initialization begins, with `total_workers` measurement.
  - `[:snakepit, :pool, :init_complete]` - emitted when initialization finishes, with `duration_ms`, `total_workers`, and `pool_workers` metadata.
  - `[:snakepit, :pool, :worker_ready]` - emitted when a worker completes the gRPC handshake, with `worker_count`, `pool_name`, and `worker_id` metadata.

### Changed

- **Pool readiness semantics** - `await_ready/2` now waits for at least one worker per pool to complete the gRPC handshake, not just for workers to be spawned. Pools report ready only when `ready_workers` is non-empty.
- Worker availability now requires both capacity headroom AND ready status. Workers are no longer marked available until they signal readiness.
- `Snakepit.execute/3` returns `{:error, :pool_not_initialized}` immediately for pools with `init_failed: true` instead of queueing requests that would eventually timeout.
- `PythonRuntime.python_version/1` returns `{:ok, version}` or `{:error, reason}` tuples instead of raw strings or `"unknown"`.
- `PythonRuntime.build_identity/1` propagates errors from version detection instead of silently returning partial identity maps.
- `PythonRuntime.runtime_identity/0` now refreshes the cached identity when the resolved Python path changes, supporting dynamic reconfiguration.
- Waiter reply logic refactored to stagger replies (2ms apart) to avoid thundering herd on pool initialization.
- `State.ensure_worker_available/2`, `State.increment_load/2`, and `State.decrement_load/2` now gate availability on worker readiness.
- `EventHandler.remove_worker_from_pool/4` cleans up `ready_workers` set when removing workers.

### Fixed

- **Startup race for process group detection** - Previously, if Python called `os.setsid()` after Snakepit captured the initial process group ID, the worker would remain in PID-only kill mode, leaving orphaned grandchildren after termination. The bootstrap phase now retries process group resolution after readiness.
- **Pool readiness gating** - Early calls to `Snakepit.execute/3` no longer hit workers with half-closed gRPC streams. Workers must complete the handshake before receiving work.
- **Broken pool signaling** - Pools that fail to start any workers are now flagged immediately. `await_ready/2` returns `{:error, %Snakepit.Error{}}` promptly instead of blocking until timeout.
- **Python runtime override robustness** - Invalid `:python_executable` or `SNAKEPIT_PYTHON` paths now return `{:error, {:invalid_python_executable, path}}` instead of crashing the VM on first use. `runtime_env/0` returns an empty list for invalid configurations instead of raising.
- **Legacy `pool_config` now preserves all user overrides** - Previously, only `startup_batch_size`, `startup_batch_delay_ms`, and `max_workers` were extracted from legacy `pool_config` maps, silently dropping other fields like `adapter_env` and `adapter_args`. The config is now fully merged before applying defaults.

## [0.11.1] - 2026-01-23

### Changed
- `ETSOwner.ensure_table/2` is now `ensure_table/1` - table options are centralized in ETSOwner as the single source of truth for known tables.
- `ETSOwner` raises `ArgumentError` for unknown table names, preventing accidental table creation outside the managed set.
- `ETSOwner` raises a clear error when called before the Snakepit application is started.
- **`WorkerSupervisor.start_worker/5` now returns the GRPCWorker PID** instead of the starter PID. The function waits up to 1 second for the worker to register, making the return value immediately usable for operations.
- `CapacityStore.ensure_started/0` no longer auto-starts the GenServer; returns `{:error, :not_started}` if the process isn't running. This prevents unsupervised process spawning during shutdown.
- `GRPC.Listener` init now uses `handle_continue` instead of spawning a `Task` for listener startup, simplifying the initialization flow.

### Fixed
- ETS table ownership for taint registry and zero-copy handles is now supervised to avoid short-lived processes becoming table owners.
- Race condition in `ETSOwner.create_table/2` now properly re-raises if the table still doesn't exist after catching `ArgumentError` (distinguishes real errors from concurrent creation).
- **Shutdown race in `ProcessManager.wait_for_server_ready/3`** - Now detects `{:EXIT, _, :shutdown}` messages and checks the shutdown flag to exit early instead of timing out during application shutdown.
- **Telemetry stream task lifecycle** - `GrpcStream` now traps exits and properly cleans up stream state when tasks complete or crash, preventing orphaned entries in the streams map.
- **Thread profile resilience during shutdown** - `Thread.start_worker/5`, `stop_worker/1`, `acquire_slot/1`, `get_capacity/1`, and `get_load/1` now handle `CapacityStore` being unavailable gracefully instead of crashing.
- **Pool capacity tracking** - `track_capacity_increment/1` and `track_capacity_decrement/1` now check if `CapacityStore` is available before attempting operations, preventing crashes during shutdown.

## [0.11.0] - 2026-01-11

### Added
- **Graceful serialization fallback** for non-JSON-serializable Python objects. Instead of failing, Snakepit now:
  - Tries common conversion methods (`model_dump`, `to_dict`, `_asdict`, `tolist`, `isoformat`)
  - Falls back to a marker dict with type info for truly non-serializable objects (safe by default, repr excluded)
- **`Snakepit.Serialization` Elixir module** with helpers for detecting and inspecting unserializable markers:
  - `unserializable?/1` - checks if a value is an unserializable marker
  - `unserializable_info/1` - extracts type and repr info from markers
- **Configurable marker detail policy** via environment variables on Python workers:
  - `SNAKEPIT_UNSERIALIZABLE_DETAIL` - controls what info is included (`none` default, `type`, `repr_truncated`, `repr_redacted_truncated`)
  - `SNAKEPIT_UNSERIALIZABLE_REPR_MAXLEN` - maximum repr length (default 500, max 2000)
- **Secret redaction** in `repr_redacted_truncated` mode - redacts common patterns (API keys, bearer tokens, passwords) from repr output.
- `GracefulJSONEncoder` class and `_orjson_default` function in `serialization.py` for both stdlib json and orjson paths.
- **Tolist size guard** (`SNAKEPIT_TOLIST_MAX_ELEMENTS` env var, default 1M) to prevent explosive sparse→dense array conversions:
  - Pre-checks numpy arrays via `isinstance()` before calling `tolist()` to avoid allocation
  - Best-effort heuristics for scipy sparse matrices and pandas DataFrames
  - Post-checks unknown types after `tolist()` with fallback to marker if oversized
- **Telemetry for marker creation** - Emits `[:snakepit, :serialization, :unserializable_marker]` events with type metadata (never repr). Deduplicated per-type-per-process with a 10K type cap to bound cardinality.
- `serialization_demo` tool in the showcase adapter demonstrating datetime, custom class, and convertible object handling.
- `graceful_serialization.exs` example showing the feature in action.
- `guides/graceful-serialization.md` comprehensive guide covering configuration, helpers, telemetry, and best practices.
- Unit tests for graceful serialization (Python: 24 tests, Elixir: 14 tests) plus policy behavior tests.

## [0.10.1] - 2026-01-11

### Fixed
- `Pool.handle_call/3` now resolves string `pool_name` options to configured pool atoms via `resolve_pool_name_opt/2`, fixing routing when callers pass pool names as strings.

## [0.10.0] - 2026-01-10

### Changed
- gRPC listener defaults to internal-only mode (port 0) and now publishes its assigned port to workers via the `grpc_listener` config.
- Added explicit external binding modes (`:external`, `:external_pool`) with required host/port configuration and pooled port selection for multi-instance deployments.
- ProcessRegistry DETS paths are now namespaced by `instance_name` and `data_dir` to prevent shared-deployment collisions.

### Fixed
- **Session affinity now supports strict routing** - Requests with `session_id` can be guaranteed to route to the same worker where refs exist by enabling strict affinity modes, preventing "Unknown reference" errors for in-memory Python refs.
  - Added `affinity: :strict_queue` to queue on the preferred worker when busy.
  - Added `affinity: :strict_fail_fast` to return `{:error, :worker_busy}` when the preferred worker is busy.
  - Kept `affinity: :hint` as the default for legacy behavior (falls back to any available worker).
- Documentation now clarifies hint vs strict affinity behavior, and the new `grpc_session_affinity_modes.exs` example demonstrates both modes in practice.
- Examples now restart Snakepit when run via `mix run` so example configs are applied consistently; README recommends `mix run --no-start` for predictable startup.

## [0.9.1] - 2026-01-09

### Added
- `ClientSupervisor` wrapper for safe gRPC client supervision across gRPC variants.
- gRPC server request logging interceptor with optional `:grpc_request_logging` and category-aware debug output.
- `mix snakepit.python_test` task to bootstrap and run the Python test suite (supports `--no-bootstrap`).
- Pool reconciliation loop to restore minimum worker counts after crash storms (configurable via `pool_reconcile_interval_ms` and `pool_reconcile_batch_size`).
- Configurable restart intensity for worker starters and worker supervisors (`worker_starter_*` and `worker_supervisor_*` defaults).

### Changed
- gRPC client and worker stream defaults now derive from `grpc_command_timeout/0` and `stream_timeout/0`.
- Pool and worker execution now handle `:infinity` timeouts without deadline bookkeeping.
- Python gRPC server now runs sync adapter calls in worker threads by default; use `thread_sensitive` metadata or `SNAKEPIT_THREAD_SENSITIVE` to keep execution on the main thread.
- `Snakepit.Pool` metadata validation now accepts `Snakepit.Pool` as the default pool identifier.
- gRPC is pinned to `0.11.5` and protobuf is pinned to `0.16.0` (override).

### Fixed
- gRPC status code 4 now maps to `{:error, :timeout}` in the client.
- Process group shutdown waits for group exit using `/proc` or `ps`, avoiding zombie false positives.
- Test suite now tracks and terminates leaked external Python processes after runs.

## [0.9.0] - 2026-01-02

### Added
- `run_as_script/2` `:exit_mode` option and `SNAKEPIT_SCRIPT_EXIT` env var for explicit exit semantics.
- Integration tests for external VM exit behavior and broken-pipe safety.
- `run_as_script/2` `:stop_mode` option for ownership-aware application shutdown.
- Shutdown orchestrator for script shutdown sequencing.
- Script shutdown telemetry events (`[:snakepit, :script, :shutdown, ...]`) with required metadata.
- CI docs build gate (`mix docs`) to catch documentation build errors.

### Changed
- Exit selection precedence now favors `:exit_mode` over legacy `:halt` and env vars.
- `Snakepit.Examples.Bootstrap.run_example/2` now defaults to `exit_mode: :auto` and respects `stop_mode`.
- `run_as_script/2` now captures cleanup targets before stopping and routes shutdown through the orchestrator.
- Documentation now aligns README/guides with `exit_mode`/`stop_mode` semantics and the Script Lifecycle tables.
- Tests now avoid timing sleeps, using deterministic polling, receive timeouts, and `Logger.flush/0` for async-safe synchronization.
- Test timing constants were tightened (heartbeat, circuit breaker, queue churn, gRPC slow-operation paths) to reduce suite runtime.
- Long-running integration and randomized flow tests are tagged `:slow`, and random worker flow iterations were trimmed.
- Pool size isolation checks now wait on pool stats instead of fixed delays.
- gRPC errors during shutdown now log at debug level to reduce noise during expected teardown.
- Refactored `Snakepit.Pool` and `Snakepit.GRPCWorker` internals into focused helpers (dispatcher/scheduler/event handler, bootstrap/instrumentation) without behavior changes.
- `Snakepit.TaskSupervisor` now starts even when pooling is disabled so queue dispatch paths can spawn tasks safely.

### Fixed
- Removed direct IO from the script exit path to avoid hangs on closed pipes.
- `run_as_script/2` no longer stops Snakepit in embedded usage unless explicitly requested.
- Script shutdown now marks shutdown-in-progress whenever cleanup runs, so cleanup-only runs (when Snakepit is already started) treat Python exits as expected.
- Shape mismatch telemetry test now filters events by operation to avoid cross-test telemetry bleed.
- Worker lifecycle memory-probe warning test now synchronizes probe failures and log capture to prevent flakes.
- BEAM run IDs now use second-resolution timestamps plus a monotonic counter to avoid collisions during rapid restarts.
- ProcessRegistry rebuilds DETS metadata when index corruption is detected, preventing stale entries after crash/restart cycles.

## [0.8.9] - 2026-01-01

### Breaking Changes

- **uv is now required** - pip support has been removed. Snakepit now requires [uv](https://docs.astral.sh/uv/) for Python package management.
  - Install uv: `curl -LsSf https://astral.sh/uv/install.sh | sh`
  - Or via Homebrew: `brew install uv`
  - The `:installer` config option has been removed (was `:auto`, `:uv`, or `:pip`)
  - uv provides 10-100x faster package operations and more reliable version resolution

### Fixed

- **Version checking now validates constraints** - `PythonPackages.check_installed/2` now properly verifies that installed package versions satisfy the version constraints in requirements (e.g., `grpcio>=1.76.0`).
  - Previously, only package existence was checked, not version satisfaction
  - This caused runtime errors when outdated packages were installed (e.g., grpcio 1.67.1 when >=1.76.0 was required)
  - Now uses `uv pip install --dry-run` for accurate PEP-440 version checking
  - Packages that need upgrading are correctly identified as "missing" and reinstalled

- **Bootstrap now uses quiet pip install** - Reduced noise from "Requirement already satisfied" messages during `mix test --include python_integration`

- **Added startup feedback** - Shows "🐍 Checking Python package requirements..." during app startup in dev/test when checking packages (once per BEAM session)

### Changed

- Removed unused configuration keys from `config/config.exs`, `config/test.exs`, and `config/grpc_test.exs` to trim dead config surface (legacy worker timeouts and unused grpc_test flags)
- Virtual environments are now created using `uv venv` for consistency with package management
- Simplified `PythonPackages` module by removing all pip-specific code paths

## [0.8.8] - 2025-12-31

### Added
- **Centralized configurable defaults** - New `Snakepit.Defaults` module provides runtime-configurable defaults for all hardcoded values
  - All 68 previously hardcoded timeout, sizing, and threshold values are now configurable via `Application.get_env/3`
  - Values are read at runtime, allowing configuration changes in `config/runtime.exs` without recompilation
  - Defaults remain unchanged from previous versions for backward compatibility
  - See `Snakepit.Defaults` module documentation for complete list of configurable keys

- **Timeout profile architecture** - New single-budget, derived deadlines, profile-based timeout system
  - Six predefined profiles: `:balanced`, `:production`, `:production_strict`, `:development`, `:ml_inference`, `:batch`
  - New user-facing API: `default_timeout/0`, `stream_timeout/0`, `queue_timeout/0`
  - Margin configuration: `worker_call_margin_ms/0` (default 1000), `pool_reply_margin_ms/0` (default 200)
  - RPC timeout derivation: `rpc_timeout/1` computes inner timeout from total budget
  - Legacy getters (`pool_request_timeout`, `grpc_command_timeout`, etc.) now derive from profile when not explicitly configured
  - Configure via: `config :snakepit, timeout_profile: :production`

- **Pool deadline-aware execution** - Pool.execute/3 now stores deadline_ms for queue-aware timeout handling
  - New helper: `Pool.get_default_timeout_for_call/3` for call-type-aware timeout lookup
  - New helper: `Pool.derive_rpc_timeout_from_opts/2` for deadline-aware RPC timeout derivation
  - New helper: `Pool.effective_queue_timeout_ms/2` for budget-aware queue timeout
  - GenServer.call timeout caught and returned as structured `{:error, %Snakepit.Error{}}`

### Changed
- **Pool module** - Timeout and sizing defaults now read from `Snakepit.Defaults`:
  - `pool_request_timeout`, `pool_streaming_timeout`, `pool_startup_timeout`, `pool_queue_timeout`
  - `checkout_timeout`, `default_command_timeout`, `pool_await_ready_timeout`
  - `pool_max_queue_size`, `pool_max_workers`, `pool_max_cancelled_entries`
  - `pool_startup_batch_size`, `pool_startup_batch_delay_ms`

- **GRPCWorker** - Execute and streaming timeouts now configurable:
  - `grpc_worker_execute_timeout`, `grpc_worker_stream_timeout`
  - `grpc_server_ready_timeout`, `worker_ready_timeout`
  - `grpc_worker_health_check_interval`
  - Heartbeat configuration: `heartbeat_ping_interval_ms`, `heartbeat_timeout_ms`, `heartbeat_max_missed`, `heartbeat_initial_delay_ms`

- **Fault tolerance modules** - Circuit breaker, retry policy, crash barrier, and health monitor defaults now configurable:
  - `circuit_breaker_failure_threshold`, `circuit_breaker_reset_timeout_ms`, `circuit_breaker_half_open_max_calls`
  - `retry_max_attempts`, `retry_backoff_sequence`, `retry_max_backoff_ms`, `retry_jitter_factor`
  - `crash_barrier_taint_duration_ms`, `crash_barrier_max_restarts`, `crash_barrier_backoff_ms`
  - `health_monitor_check_interval`, `health_monitor_crash_window_ms`, `health_monitor_max_crashes`

- **Session store** - Session management defaults now configurable:
  - `session_cleanup_interval`, `session_default_ttl`, `session_max_sessions`, `session_warning_threshold`

- **Process registry** - Cleanup intervals now configurable:
  - `process_registry_cleanup_interval`, `process_registry_unregister_cleanup_delay`, `process_registry_unregister_cleanup_attempts`

- **Application and gRPC** - Server configuration now configurable:
  - `grpc_port`, `grpc_num_acceptors`, `grpc_max_connections`, `grpc_socket_backlog`
  - `cleanup_on_stop_timeout_ms`, `cleanup_poll_interval_ms`

- **Config module** - Pool and worker profile defaults now configurable:
  - `default_pool_size`, `default_worker_profile`, `default_capacity_strategy`
  - `config_default_batch_size`, `config_default_batch_delay`, `config_default_threads_per_worker`

### Timeout Architecture Proposal

The following documents the design rationale for the timeout architecture implemented in this release.

#### Problem Statement

Snakepit's timeout configuration was fragmented with 7+ independent timeout keys that didn't coordinate:
- `pool_request_timeout` vs `grpc_command_timeout` - Which is outer? Which is inner?
- Queue wait time consumed part of the budget, but inner timeouts didn't account for it
- GenServer.call timeouts firing before inner timeouts produced unhandled exits instead of structured errors

#### Solution: Single-Budget, Derived Deadlines

**Core principle**: One top-level timeout budget, all inner timeouts derived from remaining time.

**Profile-based defaults** provide sensible starting points for different deployment scenarios:

| Profile | default_timeout | stream_timeout | queue_timeout |
|---------|-----------------|----------------|---------------|
| :balanced | 300_000 (5m) | 900_000 (15m) | 10_000 (10s) |
| :production | 300_000 (5m) | 900_000 (15m) | 10_000 (10s) |
| :production_strict | 60_000 (60s) | 300_000 (5m) | 5_000 (5s) |
| :development | 900_000 (15m) | 3_600_000 (60m) | 60_000 (60s) |
| :ml_inference | 900_000 (15m) | 3_600_000 (60m) | 60_000 (60s) |
| :batch | 3_600_000 (60m) | :infinity | 300_000 (5m) |

**Margin formula** ensures inner timeouts fire before outer:
```
rpc_timeout = total_timeout - worker_call_margin_ms (1000) - pool_reply_margin_ms (200)
```

**Deadline propagation** tracks remaining budget:
1. Pool.execute stores `deadline_ms = now + timeout` in opts
2. Queue handler uses `effective_queue_timeout_ms/2` to respect deadline
3. Worker execution uses `derive_rpc_timeout_from_opts/2` to compute remaining budget
4. All GenServer.call timeouts are caught and returned as structured errors

#### Backward Compatibility

- All legacy config keys (`pool_request_timeout`, `grpc_command_timeout`, etc.) still work
- When explicitly set, they take precedence over profile-derived values
- When not set, they derive from the active profile
- Default profile is `:balanced` which provides similar values to previous defaults

## [0.8.7] - 2025-12-31

### Fixed
- **Python Any encoding performance** - Avoided extra UTF-8 decode/encode round-trips in `TypeSerializer`
  - JSON payloads now stay as bytes for `google.protobuf.Any.value`
  - Stabilizes orjson benchmark expectations on large payloads
- **Test isolation** - Prevented telemetry/logging state bleed across tests
  - OOM telemetry assertions now scoped by operation ID
  - Logging tests reset global logging disable state
- **Python integration test bootstrap** - Ensure `--include python_integration` reliably provisions deps
  - CLI tag detection now triggers bootstrap and real env doctor checks
  - Test helper validates `.venv` exists after bootstrap and skips redundant deps fetches
- **HealthMonitor cleanup** - Ignore benign shutdown races in test teardown
- **Ready file race condition on CI** - Fixed flaky gRPC server startup on slow/loaded systems
  - `read_ready_file/1` now returns `:not_ready` instead of error when file is empty
  - Polling loop continues retrying instead of failing immediately
  - Resolves `{:invalid_ready_file, ""}` errors on GitHub Actions runners
  - Python already uses atomic rename (`os.replace`), but edge cases on slow filesystems could still produce empty reads

## [0.8.6] - 2025-12-31

### Added
- **Session cleanup telemetry** - Emit telemetry events for session lifecycle monitoring
  - `[:snakepit, :bridge, :session, :pruned]` - Emitted when sessions expire via TTL
  - `[:snakepit, :bridge, :session, :accumulation_warning]` - Emitted when session count exceeds thresholds

- **Strict mode for session store** - New `strict_mode: true` option for dev/test environments
  - Logs loud warnings when session count exceeds 80% of `max_sessions`
  - Helps detect session leaks during development

- **BaseAdapter session context** - Added `session_id` property and `set_session_context()` to `BaseAdapter`
  - Ensures consistent session_id handling across all adapters
  - Backward compatible with existing adapter implementations

- **Session Scoping Guide** - New documentation at `guides/session-scoping-rules.md`
  - Explains session lifecycle, reference scoping, and recommended patterns
  - Documents telemetry events and strict mode configuration

## [0.8.5] - 2025-12-31

### Fixed
- **GRPCWorker graceful shutdown** - Eliminated spurious crash logs during application shutdown
  - Added `shutting_down` flag to distinguish expected exits from unexpected crashes
  - Handle supervisor EXIT signals (`:shutdown`, `{:shutdown, _}`) explicitly
  - Detect shutdown via mailbox peek and pool liveness checks to handle message race conditions
  - Shutdown exit codes (0, 137/SIGKILL, 143/SIGTERM) logged at debug level during shutdown
  - Non-zero exits only logged as errors when not in shutdown context

- **Configurable shutdown timeouts** - Graceful shutdown timeout now configurable via `:graceful_shutdown_timeout_ms`
  - Default increased from 2s to 6s to accommodate Python's async shutdown envelope
  - `child_spec` and `Worker.Starter` derive supervisor shutdown timeout from this config
  - New `Snakepit.GRPCWorker.supervisor_shutdown_timeout/0` for custom supervision trees

- **Python server shutdown** - Improved graceful termination sequence
  - Server stop grace period increased to 2 seconds
  - `wait_for_termination` now awaited with 3s timeout before force-cancel
  - Sequential shutdown: close servicer → stop server → await termination task

- **Python dependency version mismatch** - Updated `requirements.txt` to match generated protobuf/grpc stubs
  - `grpcio`: `>=1.60.0` → `>=1.76.0`
  - `protobuf`: `>=4.25.0` → `>=6.31.1`
  - Previously, users installing minimum versions would get runtime import errors

- **Proto README documentation drift** - Rewrote `priv/proto/README.md` to match actual implementation
  - Fixed service name: `SnakepitBridge` → `BridgeService`
  - Removed non-existent methods (GetVariable, SetVariable, WatchVariables, optimization APIs)
  - Documented only implemented RPC methods
  - Added `Any` encoding convention documentation
  - Clarified binary payload format (opaque bytes, not pickle/ETF specific)
  - Moved aspirational features to "Roadmap" section

- **Streaming backpressure** - Added bounded queue (maxsize=100) to `ExecuteStreamingTool`
  - Prevents unbounded memory growth when producer outpaces consumer
  - `drain_sync` now blocks on enqueue with proper exception handling

- **Streaming cancellation handling** - Producer now stops when client disconnects
  - Added cancellation event propagation to drain loops
  - Added disconnect watcher task that polls `context.is_active()`
  - Producer task explicitly cancelled on cleanup
  - Iterator/generator properly closed via `aclose()`/`close()`

- **Adapter lifecycle cleanup** - Added `cleanup()` calls to adapter lifecycle
  - `ExecuteTool`: Calls `adapter.cleanup()` in finally block (always runs)
  - `ExecuteStreamingTool`: Calls `adapter.cleanup()` in finally block
  - Uses `inspect.isawaitable()` pattern for robust sync/async handling
  - Added `_maybe_cleanup()` and `_close_iterator()` helper functions

- **Threaded server parity** - Applied all streaming/cleanup fixes to `grpc_server_threaded.py`
  - Bounded queue, cancellation handling, iterator closing, adapter cleanup

- **CancelledError handling** - Producer now properly re-raises `CancelledError`
  - Prevents task from blocking on `queue.put()` when consumer is gone
  - On cancellation, task terminates immediately without sentinel (consumer is already gone)

- **Sentinel delivery under backpressure** - Fixed potential hang when queue is full
  - Sentinel is now `await queue.put(sentinel)` (guaranteed delivery) on normal completion
  - Previous `put_nowait` could silently drop sentinel, causing consumer to hang forever

- **Sentinel delivery on disconnect** - Fixed hang when `watch_disconnect()` sets cancelled flag
  - `watch_disconnect()` now injects sentinel directly into queue when disconnect detected
  - Drops buffered chunks if needed to make room for sentinel (consumer is gone anyway)
  - Prevents hang when producer exits normally (not via CancelledError) with cancelled flag set

- **Binary parameters handling** - Fixed unconditional `pickle.loads` security issue
  - `binary_parameters` now treated as opaque bytes by default (per proto docs)
  - Pickle only used if `metadata["binary_format:<param>"] == "pickle"`
  - Enables safe handling of images, audio, and other binary data

- **Loadtest demo formatting** - Fixed `format_number/1` crash on nil values and spacing in output

### Added
- **CI version guard** - New `scripts/check_stub_versions.py` validates that `requirements.txt` versions match generated protobuf/grpc stubs
  - Integrated into GitHub Actions CI workflow
  - Checks protobuf, grpcio, and grpcio-tools versions
  - Prevents "works for us, breaks for users" dependency issues

- **Streaming cancellation tests** - New tests for streaming cleanup behavior
  - `test_streaming_cleanup_called_on_normal_completion`
  - `test_streaming_producer_stops_on_client_disconnect`
  - `test_async_streaming_cleanup_called`
  - `test_streaming_completes_under_backpressure` - verifies sentinel delivery with >maxsize chunks

### Changed
- **Adapter lifecycle documentation** - Clarified per-request adapter lifecycle in `base_adapter.py`
  - Documented that adapters are instantiated per-request
  - Added example showing module-level caching pattern for expensive resources
  - Explained `initialize()`/`cleanup()` semantics

- **Streaming demo modernization** - Updated `execute_streaming_tool_demo.exs` to use standard bootstrap pattern

## [0.8.4] - 2025-12-30

### Added
- **ExecuteStreamingTool Implementation** - Full gRPC streaming support in BridgeServer
  - End-to-end streaming from clients through to Python workers
  - Automatic final chunk injection if worker doesn't send one
  - Execution time metadata on final chunks
  - Proper error handling for streaming failures

### Fixed
- **Timeout Parsing Bug** - Fixed precedence issue in `tool_call_options/1` that caused string timeout values to bypass parsing
- **Binary Parameter Encoding** - Fixed remote tool execution to properly handle binary parameters without attempting JSON encoding of tuples

## [0.8.3] - 2025-12-29

### Fixed
- **Hardware Detector Cache** - Replaced ETS cache creation with `:persistent_term` to eliminate race conditions and table ownership hazards under concurrent access.

### Removed
- **Deprecated/Unused APIs** - Removed `RetryPolicy.exponential_backoff/2`, `RetryPolicy.with_circuit_breaker/2`, `HeartbeatMonitor.get_status/1`, `RunID.valid?/1`, and deprecated `ProcessRegistry.register_worker/4`.

## [0.8.2] - 2025-12-29

### Added
- **Process-Level Log Isolation** - New `Snakepit.Logger` functions for per-process log level control
  - `set_process_level/1` - Set log level for current process only
  - `get_process_level/0` - Get effective log level for current process
  - `clear_process_level/0` - Clear process-level override
  - `with_level/2` - Execute function with temporary log level
- **Test Helper Module** - `Snakepit.Logger.TestHelper` for test isolation
  - `setup_log_isolation/0` - Set up per-test log level isolation
  - `capture_at_level/2` - Capture logs at specific level without affecting other tests
  - `capture_at_level_with_result/2` - Capture logs and return function result
  - `suppress_logs/1` - Suppress all logs for duration of function

### Fixed
- **Flaky Test Race Condition** - Tests that modify log levels no longer interfere with each other when running concurrently
  - Root cause: Multiple async tests modifying global `Application.get_env(:snakepit, :log_level)` caused race conditions
  - Solution: Logger now checks process-local override first, then Elixir Logger process level, then global config

### Changed
- Log level resolution now uses priority order:
  1. Process-level override (via `set_process_level/1`) - highest priority
  2. Elixir Logger process level (via `Logger.put_process_level/2`)
  3. Application config (via `config :snakepit, log_level: ...`) - lowest priority

## [0.8.1] - 2025-12-27

### Changed
- **BREAKING**: Default log level changed from `:warning` to `:error` for silent-by-default behavior
- Centralized all logging through `Snakepit.Logger` module
- Python logging now respects `SNAKEPIT_LOG_LEVEL` environment variable
- Replaced stdout `GRPC_READY` signaling with a non-console control channel
- Removed all hardcoded `IO.puts` and Python `print()` statements

### Added
- Category-based logging: `:lifecycle`, `:pool`, `:grpc`, `:bridge`, `:worker`, `:startup`, `:shutdown`, `:telemetry`, `:general`
- `config :snakepit, log_categories: [...]` to enable specific categories
- `priv/python/snakepit_bridge/logging_config.py` for centralized Python logging

### Fixed
- Noisy startup messages no longer pollute console output
- Health-check messages suppressed by default
- gRPC server startup messages suppressed by default

### Migration Guide
If you relied on seeing startup logs, add to your config:
```elixir
config :snakepit, log_level: :info
```

## [0.8.0] - 2025-12-27

### Added

#### Hardware Abstraction Layer
- **Hardware Detection** - New `Snakepit.Hardware` module providing automatic detection of CPU, NVIDIA CUDA, Apple MPS, and AMD ROCm accelerators.
- **Hardware Detector** - `Snakepit.Hardware.Detector` with unified detection API and caching.
- **CPU Detection** - `Snakepit.Hardware.CPUDetector` with cores, threads, model, and feature detection (AVX, AVX2, SSE4.2).
- **CUDA Detection** - `Snakepit.Hardware.CUDADetector` for NVIDIA GPUs via nvidia-smi with version, driver, and memory info.
- **MPS Detection** - `Snakepit.Hardware.MPSDetector` for Apple Metal Performance Shaders on macOS.
- **ROCm Detection** - `Snakepit.Hardware.ROCmDetector` for AMD GPUs via rocm-smi.
- **Device Selection** - `Snakepit.Hardware.Selector` with automatic selection and fallback strategies.

#### Enhanced ML Telemetry
- **Telemetry Events** - `Snakepit.Telemetry.Events` defining ML-specific telemetry events for hardware, errors, circuit breaker, and GPU profiling.
- **Logger Handler** - `Snakepit.Telemetry.Handlers.Logger` for automatic logging of all ML telemetry events.
- **Metrics Handler** - `Snakepit.Telemetry.Handlers.Metrics` with Prometheus-compatible metric definitions.
- **GPU Profiler** - `Snakepit.Telemetry.GPUProfiler` GenServer for periodic GPU memory, utilization, temperature, and power sampling.
- **Span Helper** - `Snakepit.Telemetry.Span` for convenient timing of operations with automatic start/stop telemetry.

#### Structured Exception Protocol
- **Shape Errors** - `Snakepit.Error.Shape` with `ShapeMismatch` and `DTypeMismatch` exceptions with dimension detection.
- **Device Errors** - `Snakepit.Error.Device` with `DeviceMismatch` and `OutOfMemory` exceptions with recovery suggestions.
- **Error Parser** - `Snakepit.Error.Parser` for automatic parsing of Python errors with pattern detection for shape, device, and OOM errors.

#### Crash Barrier Supervision
- **Circuit Breaker** - `Snakepit.CircuitBreaker` GenServer with closed/open/half-open states for fault tolerance.
- **Health Monitor** - `Snakepit.HealthMonitor` for tracking crash patterns with rolling windows and health status.
- **Retry Policy** - `Snakepit.RetryPolicy` with configurable exponential backoff, jitter, and retriable error filtering.
- **Executor** - `Snakepit.Executor` with `execute_with_retry/2`, `execute_with_timeout/2`, `execute_with_circuit_breaker/3`, and batch execution.

#### Documentation
- New guide: `guides/hardware-detection.md` - Hardware detection usage and device selection.
- New guide: `guides/crash-recovery.md` - Circuit breaker, health monitoring, and retry patterns.
- New guide: `guides/error-handling.md` - ML-specific error types and parsing.
- New guide: `guides/ml-telemetry.md` - ML telemetry events, GPU profiling, and metrics.

### Changed
- **ExDoc Configuration** - Added new module groups for Hardware, Reliability, ML Errors, and enhanced Telemetry.
- **Telemetry Module Groups** - Expanded to include Events, GPUProfiler, Span, and Handlers submodules.

## [0.7.7] - 2025-12-26

### Changed
- Pool GenServer initialization redesigned for OTP compliance. Worker startup now uses an async `spawn_link` pattern instead of blocking `receive` in `handle_continue`, keeping the GenServer responsive to shutdown signals during batch initialization.
- Multi-pool configuration now correctly isolates `pool_size` per pool. Each pool in `:pools` config uses its own `pool_size` value; the global `pool_config[:pool_size]` is only used in legacy single-pool mode.
- Test harness improvements: `after_suite` now monitors the supervisor and waits for actual termination before returning, preventing orphaned process warnings between test runs.
- ProcessRegistry defers unregistration when external OS processes are still alive, with automatic retry cleanup after process termination.

### Fixed
- Pool no longer crashes during application shutdown when WorkerSupervisor terminates before batch initialization completes. Added supervisor health checks before starting each worker batch.
- ProcessKiller `process_alive?/1` on Linux now detects zombie processes by reading `/proc/{pid}/stat` state, preventing false positives for terminated-but-not-reaped processes.
- Test configuration pollution fixed: tests that modify `:pools` config now properly save and restore `:pool_config` to prevent pool_size leakage between tests.

### Added
- `README_TESTING.md` updated with test isolation patterns, application lifecycle documentation, and multi-pool configuration examples for integration tests.
- `REMEDIATION_PLAN.md` documenting the root cause analysis and fixes for test harness race conditions.

## [0.7.6] - 2025-12-26

### Added
- Deterministic shutdown cleanup via `Snakepit.RuntimeCleanup` and manual cleanup via `Snakepit.cleanup/0`, with cleanup telemetry events.
- Process group lifecycle support with `process_group_kill`, pgid tracking in `ProcessRegistry`, and new `ProcessKiller` helpers for group kill/pgid lookup.
- Python gRPC servers can create their own process group when `SNAKEPIT_PROCESS_GROUP` is set.
- Python package management supports isolated virtualenvs via `:python_packages` `env_dir`, auto-creating venvs and honoring command timeouts.
- Documentation suites for FFI ergonomics, Python process cleanup, and runtime hygiene (docs/20251226/*).
- New tests for runtime cleanup, logger defaults, process group kill, process registry cleanup deferrals, and uv venv integration.

### Changed
- Quiet-by-default library config: `library_mode: true`, `log_level: :warning`, `grpc_log_level: :error`, `log_python_output: false`, plus new cleanup defaults (`cleanup_on_stop`, `cleanup_on_stop_timeout_ms`, `cleanup_poll_interval_ms`, `cleanup_retry_interval_ms`, `cleanup_max_retries`).
- Application supervision always starts `Snakepit.Pool.ProcessRegistry` and `Snakepit.Pool.ApplicationCleanup` even without pooling; `Application.stop/1` now runs a cleanup pass when enabled.
- gRPC worker startup/shutdown now tracks pgid/process_group, can kill process groups, buffers startup output, suppresses Python stdout unless enabled, and passes `SNAKEPIT_PROCESS_GROUP` while extending `PYTHONPATH` with SnakeBridge priv Python.
- `Snakepit.EnvDoctor` now locates `grpc_server.py` from the project or installed app root and expands `PYTHONPATH` to include Snakepit/SnakeBridge priv Python when running checks.
- Python runtime selection now prefers explicit overrides, then `:python_packages` venv Python, then managed/system fallback; package operations resolve Python from the configured venv.
- Cleanup retry timing for worker supervisor is now read from runtime config with `_ms` suffix.
- Version references updated to 0.7.6 in `mix.exs` and README dependency docs. Updated `supertester` to `v0.4.0`.

### Fixed
- Taint registry ETS initialization now tolerates a pre-existing table.
- Process registry cleanup no longer drops entries while external OS processes remain alive, and DETS is synced on cleanup/unregister.
- Startup failure diagnostics now include buffered Python output to aid gRPC server troubleshooting.

## [0.7.5] - 2025-12-25

### Added
- `Snakepit.PythonPackages` module for uv/pip package management.
- `Snakepit.PackageError` structured error type for package operations.
- `:python_packages` application config for installer, timeout, and env settings.
- `Snakepit.PythonPackages.ensure!/2` for provisioning required packages.
- `Snakepit.PythonPackages.check_installed/2` for verifying package presence.
- `Snakepit.PythonPackages.lock_metadata/2` for lockfile package metadata.
- `Snakepit.PythonPackages.install!/2` for direct requirement installs.

## [0.7.4] - 2025-12-25

### Added
- **Zero-copy interop** – `Snakepit.ZeroCopy` + `Snakepit.ZeroCopyRef` handle DLPack/Arrow exports/imports with explicit `close/1` and telemetry for export/import/fallback flows.
- **Crash barrier** – Worker crash classification, taint tracking, and idempotent retry policy with new crash/taint/restart telemetry events.
- **Hermetic Python runtime support** – uv-managed interpreter selection, bootstrap integration, and runtime identity metadata propagation.
- **Exception translation** – Structured Python error payloads mapped into `Snakepit.Error.*` exception structs with telemetry for mapped/unmapped translations.
- **Runtime contract coverage** – Integration test coverage for `kwargs`, `call_type`, and payload version fields.

### Changed
- **gRPC bridge error payloads** – Python gRPC servers now return JSON-structured error payloads for tooling failures.
- **Telemetry catalog** – Added runtime event listings for zero-copy, crash barrier, and exception translation.

### Fixed
- **Queue resiliency** – Tainted workers no longer drive queued requests; queue dispatch selects non-tainted workers when available.

## [0.7.3] - 2025-12-25

### Fixed
- **CI test infrastructure** – Fixed `python_integration` test failures in CI by starting `GRPC.Client.Supervisor` in `PythonIntegrationCase` setup and enabling pooling in `StreamingRegressionTest` setup.
- **EnvDoctor port check race condition** – Fixed intermittent `env_doctor_test` failures caused by `:grpc_port` check reading from global Application env instead of opts. The check now accepts `grpc_port` via opts (consistent with other state values), eliminating conflicts when tests or the application bind to overlapping port ranges.

## [0.7.2] - 2025-12-25

### Changed
- **Codebase cleanup** – Removed dead code, unused modules, and obsolete files across the Elixir and Python codebases.
- **Static analysis compliance** – Resolved Dialyzer warnings and Credo issues for cleaner, more maintainable code.
- **Documentation overhaul** – Rewrote README.md and ARCHITECTURE.md for v0.7.2; consolidated DIAGS.md and DIAGS2.md into a single DIAGRAMS.md with mermaid diagrams; updated all README_* guides with version markers; removed obsolete test_bidirectional.py and remaining_handlers.txt.

## [0.7.1] - 2025-12-24

### Added
- **Script ergonomics** – `Snakepit.run_as_script/2` now supports `restart`, `await_pool`, and `halt` options plus configurable shutdown/cleanup timeouts.
- **Example runner controls** – `examples/run_all.sh` honors `SNAKEPIT_EXAMPLE_DURATION_MS` and `SNAKEPIT_RUN_TIMEOUT_MS`.
- **Examples bootstrap helper** – `Snakepit.Examples.Bootstrap.run_example/2` centralizes pool readiness and script exit behavior.

### Changed
- **Pooling defaults to opt-in** – `pooling_enabled` now defaults to `false` to avoid auto-start surprises in scripts.
- **Examples cleanup** – bidirectional and documentation-only examples now shut down cleanly under both `mix run` and `run_all.sh`.

### Fixed
- **Mix-run config drift** – examples now restart Snakepit to apply script-level env overrides, preventing port mismatches and orphaned workers.

## [0.7.0] - 2025-12-22

### Added
- **Capacity-aware scheduling** – Pool tracks per-worker load and `threads_per_worker`, with `capacity_strategy` (`:pool` default, `:profile`, `:hybrid`) configurable globally or per pool.
- **Request metadata exposure** – Python SessionContext now carries `request_metadata` for adapters; `grpc_server.py` wraps ExecuteTool/ExecuteStreamingTool in telemetry spans.

### Changed
- **Correlation propagation** – gRPC calls now set `x-snakepit-correlation-id` headers and `ExecuteToolRequest.metadata` on execute + streaming paths; streaming calls ensure a correlation ID exists.
- **Process profile env merge** – Worker env defaults merge system thread limits with user overrides instead of replacing them.

### Fixed
- **ToolRegistry cleanup logging** – Cleanup logs now report the correct count of removed tools.

## [0.6.11] - 2025-12-20

### Added
- **Pool status CLI** – `mix snakepit.status` reports pool size, queue depth, and error counts without requiring a full dashboard stack.
- **Adapter generator** – `mix snakepit.gen.adapter` scaffolds a minimal Python adapter under `priv/python` with a ready-to-copy `adapter_args` snippet.
- **Binary gRPC results** – Bridge responses now include `binary_result` support so tools can return `{:binary, payload[, metadata]}` tuples for large outputs.
- **Examples runner** – `examples/run_all.sh` executes every example (including showcase/loadtest) via `mix run`, with auto-stop and configurable loadtest sizes.

### Changed
- **Doctor checks** – `Snakepit.EnvDoctor` validates the Elixir `grpc_port` and runs per-pool adapter import health checks via `grpc_server.py --health-check --adapter ...`.
- **Bootstrap consolidation** – scripts/docs/examples now standardize on `mix snakepit.setup` + `mix snakepit.doctor`, and examples prefer `mix run` with the shared bootstrap helper.
- **Python env defaults** – gRPC workers merge default `PYTHONPATH` and `SNAKEPIT_PYTHON` into adapter environments to keep imports predictable.
- **Docs organization** – legacy unified-bridge and unified-example design docs are archived, and install guidance now differentiates repo bootstrap from app usage.

### Fixed
- **Threaded server loop** – `grpc_server_threaded.py` now ensures a running asyncio event loop to avoid deprecation warnings.
- **Worker spawn telemetry** – gRPC worker spawn/terminate durations now use consistent monotonic units, preventing negative duration values in telemetry handlers.
- **Elixir tool decoding in Python** – `SessionContext.call_elixir_tool/2` decodes JSON/binary payloads via `TypeSerializer` instead of returning raw protobuf Any values.
- **Python ML workflow serialization** – showcase ML handlers coerce NumPy-derived stats into JSON-safe floats to avoid `orjson` errors.
- **Tool registration noise** – Python bridge caches tool registration per session and treats duplicate registrations as info, avoiding false error reports.

## [0.6.10] - 2025-11-13

### Added
- **Canonical worker metadata** – `Snakepit.Pool.Registry.metadata_keys/0` exposes the authoritative metadata keys (`:worker_module`, `:pool_name`, `:pool_identifier`, `:adapter_module`) and the surrounding docs call out how pool helpers, diagnostics, and worker profiles should treat that map as the single source of truth.
- **Telemetry catalog + filters** – `Snakepit.Telemetry.Naming.python_event_catalog/0` now documents the full event/measurement schema emitted by `snakepit_bridge`, while the Python telemetry stream implements glob-style allow/deny filters pushed from Elixir so noisy adapters can be muted without redeploying workers.
- **Async adapter registration** – `snakepit_bridge.base_adapter.BaseAdapter` adds `register_with_session_async/2` (plus regression coverage) so asyncio/aio stubs can advertise tool surfaces without blocking while the synchronous helper stays intact for classic stubs.
- **Self-managing Python tests** – `test_python.sh` now creates/updates `.venv`, fingerprints `priv/python/requirements.txt`, installs deps, regenerates protobuf stubs, and exports quiet OTEL defaults so `./test_python.sh` is a one-command pytest runner on any Linux/WSL host.

### Changed
- **Queue timeout enforcement** – Queued requests now carry their timer reference, the pool cancels those timers as soon as the request is dequeued or dropped, and statistics/logging happen in one place, preventing runaway timers when pools churn.
- **Threaded adapter guardrails** – `priv/python/grpc_server_threaded.py` refuses to boot adapters that don’t set `__thread_safe__ = True`, logging a clear remediation path and forcing unsafe adapters back to process mode.
- **Tool registration resilience** – `snakepit_bridge.base_adapter.BaseAdapter` wraps gRPC stub responses in `_coerce_stub_response/1`, unwrapping awaitables, `UnaryUnaryCall` structs, or lazy callables before checking `response.success`, which stabilizes adapters that mix sync and async gRPC stubs.
- **Heartbeat/schema documentation** – `Snakepit.Config` now ships typedocs for the normalized pool/heartbeat map shared with Python, and the architecture plus gRPC guides emphasize that BEAM is the authoritative heartbeat monitor with `SNAKEPIT_HEARTBEAT_CONFIG` kept in sync across languages.

### Fixed
- **Stale queue timeouts** – Queue timeout messages that arrive after a request has already been serviced are ignored, and clients now receive `{:error, :queue_timeout}` exactly once when their request is actually dropped.

## [0.6.9] - 2025-11-13

### Added
- **Registry helpers**: Introduced `Snakepit.Pool.Registry.fetch_worker/1` plus metadata helpers used throughout the pool, bridge server, worker profiles, and diagnostics so `worker_module`, `pool_identifier`, and `pool_name` are always looked up in a single, tested place.
- **Binary parameter validation**: `Snakepit.GRPC.BridgeServer` now rejects non-binary entries in `ExecuteToolRequest.binary_parameters`, guaranteeing local tools only ever see `{:binary, payload}` tuples while remote workers still receive the untouched proto map.
- **Slow-test workflow**: Tagged the long-running suites with `@tag :slow`, defaulted `mix test` to skip them, and documented the opt-in commands plus the 2025-11-13 slow-test inventory in `README_TESTING` and `docs/20251113/slow-test-report.md`.
- **Lifecycle observability**: Memory-based recycling now logs a warning whenever a worker cannot answer the `:get_memory_usage` probe, preventing silent configuration drift.
- **Rogue cleanup controls**: Operators can configure the exact script names and run-id markers that qualify Python processes for startup cleanup, with defaults matching `grpc_server.py`/`grpc_server_threaded.py`.
- **Memory recycle telemetry & diagnostics**: `[:snakepit, :worker, :recycled]` now emits `memory_mb`/`memory_threshold_mb`, Prometheus metrics expose `snakepit.worker.recycled` counters, and both `Snakepit.Diagnostics.ProfileInspector` plus `mix snakepit.profile_inspector` show per-pool “Memory Recycles” totals for operators.

### Changed
- **GRPC worker lookups**: GRPCWorker, ToolRegistry clients, pool helpers, and worker profiles call the new Registry helpers instead of `Registry.lookup/2`, ensuring metadata stays normalized and reverse lookups never crash when metadata is missing.
- **Bridge test coverage**: Added binary-parameter regression tests that prove malformed payloads are rejected before reaching Elixir tools, plus lifecycle tests that simulate failing memory probes.
- **Process killer tests**: Rogue cleanup unit tests now cover the customizable scripts/markers path so changes to the configuration surface immediately.
- **Heartbeat contract clarity**: Documented what `dependent: true|false` means, exported `SNAKEPIT_HEARTBEAT_CONFIG` expectations, and added both HeartbeatMonitor- and GRPCWorker-level regression tests so fail-fast vs independent behavior stays well defined.
- **Telemetry stream shutdown noise**: gRPC telemetry stream shutdowns that report `:normal` or `:shutdown` now log at debug level, eliminating the warning spam that buried actionable failures during slow-test runs.

### Fixed
- **Registry metadata race**: `Pool.Registry.put_metadata/2` now reports `{:error, :not_registered}` when clients attempt to attach metadata before the worker is registered and downgrades those expected attempts to debug logs, eliminating silent successes that previously returned `:ok`.
- **Heartbeat metrics stability**: The `snakepit.worker.memory_mb` summary now pulls values via `Map.get/2` and non-dependent monitors retain timeout/missed-heartbeat counters, so Telemetry/Prometheus exporters stop crashing when measurements arrive as maps and status checks reflect the real failure budget.
- **Docs parity**: README, README_GRPC, README_PROCESS_MANAGEMENT, and ARCHITECTURE now describe the binary parameter contract, registry helper usage, lifecycle behavior, and rogue cleanup assumptions introduced in this release.

## [0.6.8] - 2025-11-12

This release also rolls up the previously undocumented fail-fast docs/tests work from 074f2260f703d16ccfecf937c10af905165419f0 (heartbeat fail-fast suites, orphan cleanup stress tests, queue probe adapter, and config fail-fast coverage).

### Added
- **Bootstrap automation**: Introduced `Snakepit.Bootstrap`, `mix snakepit.setup`, and a `make bootstrap` target to install Mix deps, provision `.venv`/`.venv-py313`, install Python requirements, run `scripts/setup_test_pythons.sh`, and regenerate gRPC stubs with fully instrumented logging.
- **Environment doctor**: New `Snakepit.EnvDoctor` module plus `mix snakepit.doctor` task verify interpreter availability, `grpc` import, `.venv`/`.venv-py313`, `priv/python/grpc_server.py --health-check`, and worker port availability with actionable remediation messages.
- **Runtime guardrails**: `Snakepit.Application` now invokes `Snakepit.EnvDoctor.ensure_python!/0` before pools start, failing fast when Python prerequisites are missing. Test helpers (`test/support/fake_doctor.ex`, `test/support/bootstrap_runner.ex`, `test/support/command_runner.ex`) enable deterministic unit coverage for the bootstrap/doctor path.
- **Python-aware CI**: GitHub Actions workflow now runs bootstrap, doctor, the default suite, and `mix test --only python_integration` so bridge coverage is validated when the doctor passes.
- **New documentation**: README + README_TESTING describe the `make bootstrap → mix snakepit.doctor → mix test` workflow, explain how to run python integration tests, and highlight the new Mix tasks.
- **Lifecycle config & memory recycling**: Added `%Snakepit.Worker.LifecycleConfig{}` to capture adapter/profile/env data for every worker, wired `Snakepit.GRPCWorker` to answer `:get_memory_usage`, and extended lifecycle tests so TTL/request/memory recycling use the same canonical config.
- **Binary tool parameters**: `Snakepit.GRPC.BridgeServer`, `Snakepit.GRPC.Client`, and `Snakepit.GRPC.ClientImpl` now decode/forward `ExecuteToolRequest.binary_parameters`, exposing binaries to local tools as `{:binary, payload}` while sending the untouched map to Python workers. README.md and README_GRPC.md document the contract.
- **Worker-flow integration test**: New `Snakepit.Pool.WorkerFlowIntegrationTest` exercises the WorkerSupervisor → MockGRPCWorker path, ensuring registry/process tracking stays consistent after execution and crash/restart flows.
- **Randomized worker stress test**: `Snakepit.Pool.RandomWorkerFlowTest` throws randomized execute/kill sequences at pools to ensure Registry ↔ ProcessRegistry invariants hold under churn.

### Changed
- **Test gating**: Default `mix test` excludes `:python_integration` while Python-heavy suites (thread profile, session affinity, streaming regression, etc.) carry the tag; `test/unit/exunit_configuration_test.exs` locks the config in place.
- **Thread-profile test harness**: `Snakepit.ThreadProfilePython313Test` now uses `Snakepit.Test.PythonEnv.skip_unless_python_313/1` to skip cleanly when `.venv-py313` is unavailable.
- **Process killer regression**: Ports spawned during `kill_by_run_id/1` tests close via `safe_close_port/1`, eliminating `:port_close` race exceptions.
- **Queue saturation regression**: `Snakepit.Pool.QueueSaturationRuntimeTest` focuses on stats + agent tracking instead of brittle global ETS assertions, removing a common source of flaky failures.
- **gRPC generation script**: `priv/python/generate_grpc.sh` now prefers `.venv/bin/python3`, falling back to system `python3/python` only when the virtualenv is missing, and emits helpful logs when no interpreter is found.
- **Registry metadata semantics**: `Snakepit.GRPCWorker` now writes canonical metadata (`worker_module`, `pool_name`, `pool_identifier`) via `Snakepit.Pool.Registry.put_metadata/2`, unblocking pool-name extraction and worker-module discovery without parsing IDs. Tests cover PID→worker lookups.
- **LifecycleManager internals**: Tracking records store lifecycle structs instead of ad-hoc maps so replacement workers inherit adapter args/env, and memory thresholds now exercise the worker call path in tests.
- **Process cleanup safety**: Rogue process cleanup only targets commands containing `grpc_server.py`/`grpc_server_threaded.py` with `--snakepit-run-id/--run-id` flags, and operators can disable the sweep with `config :snakepit, :rogue_cleanup, enabled: false`. Docs explain the ownership contract.
- **Pool integration coverage**: Replaced the unstable `test/snakepit/pool/high_risk_flow_test.exs` harness with targeted unit-level integration coverage (WorkerSupervisor + MockGRPCWorker), keeping the suite reliable while still covering the critical registry/ProcessRegistry chain.
- **Worker profile metadata lookup**: Process/thread profiles now resolve worker modules via `Pool.Registry.get_worker_id_by_pid/1` + metadata lookup, so non-GRPC workers can be supported and Dialyzer warnings are gone.

### Fixed
- Shell instrumentation around bootstrap (reporting command start/finish and verbose pip output) prevents "silent hangs" and surfaced the root causes of previous provisioning confusion.
- `scripts/setup_test_pythons.sh` now runs under `set -x`, streaming its progress during bootstrap.
- Rogue cleanup tests verify we no longer kill unrelated Python processes, and docs call out the run-id requirements so multi-tenant hosts stay safe.


## [0.6.7] - 2025-10-28

### Added

#### Phase 1: Type System MVP + Performance
- **6x JSON performance boost**: Integrated `orjson` for Python serialization, delivering 4-6x speedup for raw JSON operations and 1.5x improvement for large payloads (`priv/python/snakepit_bridge/serialization.py`, `priv/python/tests/test_orjson_integration.py`).
- **Structured error type**: New `Snakepit.Error` struct provides detailed context for debugging with fields including `category`, `message`, `details`, `python_traceback`, and `grpc_status` (`lib/snakepit/error.ex`, `test/unit/error_test.exs`).
- **Complete type specifications**: All public API functions in `Snakepit` module now have `@spec` annotations with structured error return types for better IDE support and Dialyzer analysis.
- **Performance benchmarks**: Comprehensive benchmark suite validates 4-6x raw JSON speedup and verifies no regression on small payloads (`priv/python/tests/test_orjson_integration.py`).

#### Phase 2: Distributed Telemetry System
- **Bidirectional telemetry streaming**: Python workers can now emit telemetry events via gRPC that are re-emitted as Elixir `:telemetry` events for unified observability (`lib/snakepit/telemetry/grpc_stream.ex`, `priv/python/snakepit_bridge/telemetry/`).
- **Complete event catalog**: 43 telemetry events across 3 layers (Infrastructure, Python Execution, gRPC Bridge) with atom-safe event names to prevent atom table exhaustion (`lib/snakepit/telemetry/naming.ex`, `docs/20251028/telemetry/01_EVENT_CATALOG.md`).
- **Python telemetry API**: High-level Python API with `telemetry.emit()` for events and `telemetry.span()` for automatic timing, plus correlation ID propagation across the Elixir/Python boundary (`priv/python/snakepit_bridge/telemetry/__init__.py`).
- **Runtime telemetry control**: Adjust sampling rates, enable/disable telemetry, and filter events for individual workers without restarts (`lib/snakepit/telemetry/control.ex`).
- **Metadata safety**: Automatic sanitization of Python metadata to prevent atom table exhaustion from untrusted string keys (`lib/snakepit/telemetry/safe_metadata.ex`).
- **Multiple backend support**: Python telemetry supports gRPC streaming (default) and stderr backends, with extensible backend architecture (`priv/python/snakepit_bridge/telemetry/backends/`).
- **Worker lifecycle hooks**: Automatic telemetry stream registration/unregistration integrated into worker lifecycle (`lib/snakepit/grpc_worker.ex:479`, `lib/snakepit/grpc_worker.ex:783`).
- **Integration tests**: Comprehensive test suite covering event catalog, validation, sanitization, and control messages (`test/integration/telemetry_flow_test.exs`).

### Changed
- Python serialization now uses `orjson` with graceful fallback to stdlib `json` if orjson is unavailable, maintaining full backward compatibility.
- Error returns in `Snakepit.Pool` and `Snakepit` modules now use structured `Snakepit.Error` types with detailed context instead of atoms.
- `Snakepit.Pool.await_ready/2` now returns `{:error, %Snakepit.Error{category: :timeout}}` instead of `{:error, :timeout}`.
- Streaming validation errors now include adapter context in error details.
- Old `telemetry.span()` (OpenTelemetry) renamed to `telemetry.otel_span()` to avoid naming conflict with new telemetry streaming span.
- `Snakepit.Application` supervision tree now includes `Snakepit.Telemetry.GrpcStream` for managing bidirectional telemetry streams.

### Fixed
- Updated Dialyzer type specifications to match new structured error returns, reducing type warnings.
- Corrected `grpc_worker.ex` metadata fields for telemetry events (`state.stats.start_time`, `state.stats.requests`).

### Documentation
- **New `TELEMETRY.md`**: Complete user guide for the distributed telemetry system with usage examples, integration patterns for Prometheus/StatsD/OpenTelemetry, and troubleshooting guidance (320 lines).
- **Telemetry design docs**: 9 comprehensive design documents covering architecture, event catalog, Python integration, client guide, gRPC implementation, and backend architecture (`docs/20251028/telemetry/`).
- **New examples**: 5 comprehensive examples demonstrating v0.6.7 features with ~50KB of production-ready code:
  - `examples/telemetry_basic.exs` - Introduction to telemetry handlers and Python telemetry API
  - `examples/telemetry_advanced.exs` - Correlation tracking, performance monitoring, runtime control
  - `examples/telemetry_monitoring.exs` - Production monitoring patterns with real-time dashboard
  - `examples/telemetry_metrics_integration.exs` - Prometheus/StatsD integration patterns
  - `examples/structured_errors.exs` - New `Snakepit.Error` struct usage and pattern matching
- **Updated `examples/README.md`**: Comprehensive guide to all examples with clear learning paths and troubleshooting.
- Updated README.md with v0.6.7 release notes highlighting type system improvements, performance gains, and telemetry system.
- Updated mix.exs version to 0.6.7 with `TELEMETRY.md` in package files and docs extras.
- Added comprehensive test coverage for structured error types (12 new tests in `test/unit/error_test.exs`).

### Performance
- **Telemetry overhead**: <10μs per event, <1% CPU impact at 100% sampling, <0.1% CPU at 10% sampling.
- **Bounded resources**: Python telemetry queue limited to 1024 events (~100KB), with graceful degradation (drops events vs blocking).
- **Zero regression**: All 235+ existing tests pass with full backward compatibility maintained.

**Zero breaking changes**: All existing code continues to work. Telemetry is fully opt-in via standard `:telemetry.attach()` patterns.

## [0.6.6] - 2025-10-27

### Added
- Configurable session/program quotas now surface tagged errors when limits are exceeded, with regression coverage in `test/unit/bridge/session_store_test.exs`.
- Introduced a logger redaction helper so adapters and bridge code can log sensitive inputs safely (`test/unit/logger/redaction_test.exs`).

### Changed
- `Snakepit.GRPC.BridgeServer` reuses worker-owned gRPC channels and only dials a disposable connection when the worker has not yet published one; fallbacks are closed after each invocation.
- gRPC streaming helpers document and enforce the JSON-plus-metadata chunk envelope, clarifying `_metadata` and `raw_data_base64` handling.
- Worker startup handshake waits for the negotiated gRPC port before publishing worker metadata, eliminating transient routing failures during boot.
- `Snakepit.GRPC.ClientImpl` now returns structured `{:error, {:invalid_parameter, :json_encode_failed, message}}` tuples when parameters cannot be JSON-encoded, preventing calling processes from crashing (`test/unit/grpc/client_impl_test.exs`).
- `Snakepit.GRPC.BridgeServer.execute_streaming_tool/2` raises `UNIMPLEMENTED` with remediation guidance so callers can fall back gracefully when streaming is disabled (`test/snakepit/grpc/bridge_server_test.exs`).

### Fixed
- `Snakepit.GRPCWorker` persists the OS-assigned port discovered during startup so BridgeServer never receives `0` when routing requests (`test/unit/grpc/grpc_worker_ephemeral_port_test.exs`).
- Parameter decoding now rejects malformed protobuf payloads with descriptive `{:invalid_parameter, key, reason}` errors, preventing unexpected crashes (`test/snakepit/grpc/bridge_server_test.exs`).
- Process registry ETS tables are `:protected` and DETS handles remain private, guarding against external mutation attempts (`test/unit/pool/process_registry_security_test.exs`).
- Pool name inference prefers registry metadata and logs once when falling back to worker-id parsing, eliminating silent misroutes (`test/unit/pool/pool_registry_lookup_test.exs`).

### Documentation
- Refreshed README, gRPC guides (including the streaming and quick reference docs), and testing notes to cover port persistence, channel reuse, quota enforcement, DETS/ETS protections, streaming payload envelopes and fallbacks, metadata-driven pool routing, logging redaction guardrails, and the expanded regression suite.

## [0.6.5] - 2025-10-26

### Added
- Regression suites covering worker supervisor stop/restart flows and profile-level shutdown helpers (`test/unit/pool/worker_supervisor_test.exs`, `test/unit/worker_profile/worker_profile_stop_worker_test.exs`).

### Changed
- `Snakepit.Application` now reads the current environment from compile-time configuration instead of calling `Mix.env/0`, keeping OTP releases Mix-free.
- Introduced `Snakepit.PythonThreadLimits.resolve/1` to merge partial thread-limit overrides with defaults before applying environment variables.

### Fixed
- `Snakepit.Pool.WorkerSupervisor.stop_worker/1` targets worker starter supervisors and accepts either worker ids or pids, ensuring restarts actually decommission the old worker.
- `Snakepit.WorkerProfile.Process` and `Snakepit.WorkerProfile.Thread` resolve worker ids through the pool registry so lifecycle manager shutdowns succeed for pid handles.

## [0.6.4] - 2025-10-30

### Added
- Streaming regression guard in `test/snakepit/streaming_regression_test.exs` covering both success and adapter capability failures
- `examples/stream_progress_demo.exs` showcasing five timed streaming updates with rich progress output
- `test_python.sh` helper that regenerates protobuf stubs, activates the project virtualenv, wires `PYTHONPATH`, and forwards arguments to `pytest`

### Changed
- Python gRPC servers now bridge streaming iterators through an `asyncio.Queue`, yielding chunks as soon as they are produced and removing ad-hoc log files
- `Snakepit.Adapters.GRPCPython` consumes streaming chunks incrementally, decoding JSON payloads, surfacing metadata, and safeguarding callback failures
- Showcase `stream_progress` tool accepts `delay_ms` and reports elapsed timing so demos and diagnostics show meaningful pacing

### Fixed
- Eliminated burst delivery of streaming responses by ensuring each chunk is forwarded to Elixir immediately, restoring real-time feedback for `execute_stream/4`

---

## [0.6.3] - 2025-10-19

### Added
- **Dependent/Independent Heartbeat Mode** - New `dependent` configuration flag allows workers to optionally continue running when Elixir heartbeats fail, enabling debugging scenarios where Python workers should remain alive
- Environment variable-based heartbeat configuration via `SNAKEPIT_HEARTBEAT_CONFIG` for passing settings from Elixir to Python workers
- Python unit test coverage for dependent heartbeat termination behavior (`priv/python/tests/test_heartbeat_client.py`)
- CLI flags `--heartbeat-dependent` and `--heartbeat-independent` for Python gRPC server configuration

### Changed
- Default heartbeat enabled state changed from `false` to `true` for better production reliability
- `HeartbeatMonitor` now suppresses worker termination when `dependent: false` is configured, logging warnings instead
- Python `HeartbeatClient` includes default shutdown handler for dependent mode
- `Snakepit.GRPCWorker` passes heartbeat configuration to Python via environment variables
- Updated configuration tests to reflect new heartbeat defaults

### Fixed
- Heartbeat configuration now properly propagates from Elixir to Python across all code paths

---

## [0.6.2] - 2025-10-26

### Added
- End-to-end heartbeat regression suite covering monitor boot, timeout handling, and OS-level process cleanup (`test/snakepit/grpc/heartbeat_end_to_end_test.exs`)
- Long-running heartbeat stability test to guard against drift and missed ping accumulation (`test/snakepit/heartbeat_monitor_test.exs`)
- Python-side telemetry regression ensuring outbound metadata preserves correlation identifiers (`priv/python/tests/test_telemetry.py`)
- Deep-dive documentation for the heartbeat and observability stack plus consolidated testing command guide (`docs/20251019/*.md`)

### Changed
- `Snakepit.GRPCWorker` now terminates itself whenever the heartbeat monitor exits, preventing pools from keeping unhealthy workers alive
- `make test` preferentially uses the repository’s virtualenv interpreter, exports `PYTHONPATH`, and runs `mix test --color` for consistent local runs

### Fixed
- Guard against leaking heartbeat monitors by stopping the worker when the monitor crashes, ensuring registry entries and OS ports are released

---

## [0.6.1] - 2025-10-19

### Added
- Proactive worker heartbeat monitoring via `Snakepit.HeartbeatMonitor` with configurable cadence, miss thresholds, and per-pool overrides
- Comprehensive telemetry stack: `Snakepit.Telemetry.OpenTelemetry` boot hook, `Snakepit.TelemetryMetrics` Prometheus exporter, and correlation helpers for tracing spans
- Rich gRPC client utilities (`Snakepit.GRPC.ClientImpl`) covering ping, session lifecycle, heartbeats, and streaming tooling
- Python bridge instrumentation (`snakepit_bridge.heartbeat`, `snakepit_bridge.telemetry`) plus new unit tests for telemetry and threaded servers
- Default telemetry/heartbeat configuration shipped in `config/config.exs`, including OTLP environment toggles and Prometheus port selection
- Configurable logging system via the new `Snakepit.Logger` module with centralized control over verbosity (`:debug`, `:info`, `:warning`, `:error`, `:none`)

### Changed
- `Snakepit.GRPCWorker` now emits detailed telemetry, manages heartbeats, and wires correlation IDs through tracing spans
- `Snakepit.Application` activates OTLP exporters based on environment variables, registers telemetry reporters alongside pool supervisors, and routes logs through `Snakepit.Logger`
- Python gRPC servers (`grpc_server.py`, `grpc_server_threaded.py`) updated with structured logging, execution metrics, and heartbeat responses
- Examples refreshed with observability storylines, dual-mode telemetry demos, and cleaner default output through `Snakepit.Logger`
- GitHub workflows tightened to reflect new test layout and planning artifacts
- 25+ Elixir modules migrated to `Snakepit.Logger` for consistent log suppression in demos and production

### Configuration
- New `:log_level` option under the `:snakepit` application config to control internal logging
  ```elixir
  # config/config.exs
  config :snakepit,
    log_level: :warning  # Options: :debug, :info, :warning, :error, :none
  ```

### Fixed
- Hardened CI skips for `ApplicationCleanupTest` to avoid nondeterministic BEAM run IDs
- Addressed flaky test ordering through targeted cleanup helpers and telemetry-aware assertions

### Documentation
- Major rewrite of `ARCHITECTURE.md`, new `AGENTS.md`, and comprehensive design dossiers for v0.7/v0.8 feature tracks
- Added heartbeat, telemetry, and OTLP upgrade plans under `docs/2025101x/`
- README refreshed with v0.6.1 highlights, logging guidance, installation tips, and observability walkthroughs

### Notes
- Existing configurations continue to work with the default `:info` log level
- Log suppression is optional—set `log_level: :debug` to restore verbose output
- Provides cleaner logs for production deployments and demos while retaining full visibility for debugging

---

## [0.6.0] - 2025-10-11

### Added - Phase 1: Dual-Mode Architecture Foundation

- **Worker Profile System**
  - New `Snakepit.WorkerProfile` behaviour for pluggable parallelism strategies
  - `Snakepit.WorkerProfile.Process` - Multi-process profile (default, backward compatible)
  - `Snakepit.WorkerProfile.Thread` - Multi-threaded profile stub (Phase 2-3 implementation)
  - Profile abstraction enables switching between process and thread execution modes

- **Python Environment Detection**
  - New `Snakepit.PythonVersion` module for Python version detection
  - Automatic detection of Python 3.13+ free-threading support (PEP 703)
  - Profile recommendation based on Python capabilities
  - Version validation and compatibility warnings

- **Library Compatibility Matrix**
  - New `Snakepit.Compatibility` module with thread-safety database
  - Compatibility tracking for 20+ popular Python libraries (NumPy, PyTorch, Pandas, etc.)
  - Per-library thread safety status, recommendations, and workarounds
  - Automatic compatibility checking for thread profile configurations

- **Configuration System Enhancements**
  - New `Snakepit.Config` module for multi-pool configuration management
  - Support for named pools with different worker profiles
  - Backward-compatible legacy configuration conversion
  - Comprehensive configuration validation and normalization
  - Profile-specific defaults (process vs thread)

- **Documentation**
  - Comprehensive v0.6.0 technical plan (8,000+ words)
  - GIL removal research and dual-mode architecture design
  - Phase-by-phase implementation roadmap (10 weeks)
  - Performance benchmarks and migration strategies

### Changed

- **Architecture Evolution**
  - Foundation laid for Python 3.13+ free-threading support
  - Worker management abstracted to support multiple parallelism models
  - Configuration system generalized for multi-pool scenarios

### Added - Phase 2: Multi-Threaded Python Worker

- **Threaded gRPC Server**
  - New `grpc_server_threaded.py` - Multi-threaded server with ThreadPoolExecutor
  - Concurrent request handling via HTTP/2 multiplexing
  - Thread safety monitoring with `ThreadSafetyMonitor` class
  - Request tracking per thread with performance metrics
  - Automatic adapter thread safety validation on startup
  - Configurable thread pool size (--max-workers parameter)

- **Thread-Safe Adapter Infrastructure**
  - New `base_adapter_threaded.py` - Base class for thread-safe adapters
  - `ThreadSafeAdapter` with built-in locking primitives
  - `ThreadLocalStorage` manager for per-thread state
  - `RequestTracker` for monitoring concurrent requests
  - `@thread_safe_method` decorator for automatic tracking
  - Context managers for safe lock acquisition
  - Built-in statistics and performance monitoring

- **Example Implementations**
  - `threaded_showcase.py` - Comprehensive thread-safe adapter example
  - Pattern 1: Shared read-only resources (models, configurations)
  - Pattern 2: Thread-local storage (caches, buffers)
  - Pattern 3: Locked shared mutable state (counters, logs)
  - CPU-intensive workloads with NumPy integration
  - Stress testing and performance monitoring tools
  - Example tools: compute_intensive, matrix_multiply, batch_process, stress_test

- **Thread Safety Validation**
  - New `thread_safety_checker.py` - Runtime validation toolkit
  - Concurrent access detection with detailed warnings
  - Known unsafe library detection (Pandas, Matplotlib, SQLite3)
  - Thread contention monitoring and analysis
  - Performance profiling per thread
  - Automatic recommendations for detected issues
  - Global checker with strict mode option

- **Documentation**
  - New `README_THREADING.md` - Comprehensive threading guide
  - Thread safety patterns and best practices
  - Writing thread-safe adapters tutorial
  - Testing strategies for concurrent code
  - Performance optimization techniques
  - Library compatibility matrix (20+ libraries)
  - Common pitfalls and solutions
  - Advanced topics: worker recycling, monitoring, debugging

### Added - Phase 3: Elixir Thread Profile Integration

- **Complete ThreadProfile Implementation**
  - Full implementation of `Snakepit.WorkerProfile.Thread`
  - Worker capacity tracking via ETS table (`:snakepit_worker_capacity`)
  - Atomic load increment/decrement for thread-safe capacity management
  - Support for concurrent requests to same worker (HTTP/2 multiplexing)
  - Automatic script selection (threaded vs standard gRPC server)

- **Worker Capacity Management**
  - ETS-based capacity tracking: `{worker_pid, capacity, current_load}`
  - Atomic operations for thread-safe load updates
  - Capacity checking before request execution
  - Automatic load decrement after request completion (even on error)
  - Real-time capacity monitoring via `get_capacity/1` and `get_load/1`

- **Adapter Configuration Enhancement**
  - Updated `GRPCPython.script_path/0` to select correct server variant
  - Automatic detection of threaded mode from adapter args
  - Seamless switching between process and thread servers
  - Enhanced argument merging for user customization

- **Load Balancing**
  - Capacity-aware worker selection
  - Prevents over-subscription of workers
  - Returns `:worker_at_capacity` when no slots available
  - Automatic queueing handled by pool layer

- **Example Demonstration**
  - New `examples/threaded_profile_demo.exs` - Interactive demo script
  - Shows configuration patterns for threaded mode
  - Explains concurrent request handling
  - Demonstrates capacity management
  - Performance monitoring examples

### Added - Phase 4: Worker Lifecycle Management

- **LifecycleManager GenServer**
  - New `Snakepit.Worker.LifecycleManager` - Automatic worker recycling
  - TTL-based recycling (configurable: seconds/minutes/hours/days)
  - Request-count based recycling (recycle after N requests)
  - Memory threshold recycling (optional, requires worker support)
  - Periodic health checks (every 5 minutes)
  - Graceful worker replacement with zero downtime

- **Worker Tracking Infrastructure**
  - Automatic worker registration on startup
  - Per-worker metadata tracking (start time, request count, config)
  - Process monitoring for crash detection
  - Lifecycle statistics and reporting

- **Recycling Logic**
  - Configurable TTL: `{3600, :seconds}`, `{1, :hours}`, etc.
  - Max requests: `worker_max_requests: 1000`
  - Memory threshold: `memory_threshold_mb: 2048` (optional)
  - Manual recycling: `LifecycleManager.recycle_worker(pool, worker_id)`
  - Automatic replacement after recycling

- **Request Counting**
  - Automatic increment after successful request
  - Per-worker request tracking
  - Triggers recycling at configured threshold
  - Integrated with Pool's execute path

- **Telemetry Events**
  - `[:snakepit, :worker, :recycled]` - Worker recycled with reason
  - `[:snakepit, :worker, :health_check_failed]` - Health check failure
  - Rich metadata (worker_id, pool, reason, uptime, request_count)
  - Integration with Prometheus, LiveDashboard, custom monitors

- **Documentation**
  - New `docs/telemetry_events.md` - Complete telemetry reference
  - Event schemas and metadata descriptions
  - Usage examples for monitoring systems
  - Prometheus and LiveDashboard integration patterns
  - Best practices and debugging tips

- **Supervisor Integration**
  - LifecycleManager added to application supervision tree
  - Positioned after WorkerSupervisor, before Pool
  - Automatic startup with pooling enabled
  - Clean shutdown handling

### Changed - Phase 4

- **GRPCWorker Enhanced**
  - Workers now register with LifecycleManager on startup
  - Lifecycle config passed during initialization
  - Untracking on worker shutdown

- **Pool Enhanced**
  - Request counting integrated into execute path
  - Automatic notification to LifecycleManager on success
  - Supports lifecycle management without modifications to existing flow

### Added - Phase 5: Enhanced Diagnostics and Monitoring

- **ProfileInspector Module**
  - New `Snakepit.Diagnostics.ProfileInspector` - Programmatic pool inspection
  - Functions for pool statistics, capacity analysis, and memory usage
  - Profile-aware metrics for both process and thread pools
  - `get_pool_stats/1` - Comprehensive pool statistics
  - `get_capacity_stats/1` - Capacity utilization and thread info
  - `get_memory_stats/1` - Memory usage breakdown per worker
  - `get_comprehensive_report/0` - All pools analysis
  - `check_saturation/2` - Capacity warning system
  - `get_recommendations/1` - Intelligent optimization suggestions

- **Mix Task: Profile Inspector**
  - New `mix snakepit.profile_inspector` - Interactive pool inspection tool
  - Text and JSON output formats
  - Detailed per-worker statistics with `--detailed` flag
  - Pool-specific inspection with `--pool` option
  - Optimization recommendations with `--recommendations` flag
  - Color-coded utilization indicators (🔴🟡🟢⚪)
  - Profile-specific insights (process vs thread)

- **Enhanced Scaling Diagnostics**
  - Extended `mix diagnose.scaling` with profile-aware analysis
  - New TEST 0: Pool Profile Analysis
  - Thread pool vs process pool comparison
  - Capacity utilization monitoring
  - Profile-specific recommendations
  - System-wide optimization opportunities
  - Real-time pool statistics integration

- **Telemetry Events**
  - `[:snakepit, :pool, :saturated]` - Pool queue at max capacity
    - Measurements: `queue_size`, `max_queue_size`
    - Metadata: `pool`, `available_workers`, `busy_workers`
  - `[:snakepit, :pool, :capacity_reached]` - Worker reached capacity (thread profile)
    - Measurements: `capacity`, `load`
    - Metadata: `worker_pid`, `profile`, `rejected` (optional)
  - `[:snakepit, :request, :executed]` - Request completed with duration
    - Measurements: `duration_us` (microseconds)
    - Metadata: `pool`, `worker_id`, `command`, `success`

- **Diagnostic Features**
  - Worker memory usage tracking per process
  - Thread pool utilization analysis
  - Capacity saturation warnings
  - Profile-appropriate recommendations
  - Performance duration tracking
  - Queue depth monitoring

### Status

- **Phase 1** ✅ Complete - Foundation modules and behaviors defined
- **Phase 2** ✅ Complete - Multi-threaded Python worker implementation
- **Phase 3** ✅ Complete - Elixir thread profile integration
- **Phase 4** ✅ Complete - Worker lifecycle management and recycling
- **Phase 5** ✅ Complete - Enhanced diagnostics and monitoring
- **Phase 6** 🔄 Pending - Documentation and examples

### Notes

- **No Breaking Changes**: All v0.5.1 configurations remain fully compatible
- **Thread Profile**: Stub implementation (returns `:not_implemented`) until Phase 2-3
- **Default Behavior**: Process profile remains default for maximum stability
- **Python 3.13+**: Free-threading support enables true multi-threaded workers
- **Migration**: Existing code requires zero changes to continue working

---

## [0.5.1] - 2025-10-11

### Added
- **Diagnostic Tools**
  - New `mix diagnose.scaling` task for comprehensive bottleneck analysis
  - Captures resource metrics (ports, processes, TCP connections, memory usage)
  - Enhanced error logging with port buffer drainage

- **Configuration Enhancements**
  - Explicit gRPC port range constraint documentation and validation
  - Batched worker startup configuration (`startup_batch_size: 8`, `startup_batch_delay_ms: 750`)
  - Resource limit safeguards with `max_workers: 1000` hard limit

### Changed
- **Worker Pool Scaling Improvements**
  - Pool now reliably scales to 250+ workers (previously limited to ~105)
  - Resolved thread explosion during concurrent startup (fixed "fork bomb" issue)
  - Dynamic port allocation using OS-assigned ports (port=0) eliminates port collision races
  - Batched worker startup prevents system resource exhaustion during concurrent initialization

- **Performance Optimizations**
  - Aggressive thread limiting via environment variables for optimal pool-level parallelism:
    - `OPENBLAS_NUM_THREADS=1` (numpy/scipy)
    - `OMP_NUM_THREADS=1` (OpenMP)
    - `MKL_NUM_THREADS=1` (Intel MKL)
    - `NUMEXPR_NUM_THREADS=1` (NumExpr)
    - `GRPC_POLL_STRATEGY=poll` (single-threaded)
  - Increased GRPC server connection backlog to 512
  - Extended worker ready timeout to 30s for large pools

- **Configuration Updates**
  - Increased `port_range` to 1000 (accommodates `max_workers`)
  - Enhanced configuration comments explaining each tuning parameter
  - Resource usage tracking during pool initialization

### Fixed
- **Concurrent Startup Issues**
  - Fixed "Cannot fork" / EAGAIN errors from thread explosion during worker spawn
  - Eliminated port collision races with dynamic port allocation
  - Resolved fork bomb caused by Python scientific libraries spawning excessive threads (6,000+ threads from OpenBLAS, gRPC, MKL)

- **Resource Management**
  - Better port binding error handling in Python gRPC server
  - Improved error diagnostics during pool initialization
  - Enhanced connection management in GRPC server

### Performance
- Successfully tested with 250 workers (2.5x previous limit)
- Startup time increases with pool size (~60s for 250 workers vs ~10s for 100 workers)
- Eliminated port collision races and fork resource exhaustion
- Dynamic port allocation provides reliable scaling

### Notes
- Thread limiting optimizes for high concurrency with many small tasks
- CPU-intensive workloads that perform heavy numerical computation within a single task may need different threading configuration
- For computationally intensive per-task workloads, consider:
  - Workload-specific environment variables passed per task
  - Separate worker pools with different threading profiles
  - Dynamic thread limit adjustment based on task type
  - Allowing higher OpenBLAS threads but reducing max_workers accordingly
- See commit dc67572 for detailed technical analysis and future considerations

---

## [0.5.0] - 2025-10-10

### Added
- **Process Management & Lifecycle**
  - New `Snakepit.RunId` module for unique process run identification with nanosecond precision
  - New `Snakepit.ProcessKiller` module for robust OS-level process cleanup with SIGTERM/SIGKILL escalation
  - Enhanced `ProcessRegistry` with run_id tracking and improved cleanup logic
  - Added `scripts/setup_python.sh` for automated Python environment setup

- **Test Infrastructure Improvements**
  - Added comprehensive Supertester refactoring plan (SUPERTESTER_REFACTOR_PLAN.md)
  - Phase 1 foundation updates complete with TestableGenServer support
  - New `assert_eventually` helper for polling conditions without Process.sleep
  - Enhanced test documentation and baseline establishment
  - New worker lifecycle tests for process management validation
  - New application cleanup tests with run_id integration

- **Python Cleanup & Testing**
  - Created Python test infrastructure with `test_python.sh` script
  - Added comprehensive SessionContext test suite (15 tests)
  - Created Elixir integration tests for Python SessionContext (9 tests)
  - Python cleanup summary documentation (PYTHON_CLEANUP_SUMMARY.md)
  - Enhanced Python gRPC server with improved process management and signal handling

- **Documentation**
  - Phase 1 completion report with detailed test results
  - Python cleanup and testing infrastructure summary
  - Enhanced test planning and refactoring documentation
  - Added comprehensive process management design documents (robust_process_cleanup_with_run_id.md)
  - Added implementation summaries and debugging session reports
  - New production deployment checklist (PRODUCTION_DEPLOYMENT_CHECKLIST.md)
  - New example status documentation (EXAMPLE_STATUS_FINAL.md)
  - Enhanced README with new icons and improved organization
  - Added README_GRPC.md and README_BIDIRECTIONAL_TOOL_BRIDGE.md
  - Created docs/archive/ structure for historical analysis and design documents

- **Assets & Branding**
  - Added 29 new SVG icons for documentation (architecture, binary, book, bug, chart, etc.)
  - New snakepit-icon.svg for branding
  - Enhanced visual documentation throughout

### Changed
- **Process Management Improvements**
  - `ApplicationCleanup` rewritten with run_id-based cleanup strategy
  - `GRPCWorker` enhanced with run_id tracking and improved termination handling
  - `ProcessRegistry` optimized cleanup from O(n) to O(1) operations using run_id
  - Enhanced `GRPCPython` adapter with run_id support

- **Code Cleanup**
  - Removed dead Python code
  - Deleted obsolete backup files and unused modules
  - Streamlined Python SessionContext
  - Cleaned up test infrastructure and removed duplicate code
  - Archived ~60 historical documentation files to docs/archive/

- **Examples Refactoring**
  - Simplified grpc_streaming_demo.exs
  - Refactored grpc_advanced.exs for better clarity
  - Enhanced grpc_sessions.exs with improved structure
  - Streamlined grpc_streaming.exs
  - Improved grpc_concurrent.exs with better patterns

- **Test Coverage**
  - Increased total test coverage from 27 to 51 tests (+89%)
  - 37 Elixir tests passing (27 + 9 new integration tests + 1 new helper test)
  - 15 Python SessionContext tests passing
  - Enhanced test helpers with improved synchronization and cleanup

- **Build Configuration**
  - Enhanced mix.exs with expanded documentation and package metadata
  - Updated dependencies and build configurations

### Removed
- **DSPy Integration** (as announced in v0.4.3)
  - Removed deprecated `dspy_integration.py` module
  - Removed deprecated `types.py` with VariableType enum
  - Removed `session_context.py.backup`
  - Removed obsolete `test_server.py`
  - Removed unused CLI directory referencing non-existent modules
  - All `__pycache__/` directories cleaned up

- **Variables Feature (Temporary Removal)**
  - Removed incomplete variables implementation pending future redesign:
    - `lib/snakepit/bridge/variables.ex`
    - `lib/snakepit/bridge/variables/variable.ex`
    - `lib/snakepit/bridge/variables/types.ex`
    - All variable type modules (boolean, choice, embedding, float, integer, module, string, tensor)
    - `examples/grpc_variables.exs`
    - `lib/snakepit_showcase/demos/variables_demo.ex`
    - Related test files and Python code

- **Deprecated Components**
  - Removed `lib/snakepit/bridge/serialization.ex`
  - Removed `lib/snakepit/grpc/stream_handler.ex`
  - Removed integration test infrastructure (`test/integration/` directory)
  - Removed property-based tests pending refactor
  - Removed session and serialization tests pending redesign

### Fixed
- **Process Cleanup & Lifecycle**
  - Fixed race conditions in worker cleanup and termination
  - Improved OS-level process cleanup with proper signal handling
  - Enhanced DETS cleanup with run_id-based identification
  - Fixed test flakiness with improved synchronization

- **gRPC & Session Management**
  - Improved session initialization and cleanup in Python gRPC server
  - Enhanced error handling in bidirectional tool bridge
  - Better isolation between test runs

- **Test Infrastructure**
  - Isolation level configuration documented (staying with :basic until test refactoring)
  - Test infrastructure conflicts between manual cleanup and Supertester automatic cleanup resolved
  - Enhanced debugging capabilities for test failures

### Notes
- **Breaking Changes**:
  - DSPy integration fully removed (deprecated in v0.4.3)
  - Variables feature temporarily removed pending redesign
  - Users must migrate to DSPex for DSPy functionality (see v0.4.3 migration guide)
- Test suite reliability improved with better synchronization patterns
- Foundation laid for full Supertester conformance in future releases
- Process management significantly improved with run_id tracking system
- Documentation reorganized with archive structure for historical content

---

## [0.4.3] - 2025-10-07

### Deprecated
- **DSPy Integration** (`snakepit_bridge.dspy_integration`)
  - Deprecated in favor of DSPex-native integration
  - Will be removed in v0.5.0
  - Deprecation warnings added to all DSPy-specific classes:
    - `VariableAwarePredict`
    - `VariableAwareChainOfThought`
    - `VariableAwareReAct`
    - `VariableAwareProgramOfThought`
    - `ModuleVariableResolver`
    - `create_variable_aware_program()`
  - See migration guide: https://github.com/nshkrdotcom/dspex/blob/main/docs/architecture_review_20251007/04_DECOUPLING_PLAN.md

### Changed
- **VariableAwareMixin** docstring updated to emphasize generic applicability
  - Clarified it's generic, not DSPy-specific
  - Can be used with any Python library (scikit-learn, PyTorch, Pandas, etc.)

### Documentation
- Added prominent deprecation notice to README
- Added migration guide for DSPex users
- Clarified architectural boundaries (Snakepit = infrastructure, DSPex = domain)
- Added comprehensive architecture review documents

### Notes
- **No breaking changes** - existing code continues to work with deprecation warnings
- Core Snakepit functionality unaffected
- Non-DSPy users unaffected
- Deprecation period: 3-6 months before removal in v0.5.0

---

## [0.4.2] - 2025-10-07

### Fixed
- **DETS accumulation bug** - Fixed ProcessRegistry indefinite growth (1994+ stale entries cleaned up)
- **Session creation race condition** - Implemented atomic session creation with `:ets.insert_new` to eliminate concurrent initialization errors
- **Resource cleanup race condition** - Fixed `wait_for_worker_cleanup` to check actual resources (port availability + registry cleanup) instead of dead Elixir PID
- **Test cleanup race condition** - Added proper error handling in test teardown for already-stopped workers
- **ExDoc warnings** - Fixed documentation references by moving INSTALLATION.md to guides/ and adding to ExDoc extras

### Changed
- **ApplicationCleanup simplified** - Simplified implementation, changed to emergency-only handler with telemetry
- **Worker.Starter documentation** - Added comprehensive moduledoc with ADR-001 link explaining external process management rationale
- **DETS cleanup optimization** - Changed from O(n) per-PID syscalls to O(1) beam_run_id-based cleanup
- **Process.alive? filter removed** - Eliminated redundant check (Supervisor.which_children already returns alive children only)

### Added
- **ADR-001** - Architecture Decision Record documenting Worker.Starter supervision pattern rationale
- **External Process Supervision Design** - Comprehensive 1074-line design document covering multi-mode architecture
- **Issue #2 critical review** - Detailed analysis addressing all community feedback concerns
- **Performance benchmarks** - Added baseline benchmarks showing 1400-1500 ops/sec sustained throughput
- **Telemetry in ApplicationCleanup** - Added events for tracking orphan detection and emergency cleanup

### Removed
- **Dead code cleanup** - Removed unused/aspirational code:
  - Snakepit.Python module (referenced non-existent adapter)
  - GRPCBridge adapter (never used)
  - Dead Python adapters (dspy_streaming.py, enhanced.py, grpc_streaming.py)
  - Redundant helper functions in ApplicationCleanup
  - Catch-all rescue clauses (follows "let it crash" philosophy)

### Performance
- 100 workers initialize in ~3 seconds (unchanged)
- 1400-1500 operations/second sustained (maintained)
- DETS cleanup now O(1) vs O(n) (significant improvement for large process counts)

### Documentation
- Complete installation guide with platform-specific instructions (Ubuntu, macOS, WSL, Docker)
- Marked working vs WIP examples clearly (3 working, 6 aspirational)
- Added comprehensive analysis documents (150KB total)

### Testing
- All 139/139 tests passing ✅
- No orphaned processes ✅
- Clean shutdown behavior validated ✅

## [0.4.1] - 2025-07-24

### Added
- **New `process_text` tool** - Text processing capabilities with upper, lower, reverse, and length operations
- **New `get_stats` tool** - Real-time adapter and system monitoring with memory usage, CPU usage, and system information
- **Enhanced ShowcaseAdapter** - Added missing tools (adapter_info, echo, process_text, get_stats) for complete tool bridge demonstration

### Fixed
- **gRPC tool registration issues** - Resolved async/sync mismatch causing UnaryUnaryCall objects to be returned instead of actual responses
- **Missing tool errors** - Fixed "Unknown tool: adapter_info" and "Unknown tool: echo" errors by implementing missing @tool decorated methods
- **Automatic session initialization** - Fixed "Failed to register tools: not_found" error by automatically creating sessions before tool registration
- **Remote tool dispatch** - Implemented complete bidirectional tool execution between Elixir BridgeServer and Python workers
- **Async/sync compatibility** - Added proper handling for both sync and async gRPC stubs with fallback logic for UnaryUnaryCall objects

### Changed
- **BridgeServer enhancement** - Added remote tool execution capabilities with worker port lookup and gRPC forwarding
- **Python gRPC server** - Enhanced with automatic session initialization before tool registration
- **ShowcaseAdapter refactoring** - Expanded tool set to demonstrate full bidirectional tool bridge capabilities

## [0.4.0] - 2025-07-23

### Added
- Complete gRPC bridge implementation with full bidirectional tool execution
- Tool bridge streaming support for efficient real-time communication
- Variables feature with type system (string, integer, float, boolean, choice, tensor, embedding)
- Comprehensive process management and cleanup system
- Process registry with enhanced tracking and orphan detection
- SessionStore with TTL support and automatic expiration
- BridgeServer implementation for gRPC protocol
- StreamHandler for managing gRPC streaming responses
- Telemetry module for comprehensive metrics and monitoring
- MockGRPCWorker and test infrastructure improvements
- Showcase application with multiple demo scenarios
- Binary serialization support for large data (>10KB) with 5-10x performance improvement
- Automatic binary encoding with threshold detection
- Protobuf schema updates with binary fields support
- Tool registration and discovery system
- Elixir tool exposure to Python workers
- Batch variable operations for performance
- Variable watching/reactive updates support
- Heartbeat mechanism for session health monitoring

### Changed
- Major refactoring from legacy bridge system to gRPC-only architecture
- Removed all legacy bridge implementations (V1, V2, MessagePack)
- Unified all adapters to use gRPC protocol exclusively
- Worker module completely rewritten for gRPC support
- Pool module enhanced with configurable adapter support
- ProcessRegistry rewritten with improved tracking and cleanup
- Test framework upgraded with SuperTester integration
- Examples reorganized and updated for gRPC usage
- Python client library restructured as snakepit_bridge package
- Serialization module now returns 3-tuple `{:ok, any_map, binary_data}`
- Large tensors and embeddings automatically use binary encoding
- Integration tests updated to use new infrastructure

### Fixed
- Process cleanup and orphan detection issues
- Worker termination and registry cleanup
- Module redefinition warnings in test environment
- SessionStore TTL validation and expiration timing
- Mock adapter message handling
- Integration test pool timeouts and shutdown
- GitHub Actions deprecation warnings
- Elixir version compatibility in integration tests

### Removed
- All legacy bridge implementations (generic_python.ex, generic_python_v2.ex, etc.)
- MessagePack protocol support (moved to gRPC exclusively)
- Old Python bridge scripts (generic_bridge.py, enhanced_bridge.py)
- Legacy session_context.py implementation
- V1/V2 adapter pattern in favor of unified gRPC approach

## [0.3.3] - 2025-07-20

### Added
- Support for custom adapter arguments in gRPC adapter via pool configuration
- Enhanced Python API commands (call, store, retrieve, list_stored, delete_stored) in gRPC adapter
- Dynamic command validation based on adapter type in gRPC adapter

### Changed
- GRPCPython adapter now accepts custom adapter arguments through pool_config.adapter_args
- Improved supported_commands/0 to dynamically include commands based on the adapter in use

### Fixed
- gRPC adapter now properly supports third-party Python adapters like DSPy integration

## [0.3.2] - 2025-07-20

### Fixed
- Added missing files to the repository

## [0.3.1] - 2025-07-20

### Changed
- Merged MessagePack optimizations into main codebase
- Unified documentation for gRPC and MessagePack features
- Set GenericPythonV2 as default adapter with auto-negotiation

## [0.3.0] - 2025-07-20

### Added
- Complete gRPC bridge implementation with streaming support
- MessagePack serialization protocol support
- Comprehensive gRPC integration documentation and setup guides
- Enhanced bridge documentation and examples

### Changed
- Deprecated V1 Python bridge in favor of V2 architecture
- Updated demo implementations to use V2 Python bridge
- Improved gRPC streaming bridge implementation
- Enhanced debugging capabilities and cleanup

### Fixed
- Resolved init/1 blocking issues in V2 Python bridge
- General debugging improvements and code cleanup

## [0.2.1] - 2025-07-20

### Fixed
- Eliminated "unexpected message" logs in Pool module by properly handling Task completion messages from `Task.Supervisor.async_nolink`

## [0.2.0] - 2025-07-19

### Added
- Complete Enhanced Python Bridge V2 Extension implementation
- Built-in type support for Python Bridge V2
- Test rework specifications and improved testing infrastructure
- Commercial refactoring recommendations documentation

### Changed
- Enhanced Python Bridge V2 with improved architecture and session management
- Improved debugging capabilities for V2 examples
- Better error handling and robustness in Python Bridge

### Fixed
- Bug fixes in Enhanced Python Bridge examples
- Data science example debugging improvements
- General cleanup and code improvements

## [0.1.2] - 2025-07-18

### Added
- Python Bridge V2 with improved architecture and session management
- Generalized Python bridge implementation
- Enhanced session management capabilities

### Changed
- Major architectural improvements to Python bridge
- Better integration with external Python processes

## [0.1.1] - 2025-07-18

### Added
- DIAGS.md with comprehensive Mermaid architecture diagrams
- Elixir-themed styling and proper subgraph format for diagrams
- Logo support to ExDoc and hex package
- Mermaid diagram support in documentation

### Changed
- Updated configuration to include assets and documentation
- Improved documentation structure and visual presentation

### Fixed
- README logo path for hex docs
- Asset organization (moved img/ to assets/)

## [0.1.0] - 2025-07-18

### Added
- Initial release of Snakepit
- High-performance pooling system for external processes
- Session-based execution with worker affinity
- Built-in adapters for Python and JavaScript/Node.js
- Comprehensive session management with ETS storage
- Telemetry and monitoring support
- Graceful shutdown and process cleanup
- Extensive documentation and examples

### Features
- Lightning-fast concurrent worker initialization (1000x faster than sequential)
- Session affinity for stateful operations
- Built on OTP primitives (DynamicSupervisor, Registry, GenServer)
- Adapter pattern for any external language/runtime
- Production-ready with health checks and error handling
- Configurable pool sizes and timeouts
- Built-in bridge scripts for Python and JavaScript

[Unreleased]: https://github.com/nshkrdotcom/snakepit/compare/v0.13.0...HEAD
[0.13.0]: https://github.com/nshkrdotcom/snakepit/compare/v0.12.0...v0.13.0
[0.12.0]: https://github.com/nshkrdotcom/snakepit/compare/v0.11.1...v0.12.0
[0.11.1]: https://github.com/nshkrdotcom/snakepit/compare/v0.11.0...v0.11.1
[0.11.0]: https://github.com/nshkrdotcom/snakepit/compare/v0.10.1...v0.11.0
[0.10.1]: https://github.com/nshkrdotcom/snakepit/compare/v0.10.0...v0.10.1
[0.10.0]: https://github.com/nshkrdotcom/snakepit/compare/v0.9.1...v0.10.0
[0.9.1]: https://github.com/nshkrdotcom/snakepit/compare/v0.9.0...v0.9.1
[0.9.0]: https://github.com/nshkrdotcom/snakepit/compare/v0.8.9...v0.9.0
[0.8.9]: https://github.com/nshkrdotcom/snakepit/compare/v0.8.8...v0.8.9
[0.8.8]: https://github.com/nshkrdotcom/snakepit/compare/v0.8.7...v0.8.8
[0.8.7]: https://github.com/nshkrdotcom/snakepit/compare/v0.8.6...v0.8.7
[0.8.6]: https://github.com/nshkrdotcom/snakepit/compare/v0.8.5...v0.8.6
[0.8.5]: https://github.com/nshkrdotcom/snakepit/compare/v0.8.4...v0.8.5
[0.8.4]: https://github.com/nshkrdotcom/snakepit/compare/v0.8.3...v0.8.4
[0.8.3]: https://github.com/nshkrdotcom/snakepit/compare/v0.8.2...v0.8.3
[0.8.2]: https://github.com/nshkrdotcom/snakepit/compare/v0.8.1...v0.8.2
[0.8.1]: https://github.com/nshkrdotcom/snakepit/compare/v0.8.0...v0.8.1
[0.8.0]: https://github.com/nshkrdotcom/snakepit/compare/v0.7.7...v0.8.0
[0.7.7]: https://github.com/nshkrdotcom/snakepit/compare/v0.7.6...v0.7.7
[0.7.6]: https://github.com/nshkrdotcom/snakepit/compare/v0.7.5...v0.7.6
[0.7.5]: https://github.com/nshkrdotcom/snakepit/compare/v0.7.4...v0.7.5
[0.7.4]: https://github.com/nshkrdotcom/snakepit/compare/v0.7.3...v0.7.4
[0.7.3]: https://github.com/nshkrdotcom/snakepit/compare/v0.7.2...v0.7.3
[0.7.2]: https://github.com/nshkrdotcom/snakepit/compare/v0.7.1...v0.7.2
[0.7.1]: https://github.com/nshkrdotcom/snakepit/compare/v0.7.0...v0.7.1
[0.7.0]: https://github.com/nshkrdotcom/snakepit/compare/v0.6.11...v0.7.0
[0.6.11]: https://github.com/nshkrdotcom/snakepit/compare/v0.6.10...v0.6.11
[0.6.10]: https://github.com/nshkrdotcom/snakepit/compare/v0.6.9...v0.6.10
[0.6.9]: https://github.com/nshkrdotcom/snakepit/compare/v0.6.8...v0.6.9
[0.6.8]: https://github.com/nshkrdotcom/snakepit/compare/v0.6.7...v0.6.8
[0.6.7]: https://github.com/nshkrdotcom/snakepit/compare/v0.6.6...v0.6.7
[0.6.6]: https://github.com/nshkrdotcom/snakepit/compare/v0.6.5...v0.6.6
[0.6.5]: https://github.com/nshkrdotcom/snakepit/compare/v0.6.4...v0.6.5
[0.6.4]: https://github.com/nshkrdotcom/snakepit/compare/v0.6.3...v0.6.4
[0.6.3]: https://github.com/nshkrdotcom/snakepit/compare/v0.6.2...v0.6.3
[0.6.2]: https://github.com/nshkrdotcom/snakepit/compare/v0.6.1...v0.6.2
[0.6.1]: https://github.com/nshkrdotcom/snakepit/compare/v0.6.0...v0.6.1
[0.6.0]: https://github.com/nshkrdotcom/snakepit/compare/v0.5.1...v0.6.0
[0.5.1]: https://github.com/nshkrdotcom/snakepit/compare/v0.5.0...v0.5.1
[0.5.0]: https://github.com/nshkrdotcom/snakepit/compare/v0.4.3...v0.5.0
[0.4.3]: https://github.com/nshkrdotcom/snakepit/compare/v0.4.2...v0.4.3
[0.4.2]: https://github.com/nshkrdotcom/snakepit/compare/v0.4.1...v0.4.2
[0.4.1]: https://github.com/nshkrdotcom/snakepit/compare/v0.4.0...v0.4.1
[0.4.0]: https://github.com/nshkrdotcom/snakepit/compare/v0.3.3...v0.4.0
[0.3.3]: https://github.com/nshkrdotcom/snakepit/compare/v0.3.2...v0.3.3
[0.3.2]: https://github.com/nshkrdotcom/snakepit/compare/v0.3.1...v0.3.2
[0.3.1]: https://github.com/nshkrdotcom/snakepit/compare/v0.3.0...v0.3.1
[0.3.0]: https://github.com/nshkrdotcom/snakepit/compare/v0.2.1...v0.3.0
[0.2.1]: https://github.com/nshkrdotcom/snakepit/compare/v0.2.0...v0.2.1
[0.2.0]: https://github.com/nshkrdotcom/snakepit/compare/v0.1.2...v0.2.0
[0.1.2]: https://github.com/nshkrdotcom/snakepit/compare/v0.1.1...v0.1.2
[0.1.1]: https://github.com/nshkrdotcom/snakepit/compare/v0.1.0...v0.1.1
[0.1.0]: https://github.com/nshkrdotcom/snakepit/releases/tag/v0.1.0