README.md

# Entropy

[![Hex.pm](https://img.shields.io/hexpm/v/entropy.svg)](https://hex.pm/packages/entropy)

Entropy is a fault-injection tool for the Elixir/OTP runtime.

It acts as a sidecar application, stochastically selecting and suspending
("zombifying") processes to simulate Grey Failures (degradation) rather than
simple termination.

## Purpose

Standard supervisors recover from crashes (Termination). They do not recover
from hanging processes (Degradation). Entropy validates system resilience by
forcibly suspending processes for defined intervals, proving whether the host
system correctly handles timeouts and backpressure.

## Features

* **Grey Failure Simulation:** Simulates degradation (freezing) in addition to
  simple termination to validate timeout handling.
* **Stochastic Selection:** Uses weighted probabilistic selection to ensure fair
  coverage of the process tree over time.
* **Safety Circuit Breaker:** Automatically halts injection if node CPU or
  Memory exceeds configured safety thresholds.
* **Immunity:** Supports static and dynamic immunity to protect critical
  infrastructure processes.
* **Dead Man's Switch:** Guarantees that all suspended processes are
  automatically resumed if the Entropy daemon crashes.

## Installation

Add `entropy` to your list of dependencies in `mix.exs`:

```elixir
def deps do
  [
    {:entropy, "~> 0.1.0"}
  ]
end
```

Ensure the `:os_mon` application is enabled in your `application` callback, as
Entropy relies on `:cpu_sup` for safety checks.

```elixir
def application do
  [
    extra_applications: [:logger, :os_mon]
  ]
end
```

## Configuration

Entropy is configured via the standard application environment.

**Note:** The safety thresholds define the Circuit Breaker. If system resources
exceed these limits, Entropy halts injection to prevent cascading failure.

```elixir
# config/config.exs

config :entropy,
  # Enable/Disable the injection scheduler.
  # Default: false (Safety first)
  is_injection_enabled: false,

  # The time between injection attempts in milliseconds.
  # Default: 5000
  injection_interval_ms: 2000,

  # The frequency at which the Circuit Breaker polls system resources.
  # Lower values increase reaction time but add system overhead.
  # Default: 1000
  safety_check_interval_ms: 1000,

  # The maximum CPU utilization (0.0 - 100.0) allowed.
  # If the host node exceeds this, injection pauses.
  # Default: 95.0
  max_cpu_util_percent: 80.0,

  # The maximum Memory utilization (0.0 - 100.0) allowed.
  # Default: 90.0
  max_memory_util_percent: 80.0,

  # The maximum number of concurrent zombies allowed.
  # Default: 50
  max_active_zombies: 25,

  # A list of atoms (application names) strictly immune to selection.
  # :kernel, :init, :logger, and :entropy are immune by default.
  # Default: []
  immune_modules: [:my_critical_app],

  # A list of atoms (application names) allowed to be targeted.
  # If empty, all applications are valid targets.
  # Default: []
  target_applications: [:my_target_app],

  # The duration range {min, max} in ms for a process suspension.
  # Default: {1000, 10_000}
  zombie_ttl_range_ms: {1000, 10_000},

  # Fault Strategy Weights
  # A keyword list defining the relative frequency of fault types.
  # Keys: :suspend, :kill
  # Default: [suspend: 10, kill: 0] = Suspension only
  fault_strategy_weights: [suspend: 9, kill: 1],

  # Cooldown period for repetitive telemetry events in ms.
  # Default: 1000
  telemetry_debounce_ms: 1000,

  # Whether the AxiomaticLogger should output to `stdout`.
  # In standard operation, if the system crashes (T=0), the what and why must be
  # preserved.
  is_axiomatic_reporting_enabled: true,

  # The buffer subtracted from the max resource utilization limit required to
  # recover from an unsafe state. Prevents rapid oscillation in the circuit
  # breaker.
  # Default: 10.0
  hysteresis_padding: 10.0
```

### Configuration Hierarchy

Entropy follows the standard Elixir configuration cascade:

1. **`config/config.exs`**: Sets the **Static Defaults** (Safe/Disabled) at
   compile time.
2. **`config/runtime.exs`**: Reads **Environment Variables** at boot time.
    * *Rule:* Values set here **override** the static defaults.

### Environment Variables (Runtime)

Entropy supports runtime configuration via `config/runtime.exs`.

The following environment variables override static configuration in production:

* `ENTROPY_INJECTION_ENABLED`: ("true" | "false") Toggles the injection
  scheduler.
* `ENTROPY_INJECTION_INTERVAL_MS`: (Integer) Sets the time between injection
  attempts.

Example:
```bash
export ENTROPY_INJECTION_ENABLED="true"
export ENTROPY_INJECTION_INTERVAL_MS="5000"
```

## Usage

Entropy operates as a daemon. Interactions occur via the Entropy module or by
observing Telemetry events.

### 1. Verification

After deployment, confirm the daemon is active and the environment permits
injection.

```elixir
# Returns true if the Entropy supervision tree is alive.
iex> Entropy.is_alive?()
true

# Returns true if the Circuit Breaker allows injection.
# (i.e., CPU < max_cpu_util_percent AND Memory < max_memory_util_percent)
iex> Entropy.is_ready?()
true
```

### 2. Runtime Control

Configuration changes (e.g., increasing aggression) can be applied without
restarting the node.

1. Modify `config.exs` or `runtime.exs`
2. Execute reload:

```elixir
iex> Entropy.reload_config()
:ok
```

### 3. Dynamic Immunity

Specific processes can be temporarily granted immunity during critical
transactions.

```elixir
# Protect the current process from chaos
Entropy.State.ImmunityRegistry.register(self())

# Critical work...

# Revoke protection
Entropy.State.ImmunityRegistry.unregister(self())
```

## Observability

Entropy emits structured events via `:telemetry`.

### Injection Events

* `[:entropy, :injection, :start]` - Injection attempt initiated.
* `[:entropy, :injection, :stop]` - Injection successfully completed.
    * Metadata: `%{strategy: :suspend | :killm ...}`
* `[:entropy, :injection, :failure]` - Injection failed (e.g., target died
  before suspension).

### Safety Events

* `[:entropy, :safety, :veto]` - Circuit Breaker tripped. Injection paused.
* `[:entropy, :safety, :recovery]` - Circuit Breaker reset. Injection resumed.

### Scheduler Events

* `[:entropy, :scheduler, :skip]` - Cycle skipped (e.g., due to circuit breaker
  or zombie limit).
* `[:entropy, :scheduler, :noop]` - Cycle executed but no valid victim found.

### Configuration Events

Events emitted when `Entropy.reload_config/0` applies runtime changes.

* `[:entropy, :scheduler, :injection_interval_change]`
    * Metadata: `%{old: integer(), new: integer()}`
* `[:entropy, :circuit_breaker, :threshold_change]`
    * Metadata: `%{old: map(), new: map()}`

## Architecture

### Circuit Breaker & Hysteresis

Entropy polls `:cpu_sup` and `:memsup` at a configurable interval (Default:
1000ms).

1. **Safety Trip:** If usage exceeds `max_cpu_util_percent` or
   `max_memory_util_percent`, the system enters a **Safety State**. Injection
   halts immediately.
2. **Hysteresis Recovery:** To prevent oscillation (flapping) between safe and
   unsafe states, Entropy applies a **Hysteresis Padding** (Default: 10.0%).
    * *Example:* If the CPU limit is **80%**, the system becomes unsafe at
      **>80%**. However, it will not return to a safe state until CPU usage
      drops below **70%** (80% - 10% padding).

This ensures the host system has genuinely recovered before chaos resumes.

### Zombie Registry

Suspended processes are tracked in an ETS table owned by
`Entropy.State.ZombieRegistry`.

* **Constraint:** If the registry process crashes, the BEAM VM automatically
  resumes all suspended processes (Dead Man's Switch).
* **Limit:** The system enforces a hard limit of `max_active_zombies` (Default:
  50) to prevent total resource starvation.

### Census

Entropy maintains a cached snapshot of the process table to minimize overhead.
The `Entropy.Sanctuary.Census` process refreshes this list on a fixed interval
(default: 5s).

**Refresh Lifecycle:**

1. Retrieves the global process list.
2. Filters processes based on the `target_applications` allowlist (if configured).
3. Converts the result to a Tuple for `O(1)` random access.

This architecture ensures that the Scheduler performs constant-time victim
selection without blocking the VM with expensive `Process.list/0` calls during
every tick.

## Development

This section explains how to set up the project locally for development.

### Requirements

* Elixir `~> 1.16` (OTP 26+)
* `:os_mon` (Required for System Sensors)

### Setup

```bash
# 1. Clone the repository
## via HTTPS
git clone https://github.com/nrednav/entropy.git

## via SSH
git clone git@github.com:nrednav/entropy.git

cd entropy

# 2. Install dependencies
mix deps.get

# 3. Run the test suite
# Note: Tests use a Simulated Physics engine to avoid actual system interference.
mix test
```

### Testing Strategy

Entropy uses a `Deterministic Testing Pattern` to eliminate race conditions.

* **Simulated Physics:** Tests run against a `Physics` simulation, not the host
  OS.
* **Manual Polling:** In the test environment, the Circuit Breaker's automatic
  polling loop is paused. You must explicitly trigger state updates.

Example test workflow:

```elixir
# 1. Set the simulated physical state
Entropy.Simulation.Physics.set_cpu_util_percent(99.9)

# 2. Force the Circuit Breaker to read the new state
Entropy.State.CircuitBreaker.force_safety_check()

# 3. Assert the system reaction
Wait.until(fn ->
  case Entropy.State.CircuitBreaker.get_safety_report() do
    {:unsafe, metrics} -> metrics.cpu_util_percent == 99.9
    _ -> false
  end
end)

# or
assert Entropy.is_ready?() == false
```

## Versioning

This project uses [Semantic Versioning](https://semver.org/).
For a list of available versions, see the [repository tag
list](https://github.com/nrednav/entropy/tags).

## Issues & Requests

If you encounter a bug or have a feature request, please [open an
issue](https://github.com/nrednav/entropy/issues) on the GitHub repository.