docs/workflow_authoring.md

Select File
docs/workflow_authoring.md

# Workflow Authoring

This guide covers the workflow contract that Squid Mesh supports today.

## Formatter Setup

Squid Mesh exports formatter rules for workflow DSL calls. Host apps can import
them from their `.formatter.exs`:

```elixir
[
  import_deps: [:squid_mesh],
  inputs: ["{mix,.formatter}.exs", "{config,lib,test}/**/*.{ex,exs}"]
]
```

## Define A Workflow

Workflows are Elixir modules that `use SquidMesh.Workflow` and declare:

- one trigger
- one payload contract
- one or more steps
- transitions between steps
- optional dependency-based `after: [...]` joins on steps that wait for other work
- optional retry policy on the steps that own side effects
- optional recovery markers for irreversible or non-compensatable side effects

```elixir
defmodule Billing.Workflows.PaymentRecovery do
  use SquidMesh.Workflow

  workflow do
    trigger :payment_recovery do
      manual()

      payload do
        field :account_id, :string
        field :invoice_id, :string
        field :attempt_id, :string
        field :gateway_url, :string
      end
    end

    step :load_invoice, Billing.Steps.LoadInvoice
    step :wait_for_settlement, :wait, duration: 5_000
    step :log_recovery_attempt, :log,
      message: "Invoice loaded, checking gateway status",
      level: :info
    step :check_gateway_status, Billing.Steps.CheckGatewayStatus,
      retry: [max_attempts: 5, backoff: [type: :exponential, min: 1_000, max: 30_000]]
    step :notify_customer, Billing.Steps.NotifyCustomer

    transition :load_invoice, on: :ok, to: :wait_for_settlement
    transition :wait_for_settlement, on: :ok, to: :log_recovery_attempt
    transition :log_recovery_attempt, on: :ok, to: :check_gateway_status
    transition :check_gateway_status, on: :ok, to: :notify_customer
    transition :notify_customer, on: :ok, to: :complete
  end
end
```

## Triggers

Triggers define how a workflow run starts.

Supported trigger types:

- `manual()`
- `cron expression, timezone: "Etc/UTC"`

Trigger names are business-oriented entrypoints such as `:payment_recovery` or
`:invoice_delivery`. The trigger type describes how that entrypoint is invoked.

Current boundary:

- trigger metadata is validated and stored in the workflow definition
- manual triggers are runnable through the public API
- cron triggers are activated by the host app's scheduler and executor

Cron workflow example:

```elixir
defmodule Content.Workflows.PostDailyDigest do
  use SquidMesh.Workflow

  workflow do
    trigger :daily_digest do
      cron "0 9 * * 1-5", timezone: "Etc/UTC"

      payload do
        field :feed_url, :string, default: "https://example.com/feed.xml"
        field :discord_webhook_url, :string
        field :posted_on, :string, default: {:today, :iso8601}
      end
    end

    step :fetch_feed, Content.Steps.FetchFeed
    step :build_digest, Content.Steps.BuildDigest
    step :post_to_discord, Content.Steps.PostToDiscord,
      retry: [max_attempts: 5, backoff: [type: :exponential, min: 1_000, max: 30_000]]

    transition :fetch_feed, on: :ok, to: :build_digest
    transition :build_digest, on: :ok, to: :post_to_discord
    transition :post_to_discord, on: :ok, to: :complete
  end
end
```

Host-app scheduler example:

```elixir
def handle_cron_tick do
  MyApp.SquidMeshExecutor.enqueue_cron(
    SquidMesh.config!(),
    MyApp.Workflows.DailyStandup,
    :daily_standup,
    []
  )
end
```

Current cron boundary:

- Squid Mesh declares cron intent in the workflow DSL
- the host app performs the actual recurring scheduling
- cron workflow registration is static at boot today

## Payload

The trigger `payload` block defines the run input contract.

```elixir
payload do
  field :account_id, :string
  field :invoice_id, :string
  field :prompt_date, :string, default: {:today, :iso8601}
end
```

Supported field types today:

- `:string`
- `:integer`
- `:float`
- `:boolean`
- `:map`
- `:list`
- `:atom`

Supported defaults today:

- literal values that match the declared field type
- `{:today, :iso8601}` for ISO-8601 dates generated at run creation time

Payload validation runs before the run is persisted.

## Steps

Each `step` is declared in the Spark-backed workflow spec and is either:

- a native Squid Mesh step module that performs domain work
- a built-in primitive supplied by the runtime
- a raw `Jido.Action` module used as an explicit interop path

Module step:

```elixir
step :load_invoice, Billing.Steps.LoadInvoice
```

Native step modules use Squid Mesh concepts only:

```elixir
defmodule Billing.Steps.LoadInvoice do
  use SquidMesh.Step,
    name: :load_invoice,
    description: "Loads invoice details",
    input_schema: [
      invoice_id: [type: :string, required: true]
    ],
    output_schema: [
      invoice: [type: :map, required: true]
    ]

  @impl true
  def run(%{invoice_id: invoice_id}, %SquidMesh.Step.Context{} = context) do
    {:ok, %{invoice: %{id: invoice_id, run_id: context.run_id}}}
  end
end
```

`SquidMesh.Step.Context` exposes durable Squid Mesh runtime data:

- `run_id`
- `workflow`
- `step`
- `attempt`
- `state`, which includes the original payload merged with accumulated run context

Native steps may return:

- `{:ok, output}` or `{:ok, output, opts}` for success
- `{:error, reason}` for terminal failure that skips workflow retries and follows failure routing
- `{:retry, reason}` or `{:retry, reason, opts}` for retryable failure governed by the workflow retry policy

When `output: :key` is declared on the workflow step, Squid Mesh stores the
native step's returned map under that key after the step returns. The
`output_schema` validates the native step return before that workflow-level
mapping is applied.

Raw `Jido.Action` modules remain supported for advanced interop. They execute
through the same runtime path, but applications should prefer `use
SquidMesh.Step` for the common authoring path.

Built-in steps:

```elixir
step :wait_for_settlement, :wait, duration: 5_000
step :log_recovery_attempt, :log, message: "Checking gateway status", level: :info
step :wait_for_approval, :pause
approval_step :wait_for_review, output: :approval
```

Built-in step options supported today:

- `:wait` requires `duration`
- `:log` requires `message` and accepts `level`
- `:pause` intentionally stops the run at that step until an operator resumes it
- `approval_step/2` pauses the run for an explicit approve/reject decision and uses `:ok` or `:error` transitions to continue
- `:wait` uses executor-delayed continuation so long waits do not block a worker slot
- `:pause` is supported in transition-based workflows; dependency-based workflows cannot declare `:pause`
- `approval_step/2` is also transition-based only; dependency-based workflows cannot declare built-in `:approval` steps

Manual approval example:

```elixir
approval_step :wait_for_approval, output: :approval
step :record_approval, Billing.Steps.RecordApproval,
  input: [:account_id, :approval],
  output: :approval

step :record_rejection, Billing.Steps.RecordRejection,
  input: [:account_id, :approval],
  output: :approval

transition :wait_for_approval, on: :ok, to: :record_approval
transition :wait_for_approval, on: :error, to: :record_rejection
transition :record_approval, on: :ok, to: :complete
transition :record_rejection, on: :ok, to: :complete
```

When a run is paused at an approval step, inspect it as usual and then approve
or reject it through the public API:

```elixir
{:ok, paused_run} = SquidMesh.inspect_run(run_id, include_history: true)
{:ok, approved_run} = SquidMesh.approve_run(run_id, %{actor: "ops_123"})
{:ok, rejected_run} = SquidMesh.reject_run(run_id, %{actor: "ops_456"})
```

With `include_history: true`, the inspected run also exposes `audit_events` so
host apps can show who paused, resumed, approved, or rejected the run and when:

```elixir
Enum.map(paused_run.audit_events, &{&1.type, &1.step})
#=> [{:paused, :wait_for_approval}]
```

Manual-review durability notes:

- `approval_step/2` is only supported in transition-based workflows
- the approval step stays `:running` while the run is `:paused`
- `approve_run/3` completes that step and advances the declared `:ok` path
- `reject_run/3` completes that step and advances the declared `:error` path
- reviewer identity, decision, timestamp, and optional review metadata are persisted in the completed step output and merged run context
- `inspect_run(..., include_history: true)` also returns durable audit events for pause, resume, approval, and rejection actions
- the resolved `:ok` and `:error` targets plus output-mapping metadata are persisted with the paused step so restart or deploy boundaries do not recompute review semantics from the current workflow definition
- host apps should apply the latest Squid Mesh migrations before using pause-resume in existing environments

## Local Repo Transactions

Use `transaction: :repo` when one module step needs to run several same-process
host repo writes under one local Ecto transaction:

```elixir
step :post_local_ledger_entries, Billing.Steps.PostLocalLedgerEntries,
  transaction: :repo
```

This option is intentionally narrower than the durable workflow. It wraps only
the custom action's `run/2` callback in `config.repo.transaction/1`. If that
callback returns `{:error, reason}` or raises, the local repo writes made inside
the callback roll back and Squid Mesh then records the failed step attempt in
its normal durable history.

The boundary is not a distributed transaction:

- Squid Mesh still persists run, step, attempt, retry, and dispatch state after
  the action returns
- downstream steps and saga compensation callbacks are outside the local
  transaction
- external systems called by the action are not atomically reversible
- built-in steps cannot declare `transaction: :repo`
- transactional steps run in the worker process so Ecto can use the same
  checked-out transaction connection

Use this for small local database groups such as "insert a parent row plus
children" or "reserve and capture two local ledger records". Use saga
compensation or explicit `:error` transitions for work that crosses process,
queue, service, or workflow-step boundaries.

## Irreversible Steps

Use recovery markers when a step performs a side effect that should not be
treated as safely repeatable or undoable.

```elixir
step(:capture_payment, Billing.Steps.CapturePayment, irreversible: true)
step(:send_receipt, Billing.Steps.SendReceipt, compensatable: false)
```

`irreversible: true` means the step's effect cannot be undone in the workflow's
domain. Squid Mesh treats it as non-compensatable. `compensatable: false` is for
steps that may not be strictly irreversible but still have no reliable
application-owned compensation path.

Both markers produce the same replay safety behavior:

- `inspect_run(..., include_history: true)` includes each step's `recovery`
  policy
- `explain_run/2` removes `:replay_run` from terminal next actions after a
  completed marked step and reports the blocking step in `details.replay`
- `replay_run/2` returns `{:error, {:unsafe_replay, details}}` by default after
  a completed marked step
- `replay_run(run_id, allow_irreversible: true)` is the explicit operator
  override when re-execution has been reviewed and accepted

These markers do not provide exactly-once delivery or external compensation.
They keep Squid Mesh honest about recovery policy so a replay cannot silently
repeat a payment capture, notification, or other non-compensatable effect.

## Saga Compensation

Use `compensate: SomeAction` when a completed step has a domain-level inverse
operation that should run if a later step fails and the workflow cannot continue.
This is rollback, not same-step fallback. Same-step fallback stays modeled as an
`:error` transition.

```elixir
step :reserve_inventory, Billing.Steps.ReserveInventory,
  compensate: Billing.Steps.ReleaseInventory

step :authorize_payment, Billing.Steps.AuthorizePayment,
  compensate: Billing.Steps.VoidAuthorization

step :capture_payment, Billing.Steps.CapturePayment, retry: [max_attempts: 2]

transition :reserve_inventory, on: :ok, to: :authorize_payment
transition :authorize_payment, on: :ok, to: :capture_payment
transition :capture_payment, on: :ok, to: :complete
```

When `:capture_payment` exhausts its retry policy and has no `:error`
transition, Squid Mesh compensates previously completed compensatable steps in
reverse completion order. In this example it voids the payment authorization,
then releases inventory. Failed steps are not compensated because their forward
effect did not complete.

Compensation callbacks are `Jido.Action` modules. They receive the original
payload, current run context, the completed step's input and output, and the
terminal failure:

```elixir
def run(%{step: %{output: %{inventory_reservation: reservation}}}, _context) do
  {:ok, %{released_inventory: Map.put(reservation, :status, "released")}}
end
```

`inspect_run(..., include_history: true)` exposes compensation status and output
under each completed step's `recovery.compensation` field. Compensation callbacks
are not governed by the forward step's retry policy; forward retries exhaust
before rollback starts, and callback failures are persisted under
`recovery.compensation` for inspection. Write callbacks to be idempotent so a
host app can safely redeliver or repair failed compensation work.

## Compensation And Undo Routes

Error transitions can declare whether the routed recovery step is compensation
or undo:

```elixir
transition(:capture_payment, on: :error, to: :issue_credit, recovery: :compensation)
transition(:reserve_inventory, on: :error, to: :release_inventory, recovery: :undo)
```

Use `recovery: :compensation` when the next step reconciles or finishes partial
work with a forward action, such as issuing a credit after a payment capture
cannot continue. Use `recovery: :undo` when the next step reverses application-
owned local work, such as releasing a reservation that the workflow can still
control.

The marker does not change retry behavior. Squid Mesh still retries the failed
step first when a retry policy exists, then routes through the error transition
only after retries are exhausted. When the route is chosen,
`inspect_run(..., include_history: true)` exposes it in the failed step's
`recovery.failure` field and adds an audit event:

```elixir
%{
  failure: %{strategy: :compensation, target: :issue_credit}
}
```

Audit event types are `:compensation_routed` and `:undo_routed`, with the
target step in event metadata.

## Step Modules

Custom steps typically use `Jido.Action` and return workflow output in a plain
map.

```elixir
defmodule Billing.Steps.CheckGatewayStatus do
  use Jido.Action,
    name: "check_gateway_status",
    description: "Checks gateway state",
    schema: [
      invoice: [type: :map, required: true],
      gateway_url: [type: :string, required: true]
    ]

  @impl true
  def run(%{invoice: invoice, gateway_url: gateway_url}, _context) do
    case SquidMesh.Tools.invoke(SquidMesh.Tools.HTTP, %{method: :get, url: gateway_url}) do
      {:ok, result} ->
        {:ok, %{gateway_check: %{invoice_id: invoice.id, status: result.payload.body}}}

      {:error, error} ->
        {:error, SquidMesh.Tools.Error.to_map(error)}
    end
  end
end
```

Step result contract:

- success: `{:ok, map()}`
- failure: `{:error, map()}`

## Data Flow Between Steps

Each run starts with its validated payload.

When a step succeeds:

- Squid Mesh merges the returned map into the run context
- the next step receives the original payload merged with the accumulated context

That means later steps can use values produced by earlier steps without manual
state persistence in the host application.

If you want a step to consume only a subset of the available data, declare an
explicit input mapping:

```elixir
step :load_account, Billing.Steps.LoadAccount, input: [:account_id], output: :account
step :send_email, Billing.Steps.SendEmail, input: [:account, :invoice_id], output: :delivery
```

In that example:

- `:load_account` receives only `%{account_id: ...}`
- its returned map is stored under `:account`
- `:send_email` receives only `%{account: ..., invoice_id: ...}`
- its returned map is stored under `:delivery`

Current boundary:

- run context is still a flat merged map
- explicit `input: [...]` lets a step declare which keys it consumes
- explicit `output: :key` lets a step namespace its returned map under one top-level key
- dependency-based workflows with parallel branches should still emit disjoint top-level keys unless they intentionally namespace outputs
- if multiple parallel branches write the same key, the result is not a stable workflow contract today

## Dependency-Based Steps

Steps can also wait on explicit dependencies instead of success transitions:

```elixir
step :load_account, Billing.Steps.LoadAccount
step :load_invoice, Billing.Steps.LoadInvoice
step :prepare_notification, Billing.Steps.PrepareNotification,
  after: [:load_account, :load_invoice]
```

Choose dependency-based steps when you want to model prerequisites and joins.
They can still express a sequential chain such as `step_2 after: [:step_1]` and
`step_3 after: [:step_2]`, but if the workflow is only a straight ordered path,
`transition/2` is usually the clearer fit because it states the next step
directly.

Use `transition/2` when the workflow is a single ordered path and each step
chooses the next step by outcome. Use `after: [...]` when a step should wait
for one or more prerequisite steps, especially when multiple root steps fan in
to a join step.

In the example above, `:load_account` and `:load_invoice` are independent root
steps. Squid Mesh does not need a transition between them because neither one
depends on the other. They may be enqueued independently, and
`:prepare_notification` becomes runnable only after both have completed.

`after: [...]` makes a step runnable only after every named dependency
completes successfully. Omit the option entirely for root steps; `after: []` is
not valid because it changes execution semantics without adding a dependency
edge. Dependency workflows do not mix with `transition/2` in this slice.

### Fan-Out And Fan-In Contract

Dependency-based workflows model static graph fan-out and fan-in. A root step is
any declared step without `after: [...]`. Multiple root steps may be scheduled
as independent runnable work for the same run. A join step is any step with one
or more dependencies; it becomes runnable only after every declared dependency
has completed successfully.

Squid Mesh treats Runic-ready work as workflow runnable intent. In the current
runtime, that intent is represented by scheduled step rows and host-executor
jobs. In the Jido-native runtime path, the same readiness maps to durable
dispatch entries before live wakeups are considered successful. Either way, the
workflow contract is the same: readiness comes from persisted step state, not
from Oban or any other specific executor's concurrency model.

Sibling behavior:

- sibling root steps may run in either order, or concurrently when the host
  executor delivers them concurrently
- a join waits while any dependency is still pending or running
- a join is not scheduled after a sibling reaches terminal failure
- a sibling retry keeps the run in retrying state until the retry is delivered
  and the dependency completes
- cancellation and terminal run transitions prevent newly unlocked join work
  from being dispatched

Inspection and explanation reflect this graph state. With history enabled,
`inspect_run/2` shows declared dependency edges and whether each step is
pending, running, completed, failed, or waiting. `explain_run/2` reports a
waiting join with the dependencies it is waiting on and their current statuses;
once the join is scheduled, the explanation points at the runnable join step
and lists the dependencies that satisfied it.

Current dependency validation requires:

- every `after:` reference names a declared step
- the dependency graph is acyclic
- workflows may define multiple entry steps when dependency execution is used
- `after: []` is rejected because it changes execution semantics without adding an edge
- dependency-based workflows cannot also declare `transition/2`
- dependency-based workflows cannot declare built-in `:pause` or `:approval`
  steps; use transition-based workflows for those manual wait points today

Current execution boundary:

- a step becomes runnable only after every dependency has completed successfully
- multiple ready root steps can be enqueued independently while later phases still respect deterministic dependency order
- the current scheduler resolves dependency readiness from persisted step history after each successful dependency step, so it is intended for small and medium graph workflows
- downstream work is only enqueued from a locked run-progression boundary, so a sibling terminal failure prevents later dispatch

## Transitions

Transitions define the path through the workflow.

```elixir
transition :check_gateway_status, on: :ok, to: :notify_customer
transition :check_gateway_status, on: :error, to: :notify_operator
transition :notify_customer, on: :ok, to: :complete
```

Current workflow validation requires:

- at least one step
- exactly one trigger
- exactly one workflow entry step for transition-based workflows
- dependency-based workflows expose `entry_steps` plus `initial_step`; the singular `entry_step` is `nil`
- transitions only use supported outcomes: `:ok` and `:error`
- transitions reference known steps
- each `{from, on}` pair is declared at most once

## Retries And Backoff

Retry policy lives on the step that owns the work:

```elixir
step :check_gateway_status, Billing.Steps.CheckGatewayStatus,
  retry: [max_attempts: 5, backoff: [type: :exponential, min: 1_000, max: 30_000]]
```

Supported retry options today:

- `max_attempts`
- `backoff: [type: :exponential, min: ..., max: ...]`

Squid Mesh resolves workflow retry policy and asks the host executor to
schedule the next step attempt. If a step also declares an `on: :error` transition, Squid Mesh
takes that route only after retries are exhausted.

## Starting Runs

If a workflow defines a single trigger, the short path is:

```elixir
SquidMesh.start_run(Billing.Workflows.PaymentRecovery, %{
  account_id: account_id,
  invoice_id: invoice_id,
  attempt_id: attempt_id,
  gateway_url: gateway_url
})
```

If you want to name the trigger explicitly:

```elixir
SquidMesh.start_run(Billing.Workflows.PaymentRecovery, :payment_recovery, %{
  account_id: account_id,
  invoice_id: invoice_id,
  attempt_id: attempt_id,
  gateway_url: gateway_url
})
```

## Current Boundaries

The current workflow contract is intentionally smaller than a full graph engine.

Supported today:

- one trigger per workflow
- sequential transitions with explicit `:ok` and `:error` outcomes
- dependency-based joins with `after: [...]`
- durable retries and replay
- built-in `:wait`, `:log`, `:pause`, and `:approval` steps

Not implemented today:

- parallel dispatch of multiple ready steps
- conditional branching beyond transition outcomes
- dynamic cron registration after boot
- custom reclaim logic for interrupted in-flight step ownership