docs/adr/0022-storage-strategy.md

# ADR-0022: Pluggable Storage Strategy

## Status

Accepted

## Context

Currently, the Docker backend uses a hardcoded working directory pattern with a default of `/workspace`, which is invalid as a host path and causes test failures on macOS/Linux. More broadly, different deployment scenarios require different storage backends:

1. **Local Development**: Ephemeral tmp directories, fast, no external dependencies
2. **Multi-Node Clusters**: Shared storage (S3, Tigris) for session state across nodes
3. **Fly.io Deployments**: Tigris Object Storage for persistent, low-latency storage
4. **Session Resume**: Persistent storage enabling conversation continuation after restarts
5. **Compliance/Audit**: Durable storage of session artifacts for record-keeping

Additionally, the Anthropic backend creates files server-side that need to be synchronized to client storage. Consumers of this library need callbacks to capture file references for their own persistence needs (e.g., associating files with users in a database).

### Forces

- Docker requires a **local mount path** - remote storage must be synced locally
- Anthropic backend creates files remotely that may need client-side persistence
- Native backend operates in-memory but may want to persist results
- Library consumers need hooks to store file metadata in their own systems
- Storage lifecycle must align with session lifecycle
- Cleanup must be reliable, even on crashes

## Decision

We will introduce a `Conjure.Storage` behaviour that abstracts storage operations, with implementations for Local, S3, and Tigris. All backends will use this abstraction for session working directories and file operations.

### Storage Behaviour

```elixir
defmodule Conjure.Storage do
  @moduledoc "Behaviour for session storage backends"

  @type session_id :: String.t()
  @type path :: String.t()
  @type content :: binary()
  @type state :: term()
  @type file_ref :: %{
          path: path(),
          size: non_neg_integer(),
          content_type: String.t() | nil,
          checksum: String.t() | nil,
          storage_url: String.t() | nil,
          created_at: DateTime.t()
        }

  # Lifecycle
  @callback init(session_id, opts :: keyword()) :: {:ok, state} | {:error, term()}
  @callback cleanup(state) :: :ok | {:error, term()}

  # File operations
  @callback write(state, path, content) :: {:ok, file_ref} | {:error, term()}
  @callback read(state, path) :: {:ok, content} | {:error, term()}
  @callback exists?(state, path) :: boolean()
  @callback delete(state, path) :: :ok | {:error, term()}
  @callback list(state) :: {:ok, [file_ref]} | {:error, term()}

  # For Docker mounting - returns a local filesystem path
  @callback local_path(state) :: {:ok, path} | {:error, :not_supported}

  # For syncing remote files (Anthropic backend)
  @callback sync_from_remote(state, remote_files :: [map()]) :: {:ok, [file_ref]} | {:error, term()}
  @callback sync_to_remote(state) :: {:ok, [file_ref]} | {:error, term()}

  # Optional callbacks
  @optional_callbacks [sync_from_remote: 2, sync_to_remote: 1]
end
```

### Implementations

#### Local Storage (Default)

```elixir
defmodule Conjure.Storage.Local do
  @behaviour Conjure.Storage

  @impl true
  def init(session_id, opts) do
    base = Keyword.get(opts, :base_path, System.tmp_dir!())
    prefix = Keyword.get(opts, :prefix, "conjure")
    path = Path.join(base, "#{prefix}_#{session_id}")

    File.mkdir_p!(path)
    {:ok, %{path: path, session_id: session_id}}
  end

  @impl true
  def local_path(%{path: path}), do: {:ok, path}

  @impl true
  def cleanup(%{path: path}) do
    File.rm_rf!(path)
    :ok
  end

  # ... other callbacks operate directly on filesystem
end
```

#### S3 Storage

```elixir
defmodule Conjure.Storage.S3 do
  @behaviour Conjure.Storage

  @impl true
  def init(session_id, opts) do
    bucket = Keyword.fetch!(opts, :bucket)
    prefix = Keyword.get(opts, :prefix, "sessions/")
    region = Keyword.get(opts, :region, "us-east-1")

    # Local cache for Docker mounting
    cache_path = Path.join(System.tmp_dir!(), "conjure_s3_#{session_id}")
    File.mkdir_p!(cache_path)

    {:ok, %{
      bucket: bucket,
      prefix: "#{prefix}#{session_id}/",
      region: region,
      cache_path: cache_path,
      session_id: session_id
    }}
  end

  @impl true
  def local_path(%{cache_path: path}), do: {:ok, path}

  @impl true
  def write(state, path, content) do
    # Write to local cache
    local_file = Path.join(state.cache_path, path)
    File.mkdir_p!(Path.dirname(local_file))
    File.write!(local_file, content)

    # Async upload to S3
    s3_key = "#{state.prefix}#{path}"
    :ok = upload_to_s3(state.bucket, s3_key, content, state.region)

    {:ok, build_file_ref(path, content, s3_url(state, path))}
  end

  @impl true
  def cleanup(state) do
    # Delete from S3
    delete_prefix(state.bucket, state.prefix, state.region)
    # Cleanup local cache
    File.rm_rf!(state.cache_path)
    :ok
  end

  # ... other callbacks
end
```

#### Tigris Storage (Fly.io)

Tigris is Fly.io's globally-distributed, S3-compatible object storage. It automatically replicates data to regions close to your Fly Machines, providing low-latency access worldwide.

**Fly.io Setup:**

```bash
# Create a Tigris bucket for your Fly app
fly storage create

# Or with a specific name
fly storage create conjure-sessions

# Credentials are automatically injected into your Fly Machines as:
# - AWS_ACCESS_KEY_ID
# - AWS_SECRET_ACCESS_KEY
# - AWS_ENDPOINT_URL_S3
# - BUCKET_NAME
```

**fly.toml configuration:**

```toml
[env]
  SKILLEX_STORAGE = "tigris"

# Tigris bucket is attached automatically after `fly storage create`
```

```elixir
defmodule Conjure.Storage.Tigris do
  @moduledoc """
  Fly.io Tigris object storage backend.

  Tigris provides:
  - Global distribution with automatic replication
  - S3-compatible API
  - Zero-config when running on Fly.io Machines
  - Low-latency access from any Fly.io region

  Credentials are automatically available when running on Fly.io.
  For local development, use `fly storage` credentials or set env vars.
  """

  @behaviour Conjure.Storage

  @impl true
  def init(session_id, opts) do
    # Fly.io injects these automatically into Machines
    bucket = Keyword.get_lazy(opts, :bucket, fn ->
      System.get_env("BUCKET_NAME") || raise "BUCKET_NAME not set"
    end)

    endpoint = Keyword.get(opts, :endpoint, System.get_env("AWS_ENDPOINT_URL_S3"))
    access_key = Keyword.get(opts, :access_key, System.get_env("AWS_ACCESS_KEY_ID"))
    secret_key = Keyword.get(opts, :secret_key, System.get_env("AWS_SECRET_ACCESS_KEY"))

    # Optional: specify region hint for read affinity
    # Fly.io sets FLY_REGION automatically
    region = Keyword.get(opts, :region, System.get_env("FLY_REGION", "auto"))

    # Local cache for Docker mounting (uses Fly volume if available)
    cache_base = Keyword.get(opts, :cache_path,
      System.get_env("FLY_VOLUME_PATH") || System.tmp_dir!()
    )
    cache_path = Path.join(cache_base, "conjure_sessions/#{session_id}")
    File.mkdir_p!(cache_path)

    {:ok, %{
      bucket: bucket,
      prefix: "sessions/#{session_id}/",
      endpoint: endpoint,
      access_key: access_key,
      secret_key: secret_key,
      region: region,
      cache_path: cache_path,
      session_id: session_id
    }}
  end

  @impl true
  def local_path(%{cache_path: path}), do: {:ok, path}

  @impl true
  def write(state, path, content) do
    # Write to local cache first (fast, synchronous)
    local_file = Path.join(state.cache_path, path)
    File.mkdir_p!(Path.dirname(local_file))
    File.write!(local_file, content)

    # Upload to Tigris (replicated globally automatically)
    s3_key = "#{state.prefix}#{path}"
    :ok = put_object(state, s3_key, content)

    {:ok, build_file_ref(path, content, public_url(state, path))}
  end

  @impl true
  def cleanup(state) do
    # Delete from Tigris
    delete_prefix(state, state.prefix)
    # Cleanup local cache
    File.rm_rf!(state.cache_path)
    :ok
  end

  # Uses ex_aws or req for S3-compatible API calls
  defp put_object(state, key, content) do
    # Implementation using S3-compatible API against state.endpoint
  end

  defp delete_prefix(state, prefix) do
    # List and delete all objects with prefix
  end

  defp public_url(state, path) do
    "#{state.endpoint}/#{state.bucket}/#{state.prefix}#{path}"
  end
end
```

**Local Development with Tigris:**

```bash
# Get credentials for local dev
fly storage auth

# Or proxy through Fly
fly proxy 4566:443 -a <your-tigris-app>
```

```elixir
# config/dev.exs
config :conjure, :storage,
  strategy: Conjure.Storage.Tigris,
  endpoint: "http://localhost:4566",  # Local proxy
  bucket: "conjure-dev"
```

**Multi-Region Session Affinity:**

Tigris automatically serves data from the nearest region. For session locality:

```elixir
# Include region in session_id for debugging/routing
session_id = "#{System.get_env("FLY_REGION")}_#{UUID.uuid4()}"
```

### Consumer Callbacks

To allow library consumers to capture file references for their own persistence:

```elixir
defmodule Conjure.Storage.Callbacks do
  @type file_event :: :created | :modified | :deleted | :synced
  @type callback :: (file_event, Conjure.Storage.file_ref(), session_id :: String.t() -> :ok)

  @callback on_file_event(callback) :: :ok
end
```

Callbacks are configured per-session:

```elixir
session = Conjure.Session.new_docker(skills,
  storage: {Conjure.Storage.S3, bucket: "my-bucket"},
  on_file_created: fn file_ref, session_id ->
    # Store reference in your database
    MyApp.Repo.insert!(%MyApp.SessionFile{
      user_id: current_user.id,
      session_id: session_id,
      path: file_ref.path,
      storage_url: file_ref.storage_url,
      size: file_ref.size
    })
  end,
  on_file_deleted: fn file_ref, session_id ->
    MyApp.Repo.delete_all(
      from f in MyApp.SessionFile,
      where: f.session_id == ^session_id and f.path == ^file_ref.path
    )
  end
)
```

### Anthropic Backend Sync

The Anthropic backend creates files server-side. Storage sync pulls these to client storage:

```elixir
defmodule Conjure.Backend.Anthropic do
  def chat(session, message, api_callback, opts) do
    # ... conversation logic ...

    # After turn completes, sync created files
    case session.storage do
      nil -> {:ok, response, session}
      storage ->
        remote_files = get_created_files_from_response(response)
        {:ok, file_refs} = Conjure.Storage.sync_from_remote(storage, remote_files)
        session = %{session | created_files: session.created_files ++ file_refs}

        # Fire callbacks
        Enum.each(file_refs, fn ref ->
          fire_callback(session, :synced, ref)
        end)

        {:ok, response, session}
    end
  end
end
```

### Session Integration

```elixir
defmodule Conjure.Session do
  defstruct [
    # ... existing fields ...
    :storage,           # Storage state
    :storage_strategy,  # Module implementing Storage behaviour
    :file_callbacks     # Map of event -> callback function
  ]

  def new_docker(skills, opts) do
    {storage_strategy, storage_opts} = resolve_storage(opts)
    session_id = generate_session_id()
    {:ok, storage} = storage_strategy.init(session_id, storage_opts)

    # Get local path for Docker mounting
    {:ok, work_dir} = storage_strategy.local_path(storage)

    %Session{
      execution_mode: :docker,
      storage: storage,
      storage_strategy: storage_strategy,
      context: %ExecutionContext{working_directory: work_dir},
      # ...
    }
  end

  defp resolve_storage(opts) do
    case Keyword.get(opts, :storage) do
      nil -> {Conjure.Storage.Local, []}
      {module, opts} -> {module, opts}
      module when is_atom(module) -> {module, []}
    end
  end
end
```

### Configuration

```elixir
# config/dev.exs - Local development
config :conjure, :storage,
  strategy: Conjure.Storage.Local,
  base_path: System.tmp_dir!(),
  prefix: "conjure"

# config/prod.exs - Fly.io with Tigris (zero-config)
# Credentials auto-injected by Fly.io Machines
config :conjure, :storage,
  strategy: Conjure.Storage.Tigris
  # bucket, endpoint, credentials all from env vars automatically

# config/prod.exs - AWS S3
config :conjure, :storage,
  strategy: Conjure.Storage.S3,
  bucket: System.get_env("S3_BUCKET"),
  region: System.get_env("AWS_REGION", "us-east-1")

# Per-session override always takes precedence
session = Conjure.Session.new_docker(skills,
  storage: {Conjure.Storage.S3, bucket: "custom-bucket"}
)
```

## Consequences

### Positive

1. **Fixes Docker Bug**: Default storage now creates valid temp directories
2. **Multi-Node Support**: S3/Tigris enable shared state across cluster nodes
3. **Fly.io Ready**: First-class Tigris support for Fly.io deployments
4. **Session Persistence**: Storage backends can persist beyond session lifetime
5. **Consumer Integration**: Callbacks enable tight integration with consumer applications
6. **Anthropic File Sync**: Remote files can be captured and persisted client-side
7. **Unified Interface**: All backends use the same storage abstraction
8. **Testable**: Easy to mock storage in tests

### Negative

1. **Complexity**: Adds new abstraction layer and 3+ modules
2. **S3 Dependency**: S3/Tigris implementations need HTTP client (optional dep)
3. **Sync Latency**: Remote storage adds latency vs local filesystem
4. **Local Cache Management**: S3/Tigris need local cache for Docker, adding cleanup complexity

### Neutral

1. **Backwards Compatible**: Default Local storage maintains current behavior (once fixed)
2. **Optional Feature**: Consumers only pay complexity cost if using cloud storage

## File Structure

```
lib/conjure/
├── storage.ex                    # Storage behaviour
├── storage/
│   ├── local.ex                  # Local filesystem (default)
│   ├── s3.ex                     # AWS S3
│   └── tigris.ex                 # Fly.io Tigris
└── session.ex                    # Updated with storage integration
```

## Alternatives Considered

### 1. Docker Volumes Instead of Mounts

Could use Docker volumes for persistence, but:
- Less portable across storage backends
- Harder to access files from host
- Doesn't solve multi-node problem

### 2. Database Storage (Ecto)

Store files directly in PostgreSQL/etc:
- Adds hard Ecto dependency
- Not suitable for large files
- Tighter coupling than callbacks

### 3. GenStage/Broadway for File Events

Use GenStage for file event streaming:
- Overkill for most use cases
- Adds significant complexity
- Simple callbacks sufficient for MVP

## References

- [Tigris Object Storage](https://www.tigrisdata.com/)
- [Fly.io Tigris Integration](https://fly.io/docs/reference/tigris/)
- [AWS S3 API Reference](https://docs.aws.amazon.com/AmazonS3/latest/API/Welcome.html)
- ADR-0010: Docker Production Executor
- ADR-0020: Backend Behaviour Architecture