guides/vision.md

Select File
# Vision input

Vision input lets the model see images — a screenshot, a diagram, a
photo — alongside the user's text. ALLM exposes a single multi-modal
content shape that works across OpenAI, Anthropic, and Gemini: a list
of `%TextPart{}` and `%ImagePart{}` values as the message content.

This guide covers the part structs, image-source variants (URL, raw
bytes, file path), provider parity, and detail-level controls.

## Multi-modal content

Instead of a plain string, a `%Message{}` content can be a list of
content parts:

```elixir
import ALLM, only: [user: 1]

msg = user([
  %ALLM.TextPart{text: "What's in this picture?"},
  %ALLM.ImagePart{source: {:url, "https://example.com/photo.png"}}
])
```

The list form drops into `ALLM.request/2` and `ALLM.chat/3` exactly
like a string.

## ImagePart sources

`%ImagePart{}` accepts three source shapes:

```elixir
# Public URL
%ALLM.ImagePart{source: {:url, "https://example.com/photo.png"}}

# Raw bytes (with required mime_type)
bytes = File.read!("/path/to/photo.png")
%ALLM.ImagePart{source: {:bytes, bytes}, mime_type: "image/png"}

# File path (read at adapter time)
%ALLM.ImagePart{source: {:file, "/path/to/photo.png"}}
```

Each adapter chooses the most efficient wire shape automatically:

* **OpenAI** — accepts both URLs and base64 inline data via `image_url`.
* **Anthropic** — accepts URLs and base64 `source` blocks. URL-only
  models fall back to a wire-side fetch + inline.
* **Gemini** — uploads bytes inline via `inlineData`. URL inputs are
  fetched and inlined client-side.

## Round-trip example

    iex> engine = ALLM.Engine.new(
    ...>   adapter: ALLM.Providers.Fake,
    ...>   adapter_opts: [script: [{:text, "A red square."}, {:finish, :stop}]]
    ...> )
    iex> msg = ALLM.user([
    ...>   %ALLM.TextPart{text: "Describe this."},
    ...>   %ALLM.ImagePart{source: {:url, "https://example.com/red.png"}}
    ...> ])
    iex> {:ok, %ALLM.Response{output_text: text}} =
    ...>   ALLM.generate(engine, ALLM.request([msg]))
    iex> text
    "A red square."

Fake doesn't actually look at the image — it just returns the scripted
text. With a real provider, the image content reaches the model.

## Detail levels (OpenAI)

OpenAI's vision models accept a per-image detail hint:

```elixir
%ALLM.ImagePart{
  source: {:url, "https://example.com/photo.png"},
  detail: :high  # :auto | :low | :high
}
```

* `:auto` (default) — model decides based on image dimensions.
* `:low` — fixed 512×512 representation, cheaper.
* `:high` — full resolution, expensive but accurate for fine detail.

Anthropic and Gemini ignore the `:detail` field — their vision tiers
don't expose an equivalent knob. ALLM passes the field along to OpenAI
unchanged and silently drops it for the others (with a `:debug`-level
log on the first drop per process).

## Provider parity

| Feature | OpenAI | Anthropic | Gemini |
|---|---|---|---|
| URL source | yes | yes (some models) | client-side fetch |
| Raw bytes / base64 | yes | yes | yes |
| File path | yes (read client-side) | yes | yes |
| Image in `:system` role | rejected (raises) | rejected | rejected |
| `:detail` field | honored | dropped | dropped |
| Per-message multi-image | yes | yes | yes |

The "image in `:system` role" check is a pre-flight validation in every
adapter — the wire formats reject it (or behave inconsistently), so
ALLM raises a clear `ALLM.Error.ValidationError` before dispatch
instead of letting you debug an opaque 400.

## Common patterns

### Screenshot OCR

```elixir
{:ok, response} = ALLM.generate(engine, ALLM.request([
  ALLM.system("Extract every word visible in the image. Reply with a JSON array of strings."),
  ALLM.user([
    %ALLM.TextPart{text: "Extract text from this screenshot:"},
    %ALLM.ImagePart{source: {:file, "/tmp/screenshot.png"}}
  ])
]))
```

### Multi-image comparison

```elixir
{:ok, response} = ALLM.generate(engine, ALLM.request([
  ALLM.user([
    %ALLM.TextPart{text: "Which of these two images has more red?"},
    %ALLM.ImagePart{source: {:url, "https://example.com/a.png"}},
    %ALLM.ImagePart{source: {:url, "https://example.com/b.png"}}
  ])
]))
```

### Streaming a vision response

`stream_generate/3` works identically with vision input — the request
shape is the same, and you get text deltas back as the model
incrementally describes the image.

## File size and MIME limits

Each provider has its own limits (Anthropic caps base64 image bytes at
~5MB per image; OpenAI caps at 20MB; Gemini at 7MB). Adapters validate
size pre-flight and raise `ALLM.Error.ValidationError` with a clear
reason if you exceed it. Compress or resize before sending.

Supported MIME types (intersection across providers): `image/png`,
`image/jpeg`, `image/gif`, `image/webp`. Adapters reject other types
pre-flight.

## Where to next

* `image_generation.md` — the parallel `:image_adapter` slot for
  generating new images.
* `examples/12_vision_input.exs` — runnable smoke test against any of
  the three providers.
* The `ALLM.Providers.OpenAI`, `ALLM.Providers.Anthropic`, and
  `ALLM.Providers.Gemini` module docs cover the per-provider quirks.