Skip to main content

guides/multimodal.md

# Multimodal (Vision)

ExAthena supports multimodal messages — text plus images — through the
`ExAthena.Messages.ContentPart` struct. Two entry points are available: the
ergonomic `images:` shorthand for quick one-liners, and full `ContentPart`
construction for complex payloads.

## Quick start: `images:` shorthand

Pass `images: [...]` to `ExAthena.query/2`, `ExAthena.stream/3`, or
`ExAthena.run/2` alongside a prompt string:

```elixir
png = File.read!("diagram.png")

{:ok, response} =
  ExAthena.query("Describe what you see",
    provider: :ollama,
    model: "llava",
    images: [%{data: png, media_type: "image/png"}]
  )

IO.puts(response.text)
```

Each entry in the `images:` list may be one of:

| Shape | Description |
|---|---|
| `%{data: binary(), media_type: String.t()}` | Inline image bytes |
| `%{data: binary()}` | Inline image, media type defaults to `"image/png"` |
| `%{url: String.t()}` | Remote image URL |

ExAthena builds a multimodal user message with the text part first, followed
by the image parts. When no prompt is given, the images are merged into the
last user message in `:messages`, or appended as a new user message.

## Full `ContentPart` approach

For finer control — mixing text, images, and files in arbitrary order — build
`ContentPart` structs directly and pass them as the message content:

```elixir
alias ExAthena.Messages
alias ExAthena.Messages.ContentPart

png = File.read!("chart.png")
pdf = File.read!("report.pdf")

parts = [
  ContentPart.text("Summarize the chart and cross-reference the report:"),
  ContentPart.image(png, "image/png"),
  ContentPart.file(pdf, "report.pdf", "application/pdf")
]

{:ok, response} =
  ExAthena.query(nil,
    provider: :claude,
    model: "claude-opus-4-7",
    messages: [Messages.user(parts)]
  )
```

### ContentPart factory functions

| Function | Type | Fields |
|---|---|---|
| `ContentPart.text(content)` | `:text` | `text` |
| `ContentPart.image(data, media_type \\ "image/png")` | `:image` | `data`, `media_type` |
| `ContentPart.image_url(url)` | `:image_url` | `url` |
| `ContentPart.file(data, filename, media_type \\ "application/octet-stream")` | `:file` | `data`, `filename`, `media_type` |

## Provider examples

### Ollama (llava, qwen2-vl)

```elixir
# config/config.exs
config :ex_athena, :ollama,
  base_url: "http://localhost:11434",
  model: "llava"

# usage
png = File.read!("screenshot.png")

{:ok, response} =
  ExAthena.query("What is shown in this screenshot?",
    provider: :ollama,
    model: "llava",
    images: [%{data: png, media_type: "image/png"}]
  )
```

Pull a vision-capable model first:

```bash
ollama pull llava
# or
ollama pull qwen2-vl
```

Ollama vision support is model-dependent. Non-vision models will return an
error or silently ignore image parts.

### OpenAI-compatible (gpt-4o)

```elixir
{:ok, response} =
  ExAthena.query("What's in this image?",
    provider: :openai_compatible,
    model: "gpt-4o",
    images: [%{url: "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"}]
  )
```

For inline images with the OpenAI API:

```elixir
png = File.read!("photo.jpg")

{:ok, response} =
  ExAthena.query("Describe the photo",
    provider: :openai_compatible,
    model: "gpt-4o-mini",
    images: [%{data: png, media_type: "image/jpeg"}]
  )
```

### Anthropic Claude

```elixir
png = File.read!("diagram.png")

{:ok, response} =
  ExAthena.query("Explain this architecture diagram",
    provider: :claude,
    model: "claude-opus-4-7",
    images: [%{data: png, media_type: "image/png"}]
  )
```

Claude supports PNG, JPEG, GIF, and WebP. Maximum image size is 5 MB per
image.

### Google Gemini

```elixir
png = File.read!("chart.png")

{:ok, response} =
  ExAthena.query("What trend does this chart show?",
    provider: :gemini,
    model: "gemini-2.5-flash",
    images: [%{data: png, media_type: "image/png"}]
  )
```

## Using `images:` in the agent loop

`ExAthena.run/2` forwards `images:` to `Request.new/2` so the first turn
has the image attached:

```elixir
png = File.read!("codebase_diagram.png")

{:ok, result} =
  ExAthena.run("Implement the architecture shown in this diagram",
    provider: :claude,
    model: "claude-opus-4-7",
    cwd: "/path/to/project",
    images: [%{data: png, media_type: "image/png"}]
  )
```

## Image format notes

- **Inline images** are sent as base64-encoded data to the provider. The
  `req_llm` adapter handles encoding transparently.
- **Image URLs** (`%{url: ...}`) are forwarded as-is. The provider fetches
  the image at inference time. Not all providers support URL references —
  prefer inline for maximum compatibility.
- **media_type** should match the actual image format (`"image/png"`,
  `"image/jpeg"`, `"image/gif"`, `"image/webp"`). Some providers are lenient;
  others require an accurate MIME type.
- **Multiple images** in one message are supported by all major providers
  (Claude, OpenAI, Gemini). Ollama support is model-dependent.

## Vision support by provider

| Provider | Vision support | Notes |
|---|---|---|
| `:ollama` | Model-dependent | `llava`, `qwen2-vl`, `llava-phi3`, `bakllava` |
| `:openai_compatible` |`gpt-4o`, `gpt-4o-mini` | URL + inline; other OAI-compat endpoints vary |
| `:claude` | ✅ Any `claude-3`+ model | PNG, JPEG, GIF, WebP; max 5 MB per image |
| `:gemini` | ✅ Any `gemini-1.5`+ model | Inline + URL; very generous size limits |