guides/image-generation.md

Select File:
guides/image-generation.md

# Image Generation

> **Interactive Demo:** Try the [Image Generation Livebook](image-generation.livemd) to compare image generation across OpenAI, xAI, and Google in parallel.

## Overview

ReqLLM provides image generation through the `ReqLLM.generate_image/3` function, which works similarly to `ReqLLM.generate_text/3`. The key difference is that the response contains image data instead of text.

### Basic Usage

```elixir
{:ok, response} = ReqLLM.generate_image(
  "openai:gpt-image-1",
  "A serene Japanese garden with cherry blossoms"
)

# Extract the image binary data
image_data = ReqLLM.Response.image_data(response)

# Save to file
File.write!("garden.png", image_data)
```

### Response Structure

Image generation returns a canonical `ReqLLM.Response` struct where the assistant message contains `ReqLLM.Message.ContentPart` entries of type `:image` (binary data) or `:image_url` (URL reference).

```elixir
# Get the first image part
image_part = ReqLLM.Response.image(response)
# => #ContentPart<:image image/png (3469636 bytes)>

# Get all images (when n > 1)
all_images = ReqLLM.Response.images(response)

# Convenience helpers
binary_data = ReqLLM.Response.image_data(response)  # First :image part's data
url = ReqLLM.Response.image_url(response)           # First :image_url part's URL
```

## Common Options

These options are supported across providers (where the model allows):

| Option | Type | Description |
|--------|------|-------------|
| `n` | integer | Number of images to generate (provider-dependent; gemini-2.5-flash-image and gemini-3-pro-image-preview reject `n`) |
| `size` | string or tuple | Image dimensions, e.g., `"1024x1024"` or `{1024, 1024}` |
| `aspect_ratio` | string | Aspect ratio, e.g., `"16:9"` or `"1:1"` |
| `output_format` | atom | Image format: `:png`, `:jpeg`, or `:webp` |
| `response_format` | atom | Return type: `:binary` (default) or `:url` |
| `quality` | atom/string | Image quality (provider-dependent) |
| `seed` | integer | Random seed for reproducibility (provider-dependent) |
| `negative_prompt` | string | What to avoid in the image (provider-dependent) |

## Discovering Available Models

```elixir
# List all models that support image generation
ReqLLM.Images.supported_models()
# => ["openai:gpt-image-1", "openai:dall-e-3", "google:gemini-2.5-flash-image", ...]

# Validate a specific model
{:ok, model} = ReqLLM.Images.validate_model("openai:gpt-image-1")
```

---

## OpenAI

OpenAI offers several image generation models through the Images API.

### Supported Models

The GPT Image family provides superior instruction following, text rendering, detailed editing, and real-world knowledge. We recommend `gpt-image-1.5` for the best quality, or `gpt-image-1-mini` for cost-effective generation when image quality isn't the priority.

| Model | Notes |
|-------|-------|
| `gpt-image-1.5` | State-of-the-art, best overall quality |
| `gpt-image-1` | High fidelity with transparency support |
| `gpt-image-1-mini` | Cost-effective option for simpler use cases |
| `dall-e-3` | Higher quality than DALL-E 2, larger resolutions (deprecated May 2026) |
| `dall-e-2` | Lower cost, supports inpainting/variations (deprecated May 2026) |

### Current Limitations

ReqLLM currently supports **image generation only** via the Images API. The following OpenAI features are not yet supported:

- **Image editing** (editing with masks via the Images API)
- **Image variations** (DALL-E 2 only)
- **Responses API image generation tool** (generates images inline during chat)

### Prompt Format

OpenAI's image generation accepts only a **single text prompt** - it does not support multi-turn conversations or image editing via context. Be descriptive in your prompt to get the best results.

```elixir
# Good: Descriptive prompt
{:ok, response} = ReqLLM.generate_image(
  "openai:gpt-image-1",
  "A cozy coffee shop interior with warm lighting, exposed brick walls,
   vintage furniture, and steam rising from ceramic cups on wooden tables"
)
```

### Size Options

**GPT Image models** (gpt-image-1.5, gpt-image-1, gpt-image-1-mini):

- `"1024x1024"` (square, fastest)
- `"1536x1024"` (landscape)
- `"1024x1536"` (portrait)
- `"auto"` (default)

**dall-e-3:**

- `"1024x1024"`
- `"1792x1024"` (landscape)
- `"1024x1792"` (portrait)

**dall-e-2:**

- `"256x256"`, `"512x512"`, `"1024x1024"`

### OpenAI-Specific Options

```elixir
# gpt-image-1 with transparency
{:ok, response} = ReqLLM.generate_image(
  "openai:gpt-image-1",
  "A golden retriever puppy, isolated on transparent background",
  output_format: :png,
  provider_options: [background: "transparent"]
)

# dall-e-3 with style
{:ok, response} = ReqLLM.generate_image(
  "openai:dall-e-3",
  "A mountain landscape at sunset",
  size: "1792x1024",
  quality: :hd,
  style: :vivid  # or :natural for more realistic
)
```

**GPT Image specific options** (via `provider_options`):

| Option | Values | Description |
|--------|--------|-------------|
| `background` | `"transparent"`, `"opaque"`, `"auto"` | Background transparency (use PNG/WebP format) |
| `moderation` | `"auto"`, `"low"` | Content moderation strictness |

**dall-e-3 specific options:**

| Option | Values | Description |
|--------|--------|-------------|
| `quality` | `:standard`, `:hd` | Image detail level |
| `style` | `:vivid`, `:natural` | Artistic vs realistic style |

### Revised Prompts

DALL-E 3 may automatically enhance your prompt for better results. The revised prompt is available in the response metadata:

```elixir
{:ok, response} = ReqLLM.generate_image("openai:dall-e-3", "A cat")

[image_part] = ReqLLM.Response.images(response)
revised = image_part.metadata[:revised_prompt]
# => "A fluffy orange tabby cat sitting gracefully on a windowsill..."
```

---

## Google (Gemini)

Google's Gemini models support both text-to-image generation and image editing through multi-turn conversations.

### Supported Models

| Model | Alias | Notes |
|-------|-------|-------|
| `gemini-2.5-flash-image` | Nano Banana | Fast generation, good for quick iterations and standard tasks |
| `gemini-3-pro-image-preview` | Nano Banana Pro | State-of-the-art quality, advanced text rendering, professional assets |
| `imagen-4.0-generate-001` | Imagen 4 | High-quality photorealistic images |
| `imagen-4.0-fast-generate-001` | Imagen 4 Fast | Faster generation with good quality |

### Model Selection

**Choose Gemini 2.5 Flash** for:

- Quick prototyping and iteration
- Straightforward text-to-image tasks
- Speed-sensitive applications

**Choose Gemini 3 Pro Preview** for:

- Professional-grade asset production
- Complex multi-turn editing workflows
- Text-heavy designs (logos, menus, infographics, diagrams)
- Character consistency across multiple images
- High-resolution output (1K, 2K, 4K)
- Tasks requiring advanced reasoning

**Choose Imagen** for:

- High-quality photorealistic images
- When you don't need multi-turn editing capabilities

### Basic Generation

Note: `gemini-2.5-flash-image` and `gemini-3-pro-image-preview` reject `n`; specify the image count in the prompt.

```elixir
{:ok, response} = ReqLLM.generate_image(
  "google:gemini-2.5-flash-image",
  "A futuristic cityscape with flying cars and neon lights",
  aspect_ratio: "16:9"
)
```

### Generating Multiple Images

**Important:** Google's documentation states that "the model won't always follow the exact number of image outputs that the user explicitly asks for." Multi-image generation is inherently unreliable, and prompt phrasing significantly affects success rates.

**Effective prompt patterns** (higher success rate):

```elixir
# Numbered list format - works well
{:ok, response} = ReqLLM.generate_image(
  "google:gemini-2.5-flash-image",
  "Generate multiple images: 1) A white cat 2) A black cat"
)

# Sequential instructions - works well
{:ok, response} = ReqLLM.generate_image(
  "google:gemini-2.5-flash-image",
  "Generate the first image of a sunrise, then generate a second image of a sunset"
)

# Labeled scenes - works well
{:ok, response} = ReqLLM.generate_image(
  "google:gemini-2.5-flash-image",
  "Generate multiple scenes: Scene A shows a forest, Scene B shows a desert"
)

images = ReqLLM.Response.images(response)
# May return 1 or 2 images depending on model behavior
```

**Less effective prompt patterns** (often returns only 1 image):

```elixir
# Simple count requests - often fails
"Generate two images of cats"
"Create 2 pictures of a banana"

# Even with emphasis - often fails
"Create two DISTINCT and SEPARATE images"
```

The model may respond with text like "here are two images" but only deliver one. For reliable multi-image workflows, consider making multiple API calls or using the numbered list format above.

### Aspect Ratios

Google supports flexible aspect ratios:

- `"1:1"` (square)
- `"3:4"`, `"4:3"`
- `"4:5"`, `"5:4"`
- `"9:16"`, `"16:9"`
- `"2:3"`, `"3:2"`
- `"21:9"` (ultrawide)

### Image Editing with Context

Unlike OpenAI, Google Gemini supports **image editing** by including an existing image in the conversation context. This enables powerful workflows like style transfer, object addition/removal, and iterative refinement.

```elixir
alias ReqLLM.{Context, Message}
alias ReqLLM.Message.ContentPart

# Load an existing image
{:ok, original_image} = File.read("photo.jpg")

# Create a context with the image and editing instructions
context = Context.new([
  %Message{
    role: :user,
    content: [
      ContentPart.image(original_image, "image/jpeg"),
      ContentPart.text("Add a rainbow in the sky above the mountains")
    ]
  }
])

# Generate the edited image
{:ok, response} = ReqLLM.generate_image(
  "google:gemini-2.5-flash-image",
  context,  # Pass the full context instead of a string
  aspect_ratio: "16:9"
)

edited_image = ReqLLM.Response.image_data(response)
File.write!("photo_with_rainbow.png", edited_image)
```

### Multi-Turn Image Refinement

You can iteratively refine images through conversation:

```elixir
alias ReqLLM.{Context, Message, Response}
alias ReqLLM.Message.ContentPart

# Initial generation
{:ok, response1} = ReqLLM.generate_image(
  "google:gemini-2.5-flash-image",
  "A medieval castle on a hilltop"
)

first_image = Response.image_data(response1)

# Refine: add details
context = Context.new([
  %Message{
    role: :user,
    content: [
      ContentPart.image(first_image, "image/png"),
      ContentPart.text("Add a dramatic sunset behind the castle with orange and purple clouds")
    ]
  }
])

{:ok, response2} = ReqLLM.generate_image(
  "google:gemini-2.5-flash-image",
  context
)

# Further refinement
second_image = Response.image_data(response2)

context2 = Context.new([
  %Message{
    role: :user,
    content: [
      ContentPart.image(second_image, "image/png"),
      ContentPart.text("Add a dragon flying near one of the castle towers")
    ]
  }
])

{:ok, final_response} = ReqLLM.generate_image(
  "google:gemini-2.5-flash-image",
  context2
)
```

### Style Transfer

Apply artistic styles to existing images:

```elixir
{:ok, photo} = File.read("portrait.jpg")

context = Context.new([
  %Message{
    role: :user,
    content: [
      ContentPart.image(photo, "image/jpeg"),
      ContentPart.text("Transform this photo into a watercolor painting style")
    ]
  }
])

{:ok, response} = ReqLLM.generate_image(
  "google:gemini-2.5-flash-image",
  context
)
```

### Prompting Tips for Google

Google recommends describing scenes rather than listing keywords:

```elixir
# Less effective
"cat, sitting, window, sunlight, cozy"

# More effective
"A content tabby cat lounging on a sunny windowsill,
 warm afternoon light streaming through sheer curtains"
```

---

## Usage & Cost Tracking

Image generation responses include detailed usage and cost information:

### Basic Usage

```elixir
{:ok, response} = ReqLLM.generate_image("openai:gpt-image-1", prompt)

response.usage
#=> %{
#     image_usage: %{
#       generated: %{count: 1, size_class: "1024x1024"}
#     },
#     cost: %{
#       images: 0.04,
#       tokens: 0.0,
#       tools: 0.0,
#       total: 0.04
#     },
#     input_cost: 0.0,
#     output_cost: 0.04,
#     total_cost: 0.04
#   }
```

### Size Classes

Image costs vary by size. The `size_class` field indicates the resolution tier used for billing:

| Provider | Size Classes |
|----------|-------------|
| OpenAI | `"1024x1024"`, `"1536x1024"`, `"1024x1536"`, `"auto"` |
| Google | Based on aspect ratio (e.g., `"1:1"`, `"16:9"`) |

### Multiple Images

When generating multiple images, the `count` reflects the total:

```elixir
{:ok, response} = ReqLLM.generate_image("openai:dall-e-2", prompt, n: 3)

response.usage.image_usage.generated
#=> %{count: 3, size_class: "1024x1024"}
```

---

## Error Handling

```elixir
case ReqLLM.generate_image("openai:gpt-image-1", prompt) do
  {:ok, response} ->
    image_data = ReqLLM.Response.image_data(response)
    File.write!("output.png", image_data)

  {:error, %ReqLLM.Error.API.Request{status: 400, response_body: body}} ->
    IO.puts("Bad request: #{inspect(body)}")

  {:error, %ReqLLM.Error.Invalid.Parameter{} = error} ->
    IO.puts("Invalid parameter: #{Exception.message(error)}")

  {:error, error} ->
    IO.puts("Error: #{inspect(error)}")
end
```

## Testing with Fixtures

Use fixtures to test image generation without making API calls:

```elixir
{:ok, response} = ReqLLM.generate_image(
  "openai:gpt-image-1",
  "A test prompt",
  fixture: "image_basic"
)
```

See the [Fixture Testing](fixture-testing.md) guide for details.