guides/zero_shot.md

# Zero-Shot Classification

`Image.ZeroShot` lets you classify an image against arbitrary labels you provide at call time, without training or fine-tuning anything. Where `Image.Classification` is constrained to whatever 1000 ImageNet labels its model was trained on, zero-shot says "here are five categories I care about right now — which fits best?".

This is enormously useful when your label space:

- Doesn't exist in standard datasets (custom product categories, brand-specific taxonomies, ad-hoc tagging)
- Changes over time (new categories appear without retraining)
- Is unknown until query time (interactive search, user-driven filtering)

Powered by [CLIP](https://openai.com/research/clip), a contrastive vision-language model that learned a shared embedding space for images and text from 400 million image-caption pairs.

## Classifying

```elixir
iex> photo = Image.open!("portrait.jpg")
iex> Image.ZeroShot.classify(photo, [
...>   "a person on a horse",
...>   "a person walking a dog",
...>   "a parked car",
...>   "an empty street"
...> ])
[
  %{label: "a person on a horse", score: 0.94},
  %{label: "a person walking a dog", score: 0.04},
  %{label: "an empty street", score: 0.01},
  %{label: "a parked car", score: 0.01}
]
```

Scores sum to `1.0` (softmax over the candidate set). Results are sorted by descending score.

## Just the best label

When you only want the winner:

```elixir
iex> Image.ZeroShot.label(photo, ["dog", "cat", "horse"])
"horse"
```

## Image-to-image similarity

CLIP's image embeddings live in the same space as its text embeddings, so two images can also be compared directly:

```elixir
iex> a = Image.open!("dog1.jpg")
iex> b = Image.open!("dog2.jpg")
iex> c = Image.open!("car.jpg")
iex>
iex> Image.ZeroShot.similarity(a, b)
0.82
iex> Image.ZeroShot.similarity(a, c)
0.41
```

Useful for "find similar images" without standing up a vector database. For larger collections, compute embeddings once and cache them.

## Prompt templates matter

CLIP was trained on natural-language captions, so it understands sentences much better than bare nouns. Wrapping each label in a simple template reliably improves accuracy. The default is `"a photo of {label}"` — every label is wrapped before tokenisation.

You can override:

```elixir
# Photography-domain template
iex> Image.ZeroShot.classify(image, ["sunset", "rain", "snow"],
...>   template: "a high-quality photograph of {label}")

# Document-domain template
iex> Image.ZeroShot.classify(scan, ["invoice", "receipt", "letter"],
...>   template: "a scanned {label}")

# Disable templating if your labels are already full sentences
iex> Image.ZeroShot.classify(image,
...>   ["a black cat sitting on a chair", "an empty room with a chair"],
...>   template: nil)
```

A general rule: if your labels are nouns, leave the default template. If they're already descriptive sentences, use `template: nil`. If they're domain-specific (medical imagery, document scans, product photography), a domain-tailored template can help.

## Choosing a model

The default is `openai/clip-vit-base-patch32` — MIT licensed, ~600 MB, broad training coverage, well-validated. For higher quality at larger size:

```elixir
iex> Image.ZeroShot.classify(image, labels, repo: "openai/clip-vit-large-patch14")
```

(About ~1.7 GB — roughly 3× the default.)

## Pre-downloading

```bash
mix image_vision.download_models --zero-shot
```

## Default model

[OpenAI CLIP ViT-B/32](https://huggingface.co/openai/clip-vit-base-patch32) — MIT licensed, ~600 MB. The original CLIP, the most well-validated and broadly applicable variant. Contains both a vision encoder (ViT-B/32) and a text encoder (transformer) that produce vectors in a shared 512-dim space.

## Dependencies

Zero-shot classification requires `:bumblebee`, `:nx`, and an Nx backend such as `:exla`. Add to `mix.exs`:

```elixir
{:bumblebee, "~> 0.6"},
{:nx, "~> 0.10"},
{:exla, "~> 0.10"}
```