lib/pdf/reader.ex

Select File:
defmodule Pdf.Reader do
  @moduledoc """
  Native PDF reader — opens a PDF binary or file path and provides pure-functional
  access to text runs with positions, raster images, document metadata, interactive
  form fields, document outlines (bookmarks), and page annotations.
  No GenServer, no mutable state; the reader is a fully lazy, immutable pipeline.

  ## Typical usage

      {:ok, doc}                          = Pdf.Reader.open("report.pdf")
      {:ok, [page1_text | _], doc}        = Pdf.Reader.read_text(doc)
      {:ok, runs, doc}                    = Pdf.Reader.read_text_with_positions(doc)
      {:ok, meta, doc}                    = Pdf.Reader.read_metadata(doc)
      {:ok, n}                            = Pdf.Reader.page_count(doc)
      :ok                                 = Pdf.Reader.close(doc)

  ## Outlines (bookmarks)

      {:ok, outlines, _doc} = Pdf.Reader.read_outlines(doc)
      # => [%Pdf.Reader.Outline{title: "Chapter 1", level: 0, dest_page: 1, children: [...]}, ...]

      # Bang variant — raises Pdf.Reader.Error on failure
      outlines = Pdf.Reader.read_outlines!(doc)

  ## Annotations

      {:ok, annotations, _doc} = Pdf.Reader.read_annotations(doc)
      # => [%Pdf.Reader.Annotation{type: :highlight, page: 2, rect: {x1, y1, x2, y2}, ...}, ...]

      # Bang variant — raises Pdf.Reader.Error on failure
      annotations = Pdf.Reader.read_annotations!(doc)

  ## Error recovery

  `open/2` accepts a `recover: true` option that activates four orthogonal
  recovery phases (R-1..R-4). Each recovery action is logged as a structured
  event tuple appended to `doc.recovery_log`. Use `recovery_log/1` to inspect:

      {:ok, doc} = Pdf.Reader.open(bin, recover: true)
      Pdf.Reader.recovery_log(doc)
      # => [] when the PDF was well-formed
      # => [{:xref_recovered, 5}, {:page_failed, 2, :unresolved_ref}] on a corrupt PDF

  **Closed set of recovery event tuples:**

  | Tuple | Meaning |
  |---|---|
  | `{:xref_recovered, n}` | Linear scan recovered `n` object entries (R-3) |
  | `{:eof_marker_missing, :linear_scan_used}` | `%%EOF` absent; linear scan used (R-3) |
  | `{:page_failed, page_n, reason}` | Page skipped; text/images from other pages returned (R-1) |
  | `{:font_skipped, page_n, font_name, reason}` | Font replaced with U+FFFD fallback (R-2) |
  | `{:page_tree_recovered, n_pages}` | Catalog/Pages fallback; `n_pages` recovered (R-4) |

  An empty `recovery_log` after `open/2` **guarantees** no recovery occurred.
  No other tuple shapes are appended by the recovery paths.

  The following errors remain fatal even with `recover: true`:
  `:not_a_pdf`, `:encrypted_password_required`, `:encrypted_wrong_password`,
  `:encrypted_unsupported_handler`, `{:io_error, reason}`.

  **Known gaps (documented limitations):**

  - **Encrypted AND corrupted PDFs** — the synthetic trailer from R-3 does not
    include `/Encrypt`; decryption cannot proceed.
  - **Catalog-fallback page order (R-4)** — the page list is in xref-insertion
    order, NOT document order. `{:page_tree_recovered, n}` signals this.
  - **R-4 probe cost** — `recover: true` triggers a full page-tree walk at
    `open/2` time (O(pages)). Acceptable for opt-in mode; document in callers
    that open very large PDFs.

  ## Encryption (Phase 2)

  Standard Security Handler V1/V2/V4/V5-R6 supported. Use `open/2` with the
  `password:` opt:

      {:ok, doc} = Pdf.Reader.open(bin, password: "secret")

  Empty password is auto-tried first (covers metadata-protection cases).
  Errors: `:encrypted_password_required`, `:encrypted_wrong_password`,
  `:encrypted_unsupported_handler`. See `Pdf.Reader.Errors` for the full set.

  ## Form XObject recursion (Phase 3)

  `Do` operators referencing `/Type /XObject /Subtype /Form` objects are
  recursed into transparently — text and images inside Forms (headers,
  footers, repeated logos, templated form fields) appear in `read_text*`
  and `read_images/1` output. CTM is multiplied with the Form's `/Matrix`
  and resources are merged (Form wins on key collision). Cycle detection
  via a visited-set guards against `A → B → A` loops; recursion depth is
  capped at 8 (`{:cycle_detected, ref}` and `{:max_depth_exceeded, ref}`
  events are emitted internally and dropped from text output).

  ## Known limitations

  - **No CID fonts beyond ToUnicode** — CID-keyed fonts that rely on `/CIDToGIDMap`
    or registry/ordering/supplement data are not decoded. Only `bfchar`/`bfrange`
    sections of ToUnicode CMaps are parsed.
  - **No CCITT / JBIG2 / JPEG2000 image filters** — images using `CCITTFaxDecode`,
    `JBIG2Decode`, or `JPXDecode` produce `{:error, {:unsupported_filter, name}}`.
  - **No OCR** — scanned PDFs with no embedded text produce an empty text list.
  - **Standard-14 font metrics** — fonts without embedded `/Widths` (Standard-14
    such as Helvetica, Times-Roman) produce zero-width glyph advance; only `Tc`/`Tw`
    character/word spacing contribute. Hardcoded AFM metrics are a separate change.
  - **No BBox clipping** — text outside a Form's `/BBox` is still extracted.
  - **Annotation appearance streams not rendered** — visual rendering is out of scope.
  - **Markup popup hierarchies not resolved** — popup windows are not extracted.
  - **Sound/movie/screen/redact/3D annotations** — not extracted; surface as `:unknown`.
  - **AcroForm widget annotations** — covered by `read_acroform/1`, not `read_annotations/1`.

  ## Spec references

  - PDF 1.7 § 7.7.3 — Page Tree
  - PDF 1.7 § 7.7.3.4 — Inheritance of Page Attributes (resource walk, cycle guard):
    https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf
  - PDF 1.7 § 12.3.2 — Destinations
  - PDF 1.7 § 12.3.3 — Document Outline
  - PDF 1.7 § 12.5 — Annotations
  - PDF 1.7 § 12.5.6.x — Annotation subtypes
  - PDF 1.7 § 12.6 — Actions
  - PDF 1.7 § 14.3.2 — Metadata Streams (XMP merge precedence):
    https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf

  ## Error reasons

  See `Pdf.Reader.Errors` for the complete documented reason set.
  """

  alias Pdf.Reader.{
    AcroForm,
    Annotation,
    Annotations,
    Document,
    Encryption,
    Filter,
    Font,
    Font.Widths,
    FormField,
    ObjectResolver,
    Outline,
    Outlines,
    Page,
    Trailer,
    Utils,
    XMP,
    XRef
  }

  alias Pdf.Reader.Encryption.StandardHandler

  @type reason ::
          :not_a_pdf
          | :malformed
          | :encrypted_password_required
          | :encrypted_wrong_password
          | :encrypted_unsupported_handler
          | :io_error
          | {:io_error, File.posix()}
          | {:unsupported_filter, atom()}
          | {:unresolved_ref, Document.ref()}
          | {:unsupported_pdf_version, String.t()}
          | {:malformed, atom(), map()}

  # ---------------------------------------------------------------------------
  # Public API
  # ---------------------------------------------------------------------------

  @doc """
  Opens a PDF from a binary or a file path.

  ## Options

  - `password: String.t()` — the password to use when opening an encrypted PDF.
    Defaults to `""` (the empty string). The empty password is ALWAYS tried first
    regardless of this option (R-ENC4). If the empty password succeeds, the PDF is
    opened without requiring a non-empty password.

  ## Success

  Returns `{:ok, %Pdf.Reader.Document{}}` with:
  - `:version` — the PDF version string (e.g. `"1.7"`)
  - `:xref` — merged cross-reference table (all `/Prev` chains followed)
  - `:trailer` — the most-recent trailer dictionary as a plain map
  - `:binary` — the full PDF binary (held for lazy object resolution)
  - `:cache` — starts as `%{}`
  - `:encryption` — `nil` for non-encrypted PDFs; populated `%StandardHandler{}` on success

  ## Errors

  - `{:error, :not_a_pdf}` — binary does not start with `%PDF-`
  - `{:error, :malformed}` — missing `%%EOF`, invalid `startxref`, etc.
  - `{:error, :encrypted_password_required}` — `/Encrypt` found; no password supplied or empty password rejected.
  - `{:error, :encrypted_wrong_password}` — password supplied but authentication failed.
  - `{:error, :encrypted_unsupported_handler}` — unsupported encryption handler or RC4 unavailable.
  - `{:error, :io_error}` — file read failed (no detail)
  - `{:error, {:io_error, posix}}` — file read failed with POSIX reason

  ## Spec references

  - PDF 1.7 § 7.6 — Standard Security Handler:
    https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf
  - PDF 2.0 § 7.6 — Standard Security Handler (V5/R6):
    https://www.pdfa.org/wp-content/uploads/2023/04/ISO_32000_2_2020_PDF_2.0_FDIS.pdf
  """
  @spec open(binary() | Path.t(), keyword()) :: {:ok, Document.t()} | {:error, reason()}
  def open(path_or_binary, opts \\ [])

  # Starts with %PDF- magic bytes → definitely raw PDF content
  def open(<<"%PDF-", _::binary>> = binary, opts) do
    do_open(binary, opts)
  end

  # Binary that doesn't start with %PDF-:
  # - If it looks like a filesystem path (starts with / . ~ or contains path separators
  #   and has a .pdf-like extension), treat as file path.
  # - Otherwise treat as raw binary content (returns :not_a_pdf for non-PDF bytes).
  def open(path, opts) when is_binary(path) do
    if looks_like_path?(path) do
      read_file_and_open(path, opts)
    else
      do_open(path, opts)
    end
  end

  # File path as charlist
  def open(path, opts) when is_list(path) do
    read_file_and_open(path, opts)
  end

  # Heuristic: treat as a filesystem path if it starts with / or ./ or ../
  # or if it contains path separators and doesn't look like raw binary content.
  defp looks_like_path?(bin) do
    String.starts_with?(bin, "/") or
      String.starts_with?(bin, "./") or
      String.starts_with?(bin, "../") or
      String.starts_with?(bin, "~") or
      (String.contains?(bin, "/") and String.printable?(String.slice(bin, 0, 50)))
  end

  defp read_file_and_open(path, opts) do
    case File.read(path) do
      {:ok, bin} -> do_open(bin, opts)
      {:error, :enoent} -> {:error, {:io_error, :enoent}}
      {:error, reason} -> {:error, {:io_error, reason}}
    end
  end

  @doc """
  No-op in Phase 1 (no file handle or process held after `open/1`).

  Exists to reserve the API slot for future streaming/mmap support and to
  signal to callers that they may drop the `:binary` field to reclaim memory.

  Always returns `:ok`. Does NOT raise.
  """
  @spec close(Document.t()) :: :ok
  def close(_doc), do: :ok

  @doc """
  Returns the recovery event log for a document in chronological (oldest-first) order.

  An empty list guarantees that no recovery action occurred during `open/2`.
  This is the canonical way for callers to inspect recovery events — direct
  access to `doc.recovery_log` MUST NOT be used in application code.

  The closed set of recovery event tuples is documented in `Pdf.Reader.Document`.

  ## Spec reference

  PDF 1.7 § 7.5 — PDF file structure (recovery model).
  """
  @spec recovery_log(Document.t()) :: [Document.recovery_event()]
  def recovery_log(%Document{recovery_log: log}), do: Enum.reverse(log)

  @doc """
  Extracts document metadata from the Info dictionary.

  Resolves the trailer's `/Info` reference and returns its key-value pairs
  as a `%{String.t() => String.t()}` map. String values are decoded from
  PDF literal strings (`{:string, binary}`).

  Common keys: `"Title"`, `"Author"`, `"Subject"`, `"Keywords"`,
  `"Creator"`, `"Producer"`, `"CreationDate"`, `"ModDate"`.

  Returns `{:ok, %{}, doc}` when no `/Info` entry is present.

  ## Spec reference

  PDF 1.7 § 14.3.3 (Document Information Dictionary).
  """
  @spec read_metadata(Document.t()) ::
          {:ok, %{String.t() => String.t()}, Document.t()} | {:error, reason()}
  def read_metadata(%Document{trailer: trailer} = doc) do
    # Step 1: Read /Info dictionary (existing path)
    {info_meta, doc1} =
      case Map.get(trailer, "Info") do
        nil ->
          {%{}, doc}

        info_ref ->
          case ObjectResolver.resolve(doc, info_ref) do
            {:ok, info_dict, updated_doc} when is_map(info_dict) ->
              meta =
                info_dict
                |> Enum.flat_map(fn {k, v} ->
                  case decode_info_value(v) do
                    nil -> []
                    str -> [{k, str}]
                  end
                end)
                |> Map.new()

              {meta, updated_doc}

            {:ok, _non_dict, updated_doc} ->
              {%{}, updated_doc}

            {:error, _} ->
              {%{}, doc}
          end
      end

    # Step 2: Read XMP stream from catalog /Metadata, merge with /Info
    # XMP wins on conflict per PDF 1.7 § 14.3.2.
    {xmp_meta, doc2} = read_xmp_stream(doc1)

    merged = Map.merge(info_meta, xmp_meta)
    {:ok, merged, doc2}
  end

  @doc """
  Returns the total number of pages in the document.

  Cross-validates the `/Count` entry in the page tree root against the
  actual number of leaf page refs found by traversal. If they disagree,
  returns `{:error, {:malformed, :page_tree_count_mismatch, %{declared: n, actual: m}}}`.

  ## Recovery mode (R-4)

  When `recover_mode: true` and the page list was recovered via the catalog
  fallback (xref scan), there is no `/Pages /Count` to cross-validate against.
  In that case, the declared-count lookup is skipped and the actual count from
  the xref scan is returned directly. This branch is signalled by
  `{:page_tree_recovered, n}` in `recovery_log`.

  Spec references: PDF 1.7 § 7.7.3 (Page Tree), § 7.7.3.4 (Inheritance).
  """
  @spec page_count(Document.t()) :: {:ok, pos_integer()} | {:error, reason()}
  def page_count(%Document{} = doc) do
    with {:ok, refs, updated_doc} <- Page.list_refs(doc) do
      actual = length(refs)

      if actual == 0 do
        {:error, :no_pages}
      else
        # R-4: when the page list was obtained via catalog-fallback (xref scan),
        # there is no /Pages /Count to cross-validate. Detect this by checking
        # if a :page_tree_recovered event was appended during list_refs/1.
        # In that case return the actual count directly, bypassing the cross-check.
        page_tree_recovered? =
          Enum.any?(updated_doc.recovery_log, &match?({:page_tree_recovered, _}, &1))

        if page_tree_recovered? do
          {:ok, actual}
        else
          case read_declared_count(updated_doc) do
            {:ok, declared_count} when declared_count == actual ->
              {:ok, actual}

            {:ok, declared_count} ->
              {:error,
               {:malformed, :page_tree_count_mismatch,
                %{declared: declared_count, actual: actual}}}

            {:error, _} when doc.recover_mode ->
              # recover_mode active but no /Pages /Count found — return actual count
              {:ok, actual}

            {:error, _} = err ->
              err
          end
        end
      end
    end
  end

  @doc """
  Unified entry point — returns the entire extracted PDF in one struct.

  Default shape is `%Pdf.Reader.Result{}` carrying:

  - `:meta` — document-level metadata (title, author, subject,
    creator, producer, dates, page_count, PDF version, encryption flag,
    recovery_log, plus the raw Info+XMP map). PDF 1.7 § 14.3.
  - `:pages` — `[%Pdf.Reader.Result.Page{number, meta, lines}]`. Each
    page's `:lines` includes text lines AND embedded images as
    synthetic lines, sorted top-to-bottom. Each line's tokens carry
    `:kind` (`:text | :link | :email | :image`) and `:shape`.

  ## Convenience shapes

  Pass `:shape` if you only want one slice without building the full struct:

  - `:text` → `[String.t()]` (plain text per page)
  - `:shapes` → `[%Pdf.Reader.Shape{}]` (links/emails/images flat)

  ## Line tokenisation opts

  - `:y_tolerance` (default `2.0`) — PDF point tolerance to collapse
    text runs onto the same line.
  - `:gap_factor` (default `1.15`) — token-split threshold as a
    multiplier on the per-line median inter-glyph gap. Forwarded to
    `read_lines/2`.

  ## Image opts

  - `:image_bytes` (default `false`) — when `true`, image tokens carry
    the raw decoded `:bytes` in `meta` alongside the always-present
    `:data_uri`. Off by default to keep the result lightweight; turn
    on if the caller needs the binary (e.g. to write images to disk
    or run a QR decoder).

  ## Dictionary split

  - `:dictionary` (default `nil`) — when set, runs an additional
    post-pass that splits glued lowercase tokens at boundaries where
    BOTH halves are valid dictionary words (e.g. `"iniciode"` →
    `"inicio"` + `"de"`). Accepts:
    - `:es` — bundled 10k Spanish wordlist
      (`Pdf.Reader.Wordlist.spanish/0`, MIT-licensed)
    - `%MapSet{}` — caller-supplied wordlist of lowercase strings
    - `nil` — disabled

    URLs/emails and tokens with digits or special chars are exempted.

  ## Spec references

  - PDF 1.7 § 7.7.3      — Page Tree
  - PDF 1.7 § 8.9        — Images (XObject /Subtype /Image)
  - PDF 1.7 § 9.4        — Text objects
  - PDF 1.7 § 12.5.6.5   — Link Annotations
  - PDF 1.7 § 12.6.4     — Action types (URI, GoTo, Launch, Named)
  - PDF 1.7 § 14.3       — Document Information Dictionary + XMP
  """
  @spec read(Document.t(), keyword()) ::
          {:ok, term(), Document.t()} | {:error, reason()}
  def read(%Document{} = doc, opts \\ []) do
    case Keyword.get(opts, :shape, :document) do
      :document -> read_full_document(doc, opts)
      :text -> read_text(doc, opts)
      :shapes -> read_shapes(doc)
      other -> {:error, {:unknown_shape, other}}
    end
  end

  @doc """
  Bang variant of `read/2`. Raises `Pdf.Reader.Error` on failure.
  """
  @spec read!(Document.t(), keyword()) :: term()
  def read!(doc, opts \\ []) do
    case read(doc, opts) do
      {:ok, value, _doc} -> value
      {:error, reason} -> raise Pdf.Reader.Error, reason
    end
  end

  # Default shape — assembles the full %Pdf.Reader.Result{} from
  # text lines + shapes + images + Info/XMP metadata, partitioned by
  # page. Lines and page_count are required (the document is unusable
  # without them). Metadata, shapes, and images are best-effort — if
  # the PDF was hand-edited or has corrupt streams we still return
  # a valid Result with empty lists for the broken layer instead of
  # failing the whole extraction.
  defp read_full_document(doc, opts) do
    with {:ok, lines, doc1} <- read_lines(doc, opts),
         {:ok, page_count} <- page_count(doc1) do
      {info, doc2} = safe_read(doc1, &read_metadata/1, %{})
      {shapes, doc3} = safe_read(doc2, &read_shapes/1, [])
      {images, doc4} = safe_read(doc3, &read_images/1, [])

      include_bytes = Keyword.get(opts, :image_bytes, false)
      enriched_text_lines = attach_shapes_to_tokens(lines, shapes)
      image_lines = Enum.map(images, &image_to_synthetic_line(&1, include_bytes))

      all_lines =
        (enriched_text_lines ++ image_lines)
        |> Enum.sort_by(fn line -> {line.page, -line.y} end)

      pages = build_result_pages(all_lines, page_count)
      meta = build_result_meta(doc4, info, page_count)

      {:ok, %Pdf.Reader.Result{meta: meta, pages: pages}, doc4}
    end
  end

  # Run an optional read step. If it fails, return the default value
  # and pass the doc through unchanged so subsequent steps still try.
  defp safe_read(doc, fun, default) do
    case fun.(doc) do
      {:ok, value, new_doc} -> {value, new_doc}
      {:error, _} -> {default, doc}
    end
  end

  # Group lines by their :page field; emit one Result.Page per page index
  # in the document, even when a page has no extractable lines.
  defp build_result_pages(lines, page_count) do
    by_page = Enum.group_by(lines, & &1.page)

    for page_num <- 1..page_count do
      page_lines = Map.get(by_page, page_num, [])

      %Pdf.Reader.Result.Page{
        number: page_num,
        meta: %{},
        lines: page_lines
      }
    end
  end

  # Document-level metadata: normalise the standard Info-dict keys
  # (PDF 1.7 § 14.3.3) to atom keys, keep the raw map for vendor extras,
  # and add reader-derived fields (page_count, version, encrypted, log).
  defp build_result_meta(%Document{} = doc, info, page_count) do
    %{
      title: Map.get(info, "Title"),
      author: Map.get(info, "Author"),
      subject: Map.get(info, "Subject"),
      keywords: Map.get(info, "Keywords"),
      creator: Map.get(info, "Creator"),
      producer: Map.get(info, "Producer"),
      creation_date: Map.get(info, "CreationDate"),
      mod_date: Map.get(info, "ModDate"),
      page_count: page_count,
      version: doc.version,
      encrypted: doc.encryption != nil,
      recovery_log: recovery_log(doc),
      raw: info
    }
  end

  # Converts a raster image into a one-token synthetic line so the unified
  # `read/2` output surfaces it inline at its position. The token's
  # `:shape` carries format, dimensions, and a ready-to-use `:data_uri`
  # (RFC 2397) so callers can drop it into `<img src="...">` without
  # further work. The raw `:bytes` are only attached when the caller
  # passes `image_bytes: true` to `read/2` — by default the result is
  # kept lightweight.
  #
  # JPEG passthrough is straightforward (bytes are a complete JFIF file).
  # `:png_like` bytes are decompressed pixel data — we wrap them in a real
  # PNG container (PNG 1.2 § 5: signature + IHDR + IDAT + IEND, with
  # filter byte 0 prepended to each scanline and zlib-compressed) so the
  # data_uri is browser-loadable. Color type is inferred from
  # `byte_size / (width × height)`: 1=gray, 2=gray+alpha, 3=RGB, 4=RGBA.
  defp image_to_synthetic_line(%Pdf.Reader.Image{} = img, include_bytes) do
    rect = {img.x, img.y, img.x + img.render_width, img.y + img.render_height}
    {data_uri, encoded_format} = build_data_uri(img)

    base_meta = %{
      format: img.kind,
      encoded_format: encoded_format,
      width: img.width,
      height: img.height,
      render_width: img.render_width,
      render_height: img.render_height,
      byte_size: byte_size(img.bytes),
      data_uri: data_uri
    }

    meta = if include_bytes, do: Map.put(base_meta, :bytes, img.bytes), else: base_meta

    shape = %Pdf.Reader.Shape{
      type: :image,
      page: img.page,
      rect: rect,
      target: img.ref,
      text: nil,
      source: :embedded,
      meta: meta
    }

    token = %{
      x: img.x,
      text: "",
      width: img.render_width,
      kind: :image,
      shape: shape
    }

    %Pdf.Reader.Line{
      page: img.page,
      y: img.y,
      x: img.x,
      text: "",
      tokens: [token]
    }
  end

  # JPEG → passthrough; png_like → wrap pixels in a real PNG.
  # Returns {data_uri, encoded_format} where encoded_format is the MIME
  # subtype that the data_uri actually carries.
  defp build_data_uri(%Pdf.Reader.Image{kind: :jpeg, bytes: bytes}) do
    {"data:image/jpeg;base64," <> Base.encode64(bytes), :jpeg}
  end

  defp build_data_uri(%Pdf.Reader.Image{kind: :png_like} = img) do
    case encode_png(img) do
      {:ok, png} -> {"data:image/png;base64," <> Base.encode64(png), :png}
      :error -> {nil, nil}
    end
  end

  defp build_data_uri(_), do: {nil, nil}

  # PNG signature (PNG 1.2 § 5.2)
  @png_signature <<137, 80, 78, 71, 13, 10, 26, 10>>

  defp encode_png(%Pdf.Reader.Image{bytes: pixels, width: w, height: h}) when w > 0 and h > 0 do
    width = trunc(w)
    height = trunc(h)
    total = byte_size(pixels)

    case rem(total, width * height) do
      0 ->
        channels = div(total, width * height)
        color_type = png_color_type(channels)

        if color_type != nil do
          ihdr =
            <<width::32, height::32, 8::8, color_type::8, 0::8, 0::8, 0::8>>

          row_size = width * channels
          filtered = prepend_filter_bytes(pixels, row_size)
          idat = :zlib.compress(filtered)

          {:ok,
           @png_signature <>
             png_chunk("IHDR", ihdr) <>
             png_chunk("IDAT", idat) <>
             png_chunk("IEND", "")}
        else
          :error
        end

      _ ->
        :error
    end
  end

  defp encode_png(_), do: :error

  defp png_color_type(1), do: 0
  defp png_color_type(2), do: 4
  defp png_color_type(3), do: 2
  defp png_color_type(4), do: 6
  defp png_color_type(_), do: nil

  # Each scanline gets a filter byte (0x00 = None per PNG 1.2 § 9) prepended.
  defp prepend_filter_bytes(pixels, row_size) do
    for <<row::binary-size(row_size) <- pixels>>, into: <<>>, do: <<0::8, row::binary>>
  end

  # PNG chunk: length(4) + type(4) + data + CRC32(4) over type+data.
  defp png_chunk(type, data) when is_binary(type) and is_binary(data) do
    crc = :erlang.crc32(type <> data)
    <<byte_size(data)::32, type::binary, data::binary, crc::32>>
  end

  @doc """
  Pure helper: enriches each token in a `Line` list with `:kind` and
  `:shape`. Tokens without an overlapping shape get `kind: :text` and
  `shape: nil`. Tokens overlapping a shape get the shape attached and
  `:kind` derived from `shape.type`:

  - `:uri | :goto | :launch | :named` → `:link`
  - `:email` → `:email`

  A shape "contains" a token when:
  - The shape and the line are on the same page.
  - The shape's X range overlaps the token's X range.
  - The shape's Y is within ±2 points of the line's Y.

  Spec references:
  - PDF 1.7 § 12.5.6.5 — Link Annotations (rect semantics)
  - PDF 1.7 § 12.6.4   — Action types (URI/GoTo/Launch/Named)
  """
  @spec attach_shapes_to_tokens([Pdf.Reader.Line.t()], [Pdf.Reader.Shape.t()]) ::
          [Pdf.Reader.Line.t()]
  def attach_shapes_to_tokens(lines, shapes) when is_list(lines) and is_list(shapes) do
    shapes_by_page = Enum.group_by(shapes, & &1.page)

    Enum.map(lines, fn %Pdf.Reader.Line{} = line ->
      page_shapes = Map.get(shapes_by_page, line.page, [])

      new_tokens =
        Enum.map(line.tokens, fn token ->
          shape = find_overlapping_shape(token, line, page_shapes)

          token
          |> Map.put(:shape, shape)
          |> Map.put(:kind, kind_from_shape(shape))
        end)

      %{line | tokens: new_tokens}
    end)
  end

  # Kind derived from shape.type, mapping action-like types to :link.
  defp kind_from_shape(nil), do: :text
  defp kind_from_shape(%Pdf.Reader.Shape{type: :email}), do: :email
  defp kind_from_shape(%Pdf.Reader.Shape{type: type}) when type in [:uri, :goto, :launch, :named],
    do: :link

  defp kind_from_shape(_), do: :text

  defp find_overlapping_shape(token, %Pdf.Reader.Line{y: line_y}, shapes) do
    token_x_start = token.x
    token_x_end = token.x + Map.get(token, :width, 0.0)

    Enum.find(shapes, fn shape ->
      case shape.rect do
        nil ->
          false

        {sx1, sy1, sx2, sy2} ->
          x_lo = min(sx1, sx2)
          x_hi = max(sx1, sx2)
          y_lo = min(sy1, sy2) - 2.0
          y_hi = max(sy1, sy2) + 2.0

          line_y >= y_lo and line_y <= y_hi and
            token_x_end >= x_lo and token_x_start <= x_hi
      end
    end)
  end

  @doc """
  Returns text runs with absolute positions for all pages.

  Walks each page, decodes its content stream(s), and returns a flat list of
  `%Pdf.Reader.TextRun{}` structs ordered by page then appearance in the
  content stream.

  Returns `{:ok, [], doc}` when no text is found. Never returns `:no_text_found`
  as an error per the spec resolution (empty is valid).

  The returned `doc` carries an updated `:recovery_log` when opened with
  `recover: true` — callers should pass the returned doc to `recovery_log/1`
  to inspect per-page failures.

  Form XObjects (`Do` operator referencing `/Type /Form`) are NOT recursed —
  per Phase 1 scope. A deferred marker is recorded but produces no TextRun.
  """
  @spec read_text_with_positions(Document.t()) ::
          {:ok, [Pdf.Reader.TextRun.t()], Document.t()} | {:error, reason()}
  def read_text_with_positions(%Document{} = doc) do
    with {:ok, page_refs, doc2} <- Page.list_refs(doc) do
      collect_text_runs(page_refs, doc2, 1, [])
    end
  end

  @doc """
  Reconstructs logical text lines from the page's `TextRun`s.

  Many machine-generated PDFs (government forms, tax documents) place
  glyphs individually with TJ + per-glyph kerning, producing one
  `TextRun` per character. This function coalesces those runs into a
  list of `Pdf.Reader.Line` structs, where each line carries:

  - `:page`, `:y`, `:x` — absolute position in user space
  - `:text` — the joined text with single spaces between tokens
  - `:tokens` — `[%{x, text, width}]` separated by visible whitespace

  The token list lets callers detect column layouts (e.g. table rows
  where every line has tokens at the same X positions).

  ## Options

  - `:y_tolerance` (default `2.0`) — runs whose Y differs by less than
    this many points collapse onto the same line. PDFs often jitter
    by fractional points within a line.
  - `:gap_factor` (default `1.15`) — split into a new token when the
    horizontal gap between two consecutive runs exceeds the **median**
    inter-glyph gap on that line, multiplied by `gap_factor`. Using the
    median makes detection robust across fonts and sizes: monospace 4pt
    advances split at ~4.6pt, 6pt advances split at ~6.9pt, etc.
    Lower factor = more splits. Falls back to `font_size × gap_factor`
    when a line has fewer than two runs (no gap to measure).

  Returns `{:ok, [Line.t()], doc}`. Lines are ordered by page ascending,
  then by Y descending (top-to-bottom in PDF user space).

  ## Spec references

  - PDF 1.7 § 9.4 — Text objects
  - PDF 1.7 § 9.4.4 — Text-showing operators
  """
  @spec read_lines(Document.t(), keyword()) ::
          {:ok, [Pdf.Reader.Line.t()], Document.t()} | {:error, reason()}
  def read_lines(%Document{} = doc, opts \\ []) do
    with {:ok, runs, doc2} <- read_text_with_positions(doc) do
      {:ok, lines_from_runs(runs, opts), doc2}
    end
  end

  @doc """
  Bang variant of `read_lines/2`. Raises `Pdf.Reader.Error` on failure.
  """
  @spec read_lines!(Document.t(), keyword()) :: [Pdf.Reader.Line.t()]
  def read_lines!(doc, opts \\ []) do
    case read_lines(doc, opts) do
      {:ok, lines, _doc} -> lines
      {:error, reason} -> raise Pdf.Reader.Error, reason
    end
  end

  @doc """
  Returns the actionable elements (link-like shapes) of the document.

  Combines two sources:

  - **Annotations** of subtype `/Link` (PDF 1.7 § 12.5.6.5) — real
    clickable regions placed by the document author. Each becomes a
    `%Pdf.Reader.Shape{source: :annotation}`.
  - **Inferred shapes** — URL and email patterns appearing as plain
    text in `read_lines/2` output. Common in government forms that
    print `http://...` or `email@domain` without making them clickable.
    Each becomes `%Pdf.Reader.Shape{source: :inferred}`.

  Returns `{:ok, shapes, doc}`. Shapes are sorted by `:page` ascending,
  then by `:y` descending (top-to-bottom) when a rect is available.

  ## Spec references

  - PDF 1.7 § 12.5.6.5 — Link Annotations
  - PDF 1.7 § 12.6.4   — Action types (URI, GoTo, Launch, Named)
  - RFC 3986 § 3        — URI Generic Syntax
  - RFC 5321 § 4.1.2    — Mailbox/Domain syntax (mailto)
  """
  @spec read_shapes(Document.t()) ::
          {:ok, [Pdf.Reader.Shape.t()], Document.t()} | {:error, reason()}
  def read_shapes(%Document{} = doc) do
    with {:ok, anns, doc2} <- read_annotations(doc),
         {:ok, lines, doc3} <- read_lines(doc2) do
      annotation_shapes = Enum.flat_map(anns, &annotation_to_shape/1)
      inferred_shapes = shapes_from_lines(lines)

      shapes =
        (annotation_shapes ++ inferred_shapes)
        |> Enum.sort_by(fn s ->
          {s.page, -shape_y(s)}
        end)

      {:ok, shapes, doc3}
    end
  end

  @doc """
  Bang variant of `read_shapes/1`. Raises `Pdf.Reader.Error` on failure.
  """
  @spec read_shapes!(Document.t()) :: [Pdf.Reader.Shape.t()]
  def read_shapes!(doc) do
    case read_shapes(doc) do
      {:ok, shapes, _doc} -> shapes
      {:error, reason} -> raise Pdf.Reader.Error, reason
    end
  end

  @doc """
  Pure helper: scans a list of `Line` structs for URL and email patterns
  and emits the inferred shapes. Exposed for callers that already have
  a lines list and want the inference layer alone (no annotations).
  """
  @spec shapes_from_lines([Pdf.Reader.Line.t()]) :: [Pdf.Reader.Shape.t()]
  def shapes_from_lines(lines) when is_list(lines) do
    Enum.flat_map(lines, &infer_shapes_in_line/1)
  end

  # URI / email regexes, simplified per RFC 3986 § 3 (URI Generic Syntax)
  # and RFC 5321 § 4.1.2 (Mailbox). Match a tight subset that excludes
  # trailing punctuation that almost certainly isn't part of the URI.
  @url_regex ~r|https?://[A-Za-z0-9\-._~:/?#\[\]@!$&'()*+,;=%]+|u
  @www_regex ~r|www\.[A-Za-z0-9\-._~:/?#\[\]@!$&'()*+,;=%]+|u
  @email_regex ~r|[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}|u

  defp infer_shapes_in_line(%Pdf.Reader.Line{} = line) do
    Enum.flat_map(line.tokens, fn token -> infer_shapes_in_token(token, line) end)
  end

  defp infer_shapes_in_token(%{text: text} = token, line) do
    email = Regex.run(@email_regex, text)
    url_https = Regex.run(@url_regex, text)
    url_www = Regex.run(@www_regex, text)

    cond do
      email != nil ->
        [build_inferred_shape(:email, hd(email), token, line)]

      url_https != nil ->
        [build_inferred_shape(:uri, hd(url_https) |> trim_trailing_punct(), token, line)]

      url_www != nil ->
        [build_inferred_shape(:uri, hd(url_www) |> trim_trailing_punct(), token, line)]

      true ->
        []
    end
  end

  defp build_inferred_shape(type, target, token, line) do
    %Pdf.Reader.Shape{
      type: type,
      page: line.page,
      rect: {token.x, line.y, token.x + token.width, line.y},
      target: target,
      text: target,
      source: :inferred
    }
  end

  # URLs commonly appear as the last word in a sentence: "see http://x.com."
  # Strip trailing punctuation that is unlikely to be part of the URI.
  defp trim_trailing_punct(uri) do
    String.trim_trailing(uri, ".")
    |> String.trim_trailing(",")
    |> String.trim_trailing(";")
    |> String.trim_trailing(":")
    |> String.trim_trailing(")")
    |> String.trim_trailing("]")
  end

  # Convert a /Link annotation to one or more shapes.
  # PDF 1.7 § 12.5.6.5: a Link annotation may carry a URI action (`:url`),
  # a GoTo destination (`:dest_page`), or other action types we don't
  # currently model. We surface what we have.
  defp annotation_to_shape(%Pdf.Reader.Annotation{type: :link} = ann) do
    cond do
      ann.url != nil and is_binary(ann.url) ->
        type = if String.contains?(ann.url, "@") and not String.contains?(ann.url, "://"),
                 do: :email,
                 else: :uri

        [
          %Pdf.Reader.Shape{
            type: type,
            page: ann.page || 1,
            rect: ann.rect,
            target: ann.url,
            text: ann.contents,
            source: :annotation
          }
        ]

      ann.dest_page != nil ->
        [
          %Pdf.Reader.Shape{
            type: :goto,
            page: ann.page || 1,
            rect: ann.rect,
            target: %{page: ann.dest_page},
            text: ann.contents,
            source: :annotation
          }
        ]

      true ->
        []
    end
  end

  defp annotation_to_shape(_), do: []

  # Y for sorting — top of rect (highest y in PDF user-space) when present.
  defp shape_y(%Pdf.Reader.Shape{rect: nil}), do: 0.0
  defp shape_y(%Pdf.Reader.Shape{rect: {_x1, y1, _x2, y2}}), do: max(y1, y2) * 1.0

  @doc """
  Pure helper: groups a flat `TextRun` list into `Line` structs.

  Exposed publicly so callers who already have a runs list (from
  `read_text_with_positions/1` or hand-crafted in tests) can reuse the
  grouping logic without reopening the document.

  See `read_lines/2` for option semantics.
  """
  @spec lines_from_runs([Pdf.Reader.TextRun.t()], keyword()) :: [Pdf.Reader.Line.t()]
  def lines_from_runs(runs, opts \\ []) when is_list(runs) do
    y_tol = Keyword.get(opts, :y_tolerance, 2.0)
    gap_factor = Keyword.get(opts, :gap_factor, 1.5)
    dictionary = Pdf.Reader.Wordlist.resolve(Keyword.get(opts, :dictionary, nil))

    runs
    |> Enum.reject(&(&1.text == ""))
    |> Enum.group_by(& &1.page)
    |> Enum.sort_by(fn {page, _} -> page end)
    |> Enum.flat_map(fn {page, page_runs} ->
      page_runs
      |> bucket_by_y(y_tol)
      |> Enum.map(&build_line(page, &1, gap_factor, dictionary))
    end)
  end

  # Sort by Y descending (PDF Y goes up — top of page = highest Y),
  # then walk the sorted list, opening a new bucket whenever the Y drop
  # exceeds the tolerance.
  #
  # Buckets are accumulated by prepend for O(1) growth, then reversed so
  # the runs within each bucket retain their parser-emit order. This is
  # critical for PDFs that use a single `TJ` operator across many glyphs:
  # all glyph runs share the same starting X, so the X-sort below cannot
  # recover natural reading order — the parser order is the only signal.
  defp bucket_by_y(runs, y_tol) do
    runs
    |> Enum.sort_by(& &1.y, :desc)
    |> Enum.reduce([], fn run, acc ->
      case acc do
        [] ->
          [[run]]

        [[head | _] = current | rest] ->
          if abs(run.y - head.y) <= y_tol do
            [[run | current] | rest]
          else
            [[run], current | rest]
          end
      end
    end)
    |> Enum.reverse()
    |> Enum.map(&Enum.reverse/1)
  end

  defp build_line(page, line_runs, gap_factor, dictionary) do
    sorted = sort_by_x_with_parser_tiebreaker(line_runs)

    tokens =
      sorted
      |> tokenize_runs(gap_factor)
      |> expand_label_colons()
      |> expand_with_dictionary(dictionary)

    text = render_line_text(tokens)
    first = List.first(sorted)

    %Pdf.Reader.Line{
      page: page,
      y: first.y,
      x: first.x,
      text: text,
      tokens: tokens
    }
  end

  # Join tokens into a single string. The simple `Enum.join(" ")` would
  # produce "personales ,puede" (extra space before comma) when the
  # dict-split recursion separates punctuation. Skip the leading space
  # when the next token starts with punctuation, and the trailing space
  # when the previous token ends with an opening bracket / quote.
  defp render_line_text([]), do: ""

  defp render_line_text([first | rest]) do
    Enum.reduce(rest, first.text, fn token, acc ->
      cond do
        acc == "" -> token.text
        token.text == "" -> acc
        no_space_before?(token.text) -> acc <> token.text
        String.ends_with?(acc, ["¡", "¿", "(", "[", "\""]) -> acc <> token.text
        true -> acc <> " " <> token.text
      end
    end)
  end

  defp no_space_before?(text) do
    String.starts_with?(text, [",", ".", ";", ":", "!", "?", ")", "]", "}"])
  end

  # Sort runs by X but preserve parser-emit order when X positions are
  # close (within ~0.75 em). This handles PDFs that emit text with small
  # backward jumps — common in label/value layouts where the producer
  # writes the label, then jumps slightly back to overlap the value.
  # Pure X-sort would scramble the chars (e.g. "Territorial:" + "S" with
  # `S.x < :.x` produced "TerritorialS:OLIDARIDAD" instead of
  # "Territorial:SOLIDARIDAD"). Using a bin size proportional to the
  # font size keeps column-aligned tables intact (column gaps are
  # always >> 1 em apart so they fall in different bins) while
  # collapsing close-X cluster jitter to parser order.
  defp sort_by_x_with_parser_tiebreaker(runs) do
    bin_size = bin_size_for_runs(runs)

    runs
    |> Enum.with_index()
    |> Enum.sort_by(fn {r, idx} -> {trunc(r.x / bin_size), idx} end)
    |> Enum.map(&elem(&1, 0))
  end

  defp bin_size_for_runs([]), do: 8.0

  defp bin_size_for_runs(runs) do
    size =
      runs
      |> Enum.map(& &1.size)
      |> Enum.max(fn -> 8.0 end)

    max(size * 0.75, 1.0)
  end

  # Split a sorted-by-X list of runs into tokens. Threshold is computed
  # per-line as `median(gaps) * gap_factor`, falling back to
  # `font_size * gap_factor` when there are fewer than two runs (no
  # gaps to sample). Median-based detection makes word-break detection
  # robust across font sizes and per-glyph advance variations — a
  # monospace 4pt-advance line splits at ~4.6pt (catching subtle word
  # boundaries) while a 12pt-advance line tolerates wider intra-token
  # gaps.
  defp tokenize_runs([], _factor), do: []

  defp tokenize_runs(runs, factor) do
    # Threshold computed from NON-whitespace gaps only — whitespace chars
    # produce systematically smaller gaps that would skew the percentile.
    non_space = Enum.reject(runs, &whitespace_run?/1)
    threshold = compute_split_threshold(non_space, factor)

    runs
    |> Enum.chunk_while(
      [],
      fn run, acc ->
        cond do
          # Whitespace-only runs (text == " ", "\t", etc.) are authoritative
          # token boundaries — when the producer emits a literal space
          # character we trust it absolutely, regardless of gap math.
          # Drop the whitespace run itself; the boundary is its mere
          # presence. This handles "OMAR ALEXIS JUAN PEREZ" where the
          # PDF emits actual " " glyphs between words.
          whitespace_run?(run) ->
            if acc == [], do: {:cont, []}, else: {:cont, Enum.reverse(acc), []}

          acc == [] ->
            {:cont, [run]}

          true ->
            [prev | _] = acc
            gap = run.x - prev.x

            cond do
              # Forward jump bigger than the per-line p75 threshold: word break.
              gap > threshold ->
                {:cont, Enum.reverse(acc), [run]}

              # Backward jump > 1pt: the producer is overlapping a label
              # with a value (common in label/value layouts where the
              # writer emits "Territorial:" then jumps slightly back to
              # start "SOLIDARIDAD" underneath the colon). Always treat
              # as a token boundary — typographic kerning is at most
              # ~0.4pt for 8pt fonts.
              gap < -1.0 ->
                {:cont, Enum.reverse(acc), [run]}

              true ->
                {:cont, [run | acc]}
            end
        end
      end,
      fn
        [] -> {:cont, []}
        acc -> {:cont, Enum.reverse(acc), []}
      end
    )
    |> Enum.map(&run_chunk_to_token/1)
  end

  defp whitespace_run?(%{text: text}), do: String.trim(text) == ""
  defp whitespace_run?(_), do: false

  # Per-line threshold computation. Two regimes depending on whether
  # the producer emits text per-glyph (one run per char, like csf.pdf
  # → 2079 runs) or per-word/per-Tj (each run holds a complete word
  # or phrase, like csf2.pdf → 335 runs):
  #
  # 1. **Per-word layout** (avg run length ≥ 3 chars): each run is
  #    already a logical token — adjacent runs are different words by
  #    definition. Use a tiny threshold (1pt) so every visible gap
  #    splits. The token boundaries are the run boundaries.
  #
  # 2. **Per-glyph layout** (avg < 3 chars): chars come one-by-one and
  #    we need the gap distribution to detect word boundaries within
  #    a stream of glyphs. Use the 75th-percentile heuristic. Outliers
  #    (column gaps) are clamped via `Enum.min/2` so they don't
  #    inflate the threshold above what works for word-level gaps.
  defp compute_split_threshold(runs, factor) do
    gaps =
      runs
      |> Enum.chunk_every(2, 1, :discard)
      |> Enum.map(fn [a, b] -> b.x - a.x end)
      |> Enum.filter(&(&1 > 0))

    cond do
      length(gaps) < 3 ->
        size = (List.first(runs) || %{size: 8.0}).size
        max(size, 1.0) * factor

      per_word_layout?(runs) ->
        # Each run is already a word — split at every visible gap.
        1.0

      true ->
        size = (List.first(runs) || %{size: 8.0}).size
        sorted = Enum.sort(gaps)
        p75_idx = min(trunc(length(sorted) * 0.75), length(sorted) - 1)
        p75 = Enum.at(sorted, p75_idx)

        # Clamp p75 to ≤ font_size × 3 so column-gap outliers can't
        # inflate the threshold above word-break range.
        clamped_p75 = min(p75, size * 3.0)
        max(clamped_p75, 0.5) * factor
    end
  end

  defp per_word_layout?(runs) do
    total_chars = runs |> Enum.map(&String.length(&1.text)) |> Enum.sum()
    total_chars / length(runs) >= 3.0
  end

  defp run_chunk_to_token([first | _] = chunk) do
    text = chunk |> Enum.map(& &1.text) |> Enum.join("")
    last = List.last(chunk)
    width = max(last.x - first.x, 0.0)

    %{x: first.x, text: text, width: width}
  end

  # Post-process tokens to recover word boundaries that the PDF producer
  # collapsed by emitting glyphs with no intervening space character. Two
  # passes, both opt-out for URIs/emails so we don't shred URLs:
  #
  # 1. Label-colon split: "Postal:77710" → "Postal:" + "77710". Common in
  #    forms where label and value share a single TJ chunk.
  # 2. CamelCase split: "delMunicipio" → "del" + "Municipio",
  #    "OriginalSello" → "Original" + "Sello". A lowercase-followed-by-
  #    uppercase transition is almost always a glued word boundary in
  #    Spanish/English content.
  #
  # Token X positions are split proportionally to character count so the
  # downstream layout stays roughly correct.
  defp expand_label_colons(tokens) do
    tokens
    |> Enum.flat_map(&split_label_colon/1)
    |> Enum.flat_map(&split_letter_digit/1)
    |> Enum.flat_map(&split_camel_case/1)
  end

  # Split tokens at letter↔digit boundaries: "1Asalariado" → "1" +
  # "Asalariado". Protect identifiers, base64 hashes, and URLs from
  # over-splitting:
  #
  # - `looks_like_id?` keeps Mexican RFC/CURP shapes intact
  #   ("XAXX010101000", "MACA961017HQRRHM06").
  # - Tokens with `/`, `+`, `=` are base64 / URI fragments — skip.
  # - Tokens longer than 30 chars without separators are almost
  #   always opaque IDs.
  defp split_letter_digit(%{text: text} = token) do
    cond do
      uri_like?(text) -> [token]
      String.match?(text, ~r/[\/+=]/) -> [token]
      looks_like_id?(text) -> [token]
      String.match?(text, ~r/^\d+$/) -> [token]
      String.match?(text, ~r/^[A-Za-zÀ-ÿà-ÿ]+$/u) -> [token]
      true -> do_split_letter_digit(token)
    end
  end

  defp do_split_letter_digit(token) do
    parts =
      Regex.split(
        ~r/(?<=[A-Za-zÀ-ÿà-ÿ])(?=\d)|(?<=\d)(?=[A-Za-zÀ-ÿà-ÿ])/u,
        token.text
      )

    if length(parts) <= 1, do: [token], else: parts_to_tokens(token, parts)
  end

  # Mexican RFC/CURP shapes and similar opaque alphanumeric IDs:
  # 3-5 uppercase letters + 6 digits + 0-10 mixed alphanumeric.
  defp looks_like_id?(text) do
    String.match?(text, ~r/^[A-Z]{3,5}\d{6}[A-Z0-9]{0,10}$/)
  end

  # Split a single token's text into N tokens by character-count
  # proportion, preserving X by accumulating widths along the way.
  defp parts_to_tokens(token, parts) do
    total_chars = max(String.length(token.text), 1)

    {tokens_rev, _x_offset} =
      Enum.reduce(parts, {[], 0.0}, fn part, {acc, x_offset} ->
        chars = String.length(part)
        width = token.width * chars / total_chars

        new_token =
          Map.merge(token, %{
            text: part,
            x: token.x + x_offset,
            width: width
          })

        {[new_token | acc], x_offset + width}
      end)

    Enum.reverse(tokens_rev)
  end

  defp split_label_colon(%{text: text} = token) do
    if uri_like?(text) do
      [token]
    else
      case Regex.run(~r/^(.+:)([^\s:].+)$/, text) do
        [_full, label, value] -> split_token_at(token, label, value)
        _ -> [token]
      end
    end
  end

  defp split_camel_case(%{text: text} = token) do
    cond do
      uri_like?(text) ->
        [token]

      # Tokens that mix letters with digits or non-word chars are
      # almost always identifiers, hashes, or base64 payloads — leave
      # intact (e.g. "Y/RPVo/IWtn5M..." digital signatures).
      String.match?(text, ~r/[0-9\/+=]/) ->
        [token]

      true ->
        # Split at the FIRST lowercase→Uppercase transition where the
        # tail is a word (Cap + lowercase), not an acronym (Cap+Cap).
        # This keeps "delMunicipio" → "del Municipio" but leaves
        # "idCIF" alone (CIF is an acronym).
        case Regex.run(~r/^(.*?[a-zà-ÿ])([A-ZÀ-Ý][a-zà-ÿ].*)$/u, text) do
          [_full, head, tail] ->
            [first | rest] = split_token_at(token, head, tail)
            [first | Enum.flat_map(rest, &split_camel_case/1)]

          _ ->
            [token]
        end
    end
  end

  # Generic token split helper: split a token's text at a boundary,
  # apportioning width proportionally to character count.
  defp split_token_at(token, head_text, tail_text) do
    head_chars = String.length(head_text)
    total_chars = max(String.length(token.text), 1)
    head_width = token.width * head_chars / total_chars
    tail_width = token.width - head_width

    [
      Map.merge(token, %{text: head_text, width: head_width}),
      Map.merge(token, %{text: tail_text, x: token.x + head_width, width: tail_width})
    ]
  end

  defp uri_like?(text) do
    String.contains?(text, "://") or
      String.contains?(text, "@") or
      String.starts_with?(text, "www.") or
      String.starts_with?(text, "http")
  end

  # Dictionary-based split: when both halves of a glued lowercase token
  # match a dictionary entry, treat that as a word boundary. Recursive
  # on the tail to catch 3+ word concatenations ("tieneconsecuencias",
  # "esoeslomismo" → "tiene"+"consecuencias", "eso"+"es"+"lo"+"mismo").
  # Skipped for tokens with digits, slashes, special chars, or under 4
  # chars — those are almost always identifiers, dates, or already-correct.
  defp expand_with_dictionary(tokens, nil), do: tokens

  defp expand_with_dictionary(tokens, dict) do
    Enum.flat_map(tokens, &dictionary_split(&1, dict))
  end

  defp dictionary_split(%{text: text} = token, dict) do
    cond do
      uri_like?(text) ->
        [token]

      # base64 / armored content — leave intact. We use `+` and `=` as
      # the marker (slash `/` alone appears in legitimate Spanish like
      # "y/o" so we DO want those tokens processed).
      String.match?(text, ~r/[+=]/) ->
        [token]

      # Lots of slashes → likely path or base64; leave alone.
      slash_count(text) > 2 ->
        [token]

      # Skip ALL-UPPERCASE tokens (proper names, acronyms, all-caps
      # words). Case-folding to dict can produce nonsense splits like
      # "CONSTANCIA" → "CON" + "STAN" + "CIA" because random short
      # all-caps subsequences happen to be in subtitle-derived
      # frequency lists. Real users want proper names left alone.
      not String.match?(text, ~r/[a-zà-ÿ]/u) ->
        [token]

      true ->
        # Capture optional leading punctuation (`"`, `¡`, `,`, etc.),
        # the run of letters (≥ 4), and any trailing punctuation. The
        # prefix sticks to the first word of the partition; the suffix
        # is recursively dict-split if it still contains letters (so
        # tokens like `¡denúnciala!Siconoces...` and `personales,puedeacudir...`
        # process both halves around the embedded punctuation).
        case Regex.run(~r/^([^A-ZÀ-Ýa-zà-ÿ]*)([A-ZÀ-Ýa-zà-ÿ]{4,})(.*)$/u, text) do
          [_full, prefix, letters, suffix] ->
            do_dictionary_split(token, prefix, letters, suffix, dict)

          _ ->
            [token]
        end
    end
  end

  defp do_dictionary_split(token, prefix, letters, suffix, dict) do
    cond do
      # CRITICAL: if the prefix alone is already a valid word
      # (case-insensitive), leave the leading word intact. Without
      # this guard the greedy split would shred "personales" →
      # "persona" + "les", "desde" → "des" + "de", "queja" → "que" +
      # "ja", "Fecha" → "fec" + "ha", etc. Then RECURSE on the suffix
      # so trailing content is still processed (e.g. for tokens like
      # `personales,puedeacudiracualquier...` we keep `personales`
      # whole AND split the post-comma chunk).
      Pdf.Reader.Wordlist.member?(letters, dict) ->
        emit_with_recursive_suffix(token, prefix <> letters, suffix, dict)

      true ->
        case partition_into_dict_words(letters, dict) do
          [_single] ->
            emit_with_recursive_suffix(token, prefix <> letters, suffix, dict)

          words when length(words) >= 2 ->
            if valid_partition?(words) do
              [first | rest] = words

              {head_words, last_letter_word} =
                case Enum.reverse(rest) do
                  [last | mid_rev] -> {[prefix <> first | Enum.reverse(mid_rev)], last}
                end

              # Last partition word + suffix forms a sub-token that
              # may itself contain more dict-splittable letters.
              tail_tokens = recurse_split_tail(token, last_letter_word, suffix, dict)

              parts_to_tokens(token, head_words ++ tail_tokens)
            else
              emit_with_recursive_suffix(token, prefix <> letters, suffix, dict)
            end

          :none ->
            emit_with_recursive_suffix(token, prefix <> letters, suffix, dict)
        end
    end
  end

  # Emit `head` as one piece, then re-process `suffix` as a fresh
  # dict_split call. If the suffix has no further letter runs the
  # recursion bottoms out and we just attach it back to head.
  defp emit_with_recursive_suffix(token, head, suffix, dict) do
    if has_letter_run?(suffix) do
      tail_token = %{token | text: suffix, x: shift_x(token, head)}
      tail_pieces = dictionary_split(tail_token, dict)
      [%{token | text: head} | tail_pieces]
    else
      [%{token | text: head <> suffix}]
    end
  end

  # When a partition succeeds, the LAST partition piece + suffix may
  # still need further splitting (e.g. for "...,puedeacudir..." after
  # we kept "...," with the previous piece). We return a list of
  # strings that `parts_to_tokens` will turn back into tokens.
  defp recurse_split_tail(_token, last_letter_word, suffix, dict) do
    if has_letter_run?(suffix) do
      tail_text = last_letter_word <> suffix

      case Regex.run(~r/^([^A-ZÀ-Ýa-zà-ÿ]*)([A-ZÀ-Ýa-zà-ÿ]{4,})(.*)$/u, suffix) do
        [_full, _p2, _l2, _s2] ->
          # Run a recursive dict_split on the suffix-only sub-token
          # by piggybacking on the existing function.
          sub_token = %{text: suffix, x: 0.0, width: 0.0}

          sub_pieces =
            sub_token
            |> dictionary_split(dict)
            |> Enum.map(& &1.text)

          [last_letter_word | sub_pieces]

        _ ->
          [tail_text]
      end
    else
      [last_letter_word <> suffix]
    end
  end

  defp has_letter_run?(text), do: Regex.match?(~r/[A-ZÀ-Ýa-zà-ÿ]{4,}/u, text)

  defp shift_x(token, head_text) do
    total = max(String.length(token.text), 1)
    head_chars = String.length(head_text)
    token.x + token.width * head_chars / total
  end

  defp slash_count(text) do
    text |> String.graphemes() |> Enum.count(&(&1 == "/"))
  end

  # Full-partition: returns a list of words such that `text == join(words)`
  # AND every word is in `dict`. Returns `:none` when no such partition
  # exists. This is far more conservative than 2-way recursive split —
  # only fires when the ENTIRE token cleanly decomposes into known
  # words. Prefers FEWER, longer words (iterates head from len down to 2).
  defp partition_into_dict_words(text, dict) do
    text_lc = String.downcase(text)

    case partition_lc_lengths(text_lc, dict) do
      nil ->
        :none

      lengths ->
        # Map lengths back to original-case slices (preserves capital
        # initial like "Fecha" + "de" + "último" + ...).
        {words_rev, _} =
          Enum.reduce(lengths, {[], 0}, fn n, {acc, offset} ->
            word = String.slice(text, offset, n)
            {[word | acc], offset + n}
          end)

        Enum.reverse(words_rev)
    end
  end

  # Tight closed list of Spanish articles/prepositions/pronouns allowed
  # as short pieces in a partition. Anything else of length 2-3 chars
  # is rejected — frequency dictionaries are full of meaningful but
  # spurious 2-3 char entries (`dee`, `sta`, `do`, `des`, `tad`, `aj`,
  # `us`, `lo`) that produce nonsense partitions of valid words like
  # "Sueldos" → "Su"+"el"+"dos", "denuncia" → "den"+"un"+"cia",
  # "Estatus" → "Esta"+"tus", "Localidad" → "Lo"+"calidad",
  # "deestado" → "dee"+"sta"+"do" instead of "de"+"estado".
  @short_connectors MapSet.new(
                      ~w(de el la en si son del las los una uno con por sus que fin mes año día)
                    )

  # 1-character Spanish words allowed as partition pieces. Without these,
  # tokens like "Tributariosy" can never split because "y" is a single
  # character and the partition's minimum length would otherwise be 2.
  @one_char_connectors MapSet.new(~w(y o a e u))

  defp partition_lc_lengths(text, dict) do
    case enumerate_partitions(text, dict) do
      [] ->
        nil

      partitions ->
        text_len = String.length(text)
        Enum.min_by(partitions, fn lengths -> partition_score(lengths, text_len) end)
    end
  end

  # Score a candidate partition for the "best" pick. Lower is better.
  # Behaviour depends on the original token length:
  #
  # - **Short tokens (< 15 chars)** — prefer FEWER 1-char pieces, then
  #   FEWER pieces overall. This rule fixes "dela" → "de"+"la"
  #   instead of "del"+"a".
  #
  # - **Long tokens (≥ 15 chars)** — prefer the LONGEST first piece
  #   (more dictionary coverage at the front), then FEWER 1-char
  #   pieces, then FEWER pieces overall. This fixes
  #   "conferidasalaautoridadfiscal" → "conferidas a la autoridad
  #   fiscal" (5 pieces, first=10) instead of "conferida sala
  #   autoridad fiscal" (4 pieces, first=9, but semantically wrong —
  #   "sala" means "room" and isn't the right segmentation).
  defp partition_score(lengths, total_length) do
    ones = Enum.count(lengths, &(&1 == 1))
    first = List.first(lengths) || 0

    if total_length >= 15 do
      {-first, ones, length(lengths)}
    else
      {ones, -first, length(lengths)}
    end
  end

  # Enumerate ALL valid lengths-lists [n1, n2, ...] such that
  # text == join(slices) and every slice is dict-accepted. Tokens are
  # short (≤ 30 chars typical) so the exponential worst case is bounded.
  defp enumerate_partitions("", _dict), do: [[]]

  defp enumerate_partitions(text, dict) do
    len = String.length(text)

    Enum.flat_map(len..1//-1, fn i ->
      head = String.slice(text, 0, i)

      if piece_acceptable?(head) and Pdf.Reader.Wordlist.member?(head, dict) do
        rest = String.slice(text, i, len - i)

        enumerate_partitions(rest, dict)
        |> Enum.map(fn tail -> [i | tail] end)
      else
        []
      end
    end)
  end

  defp piece_acceptable?(word) do
    case String.length(word) do
      n when n >= 4 -> true
      1 -> MapSet.member?(@one_char_connectors, String.downcase(word))
      _ -> MapSet.member?(@short_connectors, String.downcase(word))
    end
  end

  # A partition is valid iff:
  # 1. Every piece is either ≥ 4 characters OR in `@short_connectors`.
  # 2. AT LEAST ONE piece is ≥ 4 characters (the anchor), UNLESS
  #    every piece is in the connector list — that lets glued
  #    function-word pairs like "finde" → "fin de", "dela" → "de la"
  #    split cleanly without dragging in spurious word boundaries.
  defp valid_partition?(words) do
    all_connectors_or_long = Enum.all?(words, &piece_acceptable?/1)

    has_anchor = Enum.any?(words, &(String.length(&1) >= 4))

    all_short =
      Enum.all?(words, fn w ->
        len = String.length(w)
        len <= 3 and piece_acceptable?(w)
      end)

    # Reject ONLY 2-piece partitions whose first piece is a capitalized
    # 2-3 char token. In Spanish, capitalized 2-piece concatenations
    # almost never begin with a short connector — splitting them
    # produces false positives like "Demarcación" → "De" + "marcación"
    # or "Delante" → "De" + "lante". For 3+ piece partitions the
    # multi-word context disambiguates ("Lacorrupcióntieneconsecuencias"
    # → "La"+"corrupción"+"tiene"+"consecuencias" is the right call).
    first_capital_short_two_piece =
      length(words) == 2 and
        case List.first(words) do
          nil -> false
          first -> String.length(first) <= 3 and first =~ ~r/^[A-ZÀ-Ý]/u
        end

    all_connectors_or_long and (has_anchor or all_short) and
      not first_capital_short_two_piece
  end

  @doc """
  Returns the plain text for each page as a list of strings.

  Options:
  - `:pages` — `[pos_integer]` to filter to specific 1-indexed page numbers.
    Default: all pages.

  Returns `{:ok, page_strings, doc}` where each element is the concatenated
  text for one page. The returned `doc` carries an updated `:recovery_log`
  when opened with `recover: true`. Unresolved glyphs appear as `U+FFFD`
  (already encoded by the encoding cascade layer).
  """
  @spec read_text(Document.t(), keyword()) ::
          {:ok, [String.t()], Document.t()} | {:error, reason()}
  def read_text(%Document{} = doc, opts \\ []) do
    with {:ok, runs, updated_doc} <- read_text_with_positions(doc) do
      pages_filter = Keyword.get(opts, :pages, :all)

      runs_by_page =
        runs
        |> Enum.group_by(& &1.page)

      if map_size(runs_by_page) == 0 do
        {:ok, [], updated_doc}
      else
        page_nums =
          case pages_filter do
            :all -> runs_by_page |> Map.keys() |> Enum.sort()
            list when is_list(list) -> list
          end

        texts =
          Enum.map(page_nums, fn page_num ->
            page_runs = Map.get(runs_by_page, page_num, [])
            page_runs |> Enum.map(& &1.text) |> Enum.join(" ") |> String.trim()
          end)
          |> Enum.reject(&(&1 == ""))

        {:ok, texts, updated_doc}
      end
    end
  end

  @doc """
  Extracts images from all pages.

  For each page, resolves the XObject references from content-stream `Do`
  operators and classifies them as JPEG or PNG-like based on their `/Filter`.

  Returns `{:ok, [], doc}` when no images are found. The returned `doc`
  carries an updated `:recovery_log` when opened with `recover: true`.
  """
  @spec read_images(Document.t()) ::
          {:ok, [Pdf.Reader.Image.t()], Document.t()} | {:error, reason()}
  def read_images(%Document{} = doc) do
    with {:ok, page_refs, doc2} <- Page.list_refs(doc) do
      collect_images(page_refs, doc2, 1, [])
    end
  end

  @doc """
  Bang variant of `open/2`. Raises `Pdf.Reader.Error` on failure.
  """
  @spec open!(binary() | Path.t(), keyword()) :: Document.t()
  def open!(path_or_binary, opts \\ []) do
    case open(path_or_binary, opts) do
      {:ok, doc} -> doc
      {:error, reason} -> raise Pdf.Reader.Error, reason
    end
  end

  @doc """
  Bang variant of `read_metadata/1`. Raises `Pdf.Reader.Error` on failure.
  """
  @spec read_metadata!(Document.t()) :: %{String.t() => String.t()}
  def read_metadata!(doc) do
    {:ok, meta, _doc} = read_metadata(doc)
    meta
  end

  @doc """
  Bang variant of `page_count/1`. Raises `Pdf.Reader.Error` on failure.
  """
  @spec page_count!(Document.t()) :: pos_integer()
  def page_count!(doc) do
    case page_count(doc) do
      {:ok, n} -> n
      {:error, reason} -> raise Pdf.Reader.Error, reason
    end
  end

  @doc """
  Bang variant of `read_text_with_positions/1`. Raises `Pdf.Reader.Error` on failure.
  """
  @spec read_text_with_positions!(Document.t()) :: [Pdf.Reader.TextRun.t()]
  def read_text_with_positions!(doc) do
    case read_text_with_positions(doc) do
      {:ok, runs, _doc} -> runs
      {:error, reason} -> raise Pdf.Reader.Error, reason
    end
  end

  @doc """
  Bang variant of `read_text/2`. Raises `Pdf.Reader.Error` on failure.
  """
  @spec read_text!(Document.t(), keyword()) :: [String.t()]
  def read_text!(doc, opts \\ []) do
    case read_text(doc, opts) do
      {:ok, texts, _doc} -> texts
      {:error, reason} -> raise Pdf.Reader.Error, reason
    end
  end

  @doc """
  Bang variant of `read_images/1`. Raises `Pdf.Reader.Error` on failure.
  """
  @spec read_images!(Document.t()) :: [Pdf.Reader.Image.t()]
  def read_images!(doc) do
    case read_images(doc) do
      {:ok, images, _doc} -> images
      {:error, reason} -> raise Pdf.Reader.Error, reason
    end
  end

  @doc """
  Extracts AcroForm interactive form fields from the document.

  Walks the `/AcroForm /Fields` tree depth-first, emitting only leaf fields
  as a flat list of `%Pdf.Reader.FormField{}` structs. Hierarchical names
  (`/T` dot-joined from ancestor path) are resolved. `/FT` is inherited
  downward from the nearest ancestor that defines it.

  Returns `{:ok, [], doc}` when no `/AcroForm` is present or `/Fields` is empty.
  Never returns `{:error, _}` for absent or empty AcroForms.

  ## Spec references

  - PDF 1.7 § 12.7 (Interactive Forms)
  - PDF 1.7 § 12.7.3 (Field Dictionaries)
  - PDF 1.7 § 12.7.4 (Field Types)
  """
  @spec read_acroform(Document.t()) ::
          {:ok, [FormField.t()], Document.t()} | {:error, reason()}
  def read_acroform(doc), do: AcroForm.read(doc)

  @doc """
  Bang variant of `read_acroform/1`. Raises `Pdf.Reader.Error` on failure.
  """
  @spec read_acroform!(Document.t()) :: [FormField.t()]
  def read_acroform!(doc) do
    case read_acroform(doc) do
      {:ok, fields, _} -> fields
      {:error, reason} -> raise Pdf.Reader.Error, reason
    end
  end

  @doc """
  Extracts document outline (bookmarks) from the PDF catalog's `/Outlines` tree.

  Walks the `/First`/`/Next` linked list at each nesting level, threading
  `/Parent` for depth. Cycle detection via `MapSet` and a depth cap of 32
  prevent hangs on corrupt PDFs.

  Returns `{:ok, [], doc}` when no `/Outlines` entry is present — never an error.

  ## Spec references

  - PDF 1.7 § 12.3.3 — Document Outline
  - PDF 1.7 § 12.3.2 — Destinations
  """
  @spec read_outlines(Document.t()) :: {:ok, [Outline.t()], Document.t()} | {:error, reason()}
  def read_outlines(doc), do: Outlines.read(doc)

  @doc """
  Bang variant of `read_outlines/1`. Raises `Pdf.Reader.Error` on failure.
  Returns the outlines list directly on success.
  """
  @spec read_outlines!(Document.t()) :: [Outline.t()]
  def read_outlines!(doc) do
    case read_outlines(doc) do
      {:ok, outlines, _} -> outlines
      {:error, reason} -> raise Pdf.Reader.Error, reason
    end
  end

  @doc """
  Extracts all annotations from all pages in the document.

  Enumerates every page via `Page.list_refs/1` and, for each page, resolves
  its `/Annots` array. Supports 10 annotation subtypes:
  `:link`, `:text`, `:highlight`, `:underline`, `:strikeout`, `:squiggly`,
  `:square`, `:circle`, `:freetext`, `:file_attachment`. Other subtypes
  surface as `:unknown` with raw fields preserved in `:kind_specific`.

  Returns `{:ok, [], doc}` when no page has an `/Annots` array — never an error.

  ## Spec references

  - PDF 1.7 § 12.5 — Annotations
  - PDF 1.7 § 12.5.6.x — Annotation subtypes
  - PDF 1.7 § 12.6 — Actions
  """
  @spec read_annotations(Document.t()) ::
          {:ok, [Annotation.t()], Document.t()} | {:error, reason()}
  def read_annotations(doc), do: Annotations.read(doc)

  @doc """
  Bang variant of `read_annotations/1`. Raises `Pdf.Reader.Error` on failure.
  Returns the annotations list directly on success.
  """
  @spec read_annotations!(Document.t()) :: [Annotation.t()]
  def read_annotations!(doc) do
    case read_annotations(doc) do
      {:ok, anns, _} -> anns
      {:error, reason} -> raise Pdf.Reader.Error, reason
    end
  end

  # ---------------------------------------------------------------------------
  # Internal
  # ---------------------------------------------------------------------------

  defp do_open(binary, opts) when is_binary(binary) do
    password = Keyword.get(opts, :password, "")
    recover_mode = Keyword.get(opts, :recover, false)

    with :ok <- check_header(binary),
         {version, _} <- extract_version(binary),
         {:ok, xref_entries, trailer, recovery_events} <-
           load_xref_or_recover(binary, recover_mode) do
      doc = %Document{
        binary: binary,
        version: version,
        xref: xref_entries,
        trailer: trailer.dict,
        cache: %{},
        page_refs: nil,
        encryption: nil,
        recover_mode: recover_mode,
        recovery_log: Enum.reduce(recovery_events, [], fn ev, log -> [ev | log] end)
      }

      # R-4: when recover_mode is true, probe the page tree immediately so that
      # any catalog-fallback recovery events (e.g. {:page_tree_recovered, n}) are
      # present in the doc returned from open/2. If the probe succeeds via normal
      # tree walk, no events are added. If it triggers the fallback branch, the
      # recovered page refs are stored in doc.page_refs and the recovery_log is
      # populated. Downstream callers (page_count/1, read_text/1) then use the
      # cached page_refs and see the populated recovery_log.
      doc2 =
        if recover_mode do
          case Page.list_refs(doc) do
            {:ok, refs, updated_doc} ->
              %{updated_doc | page_refs: refs}

            {:error, _} ->
              doc
          end
        else
          doc
        end

      attempt_unlock(doc2, trailer, password)
    end
  end

  # ---------------------------------------------------------------------------
  # XRef loading with optional linear-scan recovery
  #
  # Attempt strict XRef load first. If it fails AND recover_mode is true,
  # fall back to XRef.recover/1 (linear scan). Collect recovery events.
  #
  # Returns {:ok, entries, trailer, [recovery_event()]} or {:error, reason}.
  # ---------------------------------------------------------------------------

  defp load_xref_or_recover(binary, recover_mode) do
    case Trailer.locate_startxref(binary) do
      {:ok, startxref_offset} ->
        case XRef.load(binary, startxref_offset) do
          {:ok, entries, trailer} ->
            {:ok, entries, trailer, []}

          {:error, _reason} when recover_mode ->
            # Strict xref load failed with a valid %%EOF but bad offset.
            do_xref_linear_scan(binary)

          {:error, _reason} = err ->
            err
        end

      {:error, _} when recover_mode ->
        # %%EOF missing — linear scan needed.
        # Log :eof_marker_missing PLUS :xref_recovered. `do_xref_linear_scan/1`
        # is total per `XRef.recover/1`'s spec (PDF 1.7 § 7.5.4), so destructure
        # directly — a defensive case-clause would be dead code per Dialyzer.
        {:ok, entries, trailer, events} = do_xref_linear_scan(binary)
        {:ok, entries, trailer, [{:eof_marker_missing, :linear_scan_used} | events]}

      {:error, _} = err ->
        err
    end
  end

  # Run XRef.recover/1 and wrap result with recovery event tuple.
  defp do_xref_linear_scan(binary) do
    case XRef.recover(binary) do
      {:ok, entries, trailer} ->
        n_objects = map_size(entries)
        {:ok, entries, trailer, [{:xref_recovered, n_objects}]}
    end
  end

  # Check that the binary begins with %PDF-
  defp check_header(<<"%PDF-", _::binary>>), do: :ok
  defp check_header(_), do: {:error, :not_a_pdf}

  # Extract version string like "1.4", "1.7", "2.0"
  defp extract_version(<<"%PDF-", rest::binary>>) do
    # Use binary pattern matching to extract the version string from the header line.
    # String.graphemes/1 cannot be used here because PDF binary content is not valid UTF-8.
    version =
      rest
      |> :binary.split(["\n", "\r"])
      |> hd()
      |> String.trim()

    {version, rest}
  end

  defp extract_version(_), do: {"", ""}

  # ---------------------------------------------------------------------------
  # Encryption bootstrap — attempt_unlock/3
  #
  # Branches on the presence of /Encrypt in the trailer:
  # - nil / :null → non-encrypted PDF, return doc as-is.
  # - {:ref, n, g} or inline dict → resolve, parse, authenticate.
  #
  # R-ENC1, R-ENC2, R-ENC3, R-ENC4, R-ENC5, R-ENC6, R-ENC7, R-ENC8
  # ---------------------------------------------------------------------------

  # Non-encrypted PDF: no /Encrypt entry (or explicit /Encrypt null).
  defp attempt_unlock(doc, %Trailer{encrypt: nil}, _password), do: {:ok, doc}
  defp attempt_unlock(doc, %Trailer{encrypt: :null}, _password), do: {:ok, doc}

  # Encrypted PDF: resolve the Encrypt dict, parse it, and authenticate.
  defp attempt_unlock(doc, %Trailer{encrypt: encrypt_ref, id: id_pair}, password)
       when not is_nil(encrypt_ref) do
    # Resolve the Encrypt dict (may be indirect ref or inline dict).
    # Design discovery #4: trailer.encrypt can be {:ref, n, g}, inline map, etc.
    # Design discovery #5: ObjectResolver.resolve returns {:ok, value, updated_doc}.
    with {:ok, encrypt_dict, doc2} <- resolve_encrypt_dict(doc, encrypt_ref),
         doc_id <- extract_doc_id(id_pair),
         {:ok, sh0} <- StandardHandler.parse(encrypt_dict, doc_id),
         :ok <- check_version_supported(sh0),
         {:ok, doc3} <- try_passwords(doc2, sh0, password) do
      {:ok, doc3}
    end
  end

  # Resolve the Encrypt dict: either follow a ref or use an inline dict directly.
  defp resolve_encrypt_dict(doc, {:ref, _n, _g} = ref) do
    case ObjectResolver.resolve(doc, ref) do
      {:ok, dict, doc2} when is_map(dict) -> {:ok, dict, doc2}
      {:ok, _other, _doc2} -> {:error, :malformed}
      {:error, _} = err -> err
    end
  end

  defp resolve_encrypt_dict(doc, dict) when is_map(dict) do
    {:ok, dict, doc}
  end

  defp resolve_encrypt_dict(_doc, _other), do: {:error, :malformed}

  # Extract the first element of the /ID array as the document ID binary.
  # The parser emits hex strings as `{:hex_string, bin}` and literal strings as
  # `{:string, bin}`. Some legacy paths feed in plain binaries already-decoded.
  # All three shapes must yield a binary; Dialyzer's success-typing of
  # `Trailer.extract_id/1` is too narrow because the list contents are wider
  # than the declared `[binary()]`.
  defp extract_doc_id([{:hex_string, bin} | _]) when is_binary(bin), do: bin
  defp extract_doc_id([{:string, bin} | _]) when is_binary(bin), do: bin
  defp extract_doc_id([first | _]) when is_binary(first), do: first
  defp extract_doc_id(_), do: <<>>

  # R-ENC3: verify /V is in the supported set
  defp check_version_supported(%StandardHandler{version: v}) when v in [1, 2, 4, 5], do: :ok
  defp check_version_supported(_), do: {:error, :encrypted_unsupported_handler}

  # R-ENC4: always try empty password first.
  # R-ENC5: if empty fails and non-empty supplied, try the supplied password.
  # R-ENC6: both fail + non-empty supplied → :encrypted_wrong_password.
  # R-ENC7: both fail + no password (or empty) → :encrypted_password_required.
  defp try_passwords(doc, sh0, password) do
    # Step 1: try empty password (always, per R-ENC4)
    case Encryption.unlock("", sh0, doc) do
      {:ok, sh} ->
        {:ok, %{doc | encryption: sh}}

      :error ->
        # Empty password failed. Try caller-supplied if non-empty.
        if password != "" do
          case Encryption.unlock(password, sh0, doc) do
            {:ok, sh} ->
              {:ok, %{doc | encryption: sh}}

            :error ->
              {:error, :encrypted_wrong_password}

            {:error, _} = err ->
              err
          end
        else
          {:error, :encrypted_password_required}
        end

      {:error, _} = err ->
        err
    end
  end

  # ---------------------------------------------------------------------------
  # Metadata helpers
  # ---------------------------------------------------------------------------

  # Decode a PDF value to a plain string for metadata maps.
  # Literal strings come as {:string, binary} from the parser.
  # Hex strings come as {:hex_string, binary}.
  # Scalars (integers, names) are stringified.
  defp decode_info_value({:string, binary}), do: Utils.decode_pdf_string(binary)
  defp decode_info_value({:hex_string, binary}), do: binary
  defp decode_info_value({:name, name}), do: name
  defp decode_info_value(n) when is_integer(n), do: Integer.to_string(n)
  defp decode_info_value(f) when is_float(f), do: Float.to_string(f)
  defp decode_info_value(s) when is_binary(s), do: s
  defp decode_info_value(_), do: nil

  # Resolve the catalog's /Metadata XMP stream and parse it.
  # Returns {xmp_map, updated_doc} — empty map on any failure (graceful).
  # Flow: trailer["Root"] → catalog dict → catalog["Metadata"] → stream
  #        → decode_stream → XMP.parse
  # PDF 1.7 § 14.3.2 — Metadata Streams.
  defp read_xmp_stream(%Document{trailer: trailer} = doc) do
    with {:ok, root_ref} <- fetch_root_ref(trailer),
         {:ok, catalog, doc2} <- ObjectResolver.resolve(doc, root_ref),
         true <- is_map(catalog),
         {:ok, meta_ref} <- fetch_metadata_ref(catalog),
         {:ok, {:stream, stream_dict, raw_bytes}, doc3} <-
           ObjectResolver.resolve(doc2, meta_ref),
         {:ok, decoded} <- decode_stream(stream_dict, raw_bytes),
         {:ok, xmp_map} <- XMP.parse(decoded) do
      {xmp_map, doc3}
    else
      _ -> {%{}, doc}
    end
  end

  defp fetch_root_ref(trailer) do
    case Map.get(trailer, "Root") do
      nil -> :error
      ref -> {:ok, ref}
    end
  end

  defp fetch_metadata_ref(catalog) do
    case Map.get(catalog, "Metadata") do
      nil -> :error
      {:ref, _, _} = ref -> {:ok, ref}
      _ -> :error
    end
  end

  # ---------------------------------------------------------------------------
  # page_count helper
  # ---------------------------------------------------------------------------

  defp read_declared_count(%Document{trailer: trailer} = doc) do
    case Map.get(trailer, "Root") do
      nil ->
        {:error, :no_pages}

      root_ref ->
        with {:ok, catalog, doc2} <- ObjectResolver.resolve(doc, root_ref),
             {:ok, pages_ref} <- fetch_pages_ref_from_catalog(catalog),
             {:ok, pages_node, _doc3} <- ObjectResolver.resolve(doc2, pages_ref) do
          case Map.get(pages_node, "Count") do
            n when is_integer(n) -> {:ok, n}
            _ -> {:error, {:malformed, :page_tree, %{missing_count: true}}}
          end
        end
    end
  end

  defp fetch_pages_ref_from_catalog(%{"Pages" => ref}), do: {:ok, ref}
  defp fetch_pages_ref_from_catalog(_), do: {:error, :no_pages}

  # ---------------------------------------------------------------------------
  # Text extraction helpers
  # ---------------------------------------------------------------------------

  defp collect_text_runs([], doc, _page_num, acc) do
    {:ok, Enum.reverse(acc), doc}
  end

  # R-1: per-page isolation — when recover_mode is true, wrap each page in
  # try/rescue so that a single failing page does not abort the whole document.
  # On failure, append {:page_failed, page_num, reason} to recovery_log and
  # emit [] runs for that page, then continue with the next page.
  # When recover_mode is false, the MatchError from the parser is rescued and
  # converted to {:error, :malformed} to satisfy the strict-mode contract.
  defp collect_text_runs([page_ref | rest], doc, page_num, acc) do
    # page_refs from Page.list_refs are {n, g} tuples — wrap to {:ref, n, g}
    ref = ensure_ref(page_ref)

    if doc.recover_mode do
      {result_acc, result_doc} =
        try do
          case extract_page_runs(doc, ref, page_num) do
            {:ok, runs, updated_doc} ->
              {Enum.reverse(runs) ++ acc, updated_doc}

            {:error, reason} ->
              updated_doc = Document.log_recovery(doc, {:page_failed, page_num, reason})
              {acc, updated_doc}
          end
        rescue
          _ ->
            updated_doc = Document.log_recovery(doc, {:page_failed, page_num, :parse_error})
            {acc, updated_doc}
        end

      collect_text_runs(rest, result_doc, page_num + 1, result_acc)
    else
      try do
        case extract_page_runs(doc, ref, page_num) do
          {:ok, runs, updated_doc} ->
            collect_text_runs(rest, updated_doc, page_num + 1, Enum.reverse(runs) ++ acc)

          {:error, _} = err ->
            err
        end
      rescue
        _ -> {:error, :malformed}
      end
    end
  end

  # R-FX1, R-FX19: use do_interpret_with_doc/5 to thread doc through for Form
  # XObject recursion. Raw xobjects refs are passed — classification happens
  # on demand inside the Do handler. Updated doc (with cache) is returned.
  #
  # R-2: build_decoders_for_resources now returns font_failures list. On recovery
  # mode, each failure is logged as {:font_skipped, page_num, font_name, reason}.
  defp extract_page_runs(doc, page_ref, page_num) do
    with {:ok, page_dict, doc2} <- ObjectResolver.resolve(doc, page_ref),
         {:ok, content_bytes, doc3} <- resolve_page_contents(doc2, page_dict),
         {:ok, resources, doc4} <- resolve_page_resources(doc3, page_ref, page_dict),
         {:ok, font_decoders, font_failures, doc5} <-
           Font.build_decoders_for_resources(resources, doc4),
         doc5a <- log_font_failures(doc5, font_failures, page_num),
         {:ok, font_widths, doc6} <- Widths.build_widths_for_resources(resources, doc5a),
         xobjects <- build_xobjects_map(resources),
         {:ok, events, doc7} <-
           Pdf.Reader.ContentStream.do_interpret_with_doc(
             content_bytes,
             &identity_decoder/1,
             [xobjects: xobjects, font_decoders: font_decoders, font_widths: font_widths],
             doc6,
             resources
           ) do
      runs = events_to_text_runs(events, page_num)
      # doc7 carries decryption cache and Form resolution cache populated during interpretation
      {:ok, runs, doc7}
    end
  end

  # Convert font_failures list to {:font_skipped, page_num, name, reason} events
  # and log them on the doc. Returns doc unchanged when failures list is empty.
  defp log_font_failures(doc, [], _page_num), do: doc

  defp log_font_failures(doc, failures, page_num) do
    Enum.reduce(failures, doc, fn {name, reason}, acc_doc ->
      Document.log_recovery(acc_doc, {:font_skipped, page_num, name, reason})
    end)
  end

  # Default decoder used when no font-specific decoder is available.
  # Returns bytes as-is (identity pass-through).
  defp identity_decoder(bytes), do: {bytes, []}

  # Resolve page /Contents — may be a single ref or an array of refs.
  # Concatenate all decoded streams with a newline separator.
  # Streams are passed through the filter chain (e.g., FlateDecode) before
  # being returned to the content stream interpreter.
  defp resolve_page_contents(doc, page_dict) do
    case Map.get(page_dict, "Contents") do
      nil ->
        {:ok, <<>>, doc}

      {:ref, _, _} = ref ->
        case ObjectResolver.resolve(doc, ref) do
          {:ok, {:stream, dict, raw_bytes}, doc2} ->
            case decode_stream(dict, raw_bytes) do
              {:ok, decoded} -> {:ok, decoded, doc2}
              {:error, _} -> {:ok, <<>>, doc2}
            end

          {:ok, _other, doc2} ->
            {:ok, <<>>, doc2}

          {:error, _} = err ->
            err
        end

      refs when is_list(refs) ->
        Enum.reduce_while(refs, {:ok, <<>>, doc}, fn ref, {:ok, acc_bytes, acc_doc} ->
          case ObjectResolver.resolve(acc_doc, ref) do
            {:ok, {:stream, dict, raw_bytes}, updated_doc} ->
              case decode_stream(dict, raw_bytes) do
                {:ok, decoded} ->
                  {:cont, {:ok, acc_bytes <> "\n" <> decoded, updated_doc}}

                {:error, _} ->
                  # Skip streams we cannot decode — don't abort the whole page
                  {:cont, {:ok, acc_bytes, updated_doc}}
              end

            {:ok, _other, updated_doc} ->
              {:cont, {:ok, acc_bytes, updated_doc}}

            {:error, _} = err ->
              {:halt, err}
          end
        end)

      _ ->
        {:ok, <<>>, doc}
    end
  end

  # Decode a stream's raw bytes through its declared filter chain.
  # Returns {:ok, decoded_bytes} or {:error, reason}.
  # On filter error the caller decides whether to skip or propagate.
  defp decode_stream(dict, raw_bytes) do
    filter = Map.get(dict, "Filter")
    parms = Map.get(dict, "DecodeParms")

    if filter == nil do
      {:ok, raw_bytes}
    else
      Filter.apply_chain(raw_bytes, filter, parms || %{})
    end
  end

  # Resolve /Resources for a page dict.
  # Checks the leaf page's own /Resources first; if absent, walks up the
  # /Parent chain until resources are found or the root is reached.
  # PDF 1.7 § 7.7.3 (Page Tree) and § 7.7.3.4 (Inheritance of Page Attributes).
  #
  # Cache: on entry, checks doc.cache for {:page_resources, {n, g}} keyed by
  # the leaf page's xref ref. On return, writes the result to the cache so
  # subsequent calls for the same leaf page skip the walk entirely.
  #
  # Cycle detection: the `visited` MapSet accumulates {n, g} xref refs seen
  # during this walk. If a /Parent ref is already in `visited`, the cycle is
  # silently broken and the walk returns {:ok, %{}, doc}. This protects against
  # corrupt PDFs where the /Parent chain forms a loop.
  defp resolve_page_resources(doc, leaf_ref, page_dict, visited \\ MapSet.new()) do
    # Normalise the leaf ref to {n, g} for use as a cache key.
    {:ref, n, g} = leaf_ref
    leaf_key = {n, g}

    # Cache hit: return immediately without walking.
    if Map.has_key?(doc.cache, {:page_resources, leaf_key}) do
      {:ok, Map.fetch!(doc.cache, {:page_resources, leaf_key}), doc}
    else
      {{:ok, resources}, updated_doc} =
        do_resolve_page_resources(doc, leaf_key, page_dict, visited)

      cached_doc = %{
        updated_doc
        | cache: Map.put(updated_doc.cache, {:page_resources, leaf_key}, resources)
      }

      {:ok, resources, cached_doc}
    end
  end

  # Internal walker — separated from the cache/cycle guard to keep the logic clean.
  defp do_resolve_page_resources(doc, leaf_key, page_dict, visited) do
    case Map.get(page_dict, "Resources") do
      nil ->
        # Resource inheritance: try /Parent
        case Map.get(page_dict, "Parent") do
          nil ->
            {{:ok, %{}}, doc}

          {:ref, n, g} = parent_ref ->
            # Extract {n, g} from the parent ref for cycle detection.
            parent_key = {n, g}

            # Cycle guard: if this ancestor was already visited, break the loop.
            if MapSet.member?(visited, parent_key) do
              {{:ok, %{}}, doc}
            else
              new_visited = MapSet.put(visited, parent_key)

              # Also add the leaf ref itself on the first call so a page pointing
              # /Parent to itself is caught on the very first ancestor resolution.
              first_visited =
                if leaf_key != nil,
                  do: MapSet.put(new_visited, leaf_key),
                  else: new_visited

              case ObjectResolver.resolve(doc, parent_ref) do
                {:ok, parent_dict, doc2} when is_map(parent_dict) ->
                  # Recurse without the leaf_key cache (parent is not the leaf).
                  # Pass nil as leaf_key so we don't cache at an intermediate node.
                  do_resolve_page_resources(doc2, nil, parent_dict, first_visited)

                _ ->
                  {{:ok, %{}}, doc}
              end
            end
        end

      {:ref, _, _} = ref ->
        case ObjectResolver.resolve(doc, ref) do
          {:ok, resources, doc2} when is_map(resources) -> {{:ok, resources}, doc2}
          {:ok, _, doc2} -> {{:ok, %{}}, doc2}
          {:error, _} -> {{:ok, %{}}, doc}
        end

      resources when is_map(resources) ->
        {{:ok, resources}, doc}

      _ ->
        {{:ok, %{}}, doc}
    end
  end

  # Build the xobjects map: %{name => {:ref, n, g} | inline_dict}
  # R-FX19: pass raw refs — the ContentStream interpreter classifies on demand
  # by reading /Subtype. Pre-classification to :image | :form is removed.
  defp build_xobjects_map(resources) do
    case Map.get(resources, "XObject") do
      nil ->
        %{}

      xobjects when is_map(xobjects) ->
        # Return the map as-is; raw {:ref, n, g} refs and inline dicts are both valid.
        xobjects

      _ ->
        %{}
    end
  end

  # Convert content stream events to TextRun structs
  defp events_to_text_runs(events, page_num) do
    events
    |> Enum.flat_map(fn
      {:text, %{text: text, unresolved: unresolved, x: x, y: y, font: font, size: size}} ->
        if text == "" do
          []
        else
          [
            %Pdf.Reader.TextRun{
              text: text,
              unresolved: unresolved,
              x: x,
              y: y,
              font: font,
              size: size,
              page: page_num
            }
          ]
        end

      _ ->
        []
    end)
  end

  # ---------------------------------------------------------------------------
  # Image extraction helpers
  # ---------------------------------------------------------------------------

  defp collect_images([], doc, _page_num, acc) do
    {:ok, Enum.reverse(acc), doc}
  end

  # R-1: per-page isolation — mirror of collect_text_runs/4 for images.
  # When recover_mode is true, catch per-page failures (including raises) and continue.
  # When recover_mode is false, raises are rescued and converted to {:error, :malformed}.
  defp collect_images([page_ref | rest], doc, page_num, acc) do
    ref = ensure_ref(page_ref)

    if doc.recover_mode do
      {result_acc, result_doc} =
        try do
          case extract_page_images(doc, ref, page_num) do
            {:ok, images, updated_doc} ->
              {Enum.reverse(images) ++ acc, updated_doc}

            {:error, reason} ->
              updated_doc = Document.log_recovery(doc, {:page_failed, page_num, reason})
              {acc, updated_doc}
          end
        rescue
          _ ->
            updated_doc = Document.log_recovery(doc, {:page_failed, page_num, :parse_error})
            {acc, updated_doc}
        end

      collect_images(rest, result_doc, page_num + 1, result_acc)
    else
      try do
        case extract_page_images(doc, ref, page_num) do
          {:ok, images, updated_doc} ->
            collect_images(rest, updated_doc, page_num + 1, Enum.reverse(images) ++ acc)

          {:error, _} = err ->
            err
        end
      rescue
        _ -> {:error, :malformed}
      end
    end
  end

  # R-FX13: image events from Form XObjects bubble up through recurse_into_form.
  # Use do_interpret_with_doc/5 (same as extract_page_runs) so Form recursion
  # is enabled and nested image events are included in the event list.
  #
  # R-2: build_decoders_for_resources now returns font_failures list. On recovery
  # mode, each failure is logged as {:font_skipped, page_num, font_name, reason}.
  defp extract_page_images(doc, page_ref, page_num) do
    with {:ok, page_dict, doc2} <- ObjectResolver.resolve(doc, page_ref),
         {:ok, resources, doc3} <- resolve_page_resources(doc2, page_ref, page_dict),
         {:ok, content_bytes, doc4} <- resolve_page_contents(doc3, page_dict),
         {:ok, font_decoders, font_failures, doc5} <-
           Font.build_decoders_for_resources(resources, doc4),
         doc5a <- log_font_failures(doc5, font_failures, page_num),
         xobjects <- build_xobjects_map(resources),
         {:ok, events, doc6} <-
           Pdf.Reader.ContentStream.do_interpret_with_doc(
             content_bytes,
             &identity_decoder/1,
             [xobjects: xobjects, font_decoders: font_decoders],
             doc5a,
             resources
           ) do
      image_events = Enum.filter(events, &match?({:image, _}, &1))

      # Image events from nested Forms carry the CTM at the point of Do inside the Form.
      # resolve_image_xobject resolves the image stream from the xobjects hierarchy.
      # For images inside Forms, the xobject may live in the Form's /Resources, not the
      # page's /Resources. We try both page resources and a best-effort fallback.
      {images, final_doc} =
        Enum.reduce(image_events, {[], doc6}, fn {:image, %{name: name, ctm: ctm}},
                                                 {img_acc, acc_doc} ->
          case resolve_image_xobject_deep(acc_doc, resources, name, ctm, page_num) do
            {:ok, image, updated_doc} -> {[image | img_acc], updated_doc}
            {:error, _} -> {img_acc, acc_doc}
          end
        end)

      {:ok, Enum.reverse(images), final_doc}
    end
  end

  # R-FX13: extended version of resolve_image_xobject that falls back to
  # scanning the xref table for an Image XObject when the name is not in the
  # page's top-level /XObject dict (e.g. when the image lives inside a Form's
  # /Resources). The page-level lookup is tried first for performance; the
  # xref scan only runs when the name is not found in page resources.
  defp resolve_image_xobject_deep(doc, resources, name, ctm, page_num) do
    xobjects = Map.get(resources, "XObject", %{})

    case Map.get(xobjects, name) do
      nil ->
        # Image is not in page resources — scan the xref for any stream with
        # Subtype=Image and match on the name. If exactly one image exists in
        # the document (typical for Form-only-image test PDFs) return it.
        find_image_in_xref(doc, name, ctm, page_num)

      ref ->
        case ObjectResolver.resolve(doc, ref) do
          {:ok, {:stream, dict, raw_bytes}, doc2} ->
            classify_image_stream(doc2, dict, raw_bytes, ctm, page_num, ref)

          {:ok, _other, _doc2} ->
            {:error, {:not_an_image, name}}

          {:error, _} = err ->
            err
        end
    end
  end

  # Scan the xref table entries and resolve each one; return the first Image
  # XObject stream found (with Subtype=Image). This is a fallback for images
  # that live inside Form /Resources rather than page /Resources.
  defp find_image_in_xref(doc, name, ctm, page_num) do
    xref_entries =
      doc.xref
      |> Enum.filter(fn
        {{n, g}, _offset} when is_integer(n) and is_integer(g) and n > 0 -> true
        _ -> false
      end)

    result =
      Enum.find_value(xref_entries, fn {{n, g}, _offset} ->
        ref = {:ref, n, g}

        case ObjectResolver.resolve(doc, ref) do
          {:ok, {:stream, dict, raw_bytes}, doc2} ->
            case Map.get(dict, "Subtype") do
              {:name, "Image"} ->
                case classify_image_stream(doc2, dict, raw_bytes, ctm, page_num, ref) do
                  {:ok, _image, _doc3} = ok -> ok
                  _ -> nil
                end

              _ ->
                nil
            end

          _ ->
            nil
        end
      end)

    case result do
      nil -> {:error, {:unresolved_xobject, name}}
      ok -> ok
    end
  end

  # Decompose CTM {a, b, c, d, e, f} into image placement components.
  # PDF 1.7 § 8.3.3: The image unit square [0,1]x[0,1] is mapped via the CTM.
  # render_width  = sqrt(a*a + b*b) (scale in x direction)
  # render_height = sqrt(c*c + d*d) (scale in y direction)
  # x, y = translation components (e, f)
  # rotation_radians = atan2(b, a)
  defp decompose_ctm({a, b, c, d, e, f}) do
    render_width = :math.sqrt(a * a + b * b)
    render_height = :math.sqrt(c * c + d * d)
    rotation = :math.atan2(b, a)
    {e, f, render_width, render_height, rotation}
  end

  defp classify_image_stream(doc, dict, raw_bytes, ctm, page_num, ref) do
    filter = Map.get(dict, "Filter")
    width = to_float_dim(Map.get(dict, "Width", 0))
    height = to_float_dim(Map.get(dict, "Height", 0))
    ref_key = extract_ref_key(ref)

    {x, y, render_width, render_height, rotation} = decompose_ctm(ctm)

    case normalize_filter_name(filter) do
      :DCTDecode ->
        image = %Pdf.Reader.Image{
          kind: :jpeg,
          bytes: raw_bytes,
          x: x,
          y: y,
          width: width,
          height: height,
          ctm: ctm,
          render_width: render_width,
          render_height: render_height,
          rotation_radians: rotation,
          page: page_num,
          ref: ref_key
        }

        {:ok, image, doc}

      :FlateDecode ->
        case Pdf.Reader.Filter.Flate.decode(raw_bytes, Map.get(dict, "DecodeParms") || %{}) do
          {:ok, decoded} ->
            image = %Pdf.Reader.Image{
              kind: :png_like,
              bytes: decoded,
              x: x,
              y: y,
              width: width,
              height: height,
              ctm: ctm,
              render_width: render_width,
              render_height: render_height,
              rotation_radians: rotation,
              page: page_num,
              ref: ref_key
            }

            {:ok, image, doc}

          {:error, _} = err ->
            err
        end

      other_filter when other_filter != nil ->
        {:error, {:unsupported_filter, other_filter}}

      nil ->
        # No filter — raw bytes
        image = %Pdf.Reader.Image{
          kind: :png_like,
          bytes: raw_bytes,
          x: x,
          y: y,
          width: width,
          height: height,
          ctm: ctm,
          render_width: render_width,
          render_height: render_height,
          rotation_radians: rotation,
          page: page_num,
          ref: ref_key
        }

        {:ok, image, doc}
    end
  end

  defp normalize_filter_name({:name, name}) do
    case name do
      "DCTDecode" -> :DCTDecode
      "Fl" -> :FlateDecode
      "FlateDecode" -> :FlateDecode
      "DCT" -> :DCTDecode
      other -> String.to_atom(other)
    end
  end

  defp normalize_filter_name(name) when is_binary(name) do
    normalize_filter_name({:name, name})
  end

  defp normalize_filter_name(_), do: nil

  defp to_float_dim(n) when is_integer(n), do: n * 1.0
  defp to_float_dim(f) when is_float(f), do: f
  defp to_float_dim(_), do: 0.0

  defp extract_ref_key({:ref, n, g}), do: {n, g}

  defp ensure_ref({n, g}), do: {:ref, n, g}
end