CHANGELOG.md

Select File:
# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## 1.0.1 — 2026-05-07 (fork: ExPDF)

First release of `ex_pdf` on Hex. Fork of
[`andrewtimberlake/elixir-pdf`](https://github.com/andrewtimberlake/elixir-pdf)
v0.7.2 (Hex package `:pdf`, ©Andrew Timberlake, MIT). The writer API is
preserved unchanged; this release adds a native PDF reader, error
recovery, AcroForm/outlines/annotations extraction, encryption support,
and many quality-of-life improvements.

### Renamed

- OTP app and Hex package: `:pdf` → `:ex_pdf`. Module names are
  unchanged (`Pdf`, `Pdf.Reader`, etc. — the namespace `Pdf.*` was kept
  to avoid breaking writer callers).
- Mixfile module: `Pdf.Mixfile` → `ExPdf.Mixfile`.
- Repo: `andrewtimberlake/elixir-pdf` → `MisaelMa/ExPDF`.
- All internal `:code.priv_dir(:pdf)` and `Application.compile_env(:pdf, …)`
  references updated to `:ex_pdf` (transparent to library users).

### Added — Native PDF reader

The reader is implemented in pure Elixir + Erlang/OTP stdlib (`:zlib`,
`:crypto`, `:binary`, `:unicode`, `:xmerl`). No new Hex runtime deps,
no system tool deps.

#### Unified entry point

- **`Pdf.Reader.read/2`** — one call returns `%Pdf.Reader.Result{meta,
  pages}` carrying document-level metadata + per-page lines with
  `:kind`-tagged tokens (`:text | :link | :email | :image`). See
  `Pdf.Reader.Result` and `Pdf.Reader.Line` for the full shape.
- Convenience shapes: `read(doc, shape: :text)` returns `[String.t()]`,
  `read(doc, shape: :shapes)` returns `[%Pdf.Reader.Shape{}]`.
- Image opt: `image_bytes: true` includes raw decoded `:bytes` in
  shape meta (off by default — `:data_uri` is always present).

#### Core extraction primitives

- `Pdf.Reader.open/2` (with optional `password:` and `recover:` opts)
- `Pdf.Reader.read_text/1` — plain text per page
- `Pdf.Reader.read_text_with_positions/1` — text runs with absolute X/Y
- `Pdf.Reader.read_lines/2` — logical lines with token tokenisation
- `Pdf.Reader.read_metadata/1` — Info dict + XMP (PDF 1.7 § 14.3)
- `Pdf.Reader.read_images/1` — embedded raster images with positions
- `Pdf.Reader.read_outlines/1` — bookmark tree
- `Pdf.Reader.read_annotations/1` — per-page annotations
- `Pdf.Reader.read_acroform/1` — interactive form fields
- `Pdf.Reader.read_shapes/1` — link-like elements (annotations + inferred)
- `Pdf.Reader.recovery_log/1` — recovery event accessor
- `Pdf.Reader.page_count/1`
- Bang variants for every read function

#### Encryption (PDF 1.7 § 7.6, PDF 2.0 § 7.6)

- Standard Security Handler V1/V2 (RC4-40, RC4-128)
- V4 with /AESV2 (AES-128) — `crypto:crypto_init_dyn/4`
- V5/R6 with /AESV3 (AES-256) — full Algorithm 2.A round trip
- Empty-password auto-try, file-key derivation per Algorithms 2/3/5/6/8

#### CID fonts (PDF 1.7 § 9.7)

- Identity-H/V composite fonts via 2-byte CID tokenisation
- 40 predefined CMaps from `adobe-type-tools/cmap-resources` bundled in
  `priv/cmap/` (UniJIS, GBK, KSC, ETen, Identity, …)
- Adobe Japan1/CNS1/Korea1/GB1 collections bundled in `priv/`
- PostScript-subset CMap parser with codespace-aware variable-length
  tokenizer
- ToUnicode CMap fallback for glyphs outside predefined ranges

#### Per-glyph widths (PDF 1.7 § 9.4.4, § 9.6.2.1, § 9.7.4.3)

- Full advance formula `tx = ((w/1000 - Tj_kern) × Tfs + Tc + Tw_space) × Th`
  applied per glyph
- Heterogeneous CIDFont `/W` parsing (Form A + Form B + interleaved)
- Standard 14 fallback to 500-unit average glyph width when `/Widths`
  is absent — restores correct positional advance for Helvetica /
  Times-Roman / Courier PDFs

#### Form XObject recursion (PDF 1.7 § 8.10)

- `Do` operator transparent recursion into `/Subtype /Form` XObjects
- CTM × Form `/Matrix` multiplication, resource merging, cycle
  detection, depth cap (8)

#### AcroForm field extraction (PDF 1.7 § 12.7)

- `Pdf.Reader.FormField` struct with `/FT` inheritance walk
- Text, button, choice, signature field types

#### Outlines and annotations (PDF 1.7 § 12.3, § 12.5)

- `Pdf.Reader.Outline` tree with destinations resolved
- `Pdf.Reader.Annotation` struct with subtype detection (link, text,
  highlight, file attachment, …)
- Annotation-source links automatically merge into the unified
  `Pdf.Reader.Shape` API

#### Image extraction (PDF 1.7 § 8.9)

- `/Subtype /Image` XObjects with absolute positions and CTM-derived
  rendered dimensions
- `:jpeg` (DCTDecode passthrough) and `:png_like` (FlateDecode +
  Predictor) classification
- **`shape.meta.data_uri`** — RFC 2397 `data:` URI ready for HTML
  embedding. JPEG is passthrough; png_like is re-encoded into a real
  PNG (PNG 1.2 § 5: signature + IHDR + IDAT + IEND with filter byte
  0 + zlib) so the URI is browser-loadable.

#### Shape inference

- `Pdf.Reader.Shape` struct unifies link annotations and pattern-
  inferred URIs/emails/images.
- URL regex per RFC 3986 § 3, email regex per RFC 5321 § 4.1.2.
- Trailing punctuation (`. , ; : ) ]`) stripped from inferred URIs.

#### Error recovery

- Opt-in `recover: true` flag with four orthogonal phases:
  - **R-1** Per-page isolation — one bad page does not kill the doc
  - **R-2** Font lenience — bad font refs fall back to U+FFFD per byte
  - **R-3** XRef linear scan — `:binary.matches/2` recovers from
    corrupted xref tables; trailer synthesis from last `trailer<<…>>`
    block or `/Type /Catalog` scan; multi-gen dedup (highest gen wins)
  - **R-4** Catalog/Pages tree fallback — `/Type /Page` xref scan when
    `/Root` or `/Pages` doesn't resolve
- Closed set of recovery event tuples observable via `recovery_log/1`:
  `:eof_marker_missing`, `:xref_recovered`, `:page_tree_recovered`,
  `:page_failed`, `:font_skipped`.
- Fatal errors (`:not_a_pdf`, `:encrypted_password_required`,
  `:encrypted_wrong_password`, `:encrypted_unsupported_handler`)
  remain hard errors even under `recover: true`.

### Added — Tooling

- **`releaser` ~> 0.0.7** dev dependency for monorepo-aware version
  bumping, changelog generation, and Hex publishing.
- Hex package metadata: maintainers, contributors (Andrew Timberlake +
  Misael Sánchez), licenses, links (GitHub + upstream + changelog).
- ExDoc `groups_for_modules` separating Reader and Writer namespaces.
- Comprehensive README documenting every reader feature with spec
  citations.

### Test suite

1180 tests, 0 failures, 30 excluded as of this release.

---

## Unreleased — pdf-reader-error-recovery

### Added

- **`Pdf.Reader.open/2` `recover: true` option** — opts in to error-recovery mode.
  When `recover: false` (the default, unchanged), all existing strict behavior is
  preserved. When `recover: true`, the reader activates four orthogonal recovery
  phases (R-1..R-4) and logs each recovery action instead of returning
  `{:error, _}`.

- **`Pdf.Reader.recovery_log/1`** — public accessor returning the recovery event
  log in chronological (oldest-first) order. An empty list after `open/2`
  guarantees that no recovery action occurred. Direct access to
  `doc.recovery_log` is discouraged in application code.

- **`Pdf.Reader.Document` struct extension** — two new fields with defaults that
  are invisible to code that does not reference them:
  - `recover_mode :: boolean()`, default `false`
  - `recovery_log :: [recovery_event()]`, default `[]`

- **PUBLIC API CHANGE — `read_text/1` and `read_images/1` return shape**.
  Both functions now return `{:ok, list, doc}` 3-tuples (doc is the updated
  document carrying the recovery log). The bang variants (`read_text!/2`,
  `read_images!/1`) are unchanged.

- **R-1 — Per-page isolation**: when `recover: true`, a failed page is logged
  as `{:page_failed, page_n, reason}` and skipped; remaining pages continue.
  Spec: PDF 1.7 § 7.7.3, § 7.8.

- **R-2 — Font decoder lenience**: when `recover: true` and a font dict fails to
  resolve, the decoder for that font is replaced with a per-byte U+FFFD identity
  decoder (`<<0xFFFD::utf8>>` per byte). The event `{:font_skipped, page_n,
  font_name, reason}` is logged. `String.valid?/1` is guaranteed true on
  recovery output. Spec: PDF 1.7 § 9.6, § 9.10.

- **R-3 — XRef linear scan**: when `recover: true` and normal xref loading fails
  (corrupt `startxref` offset, absent `%%EOF`), `XRef.recover/1` performs a
  `:binary.matches/2` scan to reconstruct the cross-reference table. A
  `{:xref_recovered, n_objects}` event is logged. When `%%EOF` is absent, an
  additional `{:eof_marker_missing, :linear_scan_used}` event is prepended.
  Spec: PDF 1.7 § 7.5.4, § 7.5.5, § 7.5.8.

- **R-4 — Catalog / Pages tree fallback**: when `recover: true` and `/Root` or
  `/Pages` cannot be resolved, the reader scans the recovered xref entries for
  objects with `/Type /Page` and (`/Contents` OR `/Parent`). A
  `{:page_tree_recovered, n_pages}` event is logged. Form XObjects (which also
  carry `/Type /Page` sometimes) are correctly excluded by the filter.
  Spec: PDF 1.7 § 7.7.2, § 7.7.3.

### Closed set of recovery event tuples

| Tuple | Meaning |
|---|---|
| `{:xref_recovered, n}` | Linear scan recovered `n` object entries |
| `{:eof_marker_missing, :linear_scan_used}` | `%%EOF` absent; linear scan was invoked |
| `{:page_failed, page_n, reason}` | Page `page_n` skipped; `reason` is an atom or term |
| `{:font_skipped, page_n, font_name, reason}` | Font replaced with U+FFFD fallback |
| `{:page_tree_recovered, n_pages}` | Catalog/Pages fallback found `n_pages` page objects |

No other tuple shapes are appended.

### Known gaps (recovery)

- **Encrypted AND corrupted PDFs** — when a PDF is both encrypted and has a
  corrupt xref/catalog, the synthetic trailer built by the linear scan does not
  include `/Encrypt`. Decryption cannot proceed; these PDFs are non-decryptable
  even with `recover: true`.
- **Catalog-fallback page order** — when R-4 triggers, the page list is in
  xref-insertion order, NOT document order. The `{:page_tree_recovered, n}`
  event signals this known limitation to callers.
- **R-4 probe cost** — with `recover: true`, `do_open/2` runs a full page-tree
  walk immediately after xref load (to surface `{:page_tree_recovered, n}` on
  the doc returned from `open/2`). This is O(pages) and measurable on large
  documents. It is opt-in by design.

### Internal

- Test suite: 1128 tests, 0 failures, 29 excluded (was 1125 before this change).
- New test file: `test/pdf/reader/recovery_test.exs` (65 tests: 16 RED, 11 GREEN,
  integration, smoke, and stress).
- Strict TDD throughout (red → green → refactor per task).
- Spec-driven via SDD (`sdd/pdf-reader-error-recovery/*` artifacts in engram).

---

## Unreleased — pdf-reader-per-glyph-widths

### Added

- **Per-glyph width support** (`Pdf.Reader.Font.Widths`): text-matrix advance
  now uses the full PDF 1.7 § 9.4.4 formula — `tx = ((w/1000) * Tfs + Tc + Tw_if_space) * Th` —
  rather than a uniform 1-em approximation. Glyph widths are loaded from:
  - `/Widths`, `/FirstChar`, `/LastChar` for simple fonts (Type1, TrueType) — § 9.6.2.1
  - `/W` Form A/B arrays and `/DW` fallback for CIDFonts (Type0) — § 9.7.4.3
- `Pdf.Reader.Font.Widths` — new module with closures of type
  `(binary() -> [non_neg_integer()])`, one per font, built alongside the existing
  decoder map in `extract_page_runs/3` and threaded through Form XObject recursion.
- `GraphicsState.widths_fn` — new field (default `nil`) storing the active font's
  width closure. Set by the `Tf` operator alongside `decoder`. (§ 9.4.4)
- `Tc`, `Tw`, `Tz`, `TL` text-state operators now correctly update `GraphicsState`.
  Previously their operands were silently dropped.
- TJ kerning shift now applies horizontal scaling (`Th`):
  `shift = -(n/1000) * Tfs * Th` (previously `Th` was omitted).

### Changed

- `GraphicsState.horizontal_scaling` default changed from `1.0` to `100.0`
  (the PDF spec unit is a percentage; `Th = horizontal_scaling / 100`).
  Existing code that reads this field directly and expects the percentage
  form is unaffected; callers that divided by 100 already will need to adjust.

### Documented gaps (not in scope)

- Vertical writing widths (`/W2`, `/DW2`) — § 9.7.4.4
- Standard-14 hardcoded AFM metrics — § 9.6.2.2 (fonts without embedded `/Widths`
  currently produce zero-width advance; Tc/Tw still apply correctly)
- Non-default `/FontMatrix` scaling on CIDType2 fonts — § 9.7.4.3

### Internal

- Test suite: 1107 tests, 0 failures, 27 excluded (was 1095 before this change).
- New file: `lib/pdf/reader/font/widths.ex`
- New test file: `test/pdf/reader/font/widths_test.exs` (25 tests)

## Unreleased — housekeeping-dialyzer-warnings

### Internal

- Removed 9 dead-code clauses flagged by Dialyzer "pattern can never match the type".
  All defensive `{:error, _}` arms in bang-wrappers and downstream pattern dispatches
  where the upstream success_typing was `{:ok, ...}`-only. No behavior change.
  Specifically: `read_metadata!/1` error branch; `extract_doc_id/1` `{:hex_string, _}`
  and `{:string, _}` patterns; `resolve_page_resources/4` dead `{n,g}` and `nil` key
  branches plus unreachable `{:error, _}` cache arm; `do_resolve_page_resources/4`
  dead `{n,g}` and `nil` parent_key branches; `font.ex` `{:error, _}` arm for
  `CID.Decoder.build/2`; `decoder.ex` `parse_registry(nil)` clause.
  Defensive `_error` / `_` fallbacks in `outlines.ex` and `annotations.ex` that guard
  against future widening of `Destination.resolve/3` return type were intentionally
  kept and annotated with comments.

## Unreleased — pdf-reader-cid-fonts-tier3

### Added

- 10 Tier 3 predefined CMaps bundled in `priv/cmap/`:
  - Adobe-Japan1: `EUC-H`, `EUC-V`
  - Adobe-CNS1: `B5-H`, `B5-V`, `ETenms-B5-H`, `ETenms-B5-V`
  - Adobe-GB1: `GB-H`, `GB-V`
  - Adobe-Korea1: `KSCms-UHC-HW-H`, `KSCms-UHC-HW-V`
  - Source: `adobe-type-tools/cmap-resources` (Apache-2.0)
- `Pdf.Reader.CID.PredefinedCMap.@bundled` set extended from 30 to 40 names.

### Internal

- Test suite: 1063 tests, 3 pre-existing failures (encryption), 0 new failures.
- Bundle size: +51.9 KB additional `priv/cmap/` data.

## Unreleased — housekeeping-mix-format

### Internal

- Auto-formatted 12 pre-existing files via `mix format` to satisfy `--check-formatted`.
  Affected files: `lib/pdf.ex`, `lib/pdf/builder.ex`, `lib/pdf/fonts.ex`, `lib/pdf/images/png.ex`,
  `lib/pdf/layout.ex`, `lib/pdf/page.ex`, `lib/pdf/styled_table.ex`,
  `test/pdf/builder_test.exs`, `test/pdf/fonts_test.exs`, `test/pdf/layout_test.exs`,
  `test/pdf/page_templates_test.exs`, `test/pdf/styled_table_test.exs`. No behavior change.

## Unreleased — pdf-reader-annotations-outlines

### Added

- **Document outlines (bookmarks)** — `Pdf.Reader.read_outlines/1` returns
  `[%Pdf.Reader.Outline{title, level, dest_page, children}]` walking
  catalog `/Outlines` linked list with cycle detection (visited MapSet)
  and depth cap 32.
- **Annotations** — `Pdf.Reader.read_annotations/1` returns
  `[%Pdf.Reader.Annotation{type, page, rect, contents, ...}]` for the
  10 in-scope subtypes: Link, Text, Highlight, Underline, StrikeOut,
  Squiggly, Square, Circle, FreeText, FileAttachment. Other subtypes
  surface as `:type :unknown` with raw fields preserved in
  `:kind_specific`.
- **`Pdf.Reader.Destination`** — resolves all 4 `/Dest` variants
  (direct array, named string, `/A /S /GoTo /D <array>`, `/A /S /GoTo /D <name>`).
  Name-tree walker handles depth-20 + cycle detection.
- **`Pdf.Reader.Utils`** — extracted shared `decode_pdf_string/1` (UTF-16BE BOM
  + hex string aware) and `parse_rect/1`. `Pdf.Reader` and
  `Pdf.Reader.AcroForm` migrated to use Utils; private duplicates removed.
- **Page index cache** — `:page_ref_index` cached once per `read_*` call
  via `Destination.ensure_page_index/1`. Avoids O(n) page-ref lookups
  per annotation/outline.

### Out of scope
- Annotation appearance streams.
- Markup popup hierarchies.
- Sound/movie/screen/redact/3D annotations.
- AcroForm widget annotations (covered by separate `pdf-reader-acroform-extraction`).

### Internal
- 1053+ tests, 0 failures.
- Strict TDD throughout.
- Spec-driven via SDD (`sdd/pdf-reader-annotations-outlines/*` artifacts in engram).

---

## Unreleased — pdf-reader-cid-fonts-cmap-resources

### Added

- **30 Adobe predefined CMaps bundled in `priv/cmap/`** (Tier 1 + Tier 2):
  - Tier 1 (16 files): `UniJIS-UTF16-H/V`, `UniJIS-UCS2-H/V`, `UniCNS-UTF16-H/V`,
    `UniCNS-UCS2-H/V`, `UniGB-UTF16-H/V`, `UniGB-UCS2-H/V`, `UniKS-UTF16-H/V`,
    `UniKS-UCS2-H/V`
  - Tier 2 (14 files): `GBK-EUC-H/V`, `GBKp-EUC-H/V`, `GBK2K-H/V`, `ETen-B5-H/V`,
    `KSCms-UHC-H/V`, `90ms-RKSJ-H/V`, `90msp-RKSJ-H/V`
  - Source: `adobe-type-tools/cmap-resources` (Apache-2.0)

- **`Pdf.Reader.CID.CMapParser`** — minimal PostScript subset parser
  (`codespacerange`, `cidchar`, `cidrange`, `notdefchar`, `notdefrange`, `usecmap`).
  Silently skips all other PS content. Returns `{:ok, cmap_fields} | {:error, reason}`.
  Never raises on malformed input.

- **`Pdf.Reader.CID.Codespace.tokenize/2`** — variable-length 1–4 byte
  codespace-aware tokenizer per PDF 1.7 § 9.7.6 shortest-match rule.
  Bytes outside all codespace ranges are silently dropped one-at-a-time.

- **`Pdf.Reader.CID.PredefinedCMap`** — lazy loader with `Document.cache`
  keyed `{:predefined_cmap, name}` and `usecmap` chain support (cycle
  detection via visited MapSet; missing/non-bundled parents fall back to
  empty CMap per discovery #182).

- **`Pdf.Reader.CID.Decoder.build_predefined/2`** — new dispatch branch
  resolves bytes → CID via codespace + CMap → Unicode via existing Adobe
  collection table. Resolution cascade: ToUnicode CMap → predefined CMap
  → Adobe registry → U+FFFD with sentinel.

- **`Pdf.Reader.Font.cid_font_type/1`** — extends the former `cid_font?/1`
  predicate to recognise bundled predefined CMap names; dispatch returns
  `:identity | {:predefined, name} | :not_cid`.

### Known Limitations

- **Tier 3 CMaps not bundled** — `EUC`, `B5`, `GB`, `ETenms-B5`, `KSCms-UHC-HW`
  and similar encodings were deferred. Shipped in `pdf-reader-cid-fonts-tier3`.
- **Adobe-{Japan1,CNS1,Korea1,GB1}-UCS2 abstract parent files do not exist**
  in `adobe-type-tools/cmap-resources`. The `usecmap` operator falls back to
  empty parent CMap if the named parent is not bundled — child's mappings still
  work standalone. Real-world `usecmap` chains are exercised via -V → -H pairs
  (e.g. `UniJIS-UTF16-V usecmap UniJIS-UTF16-H`).

### Internal

- Test suite: 909 tests, 0 failures (was 890 before this change's tests).
- Strict TDD throughout (red → green → refactor per task pair).
- Spec-driven via SDD (`sdd/pdf-reader-cid-fonts-cmap-resources/*`).

---

## Unreleased — pdf-reader-cid-fonts

### Added

- **CID composite font support** — Type0 fonts with `/Encoding /Identity-H` or
  `/Identity-V` are now dispatched to a new CID decoder path in
  `Pdf.Reader.Font.build_decoder_internal/2`. Text extraction from standard
  CJK PDFs (Japanese, Chinese Traditional/Simplified, Korean) now returns
  correct Unicode instead of `U+FFFD`.

- **Four Adobe collection modules** — compile-time CID → Unicode tables bundled
  as `@external_resource` pattern-match clauses (O(1) BEAM dispatch):
  - `Pdf.Reader.CID.AdobeJapan1` — ~9 600 entries (UniJIS-UCS2 column)
  - `Pdf.Reader.CID.AdobeCNS1` — ~18 300 entries (UniCNS-UCS2 column)
  - `Pdf.Reader.CID.AdobeKorea1` — ~17 100 entries (UniKS-UCS2 column)
  - `Pdf.Reader.CID.AdobeGB1` — ~28 700 entries (UniGB-UCS2 column)

  Source data: `adobe-type-tools/cmap-resources` repository.
  Blob SHAs committed:
  - `Adobe-Japan1-7/cid2code.txt` → `4aead36837da`
  - `Adobe-CNS1-7/cid2code.txt`   → `13ebdcb98e07`
  - `Adobe-Korea1-2/cid2code.txt` → `0b5db6b5f5c3`
  - `Adobe-GB1-6/cid2code.txt`    → `c94c7bf8c943`
  - Repository HEAD at time of normalization: `f5cf3bca7fdf`

- **`Pdf.Reader.CID.CIDToGIDMap`** — parses `/CIDToGIDMap` entries
  (`/Identity`, FlateDecode-decoded binary stream, or indirect ref). Stored for
  future glyph-rendering work; not used in the Unicode cascade.

- **`Pdf.Reader.CID.Decoder`** — resolves per-CID Unicode via cascade:
  ToUnicode CMap → Adobe registry table → `U+FFFD` with sentinel
  `{idx, "cid:0xHHHH"}`.

- **`mix.exs` `package.files`** — `"priv"` added so that `@external_resource`
  paths in the Adobe collection modules resolve correctly at Hex compile time.

### Known limitations

- **Non-Identity predefined CMaps not decoded** — fonts with
  `/Encoding /UniJIS-UTF16-H`, `/GBK-EUC-H`, etc. fall through to the
  simple-font path and emit `U+FFFD` with sentinels. Full support planned for
  future change `pdf-reader-cid-fonts-cmap-resources`.
- **Vertical writing mode** — Identity-V is dispatched to the same decoder as
  Identity-H. No positional adjustments for vertical layout.

## Unreleased — pdf-reader-acroform-extraction

### Added

- **`Pdf.Reader.read_acroform/1` and `read_acroform!/1`** — extract interactive
  AcroForm form fields from a PDF document. Returns a flat list of
  `%Pdf.Reader.FormField{}` structs with decoded names, types, values, flags,
  and rectangles. Absent `/AcroForm` returns `{:ok, [], doc}` — never an error.
- **`Pdf.Reader.FormField`** struct — carries `:name` (fully-qualified dot-path),
  `:partial_name`, `:type` (`:text | :button | :choice | :signature | :unknown`),
  `:value` (type-specific decoded value), `:default`, `:tooltip`, `:flags`
  (`%{atom => boolean}` decoded from `/Ff` bitmask), `:rect`.
- **`Pdf.Reader.AcroForm`** walker module — depth-first leaf-only walker with
  cycle detection (`MapSet` of `{n, g}` xref keys), depth cap (`@max_field_depth 8`),
  `/FT` inheritance, hierarchical naming, and widget-only annotation filtering.

## Unreleased — pdf-reader-resource-inheritance-multilevel

### Fixed

- **Cyclic /Parent infinite loop** — `resolve_page_resources/4` now carries a
  `visited` `MapSet` of `{obj_num, gen_num}` xref refs during each `/Parent`-chain
  walk. If a ref is encountered a second time (direct self-ref or transitive cycle),
  the walk is silently terminated and `%{}` is returned. Prevents corrupt PDFs from
  hanging the reader indefinitely.

### Added

- **Per-leaf-page resource cache** — resolved `/Resources` maps are now stored in
  `doc.cache` under `{:page_resources, {n, g}}` keyed by the leaf page's xref ref.
  Subsequent calls for the same page (e.g. a second `read_text/1` call on an open
  doc) skip the `/Parent`-chain walk entirely and return the cached value.

### Note

- **Moduledoc clarified** — removed the stale "Known limitations" entry that stated
  resource inheritance was limited to one level of parent-chain walk. The full
  recursive walk has been in place since Phase 1.1; only the documentation was wrong.
  Added PDF 1.7 § 7.7.3 and § 7.7.3.4 spec references.

## Unreleased — fix-writer-set-info-state

### Fixed

- **Info dict lost after page mutations** — `Pdf.set_info/2` (and its single-key
  variants `set_title/2`, `set_author/2`, etc.) stores metadata by updating
  `document.objects`. However, `Page` carries its own copy of `objects` that was
  snapshotted at page-creation time. Any subsequent page mutation (`set_font`,
  `text_at`, …) calls `sync_page/2`, which replaces `document.objects` with
  `page.objects` — silently discarding the info update. The fix propagates the
  info-dict change into `document.current.objects` inside `put_info/2`, so both
  copies stay in sync and `sync_page/2` no longer clobbers metadata.

## Unreleased — pdf-reader-form-xobject-recursion (Phase 3)

### Added — Phase 3 (pdf-reader-form-xobject-recursion)

- **Form XObject recursion** — `Do` operators referencing `/Type /XObject
  /Subtype /Form` are now recursed into transparently. Text and images inside
  Forms (headers, footers, repeated logos, templated form fields) appear in
  `Pdf.Reader.read_text/2`, `read_text_with_positions/1`, and `read_images/1`
  output. Previously these objects were emitted as `{:deferred, :form_xobject,
  name}` events and silently dropped — that behavior is REPLACED.
- **CTM × `/Matrix` inheritance** — child Form's CTM is `Form.Matrix × parent
  CTM at time of Do`. Graphics state is saved on entry and restored on exit
  (effectively `q ... Q` around the form).
- **Resource merging** — Form's `/Resources` is shallow-merged with the page's
  resources (Form wins on key collision). Per-Form font decoders are built via
  `Pdf.Reader.Font.build_decoders_for_resources/2` and benefit from the
  existing `Document.cache` `{:font_decoder, font_ref}` cache.
- **Cycle detection** — interpreter state carries a `:visited` `MapSet` of
  `{obj_num, gen_num}` xref keys, threaded forward into child states. When a
  Form references an already-visited Form (directly or transitively), an
  internal `{:cycle_detected, ref}` event is emitted and recursion is skipped.
- **Depth cap** — recursion is capped at `@max_form_depth 8`. Beyond that, an
  internal `{:max_depth_exceeded, ref}` event is emitted and the Form is
  skipped.
- **Image bubble-up** — images embedded inside Form XObjects bubble up to the
  parent's event stream and appear in `read_images/1` output, with CTM
  reflecting the full transform (Form.Matrix × parent CTM × image local CTM).
- **Internal/cycle/depth events dropped from text output** — the new
  `{:cycle_detected, _}` and `{:max_depth_exceeded, _}` event types flow
  through `Pdf.Reader.ContentStream.interpret/3`'s output but are silently
  dropped by `events_to_text_runs/2`. Public `read_text*` API surface
  unchanged.

### Modified — Phase 3

- `Pdf.Reader.ContentStream.interpret/3` — public arity and return shape
  unchanged (backward-compat). New private `do_interpret_with_doc/5` for the
  recursive path; `extract_page_runs/3` and `extract_page_images/3` now use it.
- `Pdf.Reader.Image` and `Pdf.Reader.TextRun` events from inside Forms are
  appended to the parent page's event list.
- `build_xobjects_map/1` simplified — passes raw `{:ref, n, g}` refs from
  `resources["XObject"]` instead of pre-classifying as `:form`. ContentStream
  classifies on demand inside `Do`.

### Out of scope (Phase 3)

- BBox clipping of Form contents — text outside a Form's `/BBox` is still
  extracted (presentational concern, not data extraction).
- Pattern XObject recursion — `/Type /Pattern` objects referenced via `Do`
  are skipped.
- Multi-level page-tree resource inheritance (still one-level walk only).
- AcroForm interactive field extraction.

### Internal — Phase 3

- Test suite: 756 tests, 0 failures (738 default + 18 `@tag :fixtures`).
- Strict TDD applied throughout (red → green → refactor per task).
- Spec-driven via SDD (`sdd/pdf-reader-form-xobject-recursion/*` artifacts in
  engram).
- `Pdf.Reader.ContentStream` `@moduledoc` cites PDF 1.7 § 8.10 (Form XObjects),
  § 8.10.2 (Form Dictionaries), § 8.4 (Coordinate Systems), § 8.8 (External
  Objects / Do operator) plus pdf.js + pdfminer-six reference impls.

---

## Unreleased — pdf-reader-encryption (Phase 2)

### Added — Phase 2 (pdf-reader-encryption)

- **Standard Security Handler** support — encrypted PDFs are now READABLE via
  `Pdf.Reader.open/2` when the correct password is provided (or empty for
  metadata-protection cases). Implements all four spec versions:
  - **V1 / R=2** — RC4 40-bit (legacy)
  - **V2 / R=3** — RC4 up to 128-bit (most common pre-2008)
  - **V4 / R=4** — Crypt Filters + AES-128 (PDF 1.6+)
  - **V5 / R=6** — AES-256 + SHA-256/384/512 mixing (PDF 2.0 / Acrobat X+)
- **`Pdf.Reader.open/2`** with `password: String.t()` opt (default `""`).
  - Always tries empty password first (metadata-protection auto-unlock).
  - If non-empty password supplied, tries as user → owner password.
  - `Pdf.Reader.open/1` retained — delegates to `open/2` with empty opts.
- **New error atoms** in `Pdf.Reader.reason/0`:
  - `:encrypted_password_required` — no password supplied, empty failed.
  - `:encrypted_wrong_password` — supplied password rejected as user AND owner.
  - `:encrypted_unsupported_handler` — `/Filter != /Standard`, V5/R5 (deprecated),
    or RC4 unavailable on the runtime.
  - The legacy `:encrypted` atom is REMOVED (existing test updated to assert
    the new atom).
- **`Pdf.Reader.Document` struct** gained `:encryption` field
  (`%StandardHandler{}` when encrypted, `nil` otherwise).
- **Decryption hook** integrated transparently in
  `Pdf.Reader.ObjectResolver.resolve_in_use/3` only — `resolve_compressed/3`
  is left untouched (object-stream contents are decrypted ONCE at the
  containing-stream level; double-decryption would corrupt them).
- **Per-object encryption key** derivation per PDF 1.7 § 7.6.2 for V1/V2/V4
  (file key + obj_num + gen_num + optional `sAlT` literal → MD5 → truncate).
  V5 uses the file encryption key directly.
- **Crypt Filter `/Identity`** honored — V4 streams marked `/Identity` are
  passed through plaintext (common XMP metadata pattern).
- **`/EncryptMetadata false`** honored — when set in the Encrypt dict, the
  catalog's `/Metadata` stream is read as plaintext regardless of the
  default Stream Filter.
- **`mix.exs`** — `:crypto` added to `extra_applications` (required at
  release time; the OTP `:crypto` app is stdlib, not a Hex dep).
- New modules: `Pdf.Reader.Encryption` (facade), `Pdf.Reader.Encryption.{PasswordPad, ObjectKey, StandardHandler, V1V2, V4, V5}`.

### Known Limitations (Phase 2, carried forward)

- **End-to-end V4/V5 round-trip integration tests deferred** — algorithm-level
  unit tests (73 total across V1V2/V4/V5) verify each cipher against published
  vectors from Mozilla pdf.js `crypto_spec.js`, cross-checked with Node.js.
  V2/R3 is fully covered end-to-end via `craft_rc4_v2_pdf/1` (round-trip from
  hand-crafted PDF through `open/2` → `read_text/1`). V4/V5 dispatch through
  the resolver hook is unit-validated but lacks a full hand-crafted PDF
  round-trip fixture. Planned as `pdf-reader-encryption-fixtures-handcraft`.
- **Real-world fixture PDFs not committed** — would require `qpdf` as a
  build/test dependency, which contradicts the project's "native only, zero
  external dependencies" principle. Planned as a separate optional change
  if/when the constraint is relaxed.
- **R5** (deprecated V5 variant) — unsupported by design. PDFs with `V=5 R=5`
  return `{:error, :encrypted_unsupported_handler}`.
- **Public-Key Security Handler** (X.509 cert-based, `/Filter /Adobe.PubSec`
  or similar) — not supported. Returns `:encrypted_unsupported_handler`.
- **Permission flag enforcement** — flags are read but NOT enforced. We are
  a reader; downstream tools may choose to honor `/P` bits.
- **RC4 availability** — runtime dependent on OpenSSL configuration. On
  systems where RC4 is disabled (some OpenSSL 3 builds), V1/V2 PDFs return
  `:encrypted_unsupported_handler`. AES paths (V4/V5) work everywhere.

### Internal — Phase 2

- Test suite: 726 tests, 0 failures (708 default + 18 `@tag :fixtures`).
- 73 unit tests across V1V2/V4/V5 verify algorithms 2, 4, 5, 6, 7, 8, 9,
  10 against vectors sourced from Mozilla pdf.js `test/unit/crypto_spec.js`
  (Apache-2.0). Each vector independently re-computed with Node.js `crypto`
  and `:crypto` Erlang to confirm parity.
- Strict TDD applied throughout (red → green → refactor per task).
- Spec-driven via SDD (`sdd/pdf-reader-encryption/*` artifacts in engram).
- All algorithm modules cite canonical spec URLs (PDF 1.7/2.0, NIST FIPS 197,
  NIST SP 800-38A, RFC 1321) in `@moduledoc`.

---

## Unreleased — pdf-reader-cascade-wire (Phase 1.1)

### Added — Phase 1.1 (pdf-reader-cascade-wire)

- **Encoding cascade wired** through `read_text/2` and `read_text_with_positions/1` —
  text is now decoded to Unicode (was raw bytes in Phase 1). Per-font cascade order:
  ToUnicode CMap → /Differences + AGL → base encoding (WinAnsi/MacRoman/Standard) → U+FFFD.
- **Per-font decoder construction with cache** — `Pdf.Reader.Font.build_decoder/2` builds
  closures per font dict; decoders are cached in `Document.cache` keyed by
  `{:font_decoder, font_ref}` (indirect-ref fonts only; inline font dicts are not cached).
- **`Tf` operator switches active decoder** mid-content-stream — font changes in the
  stream are respected; each text operation uses the decoder for the currently active font.
- **XMP metadata parsing** via `:xmerl` (OTP stdlib) — `read_metadata/1` merges XMP with
  `/Info`; XMP wins on conflict (PDF 1.7 § 14.3.2). Recognized namespaces: dc:, xmp:, pdf:.
  Malformed XMP falls back to `/Info`-only silently.
- **`Pdf.Reader.Image` struct** gained `:ctm`, `:render_width`, `:render_height`,
  `:rotation_radians` fields. CTM decomposition follows PDF 1.7 § 8.3.3 and § 8.9.5.
- **Resource inheritance** — one-level parent-chain walk added to `resolve_page_resources/2`
  so writer-built PDFs (which store resources on the Pages parent node, not the leaf page)
  extract text and images correctly.
- New modules: `Pdf.Reader.Font`, `Pdf.Reader.XMP`
- New fixture: `test/fixtures/images/tiny.jpg` (32×32 px, ~900 B, public-domain JPEG
  from picsum.photos — used by image CTM integration tests)

### Known Limitations (Phase 1.1, carried forward)

- **Resource inheritance** — only one level of parent-chain walk is implemented. PDFs with
  deeply nested page trees that store resources two or more levels above the leaf page may
  produce empty text. Planned as `pdf-reader-resource-inheritance` change.
- **Per-glyph advance via `/Widths`** — glyph advance is approximated as uniform
  (`char_count × font_size`). Per-glyph widths are a separate change.
- **Form XObject `Do` recursion** — content inside form XObjects (`/Type /Form`) is not
  extracted. Planned for Phase 3.
- **CID fonts beyond ToUnicode** — CID-keyed fonts without a `/ToUnicode` CMap produce
  U+FFFD substitutions. Planned for Phase 3.
- **CCITTFaxDecode, JBIG2Decode, JPXDecode** — not supported; these require third-party
  C libraries and are outside scope.

### Internal — Phase 1.1

- Test suite: 616 tests, 0 failures (598 default + 18 `@tag :fixtures`)
- Strict TDD applied throughout (red → green → refactor per task)
- Spec-driven via SDD (`sdd/pdf-reader-cascade-wire/*` artifacts in engram)

---

## Unreleased — pdf-reader-core (Phase 1)

### Added

- `Pdf.Reader.open/1`, `read_text/2`, `read_text_with_positions/1`, `read_images/1`,
  `read_metadata/1`, `page_count/1`, `close/1` — and bang variants (`open!/1`, etc.)
- Stream filter pipeline:
  - `FlateDecode` with PNG predictors 1–4 and 10–14 and TIFF Predictor 2 (horizontal differencing)
  - `ASCII85Decode` with `z` shortcut and `~>` EOD marker
  - `ASCIIHexDecode` with whitespace tolerance and `>` EOD
  - `RunLengthDecode` (128 = EOD, 0–127 = literal, 129–255 = repeat)
  - `LZWDecode` with variable-width codes (9–12 bit), EarlyChange 0 and 1
- Cross-reference table support: classic xref (PDF 1.0–1.4) AND xref streams (PDF 1.5+)
  with `/Prev` chain merging and hybrid chains (mixed classic + stream)
- Object stream (`/Type /ObjStm`) decoding via `Pdf.Reader.ObjectStream`
- Encoding cascade (per-glyph): ToUnicode CMap → /Differences + Adobe Glyph List →
  base encoding (WinAnsi / MacRoman / StandardEncoding) → `U+FFFD` with diagnostic sentinel
- Bundled Adobe Glyph List 2.0 as a compile-time module (~4 500 entries, BSD-licensed)
- Public-domain encoding tables:
  - Apple ROMAN.TXT (canonical Mac Roman mapping)
  - PDF 1.7 Annex D.2 StandardEncoding (cross-checked against Mozilla pdf.js)
- Lazy indirect-object resolver with pure `Map` cache — no GenServer, no Agent
- Pure tagged-tuple internal value model:
  `{:ref, n, g}`, `{:name, _}`, `{:string, _}`, `{:hex_string, _}`,
  `{:stream, dict, body}`, plain `%{}` for dicts, plain lists for arrays

### Known Limitations (Phase 1)

- No encryption support — encrypted PDFs return `{:error, :encrypted}` (deferred to Phase 2)
- No CID fonts beyond ToUnicode-mapped glyphs (deferred to Phase 3)
- No `CCITTFaxDecode`, `JBIG2Decode`, or JPEG 2000 image filters — these require
  third-party C libraries and are outside scope
- No AcroForm or XFA form field extraction
- No OCR or scanned-PDF text extraction — impossible without third parties
- Form XObject (`Do` operator) is recognised but not recursed; content is not extracted
- Glyph advance approximation: uses `char_count × font_size` instead of per-glyph
  `/Widths`; start-of-run position is exact, inter-run drift is possible for proportional fonts
- CMap multi-codepoint mappings (ligatures): only the first codepoint is used
- Malformed PDFs return strict `{:error, :malformed}` — no partial-recovery mode
- XMP metadata streams are not parsed; `read_metadata/1` reads only the `/Info` dictionary

### Internal

- Test suite: 550 tests, 0 failures (541 default + 9 `@tag :fixtures`)
- Strict TDD applied throughout (red → green → refactor per task)
- Spec-driven via SDD (`sdd/pdf-reader-core/*` artifacts in engram)

---

## 0.7.1 (2024-07-23)

- Fix memory leak when cleaning up a PDF process

## 0.7.0 (2024-07-12)

- Add `autoprint/1` to automatically open the print dialog in a browser

## 0.6.1 (2023-01-19)

- Fix bug with zero width strings and empty rows (also fixes [#24])
- Fix issue with nil cap height [#35]
- Raise RuntimeError when attempting to add text without a font [#36]
- Fix typespec for `text_wrap/5` [#37]

## 0.6.0 (2021-12-07)

- Add `:odd` and `:even` to `:row_style` on table with a lower precedence than indexed styles
- Fix bug where only the first non-WinAnsi character was replaced [#32]

## 0.5.0 (2020-12-02)

- Catch errors raised within the GenServer and re-raise them in the calling process

## 0.4.0 (2020-08-12)

- Add `:encoding_replacement_character` option to supply a replacement character when encoding fails
- Add `:allow_row_overflow` option to `Pdf.table/4` to allow row contents to be split across pages

## 0.3.7 (2020-04-29)

- Bug fix: Fix memory leak by stopping internal processes

## 0.3.6 (2020-04-22)

- Bug fix: Correctly handle encoded text as binary, not UTF-8 encoded string
- Bug fix: External fonts now work like built-in fonts #17
- Bug fix: Reset colours changed by attributed text
- Bug fix: Fix global options for text_at/4 when using a string #11

## 0.3.5 (2020-04-14)

- Deprecate: `Pdf.delete/1` in favour of `Pdf.cleanup/1`
- Deprecate: `Pdf.open/2` in favour of `Pdf.build/2`