Skip to main content

README.md

# PdfEx

A pure-Elixir PDF parsing and surgery engine. No NIFs, no C bindings, no
external binaries — one runtime dependency (`:telemetry`).

PdfEx is built around a **lossless invariant**: `serialize(open(bytes)) == bytes`
for any unmodified document, and edits are appended as PDF incremental updates
so the original bytes are always a byte-for-byte prefix of the output.

## What it does

- **Parse** real-world PDFs: classic xref tables, PDF 1.5+ xref streams,
  object streams, Flate + PNG predictors — with deliberate leniency for the
  malformed files real producers emit (sloppy xref entry EOLs, mispointed
  `startxref`, junk bytes in content streams).
- **Extract text** with positions, fonts, real `/Widths` metrics, and
  ToUnicode/encoding decoding.
- **Edit structure** (`PdfEx.Editor`): insert, delete, and reorder pages —
  deletions are lossless (objects are freed, never destroyed).
- **Edit text** (`PdfEx.ContentEdit`): run-level text replacement and glyph
  deletion via token-span patching, with width compensation so the rest of the
  line doesn't drift. Works on single-byte fonts and Type0 / Identity-H
  composite fonts.
- **Move text** (`PdfEx.Convert`): per-glyph stable UIDs and position mutations
  that token-patch the content stream without regenerating anything.
- **Project to HTML, both ways** (`PdfEx.Convert`): a byte-faithful **visual**
  mode and a **semantic** mode (`<h*>`/`<p>`/`<li>` with `data-uid` ranges).
  Editing the HTML maps back to per-run text ops (reverse mapping).
- **Collaborate** (`PdfEx.Session`): supervised, per-document editing sessions
  with a crash-surviving snapshot cache and operational-transform conflict
  resolution for concurrent edits.
- **Serialize** (`PdfEx.Serializer`): **incremental** by default (lossless,
  matching the source's xref style) or opt-in **full** re-serialization
  (one clean revision; not byte-lossless).
- **Subset fonts** (`PdfEx.Font.Surgery`): TrueType glyph-retaining subsetting
  (composite-glyph closure, recomputed checksums).

```elixir
{:ok, doc} = PdfEx.open(File.read!("report.pdf"))

{:ok, n}    = PdfEx.page_count(doc)
{:ok, text} = PdfEx.extract_text(doc)

# Structural surgery — lossless, incremental
{:ok, doc}   = PdfEx.Editor.delete_page(doc, 2)
edited_bytes = PdfEx.Serializer.serialize(doc)
# byte_size(edited_bytes) > byte_size(original); original is a prefix

# Rewrite the text of a run (addressed by a stable glyph UID)
{:ok, doc} = PdfEx.ContentEdit.replace_text(doc, "p_3_g_0", "Revised heading")

# Semantic HTML, with data-uid back-references for round-tripping edits
{:ok, html} = PdfEx.Convert.to_html(doc, mode: :semantic)

# Collaborative session: reads bypass the server; writes are OT-coordinated
{:ok, id}  = PdfEx.Session.open(doc)
{:ok, _op} = PdfEx.Session.apply_op(id, %PdfEx.Op.UpdateText{uid: "p_3_g_0", text: "Hi"})
{:ok, doc} = PdfEx.Session.fetch(id)
```

## Design

- **Lazy dual-AST.** Untouched objects stay as zero-copy binary references;
  only touched objects materialize. Content-stream edits patch token spans
  in place.
- **Pure functional core.** Every parse/edit/serialize API is a pure function
  over an immutable `PdfEx.Document`. Errors are tagged tuples
  (`{:error, %PdfEx.Error{}}`) — malformed input never raises. The only
  stateful component is the optional collaborative session shell (a supervised
  GenServer per document; reads still bypass it).
- **Hardened against hostile input**: atom-table exhaustion (unknown names
  stay binaries), nesting-depth bombs, circular xref/`/Length` chains,
  unbounded xref-stream ranges, refc binary pinning.

## Current limitations (0.1.x)

- **No encryption support** — encrypted PDFs return an error at open.
- Text/position edits require **uncompressed content streams** and patch the
  first `/Contents` stream of a page.
- Composite-font editing covers **Identity-H** only; other CMaps and CFF
  (Type0/CIDFontType0) glyph injection are out of scope. Re-encoding maps to
  glyphs already present in the font's ToUnicode (no new glyphs).
- TrueType subsetting is **glyph-retaining** (ids preserved, unused outlines
  emptied); glyph renumbering/compaction, CFF subsetting, and wiring the
  subset back into a document's `FontFile2` are future work.
- Full re-serialization (`mode: :full`) is explicitly **not** byte-lossless.

## Installation

```elixir
def deps do
  [
    {:pdf_ex, "~> 0.1.0"}
  ]
end
```

## Documentation

Generate the docs locally with [ExDoc](https://hexdocs.pm/ex_doc):

```sh
mix docs        # writes HTML to doc/
```

## Testing

```sh
mix test                          # unit + integration suite
mix test --include corpus         # also sweep real PDFs in test/fixtures/corpus/
mix dialyzer                      # static analysis
```

The corpus sweep asserts the library's hard invariants against any PDFs you
drop into `test/fixtures/corpus/` (gitignored): open never raises, unmutated
round-trips are byte-identical, and incremental edits re-parse.

## License

MIT — see [LICENSE](LICENSE).