# PdfEx
A pure-Elixir PDF parsing and surgery engine. No NIFs, no C bindings, no
external binaries — one runtime dependency (`:telemetry`).
PdfEx is built around a **lossless invariant**: `serialize(open(bytes)) == bytes`
for any unmodified document, and edits are appended as PDF incremental updates
so the original bytes are always a byte-for-byte prefix of the output.
## What it does
- **Parse** real-world PDFs: classic xref tables, PDF 1.5+ xref streams,
object streams, Flate + PNG predictors — with deliberate leniency for the
malformed files real producers emit (sloppy xref entry EOLs, mispointed
`startxref`, junk bytes in content streams).
- **Extract text** with positions, fonts, real `/Widths` metrics, and
ToUnicode/encoding decoding.
- **Edit structure** (`PdfEx.Editor`): insert, delete, and reorder pages —
deletions are lossless (objects are freed, never destroyed).
- **Edit text** (`PdfEx.ContentEdit`): run-level text replacement and glyph
deletion via token-span patching, with width compensation so the rest of the
line doesn't drift. Works on single-byte fonts and Type0 / Identity-H
composite fonts.
- **Move text** (`PdfEx.Convert`): per-glyph stable UIDs and position mutations
that token-patch the content stream without regenerating anything.
- **Project to HTML, both ways** (`PdfEx.Convert`): a byte-faithful **visual**
mode and a **semantic** mode (`<h*>`/`<p>`/`<li>` with `data-uid` ranges).
Editing the HTML maps back to per-run text ops (reverse mapping).
- **Collaborate** (`PdfEx.Session`): supervised, per-document editing sessions
with a crash-surviving snapshot cache and operational-transform conflict
resolution for concurrent edits.
- **Serialize** (`PdfEx.Serializer`): **incremental** by default (lossless,
matching the source's xref style) or opt-in **full** re-serialization
(one clean revision; not byte-lossless).
- **Subset fonts** (`PdfEx.Font.Surgery`): TrueType glyph-retaining subsetting
(composite-glyph closure, recomputed checksums).
```elixir
{:ok, doc} = PdfEx.open(File.read!("report.pdf"))
{:ok, n} = PdfEx.page_count(doc)
{:ok, text} = PdfEx.extract_text(doc)
# Structural surgery — lossless, incremental
{:ok, doc} = PdfEx.Editor.delete_page(doc, 2)
edited_bytes = PdfEx.Serializer.serialize(doc)
# byte_size(edited_bytes) > byte_size(original); original is a prefix
# Rewrite the text of a run (addressed by a stable glyph UID)
{:ok, doc} = PdfEx.ContentEdit.replace_text(doc, "p_3_g_0", "Revised heading")
# Semantic HTML, with data-uid back-references for round-tripping edits
{:ok, html} = PdfEx.Convert.to_html(doc, mode: :semantic)
# Collaborative session: reads bypass the server; writes are OT-coordinated
{:ok, id} = PdfEx.Session.open(doc)
{:ok, _op} = PdfEx.Session.apply_op(id, %PdfEx.Op.UpdateText{uid: "p_3_g_0", text: "Hi"})
{:ok, doc} = PdfEx.Session.fetch(id)
```
## Design
- **Lazy dual-AST.** Untouched objects stay as zero-copy binary references;
only touched objects materialize. Content-stream edits patch token spans
in place.
- **Pure functional core.** Every parse/edit/serialize API is a pure function
over an immutable `PdfEx.Document`. Errors are tagged tuples
(`{:error, %PdfEx.Error{}}`) — malformed input never raises. The only
stateful component is the optional collaborative session shell (a supervised
GenServer per document; reads still bypass it).
- **Hardened against hostile input**: atom-table exhaustion (unknown names
stay binaries), nesting-depth bombs, circular xref/`/Length` chains,
unbounded xref-stream ranges, refc binary pinning.
## Current limitations (0.1.x)
- **No encryption support** — encrypted PDFs return an error at open.
- Text/position edits require **uncompressed content streams** and patch the
first `/Contents` stream of a page.
- Composite-font editing covers **Identity-H** only; other CMaps and CFF
(Type0/CIDFontType0) glyph injection are out of scope. Re-encoding maps to
glyphs already present in the font's ToUnicode (no new glyphs).
- TrueType subsetting is **glyph-retaining** (ids preserved, unused outlines
emptied); glyph renumbering/compaction, CFF subsetting, and wiring the
subset back into a document's `FontFile2` are future work.
- Full re-serialization (`mode: :full`) is explicitly **not** byte-lossless.
## Installation
```elixir
def deps do
[
{:pdf_ex, "~> 0.1.0"}
]
end
```
## Documentation
Generate the docs locally with [ExDoc](https://hexdocs.pm/ex_doc):
```sh
mix docs # writes HTML to doc/
```
## Testing
```sh
mix test # unit + integration suite
mix test --include corpus # also sweep real PDFs in test/fixtures/corpus/
mix dialyzer # static analysis
```
The corpus sweep asserts the library's hard invariants against any PDFs you
drop into `test/fixtures/corpus/` (gitignored): open never raises, unmutated
round-trips are byte-identical, and incremental edits re-parse.
## License
MIT — see [LICENSE](LICENSE).