README.md

[![Hex.pm](https://img.shields.io/hexpm/v/pdf_redlines)](https://hex.pm/packages/pdf_redlines)
[![Hexdocs.pm](https://img.shields.io/badge/docs-hexdocs.pm-purple)](https://hexdocs.pm/pdf_redlines)
[![Github.com](https://github.com/EnaiaInc/pdf_redlines/actions/workflows/ci.yml/badge.svg)](https://github.com/EnaiaInc/pdf_redlines/actions)

# PDFRedlines

Fast PDF redline extraction via a Rust NIF (MuPDF).

## Usage

```elixir
{:ok, result} = PDFRedlines.extract_redlines("/path/to/document.pdf")
# %PDFRedlines.Result{redlines: [%PDFRedlines.Redline{...}, ...]}
```

## What Are Redlines?

Redlines are tracked changes embedded in PDFs, typically represented as:

- **Deletions**: colored text with a strikethrough line through the middle.
- **Insertions**: colored text with an underline below the text.

This library detects those visual signals and converts them into structured
entries (deletion, insertion, or paired change).

## Notes

- This library is intentionally small and focused on the minimal API we need.
- Precompiled NIFs are published in GitHub Releases; set `PDF_REDLINES_BUILD=1`
  to force a local build.

## Configuration

You can pass a keyword list or map to tune detection thresholds:

- `:red_r_min`
- `:red_g_max`
- `:red_b_max`
- `:blue_r_max`
- `:blue_g_max`
- `:blue_b_min`
- `:formatting_bar_height_max`
- `:formatting_bar_width_min`
- `:line_bar_height_max`
- `:line_bar_width_min`
- `:stroke_line_y_tolerance`
- `:stroke_line_width_min`
- `:line_break_height_ratio`
- `:same_line_y_tolerance`
- `:merge_x_gap_max`
- `:merge_line_height_min_ratio`
- `:merge_line_height_max_ratio`
- `:margin_end_ratio`
- `:margin_start_ratio`
- `:pair_x_gap_max`
- `:page_width_fallback`
- `:line_height_fallback`

### Tuning Guide (Quick)

- If **strikethroughs are missed**, try increasing:
  - `:formatting_bar_height_max`
  - `:stroke_line_width_min` (if lines are thicker)
- If **underlines are missed**, try increasing:
  - `:line_bar_height_max`
  - `:line_bar_width_min`
- If **colors are missed**, widen:
  - `:red_r_min` (lower for lighter reds)
  - `:blue_b_min` (lower for lighter blues)
- If **line wrapping isn’t merged**, adjust:
  - `:merge_line_height_min_ratio`, `:merge_line_height_max_ratio`
  - `:margin_start_ratio`, `:margin_end_ratio`

## Parity Test (Optional)

There is an optional parity test that compares Rust/MuPDF results against the
Python/PyMuPDF implementation. It is skipped by default.

Run it with:

```bash
TEST_PDF_REDLINES_PARITY=true mix test test/redlines_parity_test.exs
```

Inputs are read from `PDF_REDLINES_TEST_DIR` (defaults to `test/fixtures/pdfs`).

## Performance

Extraction runs entirely in a Rust NIF on a dirty scheduler. Benchmarked on
real-world documents (M1 Mac):

| Document                    | Size  | Time    |
| --------------------------- | ----- | ------- |
| 35 MB scanned PDF           | 35 MB | ~350 ms |
| 33 MB scanned/OCR'd PDF     | 33 MB | ~640 ms |
| 12 MB PDF with 362 redlines | 12 MB | ~130 ms |

Even the worst case (large scanned documents) finishes under 700 ms.

## Benchmarks

Run a basic benchmark across a folder of PDFs:

```bash
PDF_REDLINES_BUILD=1 mix pdf_redlines.bench
```

You can customize:

- `PDF_REDLINES_TEST_DIR` (default `test/fixtures/pdfs`)
- `PDF_REDLINES_BENCH_REPEATS` (default `3`)

## Development

To compile the Rust sources along with the library, you can set
the `RUSTLER_PRECOMPILED_FORCE_BUILD_ALL` env variable to `1`, e.g.:

```bash
RUSTLER_PRECOMPILED_FORCE_BUILD_ALL=1 mix compile
```

## License

MIT