README.md

# Elixir Binding for Kreuzberg

<div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
  <!-- Language Bindings -->
  <a href="https://crates.io/crates/kreuzberg">
    <img src="https://img.shields.io/crates/v/kreuzberg?label=Rust&color=007ec6" alt="Rust">
  </a>
  <a href="https://hex.pm/packages/kreuzberg">
    <img src="https://img.shields.io/hexpm/v/kreuzberg?label=Elixir&color=007ec6" alt="Elixir">
  </a>
  <a href="https://pypi.org/project/kreuzberg/">
    <img src="https://img.shields.io/pypi/v/kreuzberg?label=Python&color=007ec6" alt="Python">
  </a>
  <a href="https://www.npmjs.com/package/@kreuzberg/node">
    <img src="https://img.shields.io/npm/v/@kreuzberg/node?label=Node.js&color=007ec6" alt="Node.js">
  </a>
  <a href="https://www.npmjs.com/package/@kreuzberg/wasm">
    <img src="https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM&color=007ec6" alt="WASM">
  </a>

<a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg">
    <img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java">
  </a>
  <a href="https://github.com/kreuzberg-dev/kreuzberg/releases">
    <img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v4.0.0-*" alt="Go">
  </a>
  <a href="https://www.nuget.org/packages/Kreuzberg/">
    <img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#">
  </a>
  <a href="https://packagist.org/packages/kreuzberg/kreuzberg">
    <img src="https://img.shields.io/packagist/v/kreuzberg/kreuzberg?label=PHP&color=007ec6" alt="PHP">
  </a>
  <a href="https://rubygems.org/gems/kreuzberg">
    <img src="https://img.shields.io/gem/v/kreuzberg?label=Ruby&color=007ec6" alt="Ruby">
  </a>

<!-- Project Info -->

<a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE">
    <img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License">
  </a>
  <a href="https://docs.kreuzberg.dev">
    <img src="https://img.shields.io/badge/docs-kreuzberg.dev-blue" alt="Documentation">
  </a>
</div>

<img width="1128" height="191" alt="Banner2" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />

<div align="center" style="margin-top: 20px;">
  <a href="https://discord.gg/pXxagNK2zN">
      <img height="22" src="https://img.shields.io/badge/Discord-Join%20our%20community-7289da?logo=discord&logoColor=white" alt="Discord">
  </a>
</div>

Extract text, tables, images, and metadata from 56+ file formats. The Elixir binding provides idiomatic API, native BEAM concurrency, Rustler NIF integration, OTP supervision, and comprehensive E2E testing with zero flakiness.

> **Version 4.0.0 Release Candidate**
> Kreuzberg v4.0.0 is in **Release Candidate** stage. Bugs and breaking changes are expected.
> This is a pre-release version. Please test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter.

## Table of Contents

- [Installation](#installation)
- [Quick Start](#quick-start)
- [Architecture](#architecture)
- [Usage Patterns](#usage-patterns)
- [E2E Workflow](#e2e-workflow)
- [NIF Integration](#nif-integration)
- [Features](#features)
- [Configuration](#configuration)
- [Testing](#testing)
- [Documentation](#documentation)
- [Troubleshooting](#troubleshooting)

## Installation

### Via Hex Package Manager

Add to your `mix.exs` dependencies:

```elixir
def deps do
  [
    kreuzberg: "~> 4.0"
  ]
end
```

Then run:

```bash
mix deps.get
```

The package will automatically compile the Rust NIF using Rustler precompiled binaries.

### System Requirements

- **Elixir 1.14+** and **Erlang/OTP 24+**
- C compiler (gcc, clang, or MSVC)
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality

### Native Build

If precompiled binaries are unavailable for your platform, Rustler will automatically compile from source:

```bash
# Install Rust if not already installed
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Build the project
mix compile
```

## Architecture

The Elixir binding uses Rustler to safely call high-performance Rust code from Erlang/OTP:

```
┌─────────────────────────────────────┐
│   Elixir Application (Idiomatic)    │
│  - Pattern matching on {:ok, result}│
│  - Task-based async concurrency     │
│  - OTP supervisor integration       │
└────────────┬────────────────────────┘
             │
      Rustler NIF Boundary
       (Safe term exchange)
             │
┌────────────▼────────────────────────┐
│  Rust Native Implementation         │
│  - High-performance extraction      │
│  - Memory-safe term handling        │
│  - Native concurrency support       │
└─────────────────────────────────────┘
```

Key design principles:

- **Safety**: NIF boundary crossing is automatically validated
- **Concurrency**: BEAM scheduler handles concurrent calls without blocking
- **Memory**: Rust manages memory; Elixir handles distribution and caching
- **Idiomatic**: Elixir patterns like `{:ok, result}` and `{:error, reason}` throughout

## Quick Start

### Basic Extraction with Error Handling

Extract text from files with idiomatic Elixir error handling:

```elixir
# Simple extraction with pattern matching
case Kreuzberg.extract_file("document.pdf") do
  {:ok, result} ->
    IO.puts("Content: #{result.content}")
    IO.puts("Tables: #{length(result.tables)}")
    IO.puts("Pages: #{length(result.pages)}")

  {:error, reason} ->
    IO.puts("Extraction failed: #{reason}")
end
```

### Binary Data Extraction

Extract from bytes instead of files:

```elixir
# Process binary data directly
pdf_binary = File.read!("document.pdf")

{:ok, result} = Kreuzberg.extract(pdf_binary, "application/pdf")

IO.puts("Extracted #{byte_size(result.content)} bytes of content")
IO.puts("Detected language: #{inspect(result.detected_languages)}")
```

## Usage Patterns

### Pattern 1: Synchronous Extraction

Blocking call, simple and straightforward:

```elixir
{:ok, result} = Kreuzberg.extract_file("document.pdf")
# or handle error
case Kreuzberg.extract_file("document.pdf") do
  {:ok, result} -> process_result(result)
  {:error, reason} -> log_error(reason)
end
```

### Pattern 2: Asynchronous Extraction with Task

Non-blocking extraction using BEAM tasks:

```elixir
# Spawn extraction in background
task = Kreuzberg.extract_async("document.pdf")

# Do other work...
do_other_work()

# Collect result when ready
{:ok, result} = Task.await(task, 30_000)
```

### Pattern 3: Concurrent Batch Processing

Process multiple files concurrently:

```elixir
files = ["file1.pdf", "file2.pdf", "file3.pdf"]

results =
  files
  |> Enum.map(&Task.async(fn -> Kreuzberg.extract_file(&1) end))
  |> Task.await_many(30_000)

# results is list of {:ok, result} or {:error, reason}
successful =
  Enum.filter(results, &match?({:ok, _}, &1))
  |> Enum.map(fn {:ok, result} -> result end)

IO.puts("Processed #{length(successful)}/#{length(files)} files")
```

### Pattern 4: Batch API for Optimal Performance

Use batch extraction for multiple files with internal optimization:

```elixir
files = ["file1.pdf", "file2.pdf", "file3.pdf"]

{:ok, results} = Kreuzberg.batch_extract_files(files)

Enum.each(results, fn result ->
  IO.puts("File: #{result.mime_type}")
  IO.puts("Content length: #{byte_size(result.content)}")
end)
```

## E2E Workflow

The Elixir binding includes comprehensive end-to-end tests covering real-world scenarios.

### NIF Boundary Safety

Tests verify safe Erlang term exchange across the NIF boundary:

```elixir
# Unicode, binary data, and null bytes all cross safely
unicode_text = "Hello 你好 مرحبا שלום"
{:ok, result} = Kreuzberg.extract(unicode_text, "text/plain")
assert result.content == unicode_text  # Perfect round-trip

# Large data (10MB+) handled without crashes
large_binary = String.duplicate("X", 10_000_000)
{:ok, result} = Kreuzberg.extract(large_binary, "text/plain")
assert byte_size(result.content) > 0
```

### Concurrent Safety

High concurrency tested without deadlocks:

```elixir
# 50 concurrent NIF calls complete successfully
tasks = Enum.map(1..50, fn i ->
  Task.async(fn -> Kreuzberg.extract("Task #{i}", "text/plain") end)
end)

results = Task.await_many(tasks, 60_000)
assert length(results) == 50  # All completed
assert Enum.all?(results, &match?({:ok, _}, &1))  # All successful
```

### Memory Safety

Extraction doesn't cause resource leaks or excessive memory growth:

```elixir
initial_memory = Process.info(self(), :memory) |> elem(1)

# 100 extractions
Enum.each(1..100, fn i ->
  {:ok, _result} = Kreuzberg.extract("Test #{i}", "text/plain")
end)

:erlang.garbage_collect()
final_memory = Process.info(self(), :memory) |> elem(1)

# Memory should not grow unbounded
assert final_memory <= initial_memory * 5
```

### Error Recovery

NIF errors don't crash the VM; extraction continues normally:

```elixir
# Invalid MIME type returns error
{:error, reason} = Kreuzberg.extract("data", "invalid/type")
assert is_binary(reason)

# Next call works normally
{:ok, result} = Kreuzberg.extract("valid", "text/plain")
assert result.content == "valid"
```

## Common Use Cases

### Extract with Custom Configuration

Most use cases benefit from configuration to control extraction behavior:

**With OCR (for scanned documents):**

```elixir
alias Kreuzberg.ExtractionConfig

config = %ExtractionConfig{
  ocr: %{"enabled" => true, "backend" => "tesseract"}
}

{:ok, result} = Kreuzberg.extract_file("scanned_document.pdf", nil, config)

content = result.content
IO.puts("OCR Extracted content:")
IO.puts(content)
IO.puts("Metadata: #{inspect(result.metadata)}")
```

#### Table Extraction

See [Table Extraction Guide](https://kreuzberg.dev/features/table-extraction/) for detailed examples.

#### Processing Multiple Files

```elixir title="Elixir"
file_paths = ["document1.pdf", "document2.pdf", "document3.pdf"]

{:ok, results} = Kreuzberg.batch_extract_files(file_paths)

Enum.each(results, fn result ->
  IO.puts("File: #{result.mime_type}")
  IO.puts("Content length: #{byte_size(result.content)} characters")
  IO.puts("Tables: #{length(result.tables)}")
  IO.puts("---")
end)

IO.puts("Total files processed: #{length(results)}")
```

#### Async Processing

For non-blocking document processing:

```elixir title="Elixir"
# Extract from different file types (PDF, DOCX, etc.)

case Kreuzberg.extract_file("document.pdf") do
  {:ok, result} ->
    IO.puts("Content: #{result.content}")
    IO.puts("MIME Type: #{result.metadata.format_type}")
    IO.puts("Tables: #{length(result.tables)}")

  {:error, reason} ->
    IO.puts("Extraction failed: #{inspect(reason)}")
end
```

### Next Steps

- **[Installation Guide](https://kreuzberg.dev/getting-started/installation/)** - Platform-specific setup
- **[API Documentation](https://kreuzberg.dev/api/)** - Complete API reference
- **[Examples & Guides](https://kreuzberg.dev/guides/)** - Full code examples and usage guides
- **[Configuration Guide](https://kreuzberg.dev/configuration/)** - Advanced configuration options
- **[Troubleshooting](https://kreuzberg.dev/troubleshooting/)** - Common issues and solutions

## Features

### Advanced Usage Examples

#### 1. Table Extraction with Cell Access

Extract and analyze tables with structured cell access and header information:

```elixir title="Elixir"
alias Kreuzberg.ExtractionConfig

# Extract with table focus
config = %ExtractionConfig{
  pages: %{"extract_tables" => true, "extract_text" => false}
}

{:ok, result} = Kreuzberg.extract_file("data_sheet.pdf", nil, config)

# Iterate over extracted tables
Enum.each(result.tables, fn table ->
  IO.puts("Table found:")
  IO.puts("Headers: #{inspect(table.headers)}")

  # Access cell data
  Enum.each(table.cells, fn row ->
    row_data = Enum.join(row, " | ")
    IO.puts("  #{row_data}")
  end)

  IO.puts("Table markdown:\n#{table.markdown}")
end)
```

#### 2. Image Extraction and Processing

Extract images from documents with preprocessing and format control:

```elixir title="Elixir"
alias Kreuzberg.ExtractionConfig

# Configure image extraction
config = %ExtractionConfig{
  images: %{
    "enabled" => true,
    "format" => "png",
    "quality" => 90,
    "max_width" => 1920,
    "max_height" => 1080
  }
}

{:ok, result} = Kreuzberg.extract_file("document.pdf", nil, config)

# Process extracted images
case result.images do
  nil ->
    IO.puts("No image extraction enabled")

  images ->
    Enum.each(images, fn image ->
      IO.puts("Extracted image:")
      IO.puts("  Format: #{image.format}")
      IO.puts("  Size: #{byte_size(image.data)} bytes")
      IO.puts("  Width: #{image.width}px")
      IO.puts("  Height: #{image.height}px")

      # OCR text if available
      if image.ocr_text do
        IO.puts("  OCR Text: #{String.slice(image.ocr_text, 0..100)}")
      end

      # Save to file
      filename = "image_#{image.id}.#{image.format}"
      File.write!(filename, image.data)
    end)
end
```

#### 3. Keywords Extraction Configuration

Configure and extract keywords using different algorithms:

```elixir title="Elixir"
alias Kreuzberg.ExtractionConfig

# Extract with keyword detection
config = %ExtractionConfig{
  keywords: %{
    "enabled" => true,
    "algorithm" => "yake",
    "max_keywords" => 20,
    "min_score" => 0.1,
    "ngram_range" => [1, 3]
  }
}

{:ok, result} = Kreuzberg.extract_file("article.pdf", nil, config)

# Use extracted keywords
IO.puts("Extracted Keywords:")
IO.puts("Content length: #{byte_size(result.content)} bytes")
IO.puts("Detected languages: #{inspect(result.detected_languages)}")

# Pattern match on chunked results
case result.keywords do
  nil -> IO.puts("Keywords extraction not enabled")
  keywords ->
    Enum.each(keywords, fn {keyword, score} ->
      IO.puts("  #{keyword}: #{Float.round(score, 3)}")
    end)
end
```

#### 4. Embeddings with Chunking

Generate vector embeddings for semantic search and RAG applications:

```elixir title="Elixir"
alias Kreuzberg.ExtractionConfig

# Configure chunking and embeddings
config = %ExtractionConfig{
  chunking: %{
    "enabled" => true,
    "chunk_size" => 512,
    "overlap" => 50,
    "strategy" => "semantic"
  }
}

{:ok, result} = Kreuzberg.extract_file("knowledge_base.pdf", nil, config)

case result.chunks do
  nil ->
    IO.puts("Chunking not enabled")

  chunks ->
    IO.puts("Generated #{length(chunks)} chunks")

    Enum.with_index(chunks, fn chunk, idx ->
      IO.puts("\nChunk #{idx + 1}:")
      IO.puts("  Text: #{String.slice(chunk.text, 0..80)}...")

      # Embedding available if embeddings enabled
      if chunk.embedding do
        IO.puts("  Embedding dimension: #{length(chunk.embedding)}")
      end
    end)
end
```

#### 5. Pages Extraction Usage

Extract page-by-page content with structural information:

```elixir title="Elixir"
alias Kreuzberg.ExtractionConfig

# Extract specific pages with metadata
config = %ExtractionConfig{
  pages: %{
    "enabled" => true,
    "start_page" => 1,
    "end_page" => 5,
    "extract_text" => true,
    "extract_tables" => true,
    "extract_headers_footers" => true
  }
}

{:ok, result} = Kreuzberg.extract_file("report.pdf", nil, config)

# Access per-page information
case result.pages do
  nil ->
    IO.puts("Page extraction not enabled")

  pages ->
    Enum.each(pages, fn page ->
      IO.puts("\nPage #{page.number}:")
      IO.puts("  Dimensions: #{page.width}x#{page.height}px")
      IO.puts("  Content length: #{byte_size(page.content)} chars")

      # Use position information if available
      if page.position do
        IO.puts("  Position: #{inspect(page.position)}")
      end

      # Content preview
      preview = String.slice(page.content, 0..150)
      IO.puts("  Content: #{preview}...")
    end)
end
```

### Supported File Formats (56+)

56 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

#### Office Documents

| Category | Formats | Capabilities |
|----------|---------|--------------|
| **Word Processing** | `.docx`, `.odt` | Full text, tables, images, metadata, styles |
| **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods` | Sheet data, formulas, cell metadata, charts |
| **Presentations** | `.pptx`, `.ppt`, `.ppsx` | Slides, speaker notes, images, metadata |
| **PDF** | `.pdf` | Text, tables, images, metadata, OCR support |
| **eBooks** | `.epub`, `.fb2` | Chapters, metadata, embedded resources |

#### Images (OCR-Enabled)

| Category | Formats | Features |
|----------|---------|----------|
| **Raster** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif` | OCR, table detection, EXIF metadata, dimensions, color space |
| **Advanced** | `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.pnm`, `.pbm`, `.pgm`, `.ppm` | OCR, table detection, format-specific metadata |
| **Vector** | `.svg` | DOM parsing, embedded text, graphics metadata |

#### Web & Data

| Category | Formats | Features |
|----------|---------|----------|
| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml`, `.svg` | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
| **Structured Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | Schema detection, nested structures, validation |
| **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, reStructuredText, Org Mode |

#### Email & Archives

| Category | Formats | Features |
|----------|---------|----------|
| **Email** | `.eml`, `.msg` | Headers, body (HTML/plain), attachments, threading |
| **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` | File listing, nested archives, metadata |

#### Academic & Scientific

| Category | Formats | Features |
|----------|---------|----------|
| **Citations** | `.bib`, `.biblatex`, `.ris`, `.enw`, `.csl` | Bibliography parsing, citation extraction |
| **Scientific** | `.tex`, `.latex`, `.typst`, `.jats`, `.ipynb`, `.docbook` | LaTeX, Jupyter notebooks, PubMed JATS |
| **Documentation** | `.opml`, `.pod`, `.mdoc`, `.troff` | Technical documentation formats |

**[Complete Format Reference](https://kreuzberg.dev/reference/formats/)**

### Key Capabilities

- **Text Extraction** - Extract all text content with position and formatting information

- **Metadata Extraction** - Retrieve document properties, creation date, author, etc.

- **Table Extraction** - Parse tables with structure and cell content preservation

- **Image Extraction** - Extract embedded images and render page previews

- **OCR Support** - Integrate multiple OCR backends for scanned documents

- **Async/Await** - Non-blocking document processing with concurrent operations

- **Plugin System** - Extensible post-processing for custom text transformation

- **Embeddings** - Generate vector embeddings using ONNX Runtime models

- **Batch Processing** - Efficiently process multiple documents in parallel

- **Memory Efficient** - Stream large files without loading entirely into memory

- **Language Detection** - Detect and support multiple languages in documents

- **Configuration** - Fine-grained control over extraction behavior

### Performance Characteristics

| Format | Speed | Memory | Notes |
|--------|-------|--------|-------|
| **PDF (text)** | 10-100 MB/s | ~50MB per doc | Fastest extraction |
| **Office docs** | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX |
| **Images (OCR)** | 1-5 MB/s | Variable | Depends on OCR backend |
| **Archives** | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. |
| **Web formats** | 50-200 MB/s | Streaming | HTML, XML, JSON |

## Configuration Reference

ExtractionConfig provides comprehensive control over document extraction behavior through nested configuration maps. All nested configs are optional (nil by default) and use string keys for NIF compatibility.

### Chunking Configuration

Configure text chunking for semantic search and document processing:

```elixir
config = %Kreuzberg.ExtractionConfig{
  chunking: %{
    "enabled" => true,
    "chunk_size" => 512,           # Characters per chunk (integer, default: 512)
    "overlap" => 50,               # Overlap between chunks (integer, default: 50)
    "strategy" => "semantic",      # Strategy: "fixed", "semantic", or "adaptive"
    "separator" => "\n\n",         # Custom separator for chunking (string or nil)
    "preserve_headers" => true     # Keep headers in chunks (boolean, default: true)
  }
}
```

**Fields:**
- `enabled` (boolean, optional): Enable/disable chunking
- `chunk_size` (integer, 1+): Maximum characters per chunk
- `overlap` (integer, 0+): Character overlap between consecutive chunks (must be < chunk_size)
- `strategy` (string): "fixed" (simple splitting), "semantic" (respects boundaries), or "adaptive" (ML-based)
- `separator` (string/nil): Custom text separator (default: respects document structure)
- `preserve_headers` (boolean): Keep document headers when chunking

### OCR Configuration

Control OCR extraction for scanned documents and images:

```elixir
config = %Kreuzberg.ExtractionConfig{
  ocr: %{
    "enabled" => true,            # Enable OCR processing (boolean, default: false)
    "backend" => "tesseract",      # Backend: "tesseract", "easyocr", "paddleocr"
    "languages" => ["eng"],        # Language codes (list of strings, ISO 639-3 format)
    "dpi" => 300,                  # DPI for image processing (integer, default: 300)
    "psm" => 3,                    # Tesseract PSM mode (0-13, affects layout analysis)
    "oem" => 3,                    # Tesseract OEM mode (0-3, affects recognition engine)
    "confidence_threshold" => 0.5  # Minimum confidence for OCR results (0.0-1.0)
  }
}
```

**Fields:**
- `enabled` (boolean): Enable OCR for documents without searchable text
- `backend` (string): "tesseract" (fastest), "easyocr", or "paddleocr"
- `languages` (list): ISO 639-3 language codes (e.g., "eng", "fra", "deu", "spa")
- `dpi` (integer, 72+): Resolution for image processing (higher = slower but more accurate)
- `psm` (0-13): Tesseract Page Segmentation Mode (affects text layout detection)
- `oem` (0-3): Tesseract OCR Engine Mode (0=legacy, 1=neural, 2=both, 3=default)
- `confidence_threshold` (0.0-1.0): Filter low-confidence OCR results

### Language Detection Configuration

Detect and identify languages in extracted content:

```elixir
config = %Kreuzberg.ExtractionConfig{
  language_detection: %{
    "enabled" => true,                    # Enable language detection (boolean, default: true)
    "strategy" => "accurate",             # Strategy: "auto", "fast", or "accurate"
    "confidence_threshold" => 0.7,        # Minimum confidence (0.0-1.0, default: 0.7)
    "predefined_languages" => ["en", "fr"], # Restrict detection to specific languages
    "detect_mixed_languages" => true      # Detect multiple languages in single document
  }
}
```

**Fields:**
- `enabled` (boolean): Enable automatic language detection
- `strategy` (string): "auto" (default), "fast" (quick detection), or "accurate" (ML-based)
- `confidence_threshold` (0.0-1.0): Minimum confidence for language identification
- `predefined_languages` (list): ISO 639-1 codes to restrict detection (e.g., ["en", "de", "fr"])
- `detect_mixed_languages` (boolean): Allow detection of multiple languages per document

### Post-Processor Configuration

Clean and normalize extracted text:

```elixir
config = %Kreuzberg.ExtractionConfig{
  postprocessor: %{
    "enabled" => true,                    # Enable post-processing (boolean, default: true)
    "normalize_whitespace" => true,       # Collapse multiple spaces (boolean, default: true)
    "remove_duplicates" => false,         # Remove duplicate paragraphs (boolean, default: false)
    "remove_empty_lines" => true,         # Remove blank lines (boolean, default: true)
    "trim_text" => true,                  # Trim leading/trailing whitespace (boolean, default: true)
    "normalize_unicode" => true,          # Normalize Unicode characters (boolean, default: true)
    "remove_control_characters" => true,  # Remove control characters (boolean, default: true)
    "fix_punctuation" => true,            # Fix spacing around punctuation (boolean, default: false)
    "fix_hyphens" => true,                # Fix hyphenation issues (boolean, default: false)
    "convert_quotes" => "straight",       # Quote style: "straight", "curly", or nil
    "convert_dashes" => true,             # Convert hyphens to proper dashes (boolean, default: false)
    "duplicate_threshold" => 0.95,        # Similarity threshold for duplicates (0.0-1.0, default: 0.95)
    "remove_duplicate_paragraphs" => true # Remove duplicate paragraphs (boolean, default: false)
  }
}
```

**Fields:**
- `enabled` (boolean): Enable text post-processing
- `normalize_whitespace` (boolean): Collapse multiple consecutive spaces to single space
- `remove_duplicates` (boolean): Remove near-duplicate content (uses duplicate_threshold)
- `remove_empty_lines` (boolean): Strip blank lines from output
- `trim_text` (boolean): Remove leading/trailing whitespace
- `normalize_unicode` (boolean): Apply Unicode normalization (NFC)
- `remove_control_characters` (boolean): Strip control characters (except newlines/tabs)
- `fix_punctuation` (boolean): Fix spacing around punctuation marks
- `fix_hyphens` (boolean): Fix hyphenation and line breaking issues
- `convert_quotes` (string/nil): "straight", "curly", or nil (no conversion)
- `convert_dashes` (boolean): Convert hyphens to em/en dashes
- `duplicate_threshold` (0.0-1.0): Similarity score for duplicate detection (higher = stricter)
- `remove_duplicate_paragraphs` (boolean): Remove duplicate paragraphs

### Images Configuration

Control image extraction and preprocessing:

```elixir
config = %Kreuzberg.ExtractionConfig{
  images: %{
    "enabled" => true,            # Enable image extraction (boolean, default: true)
    "format" => "png",            # Output format: "png", "jpg", "webp", "bmp"
    "quality" => 90,              # JPEG quality (0-100, default: 95, only for JPEG)
    "max_width" => 1920,          # Maximum width in pixels (integer or nil)
    "max_height" => 1080,         # Maximum height in pixels (integer or nil)
    "extract_ocr_text" => true,   # Extract text from images via OCR (boolean, default: false)
    "preprocessing" => true       # Apply image preprocessing (boolean, default: false)
  }
}
```

**Fields:**
- `enabled` (boolean): Enable image extraction from documents
- `format` (string): Output format ("png", "jpg", "webp", "bmp")
- `quality` (0-100): JPEG compression quality (only applies to JPEG format)
- `max_width` (integer/nil): Resize images to max width (preserves aspect ratio)
- `max_height` (integer/nil): Resize images to max height (preserves aspect ratio)
- `extract_ocr_text` (boolean): Run OCR on extracted images
- `preprocessing` (boolean): Apply contrast/brightness adjustment

### Pages Configuration

Extract page-level content with structural information:

```elixir
config = %Kreuzberg.ExtractionConfig{
  pages: %{
    "enabled" => true,                    # Enable page extraction (boolean, default: false)
    "start_page" => 1,                    # First page to extract (integer, 1+, default: 1)
    "end_page" => nil,                    # Last page to extract (integer or nil for all)
    "page_numbers" => [1, 2, 5, 10],      # Extract specific pages (list of integers or nil)
    "exclude_pages" => [3, 7],            # Skip these pages (list of integers or nil)
    "extract_text" => true,               # Extract text per page (boolean, default: true)
    "extract_tables" => true,             # Extract tables per page (boolean, default: true)
    "extract_images" => false,            # Extract images per page (boolean, default: false)
    "extract_headers_footers" => true     # Include headers and footers (boolean, default: true)
  }
}
```

**Fields:**
- `enabled` (boolean): Enable page-level extraction (returns pages array)
- `start_page` (integer, 1+): Starting page number (1-indexed)
- `end_page` (integer/nil): Ending page number (nil = all pages)
- `page_numbers` (list/nil): Extract only these specific pages (e.g., [1, 3, 5])
- `exclude_pages` (list/nil): Skip these pages (e.g., [2, 4, 6])
- `extract_text` (boolean): Include text content per page
- `extract_tables` (boolean): Extract tables identified on each page
- `extract_images` (boolean): Extract images found on each page
- `extract_headers_footers` (boolean): Include document headers/footers

### Token Reduction Configuration

Control output size and summarization:

```elixir
config = %Kreuzberg.ExtractionConfig{
  token_reduction: %{
    "enabled" => true,                    # Enable token reduction (boolean, default: false)
    "strategy" => "summarize",            # Strategy: "none", "truncate", "summarize", "extractive"
    "target_reduction" => 0.3,            # Target reduction ratio (0.0-1.0, e.g., 0.3 = 30%)
    "max_tokens" => 2000,                 # Maximum output tokens (integer or nil)
    "keep_first_percentage" => 0.7,       # Keep first N% for truncation (0.0-1.0)
    "summary_length_percentage" => 30,    # Summary length as % of original (0-100)
    "preserve_key_sentences" => true,     # Keep important sentences in summary (boolean, default: true)
    "num_sentences" => 5,                 # Number of sentences for extractive (integer, 1+)
    "sentence_importance_threshold" => 0.6 # Importance threshold for sentences (0.0-1.0)
  }
}
```

**Fields:**
- `enabled` (boolean): Enable token reduction/summarization
- `strategy` (string): "none" (no reduction), "truncate" (cut at token limit), "summarize" (abstractive), "extractive" (key sentences)
- `target_reduction` (0.0-1.0): Percentage reduction (0.3 = reduce to 70% of original)
- `max_tokens` (integer/nil): Hard limit on output tokens
- `keep_first_percentage` (0.0-1.0): For truncation: keep first N% of document
- `summary_length_percentage` (0-100): Summarization target as percentage of original
- `preserve_key_sentences` (boolean): Include important sentences in summary
- `num_sentences` (integer, 1+): For extractive: number of sentences to extract
- `sentence_importance_threshold` (0.0-1.0): Minimum importance score for sentences

### Keywords Configuration

Extract and configure keyword/keyphrase extraction:

```elixir
config = %Kreuzberg.ExtractionConfig{
  keywords: %{
    "enabled" => true,                # Enable keyword extraction (boolean, default: false)
    "algorithm" => "yake",            # Algorithm: "yake", "rake", "tfidf", "frequency"
    "max_keywords" => 20,             # Maximum keywords to extract (integer, 1+, default: 10)
    "min_score" => 0.1,               # Minimum relevance score (0.0-1.0, default: 0.0)
    "ngram_range" => [1, 3],          # N-gram range [min, max] (e.g., [1, 3] = unigrams to trigrams)
    "language" => "en",               # Language for processing (ISO 639-1, default: auto-detect)
    "custom_keywords" => ["important", "custom", "terms"], # Boost these keywords (list or nil)
    "weight_custom" => 2.0            # Weight multiplier for custom keywords (float, default: 1.0)
  }
}
```

**Fields:**
- `enabled` (boolean): Enable keyword/keyphrase extraction
- `algorithm` (string): "yake" (unsupervised), "rake" (rapid), "tfidf" (statistical), "frequency" (simple)
- `max_keywords` (integer, 1+): Maximum number of keywords to return
- `min_score` (0.0-1.0): Filter out keywords below this score
- `ngram_range` (list [min, max]): Extract phrases of N words (e.g., [1, 3] = 1-3 word phrases)
- `language` (string/nil): ISO 639-1 code (e.g., "en", "fr", "de") or nil for auto-detection
- `custom_keywords` (list/nil): Keywords to boost in results
- `weight_custom` (float, 0.0+): Multiplier for custom keyword scores

### PDF Options Configuration

PDF-specific extraction settings:

```elixir
config = %Kreuzberg.ExtractionConfig{
  pdf_options: %{
    "use_own_pdfium" => false,            # Use bundled PDFium (boolean, default: true)
    "allow_hybrid_rendering" => true,     # Blend text and raster (boolean, default: true)
    "enable_vector_graphics" => true      # Extract vector graphics (boolean, default: true)
  }
}
```

**Fields:**
- `use_own_pdfium` (boolean): Use custom PDFium build instead of bundled version
- `allow_hybrid_rendering` (boolean): Combine searchable text with rendered images
- `enable_vector_graphics` (boolean): Extract vector graphics as images

## OCR Support

Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:

- **Tesseract**

### OCR Configuration Example

```elixir title="Elixir"
alias Kreuzberg.ExtractionConfig

config = %ExtractionConfig{
  ocr: %{"enabled" => true, "backend" => "tesseract"}
}

{:ok, result} = Kreuzberg.extract_file("scanned_document.pdf", nil, config)

content = result.content
IO.puts("OCR Extracted content:")
IO.puts(content)
IO.puts("Metadata: #{inspect(result.metadata)}")
```

## Async Support

This binding provides full async/await support for non-blocking document processing:

```elixir title="Elixir"
# Extract from different file types (PDF, DOCX, etc.)

case Kreuzberg.extract_file("document.pdf") do
  {:ok, result} ->
    IO.puts("Content: #{result.content}")
    IO.puts("MIME Type: #{result.metadata.format_type}")
    IO.puts("Tables: #{length(result.tables)}")

  {:error, reason} ->
    IO.puts("Extraction failed: #{inspect(reason)}")
end
```

## Plugin System

Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.

For detailed plugin documentation, visit [Plugin System Guide](https://kreuzberg.dev/plugins/).

### Plugin Example

```elixir title="Elixir"
alias Kreuzberg.Plugin

# Word Count Post-Processor Plugin
# This post-processor automatically counts words in extracted content
# and adds the word count to the metadata.

defmodule MyApp.Plugins.WordCountProcessor do
  @behaviour Kreuzberg.Plugin.PostProcessor
  require Logger

  @impl true
  def name do
    "WordCountProcessor"
  end

  @impl true
  def processing_stage do
    :post
  end

  @impl true
  def version do
    "1.0.0"
  end

  @impl true
  def initialize do
    :ok
  end

  @impl true
  def shutdown do
    :ok
  end

  @impl true
  def process(result, _options) do
    content = result["content"] || ""
    word_count = content
      |> String.split(~r/\s+/, trim: true)
      |> length()

    # Update metadata with word count
    metadata = Map.get(result, "metadata", %{})
    updated_metadata = Map.put(metadata, "word_count", word_count)

    {:ok, Map.put(result, "metadata", updated_metadata)}
  end
end

# Register the word count post-processor
Plugin.register_post_processor(:word_count_processor, MyApp.Plugins.WordCountProcessor)

# Example usage
result = %{
  "content" => "The quick brown fox jumps over the lazy dog. This is a sample document with multiple words.",
  "metadata" => %{
    "source" => "document.pdf",
    "pages" => 1
  }
}

case MyApp.Plugins.WordCountProcessor.process(result, %{}) do
  {:ok, processed_result} ->
    word_count = processed_result["metadata"]["word_count"]
    IO.puts("Word count added: #{word_count} words")
    IO.inspect(processed_result, label: "Processed Result")

  {:error, reason} ->
    IO.puts("Processing failed: #{reason}")
end

# List all registered post-processors
{:ok, processors} = Plugin.list_post_processors()
IO.inspect(processors, label: "Registered Post-Processors")
```

## Embeddings Support

Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.

**[Embeddings Guide](https://kreuzberg.dev/features/#embeddings)**

## NIF Integration

### Rustler NIF Architecture

The binding is implemented as a Rustler NIF (Native Implemented Function) for safe boundary crossing:

```
Elixir Code
    ↓
Kreuzberg Module (lib/kreuzberg.ex)
    ↓
Rustler NIF Interface (native/src/lib.rs)
    ↓
Rust Implementation (kreuzberg_core)
    ↓
High-Performance Document Extraction
```

Key NIF patterns:

**Synchronous NIF calls:**
```elixir
# Directly calls Rust through NIF boundary
{:ok, result} = Kreuzberg.extract(data, mime_type, config)
```

**Error handling at boundary:**
```elixir
case Kreuzberg.extract(data, "invalid/type") do
  {:ok, result} -> result
  {:error, reason} -> IO.puts("NIF error: #{reason}")
end
```

### Memory Management Across Boundary

- **Erlang terms**: Automatically encoded/decoded by Rustler
- **Binary data**: Zero-copy where possible, bounds checked
- **Structures**: Complex structs serialized to maps for Elixir compatibility
- **GC integration**: Rust allocations cleaned up when Elixir terms are garbage collected

### Concurrent NIF Access

The NIF implementation is designed for concurrent access:

```elixir
# Multiple processes can call NIF simultaneously without blocking each other
task1 = Task.async(fn -> Kreuzberg.extract("data1", "text/plain") end)
task2 = Task.async(fn -> Kreuzberg.extract("data2", "text/plain") end)

{:ok, result1} = Task.await(task1)
{:ok, result2} = Task.await(task2)
```

## Batch Processing

Process multiple documents efficiently:

```elixir
file_paths = ["document1.pdf", "document2.pdf", "document3.pdf"]

{:ok, results} = Kreuzberg.batch_extract_files(file_paths)

Enum.each(results, fn result ->
  IO.puts("File: #{result.mime_type}")
  IO.puts("Content length: #{byte_size(result.content)} characters")
  IO.puts("Tables: #{length(result.tables)}")
  IO.puts("---")
end)

IO.puts("Total files processed: #{length(results)}")
```

## Configuration

For advanced configuration options including language detection, table extraction, OCR settings, and more:

**[Configuration Guide](https://kreuzberg.dev/configuration/)**

## Testing

The Elixir binding includes comprehensive test suites ensuring production-ready quality:

### Test Structure

```
test/
├── unit/                          # Unit tests (583 total)
│   ├── extraction_test.exs        # Core extraction functions
│   ├── batch_api_test.exs         # Batch operations
│   ├── async_api_test.exs         # Async Task patterns
│   ├── error_test.exs             # Error handling
│   ├── validators_test.exs        # Configuration validation
│   └── ...
├── e2e/                           # End-to-end tests
│   ├── nif_integration_test.exs   # NIF boundary safety (47 tests)
│   ├── pdf_extraction_test.exs    # Real PDF extraction
│   ├── html_extraction_test.exs   # HTML parsing
│   ├── table_extraction_test.exs  # Table detection
│   └── ...
└── support/
    └── document_fixtures.exs      # Test data generators
```

### Running Tests

```bash
# Run all tests
mix test

# Run only unit tests
mix test test/unit

# Run only E2E tests
mix test test/e2e

# Run with coverage
mix coveralls

# Run specific test file
mix test test/e2e/nif_integration_test.exs

# Run with tags
mix test --only :e2e
```

### Test Highlights

**NIF Integration Tests** (47 comprehensive tests):
- NIF boundary crossing safety with unicode, binary, and large data
- Concurrent NIF calls (up to 50 simultaneous)
- Memory safety and leak detection
- Error propagation and recovery
- OTP supervisor integration

**End-to-End Tests** (4 test suites):
- Real document extraction workflows
- Multi-format extraction (PDF, HTML, tables)
- Configuration variations and error conditions
- Performance and stability assertions

**Zero Flakiness**:
- All tests marked with `@tag :e2e` or `@tag :unit`
- Deterministic assertions without timing assumptions
- Resource cleanup and proper process management
- No external service dependencies

## Documentation

- **[Official Documentation](https://kreuzberg.dev/)**
- **[API Reference](https://kreuzberg.dev/reference/api-elixir/)**
- **[Examples & Guides](https://kreuzberg.dev/guides/)**
- **[Rustler Documentation](https://hexdocs.pm/rustler/)**

## Troubleshooting

### Common Issues

**Compilation errors with NIF:**
```bash
# Clean and rebuild
mix clean
mix compile
```

**Memory usage spikes:**
- Use batch processing for large files
- Call `:erlang.garbage_collect()` after large extractions
- Monitor process memory with `Process.info(self(), :memory)`

**Timeout errors:**
```elixir
# Increase timeout for large documents
{:ok, result} = Kreuzberg.extract_file("large.pdf")
# Max timeout is configured per operation
```

For detailed troubleshooting: [Troubleshooting Guide](https://kreuzberg.dev/troubleshooting/)

## Contributing

Contributions are welcome! See [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CONTRIBUTING.md).

Development setup:
```bash
git clone https://github.com/kreuzberg-dev/kreuzberg.git
cd packages/elixir
mix deps.get
mix test
```

## License

MIT License - see LICENSE file for details.

## Support

- **Discord Community**: [Join our Discord](https://discord.gg/pXxagNK2zN)
- **GitHub Issues**: [Report bugs](https://github.com/kreuzberg-dev/kreuzberg/issues)
- **Discussions**: [Ask questions](https://github.com/kreuzberg-dev/kreuzberg/discussions)