docs/ARCHITECTURE.md

Select File:
# RustyXML Architecture

A purpose-built Rust NIF for ultra-fast XML parsing in Elixir. Not a wrapper around an existing library—custom-built from the ground up for optimal BEAM integration with full XPath 1.0 support. Drop-in replacement for both SweetXml and Saxy.

## Key Innovations

### Purpose-Built, Not Wrapped

Unlike projects that wrap existing Rust crates (like quick-xml or roxmltree), RustyXML is **designed specifically for Elixir**:

- **Direct BEAM term construction** — Results go straight to Erlang terms, no intermediate serialization
- **ResourceArc integration** — Documents and streaming parser state managed by BEAM's garbage collector
- **Dirty scheduler awareness** — All raw-XML parse NIFs run on dirty CPU schedulers
- **Zero-copy where possible** — Span-based references into original input, only allocates for entity decoding
- **Structural index** — Cache-friendly storage with compact span structs and flat arrays

### Unified Architecture

RustyXML v0.2.0 consolidated multiple parsing strategies into a single optimized path: the **structural index**. A single `UnifiedScanner` tokenizes input once, dispatching to a `ScanHandler` trait that builds the appropriate representation:

| Path | Description | Best For |
|------|-------------|----------|
| `parse/1` + `xpath/2` | Structural index with XPath | General XML processing |
| `stream_tags/3` | Bounded-memory streaming | Large files (GB+) |
| `sax_parse/1` | SAX event collection | Event-driven processing |

All three paths share the same SIMD-accelerated scanner and well-formedness validation.

### Memory Efficiency

- **Structural index** — Elements stored as compact span structs (32 bytes each) referencing the original input
- **Zero-copy strings** — Tag names, attribute values, and text stored as `(offset, length)` spans
- **Sub-binary returns** — BEAM sub-binaries share memory with the original input
- **Streaming bounded memory** — Process 10GB+ files with ~128 KB combined NIF + BEAM peak via zero-copy tokenization and direct BEAM binary encoding
- **mimalloc allocator** — High-performance allocator for reduced fragmentation
- **Optional memory tracking** — Opt-in profiling with zero overhead when disabled

### Validated Correctness

- **100% W3C/OASIS XML Conformance** — All 1089 applicable tests pass (218 valid + 871 not-well-formed rejections), verified individually against the official [xmlconf](https://www.w3.org/XML/Test/) suite
- **1296+ tests** including the full conformance suite, batch accessor clamping, and lazy XPath coverage
- **Cross-path validation** — All paths produce consistent output
- **SweetXml compatibility** — Verified identical behavior for common API patterns

---

## Quick Start

```elixir
import RustyXML

xml = """
<catalog>
  <book id="1"><title>Elixir in Action</title><price>45.00</price></book>
  <book id="2"><title>Programming Phoenix</title><price>50.00</price></book>
</catalog>
"""

# Get all books
RustyXML.xpath(xml, ~x"//book"l)

# Get text content
RustyXML.xpath(xml, ~x"//title/text()"s)

# Extract multiple values
RustyXML.xmap(xml, [
  titles: ~x"//title/text()"sl,
  prices: ~x"//price/text()"sl
])
```

---

## Core Architecture

### UnifiedScanner and ScanHandler

The `UnifiedScanner` is the single entry point for all XML tokenization. It uses `memchr`-based SIMD scanning to find delimiters, then dispatches events through the `ScanHandler` trait:

```
XML Input
   |
   v
UnifiedScanner (memchr SIMD tokenization)
   |
   +---> IndexBuilder (ScanHandler) ---> StructuralIndex ---> XPath
   |
   +---> SaxCollector (ScanHandler) ---> SAX Events
   |
   +---> StreamingParser ---> Complete Elements
```

The `ScanHandler` trait:

```rust
trait ScanHandler {
    fn start_element(&mut self, name: Span, attrs: &[(Span, Span)], is_empty: bool);
    fn end_element(&mut self, name: Span);
    fn text(&mut self, span: Span, needs_entity_decode: bool);
    fn cdata(&mut self, span: Span);
    fn comment(&mut self, span: Span);
    fn processing_instruction(&mut self, target: Span, data: Option<Span>);
}
```

Adding a new processing mode requires only implementing the trait—no changes to the scanner.

### Structural Index

The structural index is the core document representation. Instead of building a DOM tree with string copies, it stores compact structs that reference byte offsets into the original input:

```rust
struct Span {
    offset: u32,
    len: u16,     // 6 bytes total
}

struct IndexElement {      // 32 bytes
    name: Span,
    ns_prefix: Option<Span>,
    parent: u32,
    children: Range<u32>,  // into flat children_data array
    attrs: Range<u32>,     // into flat attrs array
}

struct IndexText {         // 16 bytes
    span: Span,
    parent: u32,
    needs_entity_decode: bool,
}

struct IndexAttribute {    // 12 bytes
    name: Span,
    value: Span,
}
```

**Memory profile for 2.93 MB document:**
- Structural index: **12.8 MB** (4.4x input size)
- Old DOM approach: **30.2 MB** (10.3x input size)
- SweetXml/xmerl: allocated entirely on BEAM heap

The `IndexedDocumentView` implements the `DocumentAccess` trait, allowing the XPath engine to evaluate queries on the structural index without any conversion step.

### SIMD-Accelerated Scanning

Tag and content boundary detection uses `memchr` for hardware-accelerated scanning:

```rust
use memchr::{memchr, memchr2, memchr3};

// Find next tag start — SIMD accelerated
fn find_tag_start(input: &[u8], pos: usize) -> Option<usize> {
    memchr(b'<', &input[pos..]).map(|i| pos + i)
}

// Content scanning for entities and markup
fn find_content_break(input: &[u8], pos: usize) -> Option<usize> {
    memchr3(b'<', b'&', b']', &input[pos..])
}
```

**SIMD support:** SSE2 (x86_64 default), AVX2 (runtime detect), NEON (aarch64), simd128 (wasm)

---

## Parsing

### Standard Parse (`parse/1`)

All parsing flows through the structural index:

```elixir
doc = RustyXML.parse("<root><item id=\"1\"/></root>")
RustyXML.xpath(doc, ~x"//item/@id"s)
#=> "1"
```

**Best for:** Multiple XPath queries on the same document.

**Architecture:**
- `UnifiedScanner` tokenizes input with SIMD-accelerated scanning
- `IndexBuilder` collects spans into a `StructuralIndex`
- Document wrapped in `ResourceArc` for BEAM garbage collection
- XPath queries operate on the structural index via `DocumentAccess` trait

### Direct XPath (`xpath/2` with raw XML)

Parse and query in a single call:

```elixir
RustyXML.xpath("<root><item/></root>", ~x"//item"l)
```

**Best for:** Single-query scenarios, avoids persistent document reference.

### Streaming Parser (`stream_tags/3`)

Bounded-memory streaming for large files:

```elixir
# High-level API
"large_file.xml"
|> RustyXML.stream_tags(:item)
|> Stream.each(fn {:item, item_xml} ->
  name = RustyXML.xpath(item_xml, ~x"./name/text()"s)
  IO.puts("Processing: #{name}")
end)
|> Stream.run()

# Works with Stream.take (no hanging like SweetXml issue #97)
"large_file.xml"
|> RustyXML.stream_tags(:item)
|> Stream.take(10)
|> Enum.to_list()
```

**Best for:** Large files (GB+), network streams, memory-constrained environments.

**Features:**
- Returns `{tag_atom, xml_string}` tuples compatible with SweetXml
- Complete XML elements that can be queried with `xpath/2`
- Handles elements split across chunk boundaries
- Tag filtering emits only matching elements and their children
- Does NOT hang with `Stream.take` (fixes SweetXml issue #97)

### SAX Parser (`sax_parse/1`)

Event-based parsing for custom processing:

```elixir
events = RustyXML.Native.sax_parse(xml)
# Returns list of SAX events: start_element, end_element, text, etc.
```

**Best for:** Event-driven processing, custom document handling.

### Lazy XPath (`xpath_lazy/2`)

Keep XPath results in Rust memory, access on-demand:

```elixir
doc = RustyXML.parse(large_xml)

# Execute query — returns reference, not data
result = RustyXML.Native.xpath_lazy(doc, "//item")

# Access count without building terms (3x faster than regular XPath)
count = RustyXML.Native.result_count(result)

# Batch accessors for multiple items
texts = RustyXML.Native.result_texts(result, 0, 10)
ids = RustyXML.Native.result_attrs(result, "id", 0, 10)

# Extract multiple fields at once
data = RustyXML.Native.result_extract(result, 0, 10, ["id", "category"], true)
#=> [%{:name => "item", :text => "...", "id" => "1", "category" => "cat1"}, ...]
```

**Best for:** Large result sets, partial access, count-only queries.

### Parallel XPath (`xpath_parallel/2`)

Execute multiple XPath queries concurrently using Rayon:

```elixir
doc = RustyXML.parse(large_xml)
results = RustyXML.Native.xpath_parallel(doc, ["//item", "//price", "//title"])
```

**Best for:** Batch queries, `xmap` with many keys.

---

## XPath 1.0 Engine

Full XPath 1.0 implementation with recursive descent parsing:

- **All 13 axes**: child, parent, self, attribute, descendant, descendant-or-self, ancestor, ancestor-or-self, following, following-sibling, preceding, preceding-sibling, namespace
- **27+ functions**: position, last, count, local-name, namespace-uri, name, string, concat, starts-with, contains, substring, substring-before, substring-after, string-length, normalize-space, translate, boolean, not, true, false, lang, number, sum, floor, ceiling, round
- **Predicates**: Full predicate support with position, boolean, and comparison expressions
- **Operators**: Arithmetic (+, -, *, div, mod), comparison (=, !=, <, >, <=, >=), logical (and, or)

### Expression Caching

Compiled XPath expressions are cached in an LRU cache (256 entries). Repeated queries skip parsing and compilation entirely.

### Fast-Path Predicates

Common predicate patterns are optimized:

- `[@attr='value']` → `PredicateAttrEq` (direct attribute lookup)
- `[n]` → `PredicatePosition` (index access, no iteration)

### Text Extraction Fast Path

For text extraction queries, `xpath_text_list` extracts text directly from NodeSets without building recursive BEAM element tuples—eliminating the double-walk where tuples were built then discarded.

---

## Project Structure

```
native/rustyxml/src/
├── lib.rs                 # NIF entry points, memory tracking, mimalloc
├── core/
│   ├── mod.rs             # Re-exports
│   ├── scanner.rs         # SIMD byte scanning (memchr)
│   ├── unified_scanner.rs # UnifiedScanner + ScanHandler trait
│   ├── tokenizer.rs       # State machine tokenizer
│   ├── entities.rs        # Entity decoding with Cow
│   └── attributes.rs      # Attribute parsing
├── index/
│   ├── mod.rs             # Module docs, re-exports
│   ├── structural.rs      # StructuralIndex (main data structure)
│   ├── span.rs            # Span struct (offset, length)
│   ├── element.rs         # IndexElement, IndexText, IndexAttribute
│   ├── builder.rs         # IndexBuilder (ScanHandler impl)
│   └── view.rs            # IndexedDocumentView (DocumentAccess impl)
├── dom/
│   ├── mod.rs             # DocumentAccess trait, validation
│   ├── document.rs        # Document types
│   ├── node.rs            # Node types
│   └── strings.rs         # String utilities
├── xpath/
│   ├── mod.rs             # XPath exports
│   ├── lexer.rs           # XPath tokenizer
│   ├── parser.rs          # Recursive descent parser
│   ├── compiler.rs        # Expression compiler
│   ├── eval.rs            # Evaluation engine
│   ├── axes.rs            # All 13 XPath axes
│   ├── functions.rs       # 27+ XPath 1.0 functions
│   └── value.rs           # XPath value types
├── sax/
│   ├── mod.rs             # SAX module docs
│   ├── events.rs          # CompactSaxEvent types
│   └── collector.rs       # SaxCollector (ScanHandler impl)
├── strategy/
│   ├── mod.rs             # Strategy exports
│   ├── streaming.rs       # Stateful streaming parser
│   └── parallel.rs        # Parallel XPath (DirtyCpu)
├── term.rs                # BEAM term building utilities
└── resource.rs            # ResourceArc wrappers

lib/
├── rusty_xml.ex           # Main module: xpath/2, xmap/2, stream_tags/3, parse_string/4,
│                          #   parse_stream/4, stream_events/2, encode!/2, ~x sigil
├── rusty_xml/
│   ├── native.ex          # NIF bindings (RustlerPrecompiled)
│   ├── streaming.ex       # High-level streaming interface
│   ├── handler.ex         # SAX handler behaviour (= Saxy.Handler)
│   ├── event_transformer.ex # Native event → Saxy event mapping
│   ├── partial.ex         # Incremental SAX parsing (= Saxy.Partial)
│   ├── simple_form.ex     # Tuple tree output (= Saxy.SimpleForm)
│   ├── xml.ex             # Builder DSL (= Saxy.XML)
│   ├── encoder.ex         # XML string encoding
│   └── builder.ex         # Struct→XML protocol (= Saxy.Builder)
```

---

## Performance Optimizations

| Optimization | Impact |
|--------------|--------|
| Structural index (zero-copy spans) | 65-70% memory reduction vs old DOM |
| XPath text fast path | 0.74x → 1.44x faster text extraction |
| XML string serialization | 1.39x faster element queries |
| Complete elements streaming | 3.87x faster streaming |
| Lazy XPath API | 3x faster for partial access |
| XPath expression caching | Skip re-parsing repeated queries |
| Fast-path predicates | 23% faster for `[@attr='value']` |
| Compile-time atoms | Eliminates per-call atom lookup |
| Direct binary encoding | Faster string-to-term conversion |
| DocumentAccess trait | O(1) pre-parsed access |
| HashSet deduplication | O(n^2) → O(n) for node sets |

### Bypassing BEAM Term Construction

For element queries, building nested Elixir tuples (`{:element, name, attrs, children}`) is expensive. `xpath_query_raw/2` bypasses this by serializing nodes to XML strings in Rust using an iterative approach with an explicit stack.

### Lazy XPath

The regular XPath API builds BEAM terms for all results upfront. The lazy API keeps results in Rust memory as `Vec<NodeId>`:

```elixir
# Regular API: builds 1000 BEAM tuples immediately
items = RustyXML.xpath(doc, "//item")  # 104ms

# Lazy API: keeps node IDs in Rust, builds terms on-demand
result = RustyXML.Native.xpath_lazy(doc, "//item")  # 31ms
count = RustyXML.Native.result_count(result)  # instant
```

### Zero-Copy with Cow

Entity decoding uses `Cow<[u8]>` for optimal allocation:

```rust
pub fn decode_text(input: &[u8]) -> Cow<'_, [u8]> {
    if memchr(b'&', input).is_none() {
        return Cow::Borrowed(input);  // Zero-copy!
    }
    Cow::Owned(decode_entities(input))
}
```

---

## Memory Management

### mimalloc Allocator

RustyXML uses [mimalloc](https://github.com/microsoft/mimalloc) as the default allocator:

```rust
#[cfg(feature = "mimalloc")]
#[global_allocator]
static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;
```

**Benefits:**
- 10-20% faster allocation for many small objects
- Reduced fragmentation
- No tracking overhead in default configuration

### Optional Memory Tracking

For profiling, enable the `memory_tracking` feature:

```toml
# In native/rustyxml/Cargo.toml
[features]
default = ["mimalloc", "memory_tracking"]
```

When enabled:
- `RustyXML.Native.get_rust_memory/0` — Current allocation
- `RustyXML.Native.get_rust_memory_peak/0` — Peak allocation
- `RustyXML.Native.reset_rust_memory_stats/0` — Reset and get stats

### Pre-allocated Vectors

All parsing paths pre-allocate vectors with capacity estimates based on input size, reducing reallocation overhead during parsing.

---

## NIF Safety

### The 1ms Rule

NIFs should complete in under 1ms to avoid blocking schedulers.

| Approach | Used By | Description |
|----------|---------|-------------|
| Dirty Schedulers | `parse`, `parse_strict`, `parse_and_xpath`, `xpath_with_subspecs`, `xpath_string_value`, `sax_parse` | Runs on dirty CPU scheduler |
| Chunked Processing | `streaming_*` | Returns control between chunks |
| Stateful Resource | `streaming_*` | Lets Elixir control iteration |
| Fast SIMD | all paths | Completes quickly via hardware acceleration |

### Memory Safety

- Documents wrapped in `ResourceArc` with automatic cleanup
- Streaming parsers use `Mutex<StreamingParser>` for thread safety
- All allocations tracked when memory_tracking enabled

### Panic Safety

RustyXML is designed to never crash the BEAM VM:

- **No `.unwrap()` in NIF code paths** — All fallible operations use proper error handling
- **Pre-defined atoms** — Common atoms (`ok`, `error`, `nil`, `text`, `name`) created at compile time
- **Graceful mutex handling** — Poisoned mutexes return `{:error, :mutex_poisoned}` tuples

### Atom Table Safety

BEAM's atom table has a fixed limit (~1M atoms) and atoms are never garbage collected. RustyXML uses **binary keys** for user-provided values:

```elixir
# Safe: predefined atom keys + binary attribute keys
%{:name => "item", :text => "...", "id" => "1", "category" => "cat1"}
```

| Key Type | Implementation | Safe? |
|----------|----------------|-------|
| `:name`, `:text`, `:error` | Pre-defined atoms | Fixed set |
| User attribute names | Binary strings | No atom table impact |

---

## The `~x` Sigil

| Modifier | Effect | Example |
|----------|--------|---------|
| `s` | Return as string | `~x"//title/text()"s` |
| `l` | Return as list | `~x"//item"l` |
| `e` | Decode entities | `~x"//content"e` |
| `o` | Optional (nil on missing) | `~x"//optional"o` |
| `i` | Cast to integer | `~x"//count"i` |
| `f` | Cast to float | `~x"//price"f` |
| `k` | Return as keyword list | `~x"//item"k` |

Modifiers can be combined: `~x"//items"slo` (string, list, optional)

---

## API Compatibility

RustyXML is a drop-in replacement for both SweetXml and Saxy. Both APIs coexist with no conflicts (different arities and function names).

### SweetXml-Compatible

| Function | Description | Status |
|----------|-------------|--------|
| `xpath/2,3` | Execute XPath query | Complete |
| `xmap/2,3` | Extract multiple values | Complete |
| `~x` sigil | XPath with modifiers | Complete |
| `stream_tags/2,3` | Stream specific tags | Complete |

### Saxy-Compatible

| Function / Module | Description | Status |
|-------------------|-------------|--------|
| `parse_string/4` | SAX parsing with handler | Complete |
| `parse_stream/4` | Streaming SAX with handler | Complete |
| `stream_events/2` | Lazy stream of SAX events | Complete |
| `encode!/2` | XML encoding | Complete |
| `RustyXML.Handler` | Handler behaviour (= `Saxy.Handler`) | Complete |
| `RustyXML.Partial` | Incremental parsing (= `Saxy.Partial`) | Complete |
| `RustyXML.SimpleForm` | Tuple tree (= `Saxy.SimpleForm`) | Complete |
| `RustyXML.XML` | Builder DSL (= `Saxy.XML`) | Complete |
| `RustyXML.Builder` | Struct→XML protocol (= `Saxy.Builder`) | Complete |

### Migration

```elixir
# From SweetXml — just change the import
import RustyXML  # was: import SweetXml

# From Saxy — just change the module name
RustyXML.parse_string(xml, MyHandler, [])  # was: Saxy.parse_string(...)
RustyXML.SimpleForm.parse_string(xml)      # was: Saxy.SimpleForm.parse_string(...)
```

---

## Benchmark Results

See [BENCHMARK.md](BENCHMARK.md) for detailed performance comparisons.

**vs Saxy (fairest comparison — both are properly bounded streaming parsers):**
- **SAX parsing**: ~1.3-1.8x faster
- **SimpleForm**: ~1.3-1.5x faster
- **Streaming memory**: comparable (~130 KB vs ~125 KB; varies between runs)

**vs SweetXml/xmerl:**
- **Parsing**: 8-72x faster
- **XPath queries**: 1.5-3.7x faster
- **Parse memory**: significantly less (different measurement methods; see [BENCHMARK.md](BENCHMARK.md))
- **Streaming**: 16x faster (SweetXml streaming is unbounded due to xmerl accumulator)

---

## Compliance & Validation

See [COMPLIANCE.md](COMPLIANCE.md) for full details.

- **W3C/OASIS Conformance Suite** — 100% compliance (1089/1089 tests pass)
- **W3C XML 1.0 (Fifth Edition)** — Full strict mode validation
- **XPath 1.0 Specification** — Full axis and function support (13 axes, 27+ functions)

---

## References

- [W3C XML 1.0 (Fifth Edition)](https://www.w3.org/TR/xml/) — XML specification
- [XPath 1.0](https://www.w3.org/TR/xpath-10/) — XPath specification
- [OASIS XML Conformance](https://www.oasis-open.org/committees/xml-conformance/) — Test suite
- [memchr crate](https://docs.rs/memchr/latest/memchr/) — SIMD byte searching
- [rayon crate](https://docs.rs/rayon/latest/rayon/) — Parallel iteration
- [mimalloc](https://github.com/microsoft/mimalloc) — High-performance allocator
- [SweetXml](https://github.com/kbrw/sweet_xml) — Elixir XML library (compatibility target)