README.md

# lexbor_erl

[![CI](https://github.com/f34nk/lexbor_erl/actions/workflows/ci.yml/badge.svg)](https://github.com/f34nk/lexbor_erl/actions/workflows/ci.yml)
[![lexbor_erl version](https://img.shields.io/hexpm/v/lexbor_erl.svg)](https://hex.pm/packages/lexbor_erl)
[![Hex.pm](https://img.shields.io/hexpm/dt/lexbor_erl.svg)](https://hex.pm/packages/lexbor_erl)

An Erlang wrapper for the [Lexbor](https://github.com/lexbor/lexbor) HTML parser and DOM library via a port-based architecture.

## Overview

`lexbor_erl` provides safe, fast HTML parsing, CSS selector querying, DOM manipulation, and streaming parser capabilities for Erlang applications. It wraps the high-performance Lexbor C library using a port-based worker pool architecture for isolation, safety, and parallel processing.

## Features

- **HTML5-tolerant parsing** with automatic error recovery
- **CSS selector queries** (class, ID, tag, attributes, combinators, pseudo-classes)
- **DOM manipulation** - modify attributes, text content, and tree structure
- **Streaming parser** - parse large HTML documents incrementally
- **Stateless operations** for quick one-off tasks
- **Stateful document management** for complex workflows
- **Parallel processing** - worker pool architecture for concurrent operations
- **Safe for the BEAM** - crashes in native code don't bring down the VM
- **No atom leaks** - all user input stays as binaries

## Prerequisites

- Erlang/OTP (tested with OTP 24+)
- CMake 3.10+
- [Lexbor library](https://github.com/lexbor/lexbor) installed on your system

### Installing Lexbor

On macOS with Homebrew:
```bash
brew install lexbor
```

On Ubuntu/Debian:
```bash
sudo apt-get install liblexbor-dev
```

Or build from source:
```bash
git clone https://github.com/lexbor/lexbor.git
cd lexbor
mkdir build && cd build
cmake ..
make
sudo make install
```

## Building

```bash
make
```

## Quick Start

```erlang
1> lexbor_erl:start().
ok

%% Stateless: parse and serialize
2> {ok, Html} = lexbor_erl:parse_serialize(<<"<div>Hello<span>World">>).
{ok,<<"<html><head></head><body><div>Hello<span>World</span></div></body></html>">>}

%% Stateless: select elements
3> {ok, List} = lexbor_erl:select_html(
     <<"<ul><li class=a>A</li><li class=b>B</li></ul>">>, 
     <<"li.b">>).
{ok,[<<"<li class=\"b\">B</li>">>]}

%% Stateful: parse document
4> {ok, Doc} = lexbor_erl:parse(
     <<"<div id=app><ul><li class=a>A</li><li class=b>B</li></ul></div>">>).
{ok,1}

%% Select nodes
5> {ok, Nodes} = lexbor_erl:select(Doc, <<"#app li">>).
{ok,[{node,140735108544752},{node,140735108544896}]}

%% Get node HTML
6> [lexbor_erl:outer_html(Doc, N) || N <- Nodes].
[{ok,<<"<li class=\"a\">A</li>">>},{ok,<<"<li class=\"b\">B</li>">>}]

%% DOM manipulation: modify attributes
7> {ok, [Li]} = lexbor_erl:select(Doc, <<"li.a">>).
{ok,[{node,140735108544752}]}

8> lexbor_erl:set_attribute(Doc, Li, <<"class">>, <<"modified">>).
ok

9> lexbor_erl:get_attribute(Doc, Li, <<"class">>).
{ok,<<"modified">>}

%% DOM manipulation: modify text content
10> lexbor_erl:set_text(Doc, Li, <<"New Text">>).
ok

11> lexbor_erl:get_text(Doc, Li).
{ok,<<"New Text">>}

%% Streaming parser: parse incrementally
12> {ok, Session} = lexbor_erl:parse_stream_begin().
{ok,72057594037927937}

13> ok = lexbor_erl:parse_stream_chunk(Session, <<"<div><p>He">>).
ok

14> ok = lexbor_erl:parse_stream_chunk(Session, <<"llo</p></div>">>).
ok

15> {ok, StreamDoc} = lexbor_erl:parse_stream_end(Session).
{ok,72057594037927938}

%% Release documents
16> ok = lexbor_erl:release(Doc).
ok

17> ok = lexbor_erl:release(StreamDoc).
ok

18> lexbor_erl:stop().
ok
```

See the [examples/](https://github.com/f34nk/lexbor_erl/tree/main/examples) directory for complete working examples:

## Configuration

In your `sys.config`:

```erlang
{lexbor_erl, [
  {port_cmd, "priv/lexbor_port"},
  {op_timeout_ms, 3000}
]}.
```

## Parallelism and Concurrency

`lexbor_erl` uses a **worker pool architecture** to enable true parallel processing of HTML operations:

### Architecture

- **Multiple port workers**: Configurable pool of independent C port processes
- **Smart routing**: 
  - Stateless operations (e.g., `parse_serialize/1`, `select_html/2`) use time-based hash distribution for load balancing
  - Stateful operations route by `DocId` to ensure the same worker handles all operations for a given document
- **Isolation**: Each worker process is independent with its own document registry
- **Individual supervision**: Each worker is supervised independently - if one crashes, only that worker restarts
- **Fault tolerance**: Worker crashes don't affect other workers or the BEAM VM; documents on crashed worker are lost but other workers continue serving

### Configuration

Set the pool size in your `sys.config`:

```erlang
{lexbor_erl, [
  {pool_size, 8},              % Number of parallel workers (default: scheduler count)
  {op_timeout_ms, 3000}        % Timeout per operation
]}.
```

Or via environment variable when starting the application:

```erlang
application:set_env(lexbor_erl, pool_size, 8).
```

### Thread Safety and Fault Tolerance

- **Safe by design**: Each worker is single-threaded, processing one request at a time
- **No shared state**: Documents are isolated to their respective workers
- **Concurrent operations**: Multiple workers can process different documents simultaneously
- **Deterministic routing**: A document always routes to the same worker via the worker ID encoded in the `DocId`
- **Individual worker restart**: If a worker crashes, only that worker is restarted by the supervisor
- **Limited blast radius**: Worker crashes only affect documents on that specific worker
- **Automatic recovery**: Crashed workers are automatically restarted and can accept new documents

### Performance Characteristics

- **Parallelism**: Leverages all CPU cores for concurrent HTML parsing and manipulation
- **No contention**: No locks or shared mutable state between workers
- **Linear scaling**: Performance scales linearly with the number of workers (up to CPU core count)
- **Stateless optimization**: Stateless operations (`parse_serialize`, `select_html`) can use any available worker

## License

LGPL-2.1-or-later

## Credits

Built on top of the [Lexbor](https://github.com/lexbor/lexbor) HTML parser library.