README.md

# em_filter
[![Hex.pm](https://img.shields.io/hexpm/v/em_filter.svg?color=darkgreen)](https://hex.pm/packages/em_filter)
[![Hex Docs](https://img.shields.io/badge/hex-docs-blue.svg)](https://hexdocs.pm/em_filter)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE.md)

An Erlang library for building Emergence agents connected to an `em_disco` discovery service.

## Features

- Connects your agent to one or more `em_disco` nodes over persistent WebSockets
- Automatically registers on startup and reconnects on failure
- Announces agent capabilities to the `em_disco` registry via `agent_hello`
- Optional persistent memory (ETS) passed across queries
- Full set of HTML scraping utilities included

## Concepts

Every node in the Emergence system is an **agent**. An agent has two optional features:

- **Capabilities** — a list of strings (`<<"rss">>`, `<<"dns">>`, …) announced to `em_disco` at startup. Used by disco to route queries to relevant agents only.
- **Memory** — a map passed to `handle/2` on every query and updated with the returned value.
  - `ram` (default): lives in the process state, resets to `#{}` on restart.
  - `ets`: persisted in a local ETS table, survives worker restarts within the same BEAM session.

Memory is best used for caching expensive operations (HTTP responses, DNS lookups, rate limit state).
**Do not use memory to deduplicate results** — deduplication is handled upstream by the Emquest pipeline.

### Handler contract

Every handler module must export `handle/2`:

```erlang
handle(Body :: binary(), Memory :: map()) ->
    {Result :: term(), NewMemory :: map()}
```

`Body` is the raw JSON query binary. `Result` is typically a list of embryo maps.
Returning the same map as `NewMemory` is valid for stateless behaviour.

### Embryo format

Agents return a list of embryo maps:

```erlang
#{
    <<"type">>       => <<"rss">>,        %% agent-defined type
    <<"properties">> => #{
        <<"url">>    => <<"https://...">>,
        <<"title">>  => <<"...">>,
        <<"resume">> => <<"...">>
    }
}
```

## Installation

Add to your `rebar.config`:

```erlang
{deps, [
    {em_filter, "1.2.0"}
]}.
```

## Usage

### Stateless agent

Announces capabilities but does not persist state between queries.

```erlang
em_filter:start_agent(my_agent, my_handler, #{
    capabilities => [<<"search">>, <<"web">>]
}).
```

```erlang
-module(my_handler).
-export([handle/2]).

handle(Body, Memory) ->
    Results = do_search(Body),
    {Results, Memory}.
```

### Agent with memory (cache)

Memory is useful for caching.

```erlang
-module(my_handler).
-export([handle/2]).

handle(Body, Memory) ->
    Cache = maps:get(cache, Memory, #{}),
    case maps:get(Body, Cache, undefined) of
        undefined ->
            Results  = fetch_from_api(Body),
            NewCache = Cache#{Body => Results},
            {Results, Memory#{cache => NewCache}};
        Cached ->
            {Cached, Memory}
    end.
```

```erlang
em_filter:start_agent(my_agent, my_handler, #{
    capabilities => [<<"search">>],
    memory       => ets
}).
```

## Multi-disco connectivity

An agent connects to every disco node listed in `emergence.conf`.
Each node gets its own persistent WebSocket connection and worker process.

```ini
[em_disco]
nodes = localhost:8080, em-disco.roques.me
```

With this config, `start_agent/3` spawns two workers automatically:
- `my_agent_localhost_8080_server` — connected to local disco
- `my_agent_em_disco_roques_me_443_server` — connected to public disco

Port and transport resolution:
- `localhost` / `127.0.0.1` → port 8080, plain TCP (default)
- any other host without port → port 443, TLS (default)
- explicit port 443 → TLS
- any other explicit port → plain TCP

## Configuration

The `em_disco` address is resolved in this order:

1. `[em_disco] nodes` in `emergence.conf` (recommended)
2. `EM_DISCO_HOST` / `EM_DISCO_PORT` environment variables (legacy, single node)
3. Default: `localhost:8080`

`emergence.conf` locations:
- Linux/macOS: `~/.config/emergence/emergence.conf`
- Windows: `%APPDATA%\emergence\emergence.conf`

Full example:

```ini
[em_disco]
nodes = localhost:8080, em-disco.roques.me
```

## HTML utilities

The following helpers are available for agents that scrape HTML:

| Function | Description |
|---|---|
| `strip_scripts/1` | Removes `<script>` tags |
| `extract_elements/2` | CSS-style element extraction |
| `get_text/1` | Strips all HTML tags |
| `extract_attribute/2` | Extracts a tag attribute value |
| `clean_text/3` | Strips noise and decodes entities |
| `decode_html_entities/1` | Decodes `&amp;`, `&#x...;`, `&#...;` |
| `should_skip_link/2` | Filters out unwanted URLs |

## License

Apache 2.0 — see [LICENSE.md](LICENSE.md).