README.md

# Unicode IDNA

Pure-Elixir implementation of [UTS #46](https://www.unicode.org/reports/tr46/) *Unicode IDNA Compatibility Processing*, with [RFC 3492](https://www.rfc-editor.org/rfc/rfc3492) Punycode encoding/decoding, the [RFC 5893](https://www.rfc-editor.org/rfc/rfc5893) bidi rule, and CONTEXTJ joiner rules from [RFC 5892](https://www.rfc-editor.org/rfc/rfc5892).

The library converts domain names between their Unicode and ASCII (Punycode `xn--` prefixed) representations and validates that each label conforms to IDNA 2008 as relaxed by UTS #46. It passes the full [`IdnaTestV2.txt`](https://www.unicode.org/Public/17.0.0/idna/IdnaTestV2.txt) conformance suite — 6,389 rows × 3 operations = 19,167 assertions — for Unicode 17.0.

## Installation

```elixir
def deps do
  [
    {:unicode_idna, "~> 0.1"}
  ]
end
```

## Usage

`Unicode.IDNA.to_ascii/2` and `Unicode.IDNA.to_unicode/2` accept either a full domain name as a `String.t` or a list of labels. The return value mirrors the input shape: a string in returns a string out (labels are rejoined with `.`); a list in returns the list of processed labels.

```elixir
# String in / string out — full domain
iex> Unicode.IDNA.to_ascii("bücher.de")
{:ok, "xn--bcher-kva.de"}

iex> Unicode.IDNA.to_unicode("xn--bcher-kva.de")
{:ok, "bücher.de"}

# Alternate IDNA label separators are recognised
iex> Unicode.IDNA.to_ascii("中文。中国")
{:ok, "xn--fiq228c.xn--fiqs8s"}

# List in / list out — already-split labels
iex> Unicode.IDNA.to_ascii(["bücher", "de"])
{:ok, ["xn--bcher-kva", "de"]}

iex> Unicode.IDNA.to_unicode(["xn--bcher-kva", "de"])
{:ok, ["bücher", "de"]}

# Errors
iex> Unicode.IDNA.to_ascii("not_valid")
{:error, :disallowed}

iex> Unicode.IDNA.to_ascii("not_valid", use_std3_ascii_rules: false)
{:ok, "not_valid"}
```

## Public API

* `Unicode.IDNA.to_ascii/2` — UTS #46 ToASCII. Accepts a `t:String.t/0` or a list of label strings; returns the same shape.

* `Unicode.IDNA.to_unicode/2` — UTS #46 ToUnicode, with the same dual-shape semantics.

* `Unicode.IDNA.valid_label?/2` — predicate for a single label, equivalent to `match?({:ok, _}, to_ascii([label], options))`.

* `Unicode.IDNA.Punycode.encode/1` and `Unicode.IDNA.Punycode.decode/1` — RFC 3492 primitives.

* `Unicode.IDNA.Bidi.validate/1` and `validate_in_bidi_domain/1` — RFC 5893 bidi rule.

* `Unicode.IDNA.Context.validate/1` — RFC 5892 Appendix A CONTEXTJ rules for ZWJ / ZWNJ.

## Options

| Option | Default | Meaning |
| --- | --- | --- |
| `:transitional` | `false` | UTS #46 transitional vs. non-transitional processing. The default `false` matches modern browsers (Chrome, Firefox, Safari). |
| `:check_hyphens` | `true` | Reject leading/trailing hyphens and `--` at positions 3–4 (the latter is suppressed for labels that came from a Punycode-decoded ACE form). |
| `:check_bidi` | `true` | If any label contains a right-to-left character, every label in the domain must satisfy the RFC 5893 bidi rule. |
| `:check_joiners` | `true` | Labels containing ZWJ (U+200D) or ZWNJ (U+200C) must satisfy the CONTEXTJ rules of RFC 5892 Appendix A. |
| `:use_std3_ascii_rules` | `true` | Restrict ASCII characters in a label to letters, digits and hyphen. Set `false` to allow `_` and other STD3-disallowed ASCII (e.g. for Twitter-style permissive subdomain rules). |
| `:verify_dns_length` | `true` | Reject empty labels, labels longer than 63 octets, and full domains longer than 253 octets per RFC 1035. |

## Refreshing Unicode data

```bash
mix unicode_idna.download
```

This refreshes `data/idna_mapping_table.txt` and `data/idna_test_v2.txt` (the conformance vectors) from `unicode.org`. The bundled files are committed to source control; the task exists to make version bumps reproducible.