-module(erli18n_negotiate).
-moduledoc """
Canonicalization-aware BCP-47 locale negotiation and fallback (Phase 2).
This module is the pure, total, dependency-free engine behind erli18n's
opt-in locale-fallback chain and the `Accept-Language` negotiation helpers
exposed on the `erli18n` facade (`negotiate/2`, `parse_accept_language/1`,
`canonicalize_locale/1`). It holds **no** state: no `gen_server`, no ETS,
no process dictionary, no `application:get_env`. Every function runs in the
caller's process and is property-testable in isolation.
## The problem it solves
erli18n catalogs are keyed by exact binary (`<<"pt_BR">>`). Two correctness
gaps follow from that:
1. A `pt_BR` user with only a `pt` catalog loaded gets the raw `msgid`
(English) instead of Portuguese — there is no base-language fallback.
2. HTTP delivers hyphenated, mixed-case tags (`pt-BR`, `PT_br`) and legacy
subtags (`iw` for Hebrew), none of which match the underscored catalog
key `pt_BR`.
This module closes both: `canonicalize/1` folds a tag to the catalog-key
shape, `fallback_chain/2` builds the ordered candidate list to try, and
`parse_accept_language/1` + `negotiate/2,3` / `best_match/3` pick the best
supported locale from a client preference list.
It does **not** change erli18n's default behavior. The facade only consults
this module **after** an exact-match miss and **only** when the application
env `erli18n.locale_fallback` is enabled (default `off`). The lock-free
exact-hit hot path is untouched.
## Canonicalization (`canonicalize/1`)
Target shape = erli18n catalog key = underscore-joined, RFC 5646 §2.1.1
positional casing: language lowercase, script Titlecase, region UPPERCASE
(`pt_BR`, `zh_Hant`, `zh_Hant_TW`). The transform is:
- Strip a POSIX charset/modifier suffix (`pt_BR.UTF-8`, `ca_ES@valencia`).
- Treat `-` and `_` as equivalent separators.
- Case each subtag by position (language) and byte length (2 = region,
4 = script, else lowercase).
- Map a small, **closed** set of IANA-deprecated two-letter language codes
to their preferred value, on the language subtag only.
It is **idempotent** (`canonicalize(canonicalize(X)) =:= canonicalize(X)`)
and never raises on any binary content (an oversized or absurd tag is
returned unchanged).
### Legacy-alias table (the complete, IN-scope set)
| Deprecated | Preferred | Language |
|---|---|---|
| `in` | `id` | Indonesian |
| `iw` | `he` | Hebrew |
| `ji` | `yi` | Yiddish |
| `jw` | `jv` | Javanese |
| `mo` | `ro` | Moldovan → Romanian |
**Out of scope (documented non-goals):** `sh` (macrolanguage, no preferred
value), `no`/`nb`/`nn` (not deprecated), `tl`/`fil`, the script-vs-region
inference `zh_Hans` ⇄ `zh_CN` (needs the CLDR *Add Likely Subtags*
algorithm + data), and grandfathered/irregular tags (`i-klingon`). Those
pass through as ordinary (mis)canonicalized binaries that simply miss the
catalog — never special-cased.
## Fallback chain (`fallback_chain/2`)
RFC 4647 §3.4 *Lookup*: canonicalize, then progressively drop the trailing
subtag, appending the (canonicalized) default last. `pt-BR` with default
`en` yields `[<<"pt_BR">>, <<"pt">>, <<"en">>]`. The chain is
order-preserving deduplicated and bounded. The facade walks it doing one
catalog read per candidate, short-circuiting on the first hit — so the cost
is O(chain length) extra reads **only on a miss**, zero on a hit.
Script subtags are kept during truncation (`zh_Hant_TW → zh_Hant → zh`),
matching RFC 4647 Lookup rather than CLDR's script-aware stop.
## Accept-Language (`parse_accept_language/1`, `best_match/3`)
`parse_accept_language/1` parses an HTTP `Accept-Language` header
(RFC 9110 §12.5.4) into `[{Range, Q}]` with `Q` as an integer in milli-units
(`0..1000`). Absent `q` is `1000`; a well-formed `q=0` entry is dropped
("not acceptable"); the list is sorted by descending `Q` with a stable
header-order tiebreak. The output shape matches cowlib's
`cow_http_hd:parse_accept_language/1`, but this parser is total/fail-soft
(it never crashes on malformed input — cowlib does).
`best_match/3` / `negotiate/2,3` run RFC 4647 Lookup of the (already
priority-ordered) preference list against the available catalog locales,
returning the first supported match (or a default / `error`).
## Totality and anti-DoS
Consistent with `erli18n_interp` and `erli18n_plural`, the work is bounded
fail-closed and never interns untrusted text into atoms:
- `?MAX_TAG_BYTES` (35) — a longer tag/range is returned unchanged / skipped.
- `?MAX_SUBTAGS` (8) — a tag with more subtags is returned unchanged.
- `?MAX_CHAIN` (8) — fallback chain length cap.
- `?MAX_HEADER_BYTES` (4096) — a longer `Accept-Language` header → `[]`.
- `?MAX_RAW_ELEMS` (64) — comma-split element cap (RFC 9110 §5.6.1) → `[]`.
- `?MAX_RANGES` (32) — accepted-range budget in `parse_accept_language/1`;
per-consumed-cell budget (32 cells inspected max) in `to_locale_list/2`.
No `binary_to_atom`/`list_to_atom` is used anywhere; locales stay binaries,
so a stream of distinct hostile tags cannot exhaust the atom table.
## Quickstart
```erlang
1> erli18n_negotiate:canonicalize(<<"pt-BR">>).
<<"pt_BR">>
2> erli18n_negotiate:canonicalize(<<"iw-IL">>).
<<"he_IL">>
3> erli18n_negotiate:fallback_chain(<<"pt-BR">>, <<"en">>).
[<<"pt_BR">>,<<"pt">>,<<"en">>]
4> erli18n_negotiate:parse_accept_language(<<"da, en-gb;q=0.8, en;q=0.7">>).
[{<<"da">>,1000},{<<"en-gb">>,800},{<<"en">>,700}]
5> erli18n_negotiate:negotiate([<<"pt-BR">>], [<<"pt">>, <<"en">>]).
{ok,<<"pt">>}
```
""".
-export([
canonicalize/1,
fallback_chain/2,
override_chain/3,
parse_accept_language/1,
negotiate/2,
negotiate/3,
negotiate_with_index/2,
available_index/1,
best_match/3
]).
-export_type([locale/0, language_range/0, qvalue/0, available_index/0]).
%% ===================================================================
%% Types
%% ===================================================================
-doc """
A locale tag as a binary, in erli18n catalog-key shape after
canonicalization (`<<"pt_BR">>`, `<<"zh_Hant">>`). Same semantics as
`t:erli18n_server:locale/0`.
""".
-type locale() :: binary().
-doc """
An RFC 4647 language range as it appears on the wire in an `Accept-Language`
header (`<<"en-gb">>`); may be the wildcard `<<"*">>`. ASCII-lowercased by
`parse_accept_language/1`, hyphen-separated (NOT yet canonicalized).
""".
-type language_range() :: binary().
-doc """
A quality value as an integer in milli-units, `0..1000` (`q=1` → `1000`,
`q=0.8` → `800`). Integer arithmetic avoids float parsing of untrusted text.
""".
-type qvalue() :: 0..1000.
-doc """
A prebuilt canonical→original index of an available-locale set: maps
`canonicalize(Original)` to the original `Available` casing, first occurrence
winning. Produced by `available_index/1` and consumed by
`negotiate_with_index/2`, so a caller negotiating many preference lists against
ONE available set builds the index once and reuses it.
""".
-type available_index() :: #{locale() => locale()}.
%% ===================================================================
%% Anti-DoS caps. Bound the work fail-closed, mirroring the caps in
%% `erli18n_interp`. These CLAMP/skip; they never raise.
%% ===================================================================
%% Maximum bytes of a single locale tag / language range. RFC 5646 tags are
%% short; a longer input is returned unchanged (canonicalize) or skipped
%% (a range). 35 covers every realistic language-script-region-variant tag.
-define(MAX_TAG_BYTES, 35).
%% Maximum subtag count in a tag. A tag with more is returned unchanged so a
%% pathological `a-a-a-...` cannot force per-segment allocation.
-define(MAX_SUBTAGS, 8).
%% Maximum fallback-chain length (extra catalog reads on a miss).
-define(MAX_CHAIN, 8).
%% Maximum `Accept-Language` header size; a larger header parses to `[]`
%% before any splitting.
-define(MAX_HEADER_BYTES, 4096).
%% Maximum comma-separated elements accepted before splitting work begins
%% (RFC 9110 §5.6.1 list-DoS bound); more → `[]`.
-define(MAX_RAW_ELEMS, 64).
%% Two distinct per-budget uses, both fail-closed at 32:
%% - `parse_accept_language/1`: maximum ACCEPTED (post-filter) ranges kept
%% from one header; there a skipped/empty element does NOT consume the
%% budget (the budget counts outputs).
%% - `to_locale_list/2` (raw negotiation input): a per-CONSUMED-cell budget —
%% EVERY inspected cell (accepted, wildcard-skipped, or oversized-skipped)
%% consumes one unit, so at most 32 cells are ever inspected regardless of
%% how many are skipped (O(1) in input length on skip-heavy hostile input).
-define(MAX_RANGES, 32).
%% ===================================================================
%% Canonicalization
%% ===================================================================
-doc """
Canonicalizes ONE BCP-47 / POSIX locale tag to erli18n catalog-key shape.
Underscore-joined, RFC 5646 §2.1.1 positional casing (language lowercase,
script Titlecase, region UPPERCASE), with a charset/modifier suffix stripped
and a bounded legacy-language alias applied to the language subtag. Hyphen
and underscore are equivalent on input.
Total and idempotent: any binary input returns a binary and re-running
produces the same result. A binary over `?MAX_TAG_BYTES`, an empty binary,
or a tag with more than `?MAX_SUBTAGS` subtags is returned UNCHANGED
(fail-soft). A non-binary argument is a programmer error and raises
`function_clause` (the contract is binary-in/binary-out).
```erlang
1> erli18n_negotiate:canonicalize(<<"PT_br">>).
<<"pt_BR">>
2> erli18n_negotiate:canonicalize(<<"zh-hant-tw">>).
<<"zh_Hant_TW">>
3> erli18n_negotiate:canonicalize(<<"ca_ES@valencia">>).
<<"ca_ES">>
4> erli18n_negotiate:canonicalize(<<"iw">>).
<<"he">>
```
See `fallback_chain/2` (uses this) and the module doc for the alias table
and the documented non-goals (`zh_Hans` ⇄ `zh_CN` Likely Subtags).
""".
-spec canonicalize(binary()) -> binary().
canonicalize(Tag) when is_binary(Tag) ->
Size = byte_size(Tag),
case Size =:= 0 orelse Size > ?MAX_TAG_BYTES of
true -> Tag;
false -> canonicalize_bounded(Tag)
end.
%% Internal: canonicalize a tag already known to be 1..?MAX_TAG_BYTES bytes.
-spec canonicalize_bounded(binary()) -> binary().
canonicalize_bounded(Tag0) ->
Tag = strip_posix_suffix(Tag0),
Parts = binary:split(Tag, [~"-", ~"_"], [global]),
case length(Parts) > ?MAX_SUBTAGS of
true -> Tag0;
false -> join_underscore(case_subtags(Parts))
end.
%% Cut at the first '.' (POSIX charset, e.g. `pt_BR.UTF-8`) or '@' (POSIX
%% modifier, e.g. `ca_ES@valencia`), keeping the head.
-spec strip_posix_suffix(binary()) -> binary().
strip_posix_suffix(Tag) ->
case binary:match(Tag, [~".", ~"@"]) of
nomatch -> Tag;
{Pos, _Len} -> binary:part(Tag, 0, Pos)
end.
%% Case each subtag by position: subtag 0 (language) lowercase + alias; the
%% rest by byte length (2 = region UPPER, 4 = script Title, else lower).
%% The only caller feeds `binary:split/3` output, which is always a non-empty
%% list (at minimum `[Tag]`), so there is no `[]` clause — a `[]` here would be
%% a contract violation and is left to crash explicitly.
-spec case_subtags([binary(), ...]) -> [binary(), ...].
case_subtags([Lang | Rest]) ->
[alias_lang(ascii_lower(Lang)) | [case_subtag(S) || S <- Rest]].
-spec case_subtag(binary()) -> binary().
case_subtag(S) ->
case byte_size(S) of
2 -> ascii_upper(S);
4 -> ascii_title(S);
_ -> ascii_lower(S)
end.
%% Closed, compile-time legacy-language alias table (IANA deprecated
%% two-letter codes carrying a Preferred-Value). Applies to the language
%% subtag only. Any other value passes through unchanged.
-spec alias_lang(binary()) -> binary().
alias_lang(~"in") -> ~"id";
alias_lang(~"iw") -> ~"he";
alias_lang(~"ji") -> ~"yi";
alias_lang(~"jw") -> ~"jv";
alias_lang(~"mo") -> ~"ro";
alias_lang(Other) -> Other.
-spec join_underscore([binary()]) -> binary().
join_underscore(Parts) ->
iolist_to_binary(lists:join(~"_", Parts)).
%% ASCII-only case folders (BCP-47 subtags are ASCII by spec). Deliberately
%% NOT `string:lowercase/1`: byte-range folding avoids the Turkish-İ locale
%% hazard and allocates leanly.
-spec ascii_lower(binary()) -> binary().
ascii_lower(B) ->
<<<<(lower_byte(C))>> || <<C>> <= B>>.
-spec ascii_upper(binary()) -> binary().
ascii_upper(B) ->
<<<<(upper_byte(C))>> || <<C>> <= B>>.
%% Only called on a length-4 script subtag (`case_subtag/1`), so the input is
%% never empty; no `<<>>` clause.
-spec ascii_title(binary()) -> binary().
ascii_title(<<First, Rest/binary>>) -> <<(upper_byte(First)), (ascii_lower(Rest))/binary>>.
-spec lower_byte(byte()) -> byte().
lower_byte(C) when C >= $A, C =< $Z -> C + 32;
lower_byte(C) -> C.
-spec upper_byte(byte()) -> byte().
upper_byte(C) when C >= $a, C =< $z -> C - 32;
upper_byte(C) -> C.
%% ===================================================================
%% Fallback chain (RFC 4647 §3.4 Lookup)
%% ===================================================================
-doc """
Builds the ordered, deduplicated RFC 4647 *Lookup* fallback chain for a
locale, ending in `Default` (canonicalized) unless `Default =:= undefined`.
`Locale` is canonicalized first, then the trailing subtag is dropped
repeatedly to a fixpoint (`zh_Hant_TW → zh_Hant → zh`); `Default` is appended
last. The result is order-preserving deduplicated and capped at `?MAX_CHAIN`.
The head is the most specific candidate. Total; the returned list is always
non-empty (at minimum `[canonicalize(Locale)]`).
```erlang
1> erli18n_negotiate:fallback_chain(<<"pt-BR">>, <<"en">>).
[<<"pt_BR">>,<<"pt">>,<<"en">>]
2> erli18n_negotiate:fallback_chain(<<"zh_Hant_TW">>, <<"en">>).
[<<"zh_Hant_TW">>,<<"zh_Hant">>,<<"zh">>,<<"en">>]
3> erli18n_negotiate:fallback_chain(<<"en">>, undefined).
[<<"en">>]
```
The facade walks this list with one catalog read per candidate, returning on
the first hit; this is what makes a `pt_BR` user fall back to a loaded `pt`
catalog. See `canonicalize/1`.
""".
-spec fallback_chain(locale(), locale() | undefined) -> [locale(), ...].
fallback_chain(Locale, Default) when is_binary(Locale) ->
cap_with_default(dedup(base_chain(canonicalize(Locale))), Default).
-doc """
Builds an explicit-override fallback chain for `{explicit, Map}` mode: the
canonicalized `Overrides` list prefixed with `canonicalize(Locale)` and floored
with `Default`. Order-preserving deduplicated and bounded by the SAME
`?MAX_CHAIN` cap as `fallback_chain/2`. Total.
Exposed so the facade's explicit-map mode reuses one bounding/dedup
implementation instead of re-deriving the cap.
```erlang
1> erli18n_negotiate:override_chain(<<"de-AT">>, [<<"de">>], <<"en">>).
[<<"de_AT">>,<<"de">>,<<"en">>]
```
""".
-spec override_chain(locale(), [locale()], locale() | undefined) -> [locale(), ...].
override_chain(Locale, Overrides, Default) when is_binary(Locale), is_list(Overrides) ->
Canon = canonicalize(Locale),
CanonOverrides = [canonicalize(X) || X <- Overrides, is_binary(X)],
cap_with_default(dedup([Canon | CanonOverrides]), Default).
%% The candidate list before the default floor: the RFC 4647 truncation
%% prefixes of the (already canonical) tag. An over-`?MAX_TAG_BYTES` tag is
%% one that `canonicalize/1` returned UNCHANGED (fail-soft), so it is treated
%% as a single opaque candidate here — never fed into the per-level
%% `truncate_one/1` scan, which would be O(n^2) on a pathological
%% many-separator tag.
-spec base_chain(binary()) -> [binary(), ...].
base_chain(Tag) ->
case byte_size(Tag) > ?MAX_TAG_BYTES of
true -> [Tag];
false -> truncations(Tag)
end.
%% [Tag, parent, grandparent, ...] down to the single language subtag.
-spec truncations(binary()) -> [binary(), ...].
truncations(Tag) ->
case truncate_one(Tag) of
nomatch -> [Tag];
Parent -> [Tag | truncations(Parent)]
end.
%% Drop the last `_`-delimited subtag; `nomatch` when there is no separator.
%% Mirrors the `erli18n_plural:base_locale/1` idiom (duplicated to keep this
%% module self-contained — `base_locale/1` is private there).
-spec truncate_one(binary()) -> binary() | nomatch.
truncate_one(Tag) ->
case binary:matches(Tag, ~"_") of
[] ->
nomatch;
Matches ->
{Pos, _Len} = lists:last(Matches),
binary:part(Tag, 0, Pos)
end.
%% Bound the chain at `?MAX_CHAIN` while ALWAYS preserving the (canonicalized)
%% `Default` floor as the last element — so the documented "ends in Default"
%% contract holds even when the truncation prefix alone would fill the cap.
-spec cap_with_default([locale(), ...], locale() | undefined) -> [locale(), ...].
cap_with_default(Base, undefined) ->
lists:sublist(Base, ?MAX_CHAIN);
cap_with_default(Base, Default) when is_binary(Default) ->
dedup(lists:sublist(Base, ?MAX_CHAIN - 1) ++ [canonicalize(Default)]).
%% Order-preserving deduplication (NOT `lists:usort/1`: order is load-bearing).
-spec dedup([locale()]) -> [locale()].
dedup(List) ->
{Acc, _Seen} = lists:foldl(
fun(X, {AccIn, Seen}) ->
case maps:is_key(X, Seen) of
true -> {AccIn, Seen};
false -> {[X | AccIn], Seen#{X => true}}
end
end,
{[], #{}},
List
),
lists:reverse(Acc).
%% ===================================================================
%% Accept-Language parsing (RFC 9110 §12.5.4 + §12.4.2)
%% ===================================================================
-doc """
Parses an HTTP `Accept-Language` header into `[{Range, Q}]`.
`Range` is the ASCII-lowercased, hyphen-separated language range as on the
wire (NOT canonicalized; may be `<<"*">>`); `Q` is an integer in milli-units
(`0..1000`). An absent `q` parameter means `1000`; a well-formed `q=0` entry
is DROPPED. The list is sorted by descending `Q`, ties broken by ascending
header position (stable).
Total and fail-soft: any malformed element is skipped, never crashing.
Returns `[]` on an empty header, a header over `?MAX_HEADER_BYTES`, or one
with more than `?MAX_RAW_ELEMS` comma elements. A non-binary argument raises
`function_clause`. At most `?MAX_RANGES` ranges are returned.
The output shape matches cowlib's `cow_http_hd:parse_accept_language/1`, so a
Cowboy app may feed either source into `negotiate/2`. Unlike cowlib, this
parser never raises on hostile input.
```erlang
1> erli18n_negotiate:parse_accept_language(<<"da, en-gb;q=0.8, en;q=0.7">>).
[{<<"da">>,1000},{<<"en-gb">>,800},{<<"en">>,700}]
2> erli18n_negotiate:parse_accept_language(<<"fr;q=0, de">>).
[{<<"de">>,1000}]
```
See `best_match/3` and `negotiate/2,3`.
""".
-spec parse_accept_language(binary()) -> [{language_range(), qvalue()}].
parse_accept_language(Bin) when is_binary(Bin) ->
parse_accept_language_1(Bin).
-spec parse_accept_language_1(binary()) -> [{language_range(), qvalue()}].
parse_accept_language_1(Bin) ->
case byte_size(Bin) > ?MAX_HEADER_BYTES of
true ->
[];
false ->
Elems = binary:split(Bin, ~",", [global]),
case length(Elems) > ?MAX_RAW_ELEMS of
true -> [];
false -> stable_sort_desc_q(parse_elems(Elems, ?MAX_RANGES, 1, []))
end
end.
%% Fold elements in header order. The accepted-range budget and the index
%% counter advance ONLY on an accepted entry, so skipped/empty elements cost
%% nothing and the index is a dense header-order rank for the stable tiebreak.
-spec parse_elems([binary()], non_neg_integer(), pos_integer(), [
{qvalue(), pos_integer(), language_range()}
]) ->
[{qvalue(), pos_integer(), language_range()}].
parse_elems([], _Budget, _Idx, Acc) ->
Acc;
parse_elems(_Elems, 0, _Idx, Acc) ->
Acc;
parse_elems([E | Rest], Budget, Idx, Acc) ->
case parse_one(E) of
skip -> parse_elems(Rest, Budget, Idx, Acc);
{Range, Q} -> parse_elems(Rest, Budget - 1, Idx + 1, [{Q, Idx, Range} | Acc])
end.
-spec parse_one(binary()) -> {language_range(), qvalue()} | skip.
parse_one(E0) ->
case trim_ows(E0) of
<<>> ->
skip;
E ->
{RangePart, Q} = split_range_q(E),
finalize(trim_ows(RangePart), Q)
end.
%% Split a `range[;params]` element into the range and the resolved q (1000
%% when there is no `q=` parameter).
-spec split_range_q(binary()) -> {binary(), qvalue()}.
split_range_q(E) ->
case binary:split(E, ~";") of
[Range] -> {Range, 1000};
[Range, Params] -> {Range, find_q(binary:split(Params, ~";", [global]))}
end.
-spec find_q([binary()]) -> qvalue().
find_q([]) ->
1000;
find_q([Token | Rest]) ->
case trim_ows(Token) of
<<Q, $=, Val/binary>> when Q =:= $q; Q =:= $Q -> qval_to_milli(trim_ows(Val));
_ -> find_q(Rest)
end.
%% Parse a qvalue ("0" / "1" / "0.NNN" / "1.000") into milli-units. Any
%% malformation (q=abc, q=2, q=1.5, q=) clamps to full weight (1000); only a
%% well-formed zero yields 0. Never `binary_to_float`.
-spec qval_to_milli(binary()) -> qvalue().
qval_to_milli(~"0") -> 0;
qval_to_milli(~"1") -> 1000;
qval_to_milli(<<"1.", _/binary>>) -> 1000;
qval_to_milli(<<"0.", Frac/binary>>) when byte_size(Frac) =< 3 -> frac_to_milli(Frac);
qval_to_milli(_) -> 1000.
-spec frac_to_milli(binary()) -> qvalue().
frac_to_milli(Frac) ->
case all_digits(Frac) of
false -> 1000;
true -> digits3_to_int(pad3(Frac))
end.
-spec pad3(binary()) -> binary().
pad3(<<>>) -> ~"000";
pad3(<<A>>) -> <<A, $0, $0>>;
pad3(<<A, B>>) -> <<A, B, $0>>;
pad3(<<A, B, C>>) -> <<A, B, C>>.
-spec digits3_to_int(binary()) -> 0..999.
digits3_to_int(<<A, B, C>>) ->
(A - $0) * 100 + (B - $0) * 10 + (C - $0).
-spec all_digits(binary()) -> boolean().
all_digits(<<>>) -> true;
all_digits(<<C, Rest/binary>>) when C >= $0, C =< $9 -> all_digits(Rest);
all_digits(_) -> false.
%% Accept a range only if non-empty, within ?MAX_TAG_BYTES, made solely of
%% ALPHA / DIGIT / '-' / '*', and not a well-formed q=0. Lowercased on accept.
-spec finalize(binary(), qvalue()) -> {language_range(), qvalue()} | skip.
finalize(<<>>, _Q) ->
skip;
finalize(_Range, 0) ->
skip;
finalize(Range, Q) ->
case byte_size(Range) =< ?MAX_TAG_BYTES andalso valid_range_chars(Range) of
true -> {ascii_lower(Range), Q};
false -> skip
end.
-spec valid_range_chars(binary()) -> boolean().
valid_range_chars(<<>>) ->
true;
valid_range_chars(<<C, Rest/binary>>) ->
case is_range_char(C) of
true -> valid_range_chars(Rest);
false -> false
end.
-spec is_range_char(byte()) -> boolean().
is_range_char(C) ->
(C >= $a andalso C =< $z) orelse
(C >= $A andalso C =< $Z) orelse
(C >= $0 andalso C =< $9) orelse
C =:= $- orelse
C =:= $*.
%% Optional whitespace = SP / HTAB (RFC 9110 OWS).
-spec trim_ows(binary()) -> binary().
trim_ows(B) ->
trim_trailing_ows(trim_leading_ows(B)).
-spec trim_leading_ows(binary()) -> binary().
trim_leading_ows(<<C, Rest/binary>>) when C =:= $\s; C =:= $\t -> trim_leading_ows(Rest);
trim_leading_ows(B) -> B.
-spec trim_trailing_ows(binary()) -> binary().
trim_trailing_ows(<<>>) ->
<<>>;
trim_trailing_ows(B) ->
Size = byte_size(B),
case binary:at(B, Size - 1) of
C when C =:= $\s; C =:= $\t -> trim_trailing_ows(binary:part(B, 0, Size - 1));
_ -> B
end.
%% Stable sort: descending Q, then ascending header index for ties.
-spec stable_sort_desc_q([{qvalue(), pos_integer(), language_range()}]) ->
[{language_range(), qvalue()}].
stable_sort_desc_q(Acc) ->
Sorted = lists:sort(
fun({Q1, I1, _}, {Q2, I2, _}) ->
case Q1 =:= Q2 of
true -> I1 =< I2;
false -> Q1 > Q2
end
end,
Acc
),
[{Range, Q} || {Q, _I, Range} <- Sorted].
%% ===================================================================
%% Negotiation (RFC 4647 Lookup against an available set)
%% ===================================================================
-doc """
Picks the best supported locale for a preference list, or `error`.
`Preferred` is an ordered preference list (priority = position): either
`[locale()]` or the `[{locale(), qvalue()}]` output of
`parse_accept_language/1` (the `Q` is ignored — order already encodes
priority, and `q=0` ranges were already dropped). `Available` is the list of
catalog locales (e.g. `erli18n:loaded_catalogs/0` locales).
Each `Preferred` entry is canonicalized and resolved through its
`fallback_chain/2` (no default) against a canonical→original index of
`Available`; the FIRST hit wins. `*` ranges are skipped. The returned locale
is the ORIGINAL `Available` casing. Total.
```erlang
1> erli18n_negotiate:negotiate([<<"pt-BR">>], [<<"pt">>, <<"en">>]).
{ok,<<"pt">>}
2> erli18n_negotiate:negotiate([<<"zh_Hant">>], [<<"en">>]).
error
```
See `negotiate/3` (default instead of `error`) and `best_match/3`.
""".
-spec negotiate([locale()] | [{locale(), qvalue()}], [locale()]) -> {ok, locale()} | error.
negotiate(Preferred, Available) ->
negotiate_with_index(Preferred, available_index(Available)).
-doc """
Like `negotiate/2`, but against a PREBUILT `available_index/1` instead of a raw
`Available` list — so the canonical index is built once and reused across many
preference lists (e.g. one per request source).
`negotiate(Preferred, Available)` is exactly
`negotiate_with_index(Preferred, available_index(Available))`; this arity lets a
caller hoist the `available_index/1` out of a per-candidate loop. Semantics are
otherwise identical: each `Preferred` entry is canonicalized and resolved through
its `fallback_chain/2` against the index, first hit winning, returning the
original `Available` casing. Total.
```erlang
1> Ix = erli18n_negotiate:available_index([<<"pt">>, <<"en">>]).
2> erli18n_negotiate:negotiate_with_index([<<"pt-BR">>], Ix).
{ok,<<"pt">>}
3> erli18n_negotiate:negotiate_with_index([<<"zh_Hant">>], Ix).
error
```
See `negotiate/2` (raw-list form) and `available_index/1`.
""".
-spec negotiate_with_index([locale()] | [{locale(), qvalue()}], available_index()) ->
{ok, locale()} | error.
negotiate_with_index(Preferred, Index) when is_map(Index) ->
case match_preferred(to_locale_list(Preferred), Index) of
{ok, _Original} = Found -> Found;
nomatch -> error
end.
-doc """
Like `negotiate/2`, but returns `{ok, Default}` instead of `error` when
nothing matches. `Default` is the caller's chosen floor (the RFC 4647
*Lookup* default) and is NOT validated against `Available`. Total.
```erlang
1> erli18n_negotiate:negotiate([<<"zh_Hant">>], [<<"en">>], <<"en">>).
{ok,<<"en">>}
```
""".
-spec negotiate([locale()] | [{locale(), qvalue()}], [locale()], locale()) -> {ok, locale()}.
negotiate(Preferred, Available, Default) ->
{ok, best_match(Preferred, Available, Default)}.
-doc """
The bare RFC 4647 *Lookup* primitive: like `negotiate/3` but returns the
matched (or `Default`) locale directly, never wrapped. Always succeeds
(falls to `Default`). Total.
```erlang
1> erli18n_negotiate:best_match([<<"en-US">>], [<<"en">>], <<"x">>).
<<"en">>
```
""".
-spec best_match([locale()] | [{locale(), qvalue()}], [locale()], locale()) -> locale().
best_match(Preferred, Available, Default) ->
case match_preferred(to_locale_list(Preferred), available_index(Available)) of
{ok, Original} -> Original;
nomatch -> Default
end.
%% Normalize a preference list to plain locale binaries: strip q tuples and
%% wildcard ranges, preserving order. Each entry is bounded — an entry over
%% `?MAX_TAG_BYTES` is skipped (it can never match a canonical catalog key, and
%% leaving it in would feed the truncation path an oversized tag). The
%% `?MAX_RANGES` budget here is a per-CONSUMED-cell budget: every inspected cell
%% (accepted, wildcard-skipped, or oversized-skipped) decrements it, so at most
%% 32 input cells are ever inspected. This is stricter than (and distinct from)
%% `parse_accept_language/1`'s pre-split `?MAX_RAW_ELEMS` cap and its
%% accepted-output `?MAX_RANGES` budget — so a hostile preference list cannot
%% drive unbounded negotiation work even when supplied raw (not via the parser),
%% including a list that is overwhelmingly wildcards or oversized tags.
-spec to_locale_list([locale()] | [{locale(), qvalue()}]) -> [locale()].
to_locale_list(List) ->
to_locale_list(List, ?MAX_RANGES).
-spec to_locale_list([locale()] | [{locale(), qvalue()}], non_neg_integer()) -> [locale()].
to_locale_list(_List, 0) ->
[];
to_locale_list([], _Budget) ->
[];
to_locale_list([X | Rest], Budget) ->
case is_wildcard(X) of
true ->
to_locale_list(Rest, Budget - 1);
false ->
L = strip_q(X),
case byte_size(L) =< ?MAX_TAG_BYTES of
true -> [L | to_locale_list(Rest, Budget - 1)];
false -> to_locale_list(Rest, Budget - 1)
end
end.
-spec strip_q(locale() | {locale(), qvalue()}) -> locale().
strip_q({L, _Q}) when is_binary(L) -> L;
strip_q(L) when is_binary(L) -> L.
-spec is_wildcard(locale() | {locale(), qvalue()}) -> boolean().
is_wildcard({~"*", _}) -> true;
is_wildcard(~"*") -> true;
is_wildcard(_) -> false.
-doc """
Builds the canonical→original index for an available-locale set, for reuse
across many `negotiate_with_index/2` calls.
Maps `canonicalize(A)` to the original `A` for each `A` in `Available`, first
occurrence winning (so the earliest entry's original catalog casing is the one
returned by a later match). This is the per-`Available` work `negotiate/2`
otherwise repeats on every call; build it once when negotiating multiple
preference lists against the same set. Total.
```erlang
1> Ix = erli18n_negotiate:available_index([<<"pt_BR">>, <<"fr">>]).
2> erli18n_negotiate:negotiate_with_index([<<"pt-BR">>], Ix).
{ok,<<"pt_BR">>}
```
""".
-spec available_index([locale()]) -> available_index().
available_index(Available) ->
lists:foldl(
fun(A, Acc) ->
K = canonicalize(A),
case maps:is_key(K, Acc) of
true -> Acc;
false -> Acc#{K => A}
end
end,
#{},
Available
).
-spec match_preferred([locale()], available_index()) -> {ok, locale()} | nomatch.
match_preferred([], _Index) ->
nomatch;
match_preferred([P | Rest], Index) ->
case match_chain(fallback_chain(P, undefined), Index) of
{ok, _Original} = Found -> Found;
nomatch -> match_preferred(Rest, Index)
end.
-spec match_chain([locale()], available_index()) -> {ok, locale()} | nomatch.
match_chain([], _Index) ->
nomatch;
match_chain([C | Rest], Index) ->
case maps:find(C, Index) of
{ok, _Original} = Found -> Found;
error -> match_chain(Rest, Index)
end.