-module(erli18n_interp).
-moduledoc """
Pure, total, fail-soft substituter for named `%{name}` placeholders.
This is the Phase 1 interpolation engine that backs the `f`-suffix family
on `erli18n` (`gettextf`, `ngettextf`, `pgettextf`, `npgettextf` and their
`d`/`dc` aliases). It takes a resolved translation `msgstr` plus a map of
`Bindings` and produces the final binary with each `%{name}` replaced by
its bound value.
## The problem it solves
A translated string frequently needs runtime values spliced in
(`<<"Hello, %{name}!">>`). gettext itself has no interpolation; consumers
usually hand-roll `io_lib:format/2` with positional `~s`, which couples the
translation to argument ORDER and breaks the moment a translator reorders
words. Named placeholders (`%{name}`) decouple the wording from the call
site: the translator can move `%{name}` anywhere in the sentence and the
binding still resolves by name.
## Mental model — totality on the hot path
`format/2` runs on EVERY `gettextf`/`ngettextf` lookup, so it carries the
SAME totality bar as `erli18n_plural:evaluate/2`: it is TOTAL and
fail-soft — for ANY `msgstr` bytes and ANY `Bindings` map it NEVER raises
and ALWAYS returns a binary. There is exactly one opt-in path allowed to
raise: `format/3` with `#{on_missing => strict}`, used when a caller wants
a missing binding to be a hard error rather than a silently-retained
literal.
The substitution is a single left-to-right pass over the input:
- `"%%"` collapses to a literal `"%"` (both bytes consumed).
- `"%{<name>}"`, where `<name>` matches `[A-Za-z_][A-Za-z0-9_]*`, is
replaced by the bound value, or handled per the `on_missing` policy if
the name is unbound.
- To emit a literal `"%{name}"` un-substituted, author `"%%{name}"`: the
`"%%"` collapses to `"%"`, leaving the following `"{name}"` untouched.
- A lone `"%"` that begins neither `"%%"` nor a valid `"%{name}"`, and a
`"%{"` that never closes into a valid placeholder, are emitted
literally. Nothing crashes.
## Binding values and atom safety
Binding keys are atoms (`#{name => <<"World">>}`). Values may be a
`binary`, an iolist/string, an `integer`, a `float`, or an `atom`; every
value is coerced to UTF-8 text TOTALLY — an unknown or malformed term
renders via a bounded safe fallback rather than raising.
A placeholder name is resolved with `binary_to_existing_atom/2` wrapped in
`try`: a name that is not an already-existing atom is treated as a MISSING
binding and NEVER creates a new atom. This closes the atom-table-exhaustion
DoS that `binary_to_atom/2` would open on untrusted `msgstr`.
## Anti-DoS
Consistent with the project's plural caps (see `erli18n_plural`), the work is
bounded fail-closed. Because `format/2`
must stay total, the lenient path CLAMPS rather than raises:
- `?MAX_OUTPUT_BYTES` (65536) — the accumulated output is truncated once it
would exceed this size; the remaining input is dropped.
- `?MAX_EXPANSIONS` (1024) — once this many placeholders have been
expanded, further `%{name}` references are emitted literally instead of
substituted.
- `?MAX_NAME_BYTES` (256) — a `%{` whose name run exceeds this many bytes
before the closing `}` is treated as a malformed reference and emitted
literally (this also bounds the `binary_to_existing_atom/2` probe).
## Bidi (RTL) hazard
v1 does NOT auto-insert Unicode bidi isolation marks (U+2066..U+2069)
around interpolated values. Splicing an RTL value (Arabic/Hebrew) into an
LTR sentence — or vice versa — can therefore reorder neighbouring
punctuation under the Unicode Bidirectional Algorithm. Callers that mix
directions should isolate values themselves.
## Quickstart
```erlang
1> erli18n_interp:format(<<"Hello, %{name}!">>, #{name => <<"World">>}).
<<"Hello, World!">>
2> erli18n_interp:format(<<"%{a} then %{b}">>, #{a => 1, b => two}).
<<"1 then two">>
3> erli18n_interp:format(<<"100%% sure about %{x}">>, #{}).
<<"100% sure about %{x}">>
4> erli18n_interp:format(<<"need %{x}">>, #{}, #{on_missing => strict}).
** exception error: {erli18n_interp,{missing_binding,x}}
```
""".
-export([
format/2,
format/3
]).
-export_type([bindings/0, on_missing/0, opts/0]).
%% ===================================================================
%% Anti-DoS caps. Bound the work fail-closed, mirroring the plural caps
%% in `erli18n_plural`. On the lenient (total) path these CLAMP/truncate;
%% they never raise.
%% ===================================================================
%% Maximum number of UTF-8 bytes the result may reach before the pass
%% stops appending (truncation, fail-soft).
-define(MAX_OUTPUT_BYTES, 65536).
%% Maximum number of `%{name}` placeholders expanded in a single pass.
%% Beyond this, further references are emitted literally.
-define(MAX_EXPANSIONS, 1024).
%% Maximum byte length of a placeholder name run; also bounds the
%% `binary_to_existing_atom/2` probe against an atom-length DoS.
-define(MAX_NAME_BYTES, 256).
%% Maximum bytes rendered for a single coerced binding value (per-value
%% clamp so one huge value cannot blow the budget on its own).
-define(MAX_VALUE_BYTES, 8192).
-doc """
Map of placeholder bindings: atom keys to coercible values.
A key is the atom form of a `%{name}` placeholder. A value is coerced to
UTF-8 text totally and may be a `binary`, an iolist/string, an `integer`,
a `float`, or an `atom`. Any other term renders via a bounded safe
fallback instead of raising.
""".
-type bindings() :: #{atom() => term()}.
-doc """
Policy for a `%{name}` whose name resolves to no binding.
- `lenient` (default): the placeholder is emitted literally, unchanged
(`%{name}` stays `%{name}`), and the pass continues. `format/2` always
uses this policy.
- `strict`: the pass raises `error({erli18n_interp, {missing_binding,
Name}})`. This is the ONLY path in the module allowed to raise and is
opt-in via `format/3`.
""".
-type on_missing() :: lenient | strict.
-doc """
Options for `format/3`. Currently a single key, `on_missing`, defaulting
to `lenient` (which makes `format/3` equal to `format/2`).
""".
-type opts() :: #{on_missing => on_missing()}.
-doc """
Interpolate `%{name}` placeholders in `Msgstr` using `Bindings`, leniently.
TOTAL and fail-soft: for ANY `Msgstr` bytes and ANY `Bindings` map this
never raises and always returns a binary. A missing binding leaves its
`%{name}` literal in place. Equivalent to `format(Msgstr, Bindings,
#{on_missing => lenient})`.
See the module doc for the substitution grammar, value coercion, and the
anti-DoS caps.
## Examples
```erlang
1> erli18n_interp:format(<<"Hi %{who}">>, #{who => <<"Sam">>}).
<<"Hi Sam">>
2> erli18n_interp:format(<<"Hi %{who}">>, #{}).
<<"Hi %{who}">>
3> erli18n_interp:format(<<"50%% off">>, #{}).
<<"50% off">>
```
""".
-spec format(binary(), bindings()) -> binary().
format(Msgstr, Bindings) when is_binary(Msgstr), is_map(Bindings) ->
format(Msgstr, Bindings, #{}).
-doc """
Interpolate `%{name}` placeholders in `Msgstr` using `Bindings`, with
`Opts` controlling the missing-binding policy.
`Opts` supports `#{on_missing => lenient | strict}`. With `lenient` (the
default) this is TOTAL and equals `format/2`. With `strict`, a `%{name}`
whose name has no binding raises `error({erli18n_interp, {missing_binding,
Name}})` — the only raising path in this module.
`Name` in the error is the atom form of the placeholder when it already
exists as an atom, otherwise the raw name binary (a non-existing atom name
is never interned).
## Examples
```erlang
1> erli18n_interp:format(<<"Hi %{who}">>, #{who => <<"Sam">>},
1> #{on_missing => strict}).
<<"Hi Sam">>
2> erli18n_interp:format(<<"Hi %{who}">>, #{},
2> #{on_missing => strict}).
** exception error: {erli18n_interp,{missing_binding,who}}
```
""".
-spec format(binary(), bindings(), opts()) -> binary().
format(Msgstr, Bindings, Opts) when
is_binary(Msgstr), is_map(Bindings), is_map(Opts)
->
OnMissing = on_missing(Opts),
Acc = scan(Msgstr, Bindings, OnMissing, 0, 0, []),
iolist_to_binary(lists:reverse(Acc)).
%% ===================================================================
%% Internal — single left-to-right pass with running output-size tracking.
%%
%% `scan/6` walks the input, building a REVERSED list of binary chunks
%% (left-side accumulator, flushed via `iolist_to_binary/1` for linear
%% concatenation — same idiom as `erli18n_po`). `Count` tracks expansions
%% for the anti-DoS cap. `OutSize` tracks the running accumulated byte size
%% in O(1), so EVERY append operation (literal chunk, coerced bound value,
%% literal placeholder) updates OutSize and checks it against MAX_OUTPUT_BYTES
%% immediately, enabling fail-soft truncation with no O(k^2) traversals.
%% ===================================================================
-spec on_missing(opts()) -> on_missing().
on_missing(#{on_missing := strict}) -> strict;
on_missing(_) -> lenient.
-spec scan(binary(), bindings(), on_missing(), non_neg_integer(), non_neg_integer(), [binary()]) ->
[binary()].
%% `%%` -> literal `%`.
scan(<<$%, $%, Rest/binary>>, B, OM, Count, OutSize, Acc) ->
case append_percent(OutSize) of
{ok, NewOutSize} ->
scan(Rest, B, OM, Count, NewOutSize, [<<$%>> | Acc]);
stop ->
Acc
end;
%% `%{` -> attempt a placeholder.
scan(<<$%, ${, Rest/binary>>, B, OM, Count, OutSize, Acc) ->
handle_placeholder(Rest, B, OM, Count, OutSize, Acc);
%% Lone `%` at end of input -> literal.
scan(<<$%>>, _B, _OM, _Count, OutSize, Acc) ->
case append_percent(OutSize) of
{ok, _NewOutSize} ->
[<<$%>> | Acc];
stop ->
Acc
end;
%% Any other `%X` (X not `%` or `{`) -> emit `%` literally, continue at X.
scan(<<$%, Rest/binary>>, B, OM, Count, OutSize, Acc) ->
case append_percent(OutSize) of
{ok, NewOutSize} ->
scan(Rest, B, OM, Count, NewOutSize, [<<$%>> | Acc]);
stop ->
Acc
end;
%% Plain text run up to the next `%`: take it in one chunk (linear).
scan(Bin, B, OM, Count, OutSize, Acc) when is_binary(Bin), Bin =/= <<>> ->
{Chunk, Rest} = take_literal(Bin),
case append_and_check(Chunk, OutSize) of
{ok, NewOutSize} ->
scan(Rest, B, OM, Count, NewOutSize, [Chunk | Acc]);
{truncate, TruncBin} ->
%% TruncBin is never empty (append_and_check/2 only truncates with
%% RoomLeft > 0), so the clamped final chunk is always appended.
[TruncBin | Acc];
stop ->
Acc
end;
scan(<<>>, _B, _OM, _Count, _OutSize, Acc) ->
Acc.
%% Split `Bin` into the leading run with no `%` and the remainder (which
%% starts with `%` or is empty). One pass via `binary:match/2`.
-spec take_literal(binary()) -> {binary(), binary()}.
take_literal(Bin) ->
case binary:match(Bin, <<$%>>) of
nomatch ->
{Bin, <<>>};
{Pos, _Len} ->
<<Chunk:Pos/binary, Rest/binary>> = Bin,
{Chunk, Rest}
end.
%% At this point we have consumed `%{`. Try to read a valid name run and a
%% closing `}`. On any malformation, emit `%{` literally and resume.
-spec handle_placeholder(
binary(), bindings(), on_missing(), non_neg_integer(), non_neg_integer(), [binary()]
) ->
[binary()].
handle_placeholder(Rest, B, OM, Count, OutSize, Acc) ->
case read_name(Rest, 0, []) of
{ok, NameBin, AfterRest} ->
resolve(NameBin, AfterRest, B, OM, Count, OutSize, Acc);
error ->
%% Malformed `%{...` (no valid name, unclosed, or name too
%% long): emit the `%{` literally and continue scanning at
%% `Rest` (the byte after `{`).
case append_and_check(<<$%, ${>>, OutSize) of
{ok, NewOutSize} ->
scan(Rest, B, OM, Count, NewOutSize, [<<$%, ${>> | Acc]);
{truncate, TruncBin} ->
[TruncBin | Acc];
stop ->
Acc
end
end.
%% Read `[A-Za-z_][A-Za-z0-9_]*` up to `}`. Returns the name binary and
%% the remainder AFTER the `}`. `error` on: empty name, illegal char,
%% unterminated, or name exceeding `?MAX_NAME_BYTES`.
-spec read_name(binary(), non_neg_integer(), [byte()]) ->
{ok, binary(), binary()} | error.
read_name(_Bin, Len, _RevChars) when Len > ?MAX_NAME_BYTES ->
error;
read_name(<<$}, Rest/binary>>, Len, RevChars) when Len > 0 ->
{ok, list_to_binary(lists:reverse(RevChars)), Rest};
read_name(<<C, Rest/binary>>, 0, []) when
(C >= $A andalso C =< $Z) orelse
(C >= $a andalso C =< $z) orelse
C =:= $_
->
%% First char of the name: letter or underscore only.
read_name(Rest, 1, [C]);
read_name(<<C, Rest/binary>>, Len, RevChars) when
Len > 0 andalso
((C >= $A andalso C =< $Z) orelse
(C >= $a andalso C =< $z) orelse
(C >= $0 andalso C =< $9) orelse
C =:= $_)
->
read_name(Rest, Len + 1, [C | RevChars]);
read_name(_Other, _Len, _RevChars) ->
%% Empty name, illegal char, or end of input before `}`.
error.
%% A syntactically valid `%{name}` was read. Resolve it against bindings.
-spec resolve(binary(), binary(), bindings(), on_missing(), non_neg_integer(), non_neg_integer(), [
binary()
]) ->
[binary()].
resolve(NameBin, AfterRest, B, OM, Count, OutSize, Acc) ->
case lookup_binding(NameBin, B) of
{ok, Value} ->
case Count >= ?MAX_EXPANSIONS of
true ->
%% Expansion cap reached: emit the placeholder
%% literally (fail-soft, no raise) and continue.
PlaceholderBin = literal_placeholder(NameBin),
case append_and_check(PlaceholderBin, OutSize) of
{ok, NewOutSize} ->
scan(AfterRest, B, OM, Count, NewOutSize, [PlaceholderBin | Acc]);
{truncate, TruncBin} ->
[TruncBin | Acc];
stop ->
Acc
end;
false ->
Text = coerce(Value),
case append_and_check(Text, OutSize) of
{ok, NewOutSize} ->
scan(AfterRest, B, OM, Count + 1, NewOutSize, [Text | Acc]);
{truncate, TruncBin} ->
[TruncBin | Acc];
stop ->
Acc
end
end;
missing ->
handle_missing(NameBin, AfterRest, B, OM, Count, OutSize, Acc)
end.
-spec handle_missing(
binary(), binary(), bindings(), on_missing(), non_neg_integer(), non_neg_integer(), [binary()]
) ->
[binary()].
handle_missing(NameBin, _AfterRest, _B, strict, _Count, _OutSize, _Acc) ->
erlang:error({erli18n_interp, {missing_binding, missing_name_term(NameBin)}});
handle_missing(NameBin, AfterRest, B, lenient, Count, OutSize, Acc) ->
%% Lenient: leave the placeholder literal and continue.
PlaceholderBin = literal_placeholder(NameBin),
case append_and_check(PlaceholderBin, OutSize) of
{ok, NewOutSize} ->
scan(AfterRest, B, lenient, Count, NewOutSize, [PlaceholderBin | Acc]);
{truncate, TruncBin} ->
[TruncBin | Acc];
stop ->
Acc
end.
%% Resolve `NameBin` to a binding. Uses `binary_to_existing_atom/2` inside
%% a `try` so an unknown name is treated as MISSING and never interns a
%% new atom (anti-atom-table-DoS).
-spec lookup_binding(binary(), bindings()) -> {ok, term()} | missing.
lookup_binding(NameBin, B) ->
try binary_to_existing_atom(NameBin, utf8) of
Atom ->
case B of
#{Atom := Value} -> {ok, Value};
_ -> missing
end
catch
error:badarg -> missing
end.
%% The error term for a strict miss: the atom if it already exists,
%% otherwise the raw binary (never interns a new atom).
-spec missing_name_term(binary()) -> atom() | binary().
missing_name_term(NameBin) ->
try binary_to_existing_atom(NameBin, utf8) of
Atom -> Atom
catch
error:badarg -> NameBin
end.
%% Reconstruct the literal `%{name}` for the lenient/cap paths.
-spec literal_placeholder(binary()) -> binary().
literal_placeholder(NameBin) ->
<<$%, ${, NameBin/binary, $}>>.
%% ===================================================================
%% Output-size tracking and clamping (O(1) per-append checks)
%% ===================================================================
%% Check if appending a binary would exceed the output cap. Returns:
%% - {ok, NewOutSize}: the chunk fits entirely, use NewOutSize for next call
%% - {truncate, TruncBin}: the chunk would exceed; TruncBin is truncated to fit
%% (or empty if no room); use this and stop scanning
%% - stop: we're already at/over the cap, don't append anything, stop scanning
-spec append_and_check(binary(), non_neg_integer()) ->
{ok, non_neg_integer()} | {truncate, binary()} | stop.
append_and_check(Bin, OutSize) ->
ChunkSize = byte_size(Bin),
NewSize = OutSize + ChunkSize,
case NewSize > ?MAX_OUTPUT_BYTES of
false ->
%% Fits entirely.
{ok, NewSize};
true ->
%% Would exceed. Calculate how much room is left.
RoomLeft = ?MAX_OUTPUT_BYTES - OutSize,
case RoomLeft > 0 of
true ->
%% Truncate the chunk to fit, on a UTF-8 codepoint boundary so
%% the capped output never ends in a dangling partial codepoint
%% (invalid UTF-8); copy it so the small kept slice does not pin
%% the original. `byte_size(Bin) > RoomLeft` here, so this always
%% trims (never returns `Bin` whole).
TruncBin = binary:copy(truncate_utf8(Bin, RoomLeft)),
{truncate, TruncBin};
false ->
%% No room left, don't add anything.
stop
end
end.
%% Append a single literal `%` byte. A 1-byte append can never truncate: a
%% `{truncate, _}` from `append_and_check/2` needs `NewSize > MAX` (here
%% `OutSize >= MAX`) AND `RoomLeft > 0` (`OutSize < MAX`) simultaneously, which
%% is impossible. So this returns the narrower `{ok, _} | stop` (no `truncate`
%% arm to leave dead at the three single-`%` call sites), and is behaviorally
%% identical to `append_and_check(<<$%>>, OutSize)`.
-spec append_percent(non_neg_integer()) -> {ok, non_neg_integer()} | stop.
append_percent(OutSize) when OutSize < ?MAX_OUTPUT_BYTES -> {ok, OutSize + 1};
append_percent(_OutSize) -> stop.
%% ===================================================================
%% Value coercion — TOTAL. Never raises; unknown terms render via a
%% bounded safe fallback. Output clamped to `?MAX_VALUE_BYTES`.
%% ===================================================================
-spec coerce(term()) -> binary().
coerce(V) when is_binary(V) ->
clamp_value(ensure_utf8(V));
coerce(V) when is_integer(V) ->
integer_to_binary(V);
coerce(V) when is_float(V) ->
clamp_value(safe_float(V));
coerce(V) when is_atom(V) ->
clamp_value(atom_to_binary(V, utf8));
coerce(V) when is_list(V) ->
clamp_value(safe_iolist(V));
coerce(V) ->
%% Unknown term (tuple, map, pid, ...): bounded safe fallback.
clamp_value(safe_inspect(V)).
%% A binding binary may not be valid UTF-8 (it is caller-supplied). Keep
%% valid UTF-8 verbatim; otherwise re-encode latin1 bytes so the result is
%% always valid UTF-8 and the function stays total.
-spec ensure_utf8(binary()) -> binary().
ensure_utf8(Bin) ->
case unicode:characters_to_binary(Bin, utf8, utf8) of
Out when is_binary(Out) ->
Out;
_ ->
%% Invalid UTF-8: re-encode treating the bytes as latin1. Every
%% byte (0..255) is a valid latin1 codepoint, so this is total for
%% ANY binary and always yields a binary (a non-binary return is
%% impossible, hence no fallback clause).
case unicode:characters_to_binary(Bin, latin1, utf8) of
Out2 when is_binary(Out2) -> Out2
end
end.
%% Strings / iolists -> UTF-8 binary, totally.
-spec safe_iolist(list()) -> binary().
safe_iolist(L) ->
case unicode:characters_to_binary(L, unicode, utf8) of
Out when is_binary(Out) -> Out;
_ -> safe_inspect(L)
end.
%% `float_to_binary/2` is total for any `float()` (the only caller is
%% `coerce/1`'s `is_float` clause), so no error handling is needed.
-spec safe_float(float()) -> binary().
safe_float(F) ->
float_to_binary(F, [short]).
%% Bounded `io_lib` rendering for any non-text term. Total: `io_lib:format/2`
%% with `~tp` never raises for any term, and its (latin1-printable / integer-
%% list) output is always valid chardata that `unicode:characters_to_binary/3`
%% converts to a binary — so an impossible non-binary return crashes explicitly
%% (`case_clause`) rather than being silently masked.
-spec safe_inspect(term()) -> binary().
safe_inspect(Term) ->
Chars = io_lib:format("~tp", [Term]),
case unicode:characters_to_binary(Chars, unicode, utf8) of
B when is_binary(B) -> B
end.
-spec clamp_value(binary()) -> binary().
clamp_value(Bin) when byte_size(Bin) =< ?MAX_VALUE_BYTES ->
Bin;
clamp_value(Bin) ->
%% `binary:copy/1` so the clamped value does not pin the (much larger)
%% original binary. Truncation is codepoint-aware (see `truncate_utf8/2`) so
%% a multibyte value is never cut mid-codepoint into invalid UTF-8.
binary:copy(truncate_utf8(Bin, ?MAX_VALUE_BYTES)).
%% Largest prefix of `Bin` of at most `Max` bytes that does NOT end inside a
%% UTF-8 multibyte sequence. Total for ANY binary: a raw `binary:part(Bin, 0,
%% Max)` cut at a fixed offset would split a 3- or 4-byte codepoint (neither
%% `?MAX_VALUE_BYTES` nor `?MAX_OUTPUT_BYTES` is codepoint-aligned) and leave a
%% dangling lead/continuation byte — invalid UTF-8. When the cut lands inside a
%% codepoint (the first dropped byte is a continuation byte) the trailing partial
%% codepoint is removed, so a value that was valid UTF-8 stays valid; arbitrary
%% (already-invalid) bytes are otherwise preserved verbatim, never mangled. This
%% is what upholds the module's "result is always valid UTF-8" invariant across
%% both the per-value clamp and the output cap.
-spec truncate_utf8(binary(), non_neg_integer()) -> binary().
%% PRECONDITION: `Max < byte_size(Bin)`. Both callers only truncate when the
%% binary exceeds the cap — `clamp_value/1`'s small-value clause and
%% `append_and_check/2`'s size check own the within-cap case — so there is no
%% dead "already within Max" clause here; this function always trims.
truncate_utf8(Bin, Max) ->
%% `Max < byte_size(Bin)`, so `binary:at(Bin, Max)` is the first DROPPED byte.
case is_utf8_continuation(binary:at(Bin, Max)) of
false ->
%% The cut already falls on a codepoint boundary.
binary:part(Bin, 0, Max);
true ->
%% The cut split a codepoint: back off to that codepoint's lead byte.
binary:part(Bin, 0, codepoint_start(Bin, Max))
end.
%% Index of the lead byte of the codepoint the byte at `Pos` belongs to, found by
%% walking back over UTF-8 continuation bytes (10xxxxxx). Bounded: a well-formed
%% sequence is at most 4 bytes and the walk stops at the first non-continuation
%% byte (or the start of the binary).
-spec codepoint_start(binary(), non_neg_integer()) -> non_neg_integer().
codepoint_start(_Bin, 0) ->
0;
codepoint_start(Bin, Pos) ->
case is_utf8_continuation(binary:at(Bin, Pos - 1)) of
true -> codepoint_start(Bin, Pos - 1);
false -> Pos - 1
end.
%% A UTF-8 continuation byte is `2#10xxxxxx` (0x80..0xBF).
-spec is_utf8_continuation(byte()) -> boolean().
is_utf8_continuation(B) ->
B >= 16#80 andalso B =< 16#BF.