-module(erli18n_po).
-moduledoc """
Parser and serializer for the GNU gettext PO/POT format.
Reads a `.po`/`.pot` catalog (text) and returns a structured `parsed_catalog()`;
`dump/1` is the inverse path. All the logic is hand-rolled recursive descent,
dependency-free, honoring the nine PO-semantics decisions (PSD-001..009).
## What it does and what problem it solves
Turns the raw bytes of a `.po` into data the rest of the library consumes
(`erli18n_server` calls this module at the start of the load pipeline). The nine
decisions in one sentence each:
- PSD-001: `#, fuzzy` entries are dropped by default (parity with `msgfmt`).
- PSD-002: the `Content-Type` charset is normalized to `utf8 | latin1 | us_ascii`.
- PSD-003: an empty translation (`<<>>`) is preserved; the fallback is the
responsibility of whoever does the lookup, not the parser.
- PSD-004: `Plural-Forms` is preserved raw; only `nplurals` is extracted here.
- PSD-005: a UTF-8 BOM is stripped silently before any processing.
- PSD-006: `msgctxt` is a separate field, never byte-glued to the `msgid`.
- PSD-007: obsolete entries (`#~`) are dropped.
- PSD-008: a degenerate plural (`nplurals=1`) is accepted; `validate_plural_indices/3`
treats `nplurals=1` as a valid index set (`[0]`), parity with the
Asian rules (ja/zh/ko/vi/th).
- PSD-009: the `msgstr[N]` index set is validated against `nplurals`.
## Mental model
This module is PURE and STATELESS: no ETS, no process dictionary,
no `application:env`. Each `parse/2` call carries only the binary you
passed; `parse_file/2` just prepends a `file:read_file/1`. Errors
become data (`{error, parse_error()}`), not dead processes.
The input is UNTRUSTED (the multi-tenant threat model in `SECURITY.md`): a
tenant may upload an adversarial `.po`. Hence the contract is "parsing
errors become structured errors, never silent crashes nor unbounded
memory growth". Two concrete defenses live here:
- A cap by digit COUNT before any `binary_to_integer` over attacker input
(`?MAX_INT_DIGITS`), at the two sites that read integers from the `.po`:
the `nplurals=` of the header (`collect_digits/2`) and the `msgstr[N]` index
(`parse_msgstr_index/2`). Without it, a run of thousands of digits would
build an O(d^2) bignum or hit `system_limit`.
- `bins_to_binary/1` materializes large strings in LINEAR time (left-side
accumulator + `iolist_to_binary/1`); the naive form with the right-side
accumulator was Θ(n²) and stalled the loader for seconds on a single large `msgid`.
The `parse/2` pipeline is TWO-PASS, because the body charset is only
known after reading the header:
1. A prepass (`extract_header_charset/1`) reads the raw bytes (the header is
always ASCII-safe per the GNU spec) and discovers the charset.
2. `normalize_input/2` transcodes the entire body to UTF-8 in that charset.
3. The line-by-line parse runs over UTF-8, with the charset still threaded so
`\\xHH`/`\\OOO` escapes are interpreted in the declared code space BEFORE the
UTF-8 gate (two-phase decode, `decode_quoted_string/2` +
`reassemble_field/2`). Prepass and builder use the SAME field reconciler
(`field_charset/1`), so they never diverge (that divergence was a badmatch
that took down the gen_server on a `Content-Type ` with a space before the `:`).
LF, CRLF and lone-CR (classic Mac) line endings are all accepted.
## When you touch this module
- Loading a catalog: `erli18n_server` reads the file on its own
(`file:read_file/1`) and calls `parse/2` underneath — `parse_file/1,2` is a
convenience/test helper, NOT the production path. You rarely call it directly.
- Validating/inspecting a `.po` in a tool or test: `parse/1` or `parse/2`.
- Roundtrip / programmatic rewrite: `parse/1` -> edit -> `dump/1`.
## Quickstart
```erlang
1> Po = <<"msgid \"\"\n"
.. "msgstr \"Content-Type: text/plain; charset=UTF-8\\n\"\n"
.. "\n"
.. "msgid \"Hello\"\n"
.. "msgstr \"Ola\"\n">>.
2> {ok, Catalog} = erli18n_po:parse(Po).
3> maps:get(entries, Catalog).
[{singular,undefined,<<"Hello">>,<<"Ola">>}]
4> maps:get(charset, maps:get(header, Catalog)).
utf8
5> erli18n_po:parse(erli18n_po:dump(Catalog)) =:= {ok, Catalog}.
true
```
## Key functions
Input: `parse/1`, `parse/2`, `parse_file/1`, `parse_file/2`. Output: `dump/1`,
and `escape_string/1` (the PO-value escaper `dump/1` uses, exported so the
`rebar3_erli18n` plugin can serialize the metadata it owns byte-identically).
Result type: `parsed_catalog/0`; an entry is an `entry/0`; errors are a
`parse_error/0`.
""".
%% Public API.
-export([
parse/1,
parse/2,
parse_file/1,
parse_file/2,
dump/1,
escape_string/1
]).
-export_type([
parse_opts/0,
parsed_catalog/0,
header_map/0,
entry/0,
context/0,
msgid/0,
msgid_plural/0,
translation/0,
plural_index/0,
parse_error/0
]).
%% =========================
%% Types
%% =========================
-doc """
Parse options. Today there is only one key: `include_fuzzy`.
With `include_fuzzy => false` (default), entries marked `#, fuzzy` are
dropped on flush (parity with `msgfmt`). With `true`, they are kept. An
empty map `#{}` inherits all the defaults.
""".
-type parse_opts() :: #{include_fuzzy => boolean()}.
-doc """
Result of a successful parse.
`header` is the `header_map()` (always present: synthesized empty if the `.po`
had no header of its own). `entries` is in FILE ORDER. It is exactly the
shape that `dump/1` consumes.
The roundtrip law `parse(dump(C)) =:= {ok, C}` holds for catalogs whose header
was parsed from a `.po` WITH a header of its own (`raw =/= <<>>`). When the
catalog came from an input WITHOUT a header (a synthetic header with
`raw => <<>>` and `content_type => <<>>`), `dump/1` materializes a minimal
`Content-Type`; on re-parse, that field becomes populated and the catalog
differs from the original at that point. See `dump/1` for the detail.
""".
-type parsed_catalog() :: #{
header := header_map(),
entries := [entry()]
}.
%% Per PSD-002: charset normalized to one of utf8 | latin1 | us_ascii.
%% Per PSD-004: plural_forms preserved raw for downstream evaluator.
-doc """
Catalog header, already reconciled.
`charset` is the normalized atom (PSD-002). `plural_forms` is the RAW string of
the `Plural-Forms` field (PSD-004): this module does NOT evaluate it — only
`erli18n_plural` does; here it is preserved for downstream. `content_type` is
the raw value of the field of the same name. `raw` is the entire `msgstr` text
of the header, used by `dump/1` to re-emit the header faithfully. A catalog
without a header of its own gets a synthetic header with `charset => utf8` and
the other fields empty.
""".
-type header_map() :: #{
plural_forms => binary(),
content_type => binary(),
charset => utf8 | latin1 | us_ascii,
raw => binary()
}.
%% Per PSD-006: context is a separate field, never byte-glued with msgid.
%%
%% Finding #14 (dump-drops-msgid-plural-silently): the plural shape retains
%% the `msgid_plural` form text so `dump/1` can re-emit it faithfully. A
%% catalog with no explicit `msgid_plural` (only a singular `msgid` plus
%% `msgstr[N]` lines — unusual but accepted) carries `undefined`, and the
%% dumper falls back to the singular `msgid` for that one slot.
-doc """
A catalog entry, in one of two shapes.
`{singular, Context, Msgid, Translation}` — a 1:1 translation. `Context` is
`undefined` (no `msgctxt`) or a binary. `Translation` may be `<<>>` (PSD-003:
the empty value is preserved, it does not become a fallback here).
`{plural, Context, Msgid, MsgidPlural, Forms}` — a translation with plurals.
`MsgidPlural` is the plural form from the source or `undefined` (degenerate
case: only `msgstr[N]` without an explicit `msgid_plural`). `Forms` is a list
`[{plural_index(), translation()}]` ORDERED by index, validated against
`nplurals` (PSD-009).
""".
-type entry() ::
{singular, context(), msgid(), translation()}
| {plural, context(), msgid(), msgid_plural(), [{plural_index(), translation()}]}.
-type context() :: undefined | binary().
-type msgid() :: binary().
-type msgid_plural() :: undefined | binary().
-type translation() :: binary().
-type plural_index() :: non_neg_integer().
%% `file:read_file/1` returns `{error, Reason}` where Reason ranges over
%% `file:posix() | badarg | terminated | system_limit` (see file.erl
%% spec). We surface all of them under `file_error`.
-type file_read_error() ::
file:posix() | badarg | terminated | system_limit.
-doc """
Structured parse error — the only "normal" failure mode of the public API.
- `{unsupported_charset, Declared}` — the `Content-Type` declared a charset that
does not map to `utf8 | latin1 | us_ascii`.
- `{charset_conversion, Label, Detail}` — the bytes do not match the declared
charset (e.g. invalid UTF-8, a byte outside US-ASCII).
- `{plural_count_mismatch, Msgid, Expected, Got}` — the `msgstr[N]` indices do
not form exactly `[0..Expected-1]` (PSD-009).
- `{syntax_error, Line, Reason}` — malformed line; `Reason` is `term()` and
also carries the escape-decode errors (e.g. `escape_invalid_utf8`,
`octal_escape_out_of_range`) without widening the exported tuple.
- `{file_error, Posix}` — only `parse_file/1,2`: the disk read failed.
""".
-type parse_error() ::
{unsupported_charset, binary()}
| {charset_conversion, binary(), term()}
| {plural_count_mismatch, msgid(), Expected :: non_neg_integer(), Got :: [non_neg_integer()]}
%% The `Reason` of a `{syntax_error, Line, Reason}` is `term()`, so the
%% escape-decode failures introduced for finding #11
%% (po-hex-octal-escape-emits-invalid-utf8) — `escape_error()` below —
%% travel inside this envelope without widening the exported tuple
%% shape.
| {syntax_error, Line :: pos_integer(), Reason :: term()}
| {file_error, file_read_error()}.
%% Normalized charset (PSD-002), reused as the code space in which `\xHH`
%% / `\OOO` escape bytes are interpreted before being transcoded to UTF-8
%% (finding #11). Mirrors the `charset` key of `header_map/0`.
-type charset() :: utf8 | latin1 | us_ascii.
%% A chunk produced while decoding one quoted string, BEFORE the
%% charset->UTF-8 transcode. `{utf8, Bin}` is already valid UTF-8 (literal
%% text that survived the phase-1 gate of `normalize_input/2`, plus the
%% always-ASCII C escapes like `\n`/`\t`). `{raw, B}` is ONE byte in the
%% declared charset's code space, produced by a `\xHH` / `\OOO` escape —
%% exactly how the GNU gettext lexer stacks raw escape bytes before the
%% whole-string charset conversion.
-type chunk() :: {utf8, binary()} | {raw, byte()}.
%% Structured escape-decode errors (finding #11). Emitted as the `Reason`
%% of a `{syntax_error, Line, Reason}`; restores the UTF-8 gate as a true
%% guarantee (no `{ok, _}` carrying invalid UTF-8) and gives parity with
%% msgfmt's "invalid multibyte sequence" rejection.
%% `Rest` is whatever `unicode:characters_to_binary/3` hands back as the
%% undecodable tail — documented as `unicode:chardata()` (it may be a deep
%% iolist, not just a flat binary), so we carry that type verbatim rather
%% than narrowing to `binary()`.
-type escape_error() ::
{invalid_escape_charset, charset(), Byte :: byte()}
| {escape_invalid_utf8, Rest :: unicode:chardata()}
| {escape_incomplete_utf8, Rest :: unicode:chardata()}
| {octal_escape_out_of_range, pos_integer()}.
%% =========================
%% Internal parser state
%% =========================
%% Accumulator for a single entry being built line-by-line.
%%
%% Finding #17 (po-append-to-last-superlinear): each string field is built
%% as a REVERSED list of segments (`[binary()]`, newest first) while the
%% entry's lines stream in, never as a growing binary. A continuation line
%% prepends ONE segment in O(1) (`append_to_last/2`); the whole field is
%% joined into a binary exactly once, at finalization (`finalize_buffers/1`
%% -> `iolist_to_binary/1`), so building an n-byte field is genuinely
%% O(n) total. The old shape did `<<Prev/binary, Bin/binary>>` per
%% continuation: because `Prev` lived inside the record (more than one
%% reference), the runtime's in-place binary-append optimization did not
%% apply and each append re-copied the accumulator -> Θ(n²) on a single
%% many-continuation msgid. `undefined` still means "field never seen", so
%% the downstream pattern matches (`msgid = undefined`, `msgid = <<>>`) are
%% unchanged — they run AFTER `finalize_buffers/1` has flattened the
%% buffers back to binaries.
-record(po_st, {
context :: undefined | [binary()] | binary(),
msgid :: undefined | [binary()] | binary(),
msgid_plural :: undefined | [binary()] | binary(),
msgstr :: undefined | [binary()] | binary(),
msgstr_plurals = [] :: [{plural_index(), [binary()] | binary()}],
last_field ::
undefined
| msgctxt
| msgid
| msgid_plural
| msgstr
| {msgstr, plural_index()},
fuzzy = false :: boolean(),
obsolete = false :: boolean(),
start_line = 1 :: pos_integer()
}).
%% Global parser context. Carries already-finalized state.
-record(pst, {
include_fuzzy = false :: boolean(),
%% reversed during accumulation
entries = [] :: [entry()],
header :: undefined | header_map(),
nplurals :: undefined | non_neg_integer(),
%% Declared catalog charset (finding #11). Defaults to utf8 so any
%% legacy internal call building a `#pst{}` without it keeps the prior
%% already-UTF-8 behaviour. Threaded into every `decode_quoted_string`
%% call site so `\xHH`/`\OOO` escape bytes are transcoded through the
%% right code space.
charset = utf8 :: charset()
}).
%% Maximum number of decimal digits accepted for an attacker-controlled
%% integer run before `binary_to_integer` is called (finding #8,
%% po-plural-unbounded-binary-to-integer-bignum). Two sites read such
%% runs out of untrusted `.po` input: the `nplurals=<digits>` header
%% cross-check (`collect_digits/2`) and the `msgstr[<digits>]` index
%% (`parse_msgstr_index/2`). Both cap the run by DIGIT COUNT first, so a
%% thousands-digit adversarial run is rejected in O(1) without ever
%% building an O(d^2) bignum or reaching the >=~1.3M-digit
%% `error:system_limit` path. 7 digits (max 9_999_999) is far above any
%% legitimate plural-form count (real locales top out at 6) or msgstr
%% index.
-define(MAX_INT_DIGITS, 7).
%% =========================
%% Public API
%% =========================
-doc """
Parses a PO catalog from a binary, with default options
(`include_fuzzy => false`).
Equivalent to `parse(Bin, #{})`. Returns `{ok, parsed_catalog()}` with the
normalized header and the list of entries (in file order), or
`{error, parse_error()}` if the charset is invalid, the conversion fails, there
is a syntax error, or the plural indices diverge from `nplurals`.
```erlang
1> erli18n_po:parse(<<"msgid \"Hello\"\nmsgstr \"Ola\"\n">>).
{ok,#{header => #{charset => utf8,content_type => <<>>,
plural_forms => <<>>,raw => <<>>},
entries => [{singular,undefined,<<"Hello">>,<<"Ola">>}]}}
```
See `parse/2` for the full semantics of options and the pipeline, and `dump/1`
for the inverse path.
""".
-spec parse(binary()) -> {ok, parsed_catalog()} | {error, parse_error()}.
parse(Bin) ->
parse(Bin, #{}).
-doc """
Parses a PO catalog from a binary, honoring `Opts`.
`Bin` is the raw content of the `.po`; `Opts` is a `parse_opts()` — today only
`include_fuzzy => boolean()` (default `false`: entries marked `#, fuzzy` are
dropped, parity with `msgfmt`). The flow: (1) silent strip of the UTF-8 BOM
(PSD-005); (2) a prepass that extracts the charset from the `Content-Type`
header via the same field reconciler as `build_header/1`, ensuring that prepass
and builder never diverge (finding #5 — closes the `badmatch` on a
`Content-Type ` with a space before the `:`); (3) normalizes the entire body to
UTF-8 in the discovered charset; (4) line-by-line parse with the charset
threaded so `\\xHH`/`\\OOO` escapes are transcoded through the right code space.
Returns `{ok, parsed_catalog()}` (`#{header => header_map(), entries =>
[entry()]}`) or `{error, parse_error()}`. Without an explicit header, it
synthesizes an empty header with charset `utf8`. Accepts LF, CRLF and lone-CR
line endings (finding #15).
Parameters:
- `Bin` — raw content of the `.po`/`.pot`. Treated as UNTRUSTED: an
`nplurals=` or `msgstr[N]` with an absurd run of digits is rejected in O(1)
(cap by `?MAX_INT_DIGITS`), never builds a bignum.
- `Opts` — see `parse_opts/0`. `include_fuzzy` controls whether `#, fuzzy`
entries enter the result.
Failure modes (all `{error, parse_error()}`, never a crash): an unsupported
declared charset, bytes that do not match the charset, plural indices that
diverge from `nplurals`, and malformed lines (with line number).
```erlang
1> Fuzzy = <<"#, fuzzy\nmsgid \"a\"\nmsgstr \"b\"\n">>.
2> {ok, C0} = erli18n_po:parse(Fuzzy, #{}).
3> maps:get(entries, C0).
[]
4> {ok, C1} = erli18n_po:parse(Fuzzy, #{include_fuzzy => true}).
5> maps:get(entries, C1).
[{singular,undefined,<<"a">>,<<"b">>}]
6> erli18n_po:parse(<<"msgid \"a\"\nmsgstr \"b\"\n???\n">>).
{error,{syntax_error,3,{unrecognized_line,<<"???">>}}}
```
See `parse/1` (defaults), `parse_file/2` (from disk) and `dump/1`.
""".
-spec parse(binary(), parse_opts()) ->
{ok, parsed_catalog()} | {error, parse_error()}.
parse(Bin, Opts) when is_binary(Bin), is_map(Opts) ->
%% Per PSD-005: strip UTF-8 BOM silently before any other processing.
Stripped = strip_bom(Bin),
%% Per PSD-002: header determines charset, so first pass extracts header
%% bytes (raw, treating as latin1-compatible 7-bit ASCII — header is
%% always ASCII-safe per GNU spec). The second pass uses the discovered
%% charset to convert the entire body.
case extract_header_charset(Stripped) of
{ok, Charset} ->
case normalize_input(Stripped, Charset) of
{ok, Utf8Bin} ->
%% Finding #11: thread the discovered charset into the
%% body parse so escape bytes can be transcoded through
%% it instead of being spliced raw.
do_parse(Utf8Bin, Charset, Opts);
{error, _} = Err ->
Err
end;
{error, _} = Err ->
Err
end.
-doc """
Reads and parses a `.po` file from disk, with default options.
Equivalent to `parse_file(Path, #{})`. Reads `Path` with `file:read_file/1` and
delegates to `parse/2`. Read errors become `{error, {file_error, file_read_error()}}`.
```erlang
1> erli18n_po:parse_file(<<"priv/locale/fr/LC_MESSAGES/my_domain.po">>).
{ok,#{header => #{charset => utf8, ...}, entries => [...]}}
2> erli18n_po:parse_file(<<"/does/not/exist.po">>).
{error,{file_error,enoent}}
```
See `parse_file/2` (with options) and `parse/2` (the parse semantics themselves).
""".
-spec parse_file(file:filename()) ->
{ok, parsed_catalog()} | {error, parse_error()}.
parse_file(Path) ->
parse_file(Path, #{}).
-doc """
Reads and parses a `.po` file from disk, honoring `Opts`.
Reads `Path` with `file:read_file/1`; on success it delegates the binary to
`parse/2` with `Opts` (see `parse/2` for the semantics of the options and the
return). If the read fails, it returns `{error, {file_error, Posix}}`, where
`Posix` ranges over `file:posix() | badarg | terminated | system_limit`.
Parameters:
- `Path` — file path, any `file:filename()`.
- `Opts` — passed untouched to `parse/2`; see `parse_opts/0`.
The only difference from `parse/2` is the read phase: I/O errors become
`{error, {file_error, Posix}}`; everything already read follows exactly the
rules of `parse/2`.
```erlang
1> erli18n_po:parse_file(<<"catalog.po">>, #{include_fuzzy => true}).
{ok,#{header => #{...}, entries => [...]}}
```
See `parse_file/1` (defaults) and `parse/2`.
""".
-spec parse_file(file:filename(), parse_opts()) ->
{ok, parsed_catalog()} | {error, parse_error()}.
parse_file(Path, Opts) ->
case file:read_file(Path) of
{ok, Bin} -> parse(Bin, Opts);
{error, Posix} -> {error, {file_error, Posix}}
end.
-doc """
Serializes a `parsed_catalog()` back to PO text (a UTF-8 binary).
Emits the header block first (`msgid ""` / `msgstr ""` plus the header `raw`,
or a minimal header `Content-Type: text/plain; charset=UTF-8` when the `raw` is
empty or absent) and then each entry. `singular` entries produce
`msgctxt`/`msgid`/`msgstr`; `plural` entries re-emit the retained
`msgid_plural` (finding #14 — when it is `undefined`, the singular `msgid`
is used as a stand-in) and one `msgstr[N]` line per form. The strings are
re-escaped (`\\\\`, `\\"`, `\\n`, `\\t`, `\\r`) so that `parse(dump(C))`
preserves the catalog. A total function: it always returns a `binary()`.
Parameter: `Catalog` must be a valid `parsed_catalog()` (`#{header := _,
entries := _}`) — typically the `{ok, Catalog}` from `parse/2`. A map without
the `header`/`entries` keys triggers `function_clause` (contract: it only
consumes what `parse/2` produces). Each `entry()` must have the `singular`/`plural`
shape; a tuple of any other shape falls through `dump_entry/1` and crashes.
The minimal synthetic header is NOT emitted with the `Content-Type` glued onto
the `msgstr` line: `dump_header_text/1` always emits `msgstr ""` and dumps the
header body as quoted CONTINUATION LINES (`encode_header_line/1`). So the actual
output for a catalog WITHOUT a header of its own is:
```erlang
1> {ok, C} = erli18n_po:parse(<<"msgid \"Hi\"\nmsgstr \"Oi\"\n">>).
2> erli18n_po:dump(C).
<<"msgid \"\"\nmsgstr \"\"\n"
"\"Content-Type: text/plain; charset=UTF-8\\n\"\n\n"
"msgid \"Hi\"\nmsgstr \"Oi\"\n\n">>
```
WATCH OUT for the roundtrip: the `.po` above has no header, so `C` carries a
synthetic header (`raw => <<>>`, `content_type => <<>>`). `dump/1` injects a
minimal `Content-Type`; on re-parse, that field is no longer empty, so the
catalog differs from the original and equality is FALSE:
```erlang
3> erli18n_po:parse(erli18n_po:dump(C)) =:= {ok, C}.
false
```
The law `parse(dump(C)) =:= {ok, C}` only holds for catalogs that ALREADY had a
header of their own (`raw =/= <<>>`) — exactly the case of the quickstart in the
moduledoc, which does return `true`.
Inverse path of `parse/1` / `parse/2`. See `parsed_catalog/0` and `entry/0`.
""".
-spec dump(parsed_catalog()) -> binary().
dump(#{header := Header, entries := Entries}) ->
HeaderBin = dump_header(Header),
EntriesBin = iolist_to_binary([dump_entry(E) || E <- Entries]),
<<HeaderBin/binary, EntriesBin/binary>>.
%% =========================
%% Charset detection and conversion (PSD-002)
%% =========================
%% Per PSD-005: BOM strip is the first thing the parser does. Already
%% silent — no logging, no flag.
strip_bom(<<16#EF, 16#BB, 16#BF, Rest/binary>>) -> Rest;
strip_bom(Bin) when is_binary(Bin) -> Bin.
%% Walks the raw input looking for the header entry (first non-comment
%% block starting with msgid ""). Extracts and validates the charset from
%% Content-Type. Returns the normalized charset atom or an error.
%%
%% This pass runs over raw bytes. The header (per GNU spec) is always
%% ASCII-safe, so reading it byte-by-byte is correct regardless of the
%% declared charset.
%%
%% Finding #16 (INFO): the header's `msgstr` lines are decoded here AND
%% again in the main pass. This second decode is INHERENT, not a
%% workaround: the body charset is only knowable after the header has been
%% read, and the line-by-line parse must run over the already-transcoded
%% body — so the header round-trips through the decoder twice by
%% construction. The cost is bounded: it is the HEADER only (one block,
%% ASCII-safe, a handful of short lines per the GNU spec), not the catalog
%% body, so there is no structural single-decode win to be had without
%% regressing charset detection. Left as-is deliberately.
extract_header_charset(Bin) ->
case extract_header_msgstr(Bin) of
{ok, HeaderText} -> charset_from_header(HeaderText);
no_header -> {ok, utf8};
{error, _} = Err -> Err
end.
%% Extract the msgstr text of the first entry whose msgid is empty.
%% Returns the concatenated header msgstr (with newlines preserved as in
%% the source) as a binary, or no_header if no header found.
extract_header_msgstr(Bin) ->
Lines = split_lines(Bin),
find_header(Lines, 1, []).
find_header([], _Ln, []) ->
no_header;
find_header([], _Ln, _Acc) ->
no_header;
find_header([Line | Rest], Ln, Acc) ->
Trimmed = trim_leading_ws(Line),
case classify_raw_line(Trimmed) of
blank when Acc =:= [] ->
find_header(Rest, Ln + 1, []);
comment ->
find_header(Rest, Ln + 1, Acc);
{msgid, Content} ->
%% Header has msgid "". If the first msgid in the file is
%% non-empty, there is no proper header — fallback to default.
case is_empty_string_line(Content, Rest) of
{true, RestAfterMsgid} ->
collect_header_msgstr(RestAfterMsgid, Ln + 1);
{false, _} ->
no_header
end;
_ ->
find_header(Rest, Ln + 1, [Line | Acc])
end.
%% Returns {true, RestLines} when the current msgid is the empty string
%% (after consuming any continuation lines). Otherwise {false, _}.
is_empty_string_line(~"\"\"", Rest) ->
%% No continuation expected; but if the next non-blank line starts
%% with ", it's part of this string. For the header, the empty string
%% has no continuation.
{true, Rest};
is_empty_string_line(_, Rest) ->
{false, Rest}.
%% After seeing msgid "", look for the corresponding msgstr and gather it
%% (with continuation lines). The header msgstr's content is what we need.
collect_header_msgstr([], _Ln) ->
no_header;
collect_header_msgstr([Line | Rest], Ln) ->
Trimmed = trim_leading_ws(Line),
case classify_raw_line(Trimmed) of
blank ->
collect_header_msgstr(Rest, Ln + 1);
comment ->
collect_header_msgstr(Rest, Ln + 1);
{msgstr, Content} ->
case decode_quoted_string(Content) of
{ok, First} ->
{More, _Remaining} = consume_continuations(Rest),
{ok, <<First/binary, More/binary>>};
{error, Reason} ->
{error, {syntax_error, Ln, Reason}}
end;
_ ->
no_header
end.
-spec consume_continuations([binary()]) -> {binary(), [binary()]}.
consume_continuations(Lines) ->
consume_continuations(Lines, []).
-spec consume_continuations([binary()], [binary()]) ->
{binary(), [binary()]}.
consume_continuations([], Acc) ->
{bins_to_binary(Acc), []};
consume_continuations([Line | Rest] = All, Acc) ->
Trimmed = trim_leading_ws(Line),
case Trimmed of
<<$", _/binary>> ->
case decode_quoted_string(Trimmed) of
{ok, Bin} -> consume_continuations(Rest, [Bin | Acc]);
{error, _} -> {bins_to_binary(Acc), All}
end;
_ ->
{bins_to_binary(Acc), All}
end.
%% Reverse-and-concatenate a list of binaries into one binary. The list
%% comes pre-reversed from accumulator-style callers (latest element
%% first), so we reverse once and let `iolist_to_binary/1` materialize
%% the result in a single linear pass.
%%
%% This MUST stay linear in the total byte count. The previous shape —
%% a fold building `<<B/binary, Acc/binary>>` — placed the growing
%% accumulator on the RIGHT, which defeats the runtime's in-place binary
%% growth optimization (that only applies to append, `<<Acc/binary,
%% B/binary>>`, with a single reference). With the accumulator on the
%% right the whole `Acc` is re-copied on every element -> Θ(n²) to build
%% one n-byte string, so a single large msgid/msgstr stalled the loader
%% gen_server for seconds (Finding #3,
%% `po-decode-bins-to-binary-quadratic`). `iolist_to_binary/1` does the
%% same job in two linear passes (reverse + BIF) with one allocation —
%% strictly better above a few dozen bytes. `[binary()]` is a subtype of
%% `iolist()`, so the `-spec` is preserved and eqwalizer-friendly.
%%
%% Finding #17: `append_to_last/2` now ALSO accumulates continuation
%% segments as a reversed `[binary()]` list and routes them through THIS
%% same join (via `finalize_buffers/1`), so the per-field build is
%% genuinely O(total). The previous comment claimed `append_to_last/2`'s
%% left-accumulator binary append was "already O(total)" — it was not: the
%% growing accumulator lived inside the `#po_st{}` record (more than one
%% reference), defeating the runtime's in-place append optimization and
%% making a many-continuation field super-linear. Both paths now share
%% this single linear join.
-spec bins_to_binary([binary()]) -> binary().
bins_to_binary(Bins) when is_list(Bins) ->
iolist_to_binary(lists:reverse(Bins)).
%% Per PSD-002: accept utf8 (and aliases), latin1 / iso-8859-1, us-ascii.
%% Case-insensitive match per RFC 2978 (charset names are
%% case-insensitive). Anything else: hard fail.
%%
%% Finding #5 (po-header-malformed-content-type-badmatch-crash): this
%% prepass MUST agree with `build_header/1` on every input, or an
%% adversarial header (e.g. `Content-Type : ...; charset=Shift_JIS` with
%% a space before the colon) makes the two paths disagree — the prepass
%% defaulting to utf8 while `build_header` classifies and crashes on a
%% non-exhaustive match. We guarantee agreement by deriving the charset
%% from the SAME normalized field list (`parse_header_fields/1`, which
%% splits each header line on the first colon and trims/lowercases the
%% key per RFC 822 LWSP) and the SAME classifier (`field_charset/1`) that
%% `build_header/1` uses. One reconciler, one whitespace policy, no
%% divergence.
-spec charset_from_header(binary()) ->
{ok, utf8 | latin1 | us_ascii} | {error, parse_error()}.
charset_from_header(HeaderText) ->
field_charset(parse_header_fields(HeaderText)).
%% Single charset reconciler shared by the prepass (`charset_from_header/1`)
%% and the header builder (`build_header/1`). Both pass the normalized
%% field list from `parse_header_fields/1`, so they can never disagree.
-spec field_charset([{binary(), binary()}]) ->
{ok, utf8 | latin1 | us_ascii} | {error, parse_error()}.
field_charset(Fields) ->
classify_charset_from_content_type(
proplists:get_value(~"content-type", Fields, <<>>)
).
%% Narrow `unicode:chardata()` (potentially a deep iolist) into a flat
%% binary. The header is ASCII-only by GNU gettext spec, so this
%% conversion never errors on the prepass path. We assert
%% post-condition with `is_binary/1` and crash with a descriptive
%% payload if `unicode:characters_to_binary/1` returns the error tuple
%% — that would mean the input was malformed Unicode, which the
%% charset prepass should have caught.
%% Both callers pass `string:lowercase/1` of a binary, which always returns a
%% binary, so there is no chardata clause (a non-binary would be a contract
%% violation and crashes explicitly via `function_clause`).
-spec to_binary(binary()) -> binary().
to_binary(B) when is_binary(B) -> B.
extract_charset_token(Bin) ->
extract_charset_token(Bin, <<>>).
extract_charset_token(<<>>, Acc) ->
finalize_token(Acc);
extract_charset_token(<<C, _/binary>>, Acc) when
C =:= $;; C =:= $\s; C =:= $\t; C =:= $\r; C =:= $\n
->
finalize_token(Acc);
extract_charset_token(<<C, Rest/binary>>, Acc) ->
extract_charset_token(Rest, <<Acc/binary, C>>).
finalize_token(<<>>) -> undefined;
finalize_token(Bin) -> Bin.
classify_charset(Bin) ->
case string:lowercase(Bin) of
~"utf-8" -> {ok, utf8};
~"utf8" -> {ok, utf8};
~"iso-8859-1" -> {ok, latin1};
~"iso8859-1" -> {ok, latin1};
~"latin-1" -> {ok, latin1};
~"latin1" -> {ok, latin1};
~"us-ascii" -> {ok, us_ascii};
~"ascii" -> {ok, us_ascii};
_ -> {error, {unsupported_charset, Bin}}
end.
normalize_input(Bin, utf8) ->
%% Already UTF-8; validate via unicode:characters_to_binary/1 to fail
%% loud on malformed bytes.
case unicode:characters_to_binary(Bin, utf8, utf8) of
Bin2 when is_binary(Bin2) -> {ok, Bin2};
{error, _, _} = E -> {error, {charset_conversion, ~"UTF-8", E}};
{incomplete, _, _} = E -> {error, {charset_conversion, ~"UTF-8", E}}
end;
normalize_input(Bin, us_ascii) ->
%% US-ASCII is a strict subset of UTF-8 — passthrough is correct, but
%% we validate that bytes are within 0-127.
case validate_ascii(Bin) of
ok -> {ok, Bin};
{error, _} = E -> E
end;
normalize_input(Bin, latin1) ->
%% Every byte 0..255 is a valid Latin-1 codepoint, so
%% unicode:characters_to_binary/3 with latin1 -> utf8 cannot return
%% error/incomplete. A binary result is the only possible outcome;
%% any other shape is a contract violation that will surface as a
%% badmatch crash on the pattern below.
Bin2 = unicode:characters_to_binary(Bin, latin1, utf8),
true = is_binary(Bin2),
{ok, Bin2}.
validate_ascii(<<>>) ->
ok;
validate_ascii(<<C, _/binary>>) when C > 127 ->
{error, {charset_conversion, ~"US-ASCII", non_ascii_byte}};
validate_ascii(<<_, Rest/binary>>) ->
validate_ascii(Rest).
%% =========================
%% Main parser (PO grammar, hand-rolled recursive descent)
%% =========================
-doc """
(Internal, maintainer.) The parse engine: it already receives the body in UTF-8
(`Utf8Bin`) and the discovered `Charset`, threaded for the escape decode.
Runs `parse_lines/4` accumulating entries in REVERSE order in the `#pst{}` (hence
`lists:reverse/1` at the end — the accumulator invariant of this module). If no
header entry (`msgid ""`) appeared, it synthesizes an empty header with
`charset => utf8` via `empty_header/0`. Only reached after `parse/2` has already
discovered the charset and normalized the body; never called directly.
""".
-spec do_parse(binary(), charset(), parse_opts()) ->
{ok, parsed_catalog()} | {error, parse_error()}.
do_parse(Utf8Bin, Charset, Opts) ->
IncludeFuzzy = maps:get(include_fuzzy, Opts, false),
Lines = split_lines(Utf8Bin),
St0 = #pst{include_fuzzy = IncludeFuzzy, charset = Charset},
case parse_lines(Lines, 1, fresh_entry(1), St0) of
{ok, #pst{header = undefined, entries = Entries}} ->
%% No header entry — synthesize an empty header with utf8.
Header = empty_header(),
{ok, #{
header => Header,
entries => lists:reverse(Entries)
}};
{ok, #pst{header = Header, entries = Entries}} ->
{ok, #{
header => Header,
entries => lists:reverse(Entries)
}};
{error, _} = Err ->
Err
end.
empty_header() ->
#{
plural_forms => <<>>,
content_type => <<>>,
charset => utf8,
raw => <<>>
}.
fresh_entry(Ln) ->
#po_st{start_line = Ln}.
%% Line splitting that handles LF, CRLF and lone-CR line endings. We fold
%% CRLF -> LF first, then any remaining lone CR (0x0D, classic-Mac style)
%% -> LF, before splitting on LF. This matches `msgfmt -c`, which accepts
%% all three newline conventions (Finding #15). Folding CRLF first ensures
%% a CRLF is never turned into two separate line breaks.
split_lines(Bin) ->
Norm0 = binary:replace(Bin, ~"\r\n", ~"\n", [global]),
Norm = binary:replace(Norm0, ~"\r", ~"\n", [global]),
binary:split(Norm, ~"\n", [global]).
parse_lines([], _Ln, Cur, St) ->
%% EOF — flush any pending entry.
finalize_entry(Cur, St);
parse_lines([Line | Rest], Ln, Cur, St) ->
Trimmed = trim_leading_ws(Line),
case classify_line(Trimmed, Cur) of
blank ->
case is_empty_entry(Cur) of
true ->
parse_lines(Rest, Ln + 1, fresh_entry(Ln + 1), St);
false ->
case finalize_entry(Cur, St) of
{ok, St1} ->
parse_lines(
Rest,
Ln + 1,
fresh_entry(Ln + 1),
St1
);
{error, _} = Err ->
Err
end
end;
skip ->
parse_lines(Rest, Ln + 1, Cur, St);
fuzzy_flag ->
parse_lines(Rest, Ln + 1, Cur#po_st{fuzzy = true}, St);
obsolete ->
%% Per PSD-007: obsolete lines are skipped entirely, but they
%% can span multiple lines forming a fake entry. Mark the
%% current entry as obsolete so it is discarded on flush.
parse_lines(Rest, Ln + 1, Cur#po_st{obsolete = true}, St);
{msgctxt, Content} ->
handle_string_field(msgctxt, Content, Rest, Ln, Cur, St);
{msgid, Content} ->
handle_string_field(msgid, Content, Rest, Ln, Cur, St);
{msgid_plural, Content} ->
handle_string_field(msgid_plural, Content, Rest, Ln, Cur, St);
{msgstr, Content} ->
handle_string_field(msgstr, Content, Rest, Ln, Cur, St);
{msgstr_n, Idx, Content} ->
handle_string_field({msgstr, Idx}, Content, Rest, Ln, Cur, St);
{continuation, Content} ->
case decode_quoted_string(Content, St#pst.charset) of
{ok, Bin} ->
Cur2 = append_to_last(Cur, Bin),
parse_lines(Rest, Ln + 1, Cur2, St);
{error, Reason} ->
{error, {syntax_error, Ln, Reason}}
end;
{syntax_error, Reason} ->
{error, {syntax_error, Ln, Reason}}
end.
handle_string_field(Field, Content, Rest, Ln, Cur, St) ->
case decode_quoted_string(Content, St#pst.charset) of
{ok, Bin} ->
Cur2 = set_field(Field, Bin, Cur),
parse_lines(Rest, Ln + 1, Cur2, St);
{error, Reason} ->
{error, {syntax_error, Ln, Reason}}
end.
%% Finding #17: each string field starts life as a one-element REVERSED
%% segment list (`[Bin]`), so a later continuation just prepends (O(1)) and
%% the whole field joins once at finalization. The `last_field` tag drives
%% which buffer `append_to_last/2` extends.
set_field(msgctxt, Bin, Cur) ->
Cur#po_st{
context = [Bin],
last_field = msgctxt
};
set_field(msgid, Bin, Cur) ->
Cur#po_st{
msgid = [Bin],
last_field = msgid
};
set_field(msgid_plural, Bin, Cur) ->
Cur#po_st{
msgid_plural = [Bin],
last_field = msgid_plural
};
set_field(msgstr, Bin, Cur) ->
Cur#po_st{
msgstr = [Bin],
last_field = msgstr
};
set_field({msgstr, Idx}, Bin, Cur) ->
Existing = Cur#po_st.msgstr_plurals,
Cur#po_st{
msgstr_plurals = [{Idx, [Bin]} | Existing],
last_field = {msgstr, Idx}
}.
%% Finding #17 (po-append-to-last-superlinear): append ONE continuation
%% segment by PREPENDING it to the field's reversed segment list (O(1)),
%% never by re-copying a growing binary. The field is joined into a binary
%% exactly once, later, in `finalize_buffers/1`, so building an n-byte
%% field over many continuation lines is genuinely O(n) total instead of
%% the old Θ(n²).
append_to_last(Cur, Bin) ->
%% classify_line only emits {continuation, _} when last_field =/= undefined
%% (orphan continuations are intercepted as {syntax_error,
%% unexpected_continuation}). Therefore the undefined case is
%% unreachable: hitting it would mean a contract violation and we
%% want it to crash visibly with case_clause.
%%
%% The matched `[_ | _] = Segs` pattern proves the field is a NON-EMPTY
%% reversed segment list at this point — `set_field/3` always seeds it
%% with `[Bin]` before any continuation can extend it, so a non-list /
%% empty-list field would mean `set_field/3` was bypassed (contract
%% violation, badmatch).
case Cur#po_st.last_field of
msgctxt ->
[_ | _] = Segs = Cur#po_st.context,
Cur#po_st{context = [Bin | Segs]};
msgid ->
[_ | _] = Segs = Cur#po_st.msgid,
Cur#po_st{msgid = [Bin | Segs]};
msgid_plural ->
[_ | _] = Segs = Cur#po_st.msgid_plural,
Cur#po_st{msgid_plural = [Bin | Segs]};
msgstr ->
[_ | _] = Segs = Cur#po_st.msgstr,
Cur#po_st{msgstr = [Bin | Segs]};
{msgstr, Idx} ->
[{Idx, [_ | _] = Segs} | T] = Cur#po_st.msgstr_plurals,
Cur#po_st{msgstr_plurals = [{Idx, [Bin | Segs]} | T]}
end.
is_empty_entry(#po_st{
context = undefined,
msgid = undefined,
msgid_plural = undefined,
msgstr = undefined,
msgstr_plurals = [],
fuzzy = false,
obsolete = false
}) ->
true;
is_empty_entry(_) ->
false.
%% =========================
%% Entry finalization
%% =========================
%% Finding #17: flatten the per-field reversed segment buffers
%% (`[binary()]`, built O(1)-per-continuation) back into single binaries
%% EXACTLY ONCE, here at the finalization boundary, before any of the
%% `finalize_entry_flat/2` clauses pattern-match on `msgid = undefined` /
%% `msgid = <<>>` or `emit_entry/2` reads the fields. After this call every
%% string field is `undefined | binary()` again, so all downstream matches
%% are unchanged.
finalize_entry(Cur, St) ->
finalize_entry_flat(finalize_buffers(Cur), St).
%% Join each field's reversed segment list into one binary with a single
%% linear `iolist_to_binary/1` pass (`bins_to_binary/1`), leaving
%% `undefined` (field never seen) untouched. No `-spec`: like the other
%% `#po_st{}`-consuming internals (`set_field/3`, `append_to_last/2`,
%% `emit_entry/2`) it takes the record directly, and elvis'
%% `no_spec_with_records` rule forbids naming the record in a spec.
finalize_buffers(#po_st{} = Cur) ->
Cur#po_st{
context = flatten_field(Cur#po_st.context),
msgid = flatten_field(Cur#po_st.msgid),
msgid_plural = flatten_field(Cur#po_st.msgid_plural),
msgstr = flatten_field(Cur#po_st.msgstr),
%% Plural buffers are always a reversed `[binary()]` segment list at
%% finalization (never `undefined`, never a bare binary), so join them
%% with `bins_to_binary/1` directly. The `is_list/1` guard narrows the
%% record field's `[binary()] | binary()` union to the list arm; it is
%% always true here (the buffer is built only by prepending segments),
%% so it drops nothing and the join stays a single linear pass.
msgstr_plurals = [
{Idx, bins_to_binary(Segs)}
|| {Idx, Segs} <- Cur#po_st.msgstr_plurals, is_list(Segs)
]
}.
%% `undefined` -> `undefined` (field never seen); a reversed segment list
%% -> one binary in a single linear pass.
-spec flatten_field(undefined | [binary()] | binary()) -> undefined | binary().
flatten_field(undefined) -> undefined;
flatten_field(Segs) when is_list(Segs) -> bins_to_binary(Segs).
finalize_entry_flat(#po_st{obsolete = true}, St) ->
%% Per PSD-007: drop obsolete entries silently.
{ok, St};
finalize_entry_flat(#po_st{msgid = undefined}, St) ->
%% No msgid in this block — nothing to emit (trailing blank lines,
%% comment-only blocks, etc.).
{ok, St};
finalize_entry_flat(#po_st{msgid = <<>>} = Cur, #pst{header = undefined} = St) ->
%% Header entry: msgid == "". `build_header/1` is total — the prepass
%% (parse/2) already reconciled the charset and short-circuited an
%% unsupported one before do_parse, so it returns `{ok, _}` here.
HeaderText = best_header_text(Cur),
{ok, Header} = build_header(HeaderText),
Nplurals = nplurals_from_header(Header),
{ok, St#pst{header = Header, nplurals = Nplurals}};
finalize_entry_flat(#po_st{msgid = <<>>}, St) ->
%% Duplicate header entry — preserve the first one (parity with
%% msgfmt which uses the first one). Drop silently.
{ok, St};
finalize_entry_flat(Cur, St) ->
case Cur#po_st.fuzzy andalso not St#pst.include_fuzzy of
true ->
%% Per PSD-001: fuzzy entries dropped by default.
{ok, St};
false ->
emit_entry(Cur, St)
end.
best_header_text(#po_st{msgstr = Bin}) when is_binary(Bin) -> Bin;
best_header_text(_) -> <<>>.
emit_entry(
#po_st{
msgid_plural = undefined,
msgid = Msgid,
context = Ctx,
msgstr = Msgstr
},
St
) ->
Translation =
case Msgstr of
undefined -> <<>>;
_ -> Msgstr
end,
%% Per PSD-003: parser preserves <<>> as translation; fallback is
%% lookup's responsibility.
Entry = {singular, Ctx, Msgid, Translation},
{ok, St#pst{entries = [Entry | St#pst.entries]}};
emit_entry(
#po_st{
msgid_plural = MsgidPlural,
msgid = Msgid,
context = Ctx,
msgstr_plurals = Plurals
},
St
) ->
%% Per PSD-009: validate index set against nplurals from the header
%% (when known). If the header is absent or has no nplurals, accept
%% any index set.
SortedPlurals = lists:keysort(1, Plurals),
Indices = [I || {I, _} <- SortedPlurals],
case validate_plural_indices(Msgid, St#pst.nplurals, Indices) of
ok ->
%% Finding #14: retain `msgid_plural` so `dump/1` re-emits the
%% real plural-form source text instead of substituting `Msgid`.
Entry = {plural, Ctx, Msgid, MsgidPlural, SortedPlurals},
{ok, St#pst{entries = [Entry | St#pst.entries]}};
{error, _} = Err ->
Err
end.
-doc """
(Internal, maintainer — PSD-009.) Validates the `msgstr[N]` index set of a
plural entry against the header's `Nplurals`.
If `Nplurals` is `undefined` (no header, or a header without a usable `nplurals`),
it ACCEPTS any set — a deliberate fail-open, matched with the fail-open of
`collect_digits/2`. Otherwise the set must be EXACTLY
`[0, 1, ..., Nplurals-1]` (already sorted by the caller); a divergence becomes
`{error, {plural_count_mismatch, Msgid, Nplurals, Indices}}`, which bubbles up as
a `parse_error()`.
Anti-DoS (finding #1, po-plural-nplurals-seq-allocation-dos): `Nplurals` is
attacker-controlled — `collect_digits/2` only caps the DIGIT COUNT, so the
header may legitimately declare `nplurals=9999999`. The validation MUST NOT
size any list by that value (the old `lists:seq(0, Nplurals - 1)` allocated a
~10M-element list, ~80MB, for a 158-byte `.po`). Instead it checks the two
conditions that `Indices =:= lists:seq(0, Nplurals - 1)` decomposes into,
without ever materializing the expected sequence: (1) the present set is a
dense `[0, 1, ..., length(Indices) - 1]` prefix — sized by the bytes actually
present in the file, not the header; and (2) that length equals `Nplurals`. A
genuine count mismatch still yields the EXACT same
`{plural_count_mismatch, Msgid, Nplurals, Indices}` payload, in O(length(Indices)).
""".
%% Per PSD-009: index set must be exactly [0, 1, ..., Nplurals-1].
validate_plural_indices(_Msgid, undefined, _Indices) ->
ok;
validate_plural_indices(Msgid, Nplurals, Indices) ->
%% Size the comparison list by the indices PRESENT in the file
%% (`length(Indices)`), never by the untrusted header `Nplurals`. The
%% set is exactly `[0..Nplurals-1]` iff it is a dense 0-based prefix of
%% its own length AND that length equals `Nplurals`; checking both
%% avoids `lists:seq(0, Nplurals - 1)`, which an adversarial
%% `nplurals=9999999` would blow up into a multi-MB allocation.
DensePrefix = lists:seq(0, length(Indices) - 1),
case Indices =:= DensePrefix andalso length(Indices) =:= Nplurals of
true -> ok;
false -> {error, {plural_count_mismatch, Msgid, Nplurals, Indices}}
end.
%% =========================
%% Header parsing
%% =========================
%% Finding #5 (po-header-malformed-content-type-badmatch-crash):
%% `build_header/1` is now TOTAL — it returns `{error, parse_error()}`
%% instead of crashing on an unsupported charset. The charset is
%% reconciled through `field_charset/1`, the SAME path the prepass uses,
%% so in practice the prepass has already short-circuited an unsupported
%% charset before we get here. Returning the structured error (rather
%% than the old non-exhaustive `{ok,Charset} =` match) closes the
%% badmatch class for good: any future divergence degrades to a clean
%% `{error, _}` propagated by `finalize_entry/2`, never an uncaught
%% exception that terminates the loader gen_server.
-spec build_header(binary()) -> {ok, header_map()}.
build_header(<<>>) ->
{ok, empty_header()};
build_header(HeaderText) when is_binary(HeaderText) ->
Fields = parse_header_fields(HeaderText),
PluralForms = proplists:get_value(~"plural-forms", Fields, <<>>),
ContentType = proplists:get_value(~"content-type", Fields, <<>>),
%% The prepass (parse/2) already reconciled the charset via the identical
%% `field_charset/1` and short-circuited an unsupported charset BEFORE
%% do_parse ran, so it is `{ok, _}` here. Asserting the match (rather than
%% re-handling `{error, _}` on an unreachable path) keeps the single charset
%% gate in the prepass; a future divergence crashes explicitly (badmatch).
{ok, Charset} = field_charset(Fields),
{ok, #{
plural_forms => PluralForms,
content_type => ContentType,
charset => Charset,
raw => HeaderText
}}.
%% Header lines have the shape "Key: Value\n". Keys are stored lowercased
%% for case-insensitive lookup.
parse_header_fields(Bin) ->
Lines = binary:split(Bin, ~"\n", [global]),
lists:flatmap(fun parse_header_line/1, Lines).
parse_header_line(<<>>) ->
[];
parse_header_line(Line) ->
case binary:split(Line, ~":") of
[Key, Value] ->
K = string:lowercase(string:trim(Key)),
V = string:trim(Value),
[{K, V}];
_ ->
[]
end.
-spec classify_charset_from_content_type(binary()) ->
{ok, utf8 | latin1 | us_ascii} | {error, {unsupported_charset, binary()}}.
classify_charset_from_content_type(<<>>) ->
{ok, utf8};
classify_charset_from_content_type(ContentType) ->
%% Narrow `chardata() -> binary()` at the boundary so `binary:match/2`
%% is type-checked.
Lower = to_binary(string:lowercase(ContentType)),
case binary:match(Lower, ~"charset=") of
nomatch ->
{ok, utf8};
{Start, _Len} ->
Rest = binary:part(
ContentType,
Start + 8,
byte_size(ContentType) - (Start + 8)
),
Token = extract_charset_token(Rest),
case Token of
undefined -> {ok, utf8};
_ -> classify_charset(Token)
end
end.
%% Per PSD-004: nplurals parsed eagerly for cross-check with msgstr[N]
%% indices. The full Plural-Forms expression is preserved raw for
%% downstream evaluation.
nplurals_from_header(#{plural_forms := <<>>}) ->
undefined;
nplurals_from_header(#{plural_forms := PF}) ->
case binary:match(PF, ~"nplurals") of
nomatch ->
undefined;
{Start, _} ->
Rest = binary:part(PF, Start, byte_size(PF) - Start),
extract_nplurals_value(Rest)
end.
extract_nplurals_value(Bin) ->
case binary:match(Bin, ~"=") of
nomatch ->
undefined;
{EqStart, _} ->
After = binary:part(
Bin,
EqStart + 1,
byte_size(Bin) - (EqStart + 1)
),
collect_digits(After, <<>>)
end.
%% Finding #8 (po-plural-unbounded-binary-to-integer-bignum): cap the
%% digit run by COUNT before `binary_to_integer`. This is a tolerant
%% cross-check of the header's `nplurals=` value (used only to validate
%% plural-form counts downstream), so an over-long run is treated as "no
%% usable nplurals declared" — `undefined`, the same fail-open outcome as
%% a missing field — rather than crashing the parse. The bignum is never
%% materialised, so the O(d^2) cost and the >=~1.3M-digit `system_limit`
%% exception are both avoided.
-doc """
(Internal, maintainer — anti-DoS defense.) Reads the digit run of `nplurals=`
in the header, capping by COUNT (`?MAX_INT_DIGITS`) BEFORE `binary_to_integer`.
A TOLERANT fail-open: an empty or over-long run becomes `undefined` — the same
result as "no `nplurals` declared". This is safe because the value only serves
as a downstream cross-check (`validate_plural_indices/3`); an adversarial header
with thousands of digits is ignored in O(1), never builds the bignum. Contrast
with `parse_msgstr_index/2`, which FAILS CLOSED (the index is load-bearing).
""".
-spec collect_digits(binary(), binary()) -> undefined | non_neg_integer().
collect_digits(_, Acc) when byte_size(Acc) > ?MAX_INT_DIGITS ->
undefined;
collect_digits(<<C, Rest/binary>>, Acc) when C >= $0, C =< $9 ->
collect_digits(Rest, <<Acc/binary, C>>);
collect_digits(_, <<>>) ->
undefined;
collect_digits(_, Acc) ->
binary_to_integer(Acc).
%% =========================
%% Line classification
%% =========================
%% For the prepass extracting the header charset, we treat all comments
%% uniformly and only flag msgid/msgstr.
classify_raw_line(<<>>) ->
blank;
classify_raw_line(<<"#", _/binary>>) ->
comment;
classify_raw_line(<<"msgctxt", Rest/binary>>) ->
case strip_keyword_space(Rest) of
{ok, Content} -> {msgctxt, Content};
error -> other
end;
classify_raw_line(<<"msgid_plural", Rest/binary>>) ->
case strip_keyword_space(Rest) of
{ok, Content} -> {msgid_plural, Content};
error -> other
end;
classify_raw_line(<<"msgid", Rest/binary>>) ->
case strip_keyword_space(Rest) of
{ok, Content} -> {msgid, Content};
error -> other
end;
classify_raw_line(<<"msgstr", Rest/binary>>) ->
case classify_msgstr(Rest) of
{ok, Content} -> {msgstr, Content};
{ok, Idx, Content} -> {msgstr_n, Idx, Content};
%% Prepass only extracts the header charset; a malformed or
%% over-long msgstr index (finding #8) is irrelevant here and is
%% treated like any other unclassified line.
{error, _} -> other;
error -> other
end;
classify_raw_line(<<$", _/binary>>) ->
continuation;
classify_raw_line(_) ->
other.
%% Full classifier for the main parser (carries context-sensitive info).
classify_line(<<>>, _Cur) ->
blank;
classify_line(<<"#~", _Rest/binary>>, _Cur) ->
%% Per PSD-007: any line starting with #~ is part of an obsolete
%% entry. We mark the entry as obsolete; downstream skips it.
%% Body content is irrelevant — the entire entry is dropped on flush.
obsolete;
classify_line(<<"#,", Rest/binary>>, _Cur) ->
%% Flag line. Look for the literal token "fuzzy". Other flags
%% (c-format, no-c-format, etc.) are ignored — they have no effect
%% on the catalog content.
%% Narrow chardata() -> binary() so binary:match/2 is type-checked.
Lower = to_binary(string:lowercase(Rest)),
case binary:match(Lower, ~"fuzzy") of
nomatch -> skip;
_ -> fuzzy_flag
end;
classify_line(<<"#|", _Rest/binary>>, _Cur) ->
%% Previous-msgid (informational, GNU manual "Marking Translations
%% as Fuzzy"). Skip.
skip;
classify_line(<<"#.", _Rest/binary>>, _Cur) ->
skip;
classify_line(<<"#:", _Rest/binary>>, _Cur) ->
skip;
classify_line(<<"#", _Rest/binary>>, _Cur) ->
%% Translator comment.
skip;
classify_line(<<"msgctxt", Rest/binary>>, _Cur) ->
case strip_keyword_space(Rest) of
{ok, Content} -> {msgctxt, Content};
error -> {syntax_error, expected_msgctxt_string}
end;
classify_line(<<"msgid_plural", Rest/binary>>, _Cur) ->
case strip_keyword_space(Rest) of
{ok, Content} -> {msgid_plural, Content};
error -> {syntax_error, expected_msgid_plural_string}
end;
classify_line(<<"msgid", Rest/binary>>, _Cur) ->
case strip_keyword_space(Rest) of
{ok, Content} -> {msgid, Content};
error -> {syntax_error, expected_msgid_string}
end;
classify_line(<<"msgstr", Rest/binary>>, _Cur) ->
case classify_msgstr(Rest) of
{ok, Content} -> {msgstr, Content};
{ok, Idx, Content} -> {msgstr_n, Idx, Content};
%% Finding #8: an over-long `msgstr[<digits>]` index surfaces a
%% structured reason so the parse fails closed with a precise
%% diagnostic instead of crashing on a giant `binary_to_integer`.
{error, Reason} -> {syntax_error, Reason};
error -> {syntax_error, expected_msgstr_string}
end;
classify_line(<<$", _/binary>> = Line, #po_st{last_field = LF}) when LF =/= undefined ->
{continuation, Line};
classify_line(<<$", _/binary>>, _Cur) ->
{syntax_error, unexpected_continuation};
classify_line(Other, _Cur) ->
{syntax_error, {unrecognized_line, Other}}.
strip_keyword_space(<<>>) -> error;
strip_keyword_space(<<$\s, Rest/binary>>) -> strip_keyword_space(Rest);
strip_keyword_space(<<$\t, Rest/binary>>) -> strip_keyword_space(Rest);
strip_keyword_space(<<$", _/binary>> = Bin) -> {ok, Bin};
strip_keyword_space(_) -> error.
classify_msgstr(<<$[, Rest/binary>>) ->
case parse_msgstr_index(Rest, <<>>) of
{ok, Idx, After} ->
case strip_keyword_space(After) of
{ok, Content} -> {ok, Idx, Content};
error -> error
end;
{error, _} = Err ->
Err;
error ->
error
end;
classify_msgstr(Rest) ->
case strip_keyword_space(Rest) of
{ok, Content} -> {ok, Content};
error -> error
end.
%% Finding #8 (po-plural-unbounded-binary-to-integer-bignum): cap the
%% `msgstr[<digits>]` index run by DIGIT COUNT before `binary_to_integer`
%% builds the bignum. An over-long run is surfaced as a structured
%% `{error, {index_too_long, Max}}` (the rejected run is kept OUT of the
%% payload), which the caller turns into a `{syntax_error, _, _}` parse
%% error — never an O(d^2) bignum and never an uncaught `system_limit`.
-doc """
(Internal, maintainer — anti-DoS defense.) Reads the index of `msgstr[<digits>]`,
capping by COUNT (`?MAX_INT_DIGITS`) before `binary_to_integer`.
FAILS CLOSED (unlike `collect_digits/2`): an over-long run becomes
`{error, {index_too_long, Max}}` — with the rejected run DELIBERATELY kept out
of the payload — which the caller converts into `{syntax_error, Line, _}`. The
index is load-bearing (it selects the plural form), so silently ignoring it
would be wrong; better to reject the `.po`. Returns `error` (no `{}`) for a `[`
that does not close with `]` over valid digits.
""".
-spec parse_msgstr_index(binary(), binary()) ->
{ok, non_neg_integer(), binary()}
| {error, {index_too_long, pos_integer()}}
| error.
parse_msgstr_index(_, Acc) when byte_size(Acc) > ?MAX_INT_DIGITS ->
{error, {index_too_long, ?MAX_INT_DIGITS}};
parse_msgstr_index(<<$], Rest/binary>>, Acc) when byte_size(Acc) > 0 ->
{ok, binary_to_integer(Acc), Rest};
parse_msgstr_index(<<C, Rest/binary>>, Acc) when C >= $0, C =< $9 ->
parse_msgstr_index(Rest, <<Acc/binary, C>>);
parse_msgstr_index(_, _) ->
error.
trim_leading_ws(<<$\s, Rest/binary>>) -> trim_leading_ws(Rest);
trim_leading_ws(<<$\t, Rest/binary>>) -> trim_leading_ws(Rest);
trim_leading_ws(Bin) -> Bin.
%% =========================
%% Quoted string decoder
%% =========================
%% Decodes a PO-style quoted string. Input must start with " and end
%% with " (trailing whitespace allowed). Escape sequences per the GNU
%% gettext PO format spec (https://www.gnu.org/software/gettext/manual/
%% gettext.html#PO-Files): \n \t \r \" \\ \xHH \OOO \b \f \v \a.
%%
%% All four call sites (collect_header_msgstr, consume_continuations,
%% handle_string_field, parse_lines continuation branch) gate input on
%% <<$", _/binary>> via strip_keyword_space or a guard pattern, so the
%% leading quote is an enforced precondition. Passing anything else is
%% a contract violation and will crash with function_clause.
%% Arity-1 shim for the header prepass call sites
%% (`collect_header_msgstr/2`, `consume_continuations/2`). The header is
%% ASCII-safe per the GNU spec and is decoded BEFORE the body charset is
%% applied, so utf8 (the already-UTF-8 identity, matching the legacy
%% behaviour for ASCII) is the correct code space there.
-spec decode_quoted_string(binary()) ->
{ok, binary()} | {error, term()}.
decode_quoted_string(Bin) ->
decode_quoted_string(Bin, utf8).
%% Finding #11 — two-phase decode (mirrors the GNU gettext lexer):
%% phase 1 walks the quoted string emitting tagged `chunk()`s (literal
%% UTF-8 text vs. raw escape bytes); phase 2 (`reassemble_field/2`)
%% transcodes contiguous raw runs through the declared charset, so
%% `\xHH`/`\OOO` escape bytes end up as valid UTF-8 (or a structured
%% error) instead of being spliced raw past the UTF-8 gate.
-doc """
(Internal, maintainer.) Decodes ONE quoted PO string, in two phases.
Input invariant: `Bin` MUST start with `"` — all 4 call sites
guarantee this via `strip_keyword_space/1` or a guard. Passing anything else is
a contract violation and crashes with `function_clause`.
Phase 1 (`decode_chars/2`) walks the string emitting tagged `chunk()`s: literal
already-UTF-8 text becomes `{utf8, _}`; a `\\xHH`/`\\OOO` escape becomes `{raw, Byte}`
in the code space of the declared charset. Phase 2 (`reassemble_field/2`)
transcodes the runs of contiguous raw bytes through the charset and interleaves
them with the UTF-8 chunks. Grouping is essential: in UTF-8 a multibyte
codepoint is written as CONSECUTIVE escapes (`\\xC3\\xBF` = U+00FF) and must be
validated as one unit — a lone `\\xFF` becomes `{error, {escape_invalid_utf8, _}}`,
never a leaked invalid byte.
""".
-spec decode_quoted_string(binary(), charset()) ->
{ok, binary()} | {error, term()}.
decode_quoted_string(<<$", Rest/binary>>, Charset) ->
case decode_chars(Rest, []) of
{ok, ChunksRev} -> reassemble_field(ChunksRev, Charset);
{error, _} = E -> E
end.
%% Accumulates `[chunk()]` in REVERSE order (newest first), like the rest
%% of the module's accumulators.
-spec decode_chars(binary(), [chunk()]) ->
{ok, [chunk()]} | {error, term()}.
decode_chars(<<$">>, Acc) ->
{ok, Acc};
decode_chars(<<$", Rest/binary>>, Acc) ->
case is_only_trailing_ws(Rest) of
true -> {ok, Acc};
false -> {error, content_after_close_quote}
end;
decode_chars(<<$\\, Rest/binary>>, Acc) ->
case decode_escape(Rest) of
{ok, Chunk, Rest2} -> decode_chars(Rest2, [Chunk | Acc]);
{error, _} = E -> E
end;
decode_chars(<<C/utf8, Rest/binary>>, Acc) ->
%% Literal text already survived the phase-1 UTF-8 gate, so keep the
%% codepoint as a ready-made UTF-8 chunk.
decode_chars(Rest, [{utf8, <<C/utf8>>} | Acc]);
decode_chars(<<>>, _Acc) ->
{error, unterminated_string};
decode_chars(<<_Byte, _/binary>>, _Acc) ->
{error, invalid_utf8}.
%% "Literal" C escapes (\n \t \" ...) are ASCII, so they are trivially
%% valid UTF-8 and become `{utf8, _}` chunks. Only `\xHH`/`\OOO` produce a
%% `{raw, Byte}` chunk interpreted later in the declared charset.
-spec decode_escape(binary()) ->
{ok, chunk(), binary()} | {error, term()}.
decode_escape(<<$n, R/binary>>) ->
{ok, {utf8, <<$\n>>}, R};
decode_escape(<<$t, R/binary>>) ->
{ok, {utf8, <<$\t>>}, R};
decode_escape(<<$r, R/binary>>) ->
{ok, {utf8, <<$\r>>}, R};
decode_escape(<<$", R/binary>>) ->
{ok, {utf8, <<$">>}, R};
decode_escape(<<$\\, R/binary>>) ->
{ok, {utf8, <<$\\>>}, R};
decode_escape(<<$b, R/binary>>) ->
{ok, {utf8, <<$\b>>}, R};
decode_escape(<<$f, R/binary>>) ->
{ok, {utf8, <<$\f>>}, R};
decode_escape(<<$v, R/binary>>) ->
{ok, {utf8, <<$\v>>}, R};
decode_escape(<<$a, R/binary>>) ->
{ok, {utf8, <<7>>}, R};
decode_escape(<<$/, R/binary>>) ->
{ok, {utf8, <<$/>>}, R};
decode_escape(<<$?, R/binary>>) ->
{ok, {utf8, <<$?>>}, R};
decode_escape(<<$', R/binary>>) ->
{ok, {utf8, <<$'>>}, R};
decode_escape(<<$x, R/binary>>) ->
decode_hex_escape(R, <<>>, 0);
decode_escape(<<C, R/binary>>) when C >= $0, C =< $7 ->
decode_octal_escape(R, <<C>>, 1);
decode_escape(<<C, _/binary>>) ->
{error, {unknown_escape, C}};
decode_escape(<<>>) ->
{error, dangling_backslash}.
%% `\xHH` -> {raw, Byte}: the byte is interpreted later in the declared
%% charset (`reassemble_field/2`), not spliced raw.
-spec decode_hex_escape(binary(), binary(), 0..2) ->
{ok, {raw, byte()}, binary()} | {error, term()}.
decode_hex_escape(<<C, R/binary>>, Acc, N) when
N < 2,
((C >= $0 andalso C =< $9) orelse
(C >= $a andalso C =< $f) orelse
(C >= $A andalso C =< $F))
->
decode_hex_escape(R, <<Acc/binary, C>>, N + 1);
decode_hex_escape(R, Acc, _N) when byte_size(Acc) > 0 ->
Byte = binary_to_integer(Acc, 16),
{ok, {raw, Byte}, R};
decode_hex_escape(_, _, _) ->
{error, invalid_hex_escape}.
%% `\OOO` -> {raw, Byte}. In PO a `\OOO` escape is BY DEFINITION a single
%% byte; three octal digits reach 0777 (511), so values > 0xFF are a
%% malformed-escape error rather than a wrap.
-spec decode_octal_escape(binary(), binary(), 1..3) ->
{ok, {raw, byte()}, binary()} | {error, term()}.
decode_octal_escape(<<C, R/binary>>, Acc, N) when
N < 3, C >= $0, C =< $7
->
decode_octal_escape(R, <<Acc/binary, C>>, N + 1);
decode_octal_escape(R, Acc, _N) ->
Int = binary_to_integer(Acc, 8),
case Int =< 16#FF of
true -> {ok, {raw, Int}, R};
false -> {error, {octal_escape_out_of_range, Int}}
end.
%% =========================
%% Phase 2: charset->UTF-8 transcode of escape bytes (finding #11)
%% =========================
%% Takes the reversed chunk list from `decode_chars/2`, groups contiguous
%% raw bytes into runs, transcodes each run through the declared charset,
%% and interleaves with the ready UTF-8 chunks. Grouping is essential: in
%% a UTF-8 catalog a multibyte codepoint is written as CONSECUTIVE escapes
%% (`\xC3\xBF` = U+00FF) and must be validated as one unit.
-spec reassemble_field([chunk()], charset()) ->
{ok, binary()} | {error, escape_error()}.
reassemble_field(ChunksRev, Charset) ->
reassemble(lists:reverse(ChunksRev), Charset, [], []).
%% `RawAcc` collects contiguous raw bytes (reverse order); `Out` collects
%% finished UTF-8 segments (reverse order).
-spec reassemble([chunk()], charset(), [byte()], [binary()]) ->
{ok, binary()} | {error, escape_error()}.
reassemble([{raw, B} | Rest], Charset, RawAcc, Out) ->
reassemble(Rest, Charset, [B | RawAcc], Out);
reassemble([{utf8, Bin} | Rest], Charset, RawAcc, Out) ->
case flush_raw(RawAcc, Charset) of
{ok, Flushed} ->
reassemble(Rest, Charset, [], [Bin, Flushed | Out]);
{error, _} = E ->
E
end;
reassemble([], Charset, RawAcc, Out) ->
case flush_raw(RawAcc, Charset) of
{ok, Flushed} ->
{ok, iolist_to_binary(lists:reverse([Flushed | Out]))};
{error, _} = E ->
E
end.
%% Transcode one run of charset-native raw bytes into UTF-8.
-spec flush_raw([byte()], charset()) ->
{ok, binary()} | {error, escape_error()}.
flush_raw([], _Charset) ->
{ok, <<>>};
flush_raw(RawAccRev, Charset) ->
Bytes = list_to_binary(lists:reverse(RawAccRev)),
transcode_escape_bytes(Bytes, Charset).
-spec transcode_escape_bytes(binary(), charset()) ->
{ok, binary()} | {error, escape_error()}.
transcode_escape_bytes(Bytes, latin1) ->
%% Every byte 0..255 is a valid Latin-1 codepoint; latin1 -> utf8
%% never fails (same contract as `normalize_input/2`).
Out = unicode:characters_to_binary(Bytes, latin1, utf8),
true = is_binary(Out),
{ok, Out};
transcode_escape_bytes(Bytes, us_ascii) ->
%% US-ASCII: a byte >= 0x80 is outside the charset. gettext rejects;
%% we surface a structured error instead of emitting a non-ASCII byte.
case first_non_ascii(Bytes) of
none -> {ok, Bytes};
Bad -> {error, {invalid_escape_charset, us_ascii, Bad}}
end;
transcode_escape_bytes(Bytes, utf8) ->
%% UTF-8 catalog: the raw run MUST itself be valid UTF-8 (e.g.
%% `\xC3\xBF` = U+00FF). A lone `\xFF` -> structured error, parity
%% with msgfmt's "invalid multibyte sequence".
case unicode:characters_to_binary(Bytes, utf8, utf8) of
Out when is_binary(Out) ->
{ok, Out};
{error, _Converted, Rest} ->
{error, {escape_invalid_utf8, Rest}};
{incomplete, _Converted, Rest} ->
{error, {escape_incomplete_utf8, Rest}}
end.
-spec first_non_ascii(binary()) -> none | byte().
first_non_ascii(<<>>) -> none;
first_non_ascii(<<B, _/binary>>) when B > 127 -> B;
first_non_ascii(<<_, R/binary>>) -> first_non_ascii(R).
is_only_trailing_ws(<<>>) ->
true;
is_only_trailing_ws(<<C, R/binary>>) when
C =:= $\s;
C =:= $\t;
C =:= $\r;
C =:= $\n
->
is_only_trailing_ws(R);
is_only_trailing_ws(_) ->
false.
%% =========================
%% Dumper (for P1/P2 roundtrip properties)
%% =========================
dump_header(#{raw := <<>>} = _Header) ->
%% No raw header text known — emit a minimal one.
Body = ~"Content-Type: text/plain; charset=UTF-8\n",
dump_header_text(Body);
dump_header(#{raw := RawHeader}) ->
dump_header_text(RawHeader);
dump_header(_) ->
%% Tolerate missing keys by emitting a minimal header.
dump_header_text(~"Content-Type: text/plain; charset=UTF-8\n").
dump_header_text(Body) ->
Lines = binary:split(Body, ~"\n", [global]),
BodyOut = iolist_to_binary([encode_header_line(L) || L <- Lines]),
<<"msgid \"\"\nmsgstr \"\"\n", BodyOut/binary, "\n">>.
encode_header_line(<<>>) ->
<<>>;
encode_header_line(Line) ->
Escaped = escape_string(Line),
<<$", Escaped/binary, "\\n", $", $\n>>.
dump_entry({singular, Ctx, Msgid, Translation}) ->
CtxBin = dump_msgctxt(Ctx),
MsgidBin = dump_field(~"msgid", Msgid),
MsgstrBin = dump_field(~"msgstr", Translation),
<<CtxBin/binary, MsgidBin/binary, MsgstrBin/binary, "\n">>;
dump_entry({plural, Ctx, Msgid, MsgidPlural, Plurals}) ->
CtxBin = dump_msgctxt(Ctx),
MsgidBin = dump_field(~"msgid", Msgid),
%% Finding #14 (dump-drops-msgid-plural-silently): emit the RETAINED
%% `msgid_plural` form text. The parsed `entry/0` now carries it
%% verbatim, so `parse∘dump` preserves the plural source. When the
%% source had no explicit `msgid_plural` (carried as `undefined`), we
%% fall back to the singular `Msgid` — the only sensible stand-in, and
%% the historical behaviour for that degenerate case.
PluralIdSrc =
case MsgidPlural of
undefined -> Msgid;
_ -> MsgidPlural
end,
PluralIdBin = dump_field(~"msgid_plural", PluralIdSrc),
PluralsBin = iolist_to_binary([
dump_plural_form(I, T)
|| {I, T} <- Plurals
]),
<<CtxBin/binary, MsgidBin/binary, PluralIdBin/binary, PluralsBin/binary, "\n">>.
dump_msgctxt(undefined) ->
<<>>;
dump_msgctxt(Ctx) when is_binary(Ctx) ->
dump_field(~"msgctxt", Ctx).
dump_field(Key, Value) ->
Escaped = escape_string(Value),
<<Key/binary, " \"", Escaped/binary, "\"\n">>.
dump_plural_form(Idx, T) ->
IdxBin = integer_to_binary(Idx),
Escaped = escape_string(T),
<<"msgstr[", IdxBin/binary, "] \"", Escaped/binary, "\"\n">>.
-doc """
Escape a raw string for the body of a PO `"..."` value.
Applies the five GNU gettext PO escapes — backslash, double-quote, newline,
tab, carriage return — so the result can be wrapped in `"..."` and parsed back
byte-identically by `parse/1`. This is the exact escaping `dump/1` uses for
every emitted field, exposed as public API so the separate `rebar3_erli18n`
plugin package can serialize PO metadata it owns (notably the `#|`
previous-msgid lines written by `rebar3_erli18n_po_meta`) byte-identically to
`dump/1`, across the published `{deps, [erli18n]}` boundary, instead of
vendoring a duplicate escaper that would have to stay in lock-step forever.
The five escapes applied, every other byte passed through unchanged:
```erlang
1> erli18n_po:escape_string(<<"a\"b\nc\td\\e">>).
<<"a\\\"b\\nc\\td\\\\e">>
```
""".
-spec escape_string(binary()) -> binary().
escape_string(Bin) ->
escape_string(Bin, []).
-spec escape_string(binary(), [binary()]) -> binary().
escape_string(<<>>, Acc) ->
bins_to_binary(Acc);
%% Each escaped form is emitted as an explicit two-byte character segment
%% (`<<$\\, $X>>` = a literal backslash followed by `X`) rather than a `"..."`
%% string or a `~"..."` sigil. This is unambiguous for an escape sequence and
%% sidesteps a tooling wart: the equivalent escape-heavy sigils (`~"\\\\"`,
%% `~"\\\""`) are valid Erlang but desync ELP's parser.
escape_string(<<$\\, Rest/binary>>, Acc) ->
escape_string(Rest, [<<$\\, $\\>> | Acc]);
escape_string(<<$", Rest/binary>>, Acc) ->
escape_string(Rest, [<<$\\, $">> | Acc]);
escape_string(<<$\n, Rest/binary>>, Acc) ->
escape_string(Rest, [<<$\\, $n>> | Acc]);
escape_string(<<$\t, Rest/binary>>, Acc) ->
escape_string(Rest, [<<$\\, $t>> | Acc]);
escape_string(<<$\r, Rest/binary>>, Acc) ->
escape_string(Rest, [<<$\\, $r>> | Acc]);
escape_string(<<C/utf8, Rest/binary>>, Acc) ->
escape_string(Rest, [<<C/utf8>> | Acc]);
%% A byte that is not part of a valid UTF-8 sequence (e.g. a lone `0xFF` from a
%% `<<255>>` literal that the extractor pulled from consumer source): pass it
%% through verbatim. This keeps the serializer TOTAL over any `binary()`, as its
%% `-spec binary() -> binary()` promises — without it a non-UTF-8 byte matches no
%% clause and raises `function_clause`. The byte is emitted raw, the same way the
%% PO reader tolerates raw bytes on parse, so `dump/1` never crashes on a catalog
%% value carrying arbitrary bytes.
escape_string(<<Byte, Rest/binary>>, Acc) ->
escape_string(Rest, [<<Byte>> | Acc]).