-module(erli18n_telemetry).
-moduledoc """
erli18n observability surface: a thin wrapper over the `:telemetry` library
that centralizes the event names and shields call sites from the absence of
the optional dependency.
## What it is and what problem it solves
`telemetry` is an **optional** dependency of erli18n (declared via
`optional_applications`, OTP 24+): the lib works with or without it. This
module is the only layer that knows this. It solves three problems for the
rest of the code base:
- **Safe indirection.** Call sites (`erli18n_server`, the lookup hot path)
call `emit/3`/`span/3` without ever testing whether `telemetry` is present.
When the lib is not loaded, both become no-ops — zero crash, zero noise —
instead of scattering `case code:ensure_loaded(...)` everywhere.
- **Name contract.** All erli18n event names live here, exposed as pre-typed
`event_*/0` functions. A rename or audit is a one-file change. The names are
the **public contract** of observability (convention
`[<lib>, <operation>, <phase>]`, in the style of `Phoenix.Logger`).
- **Overhead and security policy.** The high-frequency lookup events
(`miss`/`fuzzy_skip`) are opt-in (flag `emit_lookup_telemetry`, default
`false`) — this minimizes both the cost and the risk of leaking msgid
content in a multi-tenant scenario. The `memory_warning` is rate-limited: at
most one emission per configured window.
## Mental model
Think of two layers, both **lock-free from any process**:
- **Telemetry detection (sticky-positive cache).** The first call performs
`code:ensure_loaded(telemetry)`, which walks the code server. If it loads,
the `true` result is stored in `persistent_term` and stays sticky for the
rest of the VM's lifetime (telemetry does not unload at runtime). If it does
**not** load, the `false` result is **not** cached: that way, if the
consumer brings telemetry up mid-flight (`application:start(telemetry)`),
the next emission already sees it. The price of this choice is, at most, one
`code:ensure_loaded/1` per emission while telemetry is absent
(microseconds), and zero per emission once present.
- **Configuration via `application:get_env/3`.** The flags
(`emit_lookup_telemetry`, `memory_warning_threshold`,
`memory_warning_rate_limit_seconds`) are read on every call — a direct read
in the application controller's ETS (~100 ns). There is no per-process state
and no caching of these flags.
Trusted vs untrusted: the rate-limit `persistent_term` key is **private** to
this module. The functions narrow its value at the boundary; if something
outside reuses the key and writes a non-integer, the code crashes explicitly
instead of operating on garbage. Invalid configuration values (non-boolean,
negative integer) also crash with `{invalid_config, ...}` — a loud, visible
failure, never silent.
## When a dev touches this module
- **Observability consumer** (attaches handlers): use the `event_*/0` names in
`telemetry:attach/4`. Do not call `emit/3`/`span/3` directly — erli18n is
what emits.
- **Core maintainer** (`erli18n_server`, hot path): call `span/3` to
instrument operations with start/stop (load/reload), `emit/3` for pointwise
events, and `lookup_telemetry_enabled/0` to gate the lookup events before
building expensive payloads. The loader calls `memory_warning_check/1`.
## Quickstart (consumer)
```erlang
%% Attach a handler to the catalog-load events:
1> telemetry:attach_many(
.. <<"erli18n-log">>,
.. [erli18n_telemetry:event_catalog_load(),
.. erli18n_telemetry:event_catalog_load() ++ [stop]],
.. fun(Event, Measurements, Meta, _Cfg) ->
.. io:format("~p ~p ~p~n", [Event, Measurements, Meta])
.. end,
.. undefined).
ok
%% Lookup events are opt-in; enable them explicitly:
2> application:set_env(erli18n, emit_lookup_telemetry, true).
ok
3> erli18n_telemetry:lookup_telemetry_enabled().
true
```
## Key functions
- Emission: `emit/3` (pointwise), `span/3` (start/stop/exception).
- Event names: `event_catalog_load/0`, `event_catalog_reload/0`,
`event_catalog_unload/0`, `event_lookup_miss/0`, `event_lookup_fuzzy_skip/0`,
`event_locale_fallback/0`, `event_plural_divergence/0`,
`event_catalog_memory_warning/0`.
- Configuration/gating: `lookup_telemetry_enabled/0`,
`memory_warning_threshold/0`, `memory_warning_rate_limit_seconds/0`,
`memory_warning_check/1`.
## References
- Library: <https://github.com/beam-telemetry/telemetry>
- Hexdocs: <https://hexdocs.pm/telemetry/>
- `span/3`: <https://hexdocs.pm/telemetry/telemetry.html#span-3>
- Naming convention `[<lib>, <operation>, <phase>]`:
<https://hexdocs.pm/phoenix/Phoenix.Logger.html>
- `persistent_term` (lock-free, copy-free across processes):
<https://www.erlang.org/doc/man/persistent_term.html>
""".
%% ============================================================================
%% erli18n_telemetry — thin wrapper over the `:telemetry` library.
%%
%% Responsibilities:
%%
%% * Encapsulate the runtime presence/absence of the `telemetry` module
%% so call sites never have to branch. If `telemetry` is not loaded,
%% `emit/3` and `span/3` are no-ops (zero crash, zero noise).
%%
%% * Centralize all `erli18n` event names (`[erli18n, catalog, load]`,
%% etc.) so a future rename or audit is a one-file change. The event
%% names are the **public contract** of the lib's observability
%% surface (naming convention and catalogue of events).
%%
%% * Provide opt-in/opt-out gating for high-frequency events
%% (lookup miss / fuzzy_skip) via the `emit_lookup_telemetry`
%% application env flag. Always-on events bypass the gate (overhead
%% policy).
%%
%% * Provide a rate-limited memory-warning check used by the loader
%% to emit `[erli18n, catalog, memory_warning]` at most once per
%% configured window (RISK-011 mitigation 2: "once per crossing
%% event, not on every tick").
%%
%% References:
%%
%% * Library: https://github.com/beam-telemetry/telemetry
%% * Hexdocs: https://hexdocs.pm/telemetry/
%% * span/3: https://hexdocs.pm/telemetry/telemetry.html#span-3
%% * Phoenix: https://hexdocs.pm/phoenix/Phoenix.Logger.html — naming
%% convention `[<lib>, <operation>, <phase>]` source.
%%
%% Performance note (`code:ensure_loaded/1` cache):
%%
%% The first call walks the code path to confirm whether `telemetry`
%% is loadable; subsequent calls hit a `persistent_term` entry
%% (~sub-microsecond, lock-free, copy-free across processes — see
%% https://www.erlang.org/doc/man/persistent_term.html). The cache is
%% invalidated only when the result is `false` (in case the consumer
%% starts telemetry mid-flight); positive results are sticky for the
%% VM's lifetime.
%% ============================================================================
%% Emission API.
-export([emit/3, span/3]).
%% Convenience: pre-typed event names.
-export([
event_catalog_load/0,
event_catalog_reload/0,
event_catalog_unload/0,
event_lookup_miss/0,
event_lookup_fuzzy_skip/0,
event_locale_fallback/0,
event_plural_divergence/0,
event_catalog_memory_warning/0
]).
%% Configuration / gating.
-export([
lookup_telemetry_enabled/0,
memory_warning_threshold/0,
memory_warning_rate_limit_seconds/0,
memory_warning_check/1
]).
%% Test-only — exposed so the SUITE can reset the persistent_term cache
%% between cases. Not in the documented API surface.
-export([reset_caches/0]).
-export_type([
event_name/0,
measurements/0,
metadata/0,
span_fun/0,
span_result/0
]).
%% =========================
%% Types
%% =========================
-doc """
Name of a telemetry event: a list of atoms in the format
`[<lib>, <operation>, <phase>]` (e.g. `[erli18n, catalog, load]`). It is the
type returned by all `event_*/0` functions and the one accepted by
`emit/3`/`span/3`. The list contains the atoms of the erli18n vocabulary and
admits a free `atom()` in the tail for extensions (e.g. the `start`/`stop`
suffix that `span/3` appends).
""".
%% Event name shapes.
-type event_name() ::
[
erli18n
| catalog
| lookup
| plural
| locale
| load
| reload
| unload
| miss
| fuzzy_skip
| fallback
| divergence_warning
| memory_warning
| atom()
].
-doc """
Map of an event's numeric **measurements** (e.g. `#{duration => N}`,
`#{ets_bytes => N}`). Structurally it is just a `map()`; the telemetry
convention is that measurements are aggregable values, distinct from
qualitative metadata.
""".
-type measurements() :: map().
-doc """
Map of an event's qualitative **metadata** (e.g. domain, locale,
`domain_locales_sample` sample). Structurally it is just a `map()`; it carries
context, not aggregable values.
""".
-type metadata() :: map().
-doc """
Body of a span: a fun/0 that **must** return `{Result, StopMetadata}`, per the
contract of `telemetry:span/3`. `Result` is propagated back by `span/3`;
`StopMetadata` is merged into the `stop` event's metadata (or discarded on the
no-op path, when telemetry is absent).
""".
%% Span body must return `{Result, StopMetadata}` per
%% https://hexdocs.pm/telemetry/telemetry.html#span-3.
-type span_fun() :: fun(() -> {term(), metadata()}).
-doc "Return value of `span/3`: the `Result` produced by `span_fun/0`.".
-type span_result() :: term().
%% =========================
%% Cache keys (persistent_term)
%% =========================
%% Sticky-true cache for the loaded check.
-define(LOADED_KEY, {?MODULE, telemetry_loaded}).
%% Rate-limit anchor for memory_warning emission.
-define(MEM_WARN_LAST_KEY, {?MODULE, memory_warning_last_emit}).
%% =========================
%% Event names
%% =========================
-doc """
Event prefix of a catalog's **load span** (`ensure_loaded`):
`[erli18n, catalog, load]`. Since it is a span prefix (via `span/3`), the
events actually emitted have the `start`/`stop`/`exception` suffix appended.
```erlang
1> erli18n_telemetry:event_catalog_load().
[erli18n,catalog,load]
```
Siblings: `event_catalog_reload/0`, `event_catalog_unload/0`.
""".
-spec event_catalog_load() -> event_name().
event_catalog_load() ->
[erli18n, catalog, load].
-doc """
Event prefix of a catalog's **atomic reload span**:
`[erli18n, catalog, reload]`. As a span prefix, it receives the
`start`/`stop`/`exception` suffix at runtime.
```erlang
1> erli18n_telemetry:event_catalog_reload().
[erli18n,catalog,reload]
```
Siblings: `event_catalog_load/0`, `event_catalog_unload/0`.
""".
-spec event_catalog_reload() -> event_name().
event_catalog_reload() ->
[erli18n, catalog, reload].
-doc """
Name of the pointwise catalog **unload** event:
`[erli18n, catalog, unload]`. Emitted via `emit/3` (not a span).
```erlang
1> erli18n_telemetry:event_catalog_unload().
[erli18n,catalog,unload]
```
Siblings: `event_catalog_load/0`, `event_catalog_reload/0`.
""".
-spec event_catalog_unload() -> event_name().
event_catalog_unload() ->
[erli18n, catalog, unload].
-doc """
Name of the **lookup miss** event (key not found in the catalog):
`[erli18n, lookup, miss]`. A **high-frequency** event and therefore **opt-in**
— only emitted when `lookup_telemetry_enabled/0` returns `true`. Keeping the
default off also avoids exposing msgid content in a multi-tenant scenario.
```erlang
1> erli18n_telemetry:event_lookup_miss().
[erli18n,lookup,miss]
```
Sibling: `event_lookup_fuzzy_skip/0`. Gate: `lookup_telemetry_enabled/0`.
""".
-spec event_lookup_miss() -> event_name().
event_lookup_miss() ->
[erli18n, lookup, miss].
-doc """
Name of the **fuzzy entry skip** event in lookup (an entry marked
`#, fuzzy` in the `.po`, which gettext ignores): `[erli18n, lookup, fuzzy_skip]`.
A **high-frequency** event, **opt-in** under the same flag as the misses
(`lookup_telemetry_enabled/0`).
```erlang
1> erli18n_telemetry:event_lookup_fuzzy_skip().
[erli18n,lookup,fuzzy_skip]
```
Sibling: `event_lookup_miss/0`. Gate: `lookup_telemetry_enabled/0`.
""".
-spec event_lookup_fuzzy_skip() -> event_name().
event_lookup_fuzzy_skip() ->
[erli18n, lookup, fuzzy_skip].
-doc """
Name of the **locale fallback** event: `[erli18n, locale, fallback]`. Emitted
(Phase 2) when an exact-locale lookup MISSES but the opt-in canonicalization-
aware fallback chain resolves the translation from a less-specific or
canonicalized locale (`pt_BR` → `pt`). The low-frequency, interesting signal
"a non-exact locale served a translation" — distinct from a true
`[erli18n, lookup, miss]` (whole chain missed).
**Opt-in** under the SAME flag as the lookup events
(`lookup_telemetry_enabled/0`): fallback resolution is by construction a
sub-event of a lookup miss, so it shares the switch and the multi-tenant
msgid-exposure policy. Off the exact-hit path entirely.
Measurements: `#{count => 1, chain_depth => non_neg_integer()}` (depth = the
0-based position in the chain of the candidate that hit). Metadata:
`#{domain, requested_locale, resolved_locale, function, context}`.
```erlang
1> erli18n_telemetry:event_locale_fallback().
[erli18n,locale,fallback]
```
Gate: `lookup_telemetry_enabled/0`. Sibling: `event_lookup_miss/0`.
""".
-spec event_locale_fallback() -> event_name().
event_locale_fallback() ->
[erli18n, locale, fallback].
-doc """
Name of the **plural divergence warning** event:
`[erli18n, plural, divergence_warning]`. Emitted at load time when the
`Plural-Forms` rule in the `.po` header diverges from the CLDR rule inlined for
the locale (an informative validation — the `.po` header remains the source of
truth at runtime). Always on (does not go through the lookup flag).
```erlang
1> erli18n_telemetry:event_plural_divergence().
[erli18n,plural,divergence_warning]
```
""".
-spec event_plural_divergence() -> event_name().
event_plural_divergence() ->
[erli18n, plural, divergence_warning].
-doc """
Name of the **memory warning** event: `[erli18n, catalog, memory_warning]`.
Emitted by `memory_warning_check/1` when the catalogs' storage usage crosses
`memory_warning_threshold/0`, **rate-limited** to at most one emission per
`memory_warning_rate_limit_seconds/0`. Always on (does not go through the
lookup flag).
```erlang
1> erli18n_telemetry:event_catalog_memory_warning().
[erli18n,catalog,memory_warning]
```
Emitter: `memory_warning_check/1`.
""".
-spec event_catalog_memory_warning() -> event_name().
event_catalog_memory_warning() ->
[erli18n, catalog, memory_warning].
%% =========================
%% Emission
%% =========================
%% Pointwise emit. Safe no-op when `telemetry` is unavailable. The naked
%% `erlang:apply/3` indirection is intentional: dialyzer treats the call
%% as an unknown remote function when `telemetry` is genuinely absent
%% from the PLT, which matches the runtime story exactly.
-doc """
Emits a **pointwise** telemetry event (no start/stop semantics; for that use
`span/3`).
Parameters:
- `EventName` — the event name, typically one of the `event_*/0` (e.g.
`event_catalog_unload/0`). Must be a list.
- `Measurements` — map of numeric/aggregable measurements. Must be a map.
- `Metadata` — map of qualitative metadata. Must be a map.
Behavior and return: if `telemetry` is loaded (see the sticky detection in the
moduledoc), it delegates to `telemetry:execute/3`; otherwise it is a **safe
no-op**. On both paths it always returns `ok` — the result of
`telemetry:execute/3` is discarded on purpose.
Failure modes: the clause is guarded (`is_list`/`is_map`/`is_map`); calling
with the wrong types results in `function_clause` (caller crash). The
`erlang:apply(telemetry, execute, ...)` indirection is **intentional**: it
makes dialyzer treat the call as an unknown remote function when `telemetry`
is genuinely absent from the PLT, mirroring the runtime story.
```erlang
1> erli18n_telemetry:emit(
.. erli18n_telemetry:event_catalog_unload(),
.. #{count => 1},
.. #{domain => my_domain, locale => <<"fr">>}).
ok
```
The no-op path does not depend on `telemetry` being **loaded in memory**, but
on `telemetry` being **absent from the code path** — detection (see
`telemetry_loaded/0` / moduledoc) uses `code:ensure_loaded(telemetry)`, which
would load the module from the code path if it existed there. In other words:
`code:is_loaded(telemetry) =:= false` does **not** make `emit/3` a no-op (the
module would still be loaded and the event emitted). The no-op only occurs when
the `telemetry` app is not in the release/code path; in that scenario the same
call returns `ok` without emitting anything.
Sibling: `span/3` (events with start/stop).
""".
-spec emit(event_name(), measurements(), metadata()) -> ok.
emit(EventName, Measurements, Metadata) when
is_list(EventName), is_map(Measurements), is_map(Metadata)
->
case telemetry_loaded() of
true ->
_ = erlang:apply(
telemetry,
execute,
[EventName, Measurements, Metadata]
),
ok;
false ->
ok
end.
%% Span emit. Matches the `:telemetry.span/3` contract:
%% * Emits `EventPrefix ++ [start]` with measurements
%% `#{monotonic_time, system_time}` and StartMetadata.
%% * Runs Fun, which must return `{Result, StopMetadata}`.
%% * Emits `EventPrefix ++ [stop]` with measurements
%% `#{monotonic_time, duration}` and (StartMetadata merged with
%% StopMetadata).
%% * On exception, emits `EventPrefix ++ [exception]` instead of stop,
%% with `#{kind, reason, stacktrace}` merged into StartMetadata.
%%
%% Reference: https://hexdocs.pm/telemetry/telemetry.html#span-3.
%%
%% Always-on path (telemetry loaded): we delegate to `:telemetry.span/3`
%% to avoid duplicating the implementation, which keeps measurement
%% semantics byte-equal to what `:telemetry` users expect.
%%
%% No-op path (telemetry absent): we still run Fun (otherwise the lib
%% would behave differently with vs without telemetry — unacceptable).
%% We discard StopMetadata because there's nothing to emit it to.
-doc """
Runs `Fun` instrumented as a telemetry **span**, following the contract of
`telemetry:span/3` (events with start, stop, and exception).
Parameters:
- `EventPrefix` — the event prefix (e.g. `event_catalog_load/0`). Telemetry
appends `start`/`stop`/`exception` to this prefix. Must be a list.
- `StartMetadata` — metadata already available in the `start` event (and
merged into `stop`). Must be a map.
- `Fun` — the span body, a fun/0 that **MUST** return `{Result, StopMetadata}`
(see `span_fun/0`).
Contract semantics (path with telemetry loaded): emits
`EventPrefix ++ [start]` with measurements `#{monotonic_time, system_time}`;
runs `Fun`; emits `EventPrefix ++ [stop]` with `#{monotonic_time, duration}`
and `StartMetadata` merged with `StopMetadata`. If `Fun` raises an exception,
it emits `EventPrefix ++ [exception]` (with `#{kind, reason, stacktrace}` in
the metadata) instead of `stop`, and the exception re-propagates. It delegates
to `telemetry:span/3` to keep the measurements byte-equal to what `:telemetry`
users expect.
No-op path semantics (telemetry absent): it **still runs `Fun`** — otherwise
the lib would behave differently with vs without telemetry, which is
unacceptable — and discards `StopMetadata` (there is nowhere to emit it). No
event is emitted.
Return: on both paths, the `Result` produced by `Fun` (see `span_result/0`).
Failure modes: guarded clause (`is_list`/`is_map`/`is_function(Fun, 0)`); wrong
types => `function_clause`. If `Fun` does not return a `{Result, StopMetadata}`
tuple, both paths crash, but **asymmetrically** with respect to the events
already emitted:
- **No-op path (telemetry absent):** crashes with `badmatch` at
`{Result, _StopMetadata} = Fun()` **before** any emission — no event goes
out (consistent with the no-op never emitting anything).
- **Path with telemetry:** `telemetry:span/3` has already emitted the
`EventPrefix ++ [start]` event **before** inspecting `Fun`'s return, so the
consumer sees an **orphan** `start` (without a matching `stop` or
`exception`) followed by the crash inside the `telemetry` lib itself when
matching the invalid shape. This is exactly the symptom to look for when
debugging `start` events without a `stop`.
```erlang
1> erli18n_telemetry:span(
.. erli18n_telemetry:event_catalog_load(),
.. #{domain => my_domain, locale => <<"fr">>},
.. fun() ->
.. Result = do_load(), %% instrumented work
.. {Result, #{entries => 128}} %% {Result, StopMetadata}
.. end).
Result
```
Sibling: `emit/3` (pointwise events).
""".
-spec span(event_name(), metadata(), span_fun()) -> span_result().
span(EventPrefix, StartMetadata, Fun) when
is_list(EventPrefix), is_map(StartMetadata), is_function(Fun, 0)
->
case telemetry_loaded() of
true ->
erlang:apply(
telemetry,
span,
[EventPrefix, StartMetadata, Fun]
);
false ->
{Result, _StopMetadata} = Fun(),
Result
end.
%% =========================
%% Configuration / gating
%% =========================
%% Opt-in flag for the high-frequency lookup events.
%%
%% `application:get_env/3` lookup is an ETS-direct read in the OTP
%% application controller (~100 ns), comparable to telemetry's own no-op
%% overhead. The flag eliminates the overhead of an attached handler, not
%% the overhead of looking up the flag — that is the theoretical limit of
%% the design.
-doc """
Gate for the high-frequency lookup events (`event_lookup_miss/0` and
`event_lookup_fuzzy_skip/0`). Call sites call this function **before** building
expensive payloads, so that the overhead only exists when the operator opts in.
Reads the app env `emit_lookup_telemetry` (default `false` — opt-in, also for
multi-tenant security reasons). The read is a direct access to the application
controller's ETS (~100 ns); this function does **not** eliminate the overhead
of looking up the flag itself, only that of having handlers attached — it is
the theoretical limit of the design.
Return and failure modes: `true` for `true`, `false` for `false`. Any **other**
configured value is a configuration error and triggers an explicit crash with
`error({invalid_config, {erli18n, emit_lookup_telemetry, Other, expected,
boolean}})` — a loud, visible failure, never a silent "treat as false".
```erlang
1> erli18n_telemetry:lookup_telemetry_enabled().
false
2> application:set_env(erli18n, emit_lookup_telemetry, true).
ok
3> erli18n_telemetry:lookup_telemetry_enabled().
true
4> application:set_env(erli18n, emit_lookup_telemetry, "yes").
ok
5> erli18n_telemetry:lookup_telemetry_enabled().
** exception error: {invalid_config,{erli18n,emit_lookup_telemetry,"yes",expected,boolean}}
```
Siblings (config): `memory_warning_threshold/0`,
`memory_warning_rate_limit_seconds/0`.
""".
-spec lookup_telemetry_enabled() -> boolean().
lookup_telemetry_enabled() ->
case application:get_env(erli18n, emit_lookup_telemetry, false) of
true -> true;
false -> false;
Other -> error({invalid_config, {erli18n, emit_lookup_telemetry, Other, expected, boolean}})
end.
%% Bytes threshold for memory_warning. Default 100 MiB (104857600).
-doc """
Threshold, in **bytes**, of the catalogs' storage usage above which
`event_catalog_memory_warning/0` becomes eligible. Compared against `ets_bytes`
inside `memory_warning_check/1` with a strict `>` (equaling the threshold does
**not** fire).
Reads the app env `memory_warning_threshold` (default `104857600`, 100 MiB).
Return and failure modes: a valid `non_neg_integer()`. Any value that is not an
integer `>= 0` (negative, non-integer) triggers a crash with
`error({invalid_config, {erli18n, memory_warning_threshold, Other, expected,
non_neg_integer}})`.
```erlang
1> erli18n_telemetry:memory_warning_threshold().
104857600
2> application:set_env(erli18n, memory_warning_threshold, 52428800).
ok
3> erli18n_telemetry:memory_warning_threshold().
52428800
4> application:set_env(erli18n, memory_warning_threshold, -1).
ok
5> erli18n_telemetry:memory_warning_threshold().
** exception error: {invalid_config,{erli18n,memory_warning_threshold,-1,expected,non_neg_integer}}
```
Consumer: `memory_warning_check/1`. Sibling: `memory_warning_rate_limit_seconds/0`.
""".
-spec memory_warning_threshold() -> non_neg_integer().
memory_warning_threshold() ->
case application:get_env(erli18n, memory_warning_threshold, 104857600) of
N when is_integer(N), N >= 0 -> N;
Other ->
error(
{invalid_config,
{erli18n, memory_warning_threshold, Other, expected, non_neg_integer}}
)
end.
%% Window (seconds) between successive memory_warning emits.
-doc """
Window, in **seconds**, between successive emissions of
`event_catalog_memory_warning/0`. Even if the threshold is crossed on every
load, `memory_warning_check/1` only re-emits after this window has elapsed
since the last emission (mitigation: "once per crossing event, not on every
tick").
Reads the app env `memory_warning_rate_limit_seconds` (default `60`).
Return and failure modes: a valid `non_neg_integer()`. A value that is not an
integer `>= 0` triggers a crash with `error({invalid_config, {erli18n,
memory_warning_rate_limit_seconds, Other, expected, non_neg_integer}})`. A
value of `0` makes every crossing re-emit (a degenerate window, with no
effective rate limit).
```erlang
1> erli18n_telemetry:memory_warning_rate_limit_seconds().
60
2> application:set_env(erli18n, memory_warning_rate_limit_seconds, 300).
ok
3> erli18n_telemetry:memory_warning_rate_limit_seconds().
300
```
Consumer: `memory_warning_check/1`. Sibling: `memory_warning_threshold/0`.
""".
-spec memory_warning_rate_limit_seconds() -> non_neg_integer().
memory_warning_rate_limit_seconds() ->
case application:get_env(erli18n, memory_warning_rate_limit_seconds, 60) of
N when is_integer(N), N >= 0 -> N;
Other ->
error(
{invalid_config,
{erli18n, memory_warning_rate_limit_seconds, Other, expected, non_neg_integer}}
)
end.
%% Inspect the given memory_info and emit a single memory_warning event
%% if the threshold is crossed and the rate-limit window has elapsed.
%%
%% Returns:
%% * `not_warned` — threshold not crossed.
%% * `rate_limited` — threshold crossed but a warning was already
%% emitted within the rate-limit window.
%% * `warned` — a `[erli18n, catalog, memory_warning]` event
%% was just emitted.
%%
%% Rate-limit storage uses `persistent_term` so the check is lock-free
%% from any process. The cost of storing a single integer is one VM
%% global GC at update time — acceptable because the update only happens
%% on actual emit (rare, by design).
-doc """
Inspects the `MemInfo` memory snapshot and emits **at most one**
`event_catalog_memory_warning/0`, deciding among not-warning, suppressing by
rate-limit, or warning. Called by the loader (`erli18n_server`) at the end of a
successful load.
Parameter:
- `MemInfo` — a snapshot map. The keys read are `ets_bytes` (catalog storage
usage in bytes; the field name is historical — storage is persistent_term; the
trigger; default `0` if absent), `num_catalogs` and `num_keys` (only used in
the measurement when warning; default `0`). Must be a map, otherwise
`function_clause`.
Decision logic:
1. If `ets_bytes` is **not** `>` `memory_warning_threshold/0`, returns
`not_warned` (strict `>` comparison).
2. Otherwise, if the `memory_warning_rate_limit_seconds/0` window has **not**
yet elapsed since the last emission, returns `rate_limited` without emitting.
3. Otherwise, writes the current instant to the anchor, builds the sample and
emits via `emit/3`, returning `warned`.
Side effects: the rate-limit anchor is a **private** key in `persistent_term`
(lock-free from any process), updated **only** on an actual emission.
Rewriting the key via `persistent_term:put/2` may trigger GC work proportional
to the processes that still hold references to the **previous** value of this
key — not an unconditional global full GC of the VM. Here that is cheap (the
previous value is a single timestamp integer, with no long-lived holders) and,
moreover, it only happens on the `warned` path (rare, by design), so the cost
is acceptable. The payload of the warned event has:
- measurements `#{ets_bytes, threshold_bytes, num_catalogs, num_keys}`;
- metadata `#{domain_locales_sample => [...]}`, a sample of up to 10
`{Domain, Locale}` pairs (payload bound in a multi-tenant deployment),
collected by `collect_domain_locales_sample/0`.
Failure modes: if `ets_bytes` or the counters are non-numeric, the `>` or the
construction of the measurements crash. If the `persistent_term` anchor holds a
non-integer (someone reusing the private key — a contract violation), the
boundary crashes with `{invalid_persistent_term, ...}` instead of operating on
garbage.
```erlang
%% Below the default threshold (100 MiB): nothing happens.
1> erli18n_telemetry:memory_warning_check(#{ets_bytes => 1024}).
not_warned
%% Above the threshold: the first call warns...
2> erli18n_telemetry:memory_warning_check(
.. #{ets_bytes => 209715200, num_catalogs => 3, num_keys => 4096}).
warned
%% ...and the next one, within the rate-limit window, is suppressed.
3> erli18n_telemetry:memory_warning_check(#{ets_bytes => 209715200}).
rate_limited
```
Config: `memory_warning_threshold/0`, `memory_warning_rate_limit_seconds/0`.
Event: `event_catalog_memory_warning/0`. In tests, `reset_caches/0` zeroes the
anchor.
""".
-spec memory_warning_check(map()) -> not_warned | rate_limited | warned.
memory_warning_check(MemInfo) when is_map(MemInfo) ->
Threshold = memory_warning_threshold(),
Bytes = maps:get(ets_bytes, MemInfo, 0),
case Bytes > Threshold of
false ->
not_warned;
true ->
Now = erlang:system_time(second),
Window = memory_warning_rate_limit_seconds(),
%% `persistent_term:get/2` returns `term()`. The value under
%% `?MEM_WARN_LAST_KEY` is only ever written by this module
%% with `persistent_term:put(?MEM_WARN_LAST_KEY, Now)` where
%% `Now = erlang:system_time(second) :: integer()`, and the
%% default we pass is the integer `0`. Narrow at the boundary
%% so arithmetic is type-checked; a non-integer would mean
%% someone is reusing our private key — contract violation,
%% crash explicitly.
Last =
case persistent_term:get(?MEM_WARN_LAST_KEY, 0) of
L when is_integer(L) -> L;
Other ->
error(
{invalid_persistent_term,
{?MEM_WARN_LAST_KEY, Other, expected, integer}}
)
end,
case (Now - Last) < Window of
true ->
rate_limited;
false ->
persistent_term:put(?MEM_WARN_LAST_KEY, Now),
Sample = collect_domain_locales_sample(),
emit(
event_catalog_memory_warning(),
#{
ets_bytes => Bytes,
threshold_bytes => Threshold,
num_catalogs => maps:get(num_catalogs, MemInfo, 0),
num_keys => maps:get(num_keys, MemInfo, 0)
},
%% The memory_warning metadata carries
%% `domain_locales_sample`: up to 10
%% `{Domain, Locale}` tuples to bound payload
%% size in multi-tenant deployments.
%% `erli18n_server:loaded_catalogs/0` is a
%% caller-process persistent_term scan — safe to call from
%% any process, including the server itself,
%% because it never re-enters the gen_server.
#{domain_locales_sample => Sample}
),
warned
end
end.
-doc """
Internal. Collects the `domain_locales_sample` sample for the
`memory_warning`: up to 10 `{Domain, Locale}` pairs from the loaded catalogs.
Invariants and safety for the maintainer:
- Guarded by `erlang:function_exported(erli18n_server, loaded_catalogs, 0)`: if
the server is not present (e.g. module not loaded in isolated tests), it
returns `[]` instead of crashing.
- `erli18n_server:loaded_catalogs/0` is a persistent_term scan in the **caller
process** — safe to call from any process, **including the gen_server itself**,
because it never re-enters the `gen_server` (no deadlock risk).
- No ordering: the order is whatever the persistent_term scan returns. The contract is an
observability sample, it does not require determinism, and sorting would only
add cost. The limit of 10 (`lists:sublist/2`) bounds the payload size in a
multi-tenant deployment.
""".
%% Sample up to 10 (Domain, Locale) tuples. Order is whatever persistent_term scan
%% returns; we don't sort because the spec doesn't require determinism
%% and sorting would add overhead at no benefit for an observability
%% sample.
collect_domain_locales_sample() ->
case erlang:function_exported(erli18n_server, loaded_catalogs, 0) of
true ->
Catalogs = erli18n_server:loaded_catalogs(),
Pairs = [{D, L} || {D, L, _N} <- Catalogs],
lists:sublist(Pairs, 10);
false ->
[]
end.
%% =========================
%% Test-only helpers
%% =========================
%% Clear both caches so a test can simulate a fresh VM. Not part of the
%% documented API.
-doc """
Test-only: erases the two `persistent_term` keys of this module — the sticky
"telemetry loaded" cache (`?LOADED_KEY`) and the memory_warning rate-limit
anchor (`?MEM_WARN_LAST_KEY`) — simulating a fresh VM between test cases. It is
not part of the documented API surface (do not rely on it in production). It
always returns `ok`.
Useful for making deterministic the tests of `memory_warning_check/1` (which
switches from `warned` to `rate_limited` depending on the anchor) and those of
telemetry detection.
```erlang
1> erli18n_telemetry:reset_caches().
ok
```
""".
-spec reset_caches() -> ok.
reset_caches() ->
_ = persistent_term:erase(?LOADED_KEY),
_ = persistent_term:erase(?MEM_WARN_LAST_KEY),
ok.
%% =========================
%% Internal
%% =========================
-doc """
Internal. Cached "is telemetry loadable?" detection — the heart of the
no-op-safe contract that `emit/3` and `span/3` consult.
Cache protocol (sticky-positive) for the maintainer:
- **Positive hit:** if `?LOADED_KEY` already holds `true` in `persistent_term`,
returns `true` directly (~sub-microsecond, lock-free).
- **First call / not yet resolved:** performs `code:ensure_loaded(telemetry)`,
which walks the code server. On `{module, telemetry}`, it writes `true` to
the cache (sticky for the VM's lifetime — telemetry does not unload at
runtime) and returns `true`.
- **Absent:** returns `false` **without** caching. This is deliberate: if the
consumer brings telemetry up later (`application:start(telemetry)`), the next
call re-checks and starts to see it. Negative caching would be cheaper in the
absent case, but it would break on-the-fly enablement and contradict the
"safe no-op, never crashes" contract.
Cost: at most one `code:ensure_loaded/1` per emission while telemetry is absent
(microseconds); zero per emission once present. `reset_caches/0` erases
`?LOADED_KEY` to force re-detection in tests.
""".
%% Cached "is telemetry loaded?" check.
%%
%% First call: `code:ensure_loaded/1` walks the code server. On success
%% we cache `true` permanently — telemetry doesn't unload at runtime.
%% On failure (module not found, not loadable) we return `false`
%% WITHOUT caching, so that if the consumer brings telemetry up later
%% (`application:start(telemetry)`) the next call observes it.
%%
%% Trade-off: positive-only caching costs at most one
%% `code:ensure_loaded/1` per emit while telemetry is absent
%% (microseconds), and zero per emit once present. Negative caching
%% would be cheaper in the absent case but would prevent on-the-fly
%% enablement, contradicting the "no-op safe, never crashes" contract.
telemetry_loaded() ->
case persistent_term:get(?LOADED_KEY, undefined) of
true ->
true;
undefined ->
case code:ensure_loaded(telemetry) of
{module, telemetry} ->
persistent_term:put(?LOADED_KEY, true),
true;
_ ->
false
end
end.