defmodule ElixirPhpEmailValidator do
@moduledoc """
A bug-for-bug compatible Elixir port of PHP's
`filter_var($email, FILTER_VALIDATE_EMAIL)`.
This library answers exactly one question, the same way PHP does:
iex> ElixirPhpEmailValidator.valid?("a@b.c")
true
iex> ElixirPhpEmailValidator.valid?("a@b")
false
It is **not** an RFC 5321/5322 validator, and it is intentionally **not**
"correct" in any abstract sense. Its only goal is to return the *same*
boolean that PHP's `filter_var/2` returns for the *same* input — including
PHP's surprising behaviours (a bare IP like `user@1.2.3.4` is rejected, but
a bracketed literal `user@[1.2.3.4]` is accepted; a bare space is illegal
even inside a quoted local part, but a backslash-escaped space is fine;
etc.). See the compatibility guide and the quirks catalog for the full list.
## How it works (why it is faithful)
PHP implements `FILTER_VALIDATE_EMAIL` in C, in
[`ext/filter/logical_filters.c`](https://github.com/php/php-src/blob/PHP-8.5/ext/filter/logical_filters.c),
function `php_filter_validate_email`. The algorithm is:
1. Reject the input if it is longer than **320 bytes** (octets).
2. Match it against a single, anchored PCRE regex with the inline flags
`i` (caseless) and `D` (dollar-end-only). There are two regexes: an
ASCII one used by default, and a Unicode-aware one (adding `\\pL`/`\\pN`
to the local part, plus the `u` / PCRE_UTF8+PCRE_UCP flag) used only
when the `FILTER_FLAG_EMAIL_UNICODE` flag is set.
3. Return the string on match, `false` otherwise.
This module vendors **the exact PHP regex literals**, byte-for-byte, in
`priv/php/regexp1.full` (default) and `priv/php/regexp0.full` (unicode), and
derives both the pattern *and* the `:re` options from them at compile time —
so the vendored literal is the single source of truth. The flag translation
is mechanical:
| PHP inline flag | Erlang `:re` option |
| -------------------------- | --------------------- |
| `i` (caseless) | `:caseless` |
| `D` (PCRE_DOLLAR_ENDONLY) | `:dollar_endonly` |
| `u` (PCRE_UTF8 + UCP) | `:unicode`, `:ucp` |
Both PHP (PCRE2) and Erlang/OTP's `:re` are PCRE-family engines, so the same
pattern produces the same matches. The library matches on the **raw bytes**
of the input (exactly as PHP's `pcre2_match` does), and — for the unicode
path — treats an input that is not valid UTF-8 as a non-match, which is what
PHP does when `pcre2_match` reports a UTF-8 error.
Faithfulness is *proven*, not asserted: the test suite generates golden
verdicts from real `php` binaries across a PHP version matrix and asserts
this module reproduces them exactly. See `mix php.golden` and the
`COMPATIBILITY.md` guide.
## Performance and untrusted input
Validation is a single anchored PCRE match. For ordinary addresses it is
effectively instant, but the vendored pattern (PHP's own) contains nested
quantifiers in the domain sub-pattern, so a *deliberately crafted* input —
even a short one, well under the 320-byte gate — can drive PCRE into heavy
backtracking: a worst-case string costs on the order of ~200 ms inside
`:re.run` before the engine's internal step limit aborts the match. The
**verdict is still correct** (such inputs are rejected, exactly as PHP rejects
them); the cost is purely CPU time. Note that the 320-byte gate bounds input
*length*, not per-call CPU.
Two consequences worth knowing for a hot path that validates **untrusted**
input at scale:
* `:re.run` runs in a NIF that does not yield, so a pathological input pins
a scheduler thread for the duration. If you validate attacker-controlled
strings at high volume, run validation off the request-serving schedulers
(e.g. a `Task` supervised on a separate pool) and/or apply a timeout and
rate-limiting upstream.
* This library intentionally does **not** pass `:re` a `match_limit`. A limit
low enough to cap the worst case can also reject a *legitimate*
many-label address that PHP accepts — i.e. it would break exact parity,
which is this library's entire contract. Bounding CPU is therefore left to
the caller, where it can be done without sacrificing faithfulness.
## Provenance / copyright
The vendored regex derives from Michael Rushton's email validator and is
redistributed by PHP under its own terms; Rushton's copyright notice is
preserved in `NOTICE`. `source_info/0` reports the exact php-src ref and
checksums this build tracks.
"""
alias ElixirPhpEmailValidator.PhpSource
# --- Vendored PHP source (single source of truth) ------------------------
#
# The library reads the exact PHP regex *literals* ("/pattern/flags", as they
# appear in php_filter_validate_email) and the provenance manifest, all
# regenerated by `mix php.extract`. Both the pattern and the matching options
# are derived from the literal, so they can never disagree with PHP's flags.
@priv_php Path.join([__DIR__, "..", "priv", "php"])
@regexp1_full_path Path.join(@priv_php, "regexp1.full")
@regexp0_full_path Path.join(@priv_php, "regexp0.full")
@manifest_path Path.join(@priv_php, "MANIFEST.json")
@external_resource @regexp1_full_path
@external_resource @regexp0_full_path
@external_resource @manifest_path
@regexp1_full File.read!(@regexp1_full_path)
@regexp0_full File.read!(@regexp0_full_path)
@manifest File.read!(@manifest_path)
# Pattern + options derived from the vendored literals at compile time. An
# unvetted upstream flag makes PhpSource.flags_to_re_opts/1 raise *here*, so a
# silent divergence becomes a loud compile error.
@regexp1 PhpSource.pattern(@regexp1_full)
@regexp0 PhpSource.pattern(@regexp0_full)
@ascii_opts PhpSource.options(@regexp1_full)
@unicode_opts PhpSource.options(@regexp0_full)
# PHP: `if (Z_STRLEN_P(value) > 320) RETURN_VALIDATION_FAILED;`
@max_length 320
# Provenance, derived at compile time from the vendored bytes/manifest so it
# can never disagree with what is actually compiled into this build.
@regexp1_sha256 Base.encode16(:crypto.hash(:sha256, @regexp1), case: :lower)
@regexp0_sha256 Base.encode16(:crypto.hash(:sha256, @regexp0), case: :lower)
@php_baseline %{
source_file: PhpSource.manifest_field(@manifest, "file"),
function: PhpSource.manifest_field(@manifest, "function"),
php_ref: PhpSource.manifest_field(@manifest, "ref"),
upstream_url: PhpSource.manifest_field(@manifest, "blob_url"),
vendored_at: PhpSource.manifest_field(@manifest, "vendored_at"),
max_length: @max_length
}
@typedoc "Options for `valid?/2` and `validate/2`."
@type opts :: [unicode: boolean()]
@doc """
Returns `true` when PHP's `filter_var(email, FILTER_VALIDATE_EMAIL)` would
accept `email`, and `false` otherwise.
## Options
* `:unicode` (boolean, default `false`) — when `true`, mirrors PHP's
`FILTER_FLAG_EMAIL_UNICODE`, allowing Unicode letters/numbers in the
**local part** (the domain still must be ASCII).
## Input type
Pass a **UTF-8 binary** (an Elixir string). The `@spec` says `binary()`, so
Dialyzer and Elixir's compile-time type checker can flag a non-binary argument
before it ever runs. At runtime the function stays *total* — like PHP's
`filter_var`, which never raises — so any non-binary value simply returns
`false`. This is the one spot where the port deliberately does **not** imitate
PHP: PHP coerces scalars (`filter_var(123, ...)` validates `"123"`), whereas
this library does not coerce. In particular a charlist such as `~c"a@b.c"`
returns `false`; convert it with `to_string/1` first.
## Examples
iex> ElixirPhpEmailValidator.valid?("first.last@iana.org")
true
iex> ElixirPhpEmailValidator.valid?("user@1.2.3.4")
false
iex> ElixirPhpEmailValidator.valid?("user@[1.2.3.4]")
true
iex> ElixirPhpEmailValidator.valid?("日本語@example.com")
false
iex> ElixirPhpEmailValidator.valid?("日本語@example.com", unicode: true)
true
iex> ElixirPhpEmailValidator.valid?(~c"a@b.c")
false
"""
@spec valid?(binary(), opts()) :: boolean()
def valid?(email, opts \\ [])
def valid?(email, opts) when is_binary(email) do
if byte_size(email) > @max_length do
false
else
mp = if Keyword.get(opts, :unicode, false), do: compiled(:unicode), else: compiled(:ascii)
try do
:re.run(email, mp, [{:capture, :none}]) == :match
catch
# In unicode mode, an input that is not valid UTF-8 makes :re.run raise
# `badarg`. PHP's pcre2_match reports a UTF-8 error there and filter_var
# returns false, so a non-match is the faithful result.
:error, :badarg -> false
end
end
end
def valid?(_other, _opts), do: false
@doc """
Mirrors PHP's return convention: `{:ok, email}` when valid (PHP returns the
string), `:error` when invalid (PHP returns `false`).
Like `valid?/2`, this expects a binary; see its "Input type" section.
## Examples
iex> ElixirPhpEmailValidator.validate("a@b.c")
{:ok, "a@b.c"}
iex> ElixirPhpEmailValidator.validate("a@b")
:error
"""
@spec validate(binary(), opts()) :: {:ok, binary()} | :error
def validate(email, opts \\ []) do
if valid?(email, opts), do: {:ok, email}, else: :error
end
@doc """
Returns provenance metadata for the vendored PHP regex: which php-src
ref/version this port tracks, the upstream file/function, and the SHA-256
checksums of the two vendored patterns. Read from `priv/php/MANIFEST.json` and
the vendored bytes at compile time, so it always matches this build.
## Example
iex> info = ElixirPhpEmailValidator.source_info()
iex> info.function
"php_filter_validate_email"
iex> byte_size(info.regexp1_sha256)
64
"""
@spec source_info() :: %{
source_file: String.t(),
function: String.t(),
php_ref: String.t(),
upstream_url: String.t(),
vendored_at: String.t(),
max_length: non_neg_integer(),
regexp1_sha256: String.t(),
regexp0_sha256: String.t()
}
def source_info do
Map.merge(@php_baseline, %{
regexp1_sha256: @regexp1_sha256,
regexp0_sha256: @regexp0_sha256
})
end
# --- Internals -----------------------------------------------------------
# Lazily compile each pattern once per node and cache in :persistent_term.
# Compiling at *runtime* (rather than baking the compiled program into the
# .beam) guarantees the program matches the PCRE of the OTP actually running.
defp compiled(mode) do
key = {__MODULE__, :compiled, mode}
case :persistent_term.get(key, nil) do
nil ->
mp = compile!(mode)
:persistent_term.put(key, mp)
mp
mp ->
mp
end
end
defp compile!(mode) do
{pattern, opts} =
case mode do
:ascii -> {@regexp1, @ascii_opts}
:unicode -> {@regexp0, @unicode_opts}
end
case :re.compile(pattern, opts) do
{:ok, mp} ->
mp
{:error, reason} ->
raise "the vendored PHP #{mode} pattern failed to compile under this OTP's PCRE: " <>
inspect(reason)
end
end
end