lib/pdf/reader/wordlist.ex

defmodule Pdf.Reader.Wordlist do
  @moduledoc """
  Compile-time dictionaries used by `Pdf.Reader.read/2` to recover word
  boundaries that the PDF producer collapsed (e.g. `iniciode` →
  `inicio` + `de`).

  Lookup is performed against a `MapSet` of lowercase words. Membership
  checks are case-insensitive — the input is downcased before the
  `MapSet` query, so `"De"`, `"DE"`, and `"de"` all match a stored
  `"de"` entry.

  ## Bundled dictionaries

  The `:es` dictionary is built from TWO bundled wordlists merged at
  compile time:

  ### 1. `priv/wordlists/spanish.txt` (50,000 entries, ~428 KB)
  The 50,000 most-frequent Spanish words from the
  [hermitdave/FrequencyWords](https://github.com/hermitdave/FrequencyWords)
  project (MIT License, © Hermit Dave), derived from the
  OpenSubtitles 2018 corpus. Covers conversational, technical, and
  legal Spanish vocabulary. Format: one lowercase word per line (the
  upstream repo also includes per-word frequencies; we strip those
  and keep the words only).

  ### 2. `priv/wordlists/spanish_mx_extras.txt` (~700 entries)
  Mexican tax/legal/government vocabulary curated specifically for
  this project, covering terms that are missing from the
  subtitle-derived 50k list. This includes:

  - SAT (Servicio de Administración Tributaria) terminology:
    `padrón`, `tributario`/`tributarios`, `federativa`, `asimilado`,
    `lineamientos`, `contribuyente`, `recaudación`, `fiscalización`,
    `gravable`, `deducible`, etc.
  - Common labour / employment terms: `asalariado`, `honorario`,
    `arrendamiento`, `prestación`, `salario`/`sueldo` variants.
  - Document/process terms: `constancia`, `cédula`, `expedición`,
    `notarial`, `apoderado`, `domiciliado`, etc.
  - Adverbs and verb conjugations not in the base list:
    `inmediatamente`, `posteriormente`, `denúnciala`, `conferidas`,
    `corresponda`, etc.
  - Common periodicity words: `mensual`, `trimestral`, `bimestral`,
    `quincenal`, etc.

  Released under the same MIT License as the rest of the project.

  ### Filtering

  A small `@es_blacklist` removes colloquial/slang merges that show
  up in subtitle frequency lists but harm partition splitting (e.g.
  `dela` is colloquial for "de la"; if left in the dict the
  whole-token guard would prevent the correct "de" + "la" split).
  Currently blacklisted: `dela`, `pal`.

  ### Final size
  ~50,500 unique entries after merge and blacklist filter. Loaded
  into a `MapSet` at compile time via `@external_resource` so there
  is zero IO cost at runtime.

  ## Custom dictionaries

  Callers can supply any `MapSet.t()` of lowercase strings as the
  dictionary opt — e.g. a Hunspell `.dic` file loaded into a MapSet,
  or a frequency list specific to the user's domain.

  ## Spec references

  Wordlist licensing follows the upstream MIT license — attribution is
  preserved in the project README and LICENSE.md.
  """

  @es_filename "spanish.txt"
  @es_extras_filename "spanish_mx_extras.txt"

  # Colloquial/slang merges that show up in subtitle frequency lists
  # but should NOT be treated as single words during partition. E.g.
  # "dela" is colloquial for "de la"; if we leave it in the dict, the
  # member?(whole_token) guard in `Pdf.Reader.dictionary_split/2`
  # prevents the correct "de"+"la" split. Filter them out before
  # building the MapSet so the partition algorithm can reach the
  # correct decomposition.
  @es_blacklist MapSet.new(~w(dela pal))

  @doc """
  Returns the bundled Spanish wordlist as a `MapSet` of lowercase
  strings.

  **Lazy-loaded** — the wordlist files are only read from `priv/` the
  first time this function is called. Result is cached in
  `:persistent_term` so subsequent calls are O(1). Callers who never
  pass `dictionary: :es` to `Pdf.Reader.read/2` never touch the disk
  and don't pay the ~50 ms one-time load cost.
  """
  @spec spanish() :: MapSet.t(String.t())
  def spanish do
    case :persistent_term.get({__MODULE__, :spanish}, nil) do
      nil ->
        ms = load_spanish()
        :persistent_term.put({__MODULE__, :spanish}, ms)
        ms

      cached ->
        cached
    end
  end

  defp load_spanish do
    priv = :code.priv_dir(:ex_pdf)

    [Path.join([priv, "wordlists", @es_filename]), Path.join([priv, "wordlists", @es_extras_filename])]
    |> Enum.flat_map(&load_lines/1)
    |> Enum.map(&String.downcase/1)
    |> Enum.reject(&MapSet.member?(@es_blacklist, &1))
    |> MapSet.new()
  end

  defp load_lines(path) do
    case File.read(path) do
      {:ok, content} -> String.split(content, "\n", trim: true)
      {:error, _} -> []
    end
  end

  @doc """
  Resolves a dictionary specifier to a `MapSet`.

  - `:es` → bundled Spanish wordlist
  - `%MapSet{}` → returned as-is
  - `nil` → returned as-is (caller should treat as "no dictionary")
  """
  @spec resolve(:es | MapSet.t() | nil) :: MapSet.t() | nil
  def resolve(:es), do: spanish()
  def resolve(%MapSet{} = ms), do: ms
  def resolve(nil), do: nil

  @doc """
  Case-insensitive membership check.
  """
  @spec member?(String.t(), MapSet.t() | nil) :: boolean()
  def member?(_word, nil), do: false
  def member?(word, %MapSet{} = ms), do: MapSet.member?(ms, String.downcase(word))
end