guides/text_classification.md

Select File:
guides/text_classification.md

# Text classification — language identification

`Text.Language.Classifier.Fasttext` is a pure-Elixir port of fastText's [`lid.176`](https://fasttext.cc/docs/en/language-identification.html) model — a supervised classifier trained on Wikipedia, Tatoeba, and SETimes data that recognises **176 languages** from short input. The implementation is validated bit-for-bit against the official C++/Python reference for hashing, n-gram extraction, feature assembly, and tree traversal: same input, same prediction, same probabilities (within float-32 rounding).

It runs entirely in the BEAM — no NIFs, no Python sidecar, no model server. The trade-off is the model file (~126 MB) is fetched once at install time and lives on disk; this guide walks through the setup and the API.

## One-time setup

The `lid.176.bin` model file is **not** part of the Hex package — every install fetches its own copy. Run once after adding `:text` to your dependencies:

```sh
mix text.download_lid176
```

The file lands at `priv/lid_176/lid.176.bin` inside the project. It's gitignored and not committed.

For production environments that want every external artefact present at boot, use the broader `mix text.download_models` task — same fetch, but it can also pre-download the Bumblebee models used by `Text.Sentiment`, `Text.POS`, `Text.NER`, and `Text.WordCloud.Backends.KeyBERT`.

## Loading the model

The model is loaded once per VM and reused across every detection call:

```elixir
{:ok, model} =
  Text.Language.Classifier.Fasttext.ModelLoader.load(
    Path.join(:code.priv_dir(:text), "lid_176/lid.176.bin")
  )
```

The result is a `Text.Language.Classifier.Fasttext.Model` struct holding the input matrix, output matrix, dictionary, and Huffman tree (if applicable) as `Nx` tensors. Loading takes a few seconds; a typical pattern is to load at application boot and stash the model in `:persistent_term` or a `GenServer` for the rest of the VM's lifetime.

## Detecting a language

`detect/3` returns a full `Detection` struct:

```elixir
{:ok, det} = Text.Language.Classifier.Fasttext.detect("Bonjour le monde", model)

det.language     #=> "fr"
det.script       #=> :Latn
det.confidence   #=> 0.984
det.alternatives #=> [{"en", 0.0035}, {"it", 0.0024}, {"oc", 0.0009}, {"ca", 0.0006}]
det.text         #=> "Bonjour le monde"
```

The struct fields:

| Field | Meaning |
|---|---|
| `:language` | BCP-47 language subtag (`"fr"`, `"zh"`, `"sr"`). |
| `:confidence` | Probability of the top prediction in `[0.0, 1.0]`. |
| `:script` | Unicode script atom derived from the input text (`:Latn`, `:Cyrl`, `:Hans`, `:Hant`, `:Hani`, …). Used downstream to disambiguate multi-script locales. |
| `:alternatives` | List of `{language, probability}` for the next-best predictions. |
| `:text` | The original input, preserved for downstream use. |

Common options:

* `:k` — number of top predictions to return. Default `5`. The first becomes the main `:language`; the rest fill `:alternatives`.

* `:threshold` — drop predictions below this probability. Default `0.0`. Raise it (e.g. `0.5`) to get `{:error, :no_predictions}` for ambiguous inputs you'd rather skip than guess at.

```elixir
case Text.Language.Classifier.Fasttext.detect(unknown_text, model, threshold: 0.5) do
  {:ok, det} -> route_by_language(det.language)
  {:error, :no_predictions} -> ask_user_to_clarify()
end
```

## Just the language code

When you only need the answer:

```elixir
{:ok, "es"} = Text.Language.Classifier.Fasttext.classify("Hola, ¿cómo estás?", model)
{:ok, "ru"} = Text.Language.Classifier.Fasttext.classify("Привет, мир!", model)
{:ok, "ja"} = Text.Language.Classifier.Fasttext.classify("こんにちは世界", model)
```

`classify/2` is a thin wrapper around `detect(text, model, k: 1)` that drops everything except the top language code. Useful for routing logic where you only care which bucket to send the text into.

## Resolving to a CLDR locale

`detect/3` returns a bare language code; downstream localisation systems usually want a full locale string like `zh-Hans-CN` or `fr-FR`. `to_locale/2` runs the detection through CLDR's likely-subtags algorithm to fill in the missing pieces:

```elixir
{:ok, det} = Text.Language.Classifier.Fasttext.detect("你好世界，这是简体中文。", model)
{:ok, "zh-Hans-CN"} = Text.Language.Classifier.Fasttext.to_locale(det)

{:ok, det} = Text.Language.Classifier.Fasttext.detect("你好世界，這是繁體中文。", model)
{:ok, "zh-Hant-TW"} = Text.Language.Classifier.Fasttext.to_locale(det)
```

When the optional [`localize`](https://hex.pm/packages/localize) dependency is loaded, this calls into CLDR's actual likely-subtags table. Without it, a built-in fallback table covers ~60 of the most common languages. Add `:localize` for production-grade locale resolution:

```elixir
{:localize, "~> 0.23", optional: true}
```

Override the inferred region or script:

```elixir
{:ok, "fr-Latn-CA"} = Text.Language.Classifier.Fasttext.to_locale(det, region: :CA)
```

The region option is typically wired to an `Accept-Language` header or IP geolocation when available; otherwise the CLDR default for the language wins.

## Script detection and Hans/Hant

Many languages are written in more than one script (Serbian in Latin or Cyrillic, Punjabi in Gurmukhi or Shahmukhi, Chinese in Simplified or Traditional Han). The fastText model returns a bare language code like `"zh"` — it doesn't distinguish `Hans` from `Hant`. `Text.Language.Classifier.Fasttext.ScriptDetector` runs alongside `detect/3` and contributes the script signal.

For Chinese specifically, `ScriptDetector` runs a second-pass codepoint-frequency analysis against curated lists of distinguishing characters. If the input contains characters present only in Simplified (`国`, `电`, `时`) it returns `:Hans`; if it contains Traditional-only characters (`國`, `電`, `時`) it returns `:Hant`. Inputs containing only shared Han codepoints fall back to `:Hani`, and likely-subtags then resolves to `Hans-CN` (the mainland-China default).

```elixir
{:ok, det} = Text.Language.Classifier.Fasttext.detect("国家电网", model)
det.script  #=> :Hans

{:ok, det} = Text.Language.Classifier.Fasttext.detect("國家電網", model)
det.script  #=> :Hant

{:ok, det} = Text.Language.Classifier.Fasttext.detect("人之初", model)
det.script  #=> :Hani  (shared codepoints — could be either)
```

## Confidence calibration

fastText's confidence scores are well-calibrated for *long* inputs (a sentence or more) but inflate aggressively on very short inputs. Common patterns:

* **Short noun phrases** ("Hello world") often produce confidence > 0.95 — usually correct, but sometimes overconfident on names that look multilingual.

* **Mixed-language text** ("Click the button to login") usually classifies as the dominant language with moderate confidence; check `:alternatives` if the result looks suspicious.

* **Code-mixed or transliterated text** ("kaisi ho?" written in Latin script for Hindi) often classifies as the script's default language (`:en`) rather than the intended one. Consider a higher `:threshold` and a fallback path for ambiguous cases.

For robust routing, look at the gap between top-1 and top-2 confidences in `:alternatives`. A small gap (< 0.1) signals genuine ambiguity even when the top score is high.

## Performance

The model's input matrix is ~128 MB of `float32` data held in an `Nx` tensor. The inference forward pass (`take + mean + dot`, plus the softmax tail for softmax-loss models) is wrapped in `Nx.Defn` so an EXLA-compiled execution runs the whole pass as a single fused XLA kernel.

**Per-prediction wall time on `lid.176`:**

| Backend | Time |
|---|---|
| `Nx.BinaryBackend` (no `:exla`) | ~600 µs |
| `EXLA.Backend`, no defn fusion | ~200 µs |
| `EXLA.Backend` + fused defn graph (default) | **~100 µs** |

For production throughput add `:exla` to your deps and configure it as both the default backend and the default `defn` compiler:

```elixir
# config/config.exs
config :nx, default_backend: EXLA.Backend
config :nx, :default_defn_options, compiler: EXLA
```

Without EXLA the package still works correctly — `Nx.Defn.Evaluator` runs the same `defn` graph against `Nx.BinaryBackend` — but per-prediction wall time is roughly an order of magnitude higher.

## The 176 supported languages

The full set is documented at the [fastText project page](https://fasttext.cc/docs/en/language-identification.html). Coverage includes all 24 official EU languages, every UN official language, the major South and Southeast Asian languages, and a long tail of regional and minority languages. Languages **not** in `lid.176` include very recently-added minority languages (Quechua, some indigenous American languages) and constructed languages outside Esperanto and Ido.

Use `model.dictionary.nlabels` (which equals 176) and `model.labels` (a list of every supported label) to enumerate at runtime if you need a UI selector or to validate a user's expected language.

## Putting it together

A typical production wiring:

```elixir
# At app boot, in your Application.start/2:
def start(_type, _args) do
  {:ok, model} = Text.Language.Classifier.Fasttext.ModelLoader.load(
    Path.join(:code.priv_dir(:text), "lid_176/lid.176.bin")
  )
  :persistent_term.put(MyApp.LidModel, model)
  Supervisor.start_link(children, opts)
end

# At call site:
def detect_language(text) do
  model = :persistent_term.get(MyApp.LidModel)

  case Text.Language.Classifier.Fasttext.detect(text, model, threshold: 0.5) do
    {:ok, det} ->
      {:ok, locale} = Text.Language.Classifier.Fasttext.to_locale(det)
      {:ok, det.language, locale}

    {:error, :no_predictions} ->
      {:error, :ambiguous}
  end
end
```

This pattern loads the model once, keeps it warm in `:persistent_term`, and produces fully-resolved CLDR locales on every call.