README.md

Select File
# textmetrics

[![Package Version](https://img.shields.io/hexpm/v/textmetrics)](https://hex.pm/packages/textmetrics)
[![Downloads](https://img.shields.io/hexpm/dt/textmetrics)](https://hex.pm/packages/textmetrics)
[![Hex Docs](https://img.shields.io/badge/hex-docs-ffaff3)](https://hexdocs.pm/textmetrics/)
[![CI](https://github.com/nao1215/textmetrics/actions/workflows/ci.yml/badge.svg)](https://github.com/nao1215/textmetrics/actions/workflows/ci.yml)
[![License](https://img.shields.io/github/license/nao1215/textmetrics)](LICENSE)

String comparison and readability metrics for Gleam: edit distances,
similarity scores, longest common subsequence, line-level diff, and
canonical readability scores (Flesch-Kincaid, Gunning Fog, SMOG, ARI,
Coleman-Liau) with the word / sentence / syllable count primitives
they consume.

## Features

- Edit distances: Levenshtein, Damerau-Levenshtein (true variant), OSA, Hamming
- Similarity scores in `[0.0, 1.0]`: Jaro, Jaro-Winkler, Sørensen-Dice
- Longest common subsequence (length and a recovered sequence)
- Diff: Myers (1986) and patience, plus POSIX unified-diff rendering
- `did_you_mean` and Jaro-Winkler ranking for spell-correction style search
- Readability: Flesch Reading Ease, Flesch-Kincaid Grade, Gunning Fog,
  SMOG, Automated Readability Index, Coleman-Liau
- Count primitives: words, sentences, syllables, characters, paragraphs,
  polysyllables
- Pure Gleam, runs on Erlang and JavaScript targets
- Operates on Unicode grapheme clusters (UAX #29)

## Install

```sh
gleam add textmetrics
```

## Suggesting a similar command name

When a user mistypes a subcommand or flag, fall back to a Levenshtein-bounded
search to suggest the closest known names. Ties are broken by the order they
appear in `candidates`, which keeps suggestions deterministic.

```gleam
import textmetrics/search

pub fn suggest_command(typed: String) -> List(String) {
  let known = ["install", "uninstall", "remove", "update", "help"]
  search.did_you_mean(typed, known, 2)
}

// suggest_command("instal") -> ["install"]
// suggest_command("updat")  -> ["update"]
// suggest_command("xyz")    -> []
```

`search.closest` is the convenience form when only one suggestion is
needed (the typical "Did you mean `X`?" path).

```gleam
import gleam/option.{type Option}
import textmetrics/search

pub fn one_suggestion(typed: String) -> Option(String) {
  let known = ["install", "uninstall", "remove", "update", "help"]
  search.closest(typed, known, 2)
}

// one_suggestion("instal") -> Some("install")
// one_suggestion("xyz")    -> None
```

## Comparing strings: Levenshtein, Damerau-Levenshtein, OSA

`levenshtein` counts the minimum number of insert / delete / substitute
operations. `damerau_levenshtein` adds adjacent-grapheme transposition and
lets the same substring participate in multiple edits. `osa` uses the same
operations but allows each substring to be edited at most once — that is
what most "Damerau distance" libraries actually compute.

```gleam
import textmetrics/distance

pub fn distances() -> #(Int, Int, Int) {
  let l = distance.levenshtein("CA", "ABC")
  let dl = distance.damerau_levenshtein("CA", "ABC")
  let o = distance.osa("CA", "ABC")
  #(l, dl, o)
}

// distances() -> #(3, 2, 3)
```

`distance.normalized_levenshtein` rescales Levenshtein distance to a
similarity in `[0.0, 1.0]` (`1.0` means identical). Use it when ranking
by edit distance is preferred over Jaro-Winkler.

```gleam
import textmetrics/distance

pub fn levenshtein_similarity() -> Float {
  // levenshtein = 3, max graphemes = 7 → 1 - 3/7 = 4/7.
  distance.normalized_levenshtein("kitten", "sitting")
}
```

`hamming` requires equal-length inputs and returns
`Error(LengthMismatch(...))` otherwise, so callers do not have to
pre-validate the input lengths.

```gleam
import textmetrics/distance

pub fn hamming_check(
  a: String,
  b: String,
) -> Result(Int, distance.HammingError) {
  distance.hamming(a, b)
}

// hamming_check("karolin", "kathrin") -> Ok(3)
// hamming_check("ab", "abc")          -> Error(LengthMismatch(left: 2, right: 3))
```

## Scoring similarity: Jaro, Jaro-Winkler, Sørensen-Dice

`jaro_winkler` boosts scores for strings that share a common prefix, which
suits typo tolerance in human names. `sorensen_dice` compares the multiset
of grapheme n-grams.

```gleam
import textmetrics/similarity

pub fn jaro_score() -> Float {
  similarity.jaro("MARTHA", "MARHTA")
}

pub fn jaro_winkler_score() -> Float {
  similarity.jaro_winkler("MARTHA", "MARHTA")
}

pub fn dice_score() -> Result(Float, similarity.SorensenDiceError) {
  similarity.sorensen_dice("night", "nacht", 2)
}

// jaro_score()         -> 0.944_444…
// jaro_winkler_score() -> 0.961_111…
// dice_score()         -> Ok(0.25)
```

`jaro_winkler_with` accepts a validated `JaroWinklerConfig` for non-default
prefix scaling. The smart constructor enforces `prefix_scale ∈ [0.0, 0.25]`
and `prefix_max >= 0`.

```gleam
import gleam/result
import textmetrics/similarity

pub fn aggressive_winkler() -> Result(
  Float,
  similarity.JaroWinklerConfigError,
) {
  use cfg <- result.map(similarity.jaro_winkler_config(
    prefix_scale: 0.2,
    prefix_max: 6,
  ))
  similarity.jaro_winkler_with("MARTHA", "MARHTA", cfg)
}
```

## Producing an edit script and a unified diff

`diff.myers` returns an `EditScript` — a list of `Equal | Insert | Delete`
steps that round-trips both inputs. `to_unified` renders that script in
the format `diff -u` produces.

```gleam
import textmetrics/diff

pub fn render_diff() -> String {
  let old = ["the quick brown", "fox jumps over", "the lazy dog"]
  let new = ["the quick brown", "fox leaps over", "the lazy dog"]
  let opts = diff.unified_options(old_name: "a", new_name: "b")
  diff.to_unified(diff.myers(old, new), opts)
}

// render_diff() ->
// --- a
// +++ b
// @@ -1,3 +1,3 @@
//  the quick brown
// -fox jumps over
// +fox leaps over
//  the lazy dog
```

`diff.unified_options` silently strips bytes that would corrupt the
`--- <old_name>` / `+++ <new_name>` header (CR, LF, NUL, TAB). For
caller-supplied paths where you want the bad bytes to surface as a typed
error instead of being dropped, use `diff.unified_options_checked` — same
constructor shape, returns a `Result(UnifiedOptions, UnifiedOptionsError)`.

`recover_old` and `recover_new` rebuild the inputs from a script. The
round-trip property is the principal invariant of the `Edit` ADT.

```gleam
import textmetrics/diff
import textmetrics/edit

pub fn round_trip_holds() -> Bool {
  let old = ["a", "b", "c"]
  let new = ["a", "x", "c"]
  let script = diff.myers(old, new)
  edit.recover_old(script) == old && edit.recover_new(script) == new
}

// round_trip_holds() -> True
```

## Longest common subsequence

```gleam
import textmetrics/lcs

pub fn lcs_example() -> #(Int, List(String)) {
  let a = ["A", "B", "C", "B", "D", "A", "B"]
  let b = ["B", "D", "C", "A", "B", "A"]
  #(lcs.length(a, b), lcs.sequence(a, b))
}

// lcs_example() -> #(4, ["B", "C", "B", "A"])  // one valid answer
```

`lcs.length` is well-defined; `lcs.sequence` returns one valid longest
common subsequence. Consumers should rely on
`length(sequence(a, b)) == length(a, b)` rather than on the specific
subsequence chosen.

## Readability scores

Six canonical English-language readability formulas, plus the count
primitives they consume. All scores are `Result(Float, _)` so that
extremely small inputs (the SMOG 30-sentence floor, for example) get
a typed error instead of a non-finite number.

```gleam
import textmetrics/readability

pub fn grade_for(text: String) -> Result(Float, readability.ReadabilityError) {
  readability.flesch_kincaid_grade(text)
}

// grade_for("The quick brown fox jumps over the lazy dog.")
// -> Ok(~2.3)   (mid-2nd-grade reading level)
```

`textmetrics/count` exposes the primitives in case you want to roll
your own metric — `count.words`, `count.sentences`,
`count.syllables_in_word`, `count.syllables`, `count.characters`,
`count.paragraphs`, `count.polysyllables`. The syllable counter is an
English-tuned heuristic (silent-`e` rule with the consonant-`le`
exception). Non-English text falls back to one syllable per word; do
not interpret the resulting grade as meaningful outside English prose.

## Unicode policy

All string-typed functions operate on extended grapheme clusters via
`gleam/string.to_graphemes`. Inputs are pre-normalised to Unicode
Normalization Form C (NFC) before comparison, so canonically-equivalent
strings such as `"\u{00C1}"` (precomposed, `U+00C1`) and `"A\u{0301}"`
(decomposed, `U+0041 U+0301`) are treated as equal. The NFC step uses
the platform's built-in normaliser (`unicode:characters_to_nfc_binary/1`
on Erlang, `String.prototype.normalize("NFC")` on JavaScript), so no
extra dependency is required.

## Development

```sh
mise install
just deps
just ci
```

`just` recipes source `scripts/lib/mise_bootstrap.sh`, so `mise activate`
is not required in the current shell.

## License

[MIT](LICENSE)