# textmetrics
[](https://hex.pm/packages/textmetrics)
[](https://hex.pm/packages/textmetrics)
[](https://hexdocs.pm/textmetrics/)
[](https://github.com/nao1215/textmetrics/actions/workflows/ci.yml)
[](LICENSE)
String comparison and readability metrics for Gleam: edit distances,
similarity scores, longest common subsequence, line-level diff, and
canonical readability scores (Flesch-Kincaid, Gunning Fog, SMOG, ARI,
Coleman-Liau) with the word / sentence / syllable count primitives
they consume.
## Features
- Edit distances: Levenshtein, Damerau-Levenshtein (true variant), OSA, Hamming
- Similarity scores in `[0.0, 1.0]`: Jaro, Jaro-Winkler, Sørensen-Dice
- Longest common subsequence (length and a recovered sequence)
- Diff: Myers (1986) and patience, plus POSIX unified-diff rendering
- `did_you_mean` and Jaro-Winkler ranking for spell-correction style search
- Readability: Flesch Reading Ease, Flesch-Kincaid Grade, Gunning Fog,
SMOG, Automated Readability Index, Coleman-Liau
- Count primitives: words, sentences, syllables, characters, paragraphs,
polysyllables
- Pure Gleam, runs on Erlang and JavaScript targets
- Operates on Unicode grapheme clusters (UAX #29)
## Install
```sh
gleam add textmetrics
```
## Suggesting a similar command name
When a user mistypes a subcommand or flag, fall back to a Levenshtein-bounded
search to suggest the closest known names. Ties are broken by the order they
appear in `candidates`, which keeps suggestions deterministic.
```gleam
import textmetrics/search
pub fn suggest_command(typed: String) -> List(String) {
let known = ["install", "uninstall", "remove", "update", "help"]
search.did_you_mean(typed, known, 2)
}
// suggest_command("instal") -> ["install"]
// suggest_command("updat") -> ["update"]
// suggest_command("xyz") -> []
```
`search.closest` is the convenience form when only one suggestion is
needed (the typical "Did you mean `X`?" path).
```gleam
import gleam/option.{type Option}
import textmetrics/search
pub fn one_suggestion(typed: String) -> Option(String) {
let known = ["install", "uninstall", "remove", "update", "help"]
search.closest(typed, known, 2)
}
// one_suggestion("instal") -> Some("install")
// one_suggestion("xyz") -> None
```
## Comparing strings: Levenshtein, Damerau-Levenshtein, OSA
`levenshtein` counts the minimum number of insert / delete / substitute
operations. `damerau_levenshtein` adds adjacent-grapheme transposition and
lets the same substring participate in multiple edits. `osa` uses the same
operations but allows each substring to be edited at most once — that is
what most "Damerau distance" libraries actually compute.
```gleam
import textmetrics/distance
pub fn distances() -> #(Int, Int, Int) {
let l = distance.levenshtein("CA", "ABC")
let dl = distance.damerau_levenshtein("CA", "ABC")
let o = distance.osa("CA", "ABC")
#(l, dl, o)
}
// distances() -> #(3, 2, 3)
```
`distance.normalized_levenshtein` rescales Levenshtein distance to a
similarity in `[0.0, 1.0]` (`1.0` means identical). Use it when ranking
by edit distance is preferred over Jaro-Winkler.
```gleam
import textmetrics/distance
pub fn levenshtein_similarity() -> Float {
// levenshtein = 3, max graphemes = 7 → 1 - 3/7 = 4/7.
distance.normalized_levenshtein("kitten", "sitting")
}
```
`hamming` requires equal-length inputs and returns
`Error(LengthMismatch(...))` otherwise, so callers do not have to
pre-validate the input lengths.
```gleam
import textmetrics/distance
pub fn hamming_check(
a: String,
b: String,
) -> Result(Int, distance.HammingError) {
distance.hamming(a, b)
}
// hamming_check("karolin", "kathrin") -> Ok(3)
// hamming_check("ab", "abc") -> Error(LengthMismatch(left: 2, right: 3))
```
## Scoring similarity: Jaro, Jaro-Winkler, Sørensen-Dice
`jaro_winkler` boosts scores for strings that share a common prefix, which
suits typo tolerance in human names. `sorensen_dice` compares the multiset
of grapheme n-grams.
```gleam
import textmetrics/similarity
pub fn jaro_score() -> Float {
similarity.jaro("MARTHA", "MARHTA")
}
pub fn jaro_winkler_score() -> Float {
similarity.jaro_winkler("MARTHA", "MARHTA")
}
pub fn dice_score() -> Result(Float, similarity.SorensenDiceError) {
similarity.sorensen_dice("night", "nacht", 2)
}
// jaro_score() -> 0.944_444…
// jaro_winkler_score() -> 0.961_111…
// dice_score() -> Ok(0.25)
```
`jaro_winkler_with` accepts a validated `JaroWinklerConfig` for non-default
prefix scaling. The smart constructor enforces `prefix_scale ∈ [0.0, 0.25]`
and `prefix_max >= 0`.
```gleam
import gleam/result
import textmetrics/similarity
pub fn aggressive_winkler() -> Result(
Float,
similarity.JaroWinklerConfigError,
) {
use cfg <- result.map(similarity.jaro_winkler_config(
prefix_scale: 0.2,
prefix_max: 6,
))
similarity.jaro_winkler_with("MARTHA", "MARHTA", cfg)
}
```
## Producing an edit script and a unified diff
`diff.myers` returns an `EditScript` — a list of `Equal | Insert | Delete`
steps that round-trips both inputs. `to_unified` renders that script in
the format `diff -u` produces.
```gleam
import textmetrics/diff
pub fn render_diff() -> String {
let old = ["the quick brown", "fox jumps over", "the lazy dog"]
let new = ["the quick brown", "fox leaps over", "the lazy dog"]
let opts = diff.unified_options(old_name: "a", new_name: "b")
diff.to_unified(diff.myers(old, new), opts)
}
// render_diff() ->
// --- a
// +++ b
// @@ -1,3 +1,3 @@
// the quick brown
// -fox jumps over
// +fox leaps over
// the lazy dog
```
`diff.unified_options` silently strips bytes that would corrupt the
`--- <old_name>` / `+++ <new_name>` header (CR, LF, NUL, TAB). For
caller-supplied paths where you want the bad bytes to surface as a typed
error instead of being dropped, use `diff.unified_options_checked` — same
constructor shape, returns a `Result(UnifiedOptions, UnifiedOptionsError)`.
`recover_old` and `recover_new` rebuild the inputs from a script. The
round-trip property is the principal invariant of the `Edit` ADT.
```gleam
import textmetrics/diff
import textmetrics/edit
pub fn round_trip_holds() -> Bool {
let old = ["a", "b", "c"]
let new = ["a", "x", "c"]
let script = diff.myers(old, new)
edit.recover_old(script) == old && edit.recover_new(script) == new
}
// round_trip_holds() -> True
```
## Longest common subsequence
```gleam
import textmetrics/lcs
pub fn lcs_example() -> #(Int, List(String)) {
let a = ["A", "B", "C", "B", "D", "A", "B"]
let b = ["B", "D", "C", "A", "B", "A"]
#(lcs.length(a, b), lcs.sequence(a, b))
}
// lcs_example() -> #(4, ["B", "C", "B", "A"]) // one valid answer
```
`lcs.length` is well-defined; `lcs.sequence` returns one valid longest
common subsequence. Consumers should rely on
`length(sequence(a, b)) == length(a, b)` rather than on the specific
subsequence chosen.
## Readability scores
Six canonical English-language readability formulas, plus the count
primitives they consume. All scores are `Result(Float, _)` so that
extremely small inputs (the SMOG 30-sentence floor, for example) get
a typed error instead of a non-finite number.
```gleam
import textmetrics/readability
pub fn grade_for(text: String) -> Result(Float, readability.ReadabilityError) {
readability.flesch_kincaid_grade(text)
}
// grade_for("The quick brown fox jumps over the lazy dog.")
// -> Ok(~2.3) (mid-2nd-grade reading level)
```
`textmetrics/count` exposes the primitives in case you want to roll
your own metric — `count.words`, `count.sentences`,
`count.syllables_in_word`, `count.syllables`, `count.characters`,
`count.paragraphs`, `count.polysyllables`. The syllable counter is an
English-tuned heuristic (silent-`e` rule with the consonant-`le`
exception). Non-English text falls back to one syllable per word; do
not interpret the resulting grade as meaningful outside English prose.
## Unicode policy
All string-typed functions operate on extended grapheme clusters via
`gleam/string.to_graphemes`. Inputs are pre-normalised to Unicode
Normalization Form C (NFC) before comparison, so canonically-equivalent
strings such as `"\u{00C1}"` (precomposed, `U+00C1`) and `"A\u{0301}"`
(decomposed, `U+0041 U+0301`) are treated as equal. The NFC step uses
the platform's built-in normaliser (`unicode:characters_to_nfc_binary/1`
on Erlang, `String.prototype.normalize("NFC")` on JavaScript), so no
extra dependency is required.
## Development
```sh
mise install
just deps
just ci
```
`just` recipes source `scripts/lib/mise_bootstrap.sh`, so `mise activate`
is not required in the current shell.
## License
[MIT](LICENSE)