Skip to main content

README.md

# Ftfy — fixes text for you

An Elixir port of the Python [ftfy](https://github.com/rspeer/python-ftfy)
library (version 6.3.1). It takes in broken Unicode text and makes it less
broken — most importantly, it detects and fixes *mojibake* (text that was
decoded in the wrong encoding).

```elixir
iex> Ftfy.fix_text("✔ No problems")
"✔ No problems"

iex> Ftfy.fix_text("Broken text… it’s flubberific!")
"Broken text… it's flubberific!"

iex> Ftfy.fix_text("LOUD NOISES")
"LOUD NOISES"

iex> Ftfy.fix_encoding_and_explain("só")
{"só", [{"encode", "latin-1"}, {"decode", "utf-8"}]}
```

## What it does

`Ftfy.fix_text/2` runs a sequence of fixes, each individually configurable via
`Ftfy.TextFixerConfig`:

- **fix_encoding** — detect mojibake and undo it by re-encoding and re-decoding
  through the right pair of encodings (the heart of ftfy), including the
  sub-fixes `restore_byte_a0`, `replace_lossy_sequences`,
  `decode_inconsistent_utf8`, and `fix_c1_controls`
- **unescape_html** — decode HTML entities (`&`, `é`, `’`, …)
- **remove_terminal_escapes** — strip ANSI color codes
- **fix_latin_ligatures** — `fi` → `fi`
- **fix_character_width** — fullwidth/halfwidth → standard width
- **uncurl_quotes** — curly quotes → straight quotes
- **fix_line_breaks** — CRLF, CR, LS, PS, NEL → `\n`
- **fix_surrogates** — repair UTF-16 surrogate pairs
- **remove_control_chars** — strip useless control characters
- Unicode **normalization** (NFC by default)

Other entry points mirror the Python API: `fix_and_explain/2`,
`fix_encoding/2`, `fix_encoding_and_explain/2`, `fix_text_segment/2`,
`apply_plan/2`, `guess_bytes/1`, `fix_file/2`, and `explain_unicode/1`. The
`Ftfy.Fixes`, `Ftfy.Badness`, `Ftfy.Chardata`, `Ftfy.Codecs`, and
`Ftfy.Formatting` modules expose the lower-level building blocks.

## Configuration

Pass a keyword list or a `%Ftfy.TextFixerConfig{}`:

```elixir
Ftfy.fix_text(text, uncurl_quotes: false)
Ftfy.fix_text(text, %Ftfy.TextFixerConfig{normalization: "NFKC"})
```

## Command line

Build the escript and fix text from a file or stdin:

```sh
mix escript.build
echo '✔ No problems' | ./ftfy
./ftfy -e latin-1 broken.txt -o fixed.txt
```

## Installation

Add `ftfy` to your dependencies in `mix.exs`:

```elixir
def deps do
  [
    {:ftfy, "~> 0.1.0"}
  ]
end
```

## Notes on the port

- The encoding-detection data tables (HTML entities, the single-byte charmap
  encodings, the fullwidth/halfwidth map, the `wcwidth` width tables) and the
  two large heuristic regexes are generated from the reference implementation by
  `scripts/gen_data.py` into the Ftfy.Data module (internal, undocumented). The
  reference package is vendored as a
  git submodule at `vendor/python-ftfy` (pinned to the `v6.3.1` tag); run
  `git submodule update --init` before regenerating.
- `Ftfy.Codecs` reimplements Python's `bad_codecs`: the `sloppy-windows-*` and
  related charmap encodings, and the `utf-8-variants` (CESU-8 / Java modified
  UTF-8) decoder, including incremental decoding.
- The behavioral test corpus is read directly from the pinned
  `vendor/python-ftfy` submodule (`tests/test_cases.json`); the unit tests are
  ported from python-ftfy. All 151 "pass" cases and 10 "known failure" cases
  match the reference. (Running the tests therefore needs the submodule:
  `git submodule update --init`.)
- One deliberate difference: the BEAM cannot represent lone UTF-16 surrogate
  codepoints in a binary, so `Ftfy.Fixes.fix_surrogates/1` is effectively a
  no-op on valid strings, and `explain_unicode/1` omits the Unicode character
  *name* (the BEAM has no names database).

## License and credits

This library is a port of [ftfy](https://github.com/rspeer/python-ftfy)
("fixes text for you"), created by **Robyn Speer**. ftfy is the result of years
of careful work on the messy reality of broken Unicode, and this Elixir port
exists only because of it — our deepest thanks to Robyn Speer for building and
maintaining the original, and for releasing it under a permissive license.

- Original ftfy: Copyright 2023 Robyn Speer, licensed under the Apache
  License, Version 2.0 — <https://github.com/rspeer/python-ftfy>
- This Elixir port: Copyright 2026 FashionUnited, also licensed under the
  Apache License, Version 2.0.

The data tables and test corpus in this repository are generated from / ported
directly from python-ftfy 6.3.1 and remain the work of the original author.
See [`LICENSE`](https://github.com/fuww/ftfy/blob/main/LICENSE) for the full
license text and [`NOTICE`](https://github.com/fuww/ftfy/blob/main/NOTICE) for
the attribution and change notice required by the Apache License.

If you use ftfy in research, please cite the original author's work as
described at <https://github.com/rspeer/python-ftfy>.