README.md

# A much improved version is available here: https://github.com/elixir-unicode/unicode
See `Unicode.replace_invalid/3`.

# UniRecover
A library for substituting illegal bytes in Unicode encoded data, following W3C spec as suggested by the [Unicode Standard](https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf#page=153).

This library leverages Erlang [Sub Binaries](https://www.erlang.org/doc/efficiency_guide/binaryhandling#sub-binaries) to scale well with large amounts of data. This should suffice for most use-cases, short of those that may necessitate NIF-based solutions.

## Installation
Add `:uni_recover` to your list of dependencies in `mix.exs`:

```elixir
def deps do
  [
    {:uni_recover, "~> 0.1.2"}
  ]
end
```

Documentation is available on [HexDocs](https://hexdocs.pm/uni_recover/readme.html) and may also be generated with [ExDoc](https://github.com/elixir-lang/ex_doc).

## Usage
```elixir
# 0b11111111 = an illegal utf-8 code sequence
UniRecover.sub(<<"foo", 0b11111111, "bar">>)
# "foo�bar"

# 216, 0 = an illegal utf-16 code sequence
(UniRecover.sub(<<"foo"::utf16, 216, 0, "bar"::utf16>>, :utf16)
|> :unicode.characters_to_binary(:utf16))
# "foo�bar"
```

## Benchmarking
The following benchmark demonstrates how UniRecover leverages sub binaries, only allocating the indexes of illegal bytes. See the benchmarking folder in the repo for details.

```
Name                                  ips        average  deviation         median         99th %
UniRecover, 207KB Input           1842.84      542.64 μs     ±1.44%      539.67 μs      574.71 μs
Simple Rebuild, 207KB Input        172.02     5813.34 μs    ±13.88%     5534.29 μs     8223.92 μs
Naive 3-liner, 207KB Input          56.59    17670.58 μs     ±6.44%    17377.60 μs    19210.26 μs

Comparison: 
UniRecover, 207KB Input           1842.84
Simple Rebuild, 207KB Input        172.02 - 10.71x slower +5270.70 μs
Naive 3-liner, 207KB Input          56.59 - 32.56x slower +17127.94 μs

Memory usage statistics:

Name                           Memory usage
UniRecover, 207KB Input               296 B
Simple Rebuild, 207KB Input       8215208 B - 27754.08x memory usage +8214912 B
Naive 3-liner, 207KB Input       39556040 B - 133635.27x memory usage +39555744 B
```

For reference, the `Simple` implementation allocated 39.66x the original json, and `Naive` even worse at a whopping 191x the original.