# A much improved version is available here: https://github.com/elixir-unicode/unicode
See `Unicode.replace_invalid/3`.
# UniRecover
A library for substituting illegal bytes in Unicode encoded data, following W3C spec as suggested by the [Unicode Standard](https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf#page=153).
This library leverages Erlang [Sub Binaries](https://www.erlang.org/doc/efficiency_guide/binaryhandling#sub-binaries) to scale well with large amounts of data. This should suffice for most use-cases, short of those that may necessitate NIF-based solutions.
## Installation
Add `:uni_recover` to your list of dependencies in `mix.exs`:
```elixir
def deps do
[
{:uni_recover, "~> 0.1.2"}
]
end
```
Documentation is available on [HexDocs](https://hexdocs.pm/uni_recover/readme.html) and may also be generated with [ExDoc](https://github.com/elixir-lang/ex_doc).
## Usage
```elixir
# 0b11111111 = an illegal utf-8 code sequence
UniRecover.sub(<<"foo", 0b11111111, "bar">>)
# "foo�bar"
# 216, 0 = an illegal utf-16 code sequence
(UniRecover.sub(<<"foo"::utf16, 216, 0, "bar"::utf16>>, :utf16)
|> :unicode.characters_to_binary(:utf16))
# "foo�bar"
```
## Benchmarking
The following benchmark demonstrates how UniRecover leverages sub binaries, only allocating the indexes of illegal bytes. See the benchmarking folder in the repo for details.
```
Name ips average deviation median 99th %
UniRecover, 207KB Input 1842.84 542.64 μs ±1.44% 539.67 μs 574.71 μs
Simple Rebuild, 207KB Input 172.02 5813.34 μs ±13.88% 5534.29 μs 8223.92 μs
Naive 3-liner, 207KB Input 56.59 17670.58 μs ±6.44% 17377.60 μs 19210.26 μs
Comparison:
UniRecover, 207KB Input 1842.84
Simple Rebuild, 207KB Input 172.02 - 10.71x slower +5270.70 μs
Naive 3-liner, 207KB Input 56.59 - 32.56x slower +17127.94 μs
Memory usage statistics:
Name Memory usage
UniRecover, 207KB Input 296 B
Simple Rebuild, 207KB Input 8215208 B - 27754.08x memory usage +8214912 B
Naive 3-liner, 207KB Input 39556040 B - 133635.27x memory usage +39555744 B
```
For reference, the `Simple` implementation allocated 39.66x the original json, and `Naive` even worse at a whopping 191x the original.