# Hyphen8
## Introduction
Hyphen8 is a pure Elixir implementation of the Knuth-Liang Hyphenation Algorithm. That algorithm formed the basis of Franklin Mark Liang's [1983 Stanford Dissertation, _WORD HY-PHEN-A-TION BY COM-PU-TER_](http://www.tug.org/docs/liang/liang-thesis.pdf). It remains the standard hyphenation method in TeX.
## Usage
Pass a string to `Hyphen8.start()`:
```command
iex> Hyphen8.start("let's hyphenate containerization orchestration platform")
process #PID<0.226.0>: hyphenating "{#PID<0.232.0>, #Reference<0.1715096087.1034682369.71350>}"
process #PID<0.225.0>: hyphenating "{#PID<0.233.0>, #Reference<0.1715096087.1034682369.71351>}"
```
`Hyphen8` will begin spawning processes. Each new one will print its PID and reference number to the screen.
You will then receive the hyphenated string:
```
"let's hy-phen-ate con-tainer-iza-tion or-ches-tra-tion plat-form"
```
The current version will not reconstruct sentence punctuation, newlines, or other meta-characters. Apostrophes are reconstructed but only naively.
To customize the string and word splitting, adjust the regular expressions in `Hyphen8.Engine.parse_words()` and `Hyphen8.Engine.parse_characters()`.
## Performance Hacks
To optimize for speed, you can adjust the size of your worker pool. Adjust the value for `:size` in `Hyphen8.Application.poolboy_config()` to resize the worker pool.
You can also adjust your string-chunking interval. Adjust the `String.chunk_every()` function in `Hyphen8.start()` to define how large or small a chunk each spawned process computes. Increasing this number can increase performance on very long strings. Alternately, increasing the chunk size well beyond your average string length can hurt performance. I suggest using Benchee and experimenting with combinations based on your use case.
Another option is to rewrite `Hyphen8` so that it dynamically determines chunk size based on the given input.
## History of the Knuth-Liang Algorithm
Liang built a program, Patgen, which developed a large pattern table. He says the following about the resulting algorithm:
> "The new hyphenation algorithm is based on the idea of hyphenating and inhibiting patterns. These are simply strings of letters that, when they match a word, give us information about hyphenation at some point in the pattern. For example, '-tion' and *c-c' are good hyphenating patterns. An important feature of this method is that a suitable set of patterns can be extracted automatically from the dictionary.
>"...The resulting hyphenation algorithm uses about 4500 patterns...These patterns find 89% of the hyphens in a pocket dictionary word list, with essentially no error."
In the [official resource for Patgen](https://www.tug.org/texlive/devsrc/Build/source/texk/web2c/patgen.web), we find a clear description of the algorithm's logic:
>"The patterns consist of strings of letters and digits, where a digit indicates a `hyphenation value` for some intercharacter position. For example, the pattern `\.{3t2ion}` specifies that if the string `\.{tion}` occurs in a word, we should assign a hyphenation value of `3` to the position immediately before the `\.{t}`, and a value of `2` to the position between the `\.{t}` and the `\.{i}`.
>"...To hyphenate a word, we find all patterns that match within the word and determine the hyphenation values for each intercharacter position. If more than one pattern applies to a given position, we take the maximum of the values specified (i.e., the higher value takes priority). If the resulting hyphenation value is odd, this position is a feasible breakpoint; if the value is even or if no value has been specified, we are not allowed to break at this position.
Hyphen8 uses Liang's original tables. There is a lot of room for optimization in this Elixir port, but the basic implementation is there.
There is a large sample output in this repo. [View this hyphenated version of _Moby-Dick_](https://github.com/zuchka/hyphen8/blob/main/moby-dick-hyphenated.txt). The input was [the unabridged, UTF-8 copy of _Moby Dick_ on Project Gutenberg](http://www.gutenberg.org/files/2701/2701-0.txt). This file is Hyphen8's raw output.
## Additional Resources
- [Links page for Patgen-related content](https://www.tug.org/docs/liang/)
## Installation
If [available in Hex](https://hex.pm/docs/publish), the package can be installed
by adding `hyphen8` to your list of dependencies in `mix.exs`:
```elixir
def deps do
[
{:hyphen8, "~> 0.1.6"}
]
end
```