README.md

# Hyphen8

## Introduction

Hyphen8 is a pure Elixir implementation of the Knuth-Liang Hyphenation Algorithm. That algorithm formed the basis of Franklin Mark Liang's [1983 Stanford Dissertation, _WORD HY-PHEN-A-TION BY COM-PU-TER_](http://www.tug.org/docs/liang/liang-thesis.pdf). It remains the standard hyphenation method in TeX.

## Usage

Pass a string to `Hyphen8.start()`:

```command
iex> Hyphen8.start("let's hyphenate containerization orchestration platform")

process #PID<0.226.0>: hyphenating "{#PID<0.232.0>, #Reference<0.1715096087.1034682369.71350>}"
process #PID<0.225.0>: hyphenating "{#PID<0.233.0>, #Reference<0.1715096087.1034682369.71351>}"

```

`Hyphen8` will begin spawning processes. Each new one will print its PID and reference number to the screen.

You will then receive the hyphenated string:

```
"let's hy-phen-ate con-tainer-iza-tion or-ches-tra-tion plat-form"
```

The current version will not reconstruct sentence punctuation, newlines, or other meta-characters. Apostrophes are reconstructed but only naively.

To customize the string and word splitting, adjust the regular expressions in `Hyphen8.Engine.parse_words()` and `Hyphen8.Engine.parse_characters()`. 

## Performance Hacks

To optimize for speed, you can adjust the size of your worker pool. Adjust the value for `:size` in `Hyphen8.Application.poolboy_config()` to resize the worker pool. 

You can also adjust your string-chunking interval. Adjust the `String.chunk_every()` function in `Hyphen8.start()` to define how large or small a chunk each spawned process computes. Increasing this number can increase performance on very long strings. Alternately, increasing the chunk size well beyond your average string length can hurt performance. I suggest using Benchee and experimenting with combinations based on your use case.

Another option is to rewrite `Hyphen8` so that it dynamically determines chunk size based on the given input.

## History of the Knuth-Liang Algorithm

Liang built a program, Patgen, which developed a large pattern table. He says the following about the resulting algorithm:

> "The new hyphenation algorithm is based on the idea of hyphenating and inhibiting patterns. These are simply strings of letters that, when they match a word, give us information about hyphenation at some point in the pattern. For example, '-tion' and *c-c' are good hyphenating patterns. An important feature of this method is that a suitable set of patterns can be extracted automatically from the dictionary.

>"...The resulting hyphenation algorithm uses about 4500 patterns...These patterns find 89% of the hyphens in a pocket dictionary word list, with essentially no error."

In the [official resource for Patgen](https://www.tug.org/texlive/devsrc/Build/source/texk/web2c/patgen.web), we find a clear description of the algorithm's logic:

>"The patterns consist of strings of letters and digits, where a digit indicates a `hyphenation value` for some intercharacter position.  For example, the pattern `\.{3t2ion}` specifies that if the string `\.{tion}` occurs in a word, we should assign a hyphenation value of `3` to the position immediately before the `\.{t}`, and a value of `2` to the position between the `\.{t}` and the `\.{i}`.

>"...To hyphenate a word, we find all patterns that match within the word and determine the hyphenation values for each intercharacter position.  If more than one pattern applies to a given position, we take the maximum of the values specified (i.e., the higher value takes priority).  If the resulting hyphenation value is odd, this position is a feasible breakpoint; if the value is even or if no value has been specified, we are not allowed to break at this position.

Hyphen8 uses Liang's original tables. There is a lot of room for optimization in this Elixir port, but the basic implementation is there.

There is a large sample output in this repo. [View this hyphenated version of _Moby-Dick_](https://github.com/zuchka/hyphen8/blob/main/moby-dick-hyphenated.txt). The input was [the unabridged, UTF-8 copy of _Moby Dick_ on Project Gutenberg](http://www.gutenberg.org/files/2701/2701-0.txt). This file is Hyphen8's raw output.

## Additional Resources

- [Links page for Patgen-related content](https://www.tug.org/docs/liang/)

## Installation

If [available in Hex](https://hex.pm/docs/publish), the package can be installed
by adding `hyphen8` to your list of dependencies in `mix.exs`:

```elixir
def deps do
  [
    {:hyphen8, "~> 0.1.6"}
  ]
end
```