# Hyphen8

## Introduction

Hyphen8 is a pure Elixir implementation of the Knuth-Liang Hyphenation Algorithm. That algorithm formed the basis of Franklin Mark Liang's [1983 Stanford Dissertation, _WORD HY-PHEN-A-TION BY COM-PU-TER_]( It remains the standard hyphenation method in TeX.

## Usage

Pass a string to `Hyphen8.Engine.main()`:

iex> Hyphen8.Engine.main("let's hyphenate containerization orchestration platform")

You will receive the hyphenated string:

"let s hy-phen-ate con-tainer-iza-tion or-ches-tra-tion plat-form"

The current version will not reconstruct punctuation. To customize the string and word splitting, adjust the regular expressions in `Hyphen8.Engine.parse_words()` and `Hyphen8.Engine.parse_characters()`.

## History of the Knuth-Liang Algorithm

Liang built a program, Patgen, which developed a large pattern table. He says the following about the resulting algorithm:

> "The new hyphenation algorithm is based on the idea of hyphenating and inhibiting patterns. These are simply strings of letters that, when they match a word, give us information about hyphenation at some point in the pattern. For example, '-tion' and *c-c' are good hyphenating patterns. An important feature of this method is that a suitable set of patterns can be extracted automatically from the dictionary.

>"...The resulting hyphenation algorithm uses about 4500 patterns...These patterns find 89% of the hyphens in a pocket dictionary word list, with essentially no error."

In the [official resource for Patgen](, we find a clear description of the algorithm's logic:

>"The patterns consist of strings of letters and digits, where a digit indicates a `hyphenation value` for some intercharacter position.  For example, the pattern `\.{3t2ion}` specifies that if the string `\.{tion}` occurs in a word, we should assign a hyphenation value of `3` to the position immediately before the `\.{t}`, and a value of `2` to the position between the `\.{t}` and the `\.{i}`.

>"...To hyphenate a word, we find all patterns that match within the word and determine the hyphenation values for each intercharacter position.  If more than one pattern applies to a given position, we take the maximum of the values specified (i.e., the higher value takes priority).  If the resulting hyphenation value is odd, this position is a feasible breakpoint; if the value is even or if no value has been specified, we are not allowed to break at this position.

Hyphen8 uses Liang's original tables. There is a lot of room for optimization in this Elixir port, but the basic implementation is there.

There is a large sample output in this repo. [View this hyphenated version of _Moby-Dick_]( The input was [the unabridged, UTF-8 copy of _Moby Dick_ on Project Gutenberg]( This file is Hyphen8's raw output.

## Additional Resources

- [Links page for Patgen-related content](

## Installation

If [available in Hex](, the package can be installed
by adding `hyphen8` to your list of dependencies in `mix.exs`:

def deps do
    {:hyphen8, "~> 0.1.0"}