defmodule Hanzi do
@moduledoc """
Han/Chinese character (汉字) utilities and conversion to Pinyin lists.
The main goal of this module is to convert strings containing Han characters into a
`t:Pinyin.pinyin_list/0`. In turn, such a list can be formatted by the functions present in the
`Pinyin` module (i.e. `Pinyin.marked/1` or `Pinyin.numbered/1`).
A string of Han characters can be read with `read/1` or `sigil_h/2`. These functions both return
a list containing strings mixed with `t:Hanzi.t/0` structs. Such a list can be converted into
a `t:Pinyin.pinyin_list/0` through the use of `to_pinyin/2`. Users with more esoteric use cases
can directly modify the `t:Hanzi.t/0` inside the `t:hanzi_list/0`.
## `to_pinyin/2` and converters
Since a given Hanzi may have different valid pronunciations, `to_pinyin/2` accepts a second
argument that determines how a given `t:Hanzi.t/0` is converted into a `t:Pinyin.pinyin_list/0`.
This second argument is called a _converter_. This module includes some standard converters.
Please refer to the documentation of `to_pinyin/2` for more information.
## Data source
The results of the functions offered by this module are all ultimately derived from the data
contained in the `Unihan_Readings.txt` file of the
[Unicode Han Database](http://www.unicode.org/reports/tr38/tr38-27.html).
`t:t/0` contains additional information about the available information.
"""
@enforce_keys [:char, :pron]
defstruct [:char, :pron, pron_tw: nil, alt: []]
@typedoc """
Representation of a single Hanzi (chinese Character).
This struct contains all the information extraced from the Unihan_readings database for a given
character. It contains the following fields:
| Key | Description | `Unihan_Readings.txt` field |
| --------- | --------------------------------------------------- | --------------------------- |
| `char` | The character itself | |
| `pron` | Most common pronunciation in pinyin | [kMandarin](http://www.unicode.org/reports/tr38/tr38-27.html#kMandarin) |
| `pron_tw` | Most common pronunciation for Taiwan in Pinyin | [kMandarin](http://www.unicode.org/reports/tr38/tr38-27.html#kMandarin) |
| `alt` | All readings defined by the Hanyu Pinyin Dictionary | [kHanyuPinyin](http://www.unicode.org/reports/tr38/tr38-27.html#kHanyuPinyin) |
## `pron` and `pron_tw`
In some rare cases, the most common reading of a hanzi is different in mainland China and in
Taiwan. If this is the case, the most common mainland reading will be stored under the `pron`
key, while the most common Taiwanese reading will be stored under the `pron_tw` key. When the
readings are the same, `pron` will contain the reading while `pron_tw` will be `nil`. Note
that, at the time of writing, only 38 characters out of the 41226 defined by
`Unihan_Readings.txt` have a different reading for mainland China and Taiwan.
## `alt`
Some hanzi have different readings based on their exact use. When this is the case, all the
possible readings of a character will stored as a list in `alt`.
"""
@type t :: %__MODULE__{
char: String.t(),
pron: Pinyin.t(),
pron_tw: Pinyin.t() | nil,
alt: [Pinyin.t()]
}
@typedoc """
List of Hanzi characters mixed with plain strings.
"""
@type hanzi_list :: [t() | String.t()]
# ------------------- #
# Low-level Utilities #
# ------------------- #
@doc """
Obtain the `t:Hanzi.t/0` struct for a character.
Note that this only works on a single character.
## Examples
iex> Hanzi.from_character("你")
%Hanzi{char: "你", pron: %Pinyin{initial: "n", final: "i", tone: 3}, pron_tw: nil, alt: []}
iex> Hanzi.from_character("x")
nil
iex> Hanzi.from_character("你好")
nil
"""
@spec from_character(String.t()) :: t() | nil
def from_character(character), do: Hanzi.Map.lookup(character)
@doc """
Check if a _single_ character is a valid Han character.
## Examples
iex> Hanzi.character?("你")
true
iex> Hanzi.character?("x")
false
iex> Hanzi.character?("你好")
false
"""
@spec character?(String.t()) :: boolean()
def character?(character) do
from_character(character) != nil
end
# ---------- #
# converters #
# ---------- #
@doc """
Converter that retrieves the most common pronunciation of a `t:Hanzi.t/0`.
The additional argument specifies if the most common pronunciation for mainland China or Taiwan
is retrieved.
## Examples
iex> Hanzi.common_pronunciation(~h/你/s)
[%Pinyin{initial: "n", final: "i", tone: 3}]
iex> Hanzi.common_pronunciation(~h/你/s, :cn)
[%Pinyin{initial: "n", final: "i", tone: 3}]
iex> Hanzi.common_pronunciation(~h/你/s, :tw)
[%Pinyin{initial: "n", final: "i", tone: 3}]
iex> Hanzi.common_pronunciation(~h/万/s)
[%Pinyin{initial: "", final: "wan", tone: 4}]
iex> Hanzi.common_pronunciation(~h/万/s, :cn)
[%Pinyin{initial: "", final: "wan", tone: 4}]
iex> Hanzi.common_pronunciation(~h/万/s, :tw)
[%Pinyin{initial: "m", final: "o", tone: 4}]
"""
@spec common_pronunciation(t(), :cn | :tw) :: Pinyin.pinyin_list()
def common_pronunciation(hanzi, loc \\ :cn)
def common_pronunciation(hanzi, :cn), do: [hanzi.pron]
def common_pronunciation(hanzi, :tw), do: [hanzi.pron_tw || hanzi.pron]
@doc """
Converter that returns all pronunciations of a character or the most common.
If only a single pronunciation is available, it is returned, otherwise, all possible
pronunciations are returned. When all possible pronunciations are returned, `left`, `mid` and
`right` determine how the alternatives are separated. `left` is positioned before the first
pronunciation in the list, `right` is positioned after the last pronunciation, `mid` is
positioned between all the other pronunciations.
## Examples
iex> Hanzi.all_pronunciations(~h/你/s)
[%Pinyin{initial: "n", final: "i", tone: 3}]
iex> Hanzi.all_pronunciations(~h/㓎/s)
["[ ", %Pinyin{initial: "q", final: "in", tone: 1}, " | ", %Pinyin{initial: "q", final: "in", tone: 4}, " | ", %Pinyin{initial: "q", final: "in", tone: 3}, " ]"]
iex> Hanzi.all_pronunciations(~h/㓎/s, "", "", "")
["", %Pinyin{initial: "q", final: "in", tone: 1}, "", %Pinyin{initial: "q", final: "in", tone: 4}, "", %Pinyin{initial: "q", final: "in", tone: 3}, ""]
"""
@spec all_pronunciations(t(), String.t(), String.t(), String.t()) :: Pinyin.pinyin_list()
def all_pronunciations(hanzi, left \\ "[ ", mid \\ " | ", right \\ " ]")
def all_pronunciations(%Hanzi{pron: p, alt: []}, _, _, _), do: [p]
def all_pronunciations(%Hanzi{pron: _, alt: alt}, left, mid, right) do
[left | Enum.intersperse(alt, mid) ++ [right]]
end
@doc """
Converter that returns a list of all known pronunciations of a character.
This converter is similar to `all_pronunciations/4`, but it does not include separators around
the various pronunciations.
## Examples
iex> Hanzi.list_pronunciations(~h/你/s)
[%Pinyin{initial: "n", final: "i", tone: 3}]
iex> Hanzi.list_pronunciations(~h/㓎/s)
[%Pinyin{initial: "q", final: "in", tone: 1}, %Pinyin{initial: "q", final: "in", tone: 4}, %Pinyin{initial: "q", final: "in", tone: 3}]
iex> Hanzi.list_pronunciations(~h/㓎/s)
[%Pinyin{initial: "q", final: "in", tone: 1}, %Pinyin{initial: "q", final: "in", tone: 4}, %Pinyin{initial: "q", final: "in", tone: 3}]
"""
@spec list_pronunciations(t()) :: Pinyin.pinyin_list()
def list_pronunciations(%Hanzi{pron: p, alt: []}), do: [p]
def list_pronunciations(%Hanzi{pron: _, alt: alt}), do: alt
# ------------- #
# Hanzi Strings #
# ------------- #
@doc """
Read a string and convert it into a list of strings and `t:Hanzi.t/0` structs.
This function reads a string containing characters mixed with normal text. The output of this
function is a list of strings and Hanzi structs.
The input string may contain any character. Any character in the string that is recognised as a
Han character (by `character?/1`) is returned as a `t:Hanzi.t/0` in the returned list. Any
other character in the input is returned unmodified.
## Examples
iex> Hanzi.read("你好")
[%Hanzi{char: "你", pron: %Pinyin{initial: "n", final: "i", tone: 3}}, %Hanzi{char: "好", pron: %Pinyin{initial: "h", final: "ao", tone: 3}, alt: [%Pinyin{initial: "h", final: "ao", tone: 3}, %Pinyin{initial: "h", final: "ao", tone: 4}]}]
iex> Hanzi.read("hello, 你")
["hello, ", %Hanzi{char: "你", pron: %Pinyin{initial: "n", final: "i", tone: 3}}]
"""
@spec read(String.t()) :: hanzi_list()
def read(string) do
string
|> String.graphemes()
|> Stream.map(fn el ->
case from_character(el) do
nil -> el
hanzi -> hanzi
end
end)
# TODO: separate whitespace in future version?
|> Stream.chunk_by(&match?(%Hanzi{}, &1))
|> Stream.flat_map(fn
lst = [el | _] when is_binary(el) -> [Enum.join(lst)]
lst = [%Hanzi{} | _] -> lst
end)
|> Enum.to_list()
end
@doc """
Sigil to create a Hanzi list or struct.
When used without any modifiers, this sigil converts ins input into a hanzi list through the use
of `read/1`. When the `s` modifier is used, `from_character/1` is used instead.
## Examples
iex> ~h/hello, 你/
["hello, ", %Hanzi{char: "你", pron: %Pinyin{initial: "n", final: "i", tone: 3}}]
iex> ~h/你/s
%Hanzi{char: "你", pron: %Pinyin{initial: "n", final: "i", tone: 3}}
iex> ~h/你好/s
nil
"""
defmacro sigil_h({:<<>>, _, [char]}, [?s]) when is_binary(char) do
hanzi = from_character(char)
quote do
unquote(Macro.escape(hanzi))
end
end
defmacro sigil_h({:<<>>, _, [string]}, []) when is_binary(string) do
list = read(string)
quote do
unquote(Macro.escape(list))
end
end
@doc """
Verify if a list or string contains only characters.
Note that whitespace is not counted as a character.
## Examples
iex> Hanzi.characters?(["你", "好"])
true
iex> Hanzi.characters?(["你", "boo", "好"])
false
iex> Hanzi.characters?("你好")
true
iex> Hanzi.characters?("你 好")
false
"""
@spec characters?(String.t() | [String.t()]) :: boolean()
def characters?(l) when is_list(l), do: Enum.all?(l, &character?/1)
def characters?(str) when is_binary(str) do
str
|> String.graphemes()
|> characters?()
end
@doc """
Convert a Hanzi list to a string of characters.
This function extracts the character of each `t:Hanzi.t/0` in `lst`. Normal strings in the list
not modified. After converting the Hanzi in the list to characters, the list is joined with
`Enum.join/2`. The `joiner` argument will be passed as the `joiner` to `Enum.join/2`.
## Examples
iex> characters(~h/你好/)
"你好"
iex> characters(~h/你hello/)
"你hello"
iex> characters(~h/你好/, ";")
"你;好"
iex> characters(~h/你hello/, ";")
"你;hello"
"""
@spec characters(hanzi_list(), String.t()) :: String.t()
def characters(lst, joiner \\ "") do
lst
|> Enum.map(fn
%Hanzi{char: c} -> c
str -> str
end)
|> Enum.join(joiner)
end
@doc """
Convert a Hanzi list to a Pinyin list.
Normal strings in the Hanzi list are returned unmodified. Every `Hanzi.t()` is passed as an
argument to `converter`, which returns a `t:Pinyin.pinyin_list/0`. This list is added to the
result.
After calling this function, `Pinyin.marked/1` or `Pinyin.numbered/1` can be used to format the
result.
## Converters
A converter is any function that transforms a `t:Hanzi.t/0` into a `t:Pinyin.pinyin_list/0`.
In the most simple case, such a converter simply returns the most common pronunciation. In more
complicated cases, such a converter returns all the possible pronunciations of a Hanzi,
separated by strings.
The `Hanzi` module includes two converters: `common_pronunciation/2` and `all_pronunciations/4`.
If no converter is specified, `&common_pronunciation(&1, :cn)` is used.
If you wish to write your own converter, the functions mentioned above, and the examples below
should be a good starting point.
## Examples
iex> to_pinyin(~h/你好/)
[%Pinyin{initial: "n", final: "i", tone: 3}, %Pinyin{initial: "h", final: "ao", tone: 3}]
iex> to_pinyin(~h/二万/, &common_pronunciation(&1, :tw))
[%Pinyin{initial: "", final: "er", tone: 4}, %Pinyin{initial: "m", final: "o", tone: 4}]
iex> to_pinyin(~h/你好/, &all_pronunciations/1)
[%Pinyin{initial: "n", final: "i", tone: 3}, "[ ", %Pinyin{initial: "h", final: "ao", tone: 3}, " | ", %Pinyin{initial: "h", final: "ao", tone: 4}, " ]"]
iex> to_pinyin(~h/你好/, &all_pronunciations(&1, "", "", ""))
[%Pinyin{initial: "n", final: "i", tone: 3}, "", %Pinyin{initial: "h", final: "ao", tone: 3}, "", %Pinyin{initial: "h", final: "ao", tone: 4}, ""]
iex> to_pinyin(~h/你好/, fn %Hanzi{pron: p} -> [p] end)
[%Pinyin{initial: "n", final: "i", tone: 3}, %Pinyin{initial: "h", final: "ao", tone: 3}]
"""
@spec to_pinyin(hanzi_list(), (t() -> Pinyin.pinyin_list())) :: Pinyin.pinyin_list()
def to_pinyin(lst, converter \\ &common_pronunciation/1) do
Enum.flat_map(lst, fn
h = %Hanzi{} -> converter.(h)
str -> [str]
end)
end
end
defimpl String.Chars, for: Hanzi do
def to_string(%Hanzi{char: c}), do: c
end
defimpl List.Chars, for: Hanzi do
def to_charlist(%Hanzi{char: c}), do: Kernel.to_charlist(c)
end
defimpl Inspect, for: Hanzi do
import Inspect.Algebra
def inspect(%Hanzi{char: c}, _) do
concat(["#Hanzi<", c, ">"])
end
end