# hnc-csv - CSV Decoder/Encoder
## Decoding
Whole CSV binary documents can be decoded with `decode/1,2`.
`decode/1` assumes default [RFC4180](https://www.ietf.org/rfc/rfc4180.txt)-style
options, that is:
* Fields are separated by commas.
* Fields are optionally enclosed in double quotes.
* Double quotes in enclosed fields are quoted by another double quote.
`decode/2` allows using custom options:
```erlang
#{separator => Separator, % any byte except $\r or $\n (defaul $,)
enclosure => Enclosure, % 'undefined' or any byte except $\r or $\n (default $")
quote => Quote} % 'undefined', 'enclosure', or any byte except $\r or $\n (defaults 'enclosure')
```
_Restrictions for option combinations:_
* If `Enclosure` is `undefined` (ie, no enclosing), `Quote` must be either `enclosure` or `undefined`.
* If `Enclosure` is _not_ `undefined`, `Quote` must also not be `undefined`.
* If `Enclosure` is _not_ `undefined`, it must _not_ be the same as `Separator`.
Lines are separated by `\r`, `\n` or `\r\n`. Empty lines are ignored by the decoder.
The result of decoding is a list of CSV lines, which are lists of CSV fields,
which are in turn binaries representing the field values on the respective line.
#### Example
Assume the following CSV data:
```text
a,b,c
"d,d","e""e","f
f"
```
In an Erlang binary, this will look like:
```erlang
1> CsvBinary = <<"a,b,c\r\n\"d,d\",\"e\"\"e\",\"f\r\nf\"\r\n">>.
<<"a,b,c\r\n\"d,d\",\"e\"\"e\",\"f\r\nf\"\r\n">>
```
Decoded with `decode/1`, this will become:
```erlang
2> hnc_csv:decode(CsvBinary).
[[<<"a">>,<<"b">>,<<"c">>],
[<<"d,d">>,<<"e\"e">>,<<"f\r\nf">>]]
```
### Higher Order Functions for Decoding
`hnc_csv` provides the functions `decode_fold/3,4`, `decode_filter/2,3`,
`decode_map/2,3`, `decode_filtermap/2,3` and `decode_foreach/2,3` which
allow decoding and processing decoded lines in one operation, much
like the `lists` functions `foldl/3`, `filter/2`, `map/2`, `filtermap/2`
and `foreach/2`.
In fact, `decode/1,2` is implemented via `decode_fold/3,4`.
### Providers
The `decode` family of functions accepts both a raw binary as well as a
`Provider` that delivers chunks of raw binary. When given a raw binary,
it is converted into a binary provider for further processing.
A provider is a 0-arity function which, when called, returns either a
tuple where the first element is a chunk of binary data and the second
is a new provider function for the next chunk of data, or the atom
`end_of_data` to indicate that the provider has delivered all data.
Providers can be implemented stateless of stateful, usually depending
on the characteristics of the underlying data source.
A stateless provider does not change and is not susceptible to external
changes to the state of the underlying data source.
A stateful provider on the other hand may change or be susceptible to
changes to the state of the underlying data source or both. It is recommended
to not (re-)use stateful providers or their underlying data source before, while
or after being used in decoding functions, except for any necessary setup before or
cleanup after being used.
`hnc_csv` comes with two convenience functions, `get_binary_provider/1,2`
(stateless) and `get_file_provider/1,2` (stateful) which return providers for
binaries or files, respectively.
##### Example
The following is an implementation of a (stateless) custom provider which delivers
data taken from a given list of binaries:
```erlang
-module(example_provider).
-export([get_list_provider/1]).
get_list_provider(L) ->
fun() -> list_provider(L) end.
list_provider([]) ->
end_of_data;
list_provider([Bin|More]) when is_binary(Bin) ->
{Bin, fun() -> list_provider(More) end}.
```
* `get_list_provider/1` creates the initial provider, which is a call
to `list_provider/1` wrapped in a 0-arity function.
* `list_provider/1` is the actual implementation of the provider, which
returns either `end_of_data` when the list given as argument is exhausted,
or otherwise a tuple with the head element of the list as first
and a call to itself with the tail of the list wrapped in a 0-arity
function as second element.
This provider can then be used as follows, for example to count the lines
and fields in the CSV data which the provider delivers:
```erlang
1> Provider = example_provider:get_list_provider([<<"a,b">>, <<",c\r">>,
<<"\nd,">>, <<"e,f">>,
<<"\r\n">>]).
#Fun<example_provider.0.64990923>
2> hnc_csv:decode_fold(Provider,
fun(Line, {LCnt, FCnt}) -> {LCnt+1, FCnt+length(Line)} end,
{0, 0}).
{2,6}
```
### Advanced Usage
For more complex scenarios than what the built-in functions provide
for, the functions `decode_init/0,1,2`, `decode_next_line/1` and
`decode_flush/1` can be used together to decode and process CSV
documents incrementally.
* `decode_init/0,1,2` creates a decoder state to be used in the
other functions listed above.
* `decode_next_line/1` decodes and returns the next line, together with
an updated state. If the data in the provider backing the state is exhausted,
the atom `end_of_data` is returned instead of a line.
* `decode_flush/1` returns all as by then unread lines in the given state.
In fact, `decode_fold/4` is implemented using those functions.
## Encoding
CSV documents can be encoded with `encode/1,2`.
`encode/1` assumes default [RFC4180](https://www.ietf.org/rfc/rfc4180.txt)-style
options, that is:
* Fields are separated by commas
* Fields are optionally enclosed in double quotes
* Double quotes in enclosed fields are quoted by another double quote
* Lines are separated by `\r\n`
`encode/2` allows using custom options:
```erlang
#{separator => Separator, % any byte except $\r and $\n (default $,)
enclosure => Enclosure, % 'undefined' or any byte except $\r or $\n (default $")
quote => Quote, % 'undefined', 'enclosure', or any byte except $\r or $\n (default 'enclosure')
enclose => Enclose, % 'optional' (default), 'never' or 'always'
end_of_line => EndOfLine} % `<<"\r\n">> (default), <<"\n">> or <<"\r">>
```
_Restrictions for option combinations:_
* If `Enclose` is `never` (ie, no enclosing), `Enclosure` must be `undefined` and `Quote` must be `undefined` or `enclosure`.
* If `Enclose` is `optional` or `always`, `Enclosure` and `Quote` must _not_ be `undefined`.
* If `Enclosure` is _not_ `undefined`, it must not be the same as `Separator`.
The input for encoding is a list of CSV lines, which are in turn lists of CSV fields,
which are in turn binaries representing the field values.
The result is a CSV binary document consisting of the given CSV lines, in turn consisting of the given CSV fields of a line.
#### Example
Assume the following CSV structure:
```erlang
1> Csv = [[<<"a">>,<<"b">>,<<"c">>],
[<<"d,d">>,<<"e\"e">>,<<"f\r\nf">>]].
```
Encoded with `encode/1`, this will become:
```erlang
2> hnc_csv:encode(Csv).
<<"a,b,c\r\n"
"\"d,d\",\"e\"\"e\",\"f\r\nf\"\r\n">>
```
# Authors
* Maria Scott (Maria-12648430)
* Jan Uhlig (juhlig)