README.md

# boyer_moore

[![Package Version](https://img.shields.io/hexpm/v/boyer_moore)](https://hex.pm/packages/boyer_moore)
[![Hex Docs](https://img.shields.io/badge/hex-docs-ffaff3)](https://hexdocs.pm/boyer_moore/)

An implementation of the Boyer-Moore search algorithm for BitArrays and user-defined data structures.

Currently the algorithm only implements "bad character" rule.

```sh
gleam add boyer_moore
```

```gleam
import boyer_moore
import gleam/io

pub fn main() {
  let needle = <<3, 4, 5>>

  // Compile needle to searcher that can be reused
  let assert Ok(searcher) = boyer_moore.build_byte_searcher(needle)

  // Find needle in given BitArray
  io.debug(boyer_moore.search(<<1, 2, 3, 4, 5, 6>>, searcher))
  // Ok(2)
}
```

## Performance for BitArray searching

A naive version of searching through a BitArray in Gleam could be written like this:

```gleam
fn naive(data: BitArray, pattern: BitArray) {
  do_naive(data, pattern, bit_array.byte_size(pattern), 0)
}

fn do_naive(data: BitArray, pattern: BitArray, pattern_size: Int, index: Int) {
  case data {
    <<>> -> Error(Nil)
    <<maybe_pattern:bytes-size(pattern_size), _rest:bytes>>
      if maybe_pattern == pattern
    -> Ok(index)
    <<_byte:size(8), rest:bytes>> ->
      do_naive(rest, pattern, pattern_size, index + 1)
    _ -> panic
  }
}
```

Summarised, the function iterates the BitArray byte by byte, and each time checks if the currently viewed section of
the BitArray starts with the pattern. This algorithm is what the sections below compare to.

### Erlang

On the Erlang target, the BitArray API of this package delegates to
[`binary:match/3`](https://www.erlang.org/doc/apps/stdlib/binary.html#match/3) and as such is much faster and more
memory efficient than the naive search on all haystack/needle sizes.

### JavaScript

In the JavaScript target, the naive algorithm is actually faster for small inputs, meaning small haystacks and small
needles. The longer the needle, the more opportunities Boyer-Moore has for skipping forward. And the larger the
haystack, the more work the naive algorithm has to do.

Measured on an Apple Silicon M1 Max laptop with Node 20, the tipping point seems to be around the point where
`haystack * needle` (as bytes) is 10000. For example, with a 26 byte needle and a 420 byte haystack, Boyer-Moore was
about 5 % faster than the naive algorithm. See the table below for more details:

| Haystack length | Needle length | Relative performance |
| --------------- | ------------- | -------------------- |
| 55 B            | 6 B           | 57 % (i.e. slower)   |
| 420 B           | 26 B          | 105 %                |
| 1024 B          | 19 B          | 188 %                |
| 30374 B         | 39 B          | 252 %                |

As a summary, for very short values, using the naive algorithm may be faster. On the other hand, in those cases the
absolute time difference will also be small. You may wish to benchmark how this algorithm works for your use case.

## Usage for custom data structures

You can also use this package for searching inside a custom data structure. A suitable data structure offers fast
access to random indices regardless of the amount of items. As an example a Dict would work, but a List would not.

Additionally, the data structure _must_ have an integer index that runs from 0 to the amount of the items minus one,
incrementing by one for each item and containing no holes. A given index must also always point to the same item
(until the data structure is modified). This must stay true both for the needle and the haystack.

To use a custom data structure, you'll need to define an API that the library can use:

```gleam
import boyer_moore

const dict_api = boyer_moore.API(
  get_size: dict.size,
  are_equal: boyer_moore.default_equality,
  unsafe_access: dict_unsafe_get,
)

fn dict_unsafe_get(data: Dict(a, b), index: a) -> b {
  let assert Ok(item) = dict.get(data, index)
  item
}
```

Now you can create a searcher by passing in the API as well:

```gleam
type DNA {
  A
  C
  G
  T
}

let searcher = boyer_moore.build_generic_searcher(
  dict.from_list([
    #(0, A), #(1, A), #(2, G), #(3, T)
  ]),
  dict_api
)

boyer_moore.search(some_data, searcher)
```