README.md

# AnnoyEx

A NIF binding to [Annoy](https://github.com/spotify/annoy), Spotify's C++ library for
approximate nearest neighbors.

It implements all of the methods in the Spotify library and all index types except Hamming.


# Code Examples

## If you have your own vectors.

  ```elixir
  iex(1)> f = 40
  40

  iex(2)> t = AnnoyEx.new(f)
  #Reference<0.3703577626.1733689351.247165>

  iex(3)> Enum.each(0..999,
  ...(3)>   fn i -> AnnoyEx.add_item(t, i, Enum.map(0..f-1, fn _ -> :rand.normal() end))
  ...(3)> end)
  :ok

  iex(4)> AnnoyEx.build(t, 10)
  :ok

  iex(5)> AnnoyEx.save(t, "test.ann")
  :ok

  iex(6)> u = AnnoyEx.new(f, :angular)
  #Reference<0.3703577626.1733689345.248447>

  iex(7)> AnnoyEx.load(u, "test.ann") # super fast, will just mmap the file
  :ok

  iex(8)> AnnoyEx.get_nns_by_item(u, 0, 10) # will find the 10 nearest neighbors
  {[0, 677, 837, 478, 793, 183, 265, 623, 751, 268],
  [0.0, 1.1232969760894775, 1.1271791458129883, 1.1428979635238647,
  1.1504143476486206, 1.1632753610610962, 1.1647002696990967,
  1.1801577806472778, 1.2018792629241943, 1.2058889865875244]}
  ```

## Word Embeddings with Pretrained GloVe vectors

  Go to https://nlp.stanford.edu/projects/glove/ and download a word-embedding file, eg.
  https://nlp.stanford.edu/data/glove.42B.300d.zip

  By its nature creating and building the index can be quite slow but querying afterward
  is fast.  Saving built indexes can help with this.

  ```elixir
     # Build and save the index.
     idx = AnnoyEx.new(300)

     index_to_word =
       File.stream!("glove.42B.300d.txt") |>
       Stream.with_index() |>
       Stream.map(fn {line,item} ->
         fields = String.trim_trailing(line) |> String.split(" ")
         word = hd(fields)
         vec = Enum.map(tl(fields), fn x -> Float.parse(x) |> elem(0) end)
         AnnoyEx.add_item(idx, item, vec)
         {item, word}
       end) |>
       Enum.into(%{})

     AnnoyEx.build(idx,10)

     AnnoyEx.save(idx, "glove.42B.300d.idx")
     File.write!("glove.42B.300d.i2w", :erlang.term_to_binary(index_to_word))
  ```

  The saved data can now be queried for similar words, eg. the 10 closest words to "dog":

  ```
  iex(1)> index_to_word = File.read!("glove.42B.300d.i2w") |> :erlang.binary_to_term
  %{
    1774702 => "bedanya",
    ...
  }

  iex(2)> word_to_index = Map.new(index_to_word, fn {k, v} -> {v, k} end)
  %{
    "timout" => 816588,
    ...
  }

  iex(3)> idx = AnnoyEx.new(300)
  #Reference<0.2696563938.657326081.165701>

  iex(4)> AnnoyEx.load(idx, "glove.42B.300d.idx")
  :ok

  iex(5)> dog_id = word_to_index["dog"]
  828

  iex(6)> {word_ids, distances} = AnnoyEx.get_nns_by_item(idx, dog_id, 10)
  {[828, 1818, 5203, 3394, 1642, 1937, 6798, 16091, 7080, 16440],
   [0.0, 0.5301183462142944, 0.617365300655365, 0.7635669112205505,
    0.8098030686378479, 0.8700820803642273, 0.8896471261978149,
    0.8973260521888733, 0.9196945428848267, 0.9411463737487793]}

  iex(7)> Enum.map(word_ids, fn word_id -> index_to_word[word_id] end)
  ["dog", "dogs", "puppy", "cats", "animal", "horse", "rabbit", "paws", "pig",
   "paw"]
  ```


# Installation

*n.b. This library currently only runs on Linux.*

You will need a C++14 compiler and the Erlang header files required for compiling NIFs.

If [available in Hex](https://hex.pm/docs/publish), the package can be installed
by adding `annoy_ex` to your list of dependencies in `mix.exs`:

```elixir
def deps do
  [
    {:annoy_ex, "~> 1.0.0"}
  ]
end
```

# Working with source

Before running tests make sure to build the shared library with `make annoy`