# Similarity

Cosine similarity & Simhash implementation

Full documentation can be found at [](

## Cosine Similarity

Cosine similarity is not sensitive to the scale of the vector!

Similarity.cosine([1,2,3], [1,2,3])
Similarity.cosine([1,2,3], [2,4,6])

Module `Similarity.Cosine` takes care of building a struct and streaming similarities:
(It handles non matching attributes, elements added don't have to have the exact attributes)

s =
s = s |> Similarity.Cosine.add("a", [{"bananas", 9}, {"hair_color_r", 124}, {"hair_color_g", 8}, {"hair_color_b", 122}])
s = s |> Similarity.Cosine.add("b", [{"bananas", 19}, {"hair_color_r", 124}, {"hair_color_g", 8}, {"hair_color_b", 122}])
s = s |> Similarity.Cosine.add("c", [{"bananas", 9}, {"hair_color_r", 124}])

s |> |> Enum.to_list
  {"a", "b", 1.9967471152702767},
  {"a", "c", 1.4142135623730951},
  {"b", "c", 1.409736747211141}

s |> Similarity.Cosine.between("a", "b")

Cosine similarity between two vectors, multiplied by the square root of the length of the vectors.
(In my experience where number of common attributes don't match between some vectros this gives a better value).

a = [1,2,3,4]
b = [1,2,3]
c = [1,2,3,4]

Similarity.cosine_srol(a |> Enum.take(3), b)
Similarity.cosine_srol(a, c)

Above even though the first 3 elements of `a` match with `b`, just like `a` with `c`,
the `a` & `c` cosine similarity returns higher value due to more elements matching.
In real world scenario I suggest using this if compared vectors aren't the same length.

## Simhash

left = "pork belly jerky brisket tenderloin shank kevin spare ribs"
right = "porchetta pork loin. Leberkas ball tip biltong, beef ribs"

Similarity.simhash(left, right, ngram_size: 3)

## Installation

Add `similarity` to your list of dependencies in `mix.exs`:

def deps do
    {:similarity, "~> 0.2"}

## Performance
Similarity.simhash is 2x faster than simhash-ex v1.1.0 package .

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking simhash-ex...
Benchmarking similarity.simhash...

Name                         ips        average  deviation         median         99th %
similarity.simhash        3.67 K      272.69 μs     ±6.50%      267.84 μs      353.05 μs
simhash-ex                1.75 K      572.14 μs    ±12.31%      552.22 μs      781.02 μs

similarity.simhash        3.67 K
simhash-ex                1.75 K - 2.10x slower +299.46 μs