# Hallmark
[](https://livebook.dev/run?url=https%3A%2F%2Fgithub.com%2Fgeorgeguimaraes%2Fhallmark%2Fblob%2Fmain%2Flivebooks%2Fhallmark_guide.livemd)
Hallucination detection for Elixir, powered by Vectara's [HHEM](https://huggingface.co/vectara/hallucination_evaluation_model) (Hallucination Evaluation Model).
Given a premise and a hypothesis, Hallmark scores how consistent the hypothesis is with the premise on a scale from 0 (hallucinated) to 1 (consistent). Useful for checking whether LLM-generated text is actually grounded in the source material.
HHEM is a fine-tuned FLAN-T5-base (184M params) that runs locally via Bumblebee. No API keys, no external calls after the initial model download.
## Installation
```elixir
def deps do
[
{:hallmark, "~> 0.1.0"},
{:exla, "~> 0.9"}
]
end
```
You need a compiler like EXLA or EMLX. Without one, the pure Elixir evaluator runs each tensor op individually and a single score call takes 10+ minutes. With EXLA, it's ~170ms.
EXLA works on all platforms (CPU on Mac, CUDA on Linux with a GPU). If you're on Apple Silicon and want Metal GPU acceleration, use [EMLX](https://github.com/elixir-nx/emlx) instead:
```elixir
{:emlx, "~> 0.2"}
```
## Usage
```elixir
# Load the model (downloads ~440MB on first run, cached after that)
{:ok, model} = Hallmark.load(compiler: EXLA)
# Score a single pair
{:ok, score} = Hallmark.score(model, "I am in California", "I am in United States.")
# => {:ok, 0.65}
# Hallucination detected
{:ok, score} = Hallmark.score(model, "The capital of France is Berlin.", "The capital of France is Paris.")
# => {:ok, 0.01}
# Get a label instead of a score
{:ok, :consistent} = Hallmark.evaluate(model, "I am in California", "I am in United States.")
{:ok, :hallucinated} = Hallmark.evaluate(model, "The capital of France is Berlin.", "The capital of France is Paris.")
# Custom threshold (default is 0.5)
{:ok, label} = Hallmark.evaluate(model, premise, hypothesis, threshold: 0.8)
# Batch scoring
{:ok, scores} = Hallmark.score_batch(model, [
{"I am in California", "I am in United States."},
{"I am in United States", "I am in California."},
{"Mark Wahlberg was a fan of Manny.", "Manny was a fan of Mark Wahlberg."}
])
```
## How it works
HHEM checks logical entailment, not factual accuracy. It answers "does the hypothesis follow from the premise?" rather than "is the hypothesis true?" So a factually correct statement can still be flagged as hallucinated if it doesn't follow from the given premise.
Under the hood, Hallmark loads the T5 encoder from HHEM's fine-tuned weights, runs the input through the encoder, and applies a 2-class classifier head on the pad token embedding. The softmax probability of the "consistent" class is the score.
## License
MIT