# Magika
An Elixir binding of [Google's Magika](https://securityresearch.google/magika) —
deep-learning file content type detection.
Magika identifies the content type of a file (e.g. `html`, `python`, `pdf`,
`zip`, `png`) from its bytes using a small, fast ONNX model. This library is a
faithful port of the reference Python implementation's `standard_v3_3` model:
it runs the **same ONNX model** (vendored under `priv/`) via
[OnnxRuntime](https://hex.pm/packages/onnxruntime) and reproduces the same
feature extraction, confidence thresholds, and label-resolution logic. Its
output matches the Python tool exactly on the upstream test corpus (77/77
files).
## How it works
1. **Corner cases first.** Empty inputs → `empty`; very small or
whitespace-only inputs are classified as `txt`/`unknown` by a UTF-8 check,
without invoking the model. Directories and symlinks get dedicated labels.
2. **Feature extraction.** For larger inputs, Magika reads up to `block_size`
(4096) bytes from the start and end, strips ASCII whitespace, and takes
`beg_size` (1024) bytes from the front and `end_size` (1024) from the back,
padding with a dedicated padding token (256). This yields a 2048-int vector.
3. **Inference.** The vector is fed to the ONNX model (`int32[batch, 2048] →
float32[batch, 214]`, a softmax over 214 content types). The argmax is the
raw "deep-learning" label and its probability is the score.
4. **Label resolution.** An overwrite map and per-content-type confidence
thresholds turn the raw label into the final output. Low-confidence
predictions are generalized to `txt` (text) or `unknown` (binary).
## Installation
Add `magika` to your dependencies in `mix.exs`:
```elixir
def deps do
[
{:magika, "~> 0.1.0"}
]
end
```
The ONNX runtime is provided by the
[`onnxruntime`](https://hex.pm/packages/onnxruntime) package (Elixir bindings
for Microsoft ONNX Runtime), which fetches precompiled native binaries for
common platforms, so no manual toolchain setup is required for installation.
## Usage
The model is loaded once and hosted by a supervised `Magika.Server` that starts
automatically with the `:magika` application — so you call the API directly,
without managing or passing around an instance:
```elixir
# Identify raw bytes:
{:ok, result} = Magika.identify("<!DOCTYPE html>\n<html>...</html>")
result.prediction.output.label #=> "html"
result.prediction.output.mime_type #=> "text/html"
result.prediction.output.group #=> "code"
result.prediction.score #=> 0.86...
# Identify a file on disk:
{:ok, result} = Magika.identify_path("/path/to/document.pdf")
result.prediction.output.label #=> "pdf"
# Missing/unreadable files return an {:error, result} with a status:
{:error, result} = Magika.identify_path("/nope")
result.status #=> :file_not_found
# Identify from an open binary device:
{:ok, device} = File.open("photo.png", [:read, :binary])
{:ok, result} = Magika.identify_stream(device)
File.close(device)
result.prediction.output.label #=> "png"
```
Inference runs in the **calling** process: the server owns the instance's
lifecycle and publishes it via `:persistent_term`, so concurrent calls don't
serialize through a single mailbox and the configuration isn't copied per call.
### Prediction mode
The prediction mode controls how strict Magika is before trusting the model's
guess. The hosted server uses `:high_confidence` by default; change it in your
application config:
```elixir
# config/config.exs
config :magika, prediction_mode: :best_guess
```
* `:high_confidence` (default) — keep the model prediction only when its score
clears the per-content-type threshold (falling back to the medium threshold
otherwise).
* `:medium_confidence` — keep it when the score clears the generic medium
threshold.
* `:best_guess` — always return the raw model prediction.
When the score is too low for the chosen mode, the output is generalized to
`txt` (for textual content types) or `unknown` (for binary ones), and
`result.prediction.overwrite_reason` is set to `:low_confidence`.
### Standalone instances (advanced)
You normally don't need this. For one-off scripts or tests you can build an
instance directly with `Magika.new/1` and pass it as the first argument,
bypassing the supervised server:
```elixir
magika = Magika.new(prediction_mode: :best_guess)
{:ok, result} = Magika.identify(magika, "<!DOCTYPE html>...")
```
## Result shape
`Magika.identify*/2` returns `{:ok, %Magika.Result{}}` or, for filesystem
errors, `{:error, %Magika.Result{}}`:
```
%Magika.Result{
status: :ok, # or :file_not_found | :permission_error
path: "/path/to/file" | nil,
prediction: %Magika.Prediction{
output: %Magika.ContentTypeInfo{ # what Magika reports to you
label: "python",
mime_type: "text/x-python",
group: "code",
description: "Python source",
extensions: ["py", ...],
is_text: true
},
dl: %Magika.ContentTypeInfo{...}, # raw model prediction (or `undefined`)
score: 0.9998, # model confidence for `dl`
overwrite_reason: :none # :none | :low_confidence | :overwrite_map
}
}
```
## Model
The vendored model is Magika's `standard_v3_3` (Apache-2.0), copied verbatim
from the [upstream repository](https://github.com/google/magika) along with its
`config.min.json` and `content_types_kb.min.json`.
## License
Apache-2.0, matching upstream Magika. The bundled model and configuration files
are © Google LLC.