<div align="center">
<img src="https://capsule-render.vercel.app/api?type=waving&color=0:76B900,100:1A1A1A&height=200§ion=header&text=viva_tensor&fontSize=64&fontColor=fff&animation=twinkling&fontAlignY=35&desc=FP8%20LLM%20inference%20in%20pure%20Gleam%20on%20the%20BEAM&descSize=18&descAlignY=55" width="100%"/>
[](https://gleam.run/)
[](https://www.erlang.org/)
[](https://www.erlang.org/doc/design_principles/des_princ)
[](https://developer.nvidia.com/cuda-toolkit)
[](https://www.nvidia.com/en-us/geforce/graphics-cards/40-series/rtx-4090/)
[](./test)
[](./CHANGELOG.md)
[](./LICENSE)
**[π§π· PortuguΓͺs](docs/pt-br/README.md)** Β· **[πΊπΈ English](docs/en/README.md)** Β· **[π¨π³ δΈζ](docs/zh-cn/README.md)**
---
*"Tensors speak Gleam. Kernels burn silicon. The BEAM holds the soul."*
</div>
---
> [!IMPORTANT]
> **viva_tensor IS NOT A WRAPPER.**
> It is a **production-grade FP8 LLM inference engine** written from scratch:
> hand-tuned CUDA kernels, blocked W8A16 GEMV, full-token CUDA Graphs, and a
> public `ModelHandle` API β all driven from Gleam on the BEAM.
>
> It is **faster than Ollama on the same hardware**.
---
## π― Overview
A tensor library for Gleam on the BEAM. Provides a pure-Gleam tensor API
for portability, an inference API for FP8 / INT4-2:4 / INT8-2:4 sparse
linear layers, and a public LLM `ModelHandle` API for Llama-family
HuggingFace checkpoints.
The library works fully in pure BEAM (slow but portable) and transparently
upgrades to the native CUDA path when the NIF is loaded.
| Property | Value |
| :------------------ | :--------------------------------------------------- |
| **Language** | Pure Gleam (type-safe functional) |
| **Runtime** | BEAM / OTP 27+ |
| **Native backend** | CUDA 12 + CUTLASS + cuSPARSELt (SM89 / Ada) |
| **Tests** | 792 passing |
| **Decode** | `448 tok/s` TinyLlama-1.1B (vs Ollama 352) |
| **Public API** | `viva_tensor.load_model` / `viva_tensor.generate` |
---
## β‘ Quick Start
```bash
git clone https://github.com/gabrielmaialva33/viva_tensor.git && cd viva_tensor
gleam deps download
# Optional: native CUDA backend (RTX 4090 / Ada SM89)
make cutlass-libs # CUTLASS + cuSPARSELt static archives
make zig # the NIF shared object
gleam test # 792 tests, all pass with NIF loaded
```
### Generate text in 4 lines of Gleam
```gleam
import viva_tensor as t
let assert Ok(model) = t.load_model("tmp/tinyllama/model.safetensors")
let assert Ok(result) = t.generate(model, "Hello", t.default_generate_opts())
result.text
```
<details>
<summary><strong>π Prerequisites</strong></summary>
| Tool | Version | Required for |
| :------------------------ | :---------- | :-------------------- |
| Gleam | `>= 1.14` | Build / pure-Gleam |
| Erlang/OTP | `>= 27` | BEAM runtime |
| CUDA toolkit | `>= 12.0` | Native inference path |
| NVIDIA GPU | Ada+ (SM89) | FP8 / Tensor Cores |
| `make` + `zig` + `clang` | recent | NIF build pipeline |
The pure-Gleam path needs only Gleam + Erlang/OTP.
</details>
---
## ποΈ Architecture
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Gleam application code β
β viva_tensor.load_model / .generate / .Tensor β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββ
β Erlang public API (viva_tensor_llm) β
β SafeTensors loader Β· BPE tokenizer Β· sampling Β· KV β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββ
β NIF dispatch (viva_tensor_zig.so + .erl) β
β PackedWeight Β· EmbeddingTable Β· KvCache Β· ModelHandle β
βββββββββββ¬βββββββββββββββββββββββββββββββ¬βββββββββββββββββ
β β
βββββββββββΌβββββββββββ βββββββββββΌβββββββββββββββββ
β Pure-Gleam tensors β β CUDA kernels β
β (no GPU needed) β β W8A16 GEMV Β· FlashAttn β
β β β Full-token CUDA Graph β
β β β CUTLASS FP8/INT4 sparse β
ββββββββββββββββββββββ ββββββββββββββββββββββββββββ
```
<details>
<summary><strong>π Core Modules</strong></summary>
| Module | Description |
| :---------------------------------- | :------------------------------------------------ |
| `viva_tensor` | Public Gleam API: tensors, prepack, linear, LLM |
| `viva_tensor_llm` | `load_model` / `generate` β opaque `ModelHandle` |
| `viva_tensor_zig` | NIF dispatch (Erlang stubs) |
| `viva_tensor_safetensors_ffi` | HF SafeTensors loader, sharded support, BF16/F16 |
| `viva_tensor_tokenizer_ffi` | SentencePiece + byte-level BPE (GPT-2/Llama-3) |
| `zig_src/cuda_block_forward.cu` | RMSNorm, RoPE, GQA flash attn, SiLU, residual |
| `zig_src/nif_forward_block.c` | Decode-step orchestration, CUDA Graph capture |
| `zig_src/cuda_fp8_cutlass.cu` | CUTLASS FP8 dense GEMM |
| `zig_src/nif_prepack_int_sparse.c` | INT4 / INT8 2:4 sparse weight prepack |
</details>
---
## π Performance
> All numbers measured on **RTX 4090** (Ada SM89) + Intel i9-13900K (32 threads @ 5.80 GHz). Reproducible
> via [`bench/`](./bench/) harness.
### Text generation β TinyLlama-1.1B-Chat (FP8 W8A16)
| Runtime | Decode speed |
| :------------------------------------------- | ---------------: |
| **viva_tensor β best run** | **`448 tok/s`** |
| **Ollama local baseline (same model)** | `352 tok/s` |
| `viva_tensor.generate` (warm) | `2.31 ms/token` |
| `viva_tensor.generate` Llama-3.2-1B-Instruct | `2.47 ms/token` |
### Validated models
| Model | Status | Path | Notes |
| :------------------------------- | :------------ | :-------------------- | :------------------------------------- |
| TinyLlama-1.1B-Chat | β
validated | single safetensors | byte-identical baseline, `2.31 ms/tok` |
| Llama-3.2-1B-Instruct (unsloth) | β
validated | single safetensors | tied embeddings, byte-level BPE, `2.47 ms/tok` |
| NousResearch/Llama-2-7b-chat-hf | β
validated | sharded F16 (13.5GB) | `head_dim=128` dynamic path, `113 ms/tok` |
| Phi-2 | β οΈ partial | sharded folder | sharded discovery OK, Phi arch β Llama |
### Quantized GEMM kernels (RTX 4090)
| Kernel | Peak performance | Backend |
| :-------------------------------------- | ------------------: | :--------------- |
| INT8 2:4 sparse (cuSPARSELt) | `1320 TOPS` | cuSPARSELt |
| INT4 2:4 sparse (CUTLASS Sm80) | `1854 TOPS` | CUTLASS |
| FP8 dense (CUTLASS E4M3 W8A8) | `~660 TFLOPS` | CUTLASS |
| FP8 W8A16 blocked GEMV (custom) | decode-optimized | hand-tuned CUDA |
Full methodology + raw numbers in [bench/results/matmul_showdown.md](bench/results/matmul_showdown.md).
---
## 𧬠Design Principles
| Principle | Description |
| :--------------------------- | :------------------------------------------------------------------- |
| **Honest numerics** | argmax tokens stay byte-identical to HF reference fp32 |
| **Pure-Gleam fallback** | Every API works without CUDA, just slower |
| **Owned device memory** | `PackedWeight`, `EmbeddingTable`, `KvCache` are Erlang resources |
| **Single-token by default** | Decode is `batch=1` first; batched prefill is future work |
| **No magic kernels** | Every `.cu` file is human-written, benchmarked, and committed |
---
## π οΈ Public API
### High-level: LLM inference
```gleam
import viva_tensor as t
pub fn main() {
let assert Ok(model) = t.load_model("tmp/tinyllama/model.safetensors")
let opts =
t.GenerateOpts(
max_new_tokens: 50,
temperature: 0.0, // argmax β deterministic
top_k: t.TopKInfinity,
top_p: 1.0,
seed: 42,
stop_on_eos: True,
)
let assert Ok(result) = t.generate(model, "Hello", opts)
result.text
}
```
### Reproducible sampling
```gleam
let sampling_opts =
t.GenerateOpts(
max_new_tokens: 30,
temperature: 0.8,
top_k: t.TopK(40),
top_p: 0.95,
seed: 42,
stop_on_eos: True,
)
```
Same `seed` β same token sequence across machines.
### Low-level: quantized linears
```gleam
let assert Ok(packed) = t.prepack_fp8_weight_blocked(weight, 16)
let assert Ok(output) = t.linear_fp8_w8a16(input, packed, bias)
```
Prepack once, run linear forwards many times β the FP8 weight + scales
live on the device for the lifetime of the `PackedWeight` resource.
---
## πΊοΈ Roadmap
| Phase | Status |
| :-------------------------------------------- | :----: |
| Pure-Gleam tensors | β
|
| CUDA backend (CUTLASS + cuSPARSELt) | β
|
| FP8 / INT4-2:4 / INT8-2:4 sparse kernels | β
|
| Public `ModelHandle` API | β
|
| Sharded SafeTensors loader | β
|
| Byte-level BPE + SentencePiece tokenizers | β
|
| Weight-tied embeddings | β
|
| Full-token CUDA Graph capture | β
|
| Reproducible temperature/top-k/top-p sampling | β
|
| FP16 weight dtype (Llama-2-7B) | π |
| Batched prefill | β³ |
| Speculative decoding | β³ |
| Hopper SM90 / Blackwell FP4 / NVFP4 | β³ |
---
## π€ Contributing
```bash
git checkout -b feature/your-feature
make cutlass-libs && make zig
gleam test # 792 should pass with NIF loaded
make test-no-nif # 791 should pass without NIF
```
See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
---
## π Documentation
| Language | Link |
| :----------- | :------------------------------------ |
| π§π· PortuguΓͺs | [docs/pt-br/](docs/pt-br/README.md) |
| πΊπΈ English | [docs/en/](docs/en/README.md) |
| π¨π³ δΈζ | [docs/zh-cn/](docs/zh-cn/README.md) |
### Guides
- **[Getting started](docs/en/guides/getting-started.md)** β install, first run.
- **[LLM inference end-to-end](docs/en/guides/inference.md)** β load β tokenize β decode β sample.
- **[FFI architecture](docs/en/guides/ffi-architecture.md)** β Gleam β Erlang β C/CUDA boundaries.
- **[Project structure](docs/en/guides/project-structure.md)** β repo layout.
### API reference
- **[LLM ModelHandle](docs/en/api/llm.md)** β `load_model`, `generate`, tested models.
- **[Inference API](docs/en/api/inference.md)** β prepack + linear FP8 / INT-sparse.
- **[Tensor API](docs/en/api/tensor.md)** β pure-Gleam tensor surface.
### Technical paper
- **[Honest paper](docs/en/paper.md)** β what works, what doesn't, why.
---
## π What's new in 2.2.102
- **Public LLM API.** `viva_tensor.load_model(path)` accepts a HuggingFace
Llama-family checkpoint (single file, sharded, or folder) and returns
an opaque `ModelHandle`. `viva_tensor.generate(model, prompt, opts)`
drives deterministic argmax or seeded temperature/top-k/top-p sampling.
- **Fast FP8 W8A16 decode.** Hand-tuned `vt_w8a16_mmv_blocked_k16` GEMV
with `uint4` vectorized loads, full-token CUDA Graph capture with
`cudaGraphExecUpdate`, persistent device-resident KV caches, and a
cuBLASLt plan cache.
- **Multi-model validated.** TinyLlama-1.1B and Llama-3.2-1B-Instruct
pass byte-identical and through the same public API.
- **Sharded SafeTensors.** Loads single `.safetensors`, HF
`model.safetensors.index.json`, or any folder containing either.
- **Byte-level BPE.** GPT-2 / Llama-3 byte-encoded vocabularies decode
back to readable text; SentencePiece (`β`) still works as before.
- **Tied embeddings.** Detects `tie_word_embeddings` from `config.json`
and reuses `embed_tokens` as `lm_head` when set.
Full evolution from `63 tok/s` baseline to `448 tok/s` across 14 rounds
of optimization is documented in [CHANGELOG.md](./CHANGELOG.md).
---
<div align="center">
**Star if you believe BEAM can do LLM inference β**
[](https://github.com/gabrielmaialva33/viva_tensor)
*Created by Gabriel Maia Β· MIT License*
<img src="https://capsule-render.vercel.app/api?type=waving&color=0:1A1A1A,100:76B900&height=100§ion=footer" width="100%"/>
</div>