# BumblebeeQuantized
4-bit quantized LLM inference with LoRA adapters for Apple Silicon.
Run 8B parameter models in ~5GB RAM with full fine-tuning support.
## Features
- **4-bit Quantized Inference** - Run quantized models using MLX's fused Metal kernels
- **Runtime LoRA Adapters** - Load and apply fine-tuned adapters at inference time
- **Training Integration** - Train your own LoRA adapters via mlx_lm
- **Apple Silicon Optimized** - Uses unified memory for zero-copy GPU access
## Requirements
- Apple Silicon Mac (M1/M2/M3/M4)
- Elixir 1.15+
- Python 3.10+ with mlx_lm (for training only)
## Installation
```elixir
def deps do
[
{:bumblebee_quantized, "~> 0.1.0"},
# REQUIRED: EMLX with quantization ops (not on Hex yet)
{:emlx, github: "notactuallytreyanastasio/emlx", branch: "feat/quantization-ops"}
]
end
```
> **Note**: The EMLX quantization ops are pending upstream merge ([PR #95](https://github.com/elixir-nx/emlx/pull/95)).
> Once merged, you'll only need `{:bumblebee_quantized, "~> 0.1.0"}`.
## Quick Start
```elixir
# Load a quantized model
{:ok, model} = BumblebeeQuantized.load_model(
"/path/to/Qwen3-8B-MLX-4bit"
)
# Load a LoRA adapter (optional)
{:ok, adapter} = BumblebeeQuantized.load_adapter("/path/to/adapter")
# Load tokenizer via Bumblebee
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-8B"})
# Create a serving and generate text
serving = BumblebeeQuantized.Serving.new(model, tokenizer,
adapter: adapter,
max_new_tokens: 100,
temperature: 0.8
)
Nx.Serving.run(serving, "Write a post about Elixir")
```
## Full Training Workflow
```elixir
# 1. Prepare training data
posts = ["First post...", "Second post...", ...]
BumblebeeQuantized.Training.prepare_data(posts, "/path/to/data",
prompt: "Write a post in my style",
min_length: 160
)
# 2. Train adapter (calls Python mlx_lm)
{:ok, adapter_path} = BumblebeeQuantized.Training.train(
base_model: "lmstudio-community/Qwen3-8B-MLX-4bit",
training_data: "/path/to/data",
output_path: "/path/to/adapter",
iterations: 25_000,
rank: 8,
scale: 20.0
)
# 3. Load and use
{:ok, model} = BumblebeeQuantized.load_model("/path/to/Qwen3-8B-4bit")
{:ok, adapter} = BumblebeeQuantized.load_adapter(adapter_path)
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-8B"})
serving = BumblebeeQuantized.Serving.new(model, tokenizer, adapter: adapter)
Nx.Serving.run(serving, "Write a post")
```
## Performance
Tested on Apple Silicon:
| Metric | Value |
|--------|-------|
| Model | Qwen3-8B-4bit |
| Memory Usage | ~5GB |
| Model Load Time | 4-6 seconds |
| Single Token Latency | ~7ms (135 tok/s) |
| Generation Throughput | ~21 tok/s |
## Modules
| Module | Description |
|--------|-------------|
| `BumblebeeQuantized.Loader` | Load quantized models from safetensors |
| `BumblebeeQuantized.Adapters` | Load, apply, and train LoRA adapters |
| `BumblebeeQuantized.Serving` | Nx.Serving for text generation |
| `BumblebeeQuantized.Training` | LoRA training workflow |
| `BumblebeeQuantized.Models.Qwen3` | Qwen3 quantized model definition |
## Supported Models
Currently supported:
- Qwen3 (8B, other sizes should work)
Planned:
- LLaMA 2/3
- Mistral
## How It Works
1. **Quantized Weights**: Models are stored in MLX 4-bit format with weight triplets (packed uint32, scales, biases)
2. **EMLX Backend**: Uses our [EMLX fork](https://github.com/notactuallytreyanastasio/emlx/tree/feat/quantization-ops) with `quantized_matmul` NIF
3. **Runtime LoRA**: Adapters are applied at inference time: `output = base_output + scale * (x @ A @ B)`
4. **Bumblebee Tokenizer**: Uses Bumblebee's tokenizer for text encoding/decoding
## Related Projects
- [bobby_posts](https://github.com/notactuallytreyanastasio/bobby_posts) - The project that spawned this library
- [EMLX Fork](https://github.com/notactuallytreyanastasio/emlx/tree/feat/quantization-ops) - EMLX with quantization ops
- [safetensors_ex](https://github.com/notactuallytreyanastasio/safetensors_ex) - Safetensors parser for Elixir
## License
MIT