README.md

# BumblebeeQuantized

4-bit quantized LLM inference with LoRA adapters for Apple Silicon.

Run 8B parameter models in ~5GB RAM with full fine-tuning support.

## Features

- **4-bit Quantized Inference** - Run quantized models using MLX's fused Metal kernels
- **Runtime LoRA Adapters** - Load and apply fine-tuned adapters at inference time
- **Training Integration** - Train your own LoRA adapters via mlx_lm
- **Apple Silicon Optimized** - Uses unified memory for zero-copy GPU access

## Requirements

- Apple Silicon Mac (M1/M2/M3/M4)
- Elixir 1.15+
- Python 3.10+ with mlx_lm (for training only)

## Installation

```elixir
def deps do
  [
    {:bumblebee_quantized, "~> 0.1.0"},
    # REQUIRED: EMLX with quantization ops (not on Hex yet)
    {:emlx, github: "notactuallytreyanastasio/emlx", branch: "feat/quantization-ops"}
  ]
end
```

> **Note**: The EMLX quantization ops are pending upstream merge ([PR #95](https://github.com/elixir-nx/emlx/pull/95)).
> Once merged, you'll only need `{:bumblebee_quantized, "~> 0.1.0"}`.

## Quick Start

```elixir
# Load a quantized model
{:ok, model} = BumblebeeQuantized.load_model(
  "/path/to/Qwen3-8B-MLX-4bit"
)

# Load a LoRA adapter (optional)
{:ok, adapter} = BumblebeeQuantized.load_adapter("/path/to/adapter")

# Load tokenizer via Bumblebee
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-8B"})

# Create a serving and generate text
serving = BumblebeeQuantized.Serving.new(model, tokenizer,
  adapter: adapter,
  max_new_tokens: 100,
  temperature: 0.8
)

Nx.Serving.run(serving, "Write a post about Elixir")
```

## Full Training Workflow

```elixir
# 1. Prepare training data
posts = ["First post...", "Second post...", ...]

BumblebeeQuantized.Training.prepare_data(posts, "/path/to/data",
  prompt: "Write a post in my style",
  min_length: 160
)

# 2. Train adapter (calls Python mlx_lm)
{:ok, adapter_path} = BumblebeeQuantized.Training.train(
  base_model: "lmstudio-community/Qwen3-8B-MLX-4bit",
  training_data: "/path/to/data",
  output_path: "/path/to/adapter",
  iterations: 25_000,
  rank: 8,
  scale: 20.0
)

# 3. Load and use
{:ok, model} = BumblebeeQuantized.load_model("/path/to/Qwen3-8B-4bit")
{:ok, adapter} = BumblebeeQuantized.load_adapter(adapter_path)
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-8B"})

serving = BumblebeeQuantized.Serving.new(model, tokenizer, adapter: adapter)
Nx.Serving.run(serving, "Write a post")
```

## Performance

Tested on Apple Silicon:

| Metric | Value |
|--------|-------|
| Model | Qwen3-8B-4bit |
| Memory Usage | ~5GB |
| Model Load Time | 4-6 seconds |
| Single Token Latency | ~7ms (135 tok/s) |
| Generation Throughput | ~21 tok/s |

## Modules

| Module | Description |
|--------|-------------|
| `BumblebeeQuantized.Loader` | Load quantized models from safetensors |
| `BumblebeeQuantized.Adapters` | Load, apply, and train LoRA adapters |
| `BumblebeeQuantized.Serving` | Nx.Serving for text generation |
| `BumblebeeQuantized.Training` | LoRA training workflow |
| `BumblebeeQuantized.Models.Qwen3` | Qwen3 quantized model definition |

## Supported Models

Currently supported:
- Qwen3 (8B, other sizes should work)

Planned:
- LLaMA 2/3
- Mistral

## How It Works

1. **Quantized Weights**: Models are stored in MLX 4-bit format with weight triplets (packed uint32, scales, biases)

2. **EMLX Backend**: Uses our [EMLX fork](https://github.com/notactuallytreyanastasio/emlx/tree/feat/quantization-ops) with `quantized_matmul` NIF

3. **Runtime LoRA**: Adapters are applied at inference time: `output = base_output + scale * (x @ A @ B)`

4. **Bumblebee Tokenizer**: Uses Bumblebee's tokenizer for text encoding/decoding

## Related Projects

- [bobby_posts](https://github.com/notactuallytreyanastasio/bobby_posts) - The project that spawned this library
- [EMLX Fork](https://github.com/notactuallytreyanastasio/emlx/tree/feat/quantization-ops) - EMLX with quantization ops
- [safetensors_ex](https://github.com/notactuallytreyanastasio/safetensors_ex) - Safetensors parser for Elixir

## License

MIT