docs/guides/troubleshooting.md

# Troubleshooting

Reference this guide when CLI or SDK calls fail or diverge from expectations. Most fixes involve configuration, backpressure handling, or local environment setup.

## Authentication or config errors

- **Missing API key/base URL**: `Tinkex.Config.new/1` raises or returns validation errors when `api_key`/`base_url` are absent. Set `TINKER_API_KEY` (and optionally `TINKER_BASE_URL`) or pass explicit options.
- **Non-default pool selection**: If you override `:base_url` without starting a matching Finch pool, requests fall back to Finch defaults. Use the same base URL configured in `Tinkex.Application` for production workloads, or provide a custom pool via `config :tinkex, :http_pool, MyPool`.
- **Session SDK version too old**: Some endpoints may reject requests (notably vision/image input) if the reported SDK version is too old. Tinkex reports the official Python Tinker SDK version configured in `mix.exs`; update to the latest Tinkex if you hit this.

## Vision and multimodal inputs

- **Asset is not a valid image**: The backend rejected the image bytes. Verify you are sending a real PNG/JPEG (and that `format` matches the file), try a different image, and avoid setting `expected_tokens` unless you know the correct value. The bundled example supports `TINKER_IMAGE_PATH` / `TINKER_IMAGE_EXPECTED_TOKENS`.

## Timeouts, queuing, or 429 responses

- **Long-running training steps**: Increase `:timeout` on `Tinkex.Config` or pass `:await_timeout` to client calls. Training requests are sent sequentially; enqueue fewer simultaneous batches to keep the GenServer responsive.
- **Queue backpressure**: Sampling and training futures emit telemetry `[:tinkex, :queue, :state_change]`. Attach `Tinkex.Telemetry.attach_logger/1` or a custom handler to watch for `:paused_rate_limit` / `:paused_capacity`.
- **HTTP 429**: The RateLimiter stores per-tenant backoff windows. You do not need to manually retry while a backoff is active—subsequent calls will sleep. When testing, lower concurrency or reuse the same `ServiceClient` to share limiter state.

## Tokenizer (NIF) issues

- **Compilation/ABI errors**: Ensure Rust toolchains and C toolchains are available; re-run `mix deps.compile tokenizers`.
- **Runtime crashes**: The ETS cache stores NIF handles; verify the same OS/CPU architecture used to build dependencies. If you suspect a bad cache entry, restart the BEAM and clear `_build`/`deps`.
- **Unexpected token IDs**: Confirm you are passing fully formatted text (chat templates are not inserted) and the correct model name. For Llama-3 variants, the SDK automatically swaps to `"thinkingmachineslabinc/meta-llama-3-tokenizer"`.
- **Kimi K2 tokenizers**: Kimi uses `tiktoken.model` + `tokenizer_config.json` (via `tiktoken_ex`), not a HuggingFace `tokenizer.json`. Ensure those files can be downloaded from HuggingFace or pass `tiktoken_model_path`/`tokenizer_config_path`.

## CLI failures

- **`--output` missing**: `tinkex checkpoint` requires `--output` to write metadata. Provide a path with write permissions.
- **Missing base model**: Both `run` and `checkpoint` expect `--base-model` (or `--model-path` for `run`). Validate the option spelling and casing.
- **Prompt file errors**: `--prompt-file` accepts plain text or a JSON array of token IDs. Confirm the file is readable and valid UTF-8/JSON.
- **EXLA errors**: EXLA is optional and is not started automatically. If you need EXLA-backed Nx operations, run via `mix run` / an OTP release and start `:exla` before calling `Nx.default_backend/1`.
- **Stuck or slow runs**: Pass `--http-timeout` / `--timeout` and monitor telemetry logs. Use `--json` to inspect raw server payloads when diagnosing errors.

## Comparing with the Python SDK

- Use the same base model, prompt text, sampling params (temperature, top_p, max_tokens), and seed (if supported) on both clients.
- Request logprobs (`prompt_logprobs` / `topk_prompt_logprobs`) to compare token-level probabilities. Expect similar, not identical, text output.
- If results diverge, verify tokenizer IDs match (`TrainingClient.get_info/1` when available) and that both clients point to the same `base_url`.

## Documentation build issues

`mix docs` relies on dev-only deps. Run it in a dev environment (not production releases) and ensure `ex_doc` is installed. If assets are missing, rebuild the escript or fetch deps again with `mix deps.get`.