docs/guides/troubleshooting.md

# Troubleshooting

Reference this guide when CLI or SDK calls fail or diverge from expectations. Most fixes involve configuration, backpressure handling, or local environment setup.

## Authentication or config errors

- **Missing API key/base URL**: `Tinkex.Config.new/1` raises or returns validation errors when `api_key`/`base_url` are absent. Set `TINKER_API_KEY` (and optionally `TINKER_BASE_URL`) or pass explicit options.
- **Non-default pool selection**: If you override `:base_url` without starting a matching Finch pool, requests fall back to Finch defaults. Use the same base URL configured in `Tinkex.Application` for production workloads, or provide a custom pool via `config :tinkex, :http_pool, MyPool`.

## Timeouts, queuing, or 429 responses

- **Long-running training steps**: Increase `:timeout` on `Tinkex.Config` or pass `:await_timeout` to client calls. Training requests are sent sequentially; enqueue fewer simultaneous batches to keep the GenServer responsive.
- **Queue backpressure**: Sampling and training futures emit telemetry `[:tinkex, :queue, :state_change]`. Attach `Tinkex.Telemetry.attach_logger/1` or a custom handler to watch for `:paused_rate_limit` / `:paused_capacity`.
- **HTTP 429**: The RateLimiter stores per-tenant backoff windows. You do not need to manually retry while a backoff is active—subsequent calls will sleep. When testing, lower concurrency or reuse the same `ServiceClient` to share limiter state.

## Tokenizer (NIF) issues

- **Compilation/ABI errors**: Ensure Rust toolchains and C toolchains are available; re-run `mix deps.compile tokenizers`.
- **Runtime crashes**: The ETS cache stores NIF handles; verify the same OS/CPU architecture used to build dependencies. If you suspect a bad cache entry, restart the BEAM and clear `_build`/`deps`.
- **Unexpected token IDs**: Confirm you are passing fully formatted text (chat templates are not inserted) and the correct model name. For Llama-3 variants, the SDK automatically swaps to `"thinkingmachineslabinc/meta-llama-3-tokenizer"`.

## CLI failures

- **`--output` missing**: `tinkex checkpoint` requires `--output` to write metadata. Provide a path with write permissions.
- **Missing base model**: Both `run` and `checkpoint` expect `--base-model` (or `--model-path` for `run`). Validate the option spelling and casing.
- **Prompt file errors**: `--prompt-file` accepts plain text or a JSON array of token IDs. Confirm the file is readable and valid UTF-8/JSON.
- **Stuck or slow runs**: Pass `--http-timeout` / `--timeout` and monitor telemetry logs. Use `--json` to inspect raw server payloads when diagnosing errors.

## Comparing with the Python SDK

- Use the same base model, prompt text, sampling params (temperature, top_p, max_tokens), and seed (if supported) on both clients.
- Request logprobs (`prompt_logprobs` / `topk_prompt_logprobs`) to compare token-level probabilities. Expect similar, not identical, text output.
- If results diverge, verify tokenizer IDs match (`TrainingClient.get_info/1` when available) and that both clients point to the same `base_url`.

## Documentation build issues

`mix docs` relies on dev-only deps. Run it in a dev environment (not production releases) and ensure `ex_doc` is installed. If assets are missing, rebuild the escript or fetch deps again with `mix deps.get`.