README.md

<p align="center">
  <img src="assets/llama_cpp_sdk.svg" alt="llama_cpp_sdk logo" width="200" height="200" />
</p>

<p align="center">
  <a href="https://hex.pm/packages/llama_cpp_sdk">
    <img src="https://img.shields.io/hexpm/v/llama_cpp_sdk.svg" alt="Hex version" />
  </a>
  <a href="https://hexdocs.pm/llama_cpp_sdk">
    <img src="https://img.shields.io/badge/hexdocs-llama__cpp__sdk-blue.svg" alt="HexDocs" />
  </a>
  <a href="LICENSE">
    <img src="https://img.shields.io/badge/license-MIT-green.svg" alt="MIT License" />
  </a>
</p>

# LlamaCppSdk

`llama_cpp_sdk` is the first concrete backend package for the self-hosted
inference stack:

```text
external_runtime_transport
  -> self_hosted_inference_core
  -> llama_cpp_sdk
  -> req_llm through published EndpointDescriptor values
```

It owns the `llama-server` specifics that do not belong in the shared kernel:

- boot-spec normalization
- `llama-server` flag rendering
- readiness and health probes
- stop semantics for a spawned service
- backend manifest publication
- OpenAI-compatible endpoint descriptor production

It does not parse OpenAI payloads, token streams, or inference responses.
Those stay northbound in `req_llm` and the calling control plane.

The phase-1 proof fixture also serves `/v1/chat/completions` with both standard
JSON and SSE streaming responses so the published endpoint contract can be
exercised honestly by northbound clients.

## Current Release Boundary

The first backend release is intentionally narrow and truthful:

- supported startup kind: `:spawned`
- supported execution surface: `:local_subprocess`
- non-local execution surfaces: rejected during boot-spec normalization
- published protocol: `:openai_chat_completions`
- northbound integration: `self_hosted_inference_core`
- `:ssh_exec` story: documented as a future additive path once remote model-path
  semantics, readiness reachability, and shutdown guarantees are verified

## Installation

Add the package to your dependency list:

```elixir
def deps do
  [
    {:llama_cpp_sdk, "~> 0.1.0"}
  ]
end
```

`llama_cpp_sdk` depends on `self_hosted_inference_core`, which in turn depends
on `external_runtime_transport`.

## Quick Start

Resolve a spawned endpoint through the shared kernel:

```elixir
alias LlamaCppSdk
alias SelfHostedInferenceCore.ConsumerManifest

consumer =
  ConsumerManifest.new!(
    consumer: :jido_integration_req_llm,
    accepted_runtime_kinds: [:service],
    accepted_management_modes: [:jido_managed],
    accepted_protocols: [:openai_chat_completions],
    required_capabilities: %{streaming?: true},
    optional_capabilities: %{tool_calling?: :unknown},
    constraints: %{startup_kind: :spawned},
    metadata: %{}
  )

{:ok, resolution} =
  LlamaCppSdk.resolve_endpoint(
    %{
      model: "/models/qwen3-14b-instruct.gguf",
      alias: "qwen3-14b-instruct",
      host: "127.0.0.1",
      port: 8080,
      ctx_size: 8_192,
      gpu_layers: :all,
      threads: 8,
      parallel: 2,
      flash_attn: :auto
    },
    consumer,
    owner_ref: "run-123",
    ttl_ms: 30_000
  )

resolution.endpoint.base_url
resolution.lease.lease_ref
```

The backend normalizes the boot spec, registers itself with
`self_hosted_inference_core`, and publishes an endpoint descriptor once the
service is actually ready.

That published descriptor is the northbound contract used by
`jido_integration`. The caller should execute requests against:

- `endpoint.base_url <> "/chat/completions"` for chat completions
- `endpoint.headers` for bearer auth or other published headers

## Supported Boot Fields

The first release supports normalized fields for the installed
`llama-server` CLI surface:

- `binary_path`
- `launcher_args`
- `model`
- `alias`
- `host`
- `port`
- `ctx_size`
- `gpu_layers`
- `threads`
- `threads_batch`
- `parallel`
- `flash_attn`
- `embeddings`
- `api_key`
- `api_key_file`
- `api_prefix`
- `timeout_seconds`
- `threads_http`
- `pooling`
- `environment`
- `extra_args`

See [`guides/boot_spec.md`](guides/boot_spec.md) for the full contract.
When `api_key_file` is provided, `llama_cpp_sdk` reads it to derive the
published authorization header for northbound clients.

## Readiness And Health

Readiness is owned here, above the transport seam:

1. launch the spawned process via `external_runtime_transport`
2. probe TCP reachability on the requested host and port
3. probe HTTP availability on `/health` or `/v1/models`
4. publish the endpoint only after readiness succeeds

Health continues to poll after publication so the shared kernel can expose
`healthy`, `degraded`, or `unavailable` runtime truth.

## Examples And Guides

- [`guides/architecture.md`](guides/architecture.md)
- [`guides/readiness_and_health.md`](guides/readiness_and_health.md)
- [`guides/integration_with_self_hosted_inference_core.md`](guides/integration_with_self_hosted_inference_core.md)
- [`examples/README.md`](examples/README.md)

## Development

Run the normal quality checks from the repo root when your environment allows
Mix to create its local coordination socket:

```bash
mix format --check-formatted
mix compile --warnings-as-errors
mix test
MIX_ENV=test mix credo --strict
MIX_ENV=dev mix dialyzer
mix docs
```

## License

This repository is released under the MIT License. See `LICENSE` for the
canonical license text and `CHANGELOG.md` for release history.