docs/guides/checkpoint_management.md

# Checkpoint Management

This guide covers checkpoint and training run management in Tinkex, including listing, inspecting, downloading, and publishing checkpoints.

## Overview

Checkpoints are snapshots of model weights saved during training. Tinkex provides comprehensive APIs to:

- List and inspect checkpoints and training runs
- Get detailed checkpoint information (base model, LoRA configuration)
- Download checkpoint archives
- Publish/unpublish checkpoints for sharing
- Delete old checkpoints
- Save and load training checkpoints (with optional optimizer state)

All checkpoints are referenced using the **Tinker path format**: `tinker://run-id/weights/checkpoint-id`

## Prerequisites

```elixir
{:ok, _} = Application.ensure_all_started(:tinkex)

config =
  Tinkex.Config.new(
    api_key: System.fetch_env!("TINKER_API_KEY"),
    base_url: System.get_env("TINKER_BASE_URL", "https://tinker.thinkingmachines.dev/services/tinker-prod")
  )

{:ok, service} = Tinkex.ServiceClient.start_link(config: config)
{:ok, rest_client} = Tinkex.ServiceClient.create_rest_client(service)
```

## Saving and Loading Training Checkpoints

Save a named checkpoint during training:

```elixir
{:ok, task} = Tinkex.TrainingClient.save_state(training_client, "checkpoint-001")
{:ok, %Tinkex.Types.SaveWeightsResponse{path: checkpoint_path}} = Task.await(task)
```

Load weights (without optimizer state) for transfer learning or evaluation:

```elixir
{:ok, task} =
  Tinkex.TrainingClient.load_state(training_client, "tinker://run-id/weights/checkpoint-001")

{:ok, _} = Task.await(task)
```

Resume training with optimizer state preserved:

```elixir
{:ok, task} =
  Tinkex.TrainingClient.load_state_with_optimizer(
    training_client,
    "tinker://run-id/weights/checkpoint-001"
  )

{:ok, _} = Task.await(task)
```

Create a new training client directly from a checkpoint:

```elixir
{:ok, training_client} =
  Tinkex.ServiceClient.create_training_client_from_state(
    service,
    "tinker://run-id/weights/checkpoint-001"
  )
```

To restore optimizer state as well, use:

```elixir
{:ok, training_client} =
  Tinkex.ServiceClient.create_training_client_from_state_with_optimizer(
    service,
    "tinker://run-id/weights/checkpoint-001"
  )
```

## Tinker Path Format

Checkpoints use a structured URI format:

```
tinker://run-id/weights/checkpoint-id
```

**Examples:**
- `tinker://run-abc123/weights/0001`
- `tinker://session-xyz/weights/checkpoint-final`

This format uniquely identifies a checkpoint and is used throughout the API.

### Parsing Tinker Paths

Use `Tinkex.Types.ParsedCheckpointTinkerPath.from_tinker_path/1` to validate and extract the components of a `tinker://` checkpoint path. It returns `{:ok, %ParsedCheckpointTinkerPath{tinker_path: ..., training_run_id: ..., checkpoint_type: "training" | "sampler", checkpoint_id: ...}}` or `{:error, %Tinkex.Error{category: :user}}` for invalid input. The helper is shared by REST/CLI helpers so you see consistent validation errors for bad paths.

## Listing Checkpoints

### List All User Checkpoints

Get all checkpoints for the current user with pagination:

```elixir
{:ok, response} = Tinkex.RestClient.list_user_checkpoints(rest_client, limit: 100, offset: 0)

Enum.each(response.checkpoints, fn checkpoint ->
  IO.puts("Path: #{checkpoint.tinker_path}")
  IO.puts("Type: #{checkpoint.checkpoint_type}")
  IO.puts("Size: #{checkpoint.size_bytes} bytes")
  IO.puts("Public: #{checkpoint.public}")
  IO.puts("Created: #{checkpoint.time}")
  IO.puts("")
end)
```

**Options:**
- `:limit` - Maximum number of checkpoints to return (default: 100)
- `:offset` - Offset for pagination (default: 0)

### List Checkpoints for a Training Run

Get all checkpoints associated with a specific training run:

```elixir
{:ok, response} = Tinkex.RestClient.list_checkpoints(rest_client, "run-abc123")

Enum.each(response.checkpoints, fn checkpoint ->
  IO.puts("Checkpoint: #{checkpoint.tinker_path}")
  IO.puts("ID: #{checkpoint.checkpoint_id}")

  if checkpoint.size_bytes do
    size_mb = checkpoint.size_bytes / (1024 * 1024)
    IO.puts("Size: #{Float.round(size_mb, 2)} MB")
  end
end)
```

## Training Runs

### List Training Runs

Get all training runs with pagination:

```elixir
{:ok, response} = Tinkex.RestClient.list_training_runs(rest_client, limit: 20, offset: 0)

Enum.each(response.training_runs, fn run ->
  IO.puts("Run ID: #{run.training_run_id}")
  IO.puts("Base Model: #{run.base_model}")
  IO.puts("Is LoRA: #{run.is_lora}")
  IO.puts("LoRA Rank: #{run.lora_rank || "N/A"}")
  IO.puts("Corrupted: #{run.corrupted || false}")
  IO.puts("Last Checkpoint: #{run.last_checkpoint && run.last_checkpoint.tinker_path}")
  IO.puts("Owner: #{run.model_owner}")
  IO.puts("")
end)
```

**Options:**
- `:limit` - Maximum number of runs to return (default: 20)
- `:offset` - Offset for pagination (default: 0)

### Get Training Run Details

Retrieve detailed information about a specific training run:

```elixir
{:ok, run} = Tinkex.RestClient.get_training_run(rest_client, "run-abc123")

IO.puts("Base Model: #{run.base_model}")
IO.puts("Is LoRA: #{run.is_lora}")
IO.puts("LoRA Rank: #{run.lora_rank}")
IO.puts("Last Checkpoint: #{run.last_checkpoint && run.last_checkpoint.tinker_path}")
IO.puts("Last Sampler Checkpoint: #{run.last_sampler_checkpoint && run.last_sampler_checkpoint.tinker_path}")
IO.puts("Last Request Time: #{run.last_request_time}")
```

You can also resolve the run directly from a checkpoint tinker path:

```elixir
{:ok, run} =
  Tinkex.RestClient.get_training_run_by_tinker_path(
    rest_client,
    "tinker://run-abc123/weights/0001"
  )
```

## Checkpoint Information

### Get Checkpoint Metadata

Get detailed information about a checkpoint, including base model and LoRA configuration:

```elixir
{:ok, weights_info} =
  Tinkex.RestClient.get_weights_info_by_tinker_path(
    rest_client,
    "tinker://run-abc123/weights/0001"
  )

IO.puts("Base Model: #{weights_info.base_model}")
IO.puts("Is LoRA: #{weights_info.is_lora}")
IO.puts("LoRA Rank: #{weights_info.lora_rank}")
```

### Validate Checkpoint Compatibility

Check if a checkpoint matches expected configuration:

```elixir
def validate_checkpoint(rest_client, path, expected_rank) do
  case Tinkex.RestClient.get_weights_info_by_tinker_path(rest_client, path) do
    {:ok, %{is_lora: true, lora_rank: ^expected_rank}} ->
      :ok

    {:ok, %{is_lora: true, lora_rank: actual}} ->
      {:error, {:rank_mismatch, expected: expected_rank, actual: actual}}

    {:ok, %{is_lora: false}} ->
      {:error, :not_lora}

    {:error, _} = error ->
      error
  end
end
```

## Downloading Checkpoints

Tinkex provides memory-efficient checkpoint downloads using streaming. Downloads use `Finch.stream_while/5` to stream checkpoint archives directly to disk with O(1) memory usage, making it safe to download large checkpoint files (100MB-GBs) without risk of OOM errors.

### Basic Download

Download and extract a checkpoint archive:

```elixir
{:ok, result} = Tinkex.CheckpointDownload.download(
  rest_client,
  "tinker://run-abc123/weights/0001",
  output_dir: "./models",
  force: false
)

IO.puts("Downloaded to: #{result.destination}")
```

**Key Features:**
- **Streaming downloads** - O(1) memory usage regardless of file size
- **Progress callbacks** - Real-time download progress tracking
- **Automatic extraction** - Downloads and extracts tar archives in one operation
- **Force overwrite** - Optional overwrite of existing checkpoint directories

**Options:**
- `:output_dir` - Parent directory for extraction (default: current directory)
- `:force` - Overwrite existing directory if it exists (default: false)
- `:progress` - Progress callback function (see below)

### Download with Progress Tracking

Monitor download progress with a callback:

```elixir
progress_fn = fn downloaded, total ->
  percent = if total > 0, do: Float.round(downloaded / total * 100, 1), else: 0
  IO.write("\rProgress: #{percent}% (#{downloaded} / #{total} bytes)")
end

{:ok, result} = Tinkex.CheckpointDownload.download(
  rest_client,
  "tinker://run-abc123/weights/0001",
  output_dir: "./models",
  force: true,
  progress: progress_fn
)

IO.puts("\n\nDownload complete!")
IO.puts("Extracted to: #{result.destination}")
```

### Get Archive URL

Get a signed URL for downloading the checkpoint archive directly:

```elixir
{:ok, url_response} =
  Tinkex.RestClient.get_checkpoint_archive_url_by_tinker_path(
    rest_client,
    "tinker://run-abc123/weights/0001"
  )

IO.puts("Download URL: #{url_response.url}")
IO.puts("Expires at: #{inspect(url_response.expires)}")
```

This URL can be used with external download tools or for programmatic access.

If you already have IDs from the training run list, you can call the ID-based helpers instead:

```elixir
{:ok, url_response} =
  Tinkex.RestClient.get_checkpoint_archive_url(rest_client, "run-abc123", "0001")

{:ok, _} = Tinkex.RestClient.delete_checkpoint(rest_client, "run-abc123", "0001")
```

## Using Downloaded Weights

After downloading, checkpoint files are extracted to a local directory:

```elixir
{:ok, result} = Tinkex.CheckpointDownload.download(
  rest_client,
  "tinker://run-abc123/weights/0001",
  output_dir: "./models"
)

# List extracted files
files = File.ls!(result.destination)
IO.puts("Extracted files: #{inspect(files)}")

# Examine file sizes
Enum.each(files, fn file ->
  path = Path.join(result.destination, file)
  stat = File.stat!(path)
  size_mb = stat.size / (1024 * 1024)
  IO.puts("  #{file}: #{Float.round(size_mb, 2)} MB")
end)
```

The checkpoint directory typically contains:
- Model weight files (`.safetensors`, `.bin`, or similar)
- Configuration files (`config.json`)
- Tokenizer files (if applicable)
- LoRA adapter files (for LoRA checkpoints)

## Publishing Checkpoints

### Make a Checkpoint Public

Publish a checkpoint to make it accessible to others:

```elixir
{:ok, _} = Tinkex.RestClient.publish_checkpoint(
  rest_client,
  "tinker://run-abc123/weights/0001"
)

IO.puts("Checkpoint published successfully")
```

### Make a Checkpoint Private

Unpublish a checkpoint to restrict access:

```elixir
{:ok, _} = Tinkex.RestClient.unpublish_checkpoint(
  rest_client,
  "tinker://run-abc123/weights/0001"
)

IO.puts("Checkpoint unpublished successfully")
```

## Deleting Checkpoints

Remove a checkpoint permanently:

```elixir
{:ok, _} = Tinkex.RestClient.delete_checkpoint(
  rest_client,
  "tinker://run-abc123/weights/0001"
)

IO.puts("Checkpoint deleted")
```

**Warning:** Deletion is permanent and cannot be undone. Ensure you have backups if needed.

## Sessions and Checkpoints

### Get Session Information

Sessions group related training runs and samplers:

```elixir
{:ok, session} = Tinkex.RestClient.get_session(rest_client, "session-xyz")

IO.puts("Training Runs: #{inspect(session.training_run_ids)}")
IO.puts("Samplers: #{inspect(session.sampler_ids)}")
```

### List Sessions

Get all sessions with pagination:

```elixir
{:ok, response} = Tinkex.RestClient.list_sessions(rest_client, limit: 20, offset: 0)

Enum.each(response.sessions, fn session ->
  IO.puts("Session ID: #{session.session_id}")
end)
```

## Complete Example: Checkpoint Workflow

Here's a complete workflow for managing checkpoints:

```elixir
# 1. List available training runs
{:ok, runs_response} = Tinkex.RestClient.list_training_runs(rest_client, limit: 10)

case runs_response.training_runs do
  [] ->
    IO.puts("No training runs found")

  [run | _] ->
    IO.puts("Inspecting run: #{run.training_run_id}")
    IO.puts("Base Model: #{run.base_model}")
    IO.puts("Is LoRA: #{run.is_lora}, Rank: #{run.lora_rank}")

    # 2. List checkpoints for this run
    {:ok, ckpt_response} = Tinkex.RestClient.list_checkpoints(rest_client, run.training_run_id)

    case ckpt_response.checkpoints do
      [] ->
        IO.puts("No checkpoints found for this run")

      [checkpoint | _] ->
        IO.puts("\nCheckpoint: #{checkpoint.tinker_path}")

        # 3. Get checkpoint metadata
        {:ok, weights_info} =
          Tinkex.RestClient.get_weights_info_by_tinker_path(
            rest_client,
            checkpoint.tinker_path
          )

        IO.puts("Checkpoint Base Model: #{weights_info.base_model}")
        IO.puts("Checkpoint LoRA Rank: #{weights_info.lora_rank}")

        # 4. Download the checkpoint
        {:ok, download} = Tinkex.CheckpointDownload.download(
          rest_client,
          checkpoint.tinker_path,
          output_dir: "./downloaded_models",
          force: true,
          progress: fn downloaded, total ->
            percent = if total > 0, do: Float.round(downloaded / total * 100, 1), else: 0
            IO.write("\rDownloading: #{percent}%")
          end
        )

        IO.puts("\n\nDownloaded to: #{download.destination}")

        # 5. List extracted files
        files = File.ls!(download.destination)
        IO.puts("\nExtracted #{length(files)} file(s):")

        Enum.each(files, fn file ->
          path = Path.join(download.destination, file)
          stat = File.stat!(path)
          size_mb = stat.size / (1024 * 1024)
          IO.puts("  • #{file} (#{Float.round(size_mb, 2)} MB)")
        end)
    end
end
```

## Error Handling

### Common Errors

**Checkpoint Already Downloaded:**
```elixir
case Tinkex.CheckpointDownload.download(rest_client, path, output_dir: "./models") do
  {:error, {:exists, existing_path}} ->
    IO.puts("Directory already exists: #{existing_path}")
    IO.puts("Use force: true to overwrite")

  {:ok, result} ->
    IO.puts("Downloaded successfully")
end
```

**Invalid Tinker Path:**
```elixir
case Tinkex.CheckpointDownload.download(rest_client, "invalid-path") do
  {:error, {:invalid_path, message}} ->
    IO.puts("Invalid path: #{message}")
    IO.puts("Path must start with 'tinker://'")

  {:ok, result} ->
    IO.puts("Downloaded successfully")
end
```

**Checkpoint Not Found:**
```elixir
case Tinkex.RestClient.get_checkpoint_archive_url_by_tinker_path(rest_client, path) do
  {:error, %Tinkex.Error{status: 404}} ->
    IO.puts("Checkpoint not found or no longer exists")

  {:error, %Tinkex.Error{status: 403}} ->
    IO.puts("Access denied to this checkpoint")

  {:ok, url_response} ->
    IO.puts("Archive URL: #{url_response.url}")
end
```

## Best Practices

### 1. Check Availability Before Downloading

```elixir
# Verify checkpoint exists before downloading
case Tinkex.RestClient.get_checkpoint_archive_url_by_tinker_path(rest_client, checkpoint_path) do
  {:ok, _url_response} ->
    # Proceed with download
    Tinkex.CheckpointDownload.download(rest_client, checkpoint_path, output_dir: "./models")

  {:error, error} ->
    IO.puts("Checkpoint not available: #{inspect(error)}")
end
```

### 2. Use Pagination for Large Collections

```elixir
def fetch_all_checkpoints(rest_client, limit \\ 100) do
  fetch_page(rest_client, limit, 0, [])
end

defp fetch_page(rest_client, limit, offset, acc) do
  case Tinkex.RestClient.list_user_checkpoints(rest_client, limit: limit, offset: offset) do
    {:ok, response} when response.checkpoints == [] ->
      {:ok, Enum.reverse(acc)}

    {:ok, response} ->
      new_acc = response.checkpoints ++ acc
      fetch_page(rest_client, limit, offset + limit, new_acc)

    {:error, error} ->
      {:error, error}
  end
end
```

### 3. Clean Up Old Checkpoints

```elixir
def cleanup_old_checkpoints(rest_client, keep_count \\ 5) do
  {:ok, response} = Tinkex.RestClient.list_user_checkpoints(rest_client, limit: 100)

  # Sort by time (assuming ISO8601 format)
  sorted = Enum.sort_by(response.checkpoints, & &1.time, :desc)

  # Keep the newest ones
  {_keep, delete} = Enum.split(sorted, keep_count)

  # Delete old checkpoints
  Enum.each(delete, fn checkpoint ->
    case Tinkex.RestClient.delete_checkpoint(rest_client, checkpoint.tinker_path) do
      {:ok, _} ->
        IO.puts("Deleted: #{checkpoint.tinker_path}")

      {:error, error} ->
        IO.puts("Failed to delete #{checkpoint.tinker_path}: #{inspect(error)}")
    end
  end)
end
```

### 4. Verify Download Integrity

```elixir
def verify_download(result) do
  if File.exists?(result.destination) do
    files = File.ls!(result.destination)

    if length(files) > 0 do
      {:ok, :verified}
    else
      {:error, :empty_directory}
    end
  else
    {:error, :directory_not_found}
  end
end
```

## What to Read Next

- API overview: `docs/guides/api_reference.md`
- Training loop guide: `docs/guides/training_loop.md`
- Troubleshooting: `docs/guides/troubleshooting.md`
- Getting started: `docs/guides/getting_started.md`