Skip to main content

lib/cmdc_eval.ex

defmodule CMDCEval do
  @moduledoc """
  CMDC Agent 评测框架(benchmark harness)—— 接公开基准 + 自定义 suite。

  ## 核心抽象

  | 概念 | 模块 | 职责 |
  |---|---|---|
  | Suite | `CMDCEval.Suite` behaviour | 一组 case 集合(如 BFCL v3、tau2-bench、internal) |
  | Case | `CMDCEval.Case` struct | 单个评测用例(id / input / expected) |
  | Run | `CMDCEval.Run` struct | 单次评测结果(pass / latency / tokens / cost) |
  | Report | `CMDCEval.Report` | JSONL 报告写入(与 LangSmith / Langfuse 同源 schema) |
  | Runner | `CMDCEval.Runner` | 并发跑 case + 收集 Run + 输出 Report |

  ## 内置 Suite

  - `CMDCEval.Suites.Internal` — cmdc 内部 scenario 验证(DAG / Steering /
    HumanApproval / Checkpoint resume 等机内特性,互补外部基准)
  - `CMDCEval.Suites.BFCL` — Berkeley Function Calling Leaderboard v3
    fixtures(从 upstream 公开仓库 fetch,详见 `mix cmdc.eval.fetch_bfcl`)

  ## Quick Start

      # 1. 跑 internal suite,输出 JSONL 报告
      $ mix cmdc.eval --suite=internal --model="anthropic:claude-sonnet-4-5" --report=out.jsonl

      # 2. 跑 BFCL(先 fetch fixtures)
      $ mix cmdc.eval.fetch_bfcl
      $ mix cmdc.eval --suite=bfcl --model="openai:gpt-4o" --report=bfcl.jsonl

      # 3. 程序化调用
      {:ok, report} = CMDCEval.run(
        suite: CMDCEval.Suites.Internal,
        model: "anthropic:claude-sonnet-4-5",
        report_path: "out.jsonl"
      )

  ## 报告 JSONL 字段(稳定 schema)

      {
        "suite": "internal",
        "case_id": "steering_basic",
        "model": "anthropic:claude-sonnet-4-5",
        "pass": true,
        "latency_ms": 1234,
        "tokens_in": 567,
        "tokens_out": 89,
        "cost_usd": 0.0034,
        "events_digest": "sha256:abc123...",
        "error": null,
        "timestamp": "2026-05-18T12:34:56Z"
      }

  与 LangSmith / Langfuse / Datadog 同源消费,便于跨 benchmark 比对。

  ## v0.1 范围

  - ✅ Suite behaviour + 4 struct(Case / Run / Report / Suite)
  - ✅ `Mix.Tasks.Cmdc.Eval` CLI
  - ✅ Internal suite(5+ scenario:DAG / Steering / HumanApproval / Checkpoint / Compactor)
  - ✅ BFCL fetch + 占位 suite(10 用例骨架,可被 upstream fixtures 填充)
  - ✅ JSONL 报告 schema(与 12G Telemetry 字段对齐)
  - 🔁 推后到 v0.2:tau2-bench airline / MemoryAgentBench 子集 / LangSmith 直接同步
  """

  alias CMDCEval.{Report, Runner}

  @typedoc "运行 evals 的入参 keyword。"
  @type run_opts :: [
          suite: module(),
          model: String.t(),
          report_path: String.t() | nil,
          concurrency: pos_integer(),
          timeout_ms: pos_integer(),
          provider_opts: keyword()
        ]

  @doc """
  跑一个 Suite,返回 `{:ok, %Report{}}` 或 `{:error, reason}`。

  ## 选项

  - `:suite` — Suite 模块(实现 `CMDCEval.Suite` behaviour),必填
  - `:model` — model 字符串(如 `"anthropic:claude-sonnet-4-5"`),必填
  - `:report_path` — 输出 JSONL 报告路径;nil 则只返回 `%Report{}` 不写文件
  - `:concurrency` — 并发跑 case 数(默认 4,Mock provider 可设更高)
  - `:timeout_ms` — 单 case 超时(默认 60_000)
  - `:provider_opts` — 透传给 `CMDC.Provider.stream/4` 的选项(如 `api_key`)

  ## 示例

      {:ok, report} = CMDCEval.run(
        suite: CMDCEval.Suites.Internal,
        model: "anthropic:claude-sonnet-4-5",
        report_path: "internal.jsonl",
        concurrency: 4
      )

      report.summary
      # => %{total: 5, pass: 5, fail: 0, total_latency_ms: 12345, ...}
  """
  @spec run(run_opts()) :: {:ok, Report.t()} | {:error, term()}
  def run(opts) do
    Runner.run(opts)
  end

  @doc """
  返回 cmdc_eval 当前版本号。
  """
  @spec version() :: String.t()
  def version do
    Application.spec(:cmdc_eval, :vsn) |> to_string()
  end
end