Skip to main content

README.md

# cmdc_rag_arcana

> Arcana-backed enterprise RAG tools and plugins for CMDC.

`cmdc_rag_arcana` 是 CMDC 的独立 RAG 扩展包。它不把 Arcana 依赖塞进
`cmdc` core,而是通过标准 `CMDC.Tool` / `CMDC.Plugin` 边界接入企业知识库。

## 能力范围

| 模块 | 用途 |
|---|---|
| `CMDCRAGArcana.Tool.Search` | `rag_search` 只读检索,返回 chunks / citations / scores |
| `CMDCRAGArcana.Tool.Answer` | `rag_answer` 基于 Arcana 生成带引用答案 |
| `CMDCRAGArcana.Tool.PipelineAnswer` | `rag_pipeline_answer` 按企业 preset 执行 Arcana Pipeline |
| `CMDCRAGArcana.Tool.IngestStatus` | `rag_ingest_status` 只读查询索引状态 |
| `CMDCRAGArcana.Tool.GraphStatus` | `rag_graph_status` 只读查询 GraphRAG 状态与 preflight |
| `CMDCRAGArcana.Tool.GraphSearch` | `rag_graph_search` 只读 GraphRAG graph/fusion search |
| `CMDCRAGArcana.Plugin.AccessControl` | collection ACL,在 `:before_tool` fail closed |
| `CMDCRAGArcana.Plugin.CitationAudit` | citation 访问事件,在 `:after_tool` emit |
| `CMDCRAGArcana.Pipeline.Preset` | gate/rewrite/decompose/rerank/ground 等 Pipeline 治理配置 |
| `CMDCRAGArcana.Pipeline.TelemetryBridge` | Arcana Pipeline telemetry → CMDC RAG trace event |
| `CMDCRAGArcana.Graph.Profile` | GraphRAG collection opt-in profile |
| `CMDCRAGArcana.Graph.Preflight` | GraphRAG 能力与治理检查 |
| `CMDCRAGArcana.Graph.Maintenance` | graph rebuild / entity embedding / community 后台 wrapper |
| `CMDCRAGArcana.Graph.Evidence` | entity / relationship / path / community 证据契约 |
| `CMDCRAGArcana.Ingestion` | Oban worker 可调用的导入 adapter contract |
| `CMDCRAGArcana.Ingestion.ParsedDocument` | OCR / parser 的解析产物契约 |
| `CMDCRAGArcana.CitationSpan` | page / table / bbox / char offset 级引用定位 |
| `CMDCRAGArcana.ProgressEvent` | ingestion / reembed / graph 统一进度事件 payload |
| `CMDCRAGArcana.Eval.ArcanaAdapter` | Arcana Evaluation → cmdc_eval adapter |
| `CMDCRAGArcana.Eval.TelemetryBridge` | Arcana Evaluation telemetry → CMDC EventBus |
| `CMDCRAGArcana.Eval.Gate` | RAG Eval 发布门禁配方与阈值检查 |
| `CMDCRAGArcana.Eval.GraphRAG` | GraphRAG 专项 Eval / Gate 指标 |
| `CMDCRAGArcana.Maintenance` | Arcana maintenance wrapper,统一 progress telemetry/event |
| `CMDCRAGArcana.Backend` | Arcana 调用 behaviour,便于测试和替换 |

`rag_search` / `rag_answer` 默认 fail closed:即使 Agent 传入了
`collections`,也必须通过 `allowed_collections``collection_policies` 或显式
`default_allow?: true` 放行。

## 明示不含

- 不让 Agent 直接 ingest / delete 企业知识库文档。
- 不默认暴露 Arcana Loop,避免 Agent 套 Agent 后削弱 CMDC trace / 成本 / 审批控制。
- 不允许 Agent 动态传入 Arcana Pipeline custom module/function/prompt;生产只走企业预配置 preset。
- 不把 graph rebuild / embed_entities / community summarize 暴露成 Agent Tool。
- 不引入 Python RAG runtime / Sidecar;RAG search / answer / pipeline / GraphRAG 主链路保持 Elixir/Arcana-first。
- 不在 `cmdc` core 引入 Arcana / pgvector / Nx / Bumblebee 依赖。
- 不在当前版本做完整 Knowledge UI / 数据飞轮 / 蒸馏训练。

## Knowledge Control Plane

v0.2 开始,本包提供企业知识库控制面接缝,但不持有企业 Ecto schema 或
Oban 依赖。生产平台应在 Phoenix app 中维护:

- KnowledgeCollection / KnowledgeDocument / DocumentVersion
- IngestionRun / IndexStatus / SourceMapping
- 租户、ACL、审批、保留期、敏感级别、active version 切换

详细 schema 草案、Oban worker skeleton、Arcana dashboard 边界和 maintenance
用法见 [Knowledge Control Plane guide](guides/knowledge_control_plane.md)
## Parser / OCR Artifact

Arcana 内置 parser 适合 txt/md/pdf 文本抽取。复杂 OCR、版面解析、表格抽取
应由企业解析服务、离线导入流程或后续 Elixir 解析能力输出 `ParsedDocument`Python 不进入 RAG runtime,只保留给后续蒸馏、训练和离线模型实验。

```elixir
%CMDCRAGArcana.Ingestion.ParsedDocument{
  text: "制度正文...",
  content_type: "application/pdf",
  checksum: "sha256:...",
  source_uri: "kb://policies/approval.pdf",
  pages: [
    %CMDCRAGArcana.Ingestion.ParsedPage{
      page_number: 3,
      text: "高风险操作需要审批",
      section: "审批制度",
      bbox: %{x: 10, y: 20, width: 300, height: 80}
    }
  ],
  tables: [
    %CMDCRAGArcana.Ingestion.ParsedTable{
      id: "tbl-approval",
      page_number: 3,
      markdown: "| 风险 | 审批 |\n| L3 | 经理审批 |"
    }
  ]
}
```

`Ingestion.run/2` 会把该 artifact 归一化成 Arcana ingest text + document
metadata。后续 `Citation` 可输出 `span`:

```json
{
  "source_uri": "kb://policies/approval.pdf",
  "span": {
    "page_number": 3,
    "section": "审批制度",
    "table_id": "tbl-approval",
    "bbox": {"x": 10, "y": 20, "width": 300, "height": 80}
  }
}
```

## 安装

```elixir
defp deps do
  [
    {:cmdc, "~> 0.6"},
    {:cmdc_eval, "~> 0.2"},
    {:cmdc_rag_arcana, "~> 0.5"}
  ]
end
```

Arcana 本身需要 Ecto Repo、PostgreSQL + pgvector 以及 embedder 配置。生产项目应按
Arcana 官方安装流程完成迁移和 supervision tree 配置。

## Agent 集成

```elixir
{:ok, session} =
  CMDC.create_agent(
    model: "anthropic:claude-sonnet-4-5",
    tools: [
      CMDCRAGArcana.Tool.Search,
      CMDCRAGArcana.Tool.Answer,
      CMDCRAGArcana.Tool.PipelineAnswer,
      CMDCRAGArcana.Tool.IngestStatus,
      CMDCRAGArcana.Tool.GraphStatus,
      CMDCRAGArcana.Tool.GraphSearch
    ],
    plugins: [
      {CMDCRAGArcana.Plugin.AccessControl,
       allowed_collections: ["policies", "sop"]},
      CMDCRAGArcana.Plugin.CitationAudit
    ],
    user_data: %{
      tenant_id: "tenant-a",
      user_id: "alice",
      roles: ["ops"],
      cmdc_rag_arcana: %{
        repo: MyApp.Repo,
        llm: "openai:gpt-4o-mini",
        status_backend: MyApp.Knowledge.RAGStatusBackend,
        allowed_collections: ["policies", "sop"],
        graph_profiles: [
          %{id: "contract_graph", mode: :relationship_graph}
        ],
        graph_policies: [
          %{profile_id: "contract_graph", collections: ["contracts"]}
        ],
        pipeline_presets: [
          %{
            id: "policy_strict",
            steps: [
              :gate,
              :rewrite,
              :search,
              %{name: :rerank, opts: %{threshold: 7}},
              %{name: :answer, opts: %{max_corrections: 1}},
              :self_correct,
              %{name: :ground, opts: %{min_score: 0.8}}
            ],
            fail_mode: :needs_review,
            min_grounding_score: 0.8
          }
        ]
      }
    }
  )
```

Agent 调用 `rag_search` 时应传入 collection:

```json
{
  "query": "高风险操作需要几级审批?",
  "collections": ["policies"],
  "top_k": 5,
  "mode": "hybrid"
}
```

返回值是 JSON 字符串,包含 `results``citations``metadata``CitationAudit`
会额外 emit:

- `:rag_retrieved`
- `:rag_answered`
- `:rag_citation_used`

Agent 调用 `rag_ingest_status` 时只读查询状态:

```json
{
  "collection": "policies",
  "document_id": "doc-1",
  "version_id": "ver-2026-05"
}
```

返回值包含 `status.status``status.graph_status``status.stale?``status.chunk_count` 等字段。该工具不会触发 ingest/delete/rebuild。

Agent 调用 `rag_pipeline_answer` 时只能选择预配置 preset:

```json
{
  "question": "高风险操作需要几级审批?",
  "preset_id": "policy_strict",
  "collections": ["policies"],
  "risk_level": "l2",
  "use_case": "policy_qa"
}
```

返回值会在 `metadata.pipeline_run_summary` 中包含 step plan、collections、
grounding score、citation count、fail mode 和降级/复核状态。无引用、低
grounding score 或越权 source 命中时,preset 可配置 `:block``:search_only``:answer_with_warning``:needs_review`
GraphRAG 必须通过 profile/policy 显式开启:

```elixir
cmdc_rag_arcana: %{
  repo: MyApp.Repo,
  allowed_collections: ["contracts"],
  graph_profiles: [
    %{id: "contract_graph", mode: :relationship_graph}
  ],
  graph_policies: [
    %{profile_id: "contract_graph", collections: ["contracts"]}
  ]
}
```

Agent 只允许只读查询 GraphRAG:

```json
{
  "query": "供应商A和设备B是什么关系?",
  "collections": ["contracts"],
  "graph_profile_id": "contract_graph"
}
```

`rag_graph_search` 返回 `metadata.entity_support``metadata.relationship_support``metadata.path_support``metadata.community_support`。graph rebuild、entity embedding、community detect
和 summary 只能由后台 job / release hook 调用:

```elixir
CMDCRAGArcana.Graph.Maintenance.release_hook(MyApp.Repo,
  collection: "contracts",
  session_id: "release-graph-1"
)
```

后台导入和重嵌入进度使用统一 payload:

```elixir
%CMDCRAGArcana.ProgressEvent{
  kind: :ingestion,
  event: :progress,
  status: :running,
  tenant_id: "tenant-a",
  collection: "policies",
  document_id: "doc-1",
  version_id: "ver-2026-05",
  current: 10,
  total: 100,
  percent: 10.0
}
```

## 测试替换 backend

```elixir
defmodule MyMockRAG do
  @behaviour CMDCRAGArcana.Backend

  def search(_query, _opts), do: {:ok, [%{id: "c1", text: "policy", score: 0.9}]}
  def answer(_question, _opts), do: {:ok, "answer", [%{id: "c1", text: "policy"}]}
end
```

然后在 `user_data` 或直接调用中配置:

```elixir
cmdc_rag_arcana: %{backend: MyMockRAG, allowed_collections: ["policies"]}
```

开发环境如需临时放开 collection ACL,可以显式配置:

```elixir
cmdc_rag_arcana: %{backend: MyMockRAG, default_allow?: true}
```

生产环境应使用 `allowed_collections``collection_policies`,不要依赖
`default_allow?: true`
## RAG Eval 与发布门禁

v0.3 复用 Arcana 内置 Evaluation,并把结果接到 `cmdc_eval`:

```elixir
alias CMDCRAGArcana.Eval.{ArcanaAdapter, Gate, GraphRAG, TelemetryBridge}

handler_id = TelemetryBridge.attach(session_id: "release-run-1")

{:ok, result} =
  ArcanaAdapter.run(
    repo: MyApp.Repo,
    mode: :hybrid,
    evaluate_answers: true,
    llm: &MyApp.LLM.complete/4,
    target: :ask
  )

Gate.check(result.cmdc_metadata,
  recall_at_5: 0.85,
  faithfulness: 0.8,
  correctness: 0.8,
  unauthorized_source_count: 0
)

GraphRAG.check(%{
  entity_recall: 0.85,
  relationship_recall: 0.8,
  path_support_rate: 0.75,
  community_relevance: 0.7,
  citation_grounding: 0.9,
  unauthorized_entity_count: 0
})

TelemetryBridge.detach(handler_id)
```

`Gate.recipe/2` 给 AgentSpec 发布前的推荐顺序:
RAG Eval → Tool Calling Eval → Safety Eval。报告字段覆盖 recall、citation、
faithfulness、correctness、unauthorized source、cost 和 latency。`GraphRAG`
额外覆盖 entity recall、relationship recall、path support、community relevance、
graph-enhanced delta、citation grounding 和 unauthorized entity/source leakage。

## License

Apache 2.0