我要投稿

用合成数据评测 RAG 系统：一份可直接上手的 DeepEval 实操指南

发布日期：2025-10-16 07:52:30 浏览次数： 1746

作者：Halo咯咯

微信搜一搜，关注“Halo咯咯”

在构建 RAG（Retrieval-Augmented Generation，检索增强生成）系统的过程中，很多人都有这样的困惑：

“模型看起来能回答问题，但到底是不是在胡说八道？” “Retriever 到底找得准不准？” “我该怎么知道系统整体是不是可靠的？”

这些问题的根源在于——我们缺乏系统化的评测方法。尤其在项目早期，还没有真实用户数据时，想要验证 RAG 流程的效果就更加困难。

今天，我们就来深入拆解一个实用方案： 👉 用 DeepEval 生成合成数据，系统性评测你的 RAG Pipeline。

这篇文章会带你一步步上手，包括依赖安装、数据生成、复杂度控制、评测逻辑等全部环节。读完后，你不仅能快速搭建一个自动化评测体系，还能理解为什么「合成数据」是 RAG 测试的关键突破口。

一、为什么要用合成数据评测 RAG？

在真实业务场景中，我们希望 RAG 系统具备三个核心能力：

检索准确（Retriever）：能找到与问题最相关的文档；
生成可靠（LLM）：答案必须“有出处”，不能胡编；
上下文合适（Context）：输入长度、内容密度要恰到好处。

但在系统上线前，我们往往没有足够的真实问题和反馈样本。这就导致很难知道模型的回答是否“扎实落地”。

而 合成数据（Synthetic Data） 正好填补了这个空白。

通过自动生成模拟用户问题 + 理想回答（golden pairs），我们能提前建立一个可重复测试集：

不依赖真实用户；
能针对不同类型问题系统化覆盖；
能反复验证 Retriever 和 Generator 的优化效果。

DeepEval 就是这个过程的核心工具。

二、DeepEval：专为 LLM 评测设计的开源框架

DeepEval 是一个专门用于大模型评测的开源框架，支持包括 RAG 流水线在内的各种场景。它的优势主要体现在三点：

✅ 自动生成合成测试数据：内置 Synthesizer 类，可基于文档生成真实感极强的 QA 对；
✅ 多维度评测指标：从 Grounding（答案是否有出处）、Context Relevance（上下文相关性）到 Faithfulness（事实一致性）；
✅ 可扩展配置：通过 EvolutionConfig 控制生成样本的复杂度与类型。

接下来我们进入实操环节。

三、安装依赖与准备环境

首先，安装所需依赖库。

pip install deepeval chromadb tiktoken pandas

安装完成后，配置你的 OpenAI API Key。 DeepEval 会调用外部模型（如 GPT-4）来生成和评测数据。

前往 OpenAI API 管理页，新建 API Key 并填入你的环境变量中：

export OPENAI_API_KEY="sk-xxxxxxx"

💡 提示：初次使用 OpenAI API 可能需要绑定支付方式并充值约 $5 才能启用。

四、准备源文本：生成“合成问答”的素材

接下来，我们需要准备一份源文本，它将作为合成数据的“语料库”。这份文本应尽量内容多样、语义清晰、事实准确。

例如：

text = """
Crows are among the smartest birds, capable of using tools and recognizing human faces even after years.
In contrast, the archerfish displays remarkable precision, shooting jets of water to knock insects off branches.
Meanwhile, in the world of physics, superconductors can carry electric current with zero resistance -- a phenomenon
discovered over a century ago but still unlocking new technologies like quantum computers today.
...
"""

将其保存为一个文本文件：

with open("example.txt", "w") as f:
    f.write(text)

💬 技巧：你完全可以换成自己的内容，比如项目知识库、技术文档、内部 FAQ 等，这样生成的评测样本就更贴近业务实际。

五、自动生成合成数据（Synthetic Goldens）

DeepEval 的核心类 Synthesizer 可以直接读取文档并生成高质量的 QA 对。

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer(model="gpt-4.1-nano")

# 从文档中生成合成数据
synthesizer.generate_goldens_from_docs(
    document_paths=["example.txt"],
    include_expected_output=True
)

# 打印部分结果
for golden in synthesizer.synthetic_goldens[:3]:  
    print(golden, "\n")

运行结果示例：

Input: Evaluate the cognitive abilities of corvids in facial recognition tasks.
Expected Output: Crows can recognize human faces and remember them for years, showing advanced memory and problem-solving.
Context: "Crows are among the smartest birds..."

可以看到，每个样本都包含：

用户问题（input）
理想回答（expected output）
语料来源（context）

这些就是我们的 golden pairs —— 可用于后续的模型性能验证。

六、控制样本复杂度：EvolutionConfig 的威力

光生成 QA 对还不够，我们需要控制生成问题的复杂度与多样性，让测试更贴近真实用户提问。

DeepEval 提供了 EvolutionConfig，可以通过「进化策略」来调节生成方式。

from deepeval.synthesizer.config import EvolutionConfig, Evolution

evolution_config = EvolutionConfig(
    evolutions={
        Evolution.REASONING: 1/5,
        Evolution.MULTICONTEXT: 1/5,
        Evolution.COMPARATIVE: 1/5,
        Evolution.HYPOTHETICAL: 1/5,
        Evolution.IN_BREADTH: 1/5,
    },
    num_evolutions=3
)

synthesizer = Synthesizer(evolution_config=evolution_config)
synthesizer.generate_goldens_from_docs(["example.txt"])

这样一来，生成的样本不仅仅是简单问答，而会覆盖：

推理类问题（Reasoning）
多上下文问题（MultiContext）
对比类问题（Comparative）
假设场景（Hypothetical）
广域探索问题（InBreadth）

例如：

Q：比较 Voyager 1 的黄金唱片与亚历山大图书馆在人类历史中的意义。A：两者都承载了人类知识与文明的象征，前者跨越宇宙，后者见证文明的起点。

这样的数据能全面测试模型的多层推理与信息整合能力。

七、构建迭代评测循环：RAG 改进闭环

当我们有了高质量的合成数据，就可以进入核心环节——RAG 评测闭环。

典型的流程如下：

Retriever 测试：验证召回文档的相关性；
LLM 评测：检查生成回答是否基于上下文；
指标计算：如 Grounding、Context Relevance、Faithfulness；
结果反馈与优化：调整检索策略或 Prompt；
重新评测：观察指标是否提升。

这就是一个完整的 Iterative RAG Improvement Loop（迭代改进循环）。

它的关键在于：

你不需要等待真实用户来“踩坑”，合成数据已经能让你提前发现系统的薄弱点。

当 Retriever 的召回率提升、LLM 的事实一致性增强后，你的系统上线风险就会显著降低。

实战代码见最后！

八、实战建议与扩展思路

如果你准备在真实项目中落地 DeepEval，可以参考以下建议：

📁 语料选取：优先使用结构化或知识密集型文档，如产品手册、内部FAQ；
⚙️ 模型配置：评测阶段可用轻量模型（如 gpt-4.1-nano），正式验证时切换至完整模型；
📊 结果分析：结合 ChromaDB 等向量库，计算各指标变化；
🔁 自动化集成：将评测脚本嵌入 CI/CD 流程，每次更新 Retriever 或 Prompt 后自动验证。

长期来看，这种方式能让你的 RAG 系统从「主观感受好像行」变为「数据指标确实强」。

九、总结：让 RAG 评测不再是黑箱

RAG 评测的难点在于——系统表现常常“看起来对”，但却难以验证背后的可靠性。 DeepEval 的出现，让这件事变得可量化、可复现、可持续改进。

合成数据的价值不在于替代真实用户，而在于提前建立可控的测试环境。通过 EvolutionConfig 等机制，我们甚至能模拟用户提出各种复杂问题，全面检验系统的推理与检索边界。

一句话总结：

在没有用户数据的阶段，合成数据就是最好的评测基线；在持续优化阶段，DeepEval 就是你的自动化教练。

付实战代码：

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
rag_iterative_eval_full.py
完整示例：迭代评测循环（RAG 改进闭环）
功能：
  - 生成/读取文档
  - 生成合成 goldens（DeepEval / OpenAI / 规则化）
  - 构建检索器（OpenAI embeddings 或 TF-IDF）
  - 使用检索到的上下文调用 LLM 生成答案（OpenAI 或简单拼接回复）
  - 计算 grounding / context_relevance / faithfulness 指标
  - 基于指标自动调整 top_k 与 temperature（形成闭环）
  - 保存与打印每轮结果
作者：jilolo
日期：2025-10
"""

import os
import json
import time
import math
import random
import hashlib
from typing import List, Dict, Any, Tuple
from collections import defaultdict, Counter

# optional imports
try:
    import openai
except Exception:
    openai = None

try:
    import numpy as np
    from numpy.linalg import norm
    NUMPY_AVAILABLE = True
except Exception:
    NUMPY_AVAILABLE = False

try:
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    SKLEARN_AVAILABLE = True
except Exception:
    SKLEARN_AVAILABLE = False

try:
    from tqdm import tqdm
    TQDM_AVAILABLE = True
except Exception:
    TQDM_AVAILABLE = False

# -------------------------
# CONFIG
# -------------------------
CONFIG = {
    "OPENAI_API_KEY": os.getenv("OPENAI_API_KEY", ""),
    "OPENAI_EMBEDDING_MODEL": "text-embedding-3-small",
    "OPENAI_COMPLETION_MODEL": "gpt-4o-mini",  # change to available model
    "DOC_PATH": "example.txt",
    "NUM_GOLDENS": 12,
    "ITERATIONS": 6,
    "INITIAL_TOP_K": 3,
    "MAX_TOP_K": 8,
    "MIN_TOP_K": 1,
    "TEMPERATURE_OPTIONS": [0.0, 0.2, 0.5],
    "SEED": 42,
    "REPORT_FILE": "rag_eval_report.json",
    "SAVE_DIR": "rag_eval_runs",
    "PROMPT_TEMPLATE": (
        "You are a knowledgeable assistant. Use only the provided context snippets to answer the question. "
        "If the information is not present in the context, respond with 'Insufficient information in context.'\n\n"
        "Context:\n{context}\n\nQuestion: {question}\n\nAnswer:"
    ),
    # metric thresholds for increasing/decreasing top_k
    "GROUNDING_GOOD": 0.7,
    "GROUNDING_BAD": 0.45,
    "FAITHFULNESS_GOOD": 0.7,
    "FAITHFULNESS_BAD": 0.45,
    "CONTEXT_RELEVANCE_GOOD": 0.7,
    "CONTEXT_RELEVANCE_BAD": 0.45,
}

random.seed(CONFIG["SEED"])
if openai and CONFIG["OPENAI_API_KEY"]:
    openai.api_key = CONFIG["OPENAI_API_KEY"]

# -------------------------
# Utilities
# -------------------------
def safe_print(*args, **kwargs):
    print(*args, **kwargs)

def ensure_dir(path: str):
    if not os.path.exists(path):
        os.makedirs(path, exist_ok=True)

def sha1_snippet(s: str) -> str:
    return hashlib.sha1(s.encode("utf-8")).hexdigest()[:10]

# -------------------------
# Example document (will write if missing)
# -------------------------
SAMPLE_TEXT = """Crows are among the smartest birds, capable of using tools and recognizing human faces even after years.
The archerfish displays remarkable precision, shooting jets of water to knock insects off branches.
Superconductors can carry electric current with zero resistance -- a phenomenon discovered over a century ago but still unlocking new technologies like quantum computers today.
The Library of Alexandria was once the largest center of learning, but much of its collection was lost in fires and wars.
Voyager 1 probe, launched in 1977, has left the solar system, carrying a golden record with sounds and images of Earth.
The Amazon rainforest produces roughly 20% of the world's oxygen.
Coral reefs support nearly 25% of all marine life despite covering less than 1% of the ocean floor.
MRI scanners use strong magnetic fields and radio waves to generate detailed images of organs without harmful radiation.
Moore's Law observed that the number of transistors on microchips doubles roughly every two years.
The Mariana Trench is the deepest part of Earth's oceans, reaching nearly 11,000 meters below sea level.
Ancient civilizations like the Sumerians and Egyptians invented mathematical systems thousands of years ago.
"""

def ensure_example_doc(path: str):
    if not os.path.exists(path):
        with open(path, "w", encoding="utf-8") as f:
            f.write(SAMPLE_TEXT)
        safe_print(f"[INFO] Wrote sample doc to {path}")

# -------------------------
# Synthetic golden generation (fallback-first approach)
# -------------------------
def simple_rule_based_goldens(doc_path: str, num: int = 12) -> List[Dict[str, str]]:
    """
    Very simple fallback: split document into sentences/paragraphs and craft simple Q/A.
    """
    with open(doc_path, "r", encoding="utf-8") as f:
        txt = f.read()
    paras = [p.strip() for p in txt.split("\n") if p.strip()]
    goldens = []
    for p in paras:
        q = f"What is one key fact from the following sentence: '{p[:120]}...'? "
        a = p
        goldens.append({"input": q, "expected_output": a, "context": p})
        if len(goldens) >= num:
            break
    return goldens

def openai_synthesize_goldens(doc_path: str, num: int = 12, model: str = CONFIG["OPENAI_COMPLETION_MODEL"]) -> List[Dict[str, str]]:
    """
    Try to use OpenAI to synthesize question-answer pairs.
    If OpenAI is not configured or API call fails, fall back to rule-based generation.
    """
    if openai is None or not getattr(openai, "api_key", None):
        safe_print("[WARN] OpenAI key not found - using rule-based goldens")
        return simple_rule_based_goldens(doc_path, num)
    with open(doc_path, "r", encoding="utf-8") as f:
        doc = f.read()

    prompt = (
        f"You are a dataset creator. Given the document below, produce {num} question-answer pairs. "
        f"For each pair, provide 'question', 'answer' (concise and grounded in the doc), and 'context' (the snippet). "
        f"Return a JSON array of objects.\n\nDocument:\n{doc}\n\n"
    )

    try:
        resp = openai.ChatCompletion.create(
            model=model,
            messages=[
                {"role": "system", "content": "You generate QA pairs."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.0,
            max_tokens=1500
        )
        text = resp["choices"][0]["message"]["content"]
        # find JSON in text
        start = text.find("[")
        if start >= 0:
            json_text = text[start:]
            try:
                arr = json.loads(json_text)
                goldens = []
                for item in arr[:num]:
                    q = item.get("question") or item.get("input") or item.get("q") or ""
                    a = item.get("answer") or item.get("expected_output") or ""
                    c = item.get("context") or ""
                    goldens.append({"input": q.strip(), "expected_output": a.strip(), "context": c.strip()})
                safe_print(f"[INFO] OpenAI synthesized {len(goldens)} goldens.")
                return goldens
            except Exception as e:
                safe_print("[WARN] Failed to parse JSON from OpenAI output:", e)
                return simple_rule_based_goldens(doc_path, num)
        else:
            safe_print("[WARN] OpenAI response lacking JSON - using rule-based fallback.")
            return simple_rule_based_goldens(doc_path, num)
    except Exception as e:
        safe_print("[ERROR] OpenAI call failed:", e)
        return simple_rule_based_goldens(doc_path, num)

def generate_goldens(doc_path: str, num: int = 12) -> List[Dict[str, str]]:
    # Attempt DeepEval if installed (not required here); else OpenAI; else rule-based
    # To keep dependencies light in this script we skip DeepEval auto-call.
    return openai_synthesize_goldens(doc_path, num)

# -------------------------
# Retriever: TF-IDF (fallback) and Embedding based (OpenAI)
# -------------------------
class TFIDFRetriever:
    def __init__(self, docs: List[str]):
        if not SKLEARN_AVAILABLE:
            raise RuntimeError("sklearn not available for TF-IDF retriever.")
        self.docs = docs
        self.vectorizer = TfidfVectorizer(ngram_range=(1,2), stop_words='english')
        self.doc_matrix = self.vectorizer.fit_transform(self.docs)

    def retrieve(self, query: str, top_k: int = 3) -> List[Tuple[int, float]]:
        qv = self.vectorizer.transform([query])
        sims = cosine_similarity(qv, self.doc_matrix)[0]
        idx_scores = list(enumerate(sims))
        idx_scores.sort(key=lambda x: x[1], reverse=True)
        return idx_scores[:top_k]

class OpenAIEmbeddingRetriever:
    def __init__(self, docs: List[str], embedding_model: str = CONFIG["OPENAI_EMBEDDING_MODEL"]):
        self.docs = docs
        self.embedding_model = embedding_model
        self.embeddings = []
        # compute embeddings
        self._build()

    def _embed_text(self, text: str):
        if openai is None or not getattr(openai, "api_key", None):
            # fallback: random vector (deterministic via hash)
            if NUMPY_AVAILABLE:
                h = int(hashlib_sha1_int(text))
                rng = np.random.RandomState(h % (2**32))
                return rng.normal(size=(1536,)).tolist()  # fake dim
            else:
                return [random.random() for _ in range(512)]
        try:
            resp = openai.Embedding.create(model=self.embedding_model, input=text)
            return resp["data"][0]["embedding"]
        except Exception as e:
            safe_print("[WARN] OpenAI embedding failed:", e)
            # fallback deterministic pseudo-random
            if NUMPY_AVAILABLE:
                h = int(hashlib_sha1_int(text))
                rng = np.random.RandomState(h % (2**32))
                return rng.normal(size=(1536,)).tolist()
            else:
                return [random.random() for _ in range(512)]

    def _build(self):
        self.embeddings = [self._embed_text(d) for d in self.docs]

    def retrieve(self, query: str, top_k: int = 3) -> List[Tuple[int, float]]:
        q_emb = self._embed_text(query)
        # compute cosine similarities
        if NUMPY_AVAILABLE:
            qv = np.array(q_emb, dtype=float)
            sims = []
            for emb in self.embeddings:
                ev = np.array(emb, dtype=float)
                denom = (norm(qv) * norm(ev))
                sim = float(np.dot(qv, ev) / denom) if denom > 0 else 0.0
                sims.append(sim)
            idx_scores = list(enumerate(sims))
            idx_scores.sort(key=lambda x: x[1], reverse=True)
            return idx_scores[:top_k]
        else:
            sims = []
            for emb in self.embeddings:
                sim = sum(a*b for a,b in zip(q_emb, emb)) / (len(q_emb) or 1)
                sims.append(sim)
            idx_scores = list(enumerate(sims))
            idx_scores.sort(key=lambda x: x[1], reverse=True)
            return idx_scores[:top_k]

# helper hashing for fallback embeddings
def hashlib_sha1_int(s: str) -> int:
    return int(hashlib.sha1(s.encode('utf-8')).hexdigest()[:16], 16)

# -------------------------
# Generator (LLM call) with fallback
# -------------------------
def call_openai_chat(question: str, contexts: List[str], temperature: float = 0.0, model: str = CONFIG["OPENAI_COMPLETION_MODEL"]) -> str:
    if openai is None or not getattr(openai, "api_key", None):
        # fallback: naive rule - if any context contains a sentence with overlap words, return that sentence; else "Insufficient"
        combined = " ".join(contexts)
        q_words = set([w.lower() for w in question.split() if len(w) > 3])
        best_sent = None
        best_overlap = 0
        for s in combined.split("."):
            wset = set([w.lower().strip(" ,;:()[]") for w in s.split() if len(w)>3])
            overlap = len(q_words & wset)
            if overlap > best_overlap:
                best_overlap = overlap
                best_sent = s.strip()
        if best_sent and best_overlap >= 1:
            return best_sent + "."
        return"Insufficient information in context."
    # try call
    prompt = CONFIG["PROMPT_TEMPLATE"].format(context="\n\n".join(contexts), question=question)
    try:
        resp = openai.ChatCompletion.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a precise assistant."},
                {"role": "user", "content": prompt}
            ],
            temperature=temperature,
            max_tokens=512,
        )
        text = resp["choices"][0]["message"]["content"].strip()
        return text
    except Exception as e:
        safe_print("[WARN] OpenAI ChatCompletion failed:", e)
        # fallback naive
        combined = " ".join(contexts)
        q_words = set([w.lower() for w in question.split() if len(w) > 3])
        best_sent = None
        best_overlap = 0
        for s in combined.split("."):
            wset = set([w.lower().strip(" ,;:()[]") for w in s.split() if len(w)>3])
            overlap = len(q_words & wset)
            if overlap > best_overlap:
                best_overlap = overlap
                best_sent = s.strip()
        if best_sent and best_overlap >= 1:
            return best_sent + "."
        return"Insufficient information in context."

# -------------------------
# Metrics implementations
# -------------------------
def compute_context_relevance(retrieved_idxs_scores: List[Tuple[int, float]]) -> float:
    """
    Simple metric: average similarity score (score between 0-1)
    """
    if not retrieved_idxs_scores:
        return 0.0
    scores = [s for _, s in retrieved_idxs_scores]
    # ensure in [0,1]
    clipped = [max(0.0, min(1.0, float(x))) for x in scores]
    return sum(clipped) / len(clipped)

def compute_grounding(answer: str, contexts: List[str]) -> float:
    """
    Heuristic: fraction of answer tokens that have overlap with context tokens.
    Returns 0-1.
    """
    a_words = [w.strip(" ,.;:()[]'\"").lower() for w in answer.split() if len(w) > 2]
    if not a_words:
        return 0.0
    context_text = " ".join(contexts).lower()
    hits = sum(1 for w in a_words if w in context_text)
    return hits / len(a_words)

def compute_faithfulness(answer: str, expected: str) -> float:
    """
    Very simple normalized similarity:
    - overlap ratio of important tokens (set intersection over union)
    """
    a_set = set([w.strip(" ,.;:()[]'\"").lower() for w in answer.split() if len(w)>2])
    e_set = set([w.strip(" ,.;:()[]'\"").lower() for w in expected.split() if len(w)>2])
    if not a_set and not e_set:
        return 1.0
    if not a_set or not e_set:
        return 0.0
    inter = a_set & e_set
    union = a_set | e_set
    return len(inter) / len(union)

# -------------------------
# Single-run RAG evaluation on list of goldens
# -------------------------
def run_rag_eval(
    goldens: List[Dict[str, str]],
    docs: List[str],
    retriever,
    top_k: int,
    temperature: float
) -> Dict[str, Any]:
    """
    Run through goldens, for each:
      - retrieve top_k contexts
      - call generator
      - compute metrics
    Return aggregated metrics and per-sample results
    """
    per_samples = []
    total_grounding = 0.0
    total_context_rel = 0.0
    total_faith = 0.0

    iterator = goldens if not TQDM_AVAILABLE else tqdm(goldens, desc=f"Eval top_k={top_k}, temp={temperature}")

    for g in iterator:
        q = g["input"]
        expected = g.get("expected_output", "")
        # retrieve
        retrieved = retriever.retrieve(q, top_k=top_k)
        contexts = [docs[idx] for idx, _ in retrieved]
        ctx_scores = [score for _, score in retrieved]

        # call generator
        answer = call_openai_chat(q, contexts, temperature=temperature)

        # compute metrics
        context_rel = compute_context_relevance(retrieved)
        grounding = compute_grounding(answer, contexts)
        faith = compute_faithfulness(answer, expected)

        total_context_rel += context_rel
        total_grounding += grounding
        total_faith += faith

        per_samples.append({
            "question": q,
            "expected": expected,
            "answer": answer,
            "retrieved": [{"idx": idx, "score": float(score), "snippet_hash": sha1_snippet(docs[idx])} for idx, score in retrieved],
            "metrics": {"context_relevance": context_rel, "grounding": grounding, "faithfulness": faith}
        })

    n = len(goldens)
    agg = {
        "avg_context_relevance": total_context_rel / n if n else 0.0,
        "avg_grounding": total_grounding / n if n else 0.0,
        "avg_faithfulness": total_faith / n if n else 0.0
    }
    return {"aggregate": agg, "samples": per_samples}

# -------------------------
# Iterative parameter adjustment logic
# -------------------------
def adjust_params(current_top_k: int, metrics: Dict[str, float]) -> int:
    """
    Very simple policy:
      - If grounding low -> increase top_k (more context)
      - If grounding high and context relevance low -> increase top_k
      - If grounding high & context relevance high -> try reduce top_k to optimize
    Bound by min/max.
    """
    g = metrics.get("avg_grounding", 0.0)
    cr = metrics.get("avg_context_relevance", 0.0)
    fa = metrics.get("avg_faithfulness", 0.0)
    new_top_k = current_top_k

    # if grounding is very low, expand context
    if g < CONFIG["GROUNDING_BAD"]:
        new_top_k = min(CONFIG["MAX_TOP_K"], current_top_k + 2)
    elif cr < CONFIG["CONTEXT_RELEVANCE_BAD"] and g < CONFIG["GROUNDING_GOOD"]:
        new_top_k = min(CONFIG["MAX_TOP_K"], current_top_k + 1)
    elif g > CONFIG["GROUNDING_GOOD"] and cr > CONFIG["CONTEXT_RELEVANCE_GOOD"]:
        # try shrink to save cost
        new_top_k = max(CONFIG["MIN_TOP_K"], current_top_k - 1)
    # small adjustments if faithfulness very low
    if fa < CONFIG["FAITHFULNESS_BAD"]:
        new_top_k = min(CONFIG["MAX_TOP_K"], new_top_k + 1)
    # ensure bounds
    new_top_k = max(CONFIG["MIN_TOP_K"], min(CONFIG["MAX_TOP_K"], new_top_k))
    return new_top_k

def pick_temperature(candidate_list: List[float], metrics: Dict[str, float]) -> float:
    """
    Simple heuristic: if faithfulness low, use lower temp (more deterministic).
    If faithfulness high and grounding high, allow slightly higher temp for diversity.
    """
    fa = metrics.get("avg_faithfulness", 0.0)
    g = metrics.get("avg_grounding", 0.0)
    if fa < 0.4 or g < 0.4:
        return min(candidate_list)
    if fa > 0.75 and g > 0.7:
        return max(candidate_list)
    return candidate_list[len(candidate_list)//2]

# -------------------------
# Main pipeline
# -------------------------
def main():
    safe_print("=== RAG Iterative Evaluation Demo ===")
    ensure_example_doc(CONFIG["DOC_PATH"])
    ensure_dir(CONFIG["SAVE_DIR"])

    # load docs and split into chunks (naive paragraph chunking)
    with open(CONFIG["DOC_PATH"], "r", encoding="utf-8") as f:
        doc_text = f.read()
    paragraphs = [p.strip() for p in doc_text.split("\n") if p.strip()]
    # if paragraphs too short, split sentences
    if len(paragraphs) < 5:
        # attempt sentence split
        sents = [s.strip() for s in doc_text.replace("\n", " ").split(".") if s.strip()]
        # group per 1-2 sentences
        paragraphs = []
        i = 0
        while i < len(sents):
            chunk = sents[i]
            if i+1 < len(sents):
                if random.random() < 0.5:
                    chunk = chunk + ". " + sents[i+1]
                    i += 2
                else:
                    i += 1
            else:
                i += 1
            paragraphs.append(chunk + ".")
    docs = paragraphs

    safe_print(f"[INFO] Loaded {len(docs)} document chunks for retrieval.")

    # generate goldens
    goldens = generate_goldens(CONFIG["DOC_PATH"], CONFIG["NUM_GOLDENS"])
    safe_print(f"[INFO] Generated {len(goldens)} goldens for evaluation.")

    # choose retriever: prefer OpenAI embeddings if available, else TF-IDF
    retriever = None
    use_embedding = False
    if openai and getattr(openai, "api_key", None) and NUMPY_AVAILABLE:
        try:
            retriever = OpenAIEmbeddingRetriever(docs)
            use_embedding = True
            safe_print("[INFO] Using OpenAI embedding retriever.")
        except Exception as e:
            safe_print("[WARN] OpenAIEmbeddingRetriever failed, falling back to TF-IDF:", e)
    if retriever is None:
        if SKLEARN_AVAILABLE:
            retriever = TFIDFRetriever(docs)
            safe_print("[INFO] Using TF-IDF retriever.")
        else:
            # fallback: naive substring search retriever
            class NaiveRetriever:
                def __init__(self, docs):
                    self.docs = docs
                def retrieve(self, query, top_k=3):
                    qs = query.lower()
                    scores = []
                    for i, d in enumerate(self.docs):
                        s = sum(1 for w inset(qs.split()) if w in d.lower())
                        scores.append((i, float(s)))
                    scores.sort(key=lambda x: x[1], reverse=True)
                    return scores[:top_k]
            retriever = NaiveRetriever(docs)
            safe_print("[INFO] Using naive substring retriever.")

    # iterative loop
    cur_top_k = CONFIG["INITIAL_TOP_K"]
    cur_temp = CONFIG["TEMPERATURE_OPTIONS"][0]
    history = []
    for itr in range(1, CONFIG["ITERATIONS"] + 1):
        safe_print(f"\n--- Iteration {itr} | top_k={cur_top_k} | temp={cur_temp} ---")
        result = run_rag_eval(goldens, docs, retriever, top_k=cur_top_k, temperature=cur_temp)
        agg = result["aggregate"]
        safe_print(f"[RESULT] avg_context_relevance={agg['avg_context_relevance']:.3f}, avg_grounding={agg['avg_grounding']:.3f}, avg_faithfulness={agg['avg_faithfulness']:.3f}")
        # save per-iteration
        run_record = {
            "iteration": itr,
            "top_k": cur_top_k,
            "temperature": cur_temp,
            "aggregate": agg,
            "timestamp": time.time(),
            "samples_count": len(result["samples"])
        }
        history.append(run_record)
        # adapt params
        new_top_k = adjust_params(cur_top_k, agg)
        new_temp = pick_temperature(CONFIG["TEMPERATURE_OPTIONS"], agg)
        safe_print(f"[ADAPT] next_top_k={new_top_k}, next_temp={new_temp}")
        # if no change and already good metrics, we can stop early
        if new_top_k == cur_top_k and new_temp == cur_temp and agg["avg_grounding"] > 0.8 and agg["avg_faithfulness"] > 0.8:
            safe_print("[INFO] Metrics are good and stable - stopping early.")
            break
        cur_top_k = new_top_k
        cur_temp = new_temp

    # produce final report
    report = {
        "config": CONFIG,
        "docs_count": len(docs),
        "goldens_count": len(goldens),
        "history": history
    }
    report_path = os.path.join(CONFIG["SAVE_DIR"], CONFIG["REPORT_FILE"])
    with open(report_path, "w", encoding="utf-8") as f:
        json.dump(report, f, indent=2, ensure_ascii=False)
    safe_print(f"\n[FINISH] Saved report to {report_path}")
    safe_print("=== End ===")

if __name__ == "__main__":
    main()

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费POC验证，效果达标后再合作。零风险落地应用大模型，已交付160+中大型企业