微信扫码
添加专属顾问
我要投稿
用合成数据评测RAG系统,这份DeepEval实操指南帮你快速搭建自动化评测体系,解决"模型是否在胡说八道"的困扰。核心内容:1. 合成数据在RAG系统评测中的关键作用与优势2. DeepEval框架的核心功能与多维度评测指标3. 从环境配置到数据生成的完整实操步骤
在构建 RAG(Retrieval-Augmented Generation,检索增强生成)系统的过程中,很多人都有这样的困惑:
“模型看起来能回答问题,但到底是不是在胡说八道?” “Retriever 到底找得准不准?” “我该怎么知道系统整体是不是可靠的?”
这些问题的根源在于——我们缺乏系统化的评测方法。 尤其在项目早期,还没有真实用户数据时,想要验证 RAG 流程的效果就更加困难。
今天,我们就来深入拆解一个实用方案: 👉 用 DeepEval 生成合成数据,系统性评测你的 RAG Pipeline。
这篇文章会带你一步步上手,包括依赖安装、数据生成、复杂度控制、评测逻辑等全部环节。 读完后,你不仅能快速搭建一个自动化评测体系,还能理解为什么「合成数据」是 RAG 测试的关键突破口。
在真实业务场景中,我们希望 RAG 系统具备三个核心能力:
但在系统上线前,我们往往没有足够的真实问题和反馈样本。 这就导致很难知道模型的回答是否“扎实落地”。
而 合成数据(Synthetic Data) 正好填补了这个空白。
通过自动生成模拟用户问题 + 理想回答(golden pairs),我们能提前建立一个可重复测试集:
DeepEval 就是这个过程的核心工具。
DeepEval 是一个专门用于大模型评测的开源框架,支持包括 RAG 流水线在内的各种场景。 它的优势主要体现在三点:
Synthesizer
类,可基于文档生成真实感极强的 QA 对;EvolutionConfig
控制生成样本的复杂度与类型。接下来我们进入实操环节。
首先,安装所需依赖库。
pip install deepeval chromadb tiktoken pandas
安装完成后,配置你的 OpenAI API Key。 DeepEval 会调用外部模型(如 GPT-4)来生成和评测数据。
前往 OpenAI API 管理页, 新建 API Key 并填入你的环境变量中:
export OPENAI_API_KEY="sk-xxxxxxx"
💡 提示: 初次使用 OpenAI API 可能需要绑定支付方式并充值约 $5 才能启用。
接下来,我们需要准备一份源文本,它将作为合成数据的“语料库”。 这份文本应尽量内容多样、语义清晰、事实准确。
例如:
text = """
Crows are among the smartest birds, capable of using tools and recognizing human faces even after years.
In contrast, the archerfish displays remarkable precision, shooting jets of water to knock insects off branches.
Meanwhile, in the world of physics, superconductors can carry electric current with zero resistance -- a phenomenon
discovered over a century ago but still unlocking new technologies like quantum computers today.
...
"""
将其保存为一个文本文件:
with open("example.txt", "w") as f:
f.write(text)
💬 技巧: 你完全可以换成自己的内容,比如项目知识库、技术文档、内部 FAQ 等,这样生成的评测样本就更贴近业务实际。
DeepEval 的核心类 Synthesizer
可以直接读取文档并生成高质量的 QA 对。
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer(model="gpt-4.1-nano")
# 从文档中生成合成数据
synthesizer.generate_goldens_from_docs(
document_paths=["example.txt"],
include_expected_output=True
)
# 打印部分结果
for golden in synthesizer.synthetic_goldens[:3]:
print(golden, "\n")
运行结果示例:
Input: Evaluate the cognitive abilities of corvids in facial recognition tasks.
Expected Output: Crows can recognize human faces and remember them for years, showing advanced memory and problem-solving.
Context: "Crows are among the smartest birds..."
可以看到,每个样本都包含:
这些就是我们的 golden pairs —— 可用于后续的模型性能验证。
光生成 QA 对还不够,我们需要控制生成问题的复杂度与多样性,让测试更贴近真实用户提问。
DeepEval 提供了 EvolutionConfig
,可以通过「进化策略」来调节生成方式。
from deepeval.synthesizer.config import EvolutionConfig, Evolution
evolution_config = EvolutionConfig(
evolutions={
Evolution.REASONING: 1/5,
Evolution.MULTICONTEXT: 1/5,
Evolution.COMPARATIVE: 1/5,
Evolution.HYPOTHETICAL: 1/5,
Evolution.IN_BREADTH: 1/5,
},
num_evolutions=3
)
synthesizer = Synthesizer(evolution_config=evolution_config)
synthesizer.generate_goldens_from_docs(["example.txt"])
这样一来,生成的样本不仅仅是简单问答,而会覆盖:
例如:
Q: 比较 Voyager 1 的黄金唱片与亚历山大图书馆在人类历史中的意义。A: 两者都承载了人类知识与文明的象征,前者跨越宇宙,后者见证文明的起点。
这样的数据能全面测试模型的多层推理与信息整合能力。
当我们有了高质量的合成数据,就可以进入核心环节——RAG 评测闭环。
典型的流程如下:
这就是一个完整的 Iterative RAG Improvement Loop(迭代改进循环)。
它的关键在于:
你不需要等待真实用户来“踩坑”, 合成数据已经能让你提前发现系统的薄弱点。
当 Retriever 的召回率提升、LLM 的事实一致性增强后,你的系统上线风险就会显著降低。
实战代码见最后!
如果你准备在真实项目中落地 DeepEval,可以参考以下建议:
长期来看,这种方式能让你的 RAG 系统从「主观感受好像行」变为「数据指标确实强」。
RAG 评测的难点在于——系统表现常常“看起来对”,但却难以验证背后的可靠性。 DeepEval 的出现,让这件事变得可量化、可复现、可持续改进。
合成数据的价值不在于替代真实用户,而在于提前建立可控的测试环境。通过 EvolutionConfig 等机制,我们甚至能模拟用户提出各种复杂问题,全面检验系统的推理与检索边界。
一句话总结:
在没有用户数据的阶段,合成数据就是最好的评测基线; 在持续优化阶段,DeepEval 就是你的自动化教练。
付实战代码:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
rag_iterative_eval_full.py
完整示例:迭代评测循环(RAG 改进闭环)
功能:
- 生成/读取文档
- 生成合成 goldens(DeepEval / OpenAI / 规则化)
- 构建检索器(OpenAI embeddings 或 TF-IDF)
- 使用检索到的上下文调用 LLM 生成答案(OpenAI 或简单拼接回复)
- 计算 grounding / context_relevance / faithfulness 指标
- 基于指标自动调整 top_k 与 temperature(形成闭环)
- 保存与打印每轮结果
作者:jilolo
日期:2025-10
"""
import os
import json
import time
import math
import random
import hashlib
from typing import List, Dict, Any, Tuple
from collections import defaultdict, Counter
# optional imports
try:
import openai
except Exception:
openai = None
try:
import numpy as np
from numpy.linalg import norm
NUMPY_AVAILABLE = True
except Exception:
NUMPY_AVAILABLE = False
try:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
SKLEARN_AVAILABLE = True
except Exception:
SKLEARN_AVAILABLE = False
try:
from tqdm import tqdm
TQDM_AVAILABLE = True
except Exception:
TQDM_AVAILABLE = False
# -------------------------
# CONFIG
# -------------------------
CONFIG = {
"OPENAI_API_KEY": os.getenv("OPENAI_API_KEY", ""),
"OPENAI_EMBEDDING_MODEL": "text-embedding-3-small",
"OPENAI_COMPLETION_MODEL": "gpt-4o-mini", # change to available model
"DOC_PATH": "example.txt",
"NUM_GOLDENS": 12,
"ITERATIONS": 6,
"INITIAL_TOP_K": 3,
"MAX_TOP_K": 8,
"MIN_TOP_K": 1,
"TEMPERATURE_OPTIONS": [0.0, 0.2, 0.5],
"SEED": 42,
"REPORT_FILE": "rag_eval_report.json",
"SAVE_DIR": "rag_eval_runs",
"PROMPT_TEMPLATE": (
"You are a knowledgeable assistant. Use only the provided context snippets to answer the question. "
"If the information is not present in the context, respond with 'Insufficient information in context.'\n\n"
"Context:\n{context}\n\nQuestion: {question}\n\nAnswer:"
),
# metric thresholds for increasing/decreasing top_k
"GROUNDING_GOOD": 0.7,
"GROUNDING_BAD": 0.45,
"FAITHFULNESS_GOOD": 0.7,
"FAITHFULNESS_BAD": 0.45,
"CONTEXT_RELEVANCE_GOOD": 0.7,
"CONTEXT_RELEVANCE_BAD": 0.45,
}
random.seed(CONFIG["SEED"])
if openai and CONFIG["OPENAI_API_KEY"]:
openai.api_key = CONFIG["OPENAI_API_KEY"]
# -------------------------
# Utilities
# -------------------------
def safe_print(*args, **kwargs):
print(*args, **kwargs)
def ensure_dir(path: str):
if not os.path.exists(path):
os.makedirs(path, exist_ok=True)
def sha1_snippet(s: str) -> str:
return hashlib.sha1(s.encode("utf-8")).hexdigest()[:10]
# -------------------------
# Example document (will write if missing)
# -------------------------
SAMPLE_TEXT = """Crows are among the smartest birds, capable of using tools and recognizing human faces even after years.
The archerfish displays remarkable precision, shooting jets of water to knock insects off branches.
Superconductors can carry electric current with zero resistance -- a phenomenon discovered over a century ago but still unlocking new technologies like quantum computers today.
The Library of Alexandria was once the largest center of learning, but much of its collection was lost in fires and wars.
Voyager 1 probe, launched in 1977, has left the solar system, carrying a golden record with sounds and images of Earth.
The Amazon rainforest produces roughly 20% of the world's oxygen.
Coral reefs support nearly 25% of all marine life despite covering less than 1% of the ocean floor.
MRI scanners use strong magnetic fields and radio waves to generate detailed images of organs without harmful radiation.
Moore's Law observed that the number of transistors on microchips doubles roughly every two years.
The Mariana Trench is the deepest part of Earth's oceans, reaching nearly 11,000 meters below sea level.
Ancient civilizations like the Sumerians and Egyptians invented mathematical systems thousands of years ago.
"""
def ensure_example_doc(path: str):
if not os.path.exists(path):
with open(path, "w", encoding="utf-8") as f:
f.write(SAMPLE_TEXT)
safe_print(f"[INFO] Wrote sample doc to {path}")
# -------------------------
# Synthetic golden generation (fallback-first approach)
# -------------------------
def simple_rule_based_goldens(doc_path: str, num: int = 12) -> List[Dict[str, str]]:
"""
Very simple fallback: split document into sentences/paragraphs and craft simple Q/A.
"""
with open(doc_path, "r", encoding="utf-8") as f:
txt = f.read()
paras = [p.strip() for p in txt.split("\n") if p.strip()]
goldens = []
for p in paras:
q = f"What is one key fact from the following sentence: '{p[:120]}...'? "
a = p
goldens.append({"input": q, "expected_output": a, "context": p})
if len(goldens) >= num:
break
return goldens
def openai_synthesize_goldens(doc_path: str, num: int = 12, model: str = CONFIG["OPENAI_COMPLETION_MODEL"]) -> List[Dict[str, str]]:
"""
Try to use OpenAI to synthesize question-answer pairs.
If OpenAI is not configured or API call fails, fall back to rule-based generation.
"""
if openai is None or not getattr(openai, "api_key", None):
safe_print("[WARN] OpenAI key not found - using rule-based goldens")
return simple_rule_based_goldens(doc_path, num)
with open(doc_path, "r", encoding="utf-8") as f:
doc = f.read()
prompt = (
f"You are a dataset creator. Given the document below, produce {num} question-answer pairs. "
f"For each pair, provide 'question', 'answer' (concise and grounded in the doc), and 'context' (the snippet). "
f"Return a JSON array of objects.\n\nDocument:\n{doc}\n\n"
)
try:
resp = openai.ChatCompletion.create(
model=model,
messages=[
{"role": "system", "content": "You generate QA pairs."},
{"role": "user", "content": prompt}
],
temperature=0.0,
max_tokens=1500
)
text = resp["choices"][0]["message"]["content"]
# find JSON in text
start = text.find("[")
if start >= 0:
json_text = text[start:]
try:
arr = json.loads(json_text)
goldens = []
for item in arr[:num]:
q = item.get("question") or item.get("input") or item.get("q") or ""
a = item.get("answer") or item.get("expected_output") or ""
c = item.get("context") or ""
goldens.append({"input": q.strip(), "expected_output": a.strip(), "context": c.strip()})
safe_print(f"[INFO] OpenAI synthesized {len(goldens)} goldens.")
return goldens
except Exception as e:
safe_print("[WARN] Failed to parse JSON from OpenAI output:", e)
return simple_rule_based_goldens(doc_path, num)
else:
safe_print("[WARN] OpenAI response lacking JSON - using rule-based fallback.")
return simple_rule_based_goldens(doc_path, num)
except Exception as e:
safe_print("[ERROR] OpenAI call failed:", e)
return simple_rule_based_goldens(doc_path, num)
def generate_goldens(doc_path: str, num: int = 12) -> List[Dict[str, str]]:
# Attempt DeepEval if installed (not required here); else OpenAI; else rule-based
# To keep dependencies light in this script we skip DeepEval auto-call.
return openai_synthesize_goldens(doc_path, num)
# -------------------------
# Retriever: TF-IDF (fallback) and Embedding based (OpenAI)
# -------------------------
class TFIDFRetriever:
def __init__(self, docs: List[str]):
if not SKLEARN_AVAILABLE:
raise RuntimeError("sklearn not available for TF-IDF retriever.")
self.docs = docs
self.vectorizer = TfidfVectorizer(ngram_range=(1,2), stop_words='english')
self.doc_matrix = self.vectorizer.fit_transform(self.docs)
def retrieve(self, query: str, top_k: int = 3) -> List[Tuple[int, float]]:
qv = self.vectorizer.transform([query])
sims = cosine_similarity(qv, self.doc_matrix)[0]
idx_scores = list(enumerate(sims))
idx_scores.sort(key=lambda x: x[1], reverse=True)
return idx_scores[:top_k]
class OpenAIEmbeddingRetriever:
def __init__(self, docs: List[str], embedding_model: str = CONFIG["OPENAI_EMBEDDING_MODEL"]):
self.docs = docs
self.embedding_model = embedding_model
self.embeddings = []
# compute embeddings
self._build()
def _embed_text(self, text: str):
if openai is None or not getattr(openai, "api_key", None):
# fallback: random vector (deterministic via hash)
if NUMPY_AVAILABLE:
h = int(hashlib_sha1_int(text))
rng = np.random.RandomState(h % (2**32))
return rng.normal(size=(1536,)).tolist() # fake dim
else:
return [random.random() for _ in range(512)]
try:
resp = openai.Embedding.create(model=self.embedding_model, input=text)
return resp["data"][0]["embedding"]
except Exception as e:
safe_print("[WARN] OpenAI embedding failed:", e)
# fallback deterministic pseudo-random
if NUMPY_AVAILABLE:
h = int(hashlib_sha1_int(text))
rng = np.random.RandomState(h % (2**32))
return rng.normal(size=(1536,)).tolist()
else:
return [random.random() for _ in range(512)]
def _build(self):
self.embeddings = [self._embed_text(d) for d in self.docs]
def retrieve(self, query: str, top_k: int = 3) -> List[Tuple[int, float]]:
q_emb = self._embed_text(query)
# compute cosine similarities
if NUMPY_AVAILABLE:
qv = np.array(q_emb, dtype=float)
sims = []
for emb in self.embeddings:
ev = np.array(emb, dtype=float)
denom = (norm(qv) * norm(ev))
sim = float(np.dot(qv, ev) / denom) if denom > 0 else 0.0
sims.append(sim)
idx_scores = list(enumerate(sims))
idx_scores.sort(key=lambda x: x[1], reverse=True)
return idx_scores[:top_k]
else:
sims = []
for emb in self.embeddings:
sim = sum(a*b for a,b in zip(q_emb, emb)) / (len(q_emb) or 1)
sims.append(sim)
idx_scores = list(enumerate(sims))
idx_scores.sort(key=lambda x: x[1], reverse=True)
return idx_scores[:top_k]
# helper hashing for fallback embeddings
def hashlib_sha1_int(s: str) -> int:
return int(hashlib.sha1(s.encode('utf-8')).hexdigest()[:16], 16)
# -------------------------
# Generator (LLM call) with fallback
# -------------------------
def call_openai_chat(question: str, contexts: List[str], temperature: float = 0.0, model: str = CONFIG["OPENAI_COMPLETION_MODEL"]) -> str:
if openai is None or not getattr(openai, "api_key", None):
# fallback: naive rule - if any context contains a sentence with overlap words, return that sentence; else "Insufficient"
combined = " ".join(contexts)
q_words = set([w.lower() for w in question.split() if len(w) > 3])
best_sent = None
best_overlap = 0
for s in combined.split("."):
wset = set([w.lower().strip(" ,;:()[]") for w in s.split() if len(w)>3])
overlap = len(q_words & wset)
if overlap > best_overlap:
best_overlap = overlap
best_sent = s.strip()
if best_sent and best_overlap >= 1:
return best_sent + "."
return"Insufficient information in context."
# try call
prompt = CONFIG["PROMPT_TEMPLATE"].format(context="\n\n".join(contexts), question=question)
try:
resp = openai.ChatCompletion.create(
model=model,
messages=[
{"role": "system", "content": "You are a precise assistant."},
{"role": "user", "content": prompt}
],
temperature=temperature,
max_tokens=512,
)
text = resp["choices"][0]["message"]["content"].strip()
return text
except Exception as e:
safe_print("[WARN] OpenAI ChatCompletion failed:", e)
# fallback naive
combined = " ".join(contexts)
q_words = set([w.lower() for w in question.split() if len(w) > 3])
best_sent = None
best_overlap = 0
for s in combined.split("."):
wset = set([w.lower().strip(" ,;:()[]") for w in s.split() if len(w)>3])
overlap = len(q_words & wset)
if overlap > best_overlap:
best_overlap = overlap
best_sent = s.strip()
if best_sent and best_overlap >= 1:
return best_sent + "."
return"Insufficient information in context."
# -------------------------
# Metrics implementations
# -------------------------
def compute_context_relevance(retrieved_idxs_scores: List[Tuple[int, float]]) -> float:
"""
Simple metric: average similarity score (score between 0-1)
"""
if not retrieved_idxs_scores:
return 0.0
scores = [s for _, s in retrieved_idxs_scores]
# ensure in [0,1]
clipped = [max(0.0, min(1.0, float(x))) for x in scores]
return sum(clipped) / len(clipped)
def compute_grounding(answer: str, contexts: List[str]) -> float:
"""
Heuristic: fraction of answer tokens that have overlap with context tokens.
Returns 0-1.
"""
a_words = [w.strip(" ,.;:()[]'\"").lower() for w in answer.split() if len(w) > 2]
if not a_words:
return 0.0
context_text = " ".join(contexts).lower()
hits = sum(1 for w in a_words if w in context_text)
return hits / len(a_words)
def compute_faithfulness(answer: str, expected: str) -> float:
"""
Very simple normalized similarity:
- overlap ratio of important tokens (set intersection over union)
"""
a_set = set([w.strip(" ,.;:()[]'\"").lower() for w in answer.split() if len(w)>2])
e_set = set([w.strip(" ,.;:()[]'\"").lower() for w in expected.split() if len(w)>2])
if not a_set and not e_set:
return 1.0
if not a_set or not e_set:
return 0.0
inter = a_set & e_set
union = a_set | e_set
return len(inter) / len(union)
# -------------------------
# Single-run RAG evaluation on list of goldens
# -------------------------
def run_rag_eval(
goldens: List[Dict[str, str]],
docs: List[str],
retriever,
top_k: int,
temperature: float
) -> Dict[str, Any]:
"""
Run through goldens, for each:
- retrieve top_k contexts
- call generator
- compute metrics
Return aggregated metrics and per-sample results
"""
per_samples = []
total_grounding = 0.0
total_context_rel = 0.0
total_faith = 0.0
iterator = goldens if not TQDM_AVAILABLE else tqdm(goldens, desc=f"Eval top_k={top_k}, temp={temperature}")
for g in iterator:
q = g["input"]
expected = g.get("expected_output", "")
# retrieve
retrieved = retriever.retrieve(q, top_k=top_k)
contexts = [docs[idx] for idx, _ in retrieved]
ctx_scores = [score for _, score in retrieved]
# call generator
answer = call_openai_chat(q, contexts, temperature=temperature)
# compute metrics
context_rel = compute_context_relevance(retrieved)
grounding = compute_grounding(answer, contexts)
faith = compute_faithfulness(answer, expected)
total_context_rel += context_rel
total_grounding += grounding
total_faith += faith
per_samples.append({
"question": q,
"expected": expected,
"answer": answer,
"retrieved": [{"idx": idx, "score": float(score), "snippet_hash": sha1_snippet(docs[idx])} for idx, score in retrieved],
"metrics": {"context_relevance": context_rel, "grounding": grounding, "faithfulness": faith}
})
n = len(goldens)
agg = {
"avg_context_relevance": total_context_rel / n if n else 0.0,
"avg_grounding": total_grounding / n if n else 0.0,
"avg_faithfulness": total_faith / n if n else 0.0
}
return {"aggregate": agg, "samples": per_samples}
# -------------------------
# Iterative parameter adjustment logic
# -------------------------
def adjust_params(current_top_k: int, metrics: Dict[str, float]) -> int:
"""
Very simple policy:
- If grounding low -> increase top_k (more context)
- If grounding high and context relevance low -> increase top_k
- If grounding high & context relevance high -> try reduce top_k to optimize
Bound by min/max.
"""
g = metrics.get("avg_grounding", 0.0)
cr = metrics.get("avg_context_relevance", 0.0)
fa = metrics.get("avg_faithfulness", 0.0)
new_top_k = current_top_k
# if grounding is very low, expand context
if g < CONFIG["GROUNDING_BAD"]:
new_top_k = min(CONFIG["MAX_TOP_K"], current_top_k + 2)
elif cr < CONFIG["CONTEXT_RELEVANCE_BAD"] and g < CONFIG["GROUNDING_GOOD"]:
new_top_k = min(CONFIG["MAX_TOP_K"], current_top_k + 1)
elif g > CONFIG["GROUNDING_GOOD"] and cr > CONFIG["CONTEXT_RELEVANCE_GOOD"]:
# try shrink to save cost
new_top_k = max(CONFIG["MIN_TOP_K"], current_top_k - 1)
# small adjustments if faithfulness very low
if fa < CONFIG["FAITHFULNESS_BAD"]:
new_top_k = min(CONFIG["MAX_TOP_K"], new_top_k + 1)
# ensure bounds
new_top_k = max(CONFIG["MIN_TOP_K"], min(CONFIG["MAX_TOP_K"], new_top_k))
return new_top_k
def pick_temperature(candidate_list: List[float], metrics: Dict[str, float]) -> float:
"""
Simple heuristic: if faithfulness low, use lower temp (more deterministic).
If faithfulness high and grounding high, allow slightly higher temp for diversity.
"""
fa = metrics.get("avg_faithfulness", 0.0)
g = metrics.get("avg_grounding", 0.0)
if fa < 0.4 or g < 0.4:
return min(candidate_list)
if fa > 0.75 and g > 0.7:
return max(candidate_list)
return candidate_list[len(candidate_list)//2]
# -------------------------
# Main pipeline
# -------------------------
def main():
safe_print("=== RAG Iterative Evaluation Demo ===")
ensure_example_doc(CONFIG["DOC_PATH"])
ensure_dir(CONFIG["SAVE_DIR"])
# load docs and split into chunks (naive paragraph chunking)
with open(CONFIG["DOC_PATH"], "r", encoding="utf-8") as f:
doc_text = f.read()
paragraphs = [p.strip() for p in doc_text.split("\n") if p.strip()]
# if paragraphs too short, split sentences
if len(paragraphs) < 5:
# attempt sentence split
sents = [s.strip() for s in doc_text.replace("\n", " ").split(".") if s.strip()]
# group per 1-2 sentences
paragraphs = []
i = 0
while i < len(sents):
chunk = sents[i]
if i+1 < len(sents):
if random.random() < 0.5:
chunk = chunk + ". " + sents[i+1]
i += 2
else:
i += 1
else:
i += 1
paragraphs.append(chunk + ".")
docs = paragraphs
safe_print(f"[INFO] Loaded {len(docs)} document chunks for retrieval.")
# generate goldens
goldens = generate_goldens(CONFIG["DOC_PATH"], CONFIG["NUM_GOLDENS"])
safe_print(f"[INFO] Generated {len(goldens)} goldens for evaluation.")
# choose retriever: prefer OpenAI embeddings if available, else TF-IDF
retriever = None
use_embedding = False
if openai and getattr(openai, "api_key", None) and NUMPY_AVAILABLE:
try:
retriever = OpenAIEmbeddingRetriever(docs)
use_embedding = True
safe_print("[INFO] Using OpenAI embedding retriever.")
except Exception as e:
safe_print("[WARN] OpenAIEmbeddingRetriever failed, falling back to TF-IDF:", e)
if retriever is None:
if SKLEARN_AVAILABLE:
retriever = TFIDFRetriever(docs)
safe_print("[INFO] Using TF-IDF retriever.")
else:
# fallback: naive substring search retriever
class NaiveRetriever:
def __init__(self, docs):
self.docs = docs
def retrieve(self, query, top_k=3):
qs = query.lower()
scores = []
for i, d in enumerate(self.docs):
s = sum(1 for w inset(qs.split()) if w in d.lower())
scores.append((i, float(s)))
scores.sort(key=lambda x: x[1], reverse=True)
return scores[:top_k]
retriever = NaiveRetriever(docs)
safe_print("[INFO] Using naive substring retriever.")
# iterative loop
cur_top_k = CONFIG["INITIAL_TOP_K"]
cur_temp = CONFIG["TEMPERATURE_OPTIONS"][0]
history = []
for itr in range(1, CONFIG["ITERATIONS"] + 1):
safe_print(f"\n--- Iteration {itr} | top_k={cur_top_k} | temp={cur_temp} ---")
result = run_rag_eval(goldens, docs, retriever, top_k=cur_top_k, temperature=cur_temp)
agg = result["aggregate"]
safe_print(f"[RESULT] avg_context_relevance={agg['avg_context_relevance']:.3f}, avg_grounding={agg['avg_grounding']:.3f}, avg_faithfulness={agg['avg_faithfulness']:.3f}")
# save per-iteration
run_record = {
"iteration": itr,
"top_k": cur_top_k,
"temperature": cur_temp,
"aggregate": agg,
"timestamp": time.time(),
"samples_count": len(result["samples"])
}
history.append(run_record)
# adapt params
new_top_k = adjust_params(cur_top_k, agg)
new_temp = pick_temperature(CONFIG["TEMPERATURE_OPTIONS"], agg)
safe_print(f"[ADAPT] next_top_k={new_top_k}, next_temp={new_temp}")
# if no change and already good metrics, we can stop early
if new_top_k == cur_top_k and new_temp == cur_temp and agg["avg_grounding"] > 0.8 and agg["avg_faithfulness"] > 0.8:
safe_print("[INFO] Metrics are good and stable - stopping early.")
break
cur_top_k = new_top_k
cur_temp = new_temp
# produce final report
report = {
"config": CONFIG,
"docs_count": len(docs),
"goldens_count": len(goldens),
"history": history
}
report_path = os.path.join(CONFIG["SAVE_DIR"], CONFIG["REPORT_FILE"])
with open(report_path, "w", encoding="utf-8") as f:
json.dump(report, f, indent=2, ensure_ascii=False)
safe_print(f"\n[FINISH] Saved report to {report_path}")
safe_print("=== End ===")
if __name__ == "__main__":
main()
关注我们,一起进步,一起成长!
53AI,企业落地大模型首选服务商
产品:场景落地咨询+大模型应用平台+行业解决方案
承诺:免费POC验证,效果达标后再合作。零风险落地应用大模型,已交付160+中大型企业
2025-10-16
2025 年 RAG 最佳 Reranker 模型
2025-10-16
HiRAG问答流程深入分析
2025-10-13
LightRAG × Yuxi-Know——「知识检索 + 知识图谱」实践案例
2025-10-13
PG用户福音|一次性搞定RAG完整数据库套装
2025-10-12
任何格式RAG数据实现秒级转换!彻底解决RAG系统中最令人头疼的数据准备环节
2025-10-12
总结了 13 个 顶级 RAG 技术
2025-10-11
企业级 RAG 系统实战(2万+文档):10 个项目踩过的坑(附代码工程示例)
2025-10-09
RAG-Anything × Milvus:读PDF要集成20个工具的RAG时代结束了!
2025-09-15
2025-08-05
2025-08-18
2025-09-02
2025-08-25
2025-08-25
2025-07-21
2025-08-25
2025-09-03
2025-08-20
2025-10-04
2025-09-30
2025-09-10
2025-09-10
2025-09-03
2025-08-28
2025-08-25
2025-08-20