我要投稿

RAG 生态系统全攻略：组件搭建与优化实践

发布日期：2025-08-22 08:46:03 浏览次数： 2007

作者：AI大模型观察站

微信搜一搜，关注“AI大模型观察站”

大多数团队在为自己的数据打造一个生产就绪的RAG系统时，都会经历多轮实验，依赖于多个不同的组件，每个组件都需要自己的设置、调优和小心处理。这些组件包括……

生产就绪的RAG系统

• 查询转换：重写用户的问题，使其更适合检索。
• 智能路由：将查询引导到正确的数据源或专用工具。
• 索引：创建多层次的知识库。
• 检索与重新排序：过滤噪音，优先选择最相关的上下文。
• 自我纠正常态流：构建能够自我评分和改进的系统。
• 端到端评估：客观衡量整个流水线的性能。
• 以及更多……

我们将通过代码和可视化内容来学习和实现RAG生态系统的每个部分，从基础到高级技术，方便理解。

所有代码（理论+Notebook）都可以在我的GitHub仓库中找到：
https://github.com/PulsarPioneers/rag-ecosystem
了解并编写RAG架构的每个重要组件

我的目录分为几个部分，先来看看吧。

理解基础RAG系统

在我们深入了解RAG的基础之前，需要设置环境变量用于追踪和其他任务，比如我们将使用的LLMs API提供商。

import os

# 设置LangChain API端点和API密钥
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY'] = <your-api-key>  # 替换为你的LangChain API密钥

# 设置OpenAI API密钥
os.environ['OPENAI_API_KEY'] = <your-api-key>  # 替换为你的OpenAI API密钥

你可以从LangSmith的官方文档中获取API密钥，以便在整个博客中追踪我们的RAG产品。对于LLM，我们将使用OpenAI API，但你可能已经知道，LangChain支持多种LLM提供商。

核心RAG流水线是任何高级系统的基础，理解其组件非常重要。因此，在深入探讨高级组件之前，我们首先需要了解RAG系统核心逻辑的运作方式，但如果你已经知道RAG系统是如何工作的，可以跳过这一部分。

基础RAG系统

最简单的RAG可以分为三个组件：

• 索引：以结构化格式组织和存储数据，以实现高效搜索。
• 检索：根据查询或输入搜索并获取相关数据。
• 生成：使用检索到的数据生成最终回答或输出。

让我们从头开始构建这个简单的流水线，看看每个部分是如何运作的。

索引阶段

在我们的RAG系统能够回答任何问题之前，它需要一个知识库来提取信息。为此，我们将使用WebBaseLoader直接从Lilian Weng的精彩博客文章中拉取内容，主题是关于LLM驱动的代理。

索引阶段

import bs4
from langchain_community.document_loaders import WebBaseLoader

# 初始化一个带有特定解析指令的网页文档加载器
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-Agent/",),  # 要加载的博客文章URL
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")  # 只解析指定的HTML类
        )
    ),
)

# 从网页加载过滤后的内容到文档中
docs = loader.load()

bs_kwargs参数帮助我们只针对相关HTML标签（post-content、post-title等），从一开始就清理数据。

现在我们有了文档，面临第一个挑战。直接将一个巨大的文档输入LLM效率低下，而且由于上下文窗口限制，通常是不可行的。

这就是为什么分块（chunking）是一个关键步骤。我们需要将文档拆分成更小的、语义上有意义的片段。

RecursiveCharacterTextSplitter是推荐的工具，因为它会智能地尽量保持段落和句子的完整性。

from langchain.text_splitter import RecursiveCharacterTextSplitter

# 创建一个文本分割器，将文本分成1000个字符的块，带200个字符的重叠
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

# 将加载的文档分割成更小的块
splits = text_splitter.split_documents(docs)

通过chunk_size=1000，我们创建了1000字符的块，chunk_overlap=200确保块之间有一些连续性，有助于保留上下文。

现在文本已经分割，但仍然只是文本。为了进行相似性搜索，我们需要将这些块转换为称为embeddings的数字表示。然后我们将这些embeddings存储在一个向量存储（vector store）中，这是一个专为高效向量搜索设计的数据库。

使用Chroma向量存储和OpenAIEmbeddings，这一切变得非常简单。下面的代码一行就完成了embedding和索引。

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# 将文本块进行embedding并存储在Chroma向量存储中以进行相似性搜索
vectorstore = Chroma.from_documents(
    documents=splits, 
    embedding=OpenAIEmbeddings()  # 使用OpenAI的embedding模型将文本转为向量
)

有了索引好的知识库，我们现在可以开始提问了。

检索

向量存储是我们的图书馆，检索器（retriever）是我们的智能图书管理员。它接收用户的查询，将其embedding，然后从向量存储中获取语义上最相似的块。

检索阶段

从向量存储创建检索器只需要一行代码。

# 从向量存储创建检索器
retriever = vectorstore.as_retriever()

让我们测试一下。我们将提出一个问题，看看检索器能找到什么。

# 针对查询检索相关文档
docs = retriever.get_relevant_documents("What is Task Decomposition?")

# 打印第一个检索到的文档内容
print(docs[0].page_content)

输出

Task decomposition can be done (1) by LLM with simple prompting ...
Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple ...

正如你所见，检索器成功拉取了博客文章中最相关的块，直接讨论了“Task Decomposition”。这正是LLM需要用来形成准确回答的上下文。

生成

我们有了上下文，但需要一个LLM来读取它并生成一个对人类友好的回答。这是RAG中的“生成”步骤。

生成步骤

首先，我们需要一个好的提示模板。这会指导LLM如何行动。我们可以从LangChain Hub拉取一个预优化过的提示，而不是自己写。

from langchain import hub

# 从LangChain Hub拉取预制的RAG提示
prompt = hub.pull("rlm/rag-prompt")

# 打印提示
print(prompt)

输出

human
You are an assistant for question-answering tasks. Use the following pieces
of retrieved context to answer the question. If you dont know the answer,
just say that you dont know. Use three sentences maximum and keep the
answer concise.

Question: {question} 
Context: {context} 
Answer:

接下来，我们初始化LLM。我们将使用gpt-3.5-turbo。

from langchain_openai import ChatOpenAI

# 初始化LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

现在是最后一步：将所有内容串联起来。使用LangChain Expression Language（LCEL），我们可以将一个组件的输出作为下一个组件的输入。

from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# 辅助函数来格式化检索到的文档
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# 定义完整的RAG链
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

让我们分解这个链：

• {"context": retriever | format_docs, "question": RunnablePassthrough()}：这部分并行运行。它将用户的问题发送到检索器以获取文档，然后通过format_docs将它们格式化为单一字符串。同时，RunnablePassthrough将原始问题原封不动地传递。
• | prompt：将上下文和问题输入我们的提示模板。
• | llm：将格式化后的提示发送到LLM。
• | StrOutputParser()：将LLM的输出清理为简单的字符串。

现在，我们调用整个链。

# 使用RAG链提问
response = rag_chain.invoke("What is Task Decomposition?")
print(response)

输出

Task decomposition is a technique used to break down large tasks
into smaller, more manageable subgoals. This can be achieved by using a
Large Language Model (LLM) with simple prompts, task-specific instructions,
or human inputs. For example, ...

成功了！我们的RAG流水线成功检索了关于“Task Decomposition”的相关信息，并生成一个简洁、准确的回答。这个简单的链是我们将构建更高级、更强大功能的基础。

高级查询转换

现在我们已经了解了RAG流水线的基础知识。但生产系统常常会暴露这种基础方法的局限性。最常见的失败点之一是用户查询本身。

查询转换

查询可能过于具体、过于宽泛，或使用了与源文档不同的词汇，导致检索结果不佳。

解决办法不是责怪用户，而是让我们的系统更聪明。查询转换是一组强大的技术，旨在重写、扩展或分解原始问题，以显著提高检索准确性。

我们将设计多个更明智的查询，而不是依赖单一查询，以更广泛、更准确地覆盖信息。

为了测试这些新技术，我们将使用之前基础RAG流水线部分中索引的同一知识库。这确保我们可以直接比较结果并看到改进。

快速回顾一下我们如何设置检索器：

# 加载博客文章
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
blog_docs = loader.load()

# 将文档分割成块
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=300, 
    chunk_overlap=50
)
splits = text_splitter.split_documents(blog_docs)

# 将块索引到Chroma向量存储中
vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=OpenAIEmbeddings())

# 创建我们的检索器
retriever = vectorstore.as_retriever()

有了准备好的检索器，我们来探索第一种查询转换技术。

多查询生成

单一用户查询只代表一种视角。基于距离的相似性搜索可能会错过使用同义词或讨论相关概念的文档。

多查询方法通过使用LLM生成用户问题的多个不同版本来解决这个问题，从而从多个角度进行搜索。

多查询优化

我们将从创建一个提示开始，指示LLM生成这些替代问题。

from langchain.prompts import ChatPromptTemplate

# 用于生成多个查询的提示
template = """You are an AI language model assistant. Your task is to generate five 
different versions of the given user question to retrieve relevant documents from a vector 
database. By generating multiple perspectives on the user question, your goal is to help
the user overcome some of the limitations of the distance-based similarity search. 
Provide these alternative questions separated by newlines. Original question: {question}"""
prompt_perspectives = ChatPromptTemplate.from_template(template)

# 生成查询的链
generate_queries = (
    prompt_perspectives 
    | ChatOpenAI(temperature=0) 
    | StrOutputParser() 
    | (lambda x: x.split("\n"))
)

让我们测试这个链，看看它为我们的问题生成了哪些查询。

question = "What is task decomposition for LLM agents?"
generated_queries_list = generate_queries.invoke({"question": question})

# 打印生成的查询
for i, q in enumerate(generated_queries_list):
    print(f"{i+1}. {q}")

输出

1. How can LLM agents break down complex tasks?
2. What is the process of task decomposition in the context of large language model agents?
3. What are the methods for decomposing tasks for LLM-powered agents?
4. Explain the concept of task decomposition as it applies to AI agents using LLMs.
5. In what ways do LLM agents handle task decomposition?

太棒了！LLM使用不同的关键词，如“break down complex tasks”、“methods”和“process”重述了我们的问题。现在，我们可以为所有这些查询检索文档并合并结果。合并的一个简单方法是取所有检索文档的唯一集合。

from langchain.load import dumps, loads

def get_unique_union(documents: list[list]):
    """ 一个简单的函数来获取检索文档的唯一集合 """
    # 将文档列表展平，并将每个文档转换为字符串以确保唯一性
    flattened_docs = [dumps(doc) for sublist in documents for doc in sublist]
    unique_docs = list(set(flattened_docs))
    return [loads(doc) for doc in unique_docs]

# 构建检索链
retrieval_chain = generate_queries | retriever.map() | get_unique_union

# 调用链并检查检索到的文档数量
docs = retrieval_chain.invoke({"question": question})
print(f"Total unique documents retrieved: {len(docs)}")

输出

Total unique documents retrieved: 6

通过使用五个不同的查询进行搜索，我们总共检索到6个唯一文档，可能会捕获比单一查询更全面的信息。现在我们可以将这些上下文输入到最终的RAG链中。

from operator import itemgetter

# 最终的RAG链
template = """Answer the following question based on this context:

{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
llm = ChatOpenAI(temperature=0)

final_rag_chain = (
    {"context": retrieval_chain, "question": itemgetter("question")} 
    | prompt
    | llm
    | StrOutputParser()
)

final_rag_chain.invoke({"question": question})

输出

Task decomposition for LLM agents involves breaking down large,
complex tasks into smaller, more manageable sub-goals. This allows
the agent to work through a problem systematically. Methods for
decomposition include using the LLM itself with simple prompts ...

这个回答更健壮，因为它基于更广泛的相关文档池。

RAG-Fusion

多查询是一个很好的开始，但简单地取文档的并集会平等对待所有文档。如果一个文档在三个查询中排名很高，而另一个文档仅在一个查询中排名较低呢？

RAG-Fusion

显然前者更重要。RAG-Fusion通过不仅获取文档，还使用一种称为Reciprocal Rank Fusion（RRF）的技术对其重新排序来改进多查询。

RRF智能地组合多次搜索的结果。它提升了在不同结果列表中持续排名靠前的文档的分数，将最相关的内容推到顶部。

代码非常相似，但我们将用RRF实现替换get_unique_union函数。

def reciprocal_rank_fusion(results: list[list], k=60):
    """ 智能组合多个排名列表的Reciprocal Rank Fusion """
    fused_scores = {}

    # 遍历每个排名文档列表
    for docs in results:
        for rank, doc inenumerate(docs):
            doc_str = dumps(doc)
            if doc_str notin fused_scores:
                fused_scores[doc_str] = 0
            # RRF的核心：排名更高的文档（排名值更低）获得更高的分数
            fused_scores[doc_str] += 1 / (rank + k)

    # 按新融合分数降序排序文档
    reranked_results = [
        (loads(doc), score)
        for doc, score insorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    ]
    return reranked_results

上述函数将在通过相似性搜索获取文档后对其重新排序，但我们尚未初始化它，所以现在让我们这样做。

# 为RAG-Fusion使用略有不同的提示
template = """You are a helpful assistant that generates multiple search queries based on a single input query. \n
Generate multiple search queries related to: {question} \n
Output (4 queries):"""
prompt_rag_fusion = ChatPromptTemplate.from_template(template)

generate_queries = (
    prompt_rag_fusion 
    | ChatOpenAI(temperature=0)
    | StrOutputParser() 
    | (lambda x: x.split("\n"))
)

# 使用RRF构建新的检索链
retrieval_chain_rag_fusion = generate_queries | retriever.map() | reciprocal_rank_fusion
docs = retrieval_chain_rag_fusion.invoke({"question": question})

print(f"Total re-ranked documents retrieved: {len(docs)}")

输出

Total re-ranked documents retrieved: 7

最终链保持不变，但现在它接收到一个更智能排序的上下文。RAG-Fusion是一种低成本、高效的方式来提升检索质量。

分解

有些问题太复杂，无法一步回答。例如，“LLM驱动的代理的主要组件是什么，它们如何交互？”这实际上是一个包含两个问题的问题。

递归回答

分解技术使用LLM将复杂查询拆分成一组更简单、独立的子问题。然后我们可以回答每个子问题并综合最终答案。

我们将从为此目的设计的提示开始。

# 分解提示
template = """You are a helpful assistant that generates multiple sub-questions related to an input question. \n
The goal is to break down the input into a set of sub-problems / sub-questions that can be answers in isolation. \n
Generate multiple search queries related to: {question} \n
Output (3 queries):"""
prompt_decomposition = ChatPromptTemplate.from_template(template)

# 生成子问题的链
generate_queries_decomposition = (
    prompt_decomposition 
    | ChatOpenAI(temperature=0) 
    | StrOutputParser() 
    | (lambda x: x.split("\n"))
)

# 生成并打印子问题
question = "What are the main components of an LLM-powered autonomous agent system?"
sub_questions = generate_queries_decomposition.invoke({"question": question})
print(sub_questions)

输出

[
 '1. What are the core components ... agent?',
 '2. How is memory implemented in LLM-po ... agents?',
 '3. What role does planning and task decomposition ... LLMs?'
]

LLM成功分解了我们的复杂问题。现在，我们可以逐一回答这些问题并合并结果。一种有效的方法是回答每个子问题，并使用生成的问答对作为上下文来综合最终的全面回答。

# RAG提示
prompt_rag = hub.pull("rlm/rag-prompt")

# 保存子问题答案的列表
rag_results = []
for sub_question in sub_questions:
    # 为每个子问题检索文档
    retrieved_docs = retriever.get_relevant_documents(sub_question)
    
    # 使用标准RAG链回答子问题
    answer = (prompt_rag | llm | StrOutputParser()).invoke({"context": retrieved_docs, "question": sub_question})
    rag_results.append(answer)

defformat_qa_pairs(questions, answers):
    """格式化问答对"""
    formatted_string = ""
    for i, (question, answer) inenumerate(zip(questions, answers), start=1):
        formatted_string += f"Question {i}: {question}\nAnswer {i}: {answer}\n\n"
    return formatted_string.strip()

# 将问答对格式化为单一上下文字符串
context = format_qa_pairs(sub_questions, rag_results)

# 最终综合提示
template = """Here is a set of Q+A pairs:

{context}

Use these to synthesize an answer to the original question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

final_rag_chain = (
    prompt
    | llm
    | StrOutputParser()
)

final_rag_chain.invoke({"context": context, "question": question})

输出

An LLM-powered autonomous agent system primarily consists of three
core components: planning, memory, and tool use. ...

通过分解问题，我们构建了一个比原本更详细、更结构化的回答。

Step-Back Prompting

有时候，用户查询过于具体，而我们的文档包含回答所需更通用的基础信息。

Step-Back Prompting

例如，用户可能问：“The Police的成员可以合法逮捕吗？”

直接搜索可能失败。Step-Back技术使用LLM“退一步”形成更通用的问题，如“The Police乐队的权力和职责是什么？”然后我们为具体和通用问题检索上下文，为最终回答提供更丰富的上下文。

我们可以通过few-shot示例教LLM这种模式。

from langchain_core.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate

# few-shot示例教模型生成更通用的step-back问题
examples = [
    {
        "input": "Could the members of The Police perform lawful arrests?",
        "output": "what can the members of The Police do?",
    },
    {
        "input": "Jan Sindel's was born in what country?",
        "output": "what is Jan Sindel's personal history?",
    },
]

# 定义每个示例在提示中的格式
example_prompt = ChatPromptTemplate.from_messages([
    ("human", "{input}"),  # 用户输入
    ("ai", "{output}")     # 模型响应
])

# 将few-shot示例包装成可重用的提示模板
few_shot_prompt = FewShotChatMessagePromptTemplate(
    example_prompt=example_prompt,
    examples=examples,
)

# 完整提示包括系统指令、few-shot示例和用户问题
prompt = ChatPromptTemplate.from_messages([
    ("system", 
     "You are an expert at world knowledge. Your task is to step back and paraphrase a question "
     "to a more generic step-back question, which is easier to answer. Here are a few examples:"),
    few_shot_prompt,
    ("user", "{question}"),
])

现在，我们可以简单地定义step-back方法的链。

# 定义使用提示和OpenAI模型生成step-back问题的链
generate_queries_step_back = prompt | ChatOpenAI(temperature=0) | StrOutputParser()

# 对特定问题运行链
question = "What is task decomposition for LLM agents?"
step_back_question = generate_queries_step_back.invoke({"question": question})

# 输出原始问题和生成的step-back问题
print(f"Original Question: {question}")
print(f"Step-Back Question: {step_back_question}")

输出

Original Question: What is task decomposition for LLM agents?
Step-Back Question: What are the different approaches to task decomposition 
                    in software engineering?

这是一个重要的step-back问题。它将范围扩大到通用软件工程，可能会拉取基础文档，然后与关于LLM代理的具体上下文结合。现在我们可以构建一个使用这两者的链。

from langchain_core.runnables import RunnableLambda

# 最终响应的提示
response_prompt_template = """You are an expert of world knowledge. I am going to ask you a question. Your response should be comprehensive and not contradicted with the following context if they are relevant. Otherwise, ignore them if they are not relevant.

# Normal Context
{normal_context}

# Step-Back Context
{step_back_context}

# Original Question: {question}
# Answer:"""
response_prompt = ChatPromptTemplate.from_template(response_prompt_template)

# 完整链
chain = (
    {
        # 使用普通问题检索上下文
        "normal_context": RunnableLambda(lambda x: x["question"]) | retriever,
        # 使用step-back问题检索上下文
        "step_back_context": generate_queries_step_back | retriever,
        # 传递原始问题
        "question": lambda x: x["question"],
    }
    | response_prompt
    | ChatOpenAI(temperature=0)
    | StrOutputParser()
)

chain.invoke({"question": question})

这是我们运行step-back提示链时的输出。

Step-Back输出

Task decomposition is a fundamental concept in software engineering
where a complex problem is broken down into smaller, more manageable parts.
In the context of LLM agents, this principle is applied to enable them to 
handle large tasks. By decomposing a task into sub-goa ....

HyDE

这最后一种技术是最巧妙的。检索的核心问题是用户的查询可能使用与文档不同的词汇（“词汇不匹配”问题）。

HyDE

HyDE（Hypothetical Document Embeddings）提出了一个激进的解决方案：首先，让LLM生成一个假设的回答。这个假文档虽然不一定是事实正确的，但语义上很丰富，使用了我们期望在真实回答中找到的语言。

然后我们embedding这个假设文档，并使用其embedding进行检索。结果是我们找到与理想回答语义上非常相似的真实文档。

让我们从创建生成假设文档的提示开始。

# HyDE提示
template = """Please write a scientific paper passage to answer the question
Question: {question}
Passage:"""
prompt_hyde = ChatPromptTemplate.from_template(template)

# 生成假设文档的链
generate_docs_for_retrieval = (
    prompt_hyde 
    | ChatOpenAI(temperature=0) 
    | StrOutputParser() 
)

# 生成并打印假设文档
hypothetical_document = generate_docs_for_retrieval.invoke({"question": question})
print(hypothetical_document)

输出

Task decomposition in large language model (LLM) agents refers to the
process of breaking down a complex, high-level task ...

这段文字是一个完美的教科书式回答。现在，我们使用它的embedding来查找真实文档。

# 使用HyDE方法检索文档
retrieval_chain = generate_docs_for_retrieval | retriever 
retrieved_docs = retrieval_chain.invoke({"question": question})

# 使用标准RAG链从检索到的上下文生成最终回答
final_rag_chain.invoke({"context": retrieved_docs, "question": question})

输出

Task decomposition for LLM agents involves breaking down a larger task
into smaller, more manageable subgoals. This can be done using techni ...

通过使用假设文档作为诱饵，HyDE帮助我们锁定知识库中最相关的块，展示我们RAG工具箱中的又一个强大工具。

路由与查询构建

我们的RAG系统越来越聪明，但在现实场景中，知识并不是存储在单一、统一的图书馆中。

我们通常有多个数据源：不同编程语言的文档、内部wiki、公共网站或带有结构化元数据的数据库。

路由与查询转换

将每个查询发送到每个数据源是极其低效的，可能会导致噪音或不相关的结果。

这需要我们的RAG系统从简单的图书管理员进化成智能的交换机操作员。它需要首先分析传入的查询，然后将其路由到正确的目标或构建更精确的结构化查询以进行检索。本节将深入探讨实现这一点的技术。

逻辑路由

路由是一个分类问题。给定用户的问题，我们需要将其分类到几个预定义类别之一。虽然传统的ML模型可以做到这一点，但我们可以利用我们已经拥有的强大推理引擎：LLM本身。

逻辑路由

通过为LLM提供一个清晰的schema（一组可能的类别），我们可以让它为我们做出分类决定。

我们将从使用Pydantic模型定义LLM输出的“合同”开始。这个schema明确告诉LLM查询的可能目标。

from typing import Literal
from langchain_core.pydantic_v1 import BaseModel, Field

# 定义路由器输出的数据模型
class RouteQuery(BaseModel):
    """用于将用户查询路由到最相关数据源的数据模型。"""

    # 'datasource'字段必须是三个指定字面字符串之一。
    # 这强制LLM选择一组严格的选项。
    datasource: Literal["python_docs", "js_docs", "golang_docs"] = Field(
        ...,  # '...'表示此字段为必填。
        description="根据用户问题选择最相关的数据源来回答。",
    )

定义了schema后，我们可以构建路由器链。我们将使用提示来给LLM提供指令，然后使用.with_structured_output()方法确保其响应完全匹配我们的RouteQuery模型。

# 初始化我们的LLM
llm = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)

# 创建一个新的LLM实例，结构化为输出我们的Pydantic模型
structured_llm = llm.with_structured_output(RouteQuery)

# 系统提示为LLM的任务提供核心指令。
system = """你是路由用户问题的专家。

根据问题涉及的编程语言，将其路由到相关的数据源。"""

# 完整提示模板结合系统消息和用户问题。
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "{question}"),
    ]
)

# 定义完整的路由器链
router = prompt | structured_llm

现在，让我们测试路由器。我们将传递一个明显关于Python的问题并检查输出。

question = """Why doesn't the following code work:

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages(["human", "speak in {language}"])
prompt.invoke("french")
"""

# 调用路由器并检查结果
result = router.invoke({"question": question})

print(result)

输出

datasource='python_docs'

输出是我们RouteQuery模型的实例，LLM正确识别出python_docs是最合适的数据源。这个结构化输出现在是我们可以在代码中可靠使用的内容，以实现分支逻辑。

def choose_route(result):
    """根据路由器输出确定下游逻辑的函数。"""
    if"python_docs"in result.datasource.lower():
        # 在实际应用中，这将是一个完整的Python文档RAG链
        return"chain for python_docs"
    elif"js_docs"in result.datasource.lower():
        # 这是JavaScript文档的链
        return"chain for js_docs"
    else:
        # 这是Go文档的链
        return"chain for golang_docs"

# 现在完整链包括路由和分支逻辑
full_chain = router | RunnableLambda(choose_route)

# 运行完整链
final_destination = full_chain.invoke({"question": question})

print(final_destination)

输出

chain for python_docs

我们的交换机正确路由了与Python相关的问题。这种方法对于构建多源RAG系统非常强大。

语义路由

逻辑路由在有明确定义的类别时效果很好。但如果你想根据问题的风格或领域进行路由呢？例如，你可能希望以严肃的学术语气回答物理问题，以逐步的教学方式回答数学问题。这时语义路由就派上用场了。

语义路由

我们不分类查询，而是定义多个专家提示。

然后我们embedding用户的查询和每个提示模板，使用余弦相似性找到与查询语义上最匹配的提示。

首先，我们定义两个专家角色。

from langchain_core.prompts import PromptTemplate

# 物理专家的提示
physics_template = """You are a very smart physics professor. \
You are great at answering questions about physics in a concise and easy to understand manner. \
When you don't know the answer to a question you admit that you don't know.

Here is a question:
{query}"""

# 数学专家的提示
math_template = """You are a very good mathematician. You are great at answering math questions. \
You are so good because you are able to break down hard problems into their component parts, \
answer the component parts, and then put them together to answer the broader question.

Here is a question:
{query}"""

现在，我们将创建执行embedding和相似性比较的路由函数。

from langchain.utils.math import cosine_similarity

# 初始化embedding模型
embeddings = OpenAIEmbeddings()

# 存储我们的模板及其embedding以进行比较
prompt_templates = [physics_template, math_template]
prompt_embeddings = embeddings.embed_documents(prompt_templates)

defprompt_router(input):
    """将输入查询路由到最相似提示模板的函数。"""
    # 1. embedding传入的用户查询
    query_embedding = embeddings.embed_query(input["query"])
    
    # 2. 计算查询与所有提示模板的余弦相似性
    similarity = cosine_similarity([query_embedding], prompt_embeddings)[0]
    
    # 3. 找到最相似提示的索引
    most_similar_index = similarity.argmax()
    
    # 4. 选择最相似的提示模板
    chosen_prompt = prompt_templates[most_similar_index]
    
    print(f"DEBUG: Using {'MATH' if most_similar_index == 1 else 'PHYSICS'} template.")
    
    # 5. 返回选定的提示对象
    return PromptTemplate.from_template(chosen_prompt)

有了路由逻辑，我们可以构建动态选择正确专家的完整链。

# 结合路由器和LLM的最终链
chain = (
    {"query": RunnablePassthrough()}
    | RunnableLambda(prompt_router)  # 动态选择提示
    | ChatOpenAI()
    | StrOutputParser()
)

# 提出一个物理问题
print(chain.invoke("What's a black hole"))

输出

DEBUG: Using PHYSICS template.
A black hole is a region of spacetime where gravity is so strong
that nothing—no particles or even electromagnetic radiation such as
light—can escape from it. The boundary of no escape is called the
event horizon. Although it has a great effect on the fate
and circumstances of an object crossing it, it has no locally
detectable features. In many ways, a black hole acts as an ideal black body,
as it reflects no light.

完美！路由器正确识别出问题与物理相关，并使用了物理教授提示，得到一个简洁准确的回答。这种技术非常适合创建适应用户需求的专门化代理。

查询结构化

到目前为止，我们专注于从非结构化文本中检索。但现实世界中的数据大多是半结构化的；它包含有价值的元数据，如日期、作者、浏览量或类别。简单的向量搜索无法利用这些信息。

查询结构化是将自然语言问题转换为结构化查询的技术，可以使用这些元数据过滤器进行高度精确的检索。

为了说明，我们来看看YouTube视频转录中的元数据。

from langchain_community.document_loaders import YoutubeLoader

# 加载YouTube转录以检查其元数据
docs = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=pbAd8O1Lvm4", add_video_info=True
).load()

# 打印第一个文档的元数据
print(docs[0].metadata)

输出

{ 
  'source': 'pbAd8O1Lvm4',
  'title': 'Self-reflective RAG with LangGraph: Self-RAG and CRAG',
  'description': 'Unknown',
  'view_count': 11922,
  'thumbnail_url': 'https://i.ytimg.com/vi/pbAd8O1Lvm4/hq720.jpg',
  'publish_date': '2024-02-07 00:00:00',
  'length': 1058,
  'author': 'LangChain'
}

这个文档有丰富的元数据：view_count、publish_date、length。我们希望用户能够使用自然语言在这些字段上进行过滤。为此，我们将定义另一个Pydantic schema，这次是为结构化的视频搜索查询。

import datetime
from typing importOptional

classTutorialSearch(BaseModel):
    """用于搜索教程视频数据库的数据模型。"""

    # 用于视频转录的相似性搜索的主要查询。
    content_search: str = Field(..., description="应用于视频转录的相似性搜索查询。")
    
    # 用于搜索视频标题的更简洁查询。
    title_search: str = Field(..., description="应用于视频标题的内容搜索查询的替代版本。")
    
    # 可选的元数据过滤器
    min_view_count: Optional[int] = Field(None, description="最小浏览量过滤器，包含。")
    max_view_count: Optional[int] = Field(None, description="最大浏览量过滤器，不包含。")
    earliest_publish_date: Optional[datetime.date] = Field(None, description="最早发布日期过滤器，包含。")
    latest_publish_date: Optional[datetime.date] = Field(None, description="最晚发布日期过滤器，不包含。")
    min_length_sec: Optional[int] = Field(None, description="最小视频长度（秒），包含。")
    max_length_sec: Optional[int] = Field(None, description="最大视频长度（秒），不包含。")

    defpretty_print(self) -> None:
        """打印模型填充字段的辅助函数。"""
        for field inself.__fields__:
            ifgetattr(self, field) isnotNone:
                print(f"{field}: {getattr(self, field)}")

这个schema是我们的目标。现在我们将创建一个链，将用户问题转换为这个模型。

# 查询分析器的系统提示
system = """你是将用户问题转换为数据库查询的专家。\
你有权访问一个关于构建LLM驱动应用的软件库的教程视频数据库。\
给定一个问题，返回一个优化以检索最相关结果的数据库查询。

如果有你不熟悉的缩写或单词，不要尝试改述它们。"""

prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{question}")])
structured_llm = llm.with_structured_output(TutorialSearch)

# 最终查询分析器链
query_analyzer = prompt | structured_llm

让我们用几个不同的问题测试它，看看它的强大之处。

# 测试1：简单查询
query_analyzer.invoke({"question": "rag from scratch"}).pretty_print()

输出

content_search: rag from scratch
title_search: rag from scratch

正如预期，它填充了content和title搜索字段。现在试试更复杂的查询。

# 测试2：带日期过滤的查询
query_analyzer.invoke(
    {"question": "videos on chat langchain published in 2023"}
).pretty_print()

输出

content_search: chat langchain
title_search: chat langchain 2023
earliest_publish_date: 2023-01-01
latest_publish_date: 2024-01-01

太棒了！LLM正确解读了“in 2023”并创建了日期范围过滤器。我们再试一个带时间约束的查询。

# 测试3：带长度过滤的查询
query_analyzer.invoke(
    {
        "question": "how to use multi-modal models in an agent, only videos under 5 minutes"
    }
).pretty_print()

输出

content_search: multi-modal models agent
title_search: multi-modal models agent
max_length_sec: 300

它完美地将“under 5 minutes”转换为max_length_sec: 300。这个结构化查询现在可以传递给支持元数据过滤的向量存储，允许进行远超简单语义搜索的极其精确和高效的检索。

高级索引策略

到目前为止，我们的索引方法很简单：将文档分成块并进行embedding。这有效，但有一个根本性限制。

小的、聚焦的块对检索准确性很好（它们包含较少的噪音），但往往缺乏LLM生成全面回答所需的更广泛上下文。

索引策略

相反，大的块提供很好的上下文，但在检索中表现不佳，因为它们的核心意义被稀释。

这是经典的“块大小”难题。我们如何兼得两者的优点？

答案在于更高级的索引策略，将用于检索的文档表示与用于生成的文档表示分开。让我们深入探讨。

多表示索引

多表示索引的核心思想简单而强大：我们不embedding整个文档块，而是为每个块创建一个更小、更聚焦的表示（如摘要）并embedding它。

多表示索引

在检索时，我们搜索这些简洁的摘要。一旦找到最佳摘要，我们使用其ID查找并检索完整的原始文档块。

这样，我们获得了搜索小而密集摘要的精确性和较大父文档的丰富上下文用于生成。

首先，我们需要加载一些文档来处理。我们将从Lilian Weng的博客中获取两篇文章。

from langchain_community.document_loaders import WebBaseLoader

# 加载两篇不同的博客文章以创建更多样化的知识库
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

loader = WebBaseLoader("https://lilianweng.github.io/posts/2024-02-05-human-data-quality/")
docs.extend(loader.load())

print(f"Loaded {len(docs)} documents.")

输出

Loaded 2 documents.

接下来，我们将为这些文档生成摘要的链。

import uuid

# 生成摘要的链
summary_chain = (
    # 从文档对象中提取page_content
    {"doc": lambda x: x.page_content}
    # 输入到提示模板
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    # 使用LLM生成摘要
    | ChatOpenAI(model="gpt-3.5-turbo", max_retries=0)
    # 将输出解析为字符串
    | StrOutputParser()
)

# 使用.batch()并行运行摘要以提高效率
summaries = summary_chain.batch(docs, {"max_concurrency": 5})

# 检查第一个摘要
print(summaries[0])

输出

The document discusses building autonomous agents powered by Large
Language Models (LLMs). It outlines the key components of such a system, ...

现在是关键部分。我们需要一个MultiVectorRetriever，它需要两个主要组件：

• 一个vectorstore来存储摘要的embeddings。
• 一个docstore（一个简单的键值存储）来保存原始完整文档。

from langchain.storage import InMemoryByteStore
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain_core.documents import Document

# 用于索引摘要embeddings的vectorstore
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())

# 存储父文档的存储层
store = InMemoryByteStore()
id_key = "doc_id"# 这个键将摘要链接到其父文档

# 协调整个过程的检索器
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

# 为每个原始文档生成唯一ID
doc_ids = [str(uuid.uuid4()) for _ in docs]

# 为摘要创建新的Document对象，在元数据中添加'doc_id'
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s inenumerate(summaries)
]

# 将摘要添加到vectorstore
retriever.vectorstore.add_documents(summary_docs)

# 将原始文档添加到docstore，用相同的ID链接它们
retriever.docstore.mset(list(zip(doc_ids, docs)))

我们的高级索引现已构建。让我们测试检索过程。我们将询问关于“代理中的记忆”的问题，看看会发生什么。

query = "Memory in agents"

# 首先，看看vectorstore通过搜索摘要找到了什么
sub_docs = vectorstore.similarity_search(query, k=1)
print("--- Result from searching summaries ---")
print(sub_docs[0].page_content)
print("\n--- Metadata showing the link to the parent document ---")
print(sub_docs[0].metadata)

输出

--- Result from searching summaries ---
The document discusses the concept of building autonomous agents powered by Large Language Models (LLMs) as their core controllers. It covers components such as planning, memory, and tool use, along with case studies and proof-of-concept examples like AutoGPT and GPT-Engineer. Challenges like finite context length, planning difficulties, and reliability of natural language interfaces are also highlighted. The document provides references to related research papers and offers a comprehensive overview of LLM-powered autonomous agents.

--- Metadata showing the link to the parent document ---
{'doc_id': 'cf31524b-fe6a-4b28-a980-f5687c9460ea'}

正如你所见，搜索找到了提到“memory”的摘要。现在，MultiVectorRetriever将使用此摘要元数据中的doc_id自动从docstore中获取完整的父文档。

# 让完整检索器完成其工作
retrieved_docs = retriever.get_relevant_documents(query, n_results=1)

# 打印检索到的完整文档的开头
print("\n--- The full document retrieved by the MultiVectorRetriever ---")
print(retrieved_docs[0].page_content[0:500])

输出

"\n\n\n\n\n\nLLM Powered Autonomous Agents | Lil'Log\n\n\n\n\n\n\n\n\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\ ...."

这正是我们想要的！我们搜索了简洁的摘要，但获得了完整的、上下文丰富的文档，解决了块大小难题。

层次索引（RAPTOR Knowledge Tree）

理论：RAPTOR（Recursive Abstractive Processing for Tree-Organized Retrieval）将多表示索引的想法进一步推进。它不仅创建一层摘要，而是构建一个多层次的摘要树。它首先聚类小文档块，然后对每个集群进行摘要。

RAPTOR（来自LangChain文档）

然后，它获取这些摘要，聚类它们，并对新集群进行摘要。这个过程重复，创建从细粒度细节到高层次概念的知识层次结构。查询时，你可以在这个树的不同层次上搜索，允许检索可以根据需要具体或通用的信息。

这是一个更高级的技术，虽然我们不会在这里实现完整的算法，但你可以在RAPTOR Cookbook中找到深入探讨和完整代码。它代表了结构化索引的尖端。

词级精度（ColBERT）

理论：标准embedding模型为整个文本块创建一个单一向量（称为“词袋”方法）。这可能会丢失很多细微差别。

专用embeddings

ColBERT（Contextualized Late Interaction over BERT）提供了一种更细粒度的方法。它为文档中的每个词生成单独的、上下文感知的embedding。

当你进行查询时，ColBERT也为查询中的每个词进行embedding。然后，它不是比较一个文档向量与一个查询向量，而是找到每个查询词与任何文档词的最大相似性。

这种“后期交互”允许对相关性有更细粒度的理解，擅长关键词式搜索。

我们可以通过RAGatouille库轻松使用ColBERT。

# 安装所需库
!pip install -U ragatouille

from ragatouille import RAGPretrainedModel

# 加载预训练的ColBERT模型
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

现在，我们使用ColBERT的独特词级方法索引一个维基百科页面。

import requests

defget_wikipedia_page(title: str):
    """从维基百科检索内容的辅助函数。"""
    # 维基百科API端点和参数
    URL = "https://en.wikipedia.org/w/api.php"
    params = { "action": "query", "format": "json", "titles": title, "prop": "extracts", "explaintext": True }
    headers = {"User-Agent": "MyRAGApp/1.0"}
    response = requests.get(URL, params=params, headers=headers)
    data = response.json()
    page = next(iter(data["query"]["pages"].values()))
    return page.get("extract")

full_document = get_wikipedia_page("Hayao_Miyazaki")

# 使用RAGatouille索引文档。它内部处理分块和词级embedding。
RAG.index(
    collection=[full_document],
    index_name="Miyazaki-ColBERT",
    max_document_length=180,
    split_documents=True,
)

索引过程更复杂，因为它为每个词创建embeddings，但RAGatouille无缝处理。现在，我们搜索我们的新索引。

# 搜索ColBERT索引
results = RAG.search(query="What animation studio did Miyazaki found?", k=3)
print(results)

输出

[{'content': 'In April 1984, ...', 'score': 25.9036, 'rank': 1, ...}, 
 {'content': 'Hayao Miyazaki ...', 'score': 25.5716, 'rank': 2, ...},
 {'content': 'Glen Keane said ...', 'score': 24.8411, 'rank': 3, ...}]

顶级结果直接提到了Studio Ghibli的创立。我们也可以轻松将其包装为标准的LangChain检索器。

# 将RAGatouille模型转换为LangChain兼容的检索器
colbert_retriever = RAG.as_langchain_retriever(k=3)

# 像使用其他检索器一样使用它
retrieved_docs = colbert_retriever.invoke("What animation studio did Miyazaki found?")
print(retrieved_docs[0].page_content)

输出

In April 1984, Miyazaki opened his own office in Suginami Ward, naming it Nibariki.

=== Studio Ghibli ===
==== Early films (1985–1996) ====
In June 1985, Miyazaki, Takahata, Tokuma and Suzuki founded the animation production company Studio Ghibli, with funding from Tokuma Shoten. Studio Ghibli's first film, Laputa: Castle in the Sky (1986), employed the same production crew of Nausicaä. Miyazaki's designs for the film's setting were inspired by Greek architecture and "European urbanistic templates".

ColBERT提供了传统向量搜索的强大、细粒度替代方案，表明我们构建图书馆的方式与搜索方式同样重要。

高级检索与生成

我们已经创建了一个复杂的RAG系统，拥有智能路由和高级索引。现在，我们到了最后阶段：检索和生成。这里我们要确保提供给LLM的上下文是最高质量的，并且LLM的最终回答是相关的、准确的，并且基于该上下文。

检索/生成

即使有了最好的索引，我们的初始检索仍然可能包含噪音——不太相关的文档会漏进来。LLM虽然强大，但有时会误解上下文或产生幻觉。

本节介绍作为我们流水线最后质量控制层的高级技术。

专用重新排序

标准检索方法给我们一个排好序的文档列表，但这个初始排名并不总是完美的。重新排序是关键的第二步，我们获取初始检索的文档集，并使用更复杂（通常更昂贵）的模型根据与查询的相关性重新排序。

专用重新排序

这确保最相关的文档被放在我们提供给LLM的上下文的最顶部。

我们已经在RAG-Fusion部分看到了一种强大的重新排序方法：Reciprocal Rank Fusion（RRF）。它是一种很好的无模型组合结果方法。但为了获得更强大的方法，我们可以使用专用重新排序模型，如Cohere提供的模型。

我们先设置一个标准检索器。我们将使用之前示例中的同一篇博客文章。

# 加载、分割和索引文档
loader = WebBaseLoader(web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",))
blog_docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=300, chunk_overlap=50)
splits = text_splitter.split_documents(blog_docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# 初次检索器：获取前10个可能相关的文档
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

现在，我们引入ContextualCompressionRetriever。这个特殊检索器包装我们的基础检索器并添加一个“压缩”步骤。在这里，我们的压缩器将是CohereRerank模型。

它将从我们的基础检索器中获取10个文档并重新排序，仅返回最相关的文档。

# 你需要安装cohere：pip install cohere
# 并设置你的COHERE_API_KEY环境变量
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank

# 初始化Cohere Rerank模型
compressor = CohereRerank()

# 创建压缩检索器
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, 
    base_retriever=retriever
)

# 使用我们的查询测试它
question = "What is task decomposition for LLM agents?"
compressed_docs = compression_retriever.get_relevant_documents(question)

# 打印重新排序的文档
print("--- Re-ranked and Compressed Documents ---")
for doc in compressed_docs:
    print(f"Relevance Score: {doc.metadata['relevance_score']:.4f}")
    print(f"Content: {doc.page_content[:150]}...\n")

输出

--- Re-ranked and Compressed Documents ---
Relevance Score: 0.9982
Content: Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.", "What are the subgoals for achieving XYZ?", (2) by using task...

Relevance Score: 0.9851
Content: Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into mult...

Relevance Score: 0.9765
Content: LLM-powered autonomous agents have been an exciting concept. They can be used for task decomposition by prompting, using task-specific instructions, or ...

输出令人印象深刻。CohereRerank模型不仅重新排序了文档，还为每个文档分配了一个relevance_score。我们现在可以更有信心地确保传递给LLM的上下文是最高质量的，直接带来更好、更准确的回答。

使用AI代理进行自我纠正

如果我们的RAG系统能在给出答案前检查自己的工作呢？这就是像CRAG（Corrective RAG）和Self-RAG这样的自我纠正RAG架构背后的想法。

自我纠正RAG（来自LangChain博客）

这些不仅仅是简单的链，它们是动态图（通常使用LangGraph构建），可以推理检索信息的质量并决定行动方案。

CRAG：如果检索到的文档对给定查询不相关或模糊，CRAG系统不会直接将它们传递给LLM。相反，它会触发一个新的、更强大的网络搜索以找到更好的信息，纠正检索到的文档，然后继续生成。
Self-RAG：这种方法更进一步。在每一步，它使用LLM生成“反思令牌”来批判过程。它对检索文档的相关性进行评分。如果不相关，它会再次检索。一旦有了好的文档，它生成答案，然后对答案的事实一致性进行评分，确保它基于源文档。

这些技术代表了构建可靠、生产级RAG的最新技术。从头实现它们需要构建状态机或图。虽然完整实现很复杂，但你可以在以下位置找到出色的详细教程：
CRAG Notebook
Self-RAG Notebook

这些代理框架是从简单问答机器人转向创建真正强大推理引擎的关键。

长上下文的影响

RAG的一个反复出现的主题是LLM的上下文窗口限制。但随着拥有巨大上下文窗口（128k、200k，甚至100万个令牌）的模型兴起，一个问题出现了：

长上下文

我们还需要RAG吗？我们能不能直接把所有文档塞进提示中？

答案是复杂的。虽然长上下文模型非常强大，但它们并非万能。

研究表明，当关键信息埋藏在非常长的上下文中间时，它们的性能可能会下降（“大海捞针”问题）。

RAG优势：RAG擅长先找到针，只将针呈现给LLM。它是一个精确工具。
长上下文优势：长上下文模型在需要同时从文档多个不同部分综合信息的任务中表现出色，RAG可能会错过这些信息。

未来可能是混合方法：使用RAG进行初始的精确检索，获取最相关的文档，然后将这些高质量、预过滤的上下文输入长上下文模型进行最终综合。

有关此主题的深入探讨，这份演示文稿是一个极好的资源：
Slides on Long Context: The Impact of Long Context on RAG

手动RAG评估

我们已经构建了一个越来越复杂的RAG流水线，层层叠加了高级检索、索引和生成技术。但一个关键问题依然存在：我们如何证明它真的有效？

在生产环境中，“看起来有效”是不够的。我们需要客观、可重复的指标来衡量性能，识别弱点并指导改进。

这就是评估的用武之地。它是让我们的RAG系统负责任的科学。在这一部分，我们将探索如何通过从头构建评估器来定量衡量我们系统的质量。

核心指标：我们应该衡量什么？

在深入代码之前，让我们定义一个“好的”RAG响应是什么样的。我们可以将其分解为几个核心原则：

忠实度（Faithfulness）：回答是否严格遵守提供的上下文？忠实的回答不会捏造信息或使用LLM的预训练知识来回答。这是防止幻觉的最重要指标。
正确性（Correctness）：与“地面真相”或参考答案相比，回答是否事实正确？
上下文相关性（Contextual Relevancy）：我们检索的上下文是否与用户的问题实际相关？这评估的是检索器的性能，而不是生成器。

让我们探索如何测量这些，从最透明的方法开始：自己构建评估器。

使用LangChain从头构建评估器

理解评估的最佳方法是构建它。使用基本的LangChain组件，我们可以创建自定义链，指示LLM充当公正的“裁判”，根据我们定义的提示中的标准对RAG系统的输出进行评分。这为我们提供了最大的控制权和透明度。

我们从正确性开始。我们的目标是创建一个链，比较generated_answer与ground_truth答案，并返回0到1的分数。

from langchain.prompts import PromptTemplate

# 我们将使用像gpt-4o这样强大的LLM作为我们的“裁判”以进行可靠的评估。
llm = ChatOpenAI(temperature=0, model_name="gpt-4o", max_tokens=4000)

# 定义评估分数的输出schema以确保一致的结构化输出。
classResultScore(BaseModel):
    score: float = Field(..., description="结果的分数，范围从0到1，1为最佳分数。")

# 这个提示模板清楚地指示LLM如何评分回答的正确性。
correctness_prompt = PromptTemplate(
    input_variables=["question", "ground_truth", "generated_answer"],
    template="""
    Question: {question}
    Ground Truth: {ground_truth}
    Generated Answer: {generated_answer}

    Evaluate the correctness of the generated answer compared to the ground truth.
    Score from 0 to 1, where 1 is perfectly correct and 0 is completely incorrect.
    
    Score:
    """
)

# 我们通过将提示管道到具有结构化输出的LLM来构建评估链。
correctness_chain = correctness_prompt | llm.with_structured_output(ResultScore)

现在，我们将它包装在一个简单函数中并测试。如果地面真相是“Paris and Madrid”，但我们的RAG系统只部分回答了“Paris”呢？

def evaluate_correctness(question, ground_truth, generated_answer):
    """运行我们自定义正确性评估链的辅助函数。"""
    result = correctness_chain.invoke({
        "question": question, 
        "ground_truth": ground_truth, 
        "generated_answer": generated_answer
    })
    return result.score

# 测试正确性链，回答部分正确。
question = "What is the capital of France and Spain?"
ground_truth = "Paris and Madrid"
generated_answer = "Paris"
score = evaluate_correctness(question, ground_truth, generated_answer)

print(f"Correctness Score: {score}")

输出

Correctness Score: 0.5

这是一个完美的结果。我们的裁判LLM正确推理出生成的回答只有一半正确，并分配了适当的0.5分数。

接下来，我们构建忠实度的评估器。这可能是RAG中比正确性更重要的指标，因为它是我们防止幻觉的主要防御。

在这里，裁判LLM必须忽略回答是否事实正确，只关心回答是否可以从给定的上下文中推导出来。

# 忠实度的提示模板包含几个示例（few-shot prompting），
# 使对裁判LLM的指令非常清晰。
faithfulness_prompt = PromptTemplate(
    input_variables=["question","context", "generated_answer"],
    template="""
    Question: {question}
    Context: {context}
    Generated Answer: {generated_answer}

    Evaluate if the generated answer to the question can be deduced from the context.
    Score of 0 or 1, where 1 is perfectly faithful *AND CAN BE DERIVED FROM THE CONTEXT* and 0 otherwise.
    You don't mind if the answer is correct; all you care about is if the answer can be deduced from the context.
    
    [... 笔记本中的几个示例以指导LLM ...]

    Example:
    Question: What is 2+2?
    Context: 4.
    Generated Answer: 4.
    In this case, the context states '4', but it does not provide information to deduce the answer to 'What is 2+2?', so the score should be 0.
    """
)

# 使用相同的结构化LLM构建忠实度链。
faithfulness_chain = faithfulness_prompt | llm.with_structured_output(ResultScore)

我们在提示中提供了几个示例来指导LLM的推理，特别是针对棘手的边缘情况。让我们用“2+2”示例测试它，这是一个经典的忠实度测试。

def evaluate_faithfulness(question, context, generated_answer):
    """运行我们自定义忠实度评估链的辅助函数。"""
    result = faithfulness_chain.invoke({
        "question": question, 
        "context": context, 
        "generated_answer": generated_answer
    })
    return result.score

# 测试忠实度链。回答是正确的，但它忠实吗？
question = "what is 3+3?"
context = "6"
generated_answer = "6"
score = evaluate_faithfulness(question, context, generated_answer)

print(f"Faithfulness Score: {score}")

输出

Faithfulness Score: 0.0

这展示了定义良好的忠实度指标的强大和精确。即使答案6是事实正确的，但它无法从提供的上下文“6”中逻辑推导出来。

上下文没有说3+3等于6。我们的系统正确标记这是一个不忠实的回答，可能是LLM使用了自己的预训练知识而不是提供的上下文导致的幻觉。

从头构建这些评估器提供了对我们测量内容的深入洞察。然而，这可能很耗时。在下一部分，我们将看到如何使用专门的评估框架更有效地实现相同的结果。

使用框架进行评估

在上一部分，我们从头构建了自己的评估链。这是理解RAG指标核心原则的绝佳方式。

然而，为了更快、更稳健的测试，专用评估框架是最佳选择。

使用框架进行评估

这些库提供预构建、优化好的指标，处理评估的复杂性，让我们专注于分析结果。

我们将探索三个流行的框架：deepeval、grouse和专为RAG设计的强大框架RAGAS。

使用deepeval进行快速评估

deepeval是一个强大、开源的框架，设计简单直观。它提供了一组定义良好的指标，可以轻松应用于你的RAG流水线输出。

工作流程涉及创建LLMTestCase对象并根据预构建的指标如Correctness、Faithfulness和ContextualRelevancy进行测量。

# 你需要安装deepeval：pip install deepeval
from deepeval import evaluate
from deepeval.metrics import GEval, FaithfulnessMetric, ContextualRelevancyMetric
from deepeval.test_case import LLMTestCase

# 创建测试用例
test_case_correctness = LLMTestCase(
    input="What is the capital of Spain?",
    expected_output="Madrid is the capital of Spain.",
    actual_output="MadriD."
)

test_case_faithfulness = LLMTestCase(
    input="what is 3+3?",
    actual_output="6",
    retrieval_context=["6"]
)

# evaluate()函数对所有测试用例运行所有指定指标
evaluation_results = evaluate(
    test_cases=[test_case_correctness, test_case_faithfulness],
    metrics=[GEval(name="Correctness", model="gpt-4o"), FaithfulnessMetric()]
)

print(evaluation_results)

输出

✨ Evaluation Results ✨
-------------------------
Overall Score: 0.50
-------------------------
Metrics 指定指标总结：
- Correctness: 1.00
- Faithfulness: 0.00
-------------------------

deepeval的聚合视图立即为我们提供了系统性能的高层次视图，轻松发现需要改进的领域。

使用grouse的另一种强大替代方案

grouse是另一个出色的开源选项，提供类似的指标套件，但独特之处在于允许深度定制“裁判”提示。这对于特定领域的评估标准微调非常有用。

# 你需要安装grouse：pip install grouse-eval
from grouse import EvaluationSample, GroundedQAEvaluator

evaluator = GroundedQAEvaluator()
unfaithful_sample = EvaluationSample(
    input="Where is the Eiffel Tower located?",
    actual_output="The Eiffel Tower is located at Rue Rabelais in Paris.",
    references=[
        "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France",
        "Gustave Eiffel died in his appartment at Rue Rabelais in Paris."
    ]
)

result = evaluator.evaluate(eval_samples=[unfaithful_sample]).evaluations[0]
print(f"Grouse Faithfulness Score (0 or 1): {result.faithfulness.faithfulness}")

输出

Grouse Faithfulness Score (0 or 1): 0

像deepeval一样，grouse能有效捕捉细微错误，为我们的评估工具箱提供了另一个强大的工具。

使用RAGAS进行评估

虽然deepeval和grouse是很棒的通用评估器，但RAGAS（Retrieval-Augmented Generation Assessment）是为评估RAG流水线专门构建的框架。它提供了全面的指标套件，测量系统的每个组件，从检索器到生成器。

要使用RAGAS，我们首先需要按照特定格式准备评估数据。每个测试用例需要四个关键信息：

• question：用户的输入查询。
• answer：RAG系统生成的最终答案。
• contexts：检索器检索到的文档列表。
• ground_truth：正确的参考答案。

让我们准备一个样本数据集。

# 1. 准备评估数据
questions = [
    "What is the name of the three-headed dog guarding the Sorcerer's Stone?",
    "Who gave Harry Potter his first broomstick?",
    "Which house did the Sorting Hat initially consider for Harry?",
]

# 这些是我们RAG流水线生成的答案
generated_answers = [
    "The three-headed dog is named Fluffy.",
    "Professor McGonagall gave Harry his first broomstick, a Nimbus 2000.",
    "The Sorting Hat strongly considered putting Harry in Slytherin.",
]

# 地面真相，或“完美”答案
ground_truth_answers = [
    "Fluffy",
    "Professor McGonagall",
    "Slytherin",
]

# RAG系统为每个问题检索的上下文
retrieved_documents = [
    ["A massive, three-headed dog was guarding a trapdoor. Hagrid mentioned its name was Fluffy."],
    ["First years are not allowed brooms, but Professor McGonagall, head of Gryffindor, made an exception for Harry."],
    ["The Sorting Hat muttered in Harry's ear, 'You could be great, you know, it's all here in your head, and Slytherin will help you on the way to greatness...'"],
]

接下来，我们使用Hugging Face datasets库来结构化这些数据，RAGAS与之无缝集成。

# 你需要安装ragas和datasets：pip install ragas datasets
from datasets import Dataset

# 2. 将数据结构化为Hugging Face Dataset对象
data_samples = {
    'question': questions,
    'answer': generated_answers,
    'contexts': retrieved_documents,
    'ground_truth': ground_truth_answers
}

dataset = Dataset.from_dict(data_samples)

现在，我们可以定义指标并运行评估。RAGAS提供了一些专为RAG设计的强大指标。

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    answer_correctness,
)

# 3. 定义我们想要用于评估的指标
metrics = [
    faithfulness,       # 回答与上下文的事实一致性如何？（防止幻觉）
    answer_relevancy,   # 回答与问题的相关性如何？
    context_recall,     # 我们是否检索到回答问题所需的所有上下文？
    answer_correctness, # 与地面真相相比，回答的准确性如何？
]

# 4. 运行评估
result = evaluate(
    dataset=dataset, 
    metrics=metrics
)

# 5. 以干净的表格格式显示结果
results_df = result.to_pandas()
print(results_df)

	question	answer	contexts	ground_truth	faithfulness	answer_relevancy	context_recall	answer_correctness
0	What is the name of the three-headed dog...	The three-headed dog is named Fluffy.	[A massive, three-headed dog was guarding...	Fluffy	1.0	0.998	1.0	1.0
1	Who gave Harry Potter his first broomstick?	Professor McGonagall gave Harry his...	[First years are not allowed brooms, but...	Professor McGonagall	1.0	1.0	1.0	0.954
2	Which house did the Sorting Hat initially...	The Sorting Hat strongly considered...	[The Sorting Hat muttered in Harry's ear...	Slytherin	1.0	0.985	1.0	1.0

RAGAS评估表

我们可以看到我们的系统高度忠实，检索相关上下文很好（忠实度和context_recall完美）。回答也高度相关和正确，只有轻微偏差。

RAGAS让运行这种全面的端到端评估变得非常简单，为我们提供了自信部署和改进RAG应用所需的数据。

总结一切

让我们总结一下我们到目前为止为构建生产就绪的RAG系统所做的工作。

在第一部分，我们从头构建了一个基础RAG系统，涵盖了三个核心组件：索引数据、检索相关上下文和生成最终答案。
在第二部分，我们转向高级查询转换，使用RAG-Fusion、分解和HyDE等技术重写和扩展用户问题，以获得更准确的检索。
在第三部分，我们将流水线变成一个智能交换机，添加路由以将查询引导到正确的数据源，并使用查询结构化利用强大的元数据过滤器。
在第四部分，我们专注于高级索引，探索多表示索引和词级ColBERT等策略，创建更智能、更高效的知识库。
在第五部分，我们通过高级检索技术（如重新排序）优化最终输出，优先选择最佳上下文，并引入了CRAG和Self-RAG等代理自我纠正概念。
最后，在第六和第七部分，我们解决了评估这一关键步骤。我们学习了如何通过忠实度和正确性等关键指标测量系统性能，既通过从头构建评估器，也通过使用deepeval、grouse和RAGAS等强大框架。

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费POC验证，效果达标后再合作。零风险落地应用大模型，已交付160+中大型企业