微信扫码
添加专属顾问
我要投稿
PageIndex革新RAG技术,用树形推理替代传统向量检索,让AI像专家一样精准定位长文档信息。核心内容: 1. PageIndex的核心原理:无向量树结构索引与推理式检索 2. 对比传统向量RAG的三大痛点与解决方案 3. 在金融/法律等专业场景的实践优势与应用前景
PageIndex 是一种不依赖向量的、基于推理(reasoning-based)的信息检索框架,用于从长篇、复杂文档中进行知识检索,其设计理念是模拟人类专家阅读和定位信息的方式,通过将文档结构化为树,并让大模型在该结构上进行推理导航,从而实现可解释、无向量的长文档检索。
核心特点包括:
感兴趣的可以去PageIndex官网去体验下。
基于向量的RAG依靠语义嵌入和向量数据库来识别相关的文本块。
在预处理阶段,文档首先被分割成更小的块,然后每个块使用嵌入模型被嵌入到向量空间中,生成的向量被存储在诸如Chroma或Pinecone之类的向量数据库中。
在查询阶段,使用相同的嵌入模型对用户查询进行嵌入处理,在向量数据库中搜索语义相似的文本块,系统检索出排名前k的结果,这些结果随后被用于构成模型的输入上下文。
尽管对于短文本而言简单有效,但基于向量的RAG面临着几个主要挑战:
{
"structure": [
{
"nodes": [
{
"title": "Abstract",
"node_id": "0001",
"summary": "This text discusses the increasing importance of fine-tuning large language models (LLMs) for human intent alignment, highlighting the need for efficient resource utilization. It contrasts Reinforcement Learning from Human or AI Preferences (RLHF/RLAIF), which is complex and unstable, with Direct Preference Optimization (DPO), a simpler alternative. The work introduces an active learning strategy for DPO, proposing an acquisition function that uses predictive entropy and the certainty of the implicit preference model to improve the efficiency and effectiveness of fine-tuning with pairwise preference data.",
"end_index": 1,
"start_index": 1
},
{
"nodes": [
{
"title": "3.1. Acquisition functions",
"node_id": "0005",
"summary": "### 3.1. Acquisition functions\n\nIn selecting scoring methods (step 8 in 1) we aim for options that are straightforward to implement and do not require modifications to the model architectures or the fine-tuning procedure itself. This allows for a drop in addition to existing implementations. As a result, we propose using the predictive entropy of $p_{\\theta_t}(y|x)$ as well as a measure of certainty under the Bradley-Terry preference model, which leverages the implicit reward model in DPO.\n",
"end_index": 4,
"start_index": 3
}
],
"title": "3 Active Preference Learning",
"node_id": "0004",
"summary": "This text introduces Active Preference Learning (APL), a machine learning paradigm for efficiently selecting the most informative data points during training, specifically within a pool-based active learning setting. The APL training procedure involves iteratively sampling prompts, generating pairs of completions using the current model, ranking these pairs with an acquisition function, selecting the highest-ranked pairs for preference labeling by an oracle, and then fine-tuning the model with these labeled preferences. This approach augments the standard DPO fine-tuning loop with an outer data acquisition loop, where the number of acquisition steps is determined by the labeling budget and batch size. A key difference from traditional active learning is the necessity of generating completions for acquired data before scoring, especially if the acquisition function requires them. The text also outlines crucial design considerations, including the selection of acquisition functions, fine-tuning implementation details, the choice of oracle, and experimental settings for sampling parameters. Algorithm 1 provides a detailed step-by-step breakdown of the entire APL procedure.",
"end_index": 3,
"start_index": 2
} ]
}
PageIndex会根据你的query先检索哪些文档相关联。文档检索大概有以下三种方式:
prompt = f"""
You are given a list of documents with their IDs, file names, and descriptions. Your task is to select documents that may contain information relevant to answering the user query.
Query: {query}
Documents: [
{
"doc_id": "xxx",
"doc_name": "xxx",
"doc_description": "xxx"
}
]
Response Format:
{{
"thinking": "<Your reasoning for document selection>",
"answer": <Python list of relevant doc_ids>, e.g. ['doc_id1', 'doc_id2']. Return [] if no documents are relevant.
}}
Return only the JSON structure, with no additional output.
"""
让大模型根据目录树来推理相关联的node节点,获取到node节点内容之后再进行迭代式生成。
prompt = f"""
You are given a query and the tree structure of a document.
You need to find all nodes that are likely to contain the answer.
Query: {query}
Document tree structure: {PageIndex_Tree}
Reply in the following JSON format:
{{
"thinking": <your reasoning about which nodes are relevant>,
"node_list": [node_id1, node_id2, ...]
}}
"""
除此之外,还支持混合树检索,例如基于chunk进行召回,筛选出node节点。
53AI,企业落地大模型首选服务商
产品:场景落地咨询+大模型应用平台+行业解决方案
承诺:免费POC验证,效果达标后再合作。零风险落地应用大模型,已交付160+中大型企业
2025-12-20
深度解析丨智能体架构,利用文件系统重塑上下文工程
2025-12-20
RAG 答非所问?可能是你少了这一步:深度解析 Rerank 与 Cross-Encoder 的“降维打击”
2025-12-18
从 RAG 到 Context:2025 年 RAG 技术年终总结
2025-12-17
embedding分数不是唯一解!搜索场景,如何根据元数据做加权rerank
2025-12-17
企业AI真瓶颈:不在模型,而在语境!
2025-12-17
从 1600+ 份 Word 文档到生产级 RAG:一个工控行业知识库的全链路实战复盘
2025-12-16
短语检索不等于BM25+向量检索| Milvus Phrase Match实战
2025-12-16
让AI真正懂数据:猫超Matra项目中的AI知识库建设之路
2025-10-04
2025-10-11
2025-09-30
2025-10-12
2025-12-04
2025-11-04
2025-10-31
2025-11-13
2025-10-12
2025-12-03
2025-12-10
2025-11-23
2025-11-20
2025-11-19
2025-11-04
2025-10-04
2025-09-30
2025-09-10