微信扫码
添加专属顾问
我要投稿
告别碎片化检索!RSE技术让AI上下文理解更连贯,输出更精准可靠。 核心内容: 1. 传统RAG的碎片化检索痛点分析 2. RSE技术的核心原理与创新优势 3. 从PDF处理到完整实现的实战流程解析
你还在用传统RAG(Retrieval-Augmented Generation)检索一堆“东一榔头西一棒槌”的文本碎片,然后让大模型自己拼拼凑凑?别闹了,AI都快被你玩成拼图大师了!今天,咱们聊聊如何用Relevant Segment Extraction(RSE)让你的RAG系统“上下文连贯”,让大模型如虎添翼,输出更靠谱、更有逻辑的答案!
先来个灵魂拷问:你见过大模型被一堆无头无尾的文本块折磨得语无伦次吗?
传统RAG的套路是这样的:
看似合理,实则暗藏杀机——相关内容往往是连续的,但你检索出来的却是“东一块西一块”,上下文断裂,信息丢失,模型理解难度飙升。
举个栗子:你问“什么是可解释AI?为什么重要?”
传统RAG可能给你拼出:
模型一看:这都啥啊?我得自己把这些碎片拼成一篇论文,累不累!
Relevant Segment Extraction(RSE)来了!它的核心思想很简单:
❝相关内容往往在文档中是连续的,咱们就应该“整段整段”地检索出来,别再让模型拼拼凑凑了!
RSE的流程如下:
这样,模型拿到的上下文是“有头有尾”的连续内容,理解起来轻松多了,输出自然更靠谱!
先用PyMuPDF(fitz)把PDF里的内容全都抽出来:
import fitz
def extract_text_from_pdf(pdf_path):
mypdf = fitz.open(pdf_path)
all_text = ""
for page_num in range(mypdf.page_count):
page = mypdf[page_num]
text = page.get_text("text")
all_text += text
return all_text
切块的艺术:每块800字,不重叠,方便后续连续片段重构。
def chunk_text(text, chunk_size=800, overlap=0):
chunks = []
for i in range(0, len(text), chunk_size - overlap):
chunk = text[i:i + chunk_size]
if chunk:
chunks.append(chunk)
return chunks
用OpenAI API(或其他embedding模型)把每个chunk变成向量,存进自建的SimpleVectorStore:
class SimpleVectorStore:
def __init__(self, dimension=1536):
self.dimension = dimension
self.vectors = []
self.documents = []
self.metadata = []
def add_documents(self, documents, vectors=None, metadata=None):
if vectors is None:
vectors = [None] * len(documents)
if metadata is None:
metadata = [{} for _ in range(len(documents))]
for doc, vec, meta in zip(documents, vectors, metadata):
self.documents.append(doc)
self.vectors.append(vec)
self.metadata.append(meta)
def search(self, query_vector, top_k=5):
import numpy as np
query_array = np.array(query_vector)
similarities = []
for i, vector in enumerate(self.vectors):
if vector is not None:
similarity = np.dot(query_array, vector) / (
np.linalg.norm(query_array) * np.linalg.norm(vector)
)
similarities.append((i, similarity))
similarities.sort(key=lambda x: x[1], reverse=True)
results = []
for i, score in similarities[:top_k]:
results.append({
"document": self.documents[i],
"score": float(score),
"metadata": self.metadata[i]
})
return results
每个chunk和query算相关性分数,减去不相关惩罚,得到“chunk value”:
def calculate_chunk_values(query, chunks, vector_store, irrelevant_chunk_penalty=0.2):
query_embedding = create_embeddings([query])[0]
num_chunks = len(chunks)
results = vector_store.search(query_embedding, top_k=num_chunks)
relevance_scores = {result["metadata"]["chunk_index"]: result["score"] for result in results}
chunk_values = []
for i in range(num_chunks):
score = relevance_scores.get(i, 0.0)
value = score - irrelevant_chunk_penalty
chunk_values.append(value)
return chunk_values
用“最大子段和”思想,找出一段段连续的高分块,保证每段不超过20块,总共不超过30块:
def find_best_segments(chunk_values, max_segment_length=20, total_max_length=30, min_segment_value=0.2):
best_segments = []
segment_scores = []
total_included_chunks = 0
while total_included_chunks < total_max_length:
best_score = min_segment_value
best_segment = None
for start in range(len(chunk_values)):
if any(start >= s[0] and start < s[1] for s in best_segments):
continue
for length in range(1, min(max_segment_length, len(chunk_values) - start) + 1):
end = start + length
if any(end > s[0] and end <= s[1] for s in best_segments):
continue
segment_value = sum(chunk_values[start:end])
if segment_value > best_score:
best_score = segment_value
best_segment = (start, end)
if best_segment:
best_segments.append(best_segment)
segment_scores.append(best_score)
total_included_chunks += best_segment[1] - best_segment[0]
else:
break
best_segments = sorted(best_segments, key=lambda x: x[0])
return best_segments, segment_scores
把连续的chunk拼成完整的上下文片段,格式化给大模型:
def reconstruct_segments(chunks, best_segments):
reconstructed_segments = []
for start, end in best_segments:
segment_text = " ".join(chunks[start:end])
reconstructed_segments.append({
"text": segment_text,
"segment_range": (start, end),
})
return reconstructed_segments
def format_segments_for_context(segments):
context = []
for i, segment in enumerate(segments):
segment_header = f"SEGMENT {i+1} (Chunks {segment['segment_range'][0]}-{segment['segment_range'][1]-1}):"
context.append(segment_header)
context.append(segment['text'])
context.append("-" * 80)
return "\n\n".join(context)
把格式化好的上下文和问题一起扔给大模型,生成最终答案:
def generate_response(query, context, model="meta-llama/Llama-3.2-3B-Instruct"):
system_prompt = """You are a helpful assistant that answers questions based on the provided context.
The context consists of document segments that have been retrieved as relevant to the user's query.
Use the information from these segments to provide a comprehensive and accurate answer.
If the context doesn't contain relevant information to answer the question, say so clearly."""
user_prompt = f"""
Context:
{context}
Question: {query}
Please provide a helpful answer based on the context provided.
"""
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0
)
return response.choices[0].message.content
来一场正面PK!
同样的问题:“什么是可解释AI?为什么重要?”
甚至让大模型自己评价,RSE的答案在“紧扣问题、上下文连贯、信息全面”上更胜一筹!
进阶玩法:
RSE不是玄学,而是让检索更贴合人类阅读习惯的“工程升级”。
别再让大模型为你的碎片化检索擦屁股了,赶紧用RSE武装你的RAG系统,让AI输出更靠谱、更有逻辑的答案!
53AI,企业落地大模型首选服务商
产品:场景落地咨询+大模型应用平台+行业解决方案
承诺:免费POC验证,效果达标后再合作。零风险落地应用大模型,已交付160+中大型企业
2025-08-30
涌现观点|RAG评估的"不可能三角":当独角兽公司因AI评估失误损失10亿美元时,我们才意识到这个被忽视的技术死角
2025-08-29
RAG2.0进入“即插即用”时代!清华YAML+MCP让复杂RAG秒变“乐高”
2025-08-29
利用RAG构建智能问答平台实战经验分享
2025-08-29
RAG如七夕,鹊桥大工程:再看文档解析实际落地badcase
2025-08-29
基于智能体增强生成式检索(Agentic RAG)的流程知识提取技术研究
2025-08-29
RAG 为何能瞬间找到答案?向量数据库告诉你
2025-08-28
寻找RAG通往上下文工程之桥:生成式AI的双重基石重构
2025-08-28
万字长文详解优图RAG技术
2025-06-05
2025-06-06
2025-06-05
2025-06-05
2025-06-20
2025-06-20
2025-07-15
2025-06-24
2025-06-24
2025-06-05