揭秘RAG神器！如何通过上下文检索与混合搜索打造超强生成效果

发布日期：2024-10-28 07:50:10 浏览次数： 3013

作者：AI技术研习社

微信搜一搜，关注“AI技术研习社”

检索增强生成 (RAG) 是自然语言处理 (NLP) 中使用的一种先进方法，用于生成准确、知情的响应。与仅依赖内部知识的传统模型不同，RAG 通过在生成过程中从外部文档或数据库检索相关信息来增强模型的能力。

这意味着该模型可以“查找”相关数据并使用它来制定更精确的响应，特别是在处理需要最新或专业信息的主题时。上下文检索和混合搜索是用于增强检索增强生成 (RAG)系统的先进技术。它们属于更广泛的 RAG 范畴，但代表了改进这些系统检索相关信息的更复杂的方法。

上下文检索不仅仅是简单的关键字匹配。它不是仅仅查找包含查询中确切单词的文档，而是根据查询的含义检索信息。这涉及理解上下文和语义，使系统能够获取更相关和更有意义的文档。

混合搜索结合了两种检索信息的方法：

词法搜索 (BM25) ：这种传统方法根据精确的关键字匹配来检索文档。例如，如果您搜索“cat on the mat”，它将找到包含这些确切单词的文档。

基于嵌入的搜索（密集检索）：这种较新的方法通过比较文档的语义来检索文档。查询和文档都被转换为高维向量（嵌入），系统检索其含义（向量表示）最接近查询的文档。

通过结合这两种方法，混合搜索可以提供更好的结果。它利用基于关键字的 BM25 的精度和密集检索的语义理解，确保系统根据所使用的单词及其含义找到最相关的文档。

将 BM25 与上下文嵌入相结合的关键优势在于，它们各自的强项能够互补：

BM25：擅长精确匹配关键词，适合特定术语至关重要的场景。
基于嵌入的检索：即使查询中没有确切关键词，也能够理解深层语义，捕捉意图。

这种组合让 RAG 系统能够检索到既包含正确关键词、又符合查询意图的文档，从而显著提升生成内容的质量。

实现混合搜索：代码示例

在此示例中，我们将使用rank_bm25库来实现词法搜索：

from rank_bm25 import BM25Okapifrom nltk.tokenize import word_tokenize
# Sample documentsdocuments = ["The cat sat on the mat.","The dog barked at the moon.","The sun is shining bright."]
# Tokenize the documentstokenized_corpus = [word_tokenize(doc.lower()) for doc in documents]
# Initialize BM25bm25 = BM25Okapi(tokenized_corpus)
# Queryquery = "cat on mat"tokenized_query = word_tokenize(query.lower())
# Retrieve BM25 resultsbm25_scores = bm25.get_scores(tokenized_query)bm25_results = bm25.get_top_n(tokenized_query, documents, n=3)
print("BM25 Results: ", bm25_results)

在这里，我们将使用transformers为文档和查询创建密集嵌入，然后使用faiss查找最相似的文档：

from transformers import AutoTokenizer, AutoModelimport torchimport faiss
# Load a pre-trained model and tokenizer for embedding creationtokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
# Embed documentsdef embed_texts(texts):inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")with torch.no_grad():embeddings = model(**inputs).last_hidden_state.mean(dim=1)return embeddings
# Generate embeddings for the documentsdoc_embeddings = embed_texts(documents).numpy()
# Generate an embedding for the queryquery_embedding = embed_texts([query]).numpy()
# Use FAISS to index and search the documents based on embeddingsindex = faiss.IndexFlatL2(doc_embeddings.shape[1])index.add(doc_embeddings)
# Search for the top 3 most similar documents_, dense_results_idx = index.search(query_embedding, k=3)dense_results = [documents[idx] for idx in dense_results_idx[0]]
print("Dense Retrieval Results: ", dense_results)

为了执行混合搜索，我们结合了 BM25 和密集检索的结果。每种方法的分数均经过标准化和加权以获得最佳总体结果：

import numpy as np
# Normalize BM25 and Dense retrieval scoresbm25_scores = np.array(bm25_scores)bm25_scores_normalized = bm25_scores / np.max(bm25_scores)
dense_scores = np.linalg.norm(query_embedding - doc_embeddings, axis=1)dense_scores_normalized = 1 - (dense_scores / np.max(dense_scores))# Convert distances to similarity
# Combine the normalized scores (you can adjust the weights as needed)combined_scores = 0.5 * bm25_scores_normalized + 0.5 * dense_scores_normalized
# Get the top documents based on combined scorestop_idx = combined_scores.argsort()[::-1]hybrid_results = [documents[i] for i in top_idx[:3]]
print("Hybrid Search Results: ", hybrid_results)