我要投稿

01 RAG搭建教程（Qwen3-Embedding-0.6B + Qwen3-Reranker-0.6B)

教程亮点：手把手教你利用Qwen3最新发布的embedding模型和reranker模型搭建一个RAG，两阶段检索设计（召回+重排）平衡了效率与精度！

环境准备

! pip install --upgrade pymilvus openai requests tqdm sentence-transformers transformers

Requires transformers>=4.51.0
Requires sentence-transformers>=2.7.0

在本示例中，我们将使用 OpenAI 作为文本生成的大型语言模型，因此您需要将 API 密钥 OPENAI_API_KEY 作为环境变量准备给大型语言模型使用。

import osos.environ["OPENAI_API_KEY"] = "sk-************"数据准备

我们可以使用Milvus文档2.4. x中的FAQ页面作为RAG中的私有知识，这是构建一个基础RAG的良好数据源。

下载zip文件并将文档解压缩到文件夹milvus_docs

! wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip! unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs我们从文件夹milvus_docs/en/faq中加载所有markdown文件，对于每个文档，我们只需用“#”来分隔文件中的内容，就可以大致分隔markdown文件各个主要部分的内容。

from glob import globtext_lines = []for file_path in glob("milvus_docs/en/faq/*.md", recursive=True):    with open(file_path, "r") as file:        file_text = file.read()    text_lines += file_text.split("# ")准备 LLM 和Embedding模型

本示例中使用 Qwen3-Embedding-0.6B 来进行文本嵌入，使用Qwen3-Reranker-0.6B对检索的结果进行重排序。

from openai import OpenAIfrom sentence_transformers import SentenceTransformerimport torchfrom transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM# Initialize OpenAI client for LLM generationopenai_client = OpenAI()# Load Qwen3-Embedding-0.6B model for text embeddingsembedding_model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")# Load Qwen3-Reranker-0.6B model for rerankingreranker_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Reranker-0.6B", padding_side='left')reranker_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Reranker-0.6B").eval()# Reranker configurationtoken_false_id = reranker_tokenizer.convert_tokens_to_ids("no")token_true_id = reranker_tokenizer.convert_tokens_to_ids("yes")max_reranker_length = 8192prefix = "<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\".<|im_end|>\n<|im_start|>user\n"suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"prefix_tokens = reranker_tokenizer.encode(prefix, add_special_tokens=False)suffix_tokens = reranker_tokenizer.encode(suffix, add_special_tokens=False)输出结果示例

定义一个函数，利用 Qwen3-Embedding-0.6B 模型生成文本嵌入。该函数将用于生成文档嵌入和查询嵌入。

def emb_text(text, is_query=False):    """    Generate text embeddings using Qwen3-Embedding-0.6B model.    Args:        text: Input text to embed        is_query: Whether this is a query (True) or document (False)    Returns:        List of embedding values    """    if is_query:        # For queries, use the "query" prompt for better retrieval performance        embeddings = embedding_model.encode([text], prompt_name="query")    else:        # For documents, use default encoding        embeddings = embedding_model.encode([text])    return embeddings[0].tolist()定义重排序函数以提升检索质量。这些函数使用Qwen3-Reranker实现完整的重排序管道，根据文档与查询的相关性对候选文档进行评估和重新排序。其中各函数主要作用分别是：

format_instruction(): 将查询、文档和任务指令格式化为重排序模型的标准输入格式
process_inputs(): 对格式化后的文本进行分词编码，并添加特殊token用于模型判断
compute_logits(): 使用重排序模型计算“查询-文档”对的相关性得分（0-1之间）
rerank_documents(): 基于查询相关性对文档进行重新排序，返回按相关性得分降序排列的文档列表

def format_instruction(instruction, query, doc):    """Format instruction for reranker input"""    if instruction is None:        instruction = 'Given a web search query, retrieve relevant passages that answer the query'    output = "<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {doc}".format(        instruction=instruction, query=query, doc=doc    )    return outputdef process_inputs(pairs):    """Process inputs for reranker"""    inputs = reranker_tokenizer(        pairs, padding=False, truncation='longest_first',        return_attention_mask=False, max_length=max_reranker_length - len(prefix_tokens) - len(suffix_tokens)    )    for i, ele in enumerate(inputs['input_ids']):        inputs['input_ids'][i] = prefix_tokens + ele + suffix_tokens    inputs = reranker_tokenizer.pad(inputs, padding=True, return_tensors="pt", max_length=max_reranker_length)    for key in inputs:        inputs[key] = inputs[key].to(reranker_model.device)    return inputs@torch.no_grad()def compute_logits(inputs, **kwargs):    """Compute relevance scores using reranker"""    batch_scores = reranker_model(**inputs).logits[:, -1, :]    true_vector = batch_scores[:, token_true_id]    false_vector = batch_scores[:, token_false_id]    batch_scores = torch.stack([false_vector, true_vector], dim=1)    batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)    scores = batch_scores[:, 1].exp().tolist()    return scoresdef rerank_documents(query, documents, task_instruction=None):    """    Rerank documents based on query relevance using Qwen3-Reranker    Args:        query: Search query        documents: List of documents to rerank        task_instruction: Task instruction for reranking    Returns:        List of (document, score) tuples sorted by relevance score    """    if task_instruction is None:        task_instruction = 'Given a web search query, retrieve relevant passages that answer the query'    # Format inputs for reranker    pairs = [format_instruction(task_instruction, query, doc) for doc in documents]    # Process inputs and compute scores    inputs = process_inputs(pairs)    scores = compute_logits(inputs)    # Combine documents with scores and sort by score (descending)    doc_scores = list(zip(documents, scores))    doc_scores.sort(key=lambda x: x[1], reverse=True)    return doc_scores生成一个测试向量，并打印其维度以及前几个元素。

test_embedding = emb_text("This is a test")embedding_dim = len(test_embedding)print(embedding_dim)print(test_embedding[:10])结果示例：

1024[-0.009923271834850311, -0.030248118564486504, -0.011494234204292297, -0.05980192497372627, -0.0026795873418450356, 0.016578301787376404, -0.04073038697242737, 0.03180320933461189, -0.024417787790298462, 2.1764861230622046e-05]将数据加载到Milvus

创建集合

from pymilvus import MilvusClientmilvus_client = MilvusClient(uri="./milvus_demo.db")collection_name = "my_rag_collection"关于MilvusClient的参数设置：

将URI设置为本地文件（例如./milvus.db）是最便捷的方法，因为它会自动使用Milvus Lite将所有数据存储在该文件中。
如果你有大规模数据，可以在Docker或Kubernetes上搭建性能更强的Milvus服务器。在这种情况下，请使用服务器的URI（例如http://localhost:19530）作为你的URI。
如果你想使用Zilliz Cloud（Milvus的全托管云服务），请调整URI和令牌，它们分别对应Zilliz Cloud中的公共端点（Public Endpoint）和API密钥（Api key）。

检查集合是否已经存在，如果存在则将其删除。

if milvus_client.has_collection(collection_name):    milvus_client.drop_collection(collection_name)创建一个具有指定参数的新集合。

如果未指定任何字段信息，Milvus将自动创建一个默认的ID字段作为主键，以及一个向量字段用于存储向量数据。一个预留的JSON字段用于存储未在schema中定义的字段及其值。

milvus_client.create_collection(    collection_name=collection_name,    dimension=embedding_dim,    metric_type="IP",  # Inner product distance    consistency_level="Strong",  # Strong consistency level)插入集合

逐行遍历文本，创建嵌入向量，然后将数据插入Milvus。

下面是一个新的字段text，它是集合中的一个未定义的字段。它将自动创建一个对应的text字段（实际上它底层是由保留的JSON动态字段实现的，你不用关心其底层实现。）

from tqdm import tqdmdata = []for i, line in enumerate(tqdm(text_lines, desc="Creating embeddings")):    data.append({"id": i, "vector": emb_text(line), "text": line})milvus_client.insert(collection_name=collection_name, data=data)输出结果示例：

Creating embeddings: 100%|██████████████████████████████████████████████████████████████████████████| 72/72 [00:08<00:00,  8.68it/s]{'insert_count': 72, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71], 'cost': 0}结合重排序技术增强RAG

检索数据

我们来指定一个关于Milvus的常见问题。

question = "How is data stored in milvus?"在集合中搜索该问题，并获取具有最高语义匹配度的前 10 个候选答案，然后使用重排序器来选出最佳的 3 个匹配项。

# Step 1: Initial retrieval with larger candidate setsearch_res = milvus_client.search(    collection_name=collection_name,    data=[        emb_text(question, is_query=True)    ],  # Use the `emb_text` function with query prompt to convert the question to an embedding vector    limit=10,  # Return top 10 candidates for reranking    search_params={"metric_type": "IP", "params": {}},  # Inner product distance    output_fields=["text"],  # Return the text field)# Step 2: Extract candidate documents for rerankingcandidate_docs = [res["entity"]["text"] for res in search_res[0]]# Step 3: Rerank documents using Qwen3-Rerankerprint("Reranking documents...")reranked_docs = rerank_documents(question, candidate_docs)# Step 4: Select top 3 reranked documentstop_reranked_docs = reranked_docs[:3]print(f"Selected top {len(top_reranked_docs)} documents after reranking")让我们来看看此次查询的重新排序结果吧！

import json# Display reranked results with reranker scoresreranked_lines_with_scores = [    (doc, score) for doc, score in top_reranked_docs]print("Reranked results:")print(json.dumps(reranked_lines_with_scores, indent=4))# Also show original embedding-based results for comparisonprint("\n" + "="*80)print("Original embedding-based results (top 3):")original_lines_with_distances = [    (res["entity"]["text"], res["distance"]) for res in search_res[0][:3]]print(json.dumps(original_lines_with_distances, indent=4))输出结果示例：

从结果中我们可以看到Qwen3-Reranker的重排序效果明显，相关性得分区分度较好

Reranked results(top 3):[    [        " Where does Milvus store data?\n\nMilvus deals with two types of data, inserted data and metadata. \n\nInserted data, including vector data, scalar data, and collection-specific schema, are stored in persistent storage as incremental log. Milvus supports multiple object storage backends, including [MinIO](https://min.io/), [AWS S3](https://aws.amazon.com/s3/?nc1=h_ls), [Google Cloud Storage](https://cloud.google.com/storage?hl=en#object-storage-for-companies-of-all-sizes) (GCS), [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs), [Alibaba Cloud OSS](https://www.alibabacloud.com/product/object-storage-service), and [Tencent Cloud Object Storage](https://www.tencentcloud.com/products/cos) (COS).\n\nMetadata are generated within Milvus. Each Milvus module has its own metadata that are stored in etcd.\n\n###",        0.9997891783714294    ],    [        "How does Milvus flush data?\n\nMilvus returns success when inserted data are loaded to the message queue. However, the data are not yet flushed to the disk. Then Milvus' data node writes the data in the message queue to persistent storage as incremental logs. If `flush()` is called, the data node is forced to write all data in the message queue to persistent storage immediately.\n\n###",        0.9989748001098633    ],    [        "Does the query perform in memory? What are incremental data and historical data?\n\nYes. When a query request comes, Milvus searches both incremental data and historical data by loading them into memory. Incremental data are in the growing segments, which are buffered in memory before they reach the threshold to be persisted in storage engine, while historical data are from the sealed segments that are stored in the object storage. Incremental data and historical data together constitute the whole dataset to search.\n\n###",        0.9984032511711121    ]]================================================================================Original embedding-based results(top 3):[    [        " Where does Milvus store data?\n\nMilvus deals with two types of data, inserted data and metadata. \n\nInserted data, including vector data, scalar data, and collection-specific schema, are stored in persistent storage as incremental log. Milvus supports multiple object storage backends, including [MinIO](https://min.io/), [AWS S3](https://aws.amazon.com/s3/?nc1=h_ls), [Google Cloud Storage](https://cloud.google.com/storage?hl=en#object-storage-for-companies-of-all-sizes) (GCS), [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs), [Alibaba Cloud OSS](https://www.alibabacloud.com/product/object-storage-service), and [Tencent Cloud Object Storage](https://www.tencentcloud.com/products/cos) (COS).\n\nMetadata are generated within Milvus. Each Milvus module has its own metadata that are stored in etcd.\n\n###",        0.8306853175163269    ],    [        "How does Milvus flush data?\n\nMilvus returns success when inserted data are loaded to the message queue. However, the data are not yet flushed to the disk. Then Milvus' data node writes the data in the message queue to persistent storage as incremental logs. If `flush()` is called, the data node is forced to write all data in the message queue to persistent storage immediately.\n\n###",        0.7302717566490173    ],    [        "How does Milvus handle vector data types and precision?\n\nMilvus supports Binary, Float32, Float16, and BFloat16 vector types.\n\n- Binary vectors: Store binary data as sequences of 0s and 1s, used in image processing and information retrieval.\n- Float32 vectors: Default storage with a precision of about 7 decimal digits. Even Float64 values are stored with Float32 precision, leading to potential precision loss upon retrieval.\n- Float16 and BFloat16 vectors: Offer reduced precision and memory usage. Float16 is suitable for applications with limited bandwidth and storage, while BFloat16 balances range and efficiency, commonly used in deep learning to reduce computational requirements without significantly impacting accuracy.\n\n###",        0.7003671526908875    ]]

使用大型语言模型（LLM）构建检索增强生成（RAG）响应

将检索到的文档转换为字符串格式。

context = "\n".join(    [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances])为大语言模型提供系统提示（system prompt）和用户提示（user prompt）。这个提示是通过从Milvus检索到的文档生成的。

SYSTEM_PROMPT = """Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided."""USER_PROMPT = f"""Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.<context>{context}</context><question>{question}</question>"""使用Open AI 的大语言模型gpt-4o，根据提示生成响应。

response = openai_client.chat.completions.create(    model="gpt-4o",    messages=[        {"role": "system", "content": SYSTEM_PROMPT},        {"role": "user", "content": USER_PROMPT},    ],)print(response.choices[0].message.content)输出结果展示：

In Milvus, data is stored in two main forms: inserted data and metadata. Inserted data, which includes vector data, scalar data, and collection-specific schema, is stored in persistent storage as incremental logs. Milvus supports multiple object storage backends for this purpose, including MinIO, AWS S3, Google Cloud Storage, Azure Blob Storage, Alibaba Cloud OSS, and Tencent Cloud Object Storage. Metadata for Milvus is generated by its various modules and stored in etcd.02小结

通过以上教程和输出结果展示，不难发现，通义千问团队在Qwen3系列中推出的embedding和reranker模型表现相当不错。这两个模型的结合使用为RAG系统提供了一个相对完整且实用的解决方案。

在设计理念上Embedding模型支持query和document的差异化处理，体现了对检索任务的深入理解；Reranker采用交叉编码器架构，能够捕捉query-document间的精细交互；教程中的两阶段检索设计（召回+重排）更是平衡了效率与精度。特别是Qwen3-Embedding-0.6B（1024维）和Qwen3-Reranker-0.6B都采用了相对轻量的参数规模，支持本地部署，减少了对外部API的依赖，在保证性能的同时，降低了硬件要求，适合中小企业和个人开发者使用。

事实上，Qwen3系列推出embedding和reranker模型，其实不是个例，不是巧合，而是产业共识。

原因很简单，这两个模块，决定了大模型是否具备产品化能力。

生成式大模型最大的问题在于：不确定性高、评估难、成本重。

要解决以上问题，无论是 RAG、LLM Memory、Agent ，本质上都依赖一个前提：能否将语义压缩成机器可高效检索和判断的向量表达。

Embedding 与 Ranking 则是目前的最优路径：标准清晰、性能可测、成本可控、易于灰度。Embedding 决定你能不能“找得到”，Ranking 决定你能不能“选得准”。这使它们成为模型商品化最先跑通的 API 模块之一：调用频率高（每次检索都需要）、切换成本高（与索引绑定）、商业价值高（可用作底层 infra）。