我要投稿

从 qwen3-reranker 来聊一下排序

发布日期：2025-08-02 00:41:58 浏览次数： 1669

作者：北漂程序员日记

微信搜一搜，关注“北漂程序员日记”

大家好，许久未见，心中充满了思念。在这个不断变化的牛马世界中，每一天都有新的故事上演，繁忙而又充满激情。感谢大家的理解和支持。

qwen3-reranker 开源了也差不多有1个多月了，从论文来看，有几个非常有意思的点。今天我们就聊一聊生成模型在排序领域的尝试。

qwen3-rerank

我们先来看使用示例的代码

# Requires transformers>=4.51.0
import torch
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM

def format_instruction(instruction, query, doc):
    if instruction is None:
        instruction = 'Given a web search query, retrieve relevant passages that answer the query'
    output = "<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {doc}".format(instruction=instruction,query=query, doc=doc)
    return output

def process_inputs(pairs):
    inputs = tokenizer(
        pairs, padding=False, truncation='longest_first',
        return_attention_mask=False, max_length=max_length - len(prefix_tokens) - len(suffix_tokens)
    )
    for i, ele in enumerate(inputs['input_ids']):
        inputs['input_ids'][i] = prefix_tokens + ele + suffix_tokens
    inputs = tokenizer.pad(inputs, padding=True, return_tensors="pt", max_length=max_length)
    for key in inputs:
        inputs[key] = inputs[key].to(model.device)
    return inputs

@torch.no_grad()
def compute_logits(inputs, **kwargs):
    batch_scores = model(**inputs).logits[:, -1, :]
    true_vector = batch_scores[:, token_true_id]
    false_vector = batch_scores[:, token_false_id]
    batch_scores = torch.stack([false_vector, true_vector], dim=1)
    batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
    scores = batch_scores[:, 1].exp().tolist()
    return scores

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Reranker-4B", padding_side='left')
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Reranker-4B").eval()

# We recommend enabling flash_attention_2 for better acceleration and memory saving.
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Reranker-4B", torch_dtype=torch.float16, attn_implementation="flash_attention_2").cuda().eval()

token_false_id = tokenizer.convert_tokens_to_ids("no")
token_true_id = tokenizer.convert_tokens_to_ids("yes")
max_length = 8192

prefix = "<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\".<|im_end|>\n<|im_start|>user\n"
suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
prefix_tokens = tokenizer.encode(prefix, add_special_tokens=False)
suffix_tokens = tokenizer.encode(suffix, add_special_tokens=False)
        
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = ["What is the capital of China?",
    "Explain gravity",
]

documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

pairs = [format_instruction(task, query, doc) for query, doc in zip(queries, documents)]

# Tokenize the input texts
inputs = process_inputs(pairs)
scores = compute_logits(inputs)

print("scores: ", scores)

从论文和以上的示例来看，Prompt中明显是要求输出Yes 和 No，但是并没有取模型的output，而是取的两者的概率分布（函数：compute_logits）。

传统交叉编码器采用Encoder-only架构，如BERT、RoBERTa等，其核心设计理念是"全局感知，直接判断"。输入格式通常为[CLS] Query [SEP] Document [SEP]，通过双向注意力机制，查询和文档的每个词汇都能与其他所有词汇进行交互，实现信息的充分融合。最终通过[CLS] token的表示输出一个连续的相关性分数。

相比之下，Qwen3-Reranker基于Decoder-only的生成式架构，将重排序任务重新定义为自然语言理解和生成问题。模型接收结构化的提示信息，包含任务指令、查询和文档，通过因果注意力机制逐步处理信息，最终生成推理过程（可选）并输出"yes"或"no"的概率分布来判断相关性。

同时在优化目标方面，交叉编码器直接优化相关性分数的回归任务，损失函数通常采用均方误差或排序损失。这种设计使得模型能够输出细粒度的相关性评分，便于进行精确的文档排序。模型的优化目标明确且直接，训练过程相对稳定。而Qwen3-Reranker则将相关性判断转化为条件文本生成任务，通过最大化目标token（"yes"/"no"）的生成概率来完成训练。这种设计虽然失去了细粒度评分的能力，但获得了更强的可解释性和推理能力。模型能够在生成最终答案前输出详细的推理过程，为决策提供透明的依据。

从性能的角度来看，交叉编码器具有明显优势。其并行计算特性和相对较小的模型规模（通常110M-340M参数）使得推理速度更快，资源消耗更低。

我们再从结果稳定性的角度去看，生成式重排模型是基于token的生成概率来进行判断，虽然说在贪心的解码设计下有其稳定性，但他的输出本质上是一个概率分布。当"yes"和"no"的概率接近时（如0.51 vs 0.49），微小的数值误差或模型版本差异可能导致判断结果的翻转，这在边界案例中会产生不稳定性。这也是使用示例中不取模型的output，直接取概率的原因。而交叉编码器显然没有这个问题。但生成式模型对prompt是非常敏感的，特别是对小模型而言。

最后，笔者再从效果的差异上做一下对比，在一些复杂的常识性推理上生成式重排模型要比交叉编码器重排模型有更好的效果。

非常有意思的是，笔者很早就在尝试用生成式模型（最早用qwen2）来做负样本过滤，比如判断提供的段落信息和Query是否相关，通过输出的固定token来筛选掉不相关的部分。当然这个操作是非常费时的。但也有一些提效的办法，比如同时判断多个，将Query和N个上下文让生成模型进行判断，让他输出多个Y/N。但实际使用中发现一个非常严重的问题：

1、不稳定。因为是拿模型的output，同时在量大的时候，很少有那么多的RPM供你这么调的，即便让模型同时判断了多个。如果让模型一次判断5个候选是否相关，那么取的Top至少得是10-20，那平均一次请求会多3-4次模型的调用。量一大就是大堆的超时，而超时后，这次的处理该怎么做了，接着重试吗，还是全部保留？由此又会产生出很多的兜底策略问题，而且这个问题在和使用bge的排序模型相关时会更严重。

同时，带来的显著效果是（笔者使用的是32B-chat模型）对一些常识推理类问题的判断有很大的提升。

所以笔者结合过往的实际使用情况，来抛砖引玉的做一个总结，很多情况下用bge-rerank类模型做一些数据的适配微调基本上就够了。在一些极端的场景，比如：对ndcg@1 的指标有特殊要求的地方，可以用模型提前做一些过滤，将不相关的过滤掉。而不建议直接用生成式排序模型。用流程的方式来展示，如下：

recall -> cross encoder rank -> decode only model semantic filter（可选