微信扫码
添加专属顾问
我要投稿
pip install -U optimum[neural-compressor] intel-extension-for-transformers
def quantize(model_name: str, output_path: str, calibration_set: "datasets.Dataset"):
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
def preprocess_function(examples):
return tokenizer(examples["text"], padding="max_length", max_length=512, truncation=True)
vectorized_ds = calibration_set.map(preprocess_function, num_proc=10)
vectorized_ds = vectorized_ds.remove_columns(["text"])
quantizer = INCQuantizer.from_pretrained(model)
quantization_config = PostTrainingQuantConfig(approach="static", backend="ipex", domain="nlp")
quantizer.quantize(
quantization_config=quantization_config,
calibration_dataset=vectorized_ds,
save_directory=output_path,
batch_size=1,
)
tokenizer.save_pretrained(output_path)
# 数据集地址https://huggingface.co/datasets/allenai/qasper
from optimum.intel import IPEXModelmodel = IPEXModel.from_pretrained("Intel/bge-small-en-v1.5-rag-int8-static")
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Intel/bge-small-en-v1.5-rag-int8-static")
inputs = tokenizer(sentences, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# get the [CLS] token
embeddings = outputs[0][:, 0]
从上面的结果可以看出,通过量化后模型的延迟和吞吐量都有大幅提升。大家是不是学会的呢。下篇我们继续介绍一个相关工具,辅助我们高效管理RAG流程。
53AI,企业落地大模型首选服务商
产品:场景落地咨询+大模型应用平台+行业解决方案
承诺:免费POC验证,效果达标后再合作。零风险落地应用大模型,已交付160+中大型企业
2025-07-31
精准与效率:RAG应用PDF文档图文提取OCR策略
2025-07-31
聊聊Dify如何集成Milvus向量库做RAG
2025-07-31
RAG + Claude的1TB大文档问答系统实战操作
2025-07-31
RAG召回质量翻倍的两个核心技术:我是这样解决"找不准"问题的
2025-07-31
测试不同的RAG技术以找到最佳方案
2025-07-30
Spring AI + Milvus 实现 RAG 智能问答实战
2025-07-30
AI问答系统崩溃?这篇RAG优化实战指南,教你解决90%的检索问题
2025-07-30
基于MCP-RAG的大规模MCP服务精确调用方法
2025-06-06
2025-05-30
2025-06-05
2025-05-19
2025-05-08
2025-05-10
2025-06-05
2025-05-20
2025-06-05
2025-05-09
2025-07-28
2025-07-09
2025-07-04
2025-07-01
2025-07-01
2025-07-01
2025-07-01
2025-06-30