我要投稿

30分钟快速搭建AI推理平台：Xinference本地部署与模型测试全图解

发布日期：2025-05-15 07:56:17 浏览次数： 2436

作者：超世先锋

微信搜一搜，关注“超世先锋”

Xinference简化各种 AI 模型的运行和集成，用于本地环境部署开源 LLM、嵌入模型和多模态模型运行推理。

一，环境准备

1，conda环境

conda create -n xinference python=3.11conda activate xinference

2，安装torch

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

3，环境变量配置

#环境变量配置export HF_ENDPOINT="https://hf-mirror.com"export USE_MODELSCOPE_HUB=1export XINFERENCE_HOME=/home/jovyan/dev/xinferenceexport XINFERENCE_MODEL_SRC=modelscope#export XINFERENCE_ENDPOINT=http://0.0.0.0:9997

二、安装部署

1，源码安装

git clone https://github.com/xorbitsai/inference.gitcd inferencepip install -e .

2，pip安装

#全量安装：推理所有支持的模型

pip install "xinference[all]"

# 部分模型需要(rank)

pip install sentence-transformers# pip install flash_attn

# rerank模型需要

pip install sentencepiecepip install protobuf

三，推理引擎

1，Transformers 引擎（PyTorch）

# 支持几乎有所的最新模型，Pytorch模型默认使用的引擎pip install "xinference[transformers]"

2，vLLM 引擎

# 支持高并发，使用vllm引擎能获取更高的吞吐量pip install "xinference[vllm]"# FlashInfer is optional but required for specific functionalities such as sliding window attention with Gemma 2.# For CUDA 12.4 & torch 2.4 to support sliding window attention for gemma 2 and llama 3.1 style ropepip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4# For other CUDA & torch versions, please check https://docs.flashinfer.ai/installation.html

# 说明：当满足以下条件时，自动选择 vllm 作为引擎：

模型格式为 pytorch ， gptq 或者 awq 。当模型格式为 pytorch 时，量化选项需为 none 。当模型格式为 awq 时，量化选项需为 Int4 。当模型格式为 gptq 时，量化选项需为 Int3 、 Int4 或者 Int8 。操作系统为 Linux 并且至少有一个支持 CUDA 的设备自定义模型的 model_family 字段和内置模型的 model_name 字段在 vLLM 的支持列表中。

3，Llama.cpp 引擎

说明：自 v1.5.0 起，xllamacpp 成为 llama.cpp 后端的默认选项。如需启用 llama-cpp-python，请设置环境变量 USE_XLLAMACPP=0

pip install xinference

# xllamacpp 的安装说明：

#CPU 或 Mac Metal：pip install -U xllamacpp#Cuda:pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/cu124#HIP:pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/rocm-6.0.2

#llama-cpp-python 不同硬件的安装方式：

#Apple M系列CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python#英伟达显卡：CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python#AMD 显卡：CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python

# 说明：

Xinference 通过 xllamacpp 或 llama-cpp-python 支持 gguf 格式的模型。xllamacpp 由 Xinference 团队开发，并将在未来成为 llama.cpp 的唯一后端。llama-cpp-python 是 llama.cpp 后端的默认选项。要启用 xllamacpp，请添加环境变量 USE_XLLAMACPP=1。在即将发布的 Xinference v1.5.0 中，xllamacpp 将成为 llama.cpp 的默认选项，而 llama-cpp-python 将被弃用。在 Xinference v1.6.0 中，llama-cpp-python 将被移除。

4，SGLang 引擎

pip install "xinference[sglang]"# For CUDA 12.4 & torch 2.4 to support sliding window attention for gemma 2 and llama 3.1 style ropepip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4# For other CUDA & torch versions, please check https://docs.flashinfer.ai/installation.html

#说明：

SGLang 具有基于 RadixAttention 的高性能推理运行时。SGLang通过在多个调用之间自动重用KV缓存，显著加速了复杂 LLM 程序的执行。SGLang还支持其他常见推理技术，如连续批处理和张量并行处理。

5，MLX 引擎

pip install "xinference[mlx]"

#说明：

MLX-lm 用来在苹果 silicon 芯片上提供高效的 LLM 推理。

四、启动及访问

1，启动（注意：默认从huggingface拉模型）

XINFERENCE_MODEL_SRC=modelscope xinference-local --host 0.0.0.0 --port 9997

#说明

说明1：xinference-local 默认会在本地启动一个 worker，端点为：http://127.0.0.1:9997，端口默认为 9997，仅支持本机本地访问。说明2：默认使用 <HOME>/.xinference 作为主目录来存储一些必要的信息（日志文件、模型文件等），配置环境变量 XINFERENCE_HOME 修改主目录：XINFERENCE_HOME=/tmp/xinference xinference-local --host 0.0.0.0 --port 9997说明3：默认从huggingface拉模型，可配置 XINFERENCE_MODEL_SRC=modelscope 指定拉取模型hub

2，帮助信息

xinference-local --help

3，UI访问

http://127.0.0.1:9997/ui

4，API 文档

http://127.0.0.1:9997/docs

五、模型测试

#启动 xinference-local

XINFERENCE_MODEL_SRC=modelscope xinference-local --host 0.0.0.0 --port 9997#后台启动nohup XINFERENCE_MODEL_SRC=modelscope xinference-local --host 0.0.0.0 --port 9997 > xinference-local.log 2>&1 &

1，Language 模型

# 选择语言模型

# 配置运行参数

# 下载日志

# 运行记录

# 测试qwen2.5

# curl测试

curl -X 'POST' \  'http://127.0.0.1:9997/v1/chat/completions' \  -H 'accept: application/json' \  -H 'Content-Type: application/json' \  -d '{    "model": "qwen2.5-instruct",    "messages": [        {            "role": "system",            "content": "You are a helpful assistant."        },        {            "role": "user",            "content": "你是谁?"        }    ]  }'

# 测试qwen2.5-vl模型（下载运行同qwen2.5）

# 运行日志

# 运行记录

# web页面测试

2，Embedding 模型

# 下载embedding模型

# 配置模型运行参数

# 下载日志

# 运行记录

#curl测试

curl http://127.0.0.1:9997/v1/embeddings \ -H "Content-Type: application/json" \ -d '{  "input": "北京景点推荐",  "model": "jina-embeddings-v3"}'

3，Rerank 模型

# 选择模型点击进入

# 配置运行参数

# 运行日志

# 运行记录

# curl测试

curl -X 'POST' 'http://127.0.0.1:9997/v1/rerank' \  -H 'Content-Type: application/json' \  -d '{    "model": "bge-reranker-v2-m3",    "query": "一个男人正在吃意大利面。",    "documents": [        "一个男人在吃东西。",        "一个男人正在吃一块面包。",        "这个女孩怀着一个婴儿。",        "一个人在骑马。",        "一个女人在拉小提琴。"    ]}'

4，图像模型

# 选择模型

# 运行参数配置

# 后台日志

# 运行记录

#web界面测试

# 测试propmtA digital illustration of a movie poster titled [‘Sad Sax: Fury Toad’], [Mad Max] parody poster, featuring [a saxophone-playing toad in a post-apocalyptic desert, with a customized car made of musical instruments], in the background, [a wasteland with other musical vehicle chases], movie title in [a gritty, bold font, dusty and intense color palette].

# 测试propmtA digital illustration of a movie poster titled ['Mulan'], featuring [a fierce warrior woman with long flowing black hair, dressed in traditional Chinese armor with red accents, holding a sword ready for battle]. She is posed in a dynamic action stance against [a backdrop of rugged snow-covered mountains with dark stormy skies]. The movie title ['Mulan'] is written in [bold red calligraphy-style text, prominently displayed at the bottom], along with [a release date in smaller font]. The scene conveys [intensity, bravery, and an epic adventure].

5，语音模型

# 选择语音模型

# 下载及运行

# 后台日志

# 运行成功

#curl测试

curl -X 'POST' \  'http://localhost:9997/v1/audio/speech' \  -H 'accept: application/json' \  -H 'Content-Type: application/json' \  -d '{    "model": "CosyVoice2-0.5B",    "input": "hello",    "voice": "中文女"  }' -o hello1.mp3

6，视频模型

# 选择视频模型

# 运行参数配置

# 运行日志

# 运行成功

# curl测试

curl -X 'POST' \  'http://127.0.0.1:9997/v1/video/generations' \  -H 'accept: application/json' \  -H 'Content-Type: application/json' \  -d '{    "model": "CogVideoX-5b",    "prompt": "an apple"  }' -o apple.mp4

# 取b64_json进行base64解码即可生成视频

7，自定义模型

# 注册模型

# 注册成功

# 模型编辑

# 运行测试

# 运行日志

# 运行记录

# web界面测试

说明：pytorch模型格式，可以选择vllm启动，速度会快很多

# curl测试

curl -X 'POST' \  'http://127.0.0.1:9997/v1/chat/completions' \  -H 'accept: application/json' \  -H 'Content-Type: application/json' \  -d '{    "model": "my-Qwen2.5-32B-Instruct-GPTQ-Int4",    "messages": [        {            "role": "system",            "content": "You are a helpful assistant."        },        {            "role": "user",            "content": "你是谁?"        }    ]  }'

六、问题总结

1，问题一：

解决：安装torch

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

2，问题二：

解决：安装缺失的包

pip install sentence-transformers

问题3:

解决：使用sglang 引擎推理启动

#Xinference #大模型本地部署

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费POC验证，效果达标后再合作。零风险落地应用大模型，已交付160+中大型企业

相关资讯

2025-10-22

字节开源了一个让人上头的 Context 项目

2025-10-22

Zilliz，源于Milvus，高于Milvus

2025-10-22

OpenAgents：只需几条命令即可构建协作式 AI 网络

2025-10-21

不想给 Claude 付费，但想玩 Skills？我用国产模型搞定了

2025-10-20

重磅：阿里的 Qoder CLI 正式发布

2025-10-20

想做独立产品的人，趁早放弃Coze拥抱编程

2025-10-20

DeepSeek开源的不仅仅是个新OCR模型。。。

2025-10-20

DeepSeek又开源，这次是OCR模型！附论文解读！

了解更多

160+中大型企业正在使用53AI

立即咨询预约演示

把握AI发展的机遇，共同探索、共同进步

2025-01-22

如何打造基于GenAI的员工服务机器人

2025-01-22

热点资讯

DeepSeek V3.1 Base / Instruct 发布

2025-08-20

阿里Qoder vs Trae vs Cursor：谁才是2025年程序猿的效率之王？

2025-09-07

有点东西！Qwen开源会写中文的生图模型Qwen-Image

2025-08-05

DeepSeek-V3.1-Base来了！MoE架构+128K上下文，性能再进化

2025-08-20

GLM-4.5 发布，六大主流模型混战测评，谁能一键生成“ 真·可用 ”的应用？

2025-07-29

手把手教你本地部署！京东JoyAgent全攻略：从零拥有一个企业级的AI Agent

2025-07-31

GLM-4.5详测，这次智谱真的重回巅峰了

2025-07-29

重磅开源！通义万相最新模型来了

2025-08-26

字节重磅开源！Coze Studio + Coze Loop 助力AI Agent开发与运维一体化

2025-07-27

阿里AI编程 IDE Qoder 正式发布，BAT 终于凑齐了！

2025-08-22

大家都在问

埃森哲的大裁员，向市场发出了什么信号？

2025-10-13

DeepSeek-V3.2背后的国产算子编程语言TileLang是什么？如何保持性能领先的同时减少6倍代码量？

2025-09-29

Qwen3-Next 首测！Qwen3.5的预览版？但为什么我的测试一塌糊涂？

2025-09-17

Dify Pre-release版本来了，Dify2.0时代不远了，看看有哪些进步？

2025-09-09

Claude不让用，有哪些国产模型能迎头赶上？

2025-09-08

阿里Qoder vs Trae vs Cursor：谁才是2025年程序猿的效率之王？

2025-09-07

苹果深夜开源FastVLM：速度飙升85倍，0.5B小模型要逼疯谁？

2025-09-01

Coze开源了，为什么AI产品经理还是不会用？

2025-08-16

热门标签

内容创作大模型技术个人提效 langchain llamaindex 多模态技术 RAG技术智能客服知识图谱模型微调 RAGFlow coze Dify Fastgpt Bisheng Qanything AI+汽车 AI+金融 AI+工业 AI+培训 AI+SaaS 提示词框架提示词技巧 AI+电商 AI面试数字员工 ChatBI 知识管理开源大模型智能营销智能硬件智能化改造 AI+医疗 MaxKB