我要投稿

单卡4090上一键GRPO微调Qwen3最新模型

发布日期：2025-05-31 05:35:26 浏览次数： 2151

作者：特沃兹道

微信搜一搜，关注“特沃兹道”

模型和数据集下载

因为国内的网络环境，造成我有 connection timeout 恐惧症，所以第一件事就是把该下载的下载好，不要在运行中去动态下载。本文用到的模型和数据集地址：

https://modelscope.cn/models/Qwen/Qwen3-4B-Base
https://huggingface.co/datasets/unsloth/OpenMathReasoning-mini
https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed

下载命令：

modelscope download --model Qwen/Qwen3-4B-Base --revision master  --local_dir  /models/Qwen/Qwen3-4B-Basehuggingface-cli download --resume-download --repo-type dataset  unsloth/OpenMathReasoning-mini   --local-dir unsloth/OpenMathReasoning-minihuggingface-cli download --resume-download --repo-type dataset open-r1/DAPO-Math-17k-Processed --local-dir open-r1/DAPO-Math-17k-Processed

启动容器

# docker run --name unsloth0517 -itd --gpus '"device=4"'   \  -v /data/ai/models:/models \  -v /data/ai/datasets:/datasets  \  -v /data/ai/workspace/unsloth:/workspace \  unsloth:20250517_4cd5_cu121 bash
# docker exec -it unsloth0517 bashroot@ 1855d8235e1a:/home/unsloth# cd /workspace/scripts

其中docker镜像 unsloth:20250517_4cd5_cu121 的构建方法在上一篇：《RTX4090单卡微调Qwen3-32B完整步骤》中有详细描述。

root@1855d8235e1a:/workspace/scripts# python unsloth-grpo-qwen3.py? Unsloth: Will patch your computer to enable 2x faster free finetuning.? Unsloth Zoo will now patch everything to make training faster!Traceback (most recent call last):  File "/workspace/scripts/unsloth-grpo-qwen3.py", line 17, in <module>    model, tokenizer = FastLanguageModel.from_pretrained(                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  File "/opt/conda/lib/python3.11/site-packages/unsloth/models/loader.py", line 138, in from_pretrained    raise ImportError(ImportError: Unsloth: Please install vLLM before enabling `fast_inference`!You can do this in a terminal via `pip install vllm`

因为 unsloth:20250517_4cd5_cu121 这个容器镜像并未包含 vllm，所以会报这个错误。

我们可以基于 unsloth:20250517_4cd5_cu121 镜像再制作一个包含了 vllm 的镜像。为了简单起见，这里直接在容器中安装 vllm：

root@1855d8235e1a:/workspace/scripts# export PIP_INDEX_URL=https://mirrors.aliyun.com/pypi/simple/root@1855d8235e1a:/workspace/scripts# pip install vllm。。。Successfully installed airportsdata-20250224 annotated-types-0.7.0 anyio-4.9.0 astor-0.8.1 blake3-1.0.5 cachetools-5.5.2 cloudpickle-3.1.1 compressed-tensors-0.9.3 cupy-cuda12x-13.4.1 deprecated-1.2.18 depyf-0.18.0 diskcache-5.6.3 einops-0.8.1 email-validator-2.2.0 fastapi-0.115.12 fastapi-cli-0.0.7 fastrlock-0.8.3 gguf-0.16.3 googleapis-common-protos-1.70.0 h11-0.16.0 hf-xet-1.1.2 httpcore-1.0.9 httptools-0.6.4 httpx-0.28.1 importlib_metadata-8.0.0 interegular-0.3.3 jinja2-3.1.6 jiter-0.10.0 lark-1.2.2 llguidance-0.7.22 llvmlite-0.44.0 lm-format-enforcer-0.10.11 mistral_common-1.5.5 msgpack-1.1.0 nest_asyncio-1.6.0 numba-0.61.2 nvidia-cublas-cu12-12.4.5.8 nvidia-cuda-cupti-cu12-12.4.127 nvidia-cuda-nvrtc-cu12-12.4.127 nvidia-cuda-runtime-cu12-12.4.127 nvidia-cufft-cu12-11.2.1.3 nvidia-curand-cu12-10.3.5.147 nvidia-cusolver-cu12-11.6.1.9 nvidia-cusparse-cu12-12.3.1.170 nvidia-cusparselt-cu12-0.6.2 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.4.127 openai-1.81.0 opencv-python-headless-4.11.0.86 opentelemetry-api-1.26.0 opentelemetry-exporter-otlp-1.26.0 opentelemetry-exporter-otlp-proto-common-1.26.0 opentelemetry-exporter-otlp-proto-grpc-1.26.0 opentelemetry-exporter-otlp-proto-http-1.26.0 opentelemetry-proto-1.26.0 opentelemetry-sdk-1.26.0 opentelemetry-semantic-conventions-0.47b0 opentelemetry-semantic-conventions-ai-0.4.9 outlines-0.1.11 outlines_core-0.1.26 partial-json-parser-0.2.1.1.post5 pillow-11.2.1 prometheus-fastapi-instrumentator-7.1.0 prometheus_client-0.22.0 py-cpuinfo-9.0.0 pycountry-24.6.1 pydantic-2.11.4 pydantic-core-2.33.2 python-dotenv-1.1.0 python-json-logger-3.3.0 python-multipart-0.0.20 pyzmq-26.4.0 ray-2.46.0 rich-toolkit-0.14.6 scipy-1.15.3 shellingham-1.5.4 sniffio-1.3.1 starlette-0.46.2 tiktoken-0.9.0 torch-2.6.0 torchaudio-2.6.0 torchvision-0.21.0 triton-3.2.0 typer-0.15.4 typing-inspection-0.4.1 uvicorn-0.34.2 uvloop-0.21.0 vllm-0.8.5.post1 watchfiles-1.0.5 websockets-15.0.1 wrapt-1.17.2 xformers-0.0.29.post2 xgrammar-0.1.18WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.

启动训练

进入容器后，确保容器中能看到如下目录：

要训练的基础模型目录：/models/Qwen/Qwen3-4B-Base
数据集1：/datasets/unsloth/OpenMathReasoning-mini/data/cot-00000-of-00001.parquet
数据集2：/datasets/open-r1/DAPO-Math-17k-Processed/en/train-00000-of-00001.parquet
工作目录和训练代码文件：/workspace/scripts/unsloth-grpo-qwen3.py

在容器的 /workspace/scripts/ 目录下执行如下代码启动训练：

cat unsloth-grpo-qwen3.py > unsloth-grpo-qwen3.py.log && \  nohup python unsloth-grpo-qwen3.py >> unsloth-grpo-qwen3.py.log 2>&1 &

我们有意将训练代码刷到了训练日志的前面做固定，这样做的好处是，方便代码迭代过程中做问题排查。

其中训练代码 unsloth-grpo-qwen3.py 的最新版本已经针对 24G显存的4090卡做了参数边界优化，并且做了详细注释。代码内容在上一篇文章中：https://mp.weixin.qq.com/s/olblI2gE3HHDSEGnejGBrw 需要的小伙伴可自行取用。本文为了快速跑完测试，对 max_steps 等参数做了限制。下面是训练代码执行中各阶段对应的日志分析:

训练日志

加载模型

? Unsloth: Will patch your computer to enable 2x faster free finetuning.? Unsloth Zoo will now patch everything to make training faster!INFO 05-25 10:56:06 [importing.py:53] Triton module has been replaced with a placeholder.INFO 05-25 10:56:06 [__init__.py:239] Automatically detected platform cuda.===== step1. 加载模型 =======================================================================((====))==  Unsloth 2025.5.6: Fast Qwen3 patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.   \\   /|    NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.65 GB. Platform: Linux.O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False] "-____-"     Free license: http://github.com/unslothai/unslothUnsloth: Fast downloading is enabled - ignore downloading bars which are red colored!Unsloth: vLLM loading /models/Qwen/Qwen3-4B-Base with actual GPU utilization = 68.76%Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 23.65 GB.Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 2048. Num Sequences = 224.Unsloth: vLLM's KV Cache can use up to 9.31 GB. Also swap space = 6 GB.INFO 05-25 10:59:17 [config.py:717] This model supports multiple tasks: {'embed', 'generate', 'score', 'reward', 'classify'}. Defaulting to 'generate'.INFO 05-25 10:59:17 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.INFO 05-25 10:59:17 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: 。。。WARNING 05-25 10:59:17 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f19973bbd50>INFO 05-25 10:59:28 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0INFO 05-25 10:59:28 [cuda.py:221] Using Flash Attention backend on V1 engine.WARNING 05-25 10:59:28 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.INFO 05-25 10:59:28 [gpu_model_runner.py:1329] Starting to load model /models/Qwen/Qwen3-4B-Base...Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:02<00:01,  1.20s/it]Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:04<00:00,  1.60s/it]Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:04<00:00,  1.52s/it]
INFO 05-25 10:59:33 [loader.py:458] Loading weights took 4.81 secondsINFO 05-25 10:59:33 [punica_selector.py:18] Using PunicaWrapperGPU.INFO 05-25 10:59:33 [gpu_model_runner.py:1347] Model loading took 7.6334 GiB and 5.084545 secondsINFO 05-25 10:59:48 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/f7b249c75c/rank_0_0 for vLLM's torch.compileINFO 05-25 10:59:48 [backends.py:430] Dynamo bytecode transform time: 14.75 sInductor Compilation: 100%|██████████| 6/6 [00:01<00:00,  5.13it/s, triton_poi_fused_add_mul_sub_5]INFO 05-25 10:59:53 [backends.py:136] Cache the graph of shape None for later use。。。Inductor Compilation: 100%|██████████| 5/5 [00:00<00:00, 28.37it/s, triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_4]INFO 05-25 11:00:39 [backends.py:148] Compiling a graph for general shape takes 49.26 sINFO 05-25 11:03:14 [monitor.py:33] torch.compile takes 64.01 s in totalINFO 05-25 11:03:18 [kv_cache_utils.py:634] GPU KV cache size: 49,856 tokensINFO 05-25 11:03:18 [kv_cache_utils.py:637] Maximum concurrency for 2,048 tokens per request: 24.34xINFO 05-25 11:04:19 [gpu_model_runner.py:1686] Graph capturing finished in 61 secs, took 3.94 GiBINFO 05-25 11:04:19 [core.py:159] init engine (profile, create kv cache, warmup model) took 286.49 secondsUnsloth 2025.5.6 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.

模型结构

对加载模型，插入 lora 后的结构：

/models/Qwen/Qwen3-4B-Base does not have a padding token! Will use pad_token = <|vision_pad|>.model:PeftModelForCausalLM(  (base_model): LoraModel(    (model): Qwen3ForCausalLM(      (model): Qwen3Model(        (embed_tokens): Embedding(151936, 2560, padding_idx=151654)        (layers): ModuleList(          (0-35): 36 x Qwen3DecoderLayer(            (self_attn): Qwen3Attention(              (q_proj): lora.Linear(                (base_layer): Linear(in_features=2560, out_features=4096, bias=False)                (lora_dropout): ModuleDict(                  (default): Identity()                )                (lora_A): ModuleDict(                  (default): Linear(in_features=2560, out_features=32, bias=False)                )                (lora_B): ModuleDict(                  (default): Linear(in_features=32, out_features=4096, bias=False)                )                (lora_embedding_A): ParameterDict()                (lora_embedding_B): ParameterDict()                (lora_magnitude_vector): ModuleDict()              )              (k_proj): lora.Linear(                (base_layer): Linear(in_features=2560, out_features=1024, bias=False)                (lora_dropout): ModuleDict(                  (default): Identity()                )                (lora_A): ModuleDict(                  (default): Linear(in_features=2560, out_features=32, bias=False)                )                (lora_B): ModuleDict(                  (default): Linear(in_features=32, out_features=1024, bias=False)                )                (lora_embedding_A): ParameterDict()                (lora_embedding_B): ParameterDict()                (lora_magnitude_vector): ModuleDict()              )              (v_proj): lora.Linear(                (base_layer): Linear(in_features=2560, out_features=1024, bias=False)                (lora_dropout): ModuleDict(                  (default): Identity()                )                (lora_A): ModuleDict(                  (default): Linear(in_features=2560, out_features=32, bias=False)                )                (lora_B): ModuleDict(                  (default): Linear(in_features=32, out_features=1024, bias=False)                )                (lora_embedding_A): ParameterDict()                (lora_embedding_B): ParameterDict()                (lora_magnitude_vector): ModuleDict()              )              (o_proj): lora.Linear(                (base_layer): Linear(in_features=4096, out_features=2560, bias=False)                (lora_dropout): ModuleDict(                  (default): Identity()                )                (lora_A): ModuleDict(                  (default): Linear(in_features=4096, out_features=32, bias=False)                )                (lora_B): ModuleDict(                  (default): Linear(in_features=32, out_features=2560, bias=False)                )                (lora_embedding_A): ParameterDict()                (lora_embedding_B): ParameterDict()                (lora_magnitude_vector): ModuleDict()              )              (q_norm): Qwen3RMSNorm((128,), eps=1e-06)              (k_norm): Qwen3RMSNorm((128,), eps=1e-06)              (rotary_emb): LlamaRotaryEmbedding()            )            (mlp): Qwen3MLP(              (gate_proj): lora.Linear(                (base_layer): Linear(in_features=2560, out_features=9728, bias=False)                (lora_dropout): ModuleDict(                  (default): Identity()                )                (lora_A): ModuleDict(                  (default): Linear(in_features=2560, out_features=32, bias=False)                )                (lora_B): ModuleDict(                  (default): Linear(in_features=32, out_features=9728, bias=False)                )                (lora_embedding_A): ParameterDict()                (lora_embedding_B): ParameterDict()                (lora_magnitude_vector): ModuleDict()              )              (up_proj): lora.Linear(                (base_layer): Linear(in_features=2560, out_features=9728, bias=False)                (lora_dropout): ModuleDict(                  (default): Identity()                )                (lora_A): ModuleDict(                  (default): Linear(in_features=2560, out_features=32, bias=False)                )                (lora_B): ModuleDict(                  (default): Linear(in_features=32, out_features=9728, bias=False)                )                (lora_embedding_A): ParameterDict()                (lora_embedding_B): ParameterDict()                (lora_magnitude_vector): ModuleDict()              )              (down_proj): lora.Linear(                (base_layer): Linear(in_features=9728, out_features=2560, bias=False)                (lora_dropout): ModuleDict(                  (default): Identity()                )                (lora_A): ModuleDict(                  (default): Linear(in_features=9728, out_features=32, bias=False)                )                (lora_B): ModuleDict(                  (default): Linear(in_features=32, out_features=2560, bias=False)                )                (lora_embedding_A): ParameterDict()                (lora_embedding_B): ParameterDict()                (lora_magnitude_vector): ModuleDict()              )              (act_fn): SiLU()            )            (input_layernorm): Qwen3RMSNorm((2560,), eps=1e-06)            (post_attention_layernorm): Qwen3RMSNorm((2560,), eps=1e-06)          )        )        (norm): Qwen3RMSNorm((2560,), eps=1e-06)        (rotary_emb): LlamaRotaryEmbedding()      )      (lm_head): Linear(in_features=2560, out_features=151936, bias=False)    )  ))

GRPO对话模版

===== step2. 准备 GRPO 对话模版 ================================================================ 对话模板输出样例:You are given a problem.Think about the problem and provide your working out.Place it between <start_working_out> and <end_working_out>.Then, provide your solution between <SOLUTION> and </SOLUTION><|endoftext|>What is 1+1?<start_working_out>I think it's 2.<end_working_out><SOLUTION>2</SOLUTION><|endoftext|>What is 2+2?<start_working_out>

格式遵循预微调

在 GRPO 训练之前，先用带推理过程的对话数据，对原始模型做一个简单的 SFT 训练，以让模型具备按我们的推理格式进行输出的能力。

先要对原始数据集进行清理，筛选出适合做格式遵循微调的数据来：

===== step3. 格式遵循预微调 ===============================================================----- 清洗后的数据集:      expected_answer  ...                                 generated_solution0                  14  ...  <think>\nOkay, let's see. I need to solve the ...6                  -2  ...  <think>\nOkay, so I need to find the value of ...9                  18  ...  <think>\nOkay, so I need to solve the equation...13                  2  ...  <think>\nOkay, so I need to evaluate the infin...17                 30  ...  <think>\nAlright, so I need to find the larges......               ...  ...                                                ...19243             244  ...  <think>\nOkay, so I need to find the value of ...19245               1  ...  <think>\nOkay, so I have this problem where a ...19247               4  ...  <think>\nOkay, let's tackle this problem step ...19248              18  ...  <think>\nOkay, let's see. I need to find the n...19250          0.8960  ...  <think>\nOkay, so I need to find the probabili...
[7507 rows x 3 columns]

----- 第1条 OpenMathReasoning 数据格式化后应用对话模板的输出:You are given a problem.Think about the problem and provide your working out.Place it between <start_working_out> and <end_working_out>.Then, provide your solution between <SOLUTION> and </SOLUTION><|endoftext|>Given $\sqrt{x^2+165}-\sqrt{x^2-52}=7$ and $x$ is positive, find all possible values of $x$.<start_working_out>Okay, let's see. I need to solve the equation √(x² + 165) - √(x² - 52) = 7, and find all positive values of x. Hmm, radicals can be tricky, but maybe if I can eliminate the square roots by squaring both sides. Let me try that.
First, let me write down the equation again to make sure I have it right:
√(x² + 165) - √(x² - 52) = 7.
Okay, so the idea is to isolate one of the radicals and then square both sides. Let me try moving the second radical to the other side:
√(x² + 165) = 7 + √(x² - 52).
Now, if I square both sides, maybe I can get rid of the square roots. Let's do that:
(√(x² + 165))² = (7 + √(x² - 52))².
Simplifying the left side:
x² + 165 = 49 + 14√(x² - 52) + (√(x² - 52))².
The right side is expanded using the formula (a + b)² = a² + 2ab + b². So the right side becomes 7² + 2*7*√(x² - 52) + (√(x² - 52))², which is 49 + 14√(x² - 52) + (x² - 52).
So putting it all together:
x² + 165 = 49 + 14√(x² - 52) + x² - 52.
Hmm, let's simplify the right side. The x² terms will cancel out, right? Let's subtract x² from both sides:
165 = 49 + 14√(x² - 52) - 52.
Simplify the constants on the right:
49 - 52 is -3, so:
165 = -3 + 14√(x² - 52).
Now, add 3 to both sides to isolate the radical term:
165 + 3 = 14√(x² - 52).
So 168 = 14√(x² - 52).
Divide both sides by 14:
168 / 14 = √(x² - 52).
12 = √(x² - 52).
Now, square both sides again to eliminate the square root:
12² = x² - 52.
144 = x² - 52.
Add 52 to both sides:
144 + 52 = x².
196 = x².
So x = √196 = 14.
But wait, since the problem states that x is positive, we only take the positive root. So x = 14.
But hold on, when dealing with squaring equations, sometimes extraneous solutions can come up. I should check if this solution actually satisfies the original equation.
Let's plug x = 14 back into the original equation:
√(14² + 165) - √(14² - 52) = ?
Calculate each term:
14² is 196.
So first radical: √(196 + 165) = √361 = 19.
Second radical: √(196 - 52) = √144 = 12.
So 19 - 12 = 7, which is exactly the right-hand side. So yes, it checks out.
Therefore, the only solution is x = 14. Since the problem says x is positive, we don't have to consider negative roots. So I think that's the answer.To solve the equation \(\sqrt{x^2 + 165} - \sqrt{x^2 - 52} = 7\) for positive \(x\), we proceed as follows:
1. Start with the given equation:   \[   \sqrt{x^2 + 165} - \sqrt{x^2 - 52} = 7   \]
2. Isolate one of the square roots by moving \(\sqrt{x^2 - 52}\) to the right side:   \[   \sqrt{x^2 + 165} = 7 + \sqrt{x^2 - 52}   \]
3. Square both sides to eliminate the square root on the left:   \[   (\sqrt{x^2 + 165})^2 = (7 + \sqrt{x^2 - 52})^2   \]   Simplifying both sides, we get:   \[   x^2 + 165 = 49 + 14\sqrt{x^2 - 52} + (x^2 - 52)   \]
4. Combine like terms on the right side:   \[   x^2 + 165 = x^2 - 52 + 49 + 14\sqrt{x^2 - 52}   \]   Simplifying further:   \[   x^2 + 165 = x^2 - 3 + 14\sqrt{x^2 - 52}   \]
5. Subtract \(x^2\) from both sides:   \[   165 = -3 + 14\sqrt{x^2 - 52}   \]
6. Add 3 to both sides to isolate the term with the square root:   \[   168 = 14\sqrt{x^2 - 52}   \]
7. Divide both sides by 14:   \[   12 = \sqrt{x^2 - 52}   \]
8. Square both sides again to eliminate the square root:   \[   12^2 = x^2 - 52   \]   Simplifying:   \[   144 = x^2 - 52   \]
9. Add 52 to both sides to solve for \(x^2\):   \[   196 = x^2   \]
10. Take the positive square root (since \(x\) is positive):    \[    x = \sqrt{196} = 14    \]
11. Verify the solution by substituting \(x = 14\) back into the original equation:    \[    \sqrt{14^2 + 165} - \sqrt{14^2 - 52} = \sqrt{196 + 165} - \sqrt{196 - 52} = \sqrt{361} - \sqrt{144} = 19 - 12 = 7    \]    The solution checks out.
Thus, the only positive solution is:\[\boxed{14}\]<end_working_out><SOLUTION>14</SOLUTION><|endoftext|>num_proc must be <= 58. Reducing num_proc to 58 for dataset of size 58.[2025-05-25 11:05:41] WARNING arrow_dataset.py:3010: num_proc must be <= 58. Reducing num_proc to 58 for dataset of size 58.
dataset.shape:(58, 5)

----- 处理好的预微调数据集:Dataset({    features: ['expected_answer', 'problem', 'generated_solution', 'Messages', 'N', 'text', '__index_level_0__'],    num_rows: 58})Unsloth: Tokenizing ["text"] (num_proc=58): 100%|██████████| 58/58 [00:07<00:00,  7.99 examples/s]

然后开始预微调训练：

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1   \\   /|    Num examples = 58 | Num Epochs = 2 | Total steps = 116O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 1\        /    Data Parallel GPUs = 1 | Total batch size (1 x 1 x 1) = 1 "-____-"     Trainable parameters = 66,060,288/4,088,528,384 (1.62% trained)100%|██████████| 116/116 [00:47<00:00,  2.46it/s]Unsloth: Will smartly offload gradients to save VRAM!{'loss': 0.7447, 'grad_norm': 0.6478227376937866, 'learning_rate': 0.00016, 'epoch': 0.09}{'loss': 0.6066, 'grad_norm': 0.640754759311676, 'learning_rate': 0.00019279279279279282, 'epoch': 0.17}{'loss': 0.4543, 'grad_norm': 0.6311891674995422, 'learning_rate': 0.0001837837837837838, 'epoch': 0.26}{'loss': 0.4684, 'grad_norm': 0.5015860199928284, 'learning_rate': 0.00017477477477477476, 'epoch': 0.34}{'loss': 0.4063, 'grad_norm': 0.5008582472801208, 'learning_rate': 0.00016576576576576578, 'epoch': 0.43}{'loss': 0.3979, 'grad_norm': 0.5995965600013733, 'learning_rate': 0.00015675675675675676, 'epoch': 0.52}{'loss': 0.4248, 'grad_norm': 0.4734836518764496, 'learning_rate': 0.00014774774774774775, 'epoch': 0.6}{'loss': 0.4197, 'grad_norm': 0.5012277960777283, 'learning_rate': 0.00013873873873873876, 'epoch': 0.69}{'loss': 0.4511, 'grad_norm': 0.548245906829834, 'learning_rate': 0.00012972972972972974, 'epoch': 0.78}{'loss': 0.3974, 'grad_norm': 0.42141056060791016, 'learning_rate': 0.00012072072072072073, 'epoch': 0.86}{'loss': 0.3317, 'grad_norm': 0.4644368886947632, 'learning_rate': 0.0001117117117117117, 'epoch': 0.95}{'loss': 0.3846, 'grad_norm': 0.3927017152309418, 'learning_rate': 0.0001027027027027027, 'epoch': 1.03}{'loss': 0.2501, 'grad_norm': 0.5447007417678833, 'learning_rate': 9.36936936936937e-05, 'epoch': 1.12}{'loss': 0.278, 'grad_norm': 0.4823240339756012, 'learning_rate': 8.468468468468469e-05, 'epoch': 1.21}{'loss': 0.2645, 'grad_norm': 0.5164972543716431, 'learning_rate': 7.567567567567568e-05, 'epoch': 1.29}{'loss': 0.2584, 'grad_norm': 0.5759400725364685, 'learning_rate': 6.666666666666667e-05, 'epoch': 1.38}{'loss': 0.2121, 'grad_norm': 0.5618821978569031, 'learning_rate': 5.765765765765766e-05, 'epoch': 1.47}{'loss': 0.2322, 'grad_norm': 0.5534489154815674, 'learning_rate': 4.8648648648648654e-05, 'epoch': 1.55}{'loss': 0.2256, 'grad_norm': 0.6181885600090027, 'learning_rate': 3.963963963963964e-05, 'epoch': 1.64}{'loss': 0.1841, 'grad_norm': 0.48197486996650696, 'learning_rate': 3.063063063063063e-05, 'epoch': 1.72}{'loss': 0.2789, 'grad_norm': 0.6069267988204956, 'learning_rate': 2.1621621621621624e-05, 'epoch': 1.81}{'loss': 0.2148, 'grad_norm': 0.5475031137466431, 'learning_rate': 1.2612612612612611e-05, 'epoch': 1.9}{'loss': 0.2263, 'grad_norm': 0.6717495918273926, 'learning_rate': 3.603603603603604e-06, 'epoch': 1.98}{'train_runtime': 47.1361, 'train_samples_per_second': 2.461, 'train_steps_per_second': 2.461, 'train_loss': 0.35218508824192246, 'epoch': 2.0}----- 训练后模型的 model.dtype: torch.bfloat16

----- 格式遵循预微调训练完成，检测下模型是否学会了遵循我们自定义的格式 -----You are given a problem.Think about the problem and provide your working out.Place it between <start_working_out> and <end_working_out>.Then, provide your solution between <SOLUTION> and </SOLUTION><|endoftext|>Jenifer has 82 cents in pennies and nickels. Her younger brother mistook all her nickels for dimes and counted the total as $1.47. How many pennies does Jenifer have?<start_working_out><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|>ol let's start by setting up the problem. Let's denote the number of pennies as \( p \) and the number of nickels as \( n \). We know that the total value of the coins is 82 cents, so we can write the equation:\[ p + 5n = 82 \]
Next, we need to consider the mistake made by Jenifer's younger brother. He mistook all the nickels for dimes, so he counted the total as $1.47, which is 147 cents. This gives us another equation:\[ p + 10n = 147 \]
Now, we have a system of two equations:1. \( p + 5n = 82 \)2. \( p + 10n = 147 \)
To solve this system, we can subtract the first equation from the second to eliminate \( p \):\[ (p + 10n) - (p + 5n) = 147 - 82 \]\[ 5n = 65 \]\[ n = 13 \]
Now that we have \( n = 13 \), we can substitute this value back into the first equation to find \( p \):\[ p + 5(13) = 82 \]\[ p + 65 = 82 \]\[ p = 17 \]
So, Jenifer has 17 pennies. Let's verify the solution:- The value of 17 pennies is \( 17 \times 1 = 17 \) cents.- The value of 13 nickels is \( 13 \times 5 = 65 \) cents.- The total value is \( 17 + 65 = 82 \) cents, which matches the given total.
Thus, the number of pennies Jenifer has is \(\boxed{17}\).<end_working_out><SOLUTION>17</SOLUTION><|endoftext|>

可以看到，预微调之后的模型输出，满足期望。推理过程放在了我们指定的标签<start_working_out> 和 <end_working_out> 之间，最终答案放在了指定的标签之间。（其中 <start_working_out> 会添加在对话问题中，引导模型输出推理。所以模型没有输出这个开始标签，而是直接输出思考内容，然后以<end_working_out>结束思考）

处理GRPO数据集

===== step4. 加载并处理数据集 ==============================================================----- 数据集 DAPO-Math-17k-Processed:Dataset({    features: ['prompt', 'solution', 'data_source', 'source_prompt', 'ability', 'reward_model', 'extra_info'],    num_rows: 14116})----- 第一条的 prompt: In triangle $ABC$, $\sin \angle A = \frac{4}{5}$ and $\angle A < 90^\circ$. Let $D$ be a point outside triangle $ABC$ such that $\angle BAD = \angle DAC$ and $\angle BDC = 90^\circ$. Suppose that $AD = 1$ and that $\frac{BD}{CD} = \frac{3}{2}$. If $AB + AC$ can be expressed in the form $\frac{a\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.----- 第一条的 solution: 34Map: 100%|██████████| 14116/14116 [00:01<00:00, 11664.38 examples/s]----- 第1条对话格式的内容：{'prompt': [{'content': 'You are given a problem.\nThink about the problem and provide your working out.\nPlace it between <start_working_out> and <end_working_out>.\nThen, provide your solution between <SOLUTION> and </SOLUTION>', 'role': 'system'}, {'content': 'In triangle $ABC$, $\\sin \\angle A = \\frac{4}{5}$ and $\\angle A < 90^\\circ$. Let $D$ be a point outside triangle $ABC$ such that $\\angle BAD = \\angle DAC$ and $\\angle BDC = 90^\\circ$. Suppose that $AD = 1$ and that $\\frac{BD}{CD} = \\frac{3}{2}$. If $AB + AC$ can be expressed in the form $\\frac{a\\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.', 'role': 'user'}], 'solution': '34', 'data_source': 'math_dapo', 'source_prompt': [{'content': 'Solve the following math problem step by step. The last line of your response should be of the form Answer: $Answer (without quotes) where $Answer is the answer to the problem.\n\nIn triangle $ABC$, $\\sin \\angle A = \\frac{4}{5}$ and $\\angle A < 90^\\circ$. Let $D$ be a point outside triangle $ABC$ such that $\\angle BAD = \\angle DAC$ and $\\angle BDC = 90^\\circ$. Suppose that $AD = 1$ and that $\\frac{BD}{CD} = \\frac{3}{2}$. If $AB + AC$ can be expressed in the form $\\frac{a\\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.\n\nRemember to put your answer on its own line after "Answer:".', 'role': 'user'}], 'ability': 'MATH', 'reward_model': {'ground_truth': '34', 'style': 'rule-lighteval/MATH_v2'}, 'extra_info': {'index': '9a9b6eb4-a1cb-49d1-8c1e-62eaf2f74079'}, 'answer': '34'}Map: 100%|██████████| 14116/14116 [00:04<00:00, 3005.13 examples/s]You are given a problem.Think about the problem and provide your working out.Place it between <start_working_out> and <end_working_out>.Then, provide your solution between <SOLUTION> and </SOLUTION><|endoftext|>In triangle $ABC$, $\sin \angle A = \frac{4}{5}$ and $\angle A < 90^\circ$. Let $D$ be a point outside triangle $ABC$ such that $\angle BAD = \angle DAC$ and $\angle BDC = 90^\circ$. Suppose that $AD = 1$ and that $\frac{BD}{CD} = \frac{3}{2}$. If $AB + AC$ can be expressed in the form $\frac{a\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.<start_working_out>Map: 100%|██████████| 14116/14116 [00:02<00:00, 5114.02 examples/s]You are given a problem.Think about the problem and provide your working out.Place it between <start_working_out> and <end_working_out>.Then, provide your solution between <SOLUTION> and </SOLUTION><|endoftext|>In triangle $ABC$, $\sin \angle A = \frac{4}{5}$ and $\angle A < 90^\circ$. Let $D$ be a point outside triangle $ABC$ such that $\angle BAD = \angle DAC$ and $\angle BDC = 90^\circ$. Suppose that $AD = 1$ and that $\frac{BD}{CD} = \frac{3}{2}$. If $AB + AC$ can be expressed in the form $\frac{a\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.<start_working_out>Map: 100%|██████████| 14116/14116 [00:02<00:00, 5114.02 examples/s]==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1   \\   /|    Num examples = 12,709 | Num Epochs = 1 | Total steps = 100O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 1\        /    Data Parallel GPUs = 1 | Total batch size (4 x 1 x 1) = 4 "-____-"     Trainable parameters = 66,060,288/4,088,528,384 (1.62% trained)Max Length =  203

定义并测试奖励函数

训练脚本中定义了4个奖励函数，如下是对其中2个的打分样例：

===== step5. 定义并测试奖励函数 ============================================================match_format:re.compile('<end_working_out>.*?<SOLUTION>(.+?)</SOLUTION>[\\s]{0,}(?:<\\|endoftext\\|>)?[\\s]{0,}$', re.MULTILINE|re.DOTALL)
----- 奖励函数 check_answer 评分样例：
Case | Question        | Response                         | Answer   | Extracted  | Score------------------------------------------------------------------------------------------1    | Q: 2+2 = ?      | Let me think!<end_working_out><SOLUTION>4</SOLUTION> | 4        | 4          | 5.02    | Q: Hello?       | <start_working_out>think..<end_working_out><SOLUTION>  yes  </SOLUTION> | yes      |   yes      | 3.53    | Q: Value?       | !<end_working_out><SOLUTION>9.5</SOLUTION> | 10       | 9.5        | 2.04    | Q: Value?       | !<end_working_out><SOLUTION>8.3</SOLUTION> | 10       | 8.3        | 1.55    | Q: Value?       | i!<end_working_out><SOLUTION>5</SOLUTION> | 10       | 5          | -2.56    | Q: Answer?      | i!<end_working_out><SOLUTION>no digit</SOLUTION> | 42       | no digit   | -4.57    | Q: String?      | f!<end_working_out><SOLUTION>oobar</SOLUTION> | baz      | oobar      | -4.5
----- 奖励函数 check_numbers 评分样例：
Case | Question             | Response(s)          | Answer(s)       | Score(s)--------------------------------------------------------------------------------------1    | Q: 2+2=?             | <SOLUTION>4          | 4               | 3.52    | 问：总量？                | <SOLUTION> 1,234.00  | 1234.0          | 3.53    | Q: 输出吧               | <SOLUTION>没有数字       | 0               | -2.54    | Q: 10-3=?            | <SOLUTION>5          | 7               | -1.55    | Q: 1+1=?,2+2=?       | <SOLUTION>2,4        | 2,4             | 3.5,-2.5

进行GRPO训练

===== step6. 训练模型 =====================================================================。。。  5%|▌         | 5/100 [02:19<40:11, 25.38s/it]********************Question:。。。100%|██████████| 100/100 [46:13<00:00, 27.74s/it]can fire at a single
Extracted:None{'loss': 0.0067, 'grad_norm': 0.2612026631832123, 'learning_rate': 2.7777777777777776e-07, 'rewards/match_format_exactly': 2.25, 'rewards/match_format_approximately': 0.375, 'rewards/check_answer': 3.25, 'rewards/check_numbers': 2.0, 'reward': 7.875, 'reward_std': 10.25, 'completion_length': 1379.25, 'kl': 0.1681276112794876, 'epoch': 0.01}{'loss': 0.0052, 'grad_norm': 0.22037602961063385, 'learning_rate': 2.2222222222222224e-07, 'rewards/match_format_exactly': 1.5, 'rewards/match_format_approximately': -0.75, 'rewards/check_answer': 1.5, 'rewards/check_numbers': 0.5, 'reward': 2.75, 'reward_std': 11.83568000793457, 'completion_length': 1665.75, 'kl': 0.1294582337141037, 'epoch': 0.01}{'loss': 0.0043, 'grad_norm': 0.18991202116012573, 'learning_rate': 1.6666666666666668e-07, 'rewards/match_format_exactly': 0.75, 'rewards/match_format_approximately': -1.875, 'rewards/check_answer': -0.25, 'rewards/check_numbers': -1.0, 'reward': -2.375, 'reward_std': 10.25, 'completion_length': 1775.5, 'kl': 0.1086689755320549, 'epoch': 0.01}{'loss': 0.0046, 'grad_norm': 0.025854697450995445, 'learning_rate': 1.1111111111111112e-07, 'rewards/match_format_exactly': 0.0, 'rewards/match_format_approximately': -3.0, 'rewards/check_answer': -2.0, 'rewards/check_numbers': -2.5, 'reward': -7.5, 'reward_std': 0.0, 'completion_length': 1844.0, 'kl': 0.11534835398197174, 'epoch': 0.01}{'loss': 0.0052, 'grad_norm': 0.05398529767990112, 'learning_rate': 5.555555555555556e-08, 'rewards/match_format_exactly': 0.0, 'rewards/match_format_approximately': -3.0, 'rewards/check_answer': -2.0, 'rewards/check_numbers': -2.5, 'reward': -7.5, 'reward_std': 0.0, 'completion_length': 1844.0, 'kl': 0.1298024207353592, 'epoch': 0.01}{'train_runtime': 2773.8914, 'train_samples_per_second': 0.144, 'train_steps_per_second': 0.036, 'train_loss': 0.005772246685810387, 'epoch': 0.01}

这里我们设置了 max_steps = 100 以便尽快完成测试。正式训练时，可以通过设置 epochs 来控制训练轮数，并根据 loss 收敛情况等提前结束训练。

测试训练好的模型

Processed prompts: 100%|██████████| 1/1 [00:11<00:00, 11.75s/it, est. speed input: 0.85 toks/s, output: 87.15 toks/s]----- 基础模型的回答: - AnswersMath and ArithmeticWhat is the sqrt of 101?Wiki User∙ 2010-05-29 22:38:13Best AnswerCopyThe square root of 101 is 10.0498756 approximatelyWiki User∙ 2010-05-29 22:38:13This answer is:?0?0?0What is the square root of 101?It is approx. 10.0498756211.What is the square root of -101?The square root of -101 can be written as the product of the positive square root of 101 and i (where i is an imaginary number). The square root of 101 is approximately 10.04987751.What is the square root of 101 simplified?sqrt(101) is already simplified since 101 is not a perfect square. Also, we cannot simplify it since 101 is a prime number. (In other words, 101 = 1 x 101, so its only factorization is 1 and 101) In decimal form it is: 10.049875621120891586572919348985505109596599416484785647300807046What is the square root of 20200?sqrt (20200) = sqrt (4 x 100 x 505) = sqrt (4) x sqrt (100) x sqrt (505) = 2 x 10 x sqrt (505) = 20 x sqrt (505) = 10 x sqrt (4) x sqrt (505) = 50 sqrt (149)Is 17 a square root?17 is a square root.What is the square root of 3 in 101?sqrt(3 in 101) = sqrt(101) x sqrt(3) = sqrt(101 x 3) = sqrt(303) = 17.4069...approx.What is an irrational number?-101 as a number.Why is one sixth of one third the same as one square root of one hundred sixty nine?sqrt(169) = 13 1/6 of 1/3 = (1/6)(1/3) = 1/(6x3) = 1/18 = (1/13)(1/13) = 1/sqrt(169)What number when squared equals six?If you meant sqrt(6)2 then this = 6 and sqrt(6) = 2.4494... For the number to be a square root you need the 6 to be in the denominator or the square root of 6.What is -20 sqrt of 101?-20 square root of (101) - 20 * sqrt(101) - 20 * sqrt(101) is a real number and cannot be simplified any further.What is the square root of 101.8?9.046921979920897...What is the square root of 7561?As sqrt(7561) = sqrt(169)*sqrt(41) = 13*sqrt(41) ~= 274.84779...What is the sqrt of 0.25?The sqrt of 0.25 is 0.5What is the square root of 169 over 10?It is 1.3How do you find the square root of 51?The square root of 51 is approx. 7.1414 The easiest way to do that is to use a calculator. If you do not have a calculator, I strongly suggest using one, since the sqrt of 51 is an irrational number with an infinite amount of decimal places.Square and cube roots can be calculated the old-fashioned manner by using trial and error. 72 = 49 which is too small; 82 = 64 which is too big, etc. If you need to go that route, you need to know your basicProcessed prompts: 100%|██████████| 1/1 [00:26<00:00, 26.65s/it, est. speed input: 2.25 toks/s, output: 74.61 toks/s]

----- GRPO-LoRA模型的回答:Okay, so I need to find the square root of 101. Hmm, let me think. The square root of a number is the value that, when multiplied by itself, gives the original number. But 101 seems like it's not a perfect square, right? I remember that perfect squares like 100, 121, 144, etc., are numbers that have exact square roots since they're squares of integers. 101 is just 3 less than 100, which is 10 squared. So maybe √101 is close to 10 but not exactly 10. Let me calculate 10 squared first. 10 × 10 is 100. So √101 is a little more than 10. How much more? Well, 10.5 squared is 110.25, which is higher than 101. So it has to be between 10 and 10.5. Maybe 10.05? Let me try that. 10.05 squared is 10.05 × 10.05. Let me compute that. 10 × 10 is 100, 10×0.05 is 0.5, 0.05×10 is 0.5, and 0.05×0.05 is 0.0025. Adding them up: 100 + 0.5 + 0.5 + 0.0025 = 101.0025. That's very close to 101, so √101 is approximately 10.05. But wait, 10.04 squared might be slightly less. Let me check 10.04: 10.04 × 10.04. This is 10.04 × 10 = 100.4, then 0.04 × 10 = 0.4, and 0.04×0.04 = 0.0016. So total: 100.4 + 0.4 = 100.8, plus 0.0016, which is 100.8016. That's less than 101. So √101 is between 10.04 and 10.05. Maybe 10.045? Let me try 10.045: 10.045 squared. Hmm, 10×10=100, 10×0.045=0.45, 0.045×10=0.45, 0.045×0.045=0.002025. Adding: 100 + 0.45 + 0.45 = 100.9, plus 0.002025 is 100.902025, which is still less than 101. So it's closer to 10.04. Maybe 10.042? Let me try 10.042: 10.042 squared. 10×10=100, 10×0.042=0.42, 0.042×10=0.42, 0.042×0.042=0.001764. Adding: 100 + 0.42 + 0.42 = 100.84, plus 0.001764 is 100.841764, which is still below 101. So it's closer to 10.043. Let me try 10.043: 10.043 squared. 10×10=100, 10×0.043=0.43, 0.043×10=0.43, 0.043×0.043=0.001849. Adding: 100 + 0.43 + 0.43 = 100.86, plus 0.001849 is 100.861849, still less than 101. So √101 is approximately 10.043. But I wonder if there's a better way. Maybe using the Newton-Raphson method for square roots? Let me recall that method. Let x be the number we want to find the square root of. Start with an initial guess, say 10. Then the method is x_(new) = (x + N/x)/2, where N is the number, which is 101. So first guess is 10. Then, (10 + 101/10)/2 = (10 + 10.1)/2 = 20.1/2 = 10.05. Next iteration: (10.05 + 101/10.05)/2. 101 divided by 10.05 is approximately 10.049751. So 10.05 + 10.049751 = 20.099751, divided by 2 is 10.049875. So after two iterations, the approximation is approximately 10.049875. That's more accurate than my previous guesses. So the square root of 101 is approximately 10.049875. But the problem didn't specify how precise the answer needs to be, so maybe just the decimal approximation is acceptable. So let me express that in decimal form. 10.049875... Hmm, four decimal places would be 10.0499. But to be precise, I should keep more. So maybe 10.049875? But that's a bit too far. Let me verify with the Newton-Raphson method again. Start with 10. (10 + 101/10)/2 = 10.05. Then (10.05 + 101/10.05)/2. 101/10.05 is 10.04975. So 10.05 + 10.04975 = 20.09975. Divided by 2 is 10.049875. Then next iteration: (10.049875 + 101/10.049875)/2. 101 divided by 10.049875 is approximately 10.049874. So 10.049875 + 10.049874 = 20.099749, divided by 2 is 10.0498745. So after three iterations, it's approximately 10.0498745. So the square root of 101 is approximately 10.049875. Therefore, I think the answer is around 10.05. But for a better approximation, maybe using more iterations or a calculator. But since the problem doesn't specify, I'll go with 10.05. Let me check if 10.05 squared is 101. 10.05 × 10.05. Let me multiply that out. 10×10=100, 10×0.05=0.5, 0.05×10=0.5, 0.05×0.05=0.0025. Adding: 100 + 0.5 + 0.5 = 101, plus 0.0025 is 101.0025. So √101 is slightly less than 10.05. Therefore, the square root is approximately 10.049875. So in decimal form, that's about 10.050. Therefore, the square root of 101 is approximately 10.05.To find the square root of 101, we can use the Newton-Raphson method for approximation. The method starts with an initial guess and iteratively refines it.
1. **Initial Guess**: Start with \( x_0

这里看到回答因为长度问题被截断了。因为本文测试代码在推理时设置了最大生成token数 max_tokens = 1024；这些问题在最新的训练代码中已经解决了。参见：

《单卡4090上一键GRPO微调Qwen3最新模型的训练代码》

合并及保存模型

Lora训练完后，不会更改原来的模型，也不会生成完整的新模型。而是生成一个额外的较小的 Lora 权重，里面就是训练好的内容。在如上测试中，是把这个 Lora 权重作为外挂，和原始权重一起加载做的推理测试。测试完成后，我们需要把外挂lora权重，和原始模型做合并，生成一个新的完整的模型。

===== step8. 合并及保存模型 ===============================================================Unsloth: Merging 4bit and LoRA weights to 16bit...Unsloth: Will use up to 336.32 out of 503.72 RAM for saving.Unsloth: Saving model... This might take 5 minutes ... 17%|█▋        | 6/36 [00:00<00:00, 57.02it/s]We will save to Disk and not RAM now.100%|██████████| 36/36 [00:05<00:00,  6.11it/s]Unsloth: Saving tokenizer... Done.Done.

得到的模型还可以进一步做需要的量化处理。

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费场景POC验证，效果验证后签署服务协议。零风险落地应用大模型，已交付160+中大型企业