微信扫码
添加专属顾问
我要投稿
Meta最新开源工具包,一键生成高质量LLM微调数据集。 核心内容: 1. 针对LLM微调的数据获取难题,Meta提供开源解决方案 2. 从原始数据到微调黄金的工作流程:导入、创建、筛选、保存 3. 支持多种文件格式和微调任务,提高数据质量和微调效率
#SDK的命令树 SDK --> SystemCheck[system-check] SDK[synthetic-data-kit] --> Ingest[ingest] SDK --> Create[create] SDK --> Curate[curate] SDK --> SaveAs[save-as] Ingest --> PDFFile[PDF File] Ingest --> HTMLFile[HTML File] Ingest --> YouTubeURL[File Format] Create --> CoT[CoT] Create --> QA[QA Pairs] Create --> Summary[Summary] Curate --> Filter[Filter by Quality] SaveAs --> JSONL[JSONL Format] SaveAs --> Alpaca[Alpaca Format] SaveAs --> FT[Fine-Tuning Format] SaveAs --> ChatML[ChatML Format]
# 从PyPI安装conda create -n synthetic-data python=3.10conda activate synthetic-datapip install synthetic-data-kit
#或者,克隆仓库以获取最新功能:bashgit clone https://github.com/meta-llama/synthetic-data-kit.gitcd synthetic-data-kitpip install -e .
bashvllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000# 创建必要的目录结构:mkdir -p data/{pdf,html,youtube,docx,ppt,txt,output,generated,cleaned,final}#检查系统是否已准备就绪:synthetic-data-kit system-check
bash# 导入PDFsynthetic-data-kit ingest research_paper.pdf# 生成30个问答对,设置质量阈值synthetic-data-kit create data/output/research_paper.txt -n 30 --threshold 8.0# 筛选质量synthetic-data-kit curate data/generated/research_paper_qa_pairs.json -t 8.5# 以OpenAI微调格式保存synthetic-data-kit save-as data/cleaned/research_paper_cleaned.json -f ft
# Example configurationvllm: api_base: "http://localhost:8000/v1" model: "meta-llama/Llama-3.3-70B-Instruct"generation: temperature: 0.7 chunk_size: 4000 num_pairs: 25curate: threshold: 7.0 batch_size: 8
prompts:
qa_generation: |
You are creating question-answer pairs for fine-tuning a legal assistant.
Focus on technical legal concepts, precedents, and statutory interpretation.
Below is a chunk of text about: {summary}...
Create {num_pairs} high-quality question-answer pairs based ONLY on this text.
Return ONLY valid JSON formatted as:
[
{
"question": "Detailed legal question?",
"answer": "Precise legal answer."
},
...
]
Text:
---
{text}
# Bash script to process multiple files
for file in data/pdf/*.pdf; do
filename=$(basename "$file" .pdf)
synthetic-data-kit ingest "$file"
synthetic-data-kit create "data/output/${filename}.txt" -n 20
synthetic-data-kit curate "data/generated/${filename}_qa_pairs.json" -t 7.5
synthetic-data-kit save-as "data/cleaned/${filename}_cleaned.json" -f chatml
done
synthetic-data-kit curate data/generated/report_qa_pairs.json -t 7.0
53AI,企业落地大模型首选服务商
产品:场景落地咨询+大模型应用平台+行业解决方案
承诺:免费场景POC验证,效果验证后签署服务协议。零风险落地应用大模型,已交付160+中大型企业
2025-02-04
2025-02-04
2024-09-18
2024-07-11
2024-07-09
2024-07-11
2024-07-26
2025-02-05
2025-01-27
2025-02-01