微信扫码
添加专属顾问
我要投稿
通过 pip 安装 ScrapeGraphAI:
pip install scrapegraphai
安装 Playwright,用于基于 JavaScript 的抓取:
playwright install
建议在虚拟环境中安装库,以避免与其他库发生冲突。
SmartScraperGraph:单页面抓取器,只需要用户提示和输入源。
SearchGraph:多页面抓取器,从搜索引擎的前 n 个搜索结果中提取信息。
SpeechGraph:单页面抓取器,从网站提取信息并生成音频文件。
使用本地模型的 SmartScraperGraph:
确保已安装 Ollama 并使用 ollama pull 命令下载模型。
示例代码展示了如何创建 SmartScraperGraph 实例并运行它,以获取项目列表及其描述。
使用混合模型的 SearchGraph:
使用 Groq 作为 LLM 和 Ollama 作为嵌入模型。
示例代码展示了如何创建 SearchGraph 实例并运行它,以获取 Chioggia 的传统食谱列表。
使用 OpenAI 的 SpeechGraph:
只需要传递 OpenAI API 密钥和模型名称。
示例代码展示了如何创建 SpeechGraph 实例并运行它,以生成项目摘要的音频文件。
SmartScraperGraph 的输出是项目及其描述的列表。
SearchGraph 的输出是食谱的列表。
SpeechGraph 的输出是页面上项目摘要的音频文件。
在使用之前,需要设置 OpenAI API 密钥。
文档和参考页面可以在 ScrapeGraphAI 的官方页面上找到。
The reference page for Scrapegraph-ai is available on the official page of pypy: pypi.
pip install scrapegraphai
you will also need to install Playwright for javascript-based scraping:
playwright install
Note: it is recommended to install the library in a virtual environment to avoid conflicts with other libraries ?
Follow the procedure on the following link to setup your OpenAI API key: link.
The documentation for ScrapeGraphAI can be found here.
Check out also the docusaurus documentation.
There are three main scraping pipelines that can be used to extract information from a website (or local file):
SmartScraperGraph: single-page scraper that only needs a user prompt and an input source;
SearchGraph: multi-page scraper that extracts information from the top n search results of a search engine;
SpeechGraph: single-page scraper that extracts information from a website and generates an audio file.
It is possible to use different LLM through APIs, such as OpenAI, Groq, Azure and Gemini, or local models using Ollama.
Remember to have Ollama installed and download the models using the ollama pull command.
scrapegraphai.graphs
graph_config {
: {
: ,
: ,
: ,
: ,
},
: {
: ,
: ,
},
: ,
}
smart_scraper_graph (
prompt,
source,
configgraph_config
)
result smart_scraper_graph.()
(result)The output will be a list of projects with their descriptions like the following:
{'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}We use Groq for the LLM and Ollama for the embeddings.
scrapegraphai.graphs
graph_config {
: {
: ,
: ,
:
},
: {
: ,
: ,
},
: ,
}
search_graph (
prompt,
configgraph_config
)
result search_graph.()
(result)The output will be a list of recipes like the following:
{'recipes': [{'name': 'Sarde in Saòre'}, {'name': 'Bigoli in salsa'}, {'name': 'Seppie in umido'}, {'name': 'Moleche frite'}, {'name': 'Risotto alla pescatora'}, {'name': 'Broeto'}, {'name': 'Bibarasse in Cassopipa'}, {'name': 'Risi e bisi'}, {'name': 'Smegiassa Ciosota'}]}You just need to pass the OpenAI API key and the model name.
scrapegraphai.graphs
graph_config {
: {
: ,
: ,
},
: {
: ,
: ,
:
},
: ,
}
speech_graph (
prompt,
source,
configgraph_config,
)
result speech_graph.()
(result)The output will be an audio file with the summary of the projects on the page.
53AI,企业落地大模型首选服务商
产品:场景落地咨询+大模型应用平台+行业解决方案
承诺:免费POC验证,效果达标后再合作。零风险落地应用大模型,已交付160+中大型企业
2025-12-31
MCP、Skill、Agent:AI 圈的三个流行词,到底在说什么?——以及 Meta 收购 Manus 这场"垃圾联姻"
2025-12-31
Claude skills 底层逻辑是什么?
2025-12-31
国产 ima 还是舶来品 Notebooklm 更好用
2025-12-30
Google 的理论框架 Titans + MIRAS:让 AI 获得「真正的长期记忆」
2025-12-30
企业级AI智能体落地秘籍:七根技术支柱缺一不可,参数细节全公开
2025-12-30
百度智能云,讲了一个6年的全栈AI长叙事
2025-12-30
LLM、RAG、微调、多模态,这些概念的「产品意义」是什么?
2025-12-30
谷歌没想到:Antigravity 竟成了 Claude Code 的“免费充电宝”?
2025-10-26
2025-10-07
2025-11-19
2025-11-13
2025-10-20
2025-10-18
2025-10-11
2025-10-21
2025-10-15
2025-10-09
2025-12-31
2025-12-30
2025-12-30
2025-12-25
2025-12-25
2025-12-25
2025-12-22
2025-12-16