我要投稿

当AI遇上爬虫：ScrapeGraphAI结合LLM实现前所未有的网页抓取效率，一言即搜！

发布日期：2024-05-09 03:36:18 浏览次数： 6971

作者：AI进修生

微信搜一搜，关注“AI进修生”

Aitrainee | 公众号：AI进修生

?ScrapeGraphAI 是一个网络抓取Python 库，它使用 LLM 和直接图形逻辑为网站、文档和 XML 文件创建抓取管道。只需说出您想要提取哪些信息，ScrapeGraphAI就会为你完成！

在当今数据驱动的世界中，网络抓取已成为从广阔的互联网中收集信息的重要工具。然而，传统的网络抓取工具往往难以适应网站的动态特性，需要开发人员不断维护和更新。

输入 ScrapeGraphAI，这是一个革命性的 Python 库，它利用大型语言模型 (LLMs) 的强大功能和直接图形逻辑来创建灵活且适应性强的 Web 抓取管道。

ScrapeGraphAI 代表了网络抓取领域的重大进步，提供了一个开源解决方案，旨在应对当今不断发展的网络环境的挑战。这就是 ScrapeGraphAI 脱颖而出的原因：

直接图逻辑：此功能使用基于图的方法动态创建爬取管道，确保基于用户定义的提示实现高效的数据检索。

多功能模型和API：ScrapeGraphAI支持各种模型和API，包括OpenAI的GPT、Docker、Groq、Azure等，允许用户根据自己的抓取需求选择最佳选项。

灵活性和适应性：传统的网页抓取工具通常依赖于固定模式或手动配置来从网页中提取数据。ScrapeGraphAI 由 LLMs 提供支持，可适应网站结构的变化，减少开发人员持续干预的需要。

易于安装：通过简单的 pip install 命令，用户可以快速设置 ScrapeGraphAI 并开始从网站、文档和 XML 文件中抓取数据。

?️ ScrapeGraphAI：您只需一次爬取

? 快速安装

Scrapegraph-ai 的参考页面可在 pypy 的官方页面上找到：pypi。

pip install scrapegraphai

您还需要安装 Playwright 以进行基于 JavaScript 的爬取：

playwright install

注意：建议在虚拟环境中安装库，以避免与其他库的冲突 ?

? 演示

官方 streamlit 演示：

https://skillicons.dev/icons?i=react)](https://scrapegraph-ai-demo.streamlit.app/

在网上直接尝试使用 Google Colab：

https://colab.research.google.com/assets/colab-badge.svg

按照以下链接上的步骤设置您的 OpenAI API 密钥：[link]：

https://scrapegraph-ai.readthedocs.io/en/latest/index.html

? 文档

ScrapeGraphAI 的文档可以在[这里]：

https://scrapegraph-ai.readthedocs.io/en/latest/

还请查看 docusaurus [文档]：

https://scrapegraph-doc.onrender.com/

? 使用方法

您可以使用 SmartScraper 类通过提示从网站提取信息。

SmartScraper 类是一个直接图实现，使用网页爬取管道中最常见的节点。有关更多信息，请参阅文档。

情况 1：使用 Ollama 提取信息

记得单独在 Ollama 上下载模型！

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "ollama/mistral",
        "temperature": 0,
        "format": "json",  # Ollama 需要显式指定格式
        "base_url": "http://localhost:11434",  # 设置 Ollama URL
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": "http://localhost:11434",  # 设置 Ollama URL
    }
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the articles",
    # 也可以使用已下载的 HTML 代码的字符串
    source="https://perinim.github.io/projects",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

情况 2：使用 Docker 提取信息

注意：在使用本地模型之前，请记得创建 docker 容器！

    docker-compose up -d
    docker exec -it ollama ollama pull stablelm-zephyr

您可以使用 Ollama 上可用的模型或您自己的模型，而不是 stablelm-zephyr

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "ollama/mistral",
        "temperature": 0,
        "format": "json",  # Ollama 需要显式指定格式
        # "model_tokens": 2000, # 设置上下文长度任意
    },
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the articles",
    # 也可以使用已下载的 HTML 代码的字符串
    source="https://perinim.github.io/projects",  
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

情况 3：使用 Openai 模型提取信息

from scrapegraphai.graphs import SmartScraperGraph
OPENAI_API_KEY = "YOUR_API_KEY"

graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "gpt-3.5-turbo",
    },
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the articles",
    # 也可以使用已下载的 HTML 代码的字符串
    source="https://perinim.github.io/projects",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

情况 4：使用 Groq 提取信息

from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

groq_key = os.getenv("GROQ_APIKEY")

graph_config = {
    "llm": {
        "model": "groq/gemma-7b-it",
        "api_key": groq_key,
        "temperature": 0
    },
    "embeddings": {


        "model": "ollama/nomic-embed-text",
        "temperature": 0,
        "base_url": "http://localhost:11434", 
    },
    "headless": False
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their description and the author.",
    source="https://perinim.github.io/projects",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

情况 5：使用 Azure 提取信息

from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings

lm_model_instance = AzureChatOpenAI(
    openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
    azure_deployment=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"]
)

embedder_model_instance = AzureOpenAIEmbeddings(
    azure_deployment=os.environ["AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"],
    openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
)
graph_config = {
    "llm": {"model_instance": llm_model_instance},
    "embeddings": {"model_instance": embedder_model_instance}
}

smart_scraper_graph = SmartScraperGraph(
    prompt="""List me all the events, with the following fields: company_name, event_name, event_start_date, event_start_time, 
    event_end_date, event_end_time, location, event_mode, event_category, 
    third_party_redirect, no_of_days, 
    time_in_hours, hosted_or_attending, refreshments_type, 
    registration_available, registration_link""",
    source="https://www.hmhco.com/event",
    config=graph_config
)

情况 6：使用 Gemini 提取信息

from scrapegraphai.graphs import SmartScraperGraph
GOOGLE_APIKEY = "YOUR_API_KEY"

# Define the configuration for the graph
graph_config = {
    "llm": {
        "api_key": GOOGLE_APIKEY,
        "model": "gemini-pro",
    },
}

# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the articles",
    source="https://perinim.github.io/projects",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

所有 3 个情况的输出将是一个包含提取信息的字典，例如：

{
    'titles': [
        'Rotary Pendulum RL'
        ],
    'descriptions': [
        'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'
        ]
}