我要投稿

AI操作网页：browser-use和AI大模型互动解析

发布日期：2025-07-21 08:37:09 浏览次数： 1560

作者：阿里云开发者

微信搜一搜，关注“阿里云开发者”

一、browser-use介绍

AI大模型大家最熟悉的就是输出文本，图片，或者视频，在工作与生活中或许已经能够驾轻就熟地使用。现在MCP（Model Context Protocol）的兴起，怎么使用AI大模型代替人工进行操作，进入了大家视野。有一个开源的 AI 驱动的浏览器自动化框架，对 AI 代理友好，能够实现在线任务的自动化流程，并且在开源社区具有强大的影响力（GitHub 星数 63.4k+），它就是browser-use。

那么browser-use具体有多强大，可以通过几个demo感受下：

1.将杂货商品添加到购物车并完成结账。

2.将我最新的LinkedIn关注者添加到Salesforce的潜在客户列表中。

3.请阅读我的简历，寻找机器学习职位，将它们保存到文件中，然后在新标签页中开始申请。如果需要帮助，请随时询问我。

4.在Google文档中为我父亲写一封感谢信，感谢他一直以来的付出，并将文档另存为PDF格式。

5.在网站Hugging Face上查找使用 CC BY-SA 4.0 许可证的模型，按点赞数排序，将前5名保存到文件中。

以上列举的任务，可以在github的README中看到demo演示：https://github.com/browser-use/browser-use，AI大模型都丝滑的完成了输入给它的任务。

问：那么browser-use是怎么做到这一切的？

答：高超地与AI大模型互动的技巧。

本篇侧重于browser-use和AI大模型互动方式进行深度解析，在browser-use这个优秀实践中，探究AI大模型交互过程，有哪些独特的技巧和方式，分析实现过程，提出可以应用在其他与大模型交互场景的Tips，给出一点点对于AI大模型其他应用场景的启发。

二、解析

browsers-use与AI大模型交互的方式非常的高效，离不开它精巧的prompt设计，和一系列不同类型的Message紧密组合的输入。

3.1 完整输入

以一个非常简单的任务为例：“1. 打开网址https://one.console.con.env136.shuguang.com ' 2.输入用户名和密码进行登陆 ”

browser-use会基于上面的任务，不断地拆解成小目标以及和大模型进行交互，下面是与大模型交互的其中一次输入。

[  SystemMessage(    content='You are an AI Agent designed to automate browser tasks. Your goal is to accomplish the ultimate task following the rules.\n\n# Input Format\nTask\nPrevious steps\nCurrent URL\nOpen Tabs\nInteractive Elements\n[index]<type>text</type>\n- index: Numeric identifier for interaction\n- type: HTML element type (button, input, etc.)\n- text: Element description\nExample:\n[33]<button>Submit Form</button>\n\n- Only elements with numeric indexes in [] are interactive\n- elements without [] provide only context\n\n# Response Rules\n1. RESPONSE FORMAT: You must ALWAYS respond with valid JSON in this exact format:\n{"current_state": {"evaluation_previous_goal": "Success|Failed|Unknown - Analyze the current elements and the image to check if the previous goals/actions are successful like intended by the task. Mention if something unexpected happened. Shortly state why/why not",\n"memory": "Description of what has been done and what you need to remember. Be very specific. Count here ALWAYS how many times you have done something and how many remain. E.g. 0 out of 10 websites analyzed. Continue with abc and xyz",\n"next_goal": "What needs to be done with the next immediate action"},\n"action":[{"one_action_name": {// action-specific parameter}}, // ... more actions in sequence]}\n\n2. ACTIONS: You can specify multiple actions in the list to be executed in sequence. But always specify only one action name per item. Use maximum {max_actions} actions per sequence.\nCommon action sequences:\n- Form filling: [{"input_text": {"index": 1, "text": "username"}}, {"input_text": {"index": 2, "text": "password"}}, {"click_element": {"index": 3}}]\n- Navigation and extraction: [{"go_to_url": {"url": "https://example.com"}}, {"extract_content": {"goal": "extract the names"}}]\n- Actions are executed in the given order\n- If the page changes after an action, the sequence is interrupted and you get the new state.\n- Only provide the action sequence until an action which changes the page state significantly.\n- Try to be efficient, e.g. fill forms at once, or chain actions where nothing changes on the page\n- only use multiple actions if it makes sense.\n\n3. ELEMENT INTERACTION:\n- Only use indexes of the interactive elements\n- Elements marked with "[]Non-interactive text" are non-interactive\n\n4. NAVIGATION & ERROR HANDLING:\n- If no suitable elements exist, use other functions to complete the task\n- If stuck, try alternative approaches - like going back to a previous page, new search, new tab etc.\n- Handle popups/cookies by accepting or closing them\n- Use scroll to find elements you are looking for\n- If you want to research something, open a new tab instead of using the current tab\n- If captcha pops up, try to solve it - else try a different approach\n- If the page is not fully loaded, use wait action\n\n5. TASK COMPLETION:\n- Use the done action as the last action as soon as the ultimate task is complete\n- Dont use "done" before you are done with everything the user asked you, except you reach the last step of max_steps. \n- If you reach your last step, use the done action even if the task is not fully finished. Provide all the information you have gathered so far. If the ultimate task is completly finished set success to true. If not everything the user asked for is completed set success in done to false!\n- If you have to do something repeatedly for example the task says for "each", or "for all", or "x times", count always inside "memory" how many times you have done it and how many remain. Don\'t stop until you have completed like the task asked you. Only call done after the last step.\n- Don\'t hallucinate actions\n- Make sure you include everything you found out for the ultimate task in the done text parameter. Do not just say you are done, but include the requested information of the task. \n\n6. VISUAL CONTEXT:\n- When an image is provided, use it to understand the page layout\n- Bounding boxes with labels on their top right corner correspond to element indexes\n\n7. Form filling:\n- If you fill an input field and your action sequence is interrupted, most often something changed e.g. suggestions popped up under the field.\n\n8. Long tasks:\n- Keep track of the status and subresults in the memory. \n\n9. Extraction:\n- If your task is to find information - call extract_content on the specific pages to get and store the information.\nYour responses must be always JSON with the specified format. ',     additional_kwargs={},     response_metadata={}),   HumanMessage(    content='Your ultimate task is: """1. 打开网址https://one.console.con.env136.shuguang.com 2.输入用户名和密码 """. If you achieved your ultimate task, stop everything and use the done action in the next step to complete the task. If not, continue as usual.',     additional_kwargs={},     response_metadata={}  ),   HumanMessage(    content='Example output:',     additional_kwargs={}, response_metadata={}  ),   AIMessage(    content='',     additional_kwargs={},     response_metadata={},     tool_calls=[      {'name': 'AgentOutput', 'args': {'current_state': {'evaluation_previous_goal': 'Success - I opend the first page', 'memory': 'Starting with the new task. I have completed 1/10 steps', 'next_goal': 'Click on company a'}, 'action': [{'click_element': {'index': 0}}]}, 'id': '1', 'type': 'tool_call'}]  ),  ToolMessage(    content='Browser started', tool_call_id='1'  ),   HumanMessage(    content='[Your task history memory starts here]',     additional_kwargs={}, response_metadata={}  ),   AIMessage(content='', additional_kwargs={}, response_metadata={}, tool_calls=[{'name': 'AgentOutput', 'args': {'current_state': {'evaluation_previous_goal': 'Unknown - The browser has just started, and the current page is blank. No previous actions have been taken yet.', 'memory': '0 out of 4 steps completed. Need to open the URL https://one.console.con.env136.shuguang.com next.', 'next_goal': 'Navigate to the specified URL'}, 'action': [{'go_to_url': {'url': 'https://one.console.con.env136.shuguang.com'}}]}, 'id': '2', 'type': 'tool_call'}]),   ToolMessage(content='', tool_call_id='2'),   HumanMessage(content='Action result: 🔗  Navigated to https://one.console.con.env136.shuguang.com', additional_kwargs={}, response_metadata={}),   HumanMessage(content=[    {'type': 'text', 'text': "\n[Task history memory ends]\n                          [Current state starts here]\nThe following is one-time information - if you need to remember it write it to memory:\nCurrent url: https://one.console.con.env136.shuguang.com/ascm/login?oauth_callback=https%3A%2F%2Fone.console.con.env136.shuguang.com%2F\nAvailable tabs:\n[TabInfo(page_id=0, url='https://one.console.con.env136.shuguang.com/ascm/login?oauth_callback=https%3A%2F%2Fone.console.con.env136.shuguang.com%2F', title='登录 · 飞天企业版')]\nInteractive elements from top layer of the current page inside the viewport:\n[Start of page]\n飞天企业版\n政企数智创新的同行者\n飞天企业版\n政企数智创新的同行者\n[0]<img />\n[1]<a />\n[2]<img />\n飞天企业版\n[3]<button |\n浏览器证书安装/>\n简体中文\n你好，\n欢迎访问\n飞天企业版\n[4]<form grid/>\n[5]<input 账户名;account/>\n[6]<input password;密码/>\n[7]<span />\n[8]<i />\n[9]<button 忘记密码/>\n[10]<button submit>账号登录/>\n[11]<button 授权登录/>\n© 2009-现在 阿里云计算有限公司。版权所有\n[End of page]\nCurrent step: 2/100Current date and time: 2025-03-28 14:58\n"},     {'type': 'image_url', 'image_url': {'url': '避免过大暂时去掉image的base64格式的信息'}}  ], additional_kwargs={}, response_metadata={})]

输入是一个列表，包含各种类型的Message，这些概念常见于 LangChain、OpenAI API 等工具中，帮助区分不同角色的交互。

browser-use使用的LangChain工具，也同样用这个类型对输入进行封装。

接下来会针对具体的输入内容进行详细拆解。

3.2 系统提示(SystemMessage)

这一次输入的Message列表中，第一个就是SystemMessage。

SystemMessage类型一般是用来定义AI角色和对话规则的指令，内容不随着输入的任务而改变，通常在对话开始时设置，指导 AI 的行为和输出格式。

browser-use采用的给出SystemMessage的方式是通过Markdown的文件书写，输入时转成字符串输入，可以看到清晰的结构：

Tips1：大模型可以识别Markdown的文本格式，使用Markdown的格式，实现清晰的语言架构，增加可读性，适用于固定的SystemMessage。这是一种对于大模型友好的表达方式。

这里browser-use设计SystemMessage的内容遵循着一般的大模型的设计原则：约束 AI 的行为、设定对话规则和控制输出格式。具体内容如下：

3.2.1 指定 AI 以特定身份回答问题

要求 AI 以特定身份回答问题：原文：You are an AI agent designed to automate browser tasks.Your goal is to accomplish the ultimate task following the rules.（你是一个专门设计用于自动化浏览器任务的 AI 智能体，你的目标是遵守下面的规则，完成最终任务）

3.2.2 约定输入的格式（Input Format）

browser-use列举的可能的输入为：

TaskPrevious stepsCurrent URLOpen TabsInteractive Elements

解释其术语以及表现形式，例如Interactive Elements（可交互的页面元素）这个，详细解释其描述格式：

[index]<type>text</type>- index: Numeric identifier for interaction- type: HTML element type(button, input, etc.)- text: Element descriptionExample:[33]<button>Submit Form</button>

Tips2：对于有特殊构造或表述的输入形式，可以在SystemMessage中对特殊结构进行阐述和解释，便于大模型理解，最好存在具体举例。

3.2.3 控制输出格式（Response Rules）

严格限制返回的JSON格式，约束其表达的内容，具体如下：

{  "current_state": {    "evaluation_previous_goal": "Success|Failed|Unknown - Analyze the current elements and the image to check if the previous goals/actions are successful like intended by the task. Mention if something unexpected happened. Shortly state why/why not",    "memory": "Description of what has been done and what you need to remember. Be very specific. Count here ALWAYS how many times you have done something and how many remain. E.g. 0 out of 10 websites analyzed. Continue with abc and xyz",    "next_goal": "What needs to be done with the next immediate action"  },  "action":[{"one_action_name": {// action-specific parameter}}, // ... more actions in sequence]}

这边将大模型的输出部分分成三部分：

1.current_state：当前状态

a.evaluation_previous_goal：将当前任务进行拆解之后输出成小目标

b.memory：保存需要记忆的信息

c.next_goal：评估接下来的目标

2.action：具体的操作步骤，这个是browser-use对浏览器的实际操作步骤。这里实际上是给的选择，后续会动态给予actions列表，在其列表中选择合适的action。

Tips3：对于需要保存之前与大模型沟通记忆的交互方案，可以直接如上面current_state的设计方式，设置一个输出让大模型自己决定需要记忆的部分。

3.3 人类消息(HumanMessage)

HumanMessage一般包含的用户输入的自然语言问题或指令，代表对话中的用户端信息。作用是将用户输入传递给 AI 进行处理。输入可以是文本、图像、音频等（需根据框架支持的格式调整）。

此次输入中HumanMessage包含4个方面的信息。

3.3.1 任务提示

本次任务目标的指示，并且补充“If you achieved your ultimate task, stop everything and use the done action in the next step to complete the task. If not, continue as usual.”说明。指示在完成完整任务之后，大模型给出停止的信号。

HumanMessage(    content='Your ultimate task is: """1. 打开网址https://one.console.con.env136.shuguang.com 2.输入用户名和密码 """. If you achieved your ultimate task, stop everything and use the done action in the next step to complete the task. If not, continue as usual.',     additional_kwargs={},     response_metadata={}  ),

3.3.2 示例

browser-use对于这个消息的应用中，并不强调要用一个HumanMessage完成所有信息的传递，而是片段式的，例如“3.1完整输入”中，第2条HumanMessage只包含“Example output:”这个信息，指示后续的AIMessage，ToolMessage只是示例。

HumanMessage(    content='Example output:',     additional_kwargs={}, response_metadata={})

3.3.3 历史任务执行过程

指示历史任务执行过程。通过[Your task history memory starts here]的标志开始，[Task history memory ends]标志结束，来告诉大模型中间的部分是历史任务执行过程中所有信息，如下图示例：

Tips4：对于HumanMessage的设计，可以是在单个HumanMessage不连续的，在整个Message列表输入中信息完整即可。可以通过指定的的标志，如[Your task history memory starts here]标志开始，[Task history memory ends]标志结束。

3.3.4 当前任务状态

指示当前任务状态，包括当前Current url，Available tabs, Interactive element, Current step, Current data and time, image

其中Interactive element使用[Start of page]和[End of page]描述当前页面元素信息，实际上依靠这个信息，不使用浏览器页面截图也可以在一定程度上完成任务，因此也支持非多模态的大模型处理该任务。

但是存在图像图片作为输入，会使准确率更高。图片截图如下，将上面的标号直接标注在图片中：

3.4 AI 消息(AIMessage)

AIMessage是AI对HumanMessage的直接响应，代表AI的输出内容。一般返回 AI 的回答或操作结果，可能包含文本、结构化数据（如 JSON）、或对后续步骤的建议。

browser-use这里使用AIMessage方式，是与前面包含“Example output:”信息的HumanMessage进行组合，告诉AI这是示例的AI要输出的方式和内容。

AIMessage(    content='',     additional_kwargs={},     response_metadata={},     tool_calls=[      {'name': 'AgentOutput', 'args': {'current_state': {'evaluation_previous_goal': 'Success - I opend the first page', 'memory': 'Starting with the new task. I have completed 1/10 steps', 'next_goal': 'Click on company a'}, 'action': [{'click_element': {'index': 0}}]}, 'id': '1', 'type': 'tool_call'}])

Tips5：AIMessage可以用于对AI大模型返回格式的信息补充，作为举例使用，使AI大模型更明确需要输出的内容。

3.5 工具消息(ToolMessage)

AI 调用外部工具或 API 时发送的指令或接收的响应，用于增强对话能力。AI 可通过 ToolMessage 指令调用数据库、搜索引擎、计算器等工具。

browser-use使用的经典用法，browser-use通过ToolMessage指令实现控制浏览器的操作，并且通过跟踪 HumanMessage 和 AIMessage 的历史，实现上下文感知的对话。

AIMessage(content='', additional_kwargs={}, response_metadata={}, tool_calls=[{'name': 'AgentOutput', 'args': {'current_state': {'evaluation_previous_goal': 'Unknown - The browser has just started, and the current page is blank. No previous actions have been taken yet.', 'memory': '0 out of 4 steps completed. Need to open the URL https://one.console.con.env136.shuguang.com next.', 'next_goal': 'Navigate to the specified URL'}, 'action': [{'go_to_url': {'url': 'https://one.console.con.env136.shuguang.com'}}]}, 'id': '2', 'type': 'tool_call'}]), ToolMessage(content='', tool_call_id='2')

Tips6：通过ToolMessage实现拓展的功能操作，并且通过保存所有调用记录，可以清晰记忆之前对话的大模型交互模式。

3.6 完整输出

browser-use使用的LangChain库调用AI大模型接口

使用BaseChatModel实例的with_structured_output方法，可以实现给定对象，严格限制输出的类型。

browser-use的源码为：

structured_llm = self.llm.with_structured_output(self.AgentOutput, include_raw=True)

response: dict[str, Any] = await structured_llm.ainvoke(input_messages)

输出会按照self.AgentOutput的类型严格输出：

{  'raw': AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_2de28edbc1d6437cb19bb9','function': {'arguments': '{"current_state": {"evaluation_previous_goal": "Success", "memory": "Navigated to the login page of https://one.console.con.env136.shuguang.com. Need to input username and password next.", "next_goal": "Input username \'admin\' and password \'****\', then click the login button."}, "action": [{"input_text": {"index": 5, "text": "admin"}}, {"input_text": {"index": 6, "text": "****"}}, {"click_element": {"index": 10}}]}', 'name': 'AgentOutput'}, 'type': 'function', 'index': 0}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 134, 'prompt_tokens': 4345, 'total_tokens': 4479, 'completion_tokens_details': {'accepted_prediction_tokens': None, 'audio_tokens': None, 'reasoning_tokens': None, 'rejected_prediction_tokens': None, 'text_tokens': 134}, 'prompt_tokens_details': {'audio_tokens': None, 'cached_tokens': None, 'text_tokens': 3089, 'image_tokens': 1256}}, 'model_name': 'qwen2.5-vl-72b-instruct', 'system_fingerprint': None, 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-adbc1357-cc00-4f40-bcee-0845d82940c7-0', tool_calls=[{'name': 'AgentOutput', 'args': {'current_state': {'evaluation_previous_goal': 'Success', 'memory': 'Navigated to the login page of https://one.console.con.env136.shuguang.com. Need to input username and password next.', 'next_goal': "Input username 'admin' and password '****', then click the login button."}, 'action': [{'input_text': {'index': 5, 'text': 'admin'}}, {'input_text': {'index': 6, 'text': '****'}}, {'click_element': {'index': 10}}]}, 'id': 'call_2de28edbc1d6437cb19bb9', 'type': 'tool_call'}], usage_metadata={'input_tokens': 4345, 'output_tokens': 134, 'total_tokens': 4479, 'input_token_details': {}, 'output_token_details': {}}),   'parsed':  AgentOutput(    current_state=AgentBrain(evaluation_previous_goal='Success', memory='Navigated to the login page of https://one.console.con.env136.shuguang.com. Need to input username and password next.', next_goal="Input username 'admin' and password '****', then click the login button."),     action=[ActionModel(done=None, search_google=None, go_to_url=None, go_back=None, wait=None, click_element=None, input_text=InputTextAction(index=5, text='admin', xpath=None), switch_tab=None, open_tab=None, extract_content=None, scroll_down=None, scroll_up=None, send_keys=None, scroll_to_text=None, get_dropdown_options=None, select_dropdown_option=None), ActionModel(done=None, search_google=None, go_to_url=None, go_back=None, wait=None, click_element=None, input_text=InputTextAction(index=6, text='****', xpath=None), switch_tab=None, open_tab=None, extract_content=None, scroll_down=None, scroll_up=None, send_keys=None, scroll_to_text=None, get_dropdown_options=None, select_dropdown_option=None), ActionModel(done=None, search_google=None, go_to_url=None, go_back=None, wait=None, click_element=ClickElementAction(index=10, xpath=None), input_text=None, switch_tab=None, open_tab=None, extract_content=None, scroll_down=None, scroll_up=None, send_keys=None, scroll_to_text=None, get_dropdown_options=None, select_dropdown_option=None)]  ),   'parsing_error': None}

Tips7：通过大模型的LangChain库中with_structured_output方法，能够严格约束大模型的输出。

三、总结

browser-use是一个非常优秀的大模型交互实践，在和AI大模型互动过程存在独特的技巧和方式。

可以应用在其他与大模型交互场景的Tips汇总：

1.大模型可以识别Markdown的文本格式，使用Markdown的格式，实现清晰的语言架构，增加可读性，适用于固定的SystemMessage。这是一种对于大模型友好的表达方式。

2.对于有特殊构造或表述的输入形式，可以在SystemMessage中对特殊结构进行阐述和解释，便于大模型理解，最好存在具体举例。

3.对于需要保存之前与大模型沟通记忆的交互方案，可以直接如上面current_state的设计方式，设置一个输出让大模型自己决定需要记忆的部分。

4.对于HumanMessage的设计，可以是在单个HumanMessage不连续的，在整个Message列表输入中信息完整即可。可以通过指定的的标志，如[Your task history memory starts here]标志开始，[Task history memory ends]标志结束。

5.AIMessage可以用于对AI大模型返回格式的信息补充，作为举例使用，使AI大模型更明确需要输出的内容。

6.通过ToolMessage实现拓展的功能操作，并且通过保存所有调用记录，可以清晰记忆之前对话的大模型交互模式。

7.通过大模型的LangChain库中with_structured_output方法，能够严格约束大模型的输出。

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费场景POC验证，效果验证后签署服务协议。零风险落地应用大模型，已交付160+中大型企业