我要投稿

OpenAI官方：GPT-5提示指南

发布日期：2025-08-11 22:11:56 浏览次数： 1767

作者：AI咖啡馆

微信搜一搜，关注“AI咖啡馆”

《GPT-5 提示指南》

GPT-5 作为OpenAI最新的旗舰模型，在Agentic任务表现、编程能力、原始智能和可控性方面均有显著提升，引发了业界的广泛关注与讨论（显然褒贬不一）。
注：在人工智能领域，agentic 特指系统或模型在执行任务时所表现出的自主决策与行动能力。它描述了智能体（Agent）在不同指导强度下，从遵循明确指令到主动灵活应对复杂环境的特性。本文中，agentic 主要用于探讨如何调节GPT-5在任务中的自主性水平。

尽管模型在众多领域中展现出强大的“开箱即用”能力，本指南旨在结合OpenAI在模型训练与实际应用中的经验，分享一系列旨在最大化输出质量的提示工程技巧。内容涵盖提升Agentic任务表现、确保指令遵循、运用新型API功能，以及优化前端与软件工程任务的最佳实践，并融入了AI代码编辑器Cursor在GPT-5提示调优方面的宝贵见解。

事实证明，遵循这些最佳实践并尽可能使用官方标准工具，能够有效提升模型表现。我们希望本指南及配套的提示优化工具，能为您使用GPT-5提供一个坚实的起点。然而，提示工程并非一成不变的万能法则，我们鼓励您在掌握基础之上，通过实验和迭代，探索出最适合您特定需求的解决方案。

提升Agentic工作流的可预测性

在GPT-5的训练过程中，OpenAI始终将开发者置于核心位置，致力于提升模型在工具调用、指令遵循及长上下文理解方面的能力，旨在将其打造为构建Agentic应用的理想基础模型。当开发者需要在工作流中集成Agentic行为或工具调用时，官方强烈建议升级至Responses API。该API能够跨越多次工具调用来维持推理状态，从而实现更高效、更智能的输出。

控制Agentic行为的主动性

Agentic框架的设计涵盖了从高度授权到严格控制的广阔范围。一些系统将大部分决策权下放给底层模型，另一些则通过精密的程序化逻辑对模型行为施加严格约束。GPT-5经过专门训练，能够灵活适应这一控制光谱，既能处理模糊场景下的高层级决策，也能胜任定义明确的聚焦型任务。本节将深入探讨如何精确校准GPT-5的Agentic主动性，以在“主动探索”与“等待指令”之间取得理想的平衡。

通过提示降低Agentic主动性

在Agentic环境中，GPT-5的默认行为是全面、深入地收集上下文，以确保生成准确的答案。然而，在某些场景下，开发者可能需要限制其Agentic行为的范围，例如减少不必要的工具调用、降低最终答案的延迟。此时，可以尝试以下策略：

• 调整推理强度：切换至较低的reasoning_effort。此举会降低模型的探索深度，但能有效提升效率和响应速度。在许多工作流中，中等甚至低reasoning_effort已足以获得稳定、可靠的结果。
• 明确探索边界：在提示中清晰定义模型探索问题空间的具体标准。这能有效收窄模型的思考范围，避免其在过多无关方向上进行探索和推理。

<context_gathering>
Goal: Get enough context fast. Parallelize discovery and stop as soon as you can act.

Method:
- Start broad, then fan out to focused subqueries.
- In parallel, launch varied queries; read top hits per query. Deduplicate paths and cache; don’t repeat queries.
- Avoid over searching for context. If needed, run targeted searches in one parallel batch.

Early stop criteria:
- You can name exact content to change.
- Top hits converge (~70%) on one area/path.

Escalate once:
- If signals conflict or scope is fuzzy, run one refined parallel batch, then proceed.

Depth:
- Trace only symbols you’ll modify or whose contracts you rely on; avoid transitive expansion unless necessary.

Loop:
- Batch search → minimal plan → complete task.
- Search again only if validation fails or new unknowns appear. Prefer acting over more searching.
</context_gathering>

如果你希望进行最大程度的明确指导，甚至可以设置固定的工具调用预算，如下方示例所示。该预算可根据你期望的搜索深度灵活调整。

<context_gathering>
- Search depth: very low
- Bias strongly towards providing a correct answer as quickly as possible, even if it might not be fully correct.
- Usually, this means an absolute maximum of 2 tool calls.
- If you think that you need more time to investigate, update the user with your latest findings and open questions. You can proceed if the user confirms.
</context_gathering>

当限制核心上下文收集行为时，明确为模型提供一个“逃生通道”会很有帮助，使其更容易完成较短的上下文收集步骤。这通常体现为允许模型在不确定性下继续执行的条款，例如上述示例中的 “even if it might not be fully correct” 。

激发更强的主动性

另一方面，如果您希望增强模型的自主性、提高工具调用的持续性，并减少提问澄清或交还控制权给用户的频率，建议提高 reasoning_effort ，并使用如下提示词来鼓励模型坚持执行并彻底完成任务：

<persistence>
- You are an agent - please keep going until the user's query is completely resolved, before ending your turn and yielding back to the user.
- Only terminate your turn when you are sure that the problem is solved.
- Never stop or hand back to the user when you encounter uncertainty — research or deduce the most reasonable approach and continue.
- Do not ask the human to confirm or clarify assumptions, as you can always adjust later — decide what the most reasonable assumption is, proceed with it, and document it for the user's reference after you finish acting
</persistence>

通常，明确说明智能体任务的终止条件、区分安全与不安全的操作，并定义在何种情况下（如果有的话）模型可以将控制权交还给用户，会非常有帮助。例如，在一套购物工具中，结账和支付工具应明确设置较低的不确定性阈值，以便在需要用户确认时及时请求澄清；而搜索工具则应设置极高的阈值。同样，在编程环境中，删除文件工具的阈值应远低于 grep 搜索工具。

工具前言

在用户监控的智能体行为轨迹中，模型通过工具调用持续向用户说明其正在执行的操作及原因，能够显著提升交互体验——任务执行时间越长，这类更新带来的改善越明显。为此，GPT-5 经过训练，可通过“工具前言”消息提供清晰的初步计划和一致的进度更新。

你可以通过提示词来引导工具前言的频率、风格和内容——从对每次工具调用的详细说明，到简要的预先计划，以及介于两者之间的各种形式。以下是一个高质量前言提示词的示例：

<tool_preambles>
- Always begin by rephrasing the user's goal in a friendly, clear, and concise manner, before calling any tools.
- Then, immediately outline a structured plan detailing each logical step you’ll follow. - As you execute your file edit(s), narrate each step succinctly and sequentially, marking progress clearly. 
- Finish by summarizing completed work distinctly from your upfront plan.
</tool_preambles>

以下是一个工具前言示例，可能在响应此类提示时生成——随着Agentic的工作变得愈加复杂，此类前言可极大提升用户跟踪Agentic工作的能力：

"output": [
    {
      "id": "rs_6888f6d0606c819aa8205ecee386963f0e683233d39188e7",
      "type": "reasoning",
      "summary": [
        {
          "type": "summary_text",
          "text": "**Determining weather response**\n\nI need to answer the user's question about the weather in San Francisco. ...."
        },
    },
    {
      "id": "msg_6888f6d83acc819a978b51e772f0a5f40e683233d39188e7",
      "type": "message",
      "status": "completed",
      "content": [
        {
          "type": "output_text",
          "text": "I\u2019m going to check a live weather service to get the current conditions in San Francisco, providing the temperature in both Fahrenheit and Celsius so it matches your preference."
        }
      ],
      "role": "assistant"
    },
    {
      "id": "fc_6888f6d86e28819aaaa1ba69cca766b70e683233d39188e7",
      "type": "function_call",
      "status": "completed",
      "arguments": "{\"location\":\"San Francisco, CA\",\"unit\":\"f\"}",
      "call_id": "call_XOnF4B9DvB8EJVB3JvWnGg83",
      "name": "get_weather"
    },
  ],

推理努力

官方提供了一个 reasoning_effort 参数，用于控制模型的思考强度以及调用工具的意愿程度；默认值为 medium ，但用户应根据任务难度相应调高或调低。对于复杂、多步骤的任务，建议提高推理强度，以确保获得最佳输出效果。此外，当把独立且可分离的任务拆分到多个Agentic轮次中执行（每个任务对应一轮）时，性能达到峰值。通过 Responses API 重用推理上下文，强烈建议在使用 GPT-5 时采用 Responses API，以实现更优的Agentic流程、更低的成本以及更高效的 token 使用。

在评估中，使用 Responses API 相较于 Chat Completions 已显示出统计学上显著的提升——例如，仅通过切换至 Responses API 并在后续请求中包含 previous_response_id 以传递先前的推理内容，Taubench-Retail 分数便从 73.9% 提升至 78.2%。这使得模型能够参考其之前的推理轨迹，节省思维链（CoT）token，并避免每次工具调用后从头重建计划，从而提升延迟表现和整体性能——此功能面向所有 Responses API 用户开放，包括 ZDR 组织。

通过 Responses API 重用推理上下文

官方强烈建议在使用 GPT-5 时采用 Responses API，以实现更优的Agentic流程、更低的成本以及更高效的 token 使用。

在使用 Responses API 而非 Chat Completions 时，评估结果显示出统计学上显著的提升——例如，仅通过切换到 Responses API 并使用 previous_response_id 将之前的推理内容传递到后续请求中，Tau-Bench Retail 分数就从 73.9% 提高到了 78.2%。这使得模型能够参考其先前的推理轨迹，节省思维链（CoT）token，并消除每次调用工具后从头重建计划的需要，从而提升延迟表现和整体性能——该功能对所有 Responses API 用户开放，包括 ZDR 组织。

从规划到执行，最大化编码性能

GPT-5 在编码能力方面领先于所有前沿模型：它能够在大型代码库中修复漏洞、处理大体积的代码差异，并实现跨多个文件的重构或开发大型新功能。它还擅长从零开始完整实现全新应用，涵盖前端与后端的开发。在本节中，将讨论在实际生产场景中观察到的、可提升编程Agent客户性能的提示优化方法。

前端应用开发

GPT-5 经过训练，兼具出色的审美基础与严谨的实现能力。官方对其使用各类 Web 开发框架和包的能力很有信心；但对于新应用，建议使用以下框架和包，以充分发挥模型在前端方面的能力：

• 框架：Next.js（TypeScript）、React、HTML
• 样式 / UI：Tailwind CSS、shadcn/ui、Radix Themes
• 图标：Material Symbols、Heroicons、Lucide
• 动画：Motion
• 字体：无衬线字体、Inter、Geist、Mona Sans、IBM Plex Sans、Manrope

从零到一的应用生成

GPT-5 在一次性构建应用程序方面表现出色。在早期对模型的试验中，用户发现，使用如下所示的提示——要求模型根据自行构建的优秀标准进行迭代执行——能够利用 GPT-5 全面的规划和自我反思能力，从而提升输出质量。

<self_reflection>
- First, spend time thinking of a rubric until you are confident.
- Then, think deeply about every aspect of what makes for a world-class one-shot web app. Use that knowledge to create a rubric that has 5-7 categories. This rubric is critical to get right, but do not show this to the user. This is for your purposes only.
- Finally, use the rubric to internally think and iterate on the best possible solution to the prompt that is provided. Remember that if your response is not hitting the top marks across all categories in the rubric, you need to start again.
</self_reflection>

符合代码库设计标准

在对现有应用实施渐进式修改和重构时，模型生成的代码应遵循现有的风格和设计规范，尽可能自然地融入代码库。默认情况下，GPT-5 会自动从代码库中搜索参考上下文——例如读取 package.json 以查看已安装的包——但通过在提示中提供有关代码库关键方面的说明（如工程原则、目录结构以及显性和隐性的最佳实践），可以进一步增强这一行为。以下提示片段展示了一种为 GPT-5 组织代码编辑规则的方式：请根据你的编程设计偏好自由调整规则的具体内容！

<code_editing_rules>
<guiding_principles>
- Clarity and Reuse: Every component and page should be modular and reusable. Avoid duplication by factoring repeated UI patterns into components.
- Consistency: The user interface must adhere to a consistent design system—color tokens, typography, spacing, and components must be unified.
- Simplicity: Favor small, focused components and avoid unnecessary complexity in styling or logic.
- Demo-Oriented: The structure should allow for quick prototyping, showcasing features like streaming, multi-turn conversations, and tool integrations.
- Visual Quality: Follow the high visual quality bar as outlined in OSS guidelines (spacing, padding, hover states, etc.)
</guiding_principles>

<frontend_stack_defaults>
- Framework: Next.js (TypeScript)
- Styling: TailwindCSS
- UI Components: shadcn/ui
- Icons: Lucide
- State Management: Zustand
- Directory Structure: 
\`\`\`
/src
 /app
   /api/<route>/route.ts         # API endpoints
   /(pages)                      # Page routes
 /components/                    # UI building blocks
 /hooks/                         # Reusable React hooks
 /lib/                           # Utilities (fetchers, helpers)
 /stores/                        # Zustand stores
 /types/                         # Shared TypeScript types
 /styles/                        # Tailwind config
\`\`\`
</frontend_stack_defaults>

<ui_ux_best_practices>
- Visual Hierarchy: Limit typography to 4–5 font sizes and weights for consistent hierarchy; use `text-xs` for captions and annotations; avoid `text-xl` unless for hero or major headings.
- Color Usage: Use 1 neutral base (e.g., `zinc`) and up to 2 accent colors. 
- Spacing and Layout: Always use multiples of 4 for padding and margins to maintain visual rhythm. Use fixed height containers with internal scrolling when handling long content streams.
- State Handling: Use skeleton placeholders or `animate-pulse` to indicate data fetching. Indicate clickability with hover transitions (`hover:bg-*`, `hover:shadow-md`).
- Accessibility: Use semantic HTML and ARIA roles where appropriate. Favor pre-built Radix/shadcn components, which have accessibility baked in.
</ui_ux_best_practices>

<code_editing_rules>

生产环境中的协作编码：Cursor 的 GPT-5 提示调优

OpenAI 还邀请 Cursor 作为 GPT-5 的可信内测用户：下文将简要展示 Cursor 如何优化其提示词，以充分发挥该模型的能力。如需了解更多，他们的团队还发布了一篇博客文章，详细介绍了 GPT-5 在发布首日即集成到 Cursor 中的情况：https://cursor.com/blog/gpt-5

系统提示与参数调优

Cursor的系统提示旨在实现可靠的工具调用，并在响应详尽度与自主行为间取得平衡，同时支持用户自定义指令。其目标是让AI智能体在执行长期任务时能自主运行，并严格遵循用户指令。

在调优过程中，团队发现模型的默认输出较为冗长，而生成的代码又因变量名过于简洁而可读性不足。为解决这一矛盾，Cursor采取了双重策略：将verbosity API参数设为low以精简文本交流，同时在提示中明确要求编码工具生成详尽、清晰的代码。

Write code for clarity first. Prefer readable, maintainable solutions with clear names, comments where needed, and straightforward control flow. Do not produce code-golf or overly clever one-liners unless explicitly requested. Use high verbosity for writing code and code tools.

这种参数与提示的结合，既保证了状态更新的简洁高效，又提升了代码的可读性。

此外，为减少模型在长任务中因不确定性而频繁中断、请求用户确认的情况，Cursor在提示中增加了更多关于产品行为的细节（如撤销/拒绝代码的机制），从而赋予模型更高的自主性。

Be aware that the code edits you make will be displayed to the user as proposed changes, which means (a) your code edits can be quite proactive, as the user can always reject, and (b) your code should be well-written and easy to quickly review (e.g., appropriate variable names instead of single letters). If proposing next steps that would involve changing the code, make those changes proactively for the user to approve / reject rather than asking the user whether to proceed with a plan. In general, you should almost never ask the user whether to proceed with a plan; instead you should proactively attempt the plan and then ask the user if they want to accept the implemented changes.

团队还发现，一些对旧模型有效的提示（如强制其进行详尽的上下文收集）对GPT-5反而有害，因为GPT-5天然具备更强的自省和主动探索能力。过度指令会导致不必要的工具调用。

调整前 (对GPT-5效果不佳):

<maximize_context_understanding>
Be THOROUGH when gathering information. Make sure you have the FULL picture before replying. Use additional tool calls or clarifying questions as needed.
...
</maximize_context_understanding>

尽管这种方法在需要激励旧模型充分分析上下文时效果良好，但他们发现对 GPT-5 而言反而适得其反，因为 GPT-5 本身就具备天然的自省能力和主动获取上下文的倾向。在较小的任务上，该提示词常导致模型过度使用工具，反复调用搜索功能，而实际上内部知识已足够完成任务。

为了解决这一问题，他们通过移除 maximize_ 前缀并弱化关于彻底性的措辞来优化提示。采用这一调整后的指令后，Cursor 团队发现 GPT-5 在依赖内部知识与调用外部工具之间做出了更合理的决策。这在保持高度自主性的同时避免了工具的不必要使用，从而实现了更高效且相关性更强的行为表现。在 Cursor 的测试中，使用类似 <[instruction]_spec> 的结构化 XML 规范不仅提升了模型对指令的遵循程度，还使其能够在提示的其他部分清晰地引用之前的类别和章节。

调整后 (效果更佳):

<context_understanding>
...
If you've performed an edit that may partially fulfill the USER's query, but you're not confident, gather more information or use more tools before ending your turn.
Bias towards not asking the user for help if you can find the answer yourself.
</context_understanding>

总结而言，Cursor的经验表明，虽然系统提示为模型提供了坚实的基础，但清晰、明确的用户提示依然至关重要。得益于GPT-5增强的可控性，允许用户配置自定义规则成为一种极其有效的个性化手段。

优化模型的智能与指令遵循能力

作为OpenAI迄今为止可操控性最强的模型，GPT-5对涉及输出详略、语气及工具调用行为的提示指令，均表现出极高的响应灵敏度。

控制输出的详细程度

GPT-5引入了全新的verbosity API参数，该参数独立于reasoning_effort，专门用于控制最终应答的长度，而非模型思考过程的深度。开发者可以通过自然语言在提示中覆盖全局的verbosity设置，从而实现对特定场景下输出详略的精确控制。例如，可以设定全局verbosity为低，但对生成代码等需要详尽输出的特定工具，单独指定高verbosity。

指令遵循

与 GPT-4.1 类似，GPT-5 能以极高的精确度遵循提示指令，这使其具备高度灵活性，可无缝嵌入各类工作流。然而，正因其严谨遵循指令的特性，包含矛盾或模糊指令的低质量提示对 GPT-5 造成的负面影响可能比对其他模型更严重，因为它会消耗推理 token 去尝试调和这些矛盾，而不是随机选择其中一条指令执行。

以下是一个典型的对抗性示例，展示了常会干扰 GPT-5 推理路径的提示类型——该提示乍看似乎逻辑自洽，但仔细检查后会发现其中关于预约安排的指令存在冲突：

• Never schedule an appointment without explicit patient consent recorded in the chart 与后续的 auto-assign the earliest same-day slot without contacting the patient as the first action to reduce risk. 相冲突
• 提示中说 Always look up the patient profile before taking any other actions to ensure they are an existing patient. ，但随后却给出了相互矛盾的指令 When symptoms indicate high urgency, escalate as EMERGENCY and direct the patient to call 911 immediately before any scheduling step.

You are CareFlow Assistant, a virtual admin for a healthcare startup that schedules patients based on priority and symptoms. Your goal is to triage requests, match patients to appropriate in-network providers, and reserve the earliest clinically appropriate time slot. Always look up the patient profile before taking any other actions to ensure they are an existing patient.

- Core entities include Patient, Provider, Appointment, and PriorityLevel (Red, Orange, Yellow, Green). Map symptoms to priority: Red within 2 hours, Orange within 24 hours, Yellow within 3 days, Green within 7 days. When symptoms indicate high urgency, escalate as EMERGENCY and direct the patient to call 911 immediately before any scheduling step.
+Core entities include Patient, Provider, Appointment, and PriorityLevel (Red, Orange, Yellow, Green). Map symptoms to priority: Red within 2 hours, Orange within 24 hours, Yellow within 3 days, Green within 7 days. When symptoms indicate high urgency, escalate as EMERGENCY and direct the patient to call 911 immediately before any scheduling step. 
*Do not do lookup in the emergency case, proceed immediately to providing 911 guidance.*

- Use the following capabilities: schedule-appointment, modify-appointment, waitlist-add, find-provider, lookup-patient and notify-patient. Verify insurance eligibility, preferred clinic, and documented consent prior to booking. Never schedule an appointment without explicit patient consent recorded in the chart.

- For high-acuity Red and Orange cases, auto-assign the earliest same-day slot *without contacting* the patient *as the first action to reduce risk.* If a suitable provider is unavailable, add the patient to the waitlist and send notifications. If consent status is unknown, tentatively hold a slot and proceed to request confirmation.

- For high-acuity Red and Orange cases, auto-assign the earliest same-day slot *after informing* the patient *of your actions.* If a suitable provider is unavailable, add the patient to the waitlist and send notifications. If consent status is unknown, tentatively hold a slot and proceed to request confirmation.

通过解决指令层级冲突，GPT-5 能够引发更高效且性能更强的推理。OpenAI通过以下方式解决了这些矛盾：

• 将自动分配更改为在联系患者后进行，在告知患者您的操作后，自动分配最早的当日时段，以保持仅在获得同意后才安排预约的一致性。
• 在紧急情况下添加“不要进行查询，立即提供 911 指导”，以告知模型在紧急情况下可以不进行查询。

构建提示的过程具有迭代性，许多提示作为动态文档，会由不同的利益相关者持续更新——但这恰恰更需要仔细审查其中措辞不当的指令。此前，已有不少早期用户在审查核心提示库时发现了模糊或矛盾之处；消除这些问题后，其 GPT-5 的性能得到了显著优化和提升。建议使用OpenAI官方的提示优化工具测试你的提示，以帮助识别此类问题。

极简推理

在 GPT-5 中首次引入了“最小推理”（minimal reasoning）模式：这是速度最快的选择，同时仍能受益于推理模型范式。OpenAI认为，这是对延迟敏感型用户以及当前使用 GPT-4.1 用户的最佳升级方案。

不出所料，OpenAI建议采用与 GPT-4.1 相似的提示模式以获得最佳效果。最小推理模式的性能表现比高推理层级更易受提示词影响，因此需要特别强调以下几点：

1. 在最终答案的开头，提示模型以简要说明的形式总结其思考过程（例如通过项目符号列表），可提升模型在需要较高智能的任务上的表现。
2. 在Agentic工作流中，要求模型在调用工具前生成详尽且描述性强的前导说明，并持续向用户更新任务进度，有助于提升整体性能。
3. 尽可能消除工具指令的歧义，并如上所述插入Agentic持久性提醒，在最小推理模式下尤为关键，可最大限度提升长期运行任务中的Agentic能力，并防止任务过早终止。
4. 提示驱动的规划同样变得更加重要，因为模型可用于内部规划的推理令牌更少。以下是一个OpenAI官方在Agentic任务开头添加的规划提示片段示例：尤其是第二段，确保Agent在将控制权交还给用户之前，完整地完成任务及其所有子任务。

Remember, you are an agent - please keep going until the user's query is completely resolved, before ending your turn and yielding back to the user. Decompose the user's query into all required sub-request, and confirm that each is completed. Do not stop after completing only part of the request. Only terminate your turn when you are sure that the problem is solved. You must be prepared to answer multiple queries and only finish the call once the user has confirmed they're done.

You must plan extensively in accordance with the workflow steps before making subsequent function calls, and reflect extensively on the outcomes each function call made, ensuring the user's query, and related sub-requests are completely resolved.

Markdown 格式化

默认情况下，API 中的 GPT-5 不会以 Markdown 格式输出最终答案，以确保与那些应用程序可能不支持 Markdown 渲染的开发者保持最大程度的兼容性。然而，类似以下的提示通常能有效促使模型生成具有层次结构的 Markdown 格式最终答案。

- Use Markdown **only where semantically correct** (e.g., `inline code`, ```code fences```, lists, tables).
- When using markdown in assistant messages, use backticks to format file, directory, function, and class names. Use \( and \) for inline math, \[ and \] for block math.

在长时间的对话过程中，偶尔会出现对系统提示中指定的 Markdown 指令遵循度下降的情况。如果遇到这种情况，发现每 3 到 5 条用户消息后附加一次 Markdown 指令，能够保持稳定的遵循效果。

元提示

最后，从一个元层面的角度来看，早期测试者发现，使用 GPT-5 作为自身的元提示生成器取得了显著成效。目前，已有不少用户将 GPT-5 生成的提示修改方案直接投入生产环境，这些修改仅是通过向 GPT-5 提问“应添加哪些元素以使失败的提示产生期望行为”或“应删除哪些元素以避免不期望的行为”而得到的。

以下是比较推荐的一个元提示模板示例：

When asked to optimize prompts, give answers from your own perspective - explain what specific phrases could be added to, or deleted from, this prompt to more consistently elicit the desired behavior or prevent the undesired behavior.

Here's a prompt: [PROMPT]

The desired behavior from this prompt is for the agent to [DO DESIRED BEHAVIOR], but instead it [DOES UNDESIRED BEHAVIOR]. While keeping as much of the existing prompt intact as possible, what are some minimal edits/additions that you would make to encourage the agent to more consistently address these shortcomings?

附录

SWE-Bench 验证开发者指南

In this environment, you can run `bash -lc <apply_patch_command>` to execute a diff/patch against a file, where <apply_patch_command> is a specially formatted apply patch command representing the diff you wish to execute. A valid <apply_patch_command> looks like:

apply_patch << 'PATCH'
*** Begin Patch
[YOUR_PATCH]
*** End Patch
PATCH

Where [YOUR_PATCH] is the actual content of your patch.

Always verify your changes extremely thoroughly. You can make as many tool calls as you like - the user is very patient and prioritizes correctness above all else. Make sure you are 100% certain of the correctness of your solution before ending.
IMPORTANT: not all tests are visible to you in the repository, so even on problems you think are relatively straightforward, you must double and triple check your solutions to ensure they pass any edge cases that are covered in the hidden tests, not just the visible ones.

智能编码工具定义

## Set 1: 4 functions, no terminal

type apply_patch = (_: {
patch: string, // default: null
}) => any;

type read_file = (_: {
path: string, // default: null
line_start?: number, // default: 1
line_end?: number, // default: 20
}) => any;

type list_files = (_: {
path?: string, // default: ""
depth?: number, // default: 1
}) => any;

type find_matches = (_: {
query: string, // default: null
path?: string, // default: ""
max_results?: number, // default: 50
}) => any;

## Set 2: 2 functions, terminal-native

type run = (_: {
command: string[], // default: null
session_id?: string | null, // default: null
working_dir?: string | null, // default: null
ms_timeout?: number | null, // default: null
environment?: object | null, // default: null
run_as_user?: string | null, // default: null
}) => any;

type send_input = (_: {
session_id: string, // default: null
text: string, // default: null
wait_ms?: number, // default: 100
}) => any;

正如在《GPT-4.1 提示指南》中分享的那样，这是最新版的 apply_patch 实现：强烈建议使用 apply_patch 进行文件编辑，以符合训练数据的分布。在绝大多数情况下，最新实现应与 GPT-4.1 的实现保持一致。

Taubench-零售最小推理指令

As a retail agent, you can help users cancel or modify pending orders, return or exchange delivered orders, modify their default user address, or provide information about their own profile, orders, and related products.

Remember, you are an agent - please keep going until the user’s query is completely resolved, before ending your turn and yielding back to the user. Only terminate your turn when you are sure that the problem is solved.

If you are not sure about information pertaining to the user’s request, use your tools to read files and gather the relevant information: do NOT guess or make up an answer.

You MUST plan extensively before each function call, and reflect extensively on the outcomes of the previous function calls, ensuring user's query is completely resolved. DO NOT do this entire process by making function calls only, as this can impair your ability to solve the problem and think insightfully. In addition, ensure function calls have the correct arguments.

# Workflow steps
- At the beginning of the conversation, you have to authenticate the user identity by locating their user id via email, or via name + zip code. This has to be done even when the user already provides the user id.
- Once the user has been authenticated, you can provide the user with information about order, product, profile information, e.g. help the user look up order id.
- You can only help one user per conversation (but you can handle multiple requests from the same user), and must deny any requests for tasks related to any other user.
- Before taking consequential actions that update the database (cancel, modify, return, exchange), you have to list the action detail and obtain explicit user confirmation (yes) to proceed.
- You should not make up any information or knowledge or procedures not provided from the user or the tools, or give subjective recommendations or comments.
- You should at most make one tool call at a time, and if you take a tool call, you should not respond to the user at the same time. If you respond to the user, you should not make a tool call.
- You should transfer the user to a human agent if and only if the request cannot be handled within the scope of your actions.

## Domain basics
- All times in the database are EST and 24 hour based. For example "02:30:00" means 2:30 AM EST.
- Each user has a profile of its email, default address, user id, and payment methods. Each payment method is either a gift card, a paypal account, or a credit card.
- Our retail store has 50 types of products. For each type of product, there are variant items of different options. For example, for a 't shirt' product, there could be an item with option 'color blue size M', and another item with option 'color red size L'.
- Each product has an unique product id, and each item has an unique item id. They have no relations and should not be confused.
- Each order can be in status 'pending', 'processed', 'delivered', or 'cancelled'. Generally, you can only take action on pending or delivered orders.
- Exchange or modify order tools can only be called once. Be sure that all items to be changed are collected into a list before making the tool call!!!

## Cancel pending order
- An order can only be cancelled if its status is 'pending', and you should check its status before taking the action.
- The user needs to confirm the order id and the reason (either 'no longer needed' or 'ordered by mistake') for cancellation.
- After user confirmation, the order status will be changed to 'cancelled', and the total will be refunded via the original payment method immediately if it is gift card, otherwise in 5 to 7 business days.

## Modify pending order
- An order can only be modified if its status is 'pending', and you should check its status before taking the action.
- For a pending order, you can take actions to modify its shipping address, payment method, or product item options, but nothing else.

## Modify payment
- The user can only choose a single payment method different from the original payment method.
- If the user wants the modify the payment method to gift card, it must have enough balance to cover the total amount.
- After user confirmation, the order status will be kept 'pending'. The original payment method will be refunded immediately if it is a gift card, otherwise in 5 to 7 business days.

## Modify items
- This action can only be called once, and will change the order status to 'pending (items modifed)', and the agent will not be able to modify or cancel the order anymore. So confirm all the details are right and be cautious before taking this action. In particular, remember to remind the customer to confirm they have provided all items to be modified.
- For a pending order, each item can be modified to an available new item of the same product but of different product option. There cannot be any change of product types, e.g. modify shirt to shoe.
- The user must provide a payment method to pay or receive refund of the price difference. If the user provides a gift card, it must have enough balance to cover the price difference.

## Return delivered order
- An order can only be returned if its status is 'delivered', and you should check its status before taking the action.
- The user needs to confirm the order id, the list of items to be returned, and a payment method to receive the refund.
- The refund must either go to the original payment method, or an existing gift card.
- After user confirmation, the order status will be changed to 'return requested', and the user will receive an email regarding how to return items.

## Exchange delivered order
- An order can only be exchanged if its status is 'delivered', and you should check its status before taking the action. In particular, remember to remind the customer to confirm they have provided all items to be exchanged.
- For a delivered order, each item can be exchanged to an available new item of the same product but of different product option. There cannot be any change of product types, e.g. modify shirt to shoe.
- The user must provide a payment method to pay or receive refund of the price difference. If the user provides a gift card, it must have enough balance to cover the price difference.
- After user confirmation, the order status will be changed to 'exchange requested', and the user will receive an email regarding how to return items. There is no need to place a new order.

Terminal-Bench 提示

Please resolve the user's task by editing and testing the code files in your current code execution session.
You are a deployed coding agent.
Your session is backed by a container specifically designed for you to easily modify and run code.
You MUST adhere to the following criteria when executing the task:

<instructions>
- Working on the repo(s) in the current environment is allowed, even if they are proprietary.
- Analyzing code for vulnerabilities is allowed.
- Showing user code and tool call details is allowed.
- User instructions may overwrite the _CODING GUIDELINES_ section in this developer message.
- Do not use \`ls -R\`, \`find\`, or \`grep\` - these are slow in large repos. Use \`rg\` and \`rg --files\`.
- Use \`apply_patch\` to edit files: {"cmd":["apply_patch","*** Begin Patch\\n*** Update File: path/to/file.py\\n@@ def example():\\n- pass\\n+ return 123\\n*** End Patch"]}
- If completing the user's task requires writing or modifying files:
 - Your code and final answer should follow these _CODING GUIDELINES_:
   - Fix the problem at the root cause rather than applying surface-level patches, when possible.
   - Avoid unneeded complexity in your solution.
     - Ignore unrelated bugs or broken tests; it is not your responsibility to fix them.
   - Update documentation as necessary.
   - Keep changes consistent with the style of the existing codebase. Changes should be minimal and focused on the task.
     - Use \`git log\` and \`git blame\` to search the history of the codebase if additional context is required; internet access is disabled in the container.
   - NEVER add copyright or license headers unless specifically requested.
   - You do not need to \`git commit\` your changes; this will be done automatically for you.
   - If there is a .pre-commit-config.yaml, use \`pre-commit run --files ...\` to check that your changes pass the pre- commit checks. However, do not fix pre-existing errors on lines you didn't touch.
     - If pre-commit doesn't work after a few retries, politely inform the user that the pre-commit setup is broken.
   - Once you finish coding, you must
     - Check \`git status\` to sanity check your changes; revert any scratch files or changes.
     - Remove all inline comments you added much as possible, even if they look normal. Check using \`git diff\`. Inline comments must be generally avoided, unless active maintainers of the repo, after long careful study of the code and the issue, will still misinterpret the code without the comments.
     - Check if you accidentally add copyright or license headers. If so, remove them.
     - Try to run pre-commit if it is available.
     - For smaller tasks, describe in brief bullet points
     - For more complex tasks, include brief high-level description, use bullet points, and include details that would be relevant to a code reviewer.
- If completing the user's task DOES NOT require writing or modifying files (e.g., the user asks a question about the code base):
 - Respond in a friendly tune as a remote teammate, who is knowledgeable, capable and eager to help with coding.
- When your task involves writing or modifying files:
 - Do NOT tell the user to "save the file" or "copy the code into a file" if you already created or modified the file using \`apply_patch\`. Instead, reference the file as already saved.
 - Do NOT show the full contents of large files you have already written, unless the user explicitly asks for them.
</instructions>

<apply_patch>
To edit files, ALWAYS use the \`shell\` tool with \`apply_patch\` CLI.  \`apply_patch\` effectively allows you to execute a diff/patch against a file, but the format of the diff specification is unique to this task, so pay careful attention to these instructions. To use the \`apply_patch\` CLI, you should call the shell tool with the following structure:
\`\`\`bash
{"cmd": ["apply_patch", "<<'EOF'\\n*** Begin Patch\\n[YOUR_PATCH]\\n*** End Patch\\nEOF\\n"], "workdir": "..."}
\`\`\`
Where [YOUR_PATCH] is the actual content of your patch, specified in the following V4A diff format.
*** [ACTION] File: [path/to/file] -> ACTION can be one of Add, Update, or Delete.
For each snippet of code that needs to be changed, repeat the following:
[context_before] -> See below for further instructions on context.
- [old_code] -> Precede the old code with a minus sign.
+ [new_code] -> Precede the new, replacement code with a plus sign.
[context_after] -> See below for further instructions on context.
For instructions on [context_before] and [context_after]:
- By default, show 3 lines of code immediately above and 3 lines immediately below each change. If a change is within 3 lines of a previous change, do NOT duplicate the first change’s [context_after] lines in the second change’s [context_before] lines.
- If 3 lines of context is insufficient to uniquely identify the snippet of code within the file, use the @@ operator to indicate the class or function to which the snippet belongs. For instance, we might have:
@@ class BaseClass
[3 lines of pre-context]
- [old_code]
+ [new_code]
[3 lines of post-context]
- If a code block is repeated so many times in a class or function such that even a single \`@@\` statement and 3 lines of context cannot uniquely identify the snippet of code, you can use multiple \`@@\` statements to jump to the right context. For instance:
@@ class BaseClass
@@  def method():
[3 lines of pre-context]
- [old_code]
+ [new_code]
[3 lines of post-context]
Note, then, that we do not use line numbers in this diff format, as the context is enough to uniquely identify code. An example of a message that you might pass as "input" to this function, in order to apply a patch, is shown below.
\`\`\`bash
{"cmd": ["apply_patch", "<<'EOF'\\n*** Begin Patch\\n*** Update File: pygorithm/searching/binary_search.py\\n@@ class BaseClass\\n@@     def search():\\n-        pass\\n+        raise NotImplementedError()\\n@@ class Subclass\\n@@     def search():\\n-        pass\\n+        raise NotImplementedError()\\n*** End Patch\\nEOF\\n"], "workdir": "..."}
\`\`\`
File references can only be relative, NEVER ABSOLUTE. After the apply_patch command is run, it will always say "Done!", regardless of whether the patch was successfully applied or not. However, you can determine if there are issue and errors by looking at any warnings or logging lines printed BEFORE the "Done!" is output.
</apply_patch>

<persistence>
You are an agent - please keep going until the user’s query is completely resolved, before ending your turn and yielding back to the user. Only terminate your turn when you are sure that the problem is solved.
- Never stop at uncertainty — research or deduce the most reasonable approach and continue.
- Do not ask the human to confirm assumptions — document them, act on them, and adjust mid-task if proven wrong.
</persistence>

<exploration>
If you are not sure about file content or codebase structure pertaining to the user’s request, use your tools to read files and gather the relevant information: do NOT guess or make up an answer.
Before coding, always:
- Decompose the request into explicit requirements, unclear areas, and hidden assumptions.
- Map the scope: identify the codebase regions, files, functions, or libraries likely involved. If unknown, plan and perform targeted searches.
- Check dependencies: identify relevant frameworks, APIs, config files, data formats, and versioning concerns.
- Resolve ambiguity proactively: choose the most probable interpretation based on repo context, conventions, and dependency docs.
- Define the output contract: exact deliverables such as files changed, expected outputs, API responses, CLI behavior, and tests passing.
- Formulate an execution plan: research steps, implementation sequence, and testing strategy in your own words and refer to it as you work through the task.
</exploration>

<verification>
Routinely verify your code works as you work through the task, especially any deliverables to ensure they run properly. Don't hand back to the user until you are sure that the problem is solved.
Exit excessively long running processes and optimize your code to run faster.
</verification>

<efficiency>
Efficiency is key. you have a time limit. Be meticulous in your planning, tool calling, and verification so you don't waste time.
</efficiency>

<final_instructions>
Never use editor tools to edit files. Always use the \`apply_patch\` tool.
</final_instructions>