Few-shot prompting 教 LLM 调用函数

浏览次数： 1564

背景

LLMs 如何工作

即使是非常高层级的 GPT 模型，包括 ChatGPT、GPT-4、GPT-3.5-turbo，它们都是我们所说的自回归语言模型。这意味着它们是巨大的人工智能模型，它们接受过庞大的数据集的训练，包括互联网、维基百科、公共 GitHub 代码和其他授权材料。它们被称为自回归，因为它们所做的只是综合所有这些信息。它们接受一个 Prompt，或者我们可以称之为上下文。它们查看 Prompt。然后它们基本上只是决定，给定这个 Prompt，给定这个输入，下一个单词应该是什么？它实际上只是在预测下一个单词。

例如，如果给定 GPT 的输入是，“the largest city in the United States is“（美国最大的城市是），那么答案就是 New York City（纽约市）。LLMs 会一个字一个字地思考，LLMs 会返回 “New”、“York”，然后是“City”。同样，在更具对话性的环境中，如果我们问它地球和太阳之间的距离是多少。GPT 已经从互联网上学过这个，它将输出 9400 万英里。它是根据输入逐个单词逐个单词思考的。

在底层，LLMs 真正做的是每次输出单词时，都会查看一堆候选单词并为它们分配概率。例如，在最初的例子中，“美国最大的城市是”，它可能有很多候选城市，New 代表“纽约”（New York），或者“新泽西”（New Jersey），或者其他什么，Los 代表“洛杉矶”（Los Angeles），然后还有其他一些可能的例子。你可以看到，它确实认为“New York City”（纽约市）可能是正确的答案，因为 New 的概率为 95%。在这种情况下，它通常会选择最有可能的结果，所以它会选择 New，然后继续前进。这个单词出现后，我们现在就知道 New 是第一个单词，所以它对下一个单词是什么就有了更多的限制。交互过程可以参考下图：

图片来源：Developing Apps with GPT-4 and ChatGPT

详细内容：? OpenAI 内部工程师揭秘：如何通过 API 将大模型集成到自己的应用程序中

Function calling 理解

LLMs 尽管能力很强且具备非常强大的涌现能力，但是也存在一些局限性，显而易见的问题就是：它无法获取最新的信息数据、只能给出文字的建议但无法直接解决某些问题，比如想问 LLMs ：“今天的天气怎么样？“ ，这种简单的场景都无法做到。

Function calling 功能彻底改变了开发者与 LLMs 交互的方式。这个功能允许开发者描述函数给 LLMs，然后 LLMs 可以智能地决定输出一个包含调用这些函数的参数的 JSON 对象。LLMs 可以根据自身的数据库知识进行回答，还可以额外挂载一个函数库，然后根据用户提问去函数库检索，按照实际需求调用外部函数并获取函数运行结果，再基于函数运行结果进行回答。

简单理解就是：根据业务场景，自动选择对应的功能函数，并格式化输出。

遇到的问题

既然是概率问题，就有可能会出错，在函数调用过程中，有以下两个环节比较容易出错：

选错函数（工具），函数库中有大量的函数，函数功能描述容易混淆。
生成参数不稳定，在之前的文章中介绍过如何让 LLMs 稳定输出的方法（让 LLM 稳定输出 JSON）

Few-shot

在 Prompt 中增加 few-shot 提供了一些高质量的示例，每个示例都包含目标任务的输入和所需输出。当 LLMs 首先看到好的示例时，它可以更好地理解用户的意图和所需答案的标准。因此，few-shot 通常会比 zero-shot 带来更好的效果。然而，它的代价是消耗更多的 token，并且当输入和输出文本很长时可能会达到上下文长度限制。

尝试一些例子。我们首先尝试一个随机标签的例子（意味着将标签 Negative 和 Positive 随机分配给输入）：

Text: (lawrence bounces) all over the stage, dancing, running, sweating, mopping his face and generally displaying the wacky talent that brought him fame in the first place.
Sentiment: positive

Text: despite all evidence to the contrary, this clunker has somehow managed to pose as an actual feature movie, the kind that charges full admission and gets hyped on tv and purports to amuse small children and ostensible adults.
Sentiment: negative

Text: for the first time in years, de niro digs deep emotionally, perhaps because he's been stirred by the powerful work of his co-stars.
Sentiment: positive

Text: i'll bet the video game is a lot more fun than the film.
Sentiment:

输出：

positive

Function Calling 调用流程

在之前的文章中，我们展示了如何使用 LLMs 的函数来“调用”外部函数来丰富模型的功能，例如使用模型不知道的信息查询其他数据源。（推荐优先阅读：OpenAI 内部工程师揭秘：如何通过 API 将大模型集成到自己的应用程序中）

在 LLMs 执行 Function calling 功能时，LLMs 能够充分发挥自身的语义理解能力，解析用户的输入，然后在函数库中自动挑选出最合适函数进行运行，并给出问题的答案，整个过程不需要人工手动干预。

要使函数调用发挥作用，至关重要的是，LLMs 需要知道如何最好地选择正确的函数以及如何为每个函数调用提取正确的参数。

在此之前，我们仅依赖于函数及其参数的描述，效果非常成功了。LLMs 根据函数描述，参数描述以及用户的输入，来决定是不是要调用这个 funciton。

函数规范很重要：
明确的函数名: 选择一个清晰、描述性强的函数名。
参数顺序和命名: 参数应有逻辑顺序，并使用描述性强的名称。
详细的函数描述：要对函数功能和设置的参数变量有明确的说明

描述主要依赖于两个关键因素：函数的说明文档和其代码逻辑，包括输入参数和返回值。

Function Calling 完整调用流程如下：

图片来源：https://zhuanlan.zhihu.com/p/645732735

Function calling 需要经过两次 Chat Completion 模型的调用以及一次本地函数的计算

Few-shot 改善函数调用

一套精心设计的提示词能极大地提升模型输出的稳定性和准确性。即便在 GPT-3.5 或 GPT-4 的接口下，输出仍有可能出现不稳定的情况。为了增加输出稳定性，结合系统角色（System role）和少量样本提示（Few-Shot prompting）是一种有效的策略。这种方法不仅明确地指导了模型的任务，还通过提供少量相关的样本，有助于模型更准确地理解期望的输出格式和内容。这样做更有可能使你得到稳定和准确的结果。

例如，即使有一些特殊的指令，我们的模型也可能会因操作顺序而出错：

llm_with_tools.invoke(
    "Whats 119 times 8 minus 20. Don't do any math yourself, only use tools for math. Respect order of operations"
).tool_calls

输出：

[{'name': 'Multiply',
  'args': {'a': 119, 'b': 8},
  'id': 'call_Dl3FXRVkQCFW4sUNYOe4rFr7'},
 {'name': 'Add',
  'args': {'a': 952, 'b': -20},
  'id': 'call_n03l4hmka7VZTCiP387Wud2C'}]

该模型还不应该尝试添加任何内容，因为从技术上讲它还不知道 119 * 8 的结果。

对于更复杂的工具使用，在提示中添加少量示例非常有用。我们可以通过在提示中添加 AIMessage 和 ToolCall 以及相应的 ToolMessage 来实现这一点。

通过添加带有一些示例的提示，我们可以纠正此行为：

from langchain_core.messages import AIMessage
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

examples = [
    HumanMessage(
        "What's the product of 317253 and 128472 plus four", name="example_user"
    ),
    AIMessage(
        "",
        name="example_assistant",
        tool_calls=[
            {"name": "Multiply", "args": {"x": 317253, "y": 128472}, "id": "1"}
        ],
    ),
    ToolMessage("16505054784", tool_call_id="1"),
    AIMessage(
        "",
        name="example_assistant",
        tool_calls=[{"name": "Add", "args": {"x": 16505054784, "y": 4}, "id": "2"}],
    ),
    ToolMessage("16505054788", tool_call_id="2"),
    AIMessage(
        "The product of 317253 and 128472 plus four is 16505054788",
        name="example_assistant",
    ),
]

system = """You are bad at math but are an expert at using a calculator. 

Use past tool usage as an example of how to correctly use the tools."""
few_shot_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        *examples,
        ("human", "{query}"),
    ]
)

chain = {"query": RunnablePassthrough()} | few_shot_prompt | llm_with_tools
chain.invoke("Whats 119 times 8 minus 20").tool_calls

输出：

[{'name': 'Multiply',
  'args': {'a': 119, 'b': 8},
  'id': 'call_MoSgwzIhPxhclfygkYaKIsGZ'}]

这次我们得到了正确的输出。

LangSmith 查看执行轨迹

通过 LangSmith 查看执行轨迹，主要查看 ChatPromptTemplate (带有 Few-shot)

{
  "output": {
    "messages": [
      {
        "content": "You are bad at math but are an expert at using a calculator. \n\nUse past tool usage as an example of how to correctly use the tools.",
        "additional_kwargs": {},
        "response_metadata": {},
        "type": "system"
      },
      {
        "content": "What's the product of 317253 and 128472 plus four",
        "additional_kwargs": {},
        "response_metadata": {},
        "type": "human",
        "name": "example_user",
        "example": false
      },
      {
        "content": "",
        "additional_kwargs": {},
        "response_metadata": {},
        "type": "ai",
        "name": "example_assistant",
        "example": false,
        "tool_calls": [
          {
            "name": "Multiply",
            "args": {
              "x": 317253,
              "y": 128472
            },
            "id": "1"
          }
        ],
        "invalid_tool_calls": []
      },
      {
        "content": "16505054784",
        "additional_kwargs": {},
        "response_metadata": {},
        "type": "tool",
        "tool_call_id": "1"
      },
      {
        "content": "",
        "additional_kwargs": {},
        "response_metadata": {},
        "type": "ai",
        "name": "example_assistant",
        "example": false,
        "tool_calls": [
          {
            "name": "Add",
            "args": {
              "x": 16505054784,
              "y": 4
            },
            "id": "2"
          }
        ],
        "invalid_tool_calls": []
      },
      {
        "content": "16505054788",
        "additional_kwargs": {},
        "response_metadata": {},
        "type": "tool",
        "tool_call_id": "2"
      },
      {
        "content": "The product of 317253 and 128472 plus four is 16505054788",
        "additional_kwargs": {},
        "response_metadata": {},
        "type": "ai",
        "name": "example_assistant",
        "example": false,
        "tool_calls": [],
        "invalid_tool_calls": []
      },
      {
        "content": "Whats 119 times 8 minus 20",
        "additional_kwargs": {},
        "response_metadata": {},
        "type": "human",
        "example": false
      }
    ]
  }
}

简单文本方式

通过 LangSmith 可以查看到，把 Few-shot 通过 Message Type 拼接到 ChatPromptTemplate 中，另一种方式是：通过简单的文本来教模型，并解释该函数的作用及其参数。如下示例：

message_history = [  
{'role': 'system', 'content': 'From now on you are a assistant that will help answering question. You should absolutely not use any of your pre-existing knowledge, and only use information in this chat history.'},  
{'role': 'system', 'content': '''the function foobar is able to query information. You should use  
it when you do not have knowledge about question that are asked.  
the parameter "abc" should contain the person, place, or event the user is asking about'''},  
]

推荐新闻

RAG系列04：使用ReRank进行重排序

本文介绍了重排序的原理和两种主流的重排序方法：基于重排模型和基于 LLM。文章指出，重排序是对检索到的上下文进行再次筛选的过程，类似于排序过程中的粗排和精排。在检索增强生成中，精排的术语就叫重排序。文章还介绍了使用 Cohere 提供的在线模型、bge-reranker-base 和 bge-reranker-large 等开源模型以及 LLM 实现重排序的方法。最后，文章得出结论：使用重排模型的方法轻量级、开销较小；而使用 LLM 的方法在多个基准测试上表现良好，但成本较高，且只有在使用 ChatGPT 和 GPT-4 时表现良好，如使用其他开源模型，如 FLAN-T5 和 Vicuna-13B 时，其性能就不那么理想。因此，在实际项目中，需要做出特定的权衡。

LangGPT论文：面向大语言模型的自然语言编程框架（中文版）

大语言模型 (Large Language Models, LLMs) 在不同领域都表现出了优异的性能。然而，对于非AI专家来说，制定高质量的提示来引导 LLMs 是目前AI应用领域的一项重要挑战。

第三篇：要真正入门AI，OpenAI的官方Prompt工程指南肯定还不够，您必须了解的强大方法论和框架！！！

自从ChatGPT（全名：Chat Generative Pre-trained Transformer）于2022年11月30日发布以来，一个新兴的行业突然兴起，那就是提示工程（Prompt engineering），可谓如日冲天。从简单的文章扩写，到RAG，ChatGPT展现了前所未有的惊人能力。

（三）12个RAG痛点及其解决方案

痛点9:结构化数据QA 痛点10:从复杂 PDF 中提取数据痛点11:后备模型痛点12:LLM安全

（二）12个RAG痛点及其解决方案

痛点5:格式错误痛点6:不正确的特异性痛点7:不完整痛点8:数据摄取可扩展性