2026年6月11日 周四晚上19:30,报名腾讯会议了解“业务抓夹如何成为前线部署工程师(FDE)”(限30人)
免费POC, 零成本试错
FDE知识库

FDE知识库

学习大模型的前沿技术与行业落地应用


我要投稿

Anthropic:当 AI 开始自我构建(中英对照)_tag2

发布日期:2026-06-05 20:09:00 浏览次数: 1518
作者:東雲研究院

微信搜一搜,关注“東雲研究院”

推荐语

AI 正加速自身研发,未来或将实现完全自主的“递归式自我改进”,这将对世界产生深远影响。

核心内容:
1. AI 在开发中的角色转变:从人类驱动到 AI 自主参与
2. 递归式自我改进的可能路径与未来情景
3. 建立全球协调机制以管理前沿 AI 发展的必要性

杨芳贤
53AI创始人/腾讯云(TVP)最具价值专家

THE ANTHROPIC INSTITUTE

当 AI 开始自我构建

When AI builds itself

我们在「递归式自我改进」上的进展,及其深远影响。
Our progress toward recursive self-improvement, and its implications.


导读 · 内容摘要

在 AI 发展的大部分历史中,开发的每一步都由人类推动;而 Anthropic 正把越来越多的 AI 开发工作交给 AI 自己。本文借助公开基准测试与 Anthropic 内部数据指出:AI 已经在显著加速 AI 自身的研发——如今其工程师人均季度代码产出约为数年前的 8 倍,合并到生产环境的代码中已有逾 80% 由 Claude 编写。文章进而探讨了「递归式自我改进」(即 AI 自主设计、训练其后继者)的可能路径、三种未来情景,以及为何世界或许需要一个能够可信地放慢或暂停前沿 AI 开发的全球协调机制。下文为中英对照全文。

原文:The Anthropic Institute · anthropic.com/institute/recursive-self-improvement
由 Claude Opus 4.8 翻译

在 AI 发展史的绝大部分时间里,开发循环的每一个环节都由人类亲自推动。但在 Anthropic,我们正把越来越多的 AI 开发工作交给 AI 系统自己来完成,而这正在加快我们的工作节奏。

For most of AI's history, humans drove every step in its development cycle. But at Anthropic, we are delegating a growing share of AI development to AI systems themselves, which is speeding up our work.

这一趋势若发展到极致,再加上足够的算力,最终指向的将是一种能够完全自主地设计和开发自身后继者的 AI 系统。这被称为「递归式自我改进」(recursive self-improvement)。我们尚未走到那一步,递归式自我改进也并非不可避免。但它的到来,可能比绝大多数机构所做的准备要早得多。

Taken far enough, and given enough compute, that trend points to an AI system capable of fully autonomously designing and developing its own successor. This is called recursive self-improvement. We are not there yet, and recursive self-improvement is not inevitable. But it could come sooner than most institutions are prepared for.

借助公开基准测试以及此前从未公布的 Anthropic 内部数据,Anthropic Institute 想要说明:AI 已经在加速 AI 系统自身的开发。仅举一例:如今,Anthropic 的工程师平均每个季度交付的代码量,是 2021—2025 年间的 8 倍。

Using public benchmarks and previously unreported data from within Anthropic, The Anthropic Institute is showing that AI is already accelerating the development of AI systems. To take just one example: today, Anthropic engineers on average ship 8x as much code per quarter as they did from 2021-2025.

本文讨论的技术趋势表明,AI 系统在未来几年里将变得强大得多。这些趋势的影响极其深远。能够自我构建的 AI,将是技术史上的一座里程碑——它有望在科学、医疗等众多领域为世界带来巨大的福祉。但与此同时,完全的递归式自我改进也可能加大人类失去对 AI 系统控制权的风险。一旦这些系统能够完全自主地构建其后继者,我们如何保障它们的安全、如何监控它们、又如何塑造它们的行为,都将变得远比从前重要。

The technical trends discussed in this piece suggest that AI systems are going to become much more capable in coming years. These trends have huge implications. AI that can build itself would be a major development in the history of technology—one that could bring enormous good for the world in science, healthcare, and beyond. But full recursive self-improvement also might increase the risks of humans losing control over AI systems. If systems are capable of fully building their own successors, the ways we secure them, monitor them, and shape their behavior all grow much more important.

从手写代码,到「闭合循环」

A TIMELINE


2021–2023

打造第一代 Claude · Building the first Claude

在最初的日子里,Anthropic 的工作和任何一家科技公司没什么两样:人们在笔记本电脑上敲代码、写文档。

In the early days, work at Anthropic looked like work at any other tech company: people writing code and docs on laptops.

2023–2025

聊天机器人 · Chatbots

人们开始用早期的聊天机器人来辅助其中一部分工作,比如生成简短的代码片段,再把输出复制到文本编辑器里。

People used early chatbots to help with parts of the process, like generating short code snippets and copying the output into text editors.

2025–2026

编程智能体 · Coding agents

随着智能体能力增强,它们已经能够自行编写和修改代码,有时甚至能完成整个文件。

As the agents became more capable, they were able to write and edit code on their own, sometimes entire files.

当下 · TODAY

自主智能体 · Autonomous agents

如今,智能体已经可以自己运行代码,并把数小时的工作委托给其他智能体去完成。

Agents can now run code themselves and delegate hours of work to other agents.

20XX?

闭合循环 · Closing the loop

在未来,智能体的能力或许会强大到足以自行构建和训练模型。一旦如此,未来版本的 Claude 就可能由 Claude 自己来持续改进。

In the future, agents could become capable enough to build and train models themselves. If this happens, future versions of Claude could be continuously improved by Claude itself.

来自外部世界的证据

EVIDENCE FROM THE OUTSIDE WORLD


AI 模型的进步速度正在加快。它们能够独立可靠完成的任务时长,大约每四个月就翻一番——而此前的趋势是每七个月翻一番。2024 年 3 月,Claude Opus 3 能完成的软件任务,相当于人类约四分钟的工作量。一年后,Claude Sonnet 3.7 已能处理约一个半小时的任务。再过一年,Claude Opus 4.6 已能胜任长达 12 小时的任务。如果这一趋势延续,需要一名熟练人员花上数天才能完成的任务,今年就可能进入 AI 的能力范围。到 2027 年,AI 系统或许就能胜任人类需要数周才能完成的任务。

The rate at which AI models improve is accelerating. The length of tasks that they can reliably complete on their own has been doubling roughly every four months, up from an earlier trend of doubling every seven months. In March 2024, Claude Opus 3 could complete software tasks that take humans about four minutes to complete. A year later, Claude Sonnet 3.7 managed tasks that took about an hour and a half. A year after that, Claude Opus 4.6 managed 12-hour tasks. If this trend holds, tasks that take a skilled person days could come into range this year. In 2027, AI systems could be capable of tasks that take a person weeks.

在编程和科研类基准测试中,也出现了同样的模式。基准测试衡量的是模型在特定领域的表现,当模型成绩接近 100% 时,该测试就被认为「饱和」了。SWE-bench 是一项面向真实软件工程的标准测试:它给模型一个真实的开源代码库和一份真实的缺陷报告,要求模型写出能修复该问题、并通过项目自带测试的代码改动。短短两年间,模型在这项测试上的成绩,已经从个位数低分一路攀升到了饱和。

The same pattern appears on coding and research benchmarks. Benchmarks measure the performance of models in a given domain, and they're "saturated" when models achieve close to 100% performance. SWE-bench is a standard test of real-world software engineering: it hands a model an actual open-source codebase and a real bug report, and asks it to write a code change that fixes the issue and passes the project's own tests. Models have gone from scoring in the low single digits to saturating the benchmark in two years.

CORE-Bench 测试的是模型能否复现已有研究成果——而这是它们开展原创研究的前提。它把一篇已发表论文背后的代码和数据交给 AI 模型,要求模型重新跑通全部流程,并确认能复现论文的结果。2024 年,AI 系统成功复现结果的比例约为 20%;仅仅十五个月后,这项测试就已达到饱和。负责运行「长时任务完成能力」基准测试的 METR 发现,Claude Mythos Preview 能够连续工作「至少」16 小时,已处于「[METR] 在不引入新任务的前提下所能测量的上限」。

CORE-Bench tests whether a model can reproduce existing research, a prerequisite for them to conduct original research. It gives an AI model the code and data behind a published paper, and asks it to rerun everything and confirm it can replicate the paper's results. AI systems went from succeeding at reproducing the results roughly 20% of the time in 2024 to saturating the benchmark fifteen months later. METR, which runs the benchmark measuring how well models can complete long-duration tasks, found that Claude Mythos Preview could work for "at least" 16 hours and was "at the upper end of what [METR] can measure without new tasks."

公开基准测试能在很大程度上说明这些系统的能力。但它们无法揭示 AI 系统在加速 AI 开发本身这件事上所产生的影响。要看清这一点,我们需要来自 Anthropic 这类 AI 公司内部的直接证据。

Public benchmarks say a lot about the capabilities of these systems. But they can't reveal the impact AI systems are having on speeding up AI development itself. For that, we need direct evidence from within AI companies like Anthropic.

来自 Anthropic 内部的证据

EVIDENCE FROM WITHIN ANTHROPIC


打造一个前沿模型,大体需要两类工作。一类是工程:编写代码、搭建基础设施、监督模型训练。另一类是科研:决定要做哪些实验、解读实验返回的结果、并判断接下来该尝试哪些想法。

Building a frontier model takes two broad categories of work. There is engineering: writing the code, standing up the infrastructure, and overseeing the model training. And there is research: deciding what experiments to run, interpreting what comes back, and figuring out which ideas to try next.

无论在工程还是科研中,呈现出的图景都是一致的。在工程方面,你可以把一个定义并不清晰的问题交给 Claude,它能自己想出解决办法;人类负责给定目标,却不再需要提供具体方法。在科研方面,对于一个定义明确的实验,Claude 在执行层面已经能够媲美甚至超越熟练的人类。然而,一旦涉及在工程和科研中运用判断力去选择目标,Claude 与人类之间仍存在巨大的差距。而这,正是今天的 AI 与未来那种能够自主设计自身后继者的系统之间的鸿沟。

Across both engineering and research, the picture is consistent. In engineering, Claude can be handed an underspecified problem and figure out how to solve it; humans supply the goal, but they no longer need to supply the method. In research, Claude can already match or outperform skilled humans at executing a well-specified experiment. However, large performance gaps persist when it comes to Claude exercising judgement in choosing goals in both engineering and research. That's the gap between AI today and a future system that could autonomously design its own successor.

在 Anthropic,员工随着经验积累而承担越来越开放、越来越重要的任务,是再常见不过的事。起初,他们执行的是别人定义好的具体任务,比如「导出按钮坏了,请修一下」。有了经验之后,他们拿到的是一个目标,需要自己设计实现方案,比如「调查为什么网络在高负载下会变慢」。到了最资深的层级,他们要决定的则是哪些问题根本值不值得去做:「团队下个季度应该构建什么?」我们可以借助 Anthropic 的内部数据,来看看 Claude 在处理这几类不同任务上已经走到了哪一步。

It's common for employees at Anthropic to receive more open-ended and important tasks as they gain more experience. Early on, they execute a task someone else specified, like, "The export button isn't working, please fix it." With experience, they're handed a goal and design the approach themselves, such as, "Investigate why the network slows down under heavy load." At the most senior levels, they are deciding which problems are worth working on at all: "What should the team build next quarter?" We can use internal Anthropic data to see how far Claude has come in being able to handle these different kinds of tasks.

Claude 编写了 Anthropic 代码中相当大的一部分。截至 2026 年 5 月,我们合并进 Anthropic 代码库的代码中,有超过 80% 出自 Claude 之手。而在 2025 年 2 月 Claude Code 以研究预览版形式发布之前,这一比例还只是个位数低位。这种转变也体现在每位工程师的产出上。在 Anthropic 成立后的头四年(2021—2024 年),每位工程师每天合并的代码行数一直保持稳定;到了 2025 年,当 Claude 开始运行代码、而不仅仅是给出建议让工程师复制粘贴时,这一数字开始向上攀升。2026 年,随着模型开始在更长的时间跨度内自主工作,这条曲线的斜率再次变陡。这两个拐点在下方的图表中清晰可见。2026 年第二季度,一名典型工程师每天合并的代码量,已是 2024 年的 8 倍。之所以如此,是因为大部分代码都由 Claude 编写,工程师负责的是指挥和审阅,而不再是亲手敲键盘。

Claude writes a significant proportion of Anthropic's code. As of May 2026, more than 80% of the code we merge into Anthropic's codebase was authored by Claude. Before Claude Code launched in research preview in February 2025, this number was in the low single digits. That shift also shows up in the amount of output per engineer. Lines of code merged per engineer per day stayed constant through Anthropic's first four years (2021-2024), then began to climb upward in 2025 when Claude began to run code rather than just suggesting it for an engineer to copy and paste. The slope steepened again in 2026 when models began to work autonomously over longer time horizons. These two inflection points are shown in the chart below. In the second quarter of 2026, the typical engineer was merging 8× as much code per day as they were in 2024. This is because much of the code is written by Claude, with the engineer directing and reviewing, rather than typing it themselves.

各季度人均贡献的代码量

▲ 各季度人均贡献的代码量(以 2025 年前平均值的倍数表示),2026 年 Q2 达到 8.0×。
Code contributed per person, by quarter.

需要说明一点:代码行数并不是一个完美的衡量指标,因为它衡量的是数量而非质量。因此,2026 年第二季度「每位工程师每天 8 倍代码量」这一数字,几乎可以肯定夸大了真实的生产力提升。但即便如此,它依然说明事情在加速。在 Anthropic,我们并不会因为某人写了多少行代码而给予奖励;团队成员之所以产出更多代码,仅仅是因为他们在用 AI 系统写更多代码。

A caveat: Lines of code is an imperfect measure, as it measures quantity over quality. So 8× lines of code/engineer/day in the second quarter of 2026 is almost certainly an overstatement of the true productivity gain. Nonetheless, it indicates an acceleration. At Anthropic, we don't reward people for how many lines of code they write; rather, team members are producing more code simply because they're using AI systems to write more code.

代码行数的增长,与人们对生产力大幅提升的主观感受相吻合。在 2026 年 3 月对来自 Anthropic 各科研团队的 130 名员工所做的一项调查中,受访者的中位数估计是:在那些他们无论如何都会做的项目上,借助 Mythos Preview,他们的产出大约是在完全没有任何 AI 模型可用时的 4 倍。我们认为 3 月份真实的提升幅度可能略低于这个数字。但即便如此,我们仍然觉得这一总体说法是可信的,并且与我们的其他观察相一致:相当一部分 Anthropic 技术人员完成核心工作的速度,已经是没有 AI 协助时的好几倍。

The increase in lines of code written lines up with subjective impressions of large productivity increases. In a March 2026 poll of 130 employees from across Anthropic research teams, the median respondent estimated that they produced around 4x as much output with Mythos Preview as they would have without access to any AI models, on the kinds of projects they would have been working on regardless. We expect that the true degree of uplift in March was somewhat lower. Nevertheless, we find the overall claim plausible, and in line with our other observations: a significant fraction of Anthropic technical staff is accomplishing their core work multiple times faster than they could without AI assistance.

我们还看到证据表明,Anthropic 的员工正在用 Claude 去做那些原本根本不会有人去做的工作,比如搭建探索性的工具,或处理那些被长期搁置的「清理工作」。举例来说,2026 年 4 月,Claude 交付了 800 多个修复,把某一类 API 错误的发生率降低到了原来的千分之一。负责监督 Claude 的工程师估计,同样的工作若由人来做,需要四年时间;因为修别人的 bug 既缓慢又费力,而人类很难一次性在脑中装下那么多陌生的上下文。

We also see evidence that people at Anthropic are using Claude to do work that simply wouldn't have happened otherwise, like building exploratory tooling and addressing long-deferred cleanup. For example, in April 2026, Claude shipped over 800 fixes that reduced a class of API errors by a factor of one thousand. The engineer overseeing Claude estimated that a human would have taken four years to complete this work; solving other people's bugs is slow and painstaking, and humans struggle to hold that much unfamiliar context in their head at once.

Claude 写出的代码是「好」代码,而且还在不断变好。所谓「好代码」包含两层含义:一是它能正常运行,二是它的写法能让另一名工程师读懂并在其基础上继续开发。在第一条标准上,证据是明确的。一年以来,Anthropic 员工在任务进行到一半时纠正、改变方向或干脆接管 Claude 的比例一直在稳步下降——即便是在最复杂、最开放的任务上也是如此。这里所说的,是那些没有明确规范、连工程师自己都不确定答案该是什么样子的问题。Claude 在不同难度任务上随时间变化的成功率清楚地说明了这一点,如下图所示。Claude 写出的代码是能用的。

The code that Claude writes is "good" and improving. "Good code" means two things: it works, and it is written in a manner that allows another engineer to understand it and build upon it. On the first criterion, the evidence is clear. The rate at which Anthropic staff correct, redirect, or take over mid-task from Claude has been falling steadily for a year, including on the most complex and open-ended tasks. This means problems with no clear specification, where the engineer isn't sure what the answer looks like. This is evident in Claude's success rate over time on tasks of different difficulties, as shown in the graph below. Claude writes code that works.

Claude Code 会话成功率

▲ 四类任务的会话成功率:开放式问题已从约 26% 升至 76%。
Claude Code session success rate, by task type.

在最开放的那一类任务上,Claude 的成功率在 2026 年 5 月达到了 76%,半年内提高了 50 个百分点。举一个属于这一难度层级的例子:一次例行升级开始导致成千上万个训练任务崩溃。一名工程师把 Claude 引向这起正在发生的事故,给它的不过是一些文字说明和集群的访问权限。Claude 逐一排查正在运行的任务、一次只测试一个环境设置,最终定位到那个触发崩溃的、极其隐蔽的单一调试标志,稳定地复现了问题,并确认了修复方案。大约两个小时,Claude 就完成了通常需要两到三天的工作。

On the most open-ended tasks, Claude's success rate reached 76% in May 2026, up 50 percentage points in six months. To give an example of tasks in this difficulty tier, a routine upgrade began crashing tens of thousands of training jobs. An engineer pointed Claude at the live incident with little more than some text content and cluster access. Working through the running jobs and testing one environment setting at a time, Claude isolated the single obscure debugging flag that was triggering the crash, reproduced it reliably, and confirmed a fix. In about two hours, Claude delivered what would normally be two to three days of work.

第二条标准,是写出能让另一名工程师读懂并在其基础上继续开发的代码。在这一点上,人类与 AI 之间的差距依然存在,但正在迅速缩小。Anthropic 内部尚未形成完全一致的看法,但许多人认为:在 2025 年底,Claude 写出的代码质量仍逊于 Anthropic 工程师手写的代码,而到了今天已大致持平。我们预计在一年之内,它就会更胜一筹。

The second criterion is writing code that another engineer can understand and build on. Here the gap between humans and AI persists, but is closing fast. There isn't full consensus among staff at Anthropic, but many believe that the Claude-written code was still worse in quality than human-written code at Anthropic in late 2025, and is roughly at parity today. We expect it to be better within the year.

这也改变了 Anthropic 如今审阅自身代码的方式。对代码库提出的每一处改动,现在都会先经过一个自动化的 Claude 审阅器,由它在代码合并之前查找 bug、安全漏洞和其他缺陷。借助这个工具,我们做了一次回溯分析,发现:如果对代码库的每一处改动都进行一次自动化的 Claude 审阅,那么在 claude.ai 上过往那些线上事故背后的 bug 中,大约三分之一本可以在抵达生产环境之前就被拦下。而写出那些代码的工程师,本就是世界上构建这类系统的顶尖高手。如今,Claude 正在捕捉他们所遗漏的错误。

This has changed the way that Anthropic now reviews its own code. Proposed changes to our codebase are now read by an automated Claude reviewer that looks for bugs, security flaws, and other defects before it can merge. Using this tool, we ran a retrospective analysis, and found that an automated Claude review of every change to our codebase would have caught roughly a third of the bugs behind past incidents on claude.ai before they ever reached production. The engineers who wrote that code are among the best in the world at building these systems. Claude is now catching the mistakes that they missed.

在「为达成别人设定的目标而开展实验」这件事上,Claude 表现出色。每次 Anthropic 发布模型时,我们都会做同一项测试:给 Claude 一段用于训练小型 AI 模型的代码,要求它在仍然通过同样正确性检查的前提下,让这段代码跑得尽可能快。由于目标和成功指标都已事先固定,Claude 的任务就是通过改写代码、运行、计时、再重复这一过程来寻找加速的方法。这其实是科研实验循环的一个微缩版本。2025 年 5 月,Claude Opus 4 相对初始代码平均实现了约 3 倍的加速。到 2026 年 4 月,Claude Mythos Preview 已能实现约 52 倍的加速。作为参照,一名熟练的人类研究员需要四到八个小时才能达到 4 倍。在科研流程的这一环节——在一个定义清晰的实验内部优化各个步骤——Claude 在不到一年的时间里,已经从「非常得力」变成了「超越人类」。

Claude is good at running experiments to hit a goal that someone else has set. Every time Anthropic releases a model, we run the same test: we give Claude some code that trains a small AI model, and ask it to make that code run as fast as possible while still passing the same correctness checks. The goal and the success metrics are fixed in advance, so Claude's job is to find speedups by rewriting the code, running it, timing it, and repeating. It's a miniature version of an experimental research loop. In May 2025, Claude Opus 4 averaged a ~3x speedup over the starting code. By April 2026, Claude Mythos Preview was achieving ~52x. For calibration, a skilled human researcher would need four to eight hours to reach 4x. In this part of the research workflow—optimizing steps within a clearly defined experiment—Claude has gone from super helpful to superhuman in under a year.

Claude 在自己提出实验方案这件事上也越来越强。2026 年 4 月,Anthropic 首次展示了 Claude 端到端地完成一个开放式科研项目。我们把一个 AI 安全领域的开放性问题交给由 Claude 驱动的智能体——简单说,就是「一个较弱的模型能否可靠地监督一个更强的模型?」——然后任由它们去解决。这一过程包括提出假设、加以检验、与并行运行的其他智能体分享发现,并不断迭代。这项任务有明确的性能「下限」和「上限」:下限是弱监督者独自工作时能达到的水平;上限是强模型在用正确答案训练后所能达到的水平。两名人类研究员花了大约一周,弥合了这一差距的约 23%;而智能体在累计 800 小时内弥合了 97%,消耗的算力约合 1.8 万美元。这项工作也有一些需要注意的地方:该结果未能干净利落地迁移到生产规模的模型上,而且选定问题和制定评分标准的依然是人类。但在这些前提之内,每一个实验都是智能体自己设计的。设定方向,是人类所扮演的唯一实质性角色。

Claude is getting better at proposing its own experiments. In April 2026, Anthropic published the first demonstration of Claude running an open-ended research project end to end. Claude-powered agents were given an open problem in AI safety—roughly, can a weaker model reliably supervise a stronger one?—and were left to solve it. This involved proposing hypotheses, testing them, sharing findings with parallel agents, and iterating. The task has a clear performance "floor" and "ceiling": the floor is how well the weak supervisor would do on its own; the ceiling is how the strong model does when trained on correct answers. Two human researchers, over about a week, recovered roughly 23% of that gap; the agents recovered 97% over 800 cumulative hours and used roughly $18,000 in compute. There are some caveats to this work; the result didn't transfer cleanly to production-scale models, and humans still chose the problem and created the scoring rubric. But within those bounds, the agents designed every experiment themselves. Direction-setting was the only meaningful role a human played.

Claude 在引导科研会话走向研究成果这件事上也越来越在行。我们检视了一批真实的 Claude Code 会话(时间介于 2026 年 1 月至 3 月之间),在这些会话中,Anthropic 研究员正与 Claude 一起攻关某个开放式的调查性问题,比如弄清某次训练为何反复崩溃,或某个模型为何在某项基准测试上得分偏低。在每一个案例中,我们都找到了研究员「走弯路」的那一刻:他们选择了某个方向,使会话一度偏离正轨,之后才重新回到正道。随后,我们只把会话偏离之前的工作内容展示给不同的 Claude 模型,并询问它接下来会怎么做。再由另一个能够看到会话最终走向的独立 Claude 来判断:究竟是 AI 还是人类提出的下一步更好。

Claude is getting better at steering research sessions towards research findings. We examined real Claude Code sessions (between January and March 2026) where Anthropic researchers were working with Claude on an open-ended investigative problem, like figuring out why a training run kept crashing, or why a model scored poorly on a benchmark. In each case, we found a moment where the researcher took a detour: they pursued a direction that sent the session sideways before it eventually got back on track. We then showed various Claude models only the work from before the session went off-course and asked what it would do next. A separate Claude that was able to see how the session eventually turned out then judged whether the AI or the human suggested the better next step.

模型能否选出比人类更好的下一步

▲ 在研究员走错方向处,模型建议胜过人类的样本比例:Mythos Preview 达 64%。
Where a researcher went wrong, could Claude have done better?

由于我们刻意挑选的是那些已知人类选择仍有改进空间的时刻(n=129),所以这并不是模型与人类判断之间一对一的对等比较。这些时刻为我们提供的,是一组真实而有挑战性的情境:在这些情境中,正确的下一步并不显而易见,而人类的选择恰好可以充当一把有用的标尺,用来比较模型表现随时间的变化。按这一指标衡量,我们 2025 年 11 月最强的模型(Opus 4.5)有 51% 的时候胜过了人类的选择;到 2026 年 4 月(Mythos Preview),这一比例上升到了 64%。科研的日常工作,在很大程度上就是这样一连串「下一步该怎么走」的决策,因此这是一个能反映模型最终能否独立开展一项调查的相关指标。我们将这一结果视为一个早期信号,表明 AI 系统在做出 AI 研究所依赖的那类判断方面正变得越来越好。

Because we deliberately picked moments (n=129) where we know the human's choice had room for improvement, this isn't a like-for-like comparison between model and human judgement. What these moments give us is a set of realistic, challenging situations where the right next step is not obvious, and where the human's choice serves as a useful yardstick to compare model performance over time. On this measure, our best model in November 2025 (Opus 4.5) beat the human choice 51% of the time; in April 2026 (Mythos Preview), this grew to 64%. The day-to-day work of research is largely a chain of these next-step decisions, making this a relevant measure of the model's ability to eventually run an investigation of its own. We view this result as an early signal that AI systems are getting better at making the kinds of judgement calls that AI research depends on.

Anthropic 未来的工作会是什么样子?

THE FUTURE OF WORK AT ANTHROPIC


种种证据表明,在 AI 开发流程的每一个环节,人类所扮演的角色都在不断收窄。一旦人类与 AI 编写的代码在质量上持平,人类就会彻底停止写代码,转而只负责审阅。但如果他们审阅代码的速度赶不上 Claude 生成代码的速度,人类的审阅就会成为 AI 开发的瓶颈。同样地,一旦 Claude 能够运行实验,问题就会转向「这些实验中,哪一个值得去做?」简而言之:如今「动手去做」(即写代码、跑实验、产出结果)几乎不再耗费人类的时间,尽管它仍然要消耗算力。

The evidence suggests that the human role is narrowing at each step in the AI development process. Once human- and AI-authored code quality reach parity, humans will stop writing code entirely, and shift to only reviewing it. But if they can't review code as quickly as Claude can generate it, human review will become the bottleneck to AI development. Similarly, once Claude can run experiments, the question shifts towards "Which of these experiments is worth running?" Put simply: the doing (i.e., writing the code, running the experiment, producing the result) now costs almost nothing in human time, even if it still has costs in compute.

就目前而言,人类仍具相对优势的一个领域,是科研品味与判断力——包括判断哪些问题重要、哪些结果值得信赖,以及某条路径何时已是死胡同。

An area of human comparative advantage, for now, is research taste and judgment, including choosing which problems matter, which results to trust, and when an approach is a dead end.

如果我们错了呢?

WHAT IF WE'RE WRONG?


对上述证据,一个很自然的反驳是:仍然掌握在人类手中的那部分工作——选择该攻克哪些问题——才是最要紧的。没有这份判断力,Claude 只是一个能干的助手,而不是一个能够独自推动 AI 进步的系统。

A natural objection to the evidence presented above is that the work that is still in human hands—choosing which problems to work on—is what matters most. Without that judgment, Claude is a capable assistant, but not a system that could drive AI progress on its own.

今天的训练方法和架构究竟能否解锁这种能力,确实还说不清楚。但 AI 的进步,很少来自「灵光乍现」的瞬间。在 AI 近年的历史中,这样的时刻确实出现过几次,比如 Transformer 架构,或混合专家(mixture-of-experts)模型,但真正颠覆范式的想法,往往相隔数年才出现一次。在这些里程碑之间,绝大多数进步都是渐进式的:我们把某样东西放大规模,看看哪里会出问题,修好它,然后再试一次。而这恰恰是 Claude 如今最擅长的那种工作流程。爱迪生说过,天才是 1% 的灵感加上 99% 的汗水。但我们看到的是,这「汗水」正越来越多地被自动化。越来越清楚的一点是:推动前沿向前的工作中,有很大一部分是可以自动化的;大规模的科研进展,在很大程度上是工具和资源的函数——它们决定了你能多快地跑实验、一次能同时跑多少、以及多快能拿到结果。

It is genuinely unclear whether today's training methods and architectures could unlock that capacity. But AI is rarely advanced by "eureka!" moments. There have been a few of these in AI's recent history, like the Transformer architecture, or mixture-of-experts models, but paradigm-shifting ideas arrive years apart. In between, most progress is incremental: we scale something up, see what breaks, fix it, and try again. That is exactly the kind of workflow Claude now excels at. Edison said that genius is 1% inspiration and 99% perspiration. But we see perspiration becoming increasingly automated. It's becoming clear that much of what advances the frontier is automatable; large-scale research progress is mostly a function of tools and resources, which dictate how fast you can run experiments, how many you can run at once, and how quickly you can get results.

即便我们假设 Claude 永远无法获得出色的科研品味,对我们这些证据的一种保守解读,仍然意味着一种复利式的加速。如果人类把大部分时间花在那占比仅个位数的「设定方向」工作上,而把其余的一切都交给 Claude,那就意味着每一位工程师或研究员所驾驭的工作量,都远超从前。我们所看到的证据表明,Anthropic 的员工既跑得更快,覆盖的面也更广。落到实处,这意味着:在高效 AI 工具问世之前与之后相比,AI 已经让 Anthropic 的步伐快了许多。

Even if we suppose that Claude never achieves good research taste, a conservative reading of our evidence still implies compounding acceleration. If humans spend most of their time on the single-digit fraction of work that is direction-setting, while Claude handles the rest, that means each engineer or researcher is steering far more work than before. The evidence we see suggests that people at Anthropic are both moving faster and covering a broader surface. In practice, this means that AI already makes Anthropic move much faster than it did before the advent of effective AI tools.

一种不那么保守的解读则是:关于 Claude 科研判断力在提升的早期证据——尽管今天还很有限——本身就是一个信号,表明这项能力同样在进步。「科研品味」或许只是又一项 AI 能力:AI 系统在一段时间内做不好,然后逐渐变得擅长。在其他偏定性的技能上,我们已经见过类似的模式,比如 AI 系统能够解释一个笑话为什么好笑、展现出「心智理论」(theory of mind),以及解开语言谜题。

The less conservative reading is that the early evidence on Claude's improving research judgment—narrow as it is today—is an indicator that this capability is improving as well. "Research taste" might be just another AI capability that AI systems fail at for a time, then get good at. We've seen a similar pattern with other qualitative skills, like AI systems being able to explain why a joke is funny, demonstrate theory of mind, and solve linguistic riddles.

几种可能的未来

POSSIBLE FUTURES


接下来会发生什么,取决于两件事:这一趋势是否会延续,以及如果它延续下去,我们会选择怎么做。我们至少可以设想三种未来情景:

What happens next depends on two things: whether the trend continues, and what we choose to do if it does. We can imagine at least three future scenarios:

1趋势停滞,但今天的 AI 能力得到广泛扩散

The trend stalls, but today's AI capabilities are widely diffused.

本文呈现了许多指数式的增长轨迹。但这些轨迹最终或许会被证明其实是 S 形曲线。我们可能正在逼近曲线的那个拐弯处——在那里,规模的边际回报开始递减,曲线先是变直,继而趋于平缓。把一名称职的研究员与一名卓越的研究员区分开来的那种判断力,或许是一种无法靠扩大算力、数据等训练投入而获得的能力。果真如此的话,要突破这一瓶颈就需要一个全新的想法,比如一种取代当前所有前沿模型都在使用的 Transformer 架构的新架构思路。

This article features many exponential trajectories. But these trajectories may actually turn out to be S-curves. We may be approaching the bend in the curve, where returns to scale diminish and the line straightens, then flattens. The judgment that separates a competent researcher from a great one might be a capability that cannot come from scaling up training inputs like compute and data. If so, getting past this bottleneck would require a new idea, like an architectural approach that supplants the Transformer architecture that all current frontier models use.

又或者,制约 AI 进步的真正瓶颈在于供应链,而非模型本身:推进前沿并将其扩散开来,所需的能源和算力可能超过现有的总量。真正的约束,也许是芯片制造的速度、电网的扩容,或互连带宽,而非智能本身。我们同样不能排除某种外生冲击会令一切大幅放缓的可能,比如算力或电力供给突然萎缩——两者中任何一个都会拖慢进展,并抬高各实验室前瞻性投资的成本。再或者,还有某种我们目前根本没有预料到的进步障碍。

Alternately, the binding constraint to AI progress could be in the supply chain, not the model: advancing and diffusing the frontier may require more energy and compute than presently exists. The pace of chip fabrication, grid expansion, or interconnect bandwidth may be the constraint, rather than intelligence itself. We also cannot rule out an exogenous shock to the AI ecosystem that dramatically slows things, like a sudden diminishment in the supply of compute or electricity, either of which would slow progress and make forward investment by labs more expensive. Or we may not be anticipating some other barrier to progress.

即便模型能力就此冻结在今天的水平,我们也预期世界仍将发生重大变化。Glasswing 项目就是一个早期迹象:在头几周里,Mythos Preview 就在全球最重要的一批系统中发现了一万多个高危和严重级别的软件漏洞——多到足以使网络防御的瓶颈,从「发现漏洞」转移到了「能否足够快地打上补丁」。而今天的模型向更广阔经济领域的扩散,还只是刚刚起步:一家百人公司将越来越能完成原本千人公司的工作量,因为每一名员工都将坐镇于一座由众多智能体构成的金字塔之巅。

Even if model capabilities were frozen at today's level, we would expect major changes to occur in the world. Project Glasswing is one early sign: in its first weeks, Mythos Preview found more than ten thousand high- and critical-severity software vulnerabilities across the world's most important systems—enough that the bottleneck in cyber defense has already shifted from finding vulnerabilities to patching them fast enough. And we are still early in the diffusion of today's models into the wider economy, where a 100-person company can increasingly do the work of a 1,000-person one, because each employee will sit atop a pyramid of agents.

我们把这一情景列出来是为了完整起见,但我们并不认为它很可能成真。我们能够衡量的每一项能力——包括那些感觉更「软」、更难量化的能力,比如代码质量和开放式任务的成功率——迄今为止都遵循着同一条曲线。我们还没有看到这条曲线出现拐弯。在我们考虑的三种未来中,这一种会给各国政府和社会留下最充裕的适应时间。我们更担心的是接下来的两种,它们推进得更快,留给人们准备的余地也小得多。

We include this scenario for completeness, but we don't believe it's likely. Every capability we can measure, including those that feel "squishier," like quality of code and success on open-ended tasks, has so far followed the same curve. We have not yet seen that curve bend. Of the three futures we consider, this one would give governments and societies the most time to adapt. We are more worried about the next two, which would move faster and leave far less room for preparation.

2AI 实验室持续获得复利式的效率增益

AI labs continue to see compounding efficiency gains.

在这一情景中,AI 开发在很大程度上实现了自动化,但设定研究方向、评判结果的依然是人类。随着时间推移,使用 AI 系统的组织会变得高效得多,因此我们可以预期,组织中每个人的生产力都会被显著放大。百人公司将能够完成一万人、甚至十万人组织的工作量。这会彻底变革知识工作和政府服务,但也可能被用于有害的目的——从对整个人口的威权式监控,到那种为每一个人量身定制操纵手段、并以任何人类团队都无法企及的规模运转的影响力行动。在 Anthropic 这样的公司里,人类的角色将发生转变。人们会与 AI 系统结成搭档,去扩大研究规模、产生新的洞见;并与之携手构建那些必要的系统,用以验证 AI 的输出是否值得信赖。

In this scenario, AI development becomes substantially automated, but humans continue to set research directions and judge results. Organizations that use AI systems would become much more efficient as time goes on, so we could expect to see significant productivity multipliers on each person in this organization. 100-person companies could do the work of 10,000- or 100,000-person organizations. This would revolutionize knowledge work and government services, but could also be turned to harmful ends, from authoritarian surveillance of whole populations to influence operations that tailor manipulation to each individual and run at a scale no human team could match. The role of humans at companies like Anthropic would shift. People would partner with AI systems to scale up research and generate new insights, and together they would build the systems needed to verify that AI outputs can be trusted.

我们在此列出的证据表明,我们很可能正走向这一情景。但加速一个流程中的某个环节,往往只是把瓶颈转移到了别处:整体节奏,受制于那些尚未提速的环节。在计算机领域,这被称为「阿姆达尔定律」(Amdahl's law),同样的逻辑也适用于组织。Anthropic 已经遭遇了阿姆达尔定律的一个典型表现:随着我们开始在组织内部推送越来越多的代码,人工代码审阅已经成了一个新的瓶颈。

The evidence we've laid out here suggests that we're likely heading into this scenario. But speeding up one part of a process often just shifts the bottleneck elsewhere: overall pace is capped by the parts that haven't sped up. In computing, this is known as Amdahl's law, and the same logic can apply to organizations. Anthropic has already encountered one signature of Amdahl's law: as we've begun to push more code around the organization, human code review has become a new bottleneck.

在工程之外,我们也遇到了同样的摩擦。由于 Anthropic 员工与能力极强的模型协同工作,新的想法、新的倡议、新的工具和模拟实验如雨后春笋般涌现——其数量远远超出我们有能力去推进的范围。一个组织发现并修复这些瓶颈的速度,或许本身就是一项会随时间提升的技能,而它也可能成为任何组织最为重要的一项技能。

We've also encountered this friction outside engineering. There has been an explosion of new ideas, initiatives, tools, and simulations, as a result of Anthropic employees working with highly capable models—far more than we have the capacity to pursue. The rate at which organizations can spot and fix these bottlenecks may be a skill that improves over time, and it may become the most important skill for any organization.

3AI 系统自身具备完全的递归式自我改进,并开始构建后继者

AI systems themselves become capable of full recursive self-improvement, and begin building their successors.

如果能力提升的技术趋势持续下去,而 AI 系统又能够发展出那种内在于变革性人类智慧的能力,那么 AI 系统自行设计并完善自身,就是一种合理的可能。

If technical trends in advancing capabilities continue, and AI systems are able to develop the capabilities inherent to transformative human ingenuity, then it is plausible that AI systems could design and refine themselves.

在这样的世界里,AI 开发的进展速度将完全由 AI 系统可用的算力(或者发现算法训练、推理中各种效率提升的速度)来决定。人类在这一开发过程中所扮演的角色将大幅缩减,我们的大部分精力很可能会转向对一座由 AI 系统运行、不断扩张的「虚拟实验室」进行监督、确认与验证。我们预期,能够自动化开展 AI 研发的系统,其所掌握的技能将可以迁移到科学的其余领域,使它们得以开始变革其他学科。

In this world, the pace of progress in AI development becomes determined entirely by the availability of compute (or the speed of discovering various efficiencies in algorithmic training or inference) for AI systems. Humans play a substantially diminished role in their development, likely moving most of our effort towards oversight, validation, and verification of an expanding "virtual lab" run by AI systems. We expect that systems capable of automated AI research and development would have skills that would transfer to the rest of science, allowing them to begin to revolutionize other fields.

在这样的未来里,对齐问题将如何得到解决——或者根本无法解决——是我们最没有把握的事情。模型可能被证明足够对齐、又具备足够的科研品味,从而发现并实现我们尚未抵达的全新解决方案。即便对齐不足,它们也可能足够明智,会主动叫停开发。但反过来,今天的模型中那些偶尔出现的对齐偏差,也可能随着模型构建其后继者而层层累积,变得越来越频繁、却越来越难以理解,直到我们失去对它们的控制。我们或许根本无法构建、整合并验证那些工具——而正是借助这些工具,我们才能弄清自己究竟正处在哪一条趋势线上。

How the alignment problem gets solved—or not—in this future is something we are least certain about. Models could prove to be sufficiently aligned and capable enough of research taste that they discover and implement novel solutions that we have not yet reached. They could also be sufficiently wise to halt development if not. Alternatively, the rare occurrences of misalignment present in today's models could compound as the models build their successors, growing more frequent but less understood until we lose control of them. It's possible that we can't build, integrate, and verify the tools that we'd need to understand which trendline we are actually on.

对于这样的世界会是什么样子,我们并没有可靠的直觉,因为我们当前的经济是由人类以及人类制造的工具所驱动的。从本质上说,一个由快速递归式自我改进所驱动的世界,可能会被那个自我改进的模型所主导——因为它的能力会彻底盖过人类,并在更广阔的经济中不断扩散。一旦人类劳动不再具备竞争力,经济会变成什么样子,是很难预测的。

We do not have good intuitions for what this world would look like, because our economy is currently driven by humans and human-built tools. By its nature, a world driven by fast recursive self-improvement could become dominated by the self-improving model as its capabilities fully eclipse those of humans and the model proliferates across the broader economy. It is difficult to predict what the economy looks like if human labor stops being competitive.

即便模型开发变得完全自动化且递归化,我们也无法预测这对大多数人的日常生活意味着什么。阿姆达尔定律在这里同样适用。递归式智能可能会让《充满爱意的机器》(Machines of Loving Grace)一文所勾勒的诸多益处得以实现,在某些领域还会来得很快。我们预期,具身智能(即机器人技术)可能会紧随递归式智能之后到来,并沿着一条类似的路径——回报递增、成本递减。更强大的智能或许能帮助我们更快地在物理世界中建造东西、更高效地开展挽救生命的药物临床试验,以及发展出全新的协作形式。

Even if model development became fully automated and recursive, we can't predict what that would mean for most humans' daily lives. Amdahl's law applies here as well. Recursive intelligence could lead to achieving many of the benefits outlined in Machines of Loving Grace, quickly in some domains. We expect that embodied intelligence (i.e., robotics) might quickly follow recursive intelligence, and follow a similar path of increasing returns at decreasing cost. More powerful intelligence might help us build things in the physical world more quickly, run more productive clinical trials of lifesaving drugs, and develop novel forms of coordination.

但仅仅实现了递归式改进,并不意味着工业生产的方式、社会的组织形态或市场的运作机制会立刻发生改变。再多的智能,也无法在几十年的实际使用之后才得知的药效上「提前学会」,无法比宪法规定的时间更早地举行选举,也无法在一个周末里把一个陌生人变成多年的老友。对大多数人而言,这一未来在体感上的节奏,仍将由那些瓶颈来决定——哪怕上游的实验室是以算力的速度在运转。当不断加速自我构建的递归式智能,撞上由人类、人际关系和治理所构成的世界,这场碰撞,同样是这一未来中我们无法预测的部分。

But achieving recursive improvement alone does not suggest an immediate change in how industrial production occurs, societies organize, or markets function. More intelligence can't learn what a drug does over decades of use, can't hold elections sooner than a constitution dictates, and can't turn a stranger into an old friend in a weekend. For most people, the felt pace of this future will still be set by the bottlenecks, even if the laboratory upstream runs at the speed of compute. That collision, where recursive intelligence building itself ever faster meets the world of humans, relationships, and governance, is another part of this future we can't predict.

我们应该怎么做?

WHAT SHOULD WE DO?


如果有可能切实放慢这项技术的发展,好让我们有更多时间去应对它那极其深远的影响,我们认为这很可能是一件好事。但如果放慢脚步只是让那些最不谨慎的参与者在技术上追了上来,那它反而可能让所有人都更不安全。在缺乏全球协调机制的情况下,企业和政府将不得不在竞争压力与地缘政治压力之下,就安全问题做出艰难的抉择。

If it were possible to effectively slow the development of this technology to give ourselves more time to deal with its immense implications, we think that would likely be a good thing. But if a slowdown simply lets the least cautious actors catch up technologically, it could leave everyone less safe. Without a global coordination mechanism, companies and governments will have to make difficult decisions about safety while under competitive and geopolitical pressures.

我们相信,让世界拥有「放慢或暂时叫停前沿 AI 开发」这一选项,会是一件好事,因为这能让社会结构与对齐研究跟上技术前进的步伐。Anthropic Institute 将与众多伙伴合作开展研究,并采取行动,帮助构建一次可信的放慢或暂停所需要的各种系统。这些系统将使前沿 AI 开发者能够核实:全球其他各方是否真的已经停下或放慢了脚步,以及某个不良行为者不能借「协调放慢」之名,暗中抢先冲到前面。倘若这样的系统真的存在,我们预期:只要其他处于或接近前沿的开发者也以可核实的方式放慢或暂时叫停,我们也会这样做。

We believe it would be good for the world to have the option to slow or temporarily pause frontier AI development to enable societal structures and alignment research to keep up with the advance of the technology. The Anthropic Institute will conduct research—in collaboration with many others—and take actions to help build the systems that a credible slowdown or pause would require. These systems would enable frontier AI developers to verify that others globally have actually stopped or slowed, and that a bad actor could not use the auspices of a coordinated slowdown to jump ahead in secret. If such systems existed, we expect that we would slow down or temporarily pause, if other developers at or near the frontier also did so in a verifiable manner.

一次有意义的放慢或暂停,将需要分处多个国家、处于或接近前沿、且资源雄厚的多家实验室,同意在同样的条件下停下来。它还要求每一方都能核实其他各方是否真的停了下来。由于 AI 系统的独特属性,这一军备控制难题中「可探测性」(这是比「可核实性」更低的标准)这一环,比在其他技术上要棘手得多。训练运行远比导弹发射井更容易隐藏,其投入要素是通用的,而悄悄「违约」的诱惑又极其巨大——因为谁在别人暂停时继续推进,谁就可能坐收领先优势。一次可信的暂停,还必须明确:是什么触发它、是什么解除它,以及由谁来裁决。

A meaningful slowdown or pause would require multiple well-resourced labs at or near the frontier, in multiple countries, agreeing to stop under the same conditions. It would also require that each can verify that the others have actually stopped. Due to the unique characteristics of AI systems, the detectability (a lower standard than verifiability) element of this arms control problem is much more challenging than with other technologies. Training runs are far easier to conceal than missile silos, their inputs are general-purpose, and the incentive to defect quietly is enormous, because whoever continues while others pause could inherit the lead. A credible pause also has to specify what triggers it, what lifts it, and who adjudicates.

这一切在原则上并非注定不可能——世界已经为其他复杂技术建立过核查机制(例如《中导条约》)——但那些机制花了数十年才同时建立起相应的基础设施与信任。我们没有那么长的时间。相比之下,单独一家实验室的单方面暂停是可以立刻做到的,但所能成就的要少得多:它会改变谁是领跑者,却无法催生出那个当下所缺失的、更广泛的审议过程。

None of this is necessarily impossible in principle—the world has built verification regimes for other complex technologies (e.g., the Intermediate-Range Nuclear Forces Treaty)—but those regimes took decades to build both the infrastructure and the trust. We don't have that long. A unilateral pause by one lab, by contrast, is achievable immediately, but accomplishes much less: it would change who the front-runner is, but it would not create the wider deliberative process that is currently missing.

在接下来的几个月里,我们将组织一系列对话,邀请政策制定者、研究人员、公民社会以及其他 AI 公司,共同来回答本文提出的一些问题,尤其是围绕完全的递归式自我改进,以及如何为协调与审议创造更好的选项。我们会把这些对话的成果公开发表。共同探究这些问题的窗口已经到来,而 AI 公司之外的人们,也理应参与到这场审议中来。

In the coming months, we will organize conversations where policymakers, researchers, civil society, and other AI companies can help answer some of the questions this piece raises, especially around full recursive self-improvement and how to create better options for coordination and deliberation. We'll publish what comes out of it. The window to investigate the questions together is here, and people outside AI companies should be involved in this deliberation.

本文由 Marina Favaro 与 Jack Clark 共同撰写,Santi Ruiz 提供编辑支持。Shan Carter、Romello Goodman 和 Nikki Makagiansar 根据 Brian Calvert 与 Jun Shern Chan 收集的数据制作了可视化图表。Daniel Freeman、Jim Baker、Max Young、Sarah Pollack、Francesco Mosconi、Holden Karnofsky、Andy Jones、Kevin Troy、Anton Korinek、Meg Tong、Andrew Ho、Dan Altman、Drake Thomas、Jack Shen、Sasha de Marigny 与 Avital Balwit 提供了反馈意见。

Marina Favaro and Jack Clark co-authored this piece, with editorial support from Santi Ruiz. Shan Carter, Romello Goodman, and Nikki Makagiansar created the visuals from data collected by Brian Calvert and Jun Shern Chan. Daniel Freeman, Jim Baker, Max Young, Sarah Pollack, Francesco Mosconi, Holden Karnofsky, Andy Jones, Kevin Troy, Anton Korinek, Meg Tong, Andrew Ho, Dan Altman, Drake Thomas, Jack Shen, Sasha de Marigny, and Avital Balwit provided feedback.

脚注 · Footnotes


1. METR 的核心指标,衡量的是 AI 系统在一组任务上能够保持 50% 可靠度的时间跨度;不过在 80% 可靠度下,趋势线看起来是一样的。

METR's key measure tells you the time horizon over which AI systems can be 50% reliable at a basket of tasks, though the trendline looks the same at 80% reliability.

2. 尤其是当基准测试转向更开放的形式和更难的任务(例如奥林匹克级别的数学)时,由于题目与答案集本身存在错误(比如表述含糊的题目和无解的问题),它们往往会在不到 100% 的成绩处就达到饱和。

Especially as they shift toward more open-ended formats and more difficult tasks (e.g., Olympiad-level mathematics), benchmarks often saturate below 100% due to errors in the question and answer sets like ambiguous problem statements and unsolvable questions.

3. Anthropic 的管理层曾公开估计,我们 90% 或以上的代码(包括脚本和实验性代码)出自 Claude 之手。而我们这里所说的「超过 80%」,衡量的是合并到生产环境的代码行中可归因于 Claude 的占比。这一测量在两方面更为保守:一是我们的归因流程本身存在缺口,二是那些未被归因于 Claude 的代码行中,也包含了自动生成的代码以及其他同样并非人类手写的产物。

Anthropic leadership have publicly estimated that 90% or more of our code is written by Claude, including scripts and experimental code. Our >80% figure measures the share of lines merged to production that can be attributed to Claude. This is a more conservative measurement in two ways: our attribution pipeline has gaps, and the lines not attributed to Claude include auto-generated code and other artifacts that were not hand-written by humans either.

4. 代码产量的这种激增,正在给所有人共享的基础设施带来压力。GitHub——全球大部分软件赖以构建的平台——在整个 2025 年大约有 10 亿次代码提交;到 2026 年年中,它每周就有 2.75 亿次提交,照此速度全年约为 140 亿次。该公司的首席运营官表示,他们正「以惊人的力度」扩充容量,仅仅是为了跟上节奏。

This surge in code production is straining the infrastructure everyone shares. GitHub—the platform most of the world's software is built on—saw roughly one billion code commits in all of 2025; by mid-2026 it saw 275 million a week, on pace for roughly 14 billion over the year. The company's COO has said that it is "pushing incredibly hard" on capacity just to keep up.

5. 有关这项调查方法的更多细节,可参见《Claude Opus 4.7 系统卡》第 2.3.5 节。

Additional details on the methodology of this survey are discussed in section 2.3.5 of the Claude Opus 4.7 System Card.

6. 许多受访者可能并没有仔细考虑该如何在问题定义中校正各种偏差或微妙之处,而 METR 近期的研究表明,开发者对 AI 生产力提升的估计可能存在高估。

Many respondents may not have thought carefully about how to account for various biases or subtleties in the question definition, and recent research by METR shows that developer estimates of AI productivity uplift can be overestimated.

7. 加速幅度能有多大,在很大程度上取决于初始代码还留有多少改进空间,因此不应将其解读为真实世界中的训练加速。所以,这里真正值得关注的并不是那个绝对的倍数。更具参考价值的,是这套实验设置所使得的对等比较:既包括模型之间的比较(过去一年从约 3 倍到约 52 倍),也包括与熟练人类的比较(在同一任务上,四到八小时内约 4 倍)。

How large the speedup gets depends heavily on how much room for improvement the starting code leaves, and it should not be read as a real-world training speedup. So the absolute multiple is not the figure to anchor on here. What is more informative is the like-for-like comparison that this experimental setup makes possible, both across models (~3x to ~52x over the past year) and against a skilled human (~4x in four to eight hours on the same task).

8. 为了检验评判者是否存在偏向,我们在另一组 127 个时刻上做了同样的测试——在这组时刻中,人类的下一步本就已经很出色(这与原先那组人类方向仍有改进空间的情形相反)。结果,模型的建议只有约 20% 的时候被判定为更优。

As a check on judge bias, we ran the same test on a separate set of 127 moments where the human's next move was already strong (as opposed to the original set, where the human's direction had room for improvement). There, the models' suggestions were judged better only about 20% of the time.

53AI,企业落地大模型首选服务商

产品:场景落地咨询+大模型应用平台+行业解决方案

承诺:免费POC验证,效果达标后再合作。零风险落地应用大模型,已交付160+中大型企业

联系我们

售前咨询
186 6662 7370
预约演示
185 8882 0121

微信扫码

添加专属顾问

回到顶部

加载中...

扫码咨询