Technology

Anthropic's Project Fetch Phase Two: Testing Claude on Robotics Tasks

Martin HollowayPublished 37m ago4 min readBased on 3 sources
Reading level
Anthropic's Project Fetch Phase Two: Testing Claude on Robotics Tasks

Anthropic's Project Fetch Phase Two: Testing Claude on Robotics Tasks

Anthropic has published results from Project Fetch: Phase Two, a research effort examining whether Claude can meaningfully assist Anthropic employees with sophisticated robotics tasks, with the original experiment conducted in August 2025.

Project Fetch: Phase Two, documented on Anthropic's official research blog on 21 June 2026, extends an earlier phase of the project first detailed in November 2025. The core question is straightforward: can a large language model like Claude serve as a practical collaborator for employees working on physically embodied systems — the kind of work that sits at the intersection of software reasoning and real-world mechanical actuation?

The project's name is not purely metaphorical. A robodog tasked with fetching a beach ball during the experiment failed to complete the retrieval, a detail Anthropic did not bury. It is a candid data point — physical manipulation in unstructured environments remains one of the harder open problems in robotics, and no amount of language model capability straightforwardly resolves the sensor-actuator loop.

What Project Fetch is actually probing is something narrower and arguably more tractable: whether Claude can support the human side of robotics work — helping employees reason through task design, debug system behavior, interpret sensor output, or navigate the procedural complexity of configuring and deploying robotic platforms. This is AI as a force-multiplier for domain experts rather than as a direct replacement for the robot's own perception and control stack.

The framing matters here. Much of the public discourse around AI and robotics conflates two distinct capability questions: what can an LLM do when its outputs are piped into a robot's control system, and what can an LLM do when a human engineer is the intermediary. Project Fetch, at least as scoped in these phases, appears to be testing the latter. That is a more conservative and arguably more immediately useful scope — enterprise AI deployments have repeatedly found that augmenting skilled workers yields faster, more reliable returns than attempting full automation of physically complex workflows.

Anthropic's choice to run this experiment internally, with its own employees as the subjects, follows a pattern the company has used in other capability evaluations. Using internal staff provides a controlled population of motivated, domain-literate users while also generating direct operational signal about where Claude falls short in day-to-day professional use — the kind of friction that polished demos rarely surface.

The robotics domain is a meaningful test environment for precisely this reason. Robotics work is procedurally dense, requires tight feedback between planning and execution, and involves debugging across hardware, firmware, and software layers simultaneously. If Claude can add genuine value in that context, the same underlying capability — maintaining context across complex, multi-step technical workflows — transfers readily into other engineering disciplines.

The beach ball failure is worth sitting with, though not as a sign of systemic inadequacy. Robotic manipulation of deformable, unpredictably rolling objects in open space is a benchmark that still challenges purpose-built systems with dedicated perception pipelines. Expecting a robodog, presumably operating with general-purpose locomotion firmware, to reliably execute that task says as much about the hardware and control constraints as it does about any AI assistance layer above them.

What Phase Two adds to the picture is a more systematic look at where Claude's contributions to these tasks proved durable and where they did not — which, from a research standpoint, is the more instructive result set. The progression from Phase One (November 2025) to Phase Two (published June 2026) suggests an iterative evaluation methodology: define a task environment, run the experiment, publish honestly, refine the scope, repeat.

For engineers working in robotics, automation, or adjacent fields, the practical question is whether Claude-class models are ready to sit alongside them in a real workflow rather than in a demo environment. Project Fetch is one structured attempt to answer that under actual working conditions. The results, including the failures, are the more credible input for that assessment than any benchmark run in isolation.