Skip to content
Vamshi Jandhyala

AI Lab

Live

AgentScope

An interface prototype for a developer-facing harness that makes a multi-agent run legible at a glance, including when it fails.

A clickable prototype that visualises a tree of agent runs, each with their own steps, context window, and cost share, on one screen. Three switchable mock runs demonstrate the design across different shapes.

Themes Interface prototype · Developer tools · Multi-agent · Observability


The problem

Building agents is hard because the developer can’t easily see what’s happening inside a run. Tokens balloon mysteriously, tool calls fire silently, the reasoning chain is opaque, and the only signal back is the final output. When something goes wrong (a wrong answer, a runaway cost, a timeout), the developer has no good way to localise the failure.

It is harder still once agents spawn sub-agents. Modern agent systems (LangGraph, AutoGen, OpenAI Swarm, Claude’s sub-agent tool pattern) routinely produce trees of agents calling other agents. Each child has its own context window, its own steps, and its own failure modes. Now the developer has to debug a hierarchy, not a single transcript, and most existing tracers either flatten the tree into a long JSON list or skip the structure entirely.

Today’s options are thin. Provider dashboards show aggregate metrics but not per-run causality. LangSmith, Helicone, Langfuse and the LLM-provider tracers each pick a slice. The pain is uniform across providers and frameworks: it doesn’t matter whether the agents are built on Cohere, Anthropic, OpenAI, Mistral, or a self-hosted model. The shape of the problem is the same.

Who it’s for

The named persona is the platform engineer at a frontier lab or financial-services firm building an agent system with three to seven sub-agents in production. They have a Slack thread full of users complaining that one query out of a thousand burned £40 in tokens or returned a wrong answer, and a tracing tool that shows them a 4,000-line JSON blob. They need to localise the failure, isolate the cost driver, and ship a fix before the next on-call rotation.

The product question

What should a multi-agent execution harness look like for that engineer? What signals must it surface, in what hierarchy, so a developer can answer in seconds: which agent ran, who spawned whom, who ran in parallel, where each agent’s context budget went, and which step blew the run? How do you keep that surface legible as the tree grows three, five, or six deep without it becoming spaghetti? How do you make a failure localisable in one click rather than fifteen?

The artifact

This prototype is the answer made clickable.

No model, no backend, no real trace ingestion. The screen runs against three hand-built mock runs you can switch between in the topbar. Each is chosen to test a different shape of the design:

  1. Vector-DB research — a five-agent run with two parallel sub-sub-agents. Tests the basic case: hierarchy, parallelism, normal completion.
  2. Deep research chain — a five-level linear chain of agents, each spawning the next. Tests whether deep nesting stays legible.
  3. Failed run · refund flow — a four-agent customer-support run where the leaf payment-gateway agent fails on an expired card, propagating up through its parent. Tests the harness’s most important job: localising blame.

Switching between them resets the playhead and the selected agent. Same harness, three very different runs.

How to look at it

What the prototype shows

A few of the product calls behind the prototype