Live

AgentScope

An interface prototype for a developer-facing harness that makes a multi-agent run legible at a glance, including when it fails.

A clickable prototype that visualises a tree of agent runs, each with their own steps, context window, and cost share, on one screen. Three switchable mock runs demonstrate the design across different shapes.

Try it ↗

Themes Interface prototype · Developer tools · Multi-agent · Observability

The problem

Building agents is hard because the developer can’t easily see what’s happening inside a run. Tokens balloon mysteriously, tool calls fire silently, the reasoning chain is opaque, and the only signal back is the final output. When something goes wrong (a wrong answer, a runaway cost, a timeout), the developer has no good way to localise the failure.

It is harder still once agents spawn sub-agents. Modern agent systems (LangGraph, AutoGen, OpenAI Swarm, Claude’s sub-agent tool pattern) routinely produce trees of agents calling other agents. Each child has its own context window, its own steps, and its own failure modes. Now the developer has to debug a hierarchy, not a single transcript, and most existing tracers either flatten the tree into a long JSON list or skip the structure entirely.

Today’s options are thin. Provider dashboards show aggregate metrics but not per-run causality. LangSmith, Helicone, Langfuse and the LLM-provider tracers each pick a slice. The pain is uniform across providers and frameworks: it doesn’t matter whether the agents are built on Cohere, Anthropic, OpenAI, Mistral, or a self-hosted model. The shape of the problem is the same.

Who it’s for

The named persona is the platform engineer at a frontier lab or financial-services firm building an agent system with three to seven sub-agents in production. They have a Slack thread full of users complaining that one query out of a thousand burned £40 in tokens or returned a wrong answer, and a tracing tool that shows them a 4,000-line JSON blob. They need to localise the failure, isolate the cost driver, and ship a fix before the next on-call rotation.

The product question

What should a multi-agent execution harness look like for that engineer? What signals must it surface, in what hierarchy, so a developer can answer in seconds: which agent ran, who spawned whom, who ran in parallel, where each agent’s context budget went, and which step blew the run? How do you keep that surface legible as the tree grows three, five, or six deep without it becoming spaghetti? How do you make a failure localisable in one click rather than fifteen?

The artifact

This prototype is the answer made clickable.

No model, no backend, no real trace ingestion. The screen runs against three hand-built mock runs you can switch between in the topbar. Each is chosen to test a different shape of the design:

Vector-DB research — a five-agent run with two parallel sub-sub-agents. Tests the basic case: hierarchy, parallelism, normal completion.
Deep research chain — a five-level linear chain of agents, each spawning the next. Tests whether deep nesting stays legible.
Failed run · refund flow — a four-agent customer-support run where the leaf payment-gateway agent fails on an expired card, propagating up through its parent. Tests the harness’s most important job: localising blame.

Switching between them resets the playhead and the selected agent. Same harness, three very different runs.

How to look at it

Pick a run in the topbar dropdown. Each one has a one-line blurb telling you what it tests.
Press ▶ PLAY to watch the whole hierarchy unfold in real time, or drag the scrubber.
Pick any agent in the left-rail Agents picker to scope the trace, context, and step-detail screens to that agent.
The breadcrumb in the top bar traces your selected agent’s path back to the root. Failed agents and breadcrumb segments render in red.
Switch between the five tabs:
- Agent Tree — Gantt-style home view, one bar per agent, indented by depth. Failed agents are red bars.
- Execution Trace — per-agent steps, with click-through chips for spawned sub-agents and a red error chip on failure steps.
- Context Window — per-agent budget. Each agent has its own.
- Cost Breakdown — total tokens, illustrative cost, a treemap of tokens-by-agent-by-step-type, and a sortable per-agent breakdown table.
- Run Stats — totals plus a tree-indented bar chart of tokens-by-agent.

What the prototype shows

Agent tree as a Gantt. Children indent below their parent. Parallel agents stack at the same time range. Failed agents render in red. A pink playhead line tracks the scrubber.
A typed step stream. Six step types (user_input, model_think, tool_call, tool_result, model_output, spawn_agent) plus an error type for failures. Each has its own colour and glyph used everywhere it appears.
Per-agent context windows. Each agent has its own 4k context budget, shown as a stacked bar. Spawning a sub-agent costs the parent its spawn-prompt tokens; the sub-agent’s own budget starts fresh.
Cost breakdown. A proportional treemap of tokens-by-agent, plus a per-agent table with token counts, percentage of run, illustrative cost, and status. The fastest way to spot the sub-agent that burned the budget.
Failure localisation. When a sub-agent fails, its row turns red in the picker and its bar turns red in the Gantt. The parent agent’s error step in the trace links to the failed child. The breadcrumb shows the failure path. One click from “something failed” to “this exact step in this exact agent.”

A few of the product calls behind the prototype

Agent Tree as the home tab, not the trace. The first thing a developer needs to see in a multi-agent system is the shape of the run, not the steps of one agent. The trace is what they drill into next.
Per-agent everything (with breadcrumbs to keep scope honest). The trace, context, and step-detail screens scope to one agent at a time. A breadcrumb in the topbar plus a tab subscript (· research_lead, · RUN) means the developer always knows what they’re inspecting. Without that discipline, multi-agent UIs blur into confusion.
Parallel agents render as stacked Gantt bars. This is the single best argument for the Gantt over a process-tree view: two agents running in parallel are two bars at the same horizontal x position. Visible parallelism is something a vertical tree can’t show.
One accent colour per agent, hashed from the name. Stable across reorders. The eye tracks one agent across views without rereading labels.
A spawn_agent step type, not a tool call. Spawning a sub-agent is structurally different from invoking an external tool. Conflating them under tool_call loses the one piece of information a multi-agent harness exists to surface: the hierarchy.
A first-class error step type with red propagation. When a leaf fails, the error walks up the parent chain as error steps with structured fields (error_code, the offending child agent’s id). The harness exists to localise failures, so failures are not styled like other steps; they are unmistakeable.
Cost as a top-level tab, not a tooltip. For developers paying per token, “where did the budget go” is a load-bearing question. Sankey-style flows would dazzle but lose precision; a proportional treemap-row + per-agent table answers the question directly.
Three runs to test the design. A demo with one mock run can hide design weaknesses behind a single happy path. Three runs (parallel, deep, failed) prove the design covers the space, not just the easy case.
Model-agnostic by construction. The harness shows steps, agents, and tokens, not provider-specific structures. Any agent system that emits a typed step stream with spawn relations can drive this surface.

What this prototype taught me

Three things that transfer to any agent-system observability product:

Hierarchy is the primary information. The most useful debugging signal in a multi-agent system is the shape of the run. Flattening it into a JSON list, or a single transcript, discards exactly what makes the system multi-agent in the first place. The Gantt-tree-with-Gantt-bars view was forced by this.
Failure localisation is the most-asked job, but not the most-built. Every agent system has a where did it go wrong debugging loop. Most existing tracers help with throughput, latency, and aggregate cost; few help with one-click localisation of a failed sub-agent. The red-propagation breadcrumb was the entire point of the prototype.
Cost is a product surface, not a tooltip. Developers paying per token need cost-by-sub-agent visible by default, not buried in a billing dashboard a tab away. Treating cost as a top-level tab made it a first-line debugging primitive.

What this isn’t yet

Real ingestion against a typed-event stream from a live agent framework. The harness is shape-tested against three mocks; turning it into an artefact that drives an actual LangGraph or AutoGen run needs an adapter for each, and the right adapter shape is itself a design question a platform team would be the right partner to define.