Prototype

ContextScope

AgentScope for agent belief state. An interface prototype that shows the conversation on the left and the agent's changing case state, as a turn-by-turn diff, on the right.

A clickable prototype for a causal debugger of agent state: what the agent believed, why it believed it, what conflicted, and what that enabled next. Where AgentScope makes a run legible, ContextScope makes the agent's belief legible. Built around the conflict a regulated operation has to handle well: the ledger says the card payment was authorised, the customer says it was not them.

Open the prototype ↗

Themes Interface prototype · Product design · Agent observability · Governance

Before an agent can be trusted to act, you have to see how it came to believe what it believes. ContextScope is a static prototype of that view, a design provocation rather than a live tool.

It is AgentScope for agent belief state. Where AgentScope makes a run legible, the steps, the context window, the cost, ContextScope makes the agent’s changing case state legible: what it believed, why it believed it, what conflicted, and what that enabled next. One debugs the execution. The other debugs the belief.

The conversation sits on the left. The agent’s context sits on the right, rendered not as a snapshot but as a diff against the previous turn. A full snapshot is too large to read every turn and a transcript is too thin, so the change set between turns is the unit the interface is built on. The point is not another transcript viewer. The point is a causal debugger for agent state.

The problem

Agents in a regulated operation rarely fail because one message is broken in isolation. They fail because context drifts. A customer says a £420 card payment was not them. The ledger says it was authorised and 3-D Secure verified. The agent has to hold both facts at once, notice that they conflict, stop treating the authorisation as final, move into a dispute workflow, apply provisional credit, and never tell the customer they are wrong. Most debugging surfaces make that sequence surprisingly hard to see.

Existing tracers (LangSmith, Langfuse, Arize, the provider dashboards) are built around the span and the token. They answer what executed and what it cost, and they answer it well. None centres the question a platform owner asks after a bad conversation: what changed in the agent’s context, why, and what did that change cause it to do next? The state is in there, buried in expandable JSON, but it is not the unit the interface is built on. That gap matters more as agents move from answering questions to operating workflows, because the context stops being prompt stuffing and becomes the working state of a case that moves money.

Who it’s for

The persona is the agent-platform owner at a bank, insurer, or card issuer running a customer-service agent in production: thousands of conversations a day, a support team escalating screenshots, and a tracing tool that shows every event but not the shape of the failure. After a complaint that the agent kept insisting a payment was valid when it should have opened a dispute, they need one pass to see whether the agent extracted the right charge, trusted the right evidence, noticed the conflict between ledger and claim, retrieved policy, and created the right action plan.

The product question

What should a debugger look like if the primary object is the changing state, not the transcript? The useful questions are turn-shaped:

Which message changed the agent’s context?
Which fields were added, updated, removed, or left unchanged?
What evidence caused each change?
Was the field model-inferred, tool-derived, retrieved from policy, or written to memory?
Is the context temporary, session-level, or persistent?
Did the new context create a conflict?
What downstream action did the new state enable?

A debugger that cannot answer those is not yet an agent debugger. It is a prettier log viewer.

The artifact

This is the answer made static: no model, no backend, one hand-built run of a disputed card payment. Pick a turn on the left and the right panel shows that turn’s context as a diff against the previous one. The selected message is “That wasn’t me. I never authorised that,” and that one sentence drives the whole transition: intent moves to dispute, a dispute reason is inferred, the authorisation record is retained but now conflicts with the claim, a dispute is queued, a provisional credit is planned, and the governing policy is retrieved. The prototype is deliberately narrow. It makes one thing legible: how context changes when a customer contradicts the system of record.

What the prototype argues

A selected message is a cause, not a row. The chosen turn is treated as the source of a state transition: this message caused six context changes and one conflict. In a debugging session the engineer is hunting causality, not reading for tone, so the interface should make a message point at the state it changed.

Context is typed state, with provenance on every row. Conversation state, extracted entities, memory, tools, and policy are kept distinct, because a charge the customer named is not the same kind of thing as an authorisation flag the ledger returned. And each row carries where it came from, how long it should live, and how confident the system is, so a low-confidence inference never sits next to a verified fact with the same weight.

Refusing to remember is a governance event. “Customer reports this payment as unauthorised” is evidence for this case and stays session-only. The adverse inference one step beyond it, “this customer is fraud-prone,” is exactly what a memory-enabled agent will quietly persist and later act on, and exactly what fair-treatment rules say it must not. ContextScope renders that as a first-class blocked row rather than letting it happen off-screen. What an agent declines to store is as much a governance question as what it keeps.

Conflict is a first-class surface. The contradiction is the heart of the run: the ledger says authorised and 3-D Secure verified, the customer says it was not them. Most observability tools know how to render a failed tool call in red; far fewer can render a semantic clash between two calls that both succeeded. For agents in customer operations, conflicts are not edge cases. They are the work.

Planned actions are distinct from executed tools. txn.lookup has executed, dispute.create is queued, provisional_credit.issue is planned for after the dispute exists. Many agent failures live in the gap between what the model intended and what the system actually did, and a debugger that collapses the two cannot localise them.

What this isn’t yet

This is not a live debugger. It does not ingest traces from an agent framework, a core-banking workflow, or a customer-service platform, and it does not define the event schema needed to reconstruct context deltas from production traffic. The next useful version would be driven by a typed event stream: message.received, tool.result, context.field_updated, context.conflict_detected, policy.retrieved, action.planned, action.executed, memory.write_blocked, memory.write_committed.

The hard part is not drawing the panels. It is deciding which state transitions deserve to exist as first-class events. Once that schema exists, the interface almost draws itself.