Skip to content
Vamshi Jandhyala

AI Lab

Case study

Tax Policy Navigator

What it takes to make a generative answer about UK tax trustworthy enough that a citizen could act on it.

A case study in product judgment for grounded AI in a regulated domain. Why refusal has to be a designed surface, why a citation has to reach down to the individual claim, and where a machine judge stops being enough and a human has to take over. Built and evaluated end to end over the full HMRC Employment Income Manual. The hard problems turned out to be product problems, not retrieval problems.

Themes AI product · Grounded generation · Evaluation · Regulated domains · Trust and safety


The short version

A generative answer about your own taxes is worthless unless you can check it. The model can be fluent, confident, and wrong, and you have no way to tell. So the product question is not “can a language model answer tax questions.” It is “what does the answer have to carry alongside it before a reasonable person would act on it, and what should the system do when it cannot meet that bar.”

I built a working navigator over the full HMRC Employment Income Manual to find out, and evaluated it end to end. The engineering was the easy half. The decisions that mattered were product decisions: when the system should refuse rather than answer, how a user sees which single sentence is actually supported and which is the model reaching, and the point in the evaluation loop where a machine judge has to hand off to a human expert. This is an account of those decisions and of what the evaluation could and could not prove.

The product

A single surface. The user types a tax question in plain English. The system returns either a grounded answer, where every sentence carries a footnote to the exact HMRC paragraph that supports it and the user can open that paragraph in place, or one of five typed refusal cards. There is no free-text “I’m not sure.” Uncertainty is a structured outcome the interface can route, not a sentence the model improvises.

The corpus is HMRC’s Employment Income Manual: 2,798 pages of UK income-tax rules, each carrying explicit cross-references to other pages. The scope is deliberately narrow (employment income for England, Wales and Northern Ireland) because a navigator that pretends to cover capital gains, Scottish bands, and VAT badly is worse than one that covers employment income well and says so.

Three bars a regulated answer has to clear at once

These are not retrieval requirements. They are the product’s promises to a user who might rely on the answer.

  1. Every claim traces to a source the user can read. Not the answer as a whole. Each individual assertion, because a five-sentence answer can be four sentences of fact and one sentence of confident invention, and those read identically.
  2. Refusal is a first-class outcome, not a failure. A regulated tool that guesses when it does not know is a liability. The system has to recognise when the corpus genuinely cannot answer and say so in a way the product can act on.
  3. The system has to know the difference between “the corpus disagrees with itself” and “I could not find it.” Those are different situations that need different responses to the user.

Every claim, back to a paragraph

The decision that does the most work is making the citation reach down to the individual claim rather than the answer. The system writes the answer one sentence at a time, each sentence ending in a marker, then runs a separate verification pass that checks each cited sentence against the paragraph it points to and labels it supported, partially supported, unsupported, or contradicted. A sentence that cannot be supported does not get quietly dropped. It either pulls the whole answer below the threshold and triggers a refusal, or it surfaces as a claim the user is warned about.

This is the primitive that any grounded product in a regulated domain ends up needing, so I built it as a standalone component the navigator depends on rather than as logic buried inside it. The interesting part is not the code. It is the trade-off it forces you to confront: cite every clause and the answer becomes a wall of footnotes no one reads; cite too loosely and you are back to trusting fluent prose. Where you set that line is a product call about how much friction a user in a high-stakes domain will tolerate, and it is different for tax than it would be for, say, a cooking assistant.

The one piece of engineering worth keeping in the story

Most of the build does not belong in a product case study. One diagnostic does, because it changed how I think about retrieval over regulated text.

The first version of the system refused to answer “what is the basic rate of income tax.” The right page had been retrieved. The model had then, correctly, declined to use it, because the pipeline had thrown away the part of the page that held the actual rates table before the model ever saw it. The fix was not a better model or a cleverer ranker.

A directed graph of fifteen HMRC pages around mileage allowance payments, with arrows showing the cross-references the corpus author wrote between them.
A real slice of HMRC's cross-reference graph centred on mileage allowance payments. Every arrow is a link an HMRC author wrote in their own prose. The full corpus has 9,535 of them.

The insight: a regulated corpus has already told you which rules relate to which, in the prose, through its own cross-references. An HMRC author writing about company cars links to the page on fuel benefit because they know the two interact. Following those author-written links at retrieval time is a stronger signal than any general-purpose reranker that has never seen the corpus, because it is the domain expert’s own map of the territory. Teaching the system to walk that map turned a class of failures into successes. The transferable claim is narrow and I want to keep it honest: this works for any corpus whose authors maintain explicit cross-references, which covers most legal, regulatory, and academic text. It does nothing for a pile of unstructured documents.

What the evaluation showed, and the caveat that matters more than the numbers

I wrote 50 questions: 40 across the in-scope domain and 10 designed to be refused. Then I scored the system end to end.

At the level of “did the system do the right thing”:

That is 48 of 50 on the right behaviour, with refusals landing only on questions that should refuse.

At the level of individual claims, the system’s own verifier scored every cited sentence across the 38 answered questions, 201 claims in total:

The honest read is that this system is very good at not making things up and at refusing when it should, and only moderately good at grounding every claim in exactly the right paragraph. Two thirds land cleanly, a third are true but lean on synthesis, and that gap, not fabrication, is the real work left.

The caveat that matters. This is the system grading its own faithfulness, whether the cited paragraph supports the claim, not an independent check of whether the answer is correct under UK tax law. That second judgement needs a chartered tax adviser, which is exactly the handoff the next section is about. The numbers above are generated by a script from the committed run, not typed in by hand, so they cannot drift from the data, and every claim is pinned to a named, public HMRC paragraph rather than to the model’s say-so.

What the evaluation could not tell me

The most useful output of the eval was a clear line down the middle of the work, separating what a model can judge from what a human has to.

Stage of the loopWho should own it
Writing the eval questionsModel is fine
Authoring the ground-truth answers and citationsDomain expert, non-negotiable
Generating the system’s answersThe system
First-line scoring of correctness and groundednessModel, at volume
Adjudicating rule conflicts and edge-case interpretationDomain expert
Sign-off on anything a user will act onDomain expert

The product skill is not “use an LLM judge” or “use a human.” It is knowing precisely where the handoff falls, because that decision sets your cost, your throughput, and your liability all at once. A team that puts the human in the wrong place either cannot afford to ship or cannot afford to be wrong.

What it would take before a citizen should rely on this

The navigator is a strong demonstration and not a product, and the distance between the two is entirely product work, not engineering.

The transferable lesson

Strong retrieval and tight, claim-level grounding are necessary and nowhere near sufficient. Once you have them, every remaining hard problem is a product problem: when to refuse, how to show a user which words to trust, how to handle the conflicts and life-shaped questions that a clean corpus pretends do not exist, and where to spend a human expert’s scarce attention. The model is the easy part. Deciding what the product owes a person who might act on its answer is the work.