Case study
Tax Policy Navigator
What it takes to make a generative answer about UK tax trustworthy enough that a citizen could act on it.
A case study in product judgment for grounded AI in a regulated domain. Why refusal has to be a designed surface, why a citation has to reach down to the individual claim, and where a machine judge stops being enough and a human has to take over. Built and evaluated end to end over the full HMRC Employment Income Manual. The hard problems turned out to be product problems, not retrieval problems.
Themes AI product · Grounded generation · Evaluation · Regulated domains · Trust and safety
The short version
A generative answer about your own taxes is worthless unless you can check it. The model can be fluent, confident, and wrong, and you have no way to tell. So the product question is not “can a language model answer tax questions.” It is “what does the answer have to carry alongside it before a reasonable person would act on it, and what should the system do when it cannot meet that bar.”
I built a working navigator over the full HMRC Employment Income Manual to find out, and evaluated it end to end. The engineering was the easy half. The decisions that mattered were product decisions: when the system should refuse rather than answer, how a user sees which single sentence is actually supported and which is the model reaching, and the point in the evaluation loop where a machine judge has to hand off to a human expert. This is an account of those decisions and of what the evaluation could and could not prove.
The product
A single surface. The user types a tax question in plain English. The system returns either a grounded answer, where every sentence carries a footnote to the exact HMRC paragraph that supports it and the user can open that paragraph in place, or one of five typed refusal cards. There is no free-text “I’m not sure.” Uncertainty is a structured outcome the interface can route, not a sentence the model improvises.
The corpus is HMRC’s Employment Income Manual: 2,798 pages of UK income-tax rules, each carrying explicit cross-references to other pages. The scope is deliberately narrow (employment income for England, Wales and Northern Ireland) because a navigator that pretends to cover capital gains, Scottish bands, and VAT badly is worse than one that covers employment income well and says so.
Three bars a regulated answer has to clear at once
These are not retrieval requirements. They are the product’s promises to a user who might rely on the answer.
- Every claim traces to a source the user can read. Not the answer as a whole. Each individual assertion, because a five-sentence answer can be four sentences of fact and one sentence of confident invention, and those read identically.
- Refusal is a first-class outcome, not a failure. A regulated tool that guesses when it does not know is a liability. The system has to recognise when the corpus genuinely cannot answer and say so in a way the product can act on.
- The system has to know the difference between “the corpus disagrees with itself” and “I could not find it.” Those are different situations that need different responses to the user.
Every claim, back to a paragraph
The decision that does the most work is making the citation reach down to the individual claim rather than the answer. The system writes the answer one sentence at a time, each sentence ending in a marker, then runs a separate verification pass that checks each cited sentence against the paragraph it points to and labels it supported, partially supported, unsupported, or contradicted. A sentence that cannot be supported does not get quietly dropped. It either pulls the whole answer below the threshold and triggers a refusal, or it surfaces as a claim the user is warned about.
This is the primitive that any grounded product in a regulated domain ends up needing, so I built it as a standalone component the navigator depends on rather than as logic buried inside it. The interesting part is not the code. It is the trade-off it forces you to confront: cite every clause and the answer becomes a wall of footnotes no one reads; cite too loosely and you are back to trusting fluent prose. Where you set that line is a product call about how much friction a user in a high-stakes domain will tolerate, and it is different for tax than it would be for, say, a cooking assistant.
The one piece of engineering worth keeping in the story
Most of the build does not belong in a product case study. One diagnostic does, because it changed how I think about retrieval over regulated text.
The first version of the system refused to answer “what is the basic rate of income tax.” The right page had been retrieved. The model had then, correctly, declined to use it, because the pipeline had thrown away the part of the page that held the actual rates table before the model ever saw it. The fix was not a better model or a cleverer ranker.
The insight: a regulated corpus has already told you which rules relate to which, in the prose, through its own cross-references. An HMRC author writing about company cars links to the page on fuel benefit because they know the two interact. Following those author-written links at retrieval time is a stronger signal than any general-purpose reranker that has never seen the corpus, because it is the domain expert’s own map of the territory. Teaching the system to walk that map turned a class of failures into successes. The transferable claim is narrow and I want to keep it honest: this works for any corpus whose authors maintain explicit cross-references, which covers most legal, regulatory, and academic text. It does nothing for a pile of unstructured documents.
What the evaluation showed, and the caveat that matters more than the numbers
I wrote 50 questions: 40 across the in-scope domain and 10 designed to be refused. Then I scored the system end to end.
At the level of “did the system do the right thing”:
- 38 questions answered with citations.
- 10 correctly refused (9 out of scope, 1 malformed).
- 1 conflict where two HMRC paragraphs genuinely disagree, which the system flagged and refused to paper over rather than inventing a resolution.
- 1 genuine miss, where the corpus held the answer but retrieval did not surface it. This is the real failure of the run, and I am keeping it in the writeup rather than rounding it away.
That is 48 of 50 on the right behaviour, with refusals landing only on questions that should refuse.
At the level of individual claims, the system’s own verifier scored every cited sentence across the 38 answered questions, 201 claims in total:
- 128 fully supported, 64 percent. The cited paragraph directly states the claim.
- 67 partially supported, 33 percent. The claim is sound but the cited paragraph only partly carries it, usually because the answer pools two paragraphs into a confidence the corpus never states in one place, or pins a year-specific figure to a paragraph that supports the rule but not the exact number.
- 6 unsupported, 0 contradicted. The system almost never fabricates. When it cannot ground a claim it leans towards refusing rather than inventing.
- Mean per-question faithfulness: 0.81.
The honest read is that this system is very good at not making things up and at refusing when it should, and only moderately good at grounding every claim in exactly the right paragraph. Two thirds land cleanly, a third are true but lean on synthesis, and that gap, not fabrication, is the real work left.
The caveat that matters. This is the system grading its own faithfulness, whether the cited paragraph supports the claim, not an independent check of whether the answer is correct under UK tax law. That second judgement needs a chartered tax adviser, which is exactly the handoff the next section is about. The numbers above are generated by a script from the committed run, not typed in by hand, so they cannot drift from the data, and every claim is pinned to a named, public HMRC paragraph rather than to the model’s say-so.
What the evaluation could not tell me
The most useful output of the eval was a clear line down the middle of the work, separating what a model can judge from what a human has to.
| Stage of the loop | Who should own it |
|---|---|
| Writing the eval questions | Model is fine |
| Authoring the ground-truth answers and citations | Domain expert, non-negotiable |
| Generating the system’s answers | The system |
| First-line scoring of correctness and groundedness | Model, at volume |
| Adjudicating rule conflicts and edge-case interpretation | Domain expert |
| Sign-off on anything a user will act on | Domain expert |
The product skill is not “use an LLM judge” or “use a human.” It is knowing precisely where the handoff falls, because that decision sets your cost, your throughput, and your liability all at once. A team that puts the human in the wrong place either cannot afford to ship or cannot afford to be wrong.
What it would take before a citizen should rely on this
The navigator is a strong demonstration and not a product, and the distance between the two is entirely product work, not engineering.
- Per-claim confidence has to reach the user. The verifier already knows which sentence is only partially supported. The interface has to show it, sentence by sentence, or the careful grounding is invisible.
- Conflict needs a resolution flow, not a shrug. “These two rules disagree, here are both” is honest and useless. Walking the user through the disambiguation is a design problem with no generic answer.
- Real questions do not respect the corpus’s categories. People ask life-event questions (“I’m getting divorced and inheriting money and changing jobs”) that span several HMRC manuals at once. The system’s refusal categories are categories of the corpus, not of the user’s life. Bridging the two is the product.
- Multi-turn is essential and absent. A single-shot system discards every verified citation the moment the user asks a follow-up. Real tax questions arrive in cascades.
The transferable lesson
Strong retrieval and tight, claim-level grounding are necessary and nowhere near sufficient. Once you have them, every remaining hard problem is a product problem: when to refuse, how to show a user which words to trust, how to handle the conflicts and life-shaped questions that a clean corpus pretends do not exist, and where to spend a human expert’s scarce attention. The model is the easy part. Deciding what the product owes a person who might act on its answer is the work.