Case study
Tax Policy Navigator
What it takes to make a generative answer about UK tax trustworthy enough that a citizen could act on it.
A case study in product judgment for grounded AI in a regulated domain. Why refusal has to be a designed surface, why a citation has to reach down to the individual claim, and where a machine judge stops being enough and a human has to take over. Built and evaluated end to end over the full HMRC Employment Income Manual. The hard problems turned out to be product problems, not model problems.
Themes AI product · Grounded generation · Evaluation · Regulated domains · Trust and safety
The short version
A generative answer about your own taxes is worthless unless you can check it. A fluent, confident, wrong answer reads exactly like a correct one, and nothing in it tells you which you have. So the real question is what an answer must carry before a reasonable person could act on it, and what the system should do when it cannot meet that bar.
I built a working navigator over the full HMRC Employment Income Manual to find out, and evaluated it end to end. The engineering was the tractable half: real work, but work with known techniques. The decisions that mattered, the ones with no playbook, were product decisions: when the system should refuse rather than answer, how a user sees which single sentence is actually supported and which is the model reaching, and the point in the evaluation loop where a machine judge has to hand off to a human expert. This is an account of those decisions and of what the evaluation could and could not prove.
Try it: a live demonstration runs the full pipeline over the Employment Income Manual on a hosted model API. It is a demonstration, not a product: it can refuse, or answer incompletely, and that behaviour is part of what this case study examines.
The product
A single surface. The user types a tax question in plain English. The system returns either a grounded answer, where every sentence carries a footnote to the exact HMRC paragraph that supports it and the user can open that paragraph in place, or one of a few typed refusal cards. There is no free-text “I’m not sure.” Uncertainty is a structured outcome the interface can route, not a sentence the model improvises.
The corpus is HMRC’s Employment Income Manual: 2,798 pages of UK income-tax rules, each carrying explicit cross-references to other pages. The scope is deliberately narrow (employment income for England, Wales and Northern Ireland) because a navigator that pretends to cover capital gains, Scottish bands, and VAT badly is worse than one that covers employment income well and says so.
Three bars a regulated answer has to clear at once
These are not retrieval requirements. They are the product’s promises to a user who might rely on the answer.
- Every claim traces to a source the user can read. Not the answer as a whole, each individual assertion, because a five-sentence answer can be four sentences of fact and one of confident invention.
- Refusal is a first-class outcome, not a failure. A regulated tool that guesses when it does not know is a liability. The system has to recognise when the corpus genuinely cannot answer and say so in a way the product can act on.
- The system has to know the difference between “the corpus disagrees with itself” and “I could not find it.” Those are different situations that need different responses to the user.
Every claim, back to a paragraph
The decision that does the most work is making the citation reach down to the individual claim rather than the answer. The system writes the answer one sentence at a time, each sentence ending in a marker, then runs a separate verification pass that checks each cited sentence against the paragraph it points to and labels it supported, partially supported, unsupported, or contradicted. A sentence that cannot be supported does not get quietly dropped. It either pulls the whole answer below the threshold and triggers a refusal, or it surfaces as a claim the user is warned about.
This is the primitive that any grounded product in a regulated domain ends up needing, so I built it as a standalone component the navigator depends on rather than as logic buried inside it. The interesting part is not the code. It is the trade-off it forces you to confront: cite every clause and the answer becomes a wall of footnotes no one reads; cite too loosely and you are back to trusting fluent prose. Where you set that line is a product call about how much friction a user in a high-stakes domain will tolerate, and it is different for tax than it would be for, say, a cooking assistant.
The one piece of engineering worth keeping in the story
Most of the build does not belong in a product case study. One diagnostic does, because it changed how I think about retrieval over regulated text.
The first version of the system refused to answer “what is the basic rate of income tax.” The right page had been retrieved. The model had then, correctly, declined to use it, because the pipeline had thrown away the part of the page that held the actual rates table before the model ever saw it. The fix was not a better model or a cleverer ranker. It was to stop discarding the page’s own structure before the model saw it.
That pointed at a broader principle, one that holds between pages as well as inside them: a regulated corpus has already told you which rules relate to which, in the prose, through its own cross-references. An HMRC author writing about company cars links to the page on fuel benefit because they know the two interact. Following those author-written links at retrieval time is a stronger signal than any general-purpose reranker that has never seen the corpus, because it is the domain expert’s own map of the territory. Teaching the system to walk that map turned a class of failures into successes. The transferable claim is narrow, and the honest version is narrower still: this helps wherever a corpus’s authors maintain explicit cross-references, which covers most legal, regulatory, and academic text, but the method is only as complete as that linking. Even HMRC’s manual, with over nine thousand cross-references, has governing pages the authors never linked from the topical pages that need them, and there the walk cannot help: it inherits the corpus’s own blind spots. Author cross-references are a strong signal where they exist, not a guarantee of coverage, and nothing at all for a pile of unstructured documents.
What the evaluation showed, and the caveat that matters more than the numbers
I wrote 50 questions: 40 across the in-scope domain and 10 designed to be refused. Then I scored the system end to end.
At the level of “did the system do the right thing”:
- 38 questions answered with citations.
- 10 correctly refused (9 out of scope, 1 malformed).
- 1 conflict where two HMRC paragraphs genuinely disagree, which the system flagged and refused to paper over rather than inventing a resolution.
- 1 genuine miss: asked whether a scholarship or bursary paid to an apprentice is taxable, the system refused, even though the manual covers scholarship income, because retrieval could not assemble a confidently grounded answer from the pages it found. This is the clearest failure at the question level, and it stays in the writeup rather than rounded away.
That is 48 of 50 on the right behaviour, with refusals landing only on questions that should refuse.
At the level of individual claims, the system’s own verifier scored every cited sentence across the 38 answered questions, 261 claims in total:
- 195 fully supported, 75 percent. The cited paragraph directly states the claim.
- 61 partially supported, 23 percent. The claim is sound but the cited paragraph only partly carries it, usually because the answer pools two paragraphs into a confidence the corpus never states in one place, or pins a year-specific figure to a paragraph that supports the rule but not the exact number.
- 1 unsupported and 4 contradicted, about 2 percent combined. The system rarely invents; the larger model’s verifier now also flags borderline contradictions the smaller one missed, which is what made the threshold for “two HMRC paragraphs disagree” need tuning.
- Mean per-question faithfulness: 0.87.
The honest read is that this system is very good at not making things up and at refusing when it should, and only moderately good at grounding every claim in exactly the right paragraph. Three quarters land cleanly, the rest are true but lean on synthesis, and that gap, not fabrication, is the real work left.
The caveat that matters. This is the system grading its own faithfulness, whether the cited paragraph supports the claim, not an independent check of whether the answer is correct under UK tax law. That second judgement needs a chartered tax adviser, which is exactly the handoff the next section is about. The numbers above are generated by a script from the committed run, not typed in by hand, so they cannot drift from the data, and every claim is pinned to a named, public HMRC paragraph rather than to the model’s say-so. These figures are from a re-run in June 2026 on a larger model.
There is a subtler blind spot in the same place, and it is the one to watch. Faithfulness scores whether each claim is grounded, not whether the answer is complete. A response can cite every sentence correctly and still omit the governing rule that changes the result, and a grounded-ness check cannot catch that, because an omission leaves no unsupported claim behind to flag. A high faithfulness score and a quietly incomplete answer look identical to the metric.
A real example from the demo: ask how a company car is taxed when taken through salary sacrifice. The system answers, correctly, that the salary given up does not reduce the car benefit, and grounds every sentence in the right HMRC pages. What it leaves out is the rule that actually decides the bill, that since April 2017 the taxable amount is the higher of the salary given up and the car’s normal taxable value, for all but the lowest-emission cars, because the page stating that rule was never retrieved. Every cited sentence is true, the verifier scores the answer highly, and a reader who acted on it could understate the tax due. Completeness, not fabrication, is the bar this product has to be measured against.
What the evaluation could not tell me
The most useful output of the eval was a clear line down the middle of the work, separating what a model can judge from what a human has to.
| Stage of the loop | Who should own it |
|---|---|
| Writing the eval questions | Model is fine |
| Authoring the ground-truth answers and citations | Domain expert, non-negotiable |
| Generating the system’s answers | The system |
| First-line scoring of correctness and groundedness | Model, at volume |
| Adjudicating rule conflicts and edge-case interpretation | Domain expert |
| Sign-off on anything a user will act on | Domain expert |
The product skill is knowing precisely where that handoff falls. That single placement sets your cost, your throughput, and your liability at once, and a team that puts the human in the wrong place either cannot afford to ship or cannot afford to be wrong.
What it would take before a citizen should rely on this
The gap between this demonstration and something a citizen could rely on is entirely product work.
- Per-claim confidence has to reach the user. The verifier already knows which sentence is only partially supported. The interface has to show it, sentence by sentence, or the careful grounding is invisible.
- Conflict needs a resolution flow, not a shrug. “These two rules disagree, here are both” is honest and useless. Walking the user through the disambiguation is a design problem with no generic answer.
- Real questions do not respect the corpus’s categories. People ask life-event questions (“I’m getting divorced and inheriting money and changing jobs”) that span several HMRC manuals at once. The system’s refusal categories are categories of the corpus, not of the user’s life. Bridging the two is the product.
- Multi-turn is essential and absent. A single-shot system discards every verified citation the moment the user asks a follow-up. Real tax questions arrive in cascades.
The transferable lesson
Strong retrieval and claim-level grounding are necessary and nowhere near sufficient. Every remaining hard problem is a product problem, and the model and the retrieval, the parts a benchmark can measure, are the tractable ones.
A note on build-versus-buy, since it shaped this project. Managed retrieval platforms get you to a grounded answer quickly, but they hide the levers this work needs: when the system retrieves the wrong page, or a generally-worded governing rule is outranked by a topical one, you have to be able to see that it happened and tune the retrieval to fix it. Over a regulated corpus those levers are not a nice-to-have; they are the distance between a demo and something a professional could trust. Convenience and control sit on a spectrum, and the more regulated the domain, the further towards control you end up.
Every decision here serves one moment, the one where a real person acts on the answer. A demo never reaches that moment. Designing for it, and not for the demo, is the product.