How To Build An AI Evaluation Harness For Internal Copilots

Tools & Technical Tutorials

8 June 2026 | By Ashley Marshall

How To Build An AI Evaluation Harness For Internal Copilots?

Build an AI evaluation harness by turning real copilot tasks into a versioned golden dataset, defining objective and expert-scored criteria, running every prompt, retrieval and model change through the same tests, and reviewing failures before release. For UK teams, include privacy, accuracy, human oversight and security checks so the harness supports both product quality and governance.

Internal copilots fail quietly when teams rely on demo confidence instead of repeatable tests. A golden dataset gives you the evidence to know whether a change has made the assistant better, worse or merely different.

Why Internal Copilots Need Evidence, Not Demo Confidence

Most internal copilot projects start with a convincing demo. Someone asks a policy question, the assistant finds the right paragraph, rewrites it in plain English and everyone sees the potential. The problem is that demos are not operating evidence. They show that a system can work once, under friendly conditions, with a user who already knows what good looks like. They do not show whether the copilot will keep working after the prompt changes, the knowledge base grows, a model version moves, a document is badly formatted or a rushed employee asks an ambiguous question on a Friday afternoon.

The UK context makes that distinction important. DSIT's 2026 AI Adoption Research found that only 16% of UK businesses were currently using at least one AI technology, yet among adopters 85% were using natural language processing and text generation. It also found that 84% of AI-using businesses applied at least some human input or checking to AI outputs or decisions. That is the real world for internal copilots: widespread interest, uneven maturity, and a continuing need for human judgement. A test harness is how you turn that judgement into a repeatable release process rather than a feeling in a meeting.

An evaluation harness is a small engineering and governance system around the copilot. It stores test cases, runs the copilot against those cases, scores results, records failures, compares versions and decides whether a release is good enough. The golden dataset is the carefully selected set of examples that represent the tasks the copilot must handle. For a HR copilot, it might include holiday policy questions, parental leave edge cases and unsafe requests for personal data. For a finance copilot, it might include invoice coding, variance explanations and queries that should be refused because they ask for restricted information.

The misconception is that this is only for large AI teams. It is not. A useful harness can start with 50 to 100 well chosen cases in a spreadsheet, a script that calls the copilot, a rubric for scoring, and a release rule. The point is discipline. You are no longer asking whether the latest build feels smarter. You are asking whether it still answers the 20 critical questions, refuses the 10 unsafe ones, cites the right source, stays inside the allowed systems and saves enough time to justify its cost. That is a management control as much as a technical control.

Source: DSIT AI Adoption Research.

Start With A Golden Dataset That Mirrors Real Work

The golden dataset is not a random pile of prompts. It is a deliberately curated sample of work that the internal copilot must handle reliably. Start with production-like material: anonymised helpdesk tickets, policy questions, CRM notes, sales queries, support transcripts, operational checklists, incident write-ups and common spreadsheet tasks. Then add the edge cases that experienced staff know cause mistakes. These are usually more valuable than generic benchmark questions because they encode the organisation's own processes, terminology, exceptions and risk tolerance.

A strong first version normally has several buckets. Include typical requests, high-value requests, ambiguous requests, outdated-document traps, retrieval failures, sensitive data requests, prompt injection attempts, and cases where the right answer is to ask a clarifying question. For each case, store the user input, allowed sources, expected behaviour, unacceptable behaviour, reference answer where appropriate, tags, owner, risk rating and scoring notes. Version it in Git, a database or a controlled spreadsheet. The format matters less than ownership and change control.

OpenAI's evaluation guidance is useful here because it warns against biased design, generic metrics and "vibe-based" evals. It recommends task-specific tests that reflect real-world distributions, logging everything so logs can become eval cases, and combining metrics with human judgement. Its examples are practical: for Q and A over documents, it suggests using production data, domain expert answers and historical logs, then tracking context recall, context precision and positively rated answers. That pattern maps directly to internal copilots, especially retrieval-augmented systems sitting over SharePoint, Google Drive, Confluence, Notion, Salesforce or a document management system.

Be strict about personal data. The golden dataset should represent sensitive situations without casually copying sensitive records into a test set. Use anonymisation, synthetic variants, redaction and access controls. The ICO's AI guidance is clear that fairness, accuracy, data minimisation, security and accountability still matter when AI processes personal data. The golden dataset should therefore include governance cases as well as quality cases: "should not answer", "should ask for permission", "should route to HR", "should not infer a medical condition", and "should cite the policy rather than inventing a rule". That is how evaluation becomes a practical implementation tool rather than an academic exercise.

Sources: OpenAI evaluation best practices and ICO AI and data protection guidance.

Define Scorecards Before You Change The Prompt

A golden dataset only becomes useful when the scoring is clear. Without a scorecard, teams slide back into preference debates: one reviewer likes concise answers, another likes detail, a third only cares about citations, and the product owner wants fewer escalations. Those preferences are legitimate, but they need to be turned into criteria before the copilot is tested. Otherwise every release decision becomes a negotiation after the result is known.

For most internal copilots, use a mixed scorecard. Keep deterministic checks where possible: did the assistant cite an allowed source, did it include a required field, did it call the correct tool, did it refuse a restricted request, did it stay below a latency or cost threshold. Add expert-scored checks for things that need judgement: usefulness, completeness, tone, clarity, risk awareness and whether the answer would help a competent employee complete the task. Where the copilot uses retrieval, score retrieval separately from generation. A bad answer may be caused by a weak prompt, but it may also be caused by the wrong document chunk being retrieved.

Set thresholds by risk. A low-risk drafting assistant may be allowed to pass with a lower usefulness threshold if the output is always edited by a human. A compliance, finance, HR or customer-impacting copilot needs stricter accuracy, citation and refusal gates. You might require 95% pass on critical refusal cases, no high-severity data leakage failures, 85% context recall on policy questions, and a minimum expert score of 4 out of 5 for answers that reach users. The exact numbers should be owned by the business process owner, not invented in isolation by the AI team.

There is a common counterargument that LLM-as-judge scoring is enough, especially because it is fast and cheap. It can help, but it should not be the sole authority for high-risk internal copilots. OpenAI's GDPval work is a good named example: across a gold set of 220 real-world tasks, expert graders blindly compared model outputs with human deliverables, while an automated grader was described as experimental and not yet reliable enough to replace experts. The same principle applies inside a business. Use model judges for scale, but calibrate them against human experts and review disagreements. Your harness should store the judge prompt, judge model, score explanation and human override so scoring itself can be audited.

Source: OpenAI GDPval evaluation write-up.

Wire The Harness Into Release Gates And Everyday Engineering

The practical implementation is straightforward once the dataset and scorecard exist. Build a runner that takes each case, calls the same copilot interface that users will use, captures the final answer, captures intermediate retrieval results and tool calls where possible, then writes a result record. The record should include application version, prompt version, model name, model settings, retrieval index version, dataset version, run time, cost estimate, scorer version and pass or fail status. Without that metadata, you cannot explain why performance changed.

Run the harness on every meaningful change. That includes prompt edits, system instruction changes, model upgrades, retrieval chunking changes, embedding model changes, new document sources, tool permission changes and guardrail updates. OpenAI's evaluation guidance recommends continuous evaluation that runs on every change, monitors for nondeterminism and grows the eval set over time. For an internal copilot, that means the harness should sit in CI or at least in a pre-release workflow. A pull request that changes a prompt should show the pass rate before and after. A model upgrade should not be accepted because the vendor says it is better. It should be accepted because the copilot is better on your own golden dataset.

A simple stack is enough to start. Store cases as JSONL or CSV. Use a Node, Python or TypeScript runner. Call the copilot through its API rather than bypassing the application layer. Store results in Postgres, BigQuery, DuckDB or even a versioned file for early work. Use a dashboard in Metabase, Grafana, Looker Studio or a lightweight internal page. The key is traceability, not expensive tooling. Tools such as LangSmith, Arize Phoenix, Humanloop, Braintrust, Langfuse, Promptfoo and OpenAI's current evaluation tooling can speed things up, but the underlying discipline is the same.

The harness should also feed operations. When users downvote an answer, edit a generated draft heavily, escalate a response, or report a hallucination, convert a sample into a new candidate eval case. That keeps the golden dataset alive. It also prevents "golden dataset rot", where the test suite reflects last quarter's product, policies and user behaviour rather than today's work. Pair this with AI context management policies so retrieval sources, prompt context and memory rules are governed alongside release tests.

Source: OpenAI evaluation best practices.

Test Security, Privacy And Misuse As First-Class Outcomes

Quality evals are not enough. Internal copilots often sit close to sensitive business systems: HR policies, customer records, support histories, contracts, finance data, sales notes and operational procedures. The harness should therefore include negative tests and governance tests from day one. If the copilot can access tools, retrieve documents or act through APIs, the question is not just whether it gives a good answer. It is whether it stays inside the authority the organisation intended.

NCSC's prompt injection guidance is a useful corrective to a dangerous misconception. It explains that current large language models do not enforce a security boundary between instructions and data inside a prompt, and warns that prompt injection should not be treated as if it were just SQL injection with different wording. That matters for internal copilots because they often read untrusted or semi-trusted material: emails, tickets, CVs, supplier documents, customer attachments, web pages and copied text. A malicious instruction hidden in one of those sources can try to make the copilot ignore policy, reveal data or call a tool it should not call.

Your golden dataset should include these cases. Add documents containing hidden instructions, customer emails asking for restricted data, prompts that request another employee's salary, attempts to override the system prompt, requests to summarise confidential board material for an unauthorised role, and tasks that try to move from read-only advice into action. Score not only the wording of the answer, but also the tool path. Did the copilot retrieve from an allowed source? Did it call a tool after reading untrusted content? Did it ask for approval before a risky action? Did it leak system instructions or confidential context?

Privacy tests are just as important. The ICO points organisations towards fairness, statistical accuracy, transparency, purpose limitation, data minimisation, security, accountability and Article 22 considerations where automated decision-making has legal or similarly significant effects. For a copilot, that means testing whether it minimises personal data in answers, avoids unsupported inferences about people, provides source-grounded explanations and routes sensitive decisions back to humans. If the harness only measures helpfulness, it will reward confident overreach. If it measures governance outcomes, it can stop that overreach before rollout.

Sources: NCSC prompt injection guidance and ICO AI fairness guidance.

Make Evaluation A Living Operating System

The final step is ownership. An evaluation harness is not a one-off build task that gets filed under testing. It is the operating system for improving the copilot safely. Assign owners for the dataset, scoring rubric, technical runner, release thresholds and business sign-off. The owner of a HR copilot's golden dataset should include HR expertise. The owner of a finance copilot's risk thresholds should include finance leadership. The AI team can build the harness, but the business must define what good means.

Set a monthly rhythm. Review new failures, user feedback, high-volume queries, low-confidence answers, document changes and policy updates. Promote selected real-world examples into the golden dataset. Retire duplicates. Add cases when the business process changes. Track pass rate by category rather than only as one headline number. A system that passes 92% overall may still fail 40% of sensitive data cases, which is not acceptable. Equally, a system may be safe but unhelpful, which means users will bypass it or return to shadow AI.

Use the harness for procurement as well as internal releases. If you are choosing between Microsoft 365 Copilot extensions, a custom RAG assistant, a Salesforce Einstein workflow, a ServiceNow AI Agent, an OpenAI-based internal assistant or an open-source model served internally, run the same golden dataset where licensing allows. That turns vendor comparison into evidence. It also helps with cost control because you can test whether a smaller or cheaper model is good enough for a narrow workflow, while reserving stronger models for harder tasks.

DSIT's portfolio of AI assurance techniques is useful language for this board-level conversation because it frames assurance as something used across the AI lifecycle to support trustworthy systems. Your harness is one of those assurance techniques. It gives product teams faster feedback, gives risk owners evidence, gives senior leaders a release gate, and gives users a better chance of receiving a copilot that works under real conditions. Done well, the golden dataset becomes an asset. It captures how your organisation thinks, decides, refuses, escalates and explains. That is much harder for a competitor to copy than a prompt.

For related implementation thinking, see how to scope a first agentic AI workflow. Source: GOV.UK portfolio of AI assurance techniques.

Frequently Asked Questions

What is a golden dataset for an internal copilot?

It is a controlled set of realistic test cases that represent the work the copilot must handle. Each case includes the user request, expected behaviour, allowed sources, scoring criteria and any risk tags.

How many examples should we start with?

A useful first harness can start with 50 to 100 examples if they are well chosen. Prioritise high-value workflows, common questions, sensitive cases, known failure modes and tasks that must never be handled incorrectly.

Can we use synthetic data?

Yes, synthetic data is useful for privacy and edge cases, but it should be grounded in real workflows. Blend anonymised production patterns, expert-written cases and synthetic variants.

Should an LLM judge the outputs?

LLM judges can help with scale, but they should be calibrated against human experts. For higher-risk copilots, human review should remain part of the release process.

What should the harness test besides answer quality?

Test retrieval quality, citations, refusals, privacy handling, data minimisation, prompt injection resistance, tool calls, latency, cost and whether the copilot asks for human approval when needed.

How often should we run evaluations?

Run evaluations before every meaningful release, including prompt changes, model upgrades, retrieval changes, tool permission changes and new document sources. Also run scheduled checks to spot drift.

Do UK GDPR and ICO guidance matter for internal copilots?

Yes. If the copilot processes personal data, UK data protection duties still apply. The harness should include tests for fairness, accuracy, transparency, minimisation, security and appropriate human involvement.

Which tools can help build an evaluation harness?

Teams often use simple scripts with JSONL, CSV and Postgres at first. Specialist tools such as LangSmith, Langfuse, Arize Phoenix, Braintrust, Humanloop, Promptfoo and vendor evaluation tools can help as the process matures.