MEASURE-2.1: Test sets, metrics, and details about the tools used during TEVV are documented.
MEASURE-2.1 requires you to document, in a repeatable and reviewable way, the exact test sets, evaluation metrics, and tools used during TEVV (testing, evaluation, verification, and validation) for each AI system. Operationalize it by standardizing a TEVV documentation package, assigning an owner, and collecting evidence on every significant model, data, metric, or tooling change 1.
Key takeaways:
- You need a “TEVV evidence package” per model/version: test set lineage, metric definitions, and toolchain details 1.
- The hard part is change control: keep documentation current when data, metrics, or evaluation tooling changes.
- Audit readiness depends on traceability from claims (model performance/safety) back to specific datasets, metrics, and tool versions.
MEASURE-2.1: test sets, metrics, and details about the tools used during TEVV are documented. requirement is a documentation control with real operational impact. If your team cannot point to the exact datasets, metric definitions, and evaluation tooling used to produce reported performance or risk results, you cannot defend decisions about release, monitoring thresholds, human review triggers, or customer disclosures. The requirement is simple on paper, but teams fail it in predictable ways: ad hoc test sets, metric names that mean different things to different teams, evaluation notebooks that are not reproducible, and tool versions that drift without anyone noticing.
For a Compliance Officer, CCO, or GRC lead, the fastest path is to treat TEVV documentation like any other controlled record: define the minimum required fields, require completion at key lifecycle gates (pre-release, major change, periodic review), and store the evidence in a system that supports versioning and approvals. Your goal is not to produce a novel. Your goal is to make evaluation outputs defensible and repeatable for internal challenge, customers, and auditors, aligned to the NIST AI RMF Core expectations 2.
Regulatory text
Requirement (excerpt): “Test sets, metrics, and details about the tools used during TEVV are documented.” 1
What the operator must do: Maintain controlled documentation that identifies (1) the test sets used in TEVV, (2) the metrics used and how they are computed/interpreted, and (3) the tools used to run TEVV, with enough detail to reproduce and explain results for the specific model/version and intended use 1.
Plain-English interpretation (what auditors expect you to be able to show)
You must be able to answer, quickly and consistently:
- Which test sets did you use, and where did they come from? Include how they were built, what time period they cover, what populations/edge cases they represent, and how you prevented leakage from training data.
- Which metrics did you report, and what do they mean? Define each metric, the calculation method, thresholds/acceptance criteria (if any), and why that metric is appropriate for the use case and risks.
- Which tools produced the results? Identify evaluation scripts/pipelines, libraries, versions, configuration parameters, and the execution environment. If a third party tool was used, record the product/version and how outputs are validated.
This is a documentation requirement, but it functions as a traceability control. It connects risk decisions to evidence 1.
Who it applies to (entity and operational context)
Applies to: Organizations developing, fine-tuning, integrating, or deploying AI systems where TEVV is performed or relied upon for decisions 1.
Operational contexts where this becomes non-negotiable:
- Model release gates (go/no-go decisions).
- High-impact or regulated use cases (credit, employment, healthcare, public sector, safety-related decisions), where you will need stronger documentation even if the framework is voluntary.
- Third party AI (APIs, embedded models, SaaS features): you still need documentation of the test sets/metrics/tools you used to validate the third party system in your environment, plus what you can obtain contractually from the provider.
What you actually need to do (step-by-step)
1) Define the minimum TEVV documentation schema (a required template)
Create a standard template with required fields. Treat it like a controlled form.
Minimum fields to include (practical baseline):
- System identification: model name, version/hash, intended use, prohibited uses, deployment context.
- Test set register: test set name/ID, owner, source, creation method, inclusion/exclusion criteria, size/coverage description (qualitative), time window, labeling method, known limitations, retention constraints.
- Data lineage controls: where stored, access controls, whether training data overlap checks were performed, and summary outcome.
- Metric register: metric name, definition, calculation method, sampling/unit of analysis, thresholds/targets, rationale, known failure modes.
- TEVV toolchain: evaluation pipeline name/ID, repository link, tool/library versions, configuration files, prompt templates (if applicable), hardware/runtime environment, random seed policy (if applicable).
- Run log summary: date, executor, input artifacts, output artifacts, exceptions, approvals.
- Interpretation: what results mean for release decision and risk posture.
Map this template explicitly to MEASURE-2.1 so it is easy to defend in reviews 1.
2) Establish ownership and RACI
Assign clear accountability:
- Control owner (GRC or Model Risk): ensures the template exists, is used, and evidence is retained.
- Technical owner (ML lead / evaluation lead): populates test set, metric, and tooling details.
- Reviewer (independent where possible): validates completeness, checks traceability, challenges metric appropriateness.
- Approver (product/risk): ties TEVV evidence to release decisions.
If you cannot get independent review, require at least a second set of eyes from a different function for higher-risk deployments.
3) Put TEVV documentation into the delivery workflow (gates)
Pick natural gates and enforce “no doc, no ship”:
- Pre-release / pre-launch
- Major model update (new base model, fine-tune, architecture change)
- Material data change (new labeling policy, new population)
- Metric change (new definition, new threshold)
- Tooling change (library upgrade, new evaluation harness)
Connect gates to your existing SDLC, MLOps pipeline, or change management process. The outcome you want: documentation gets created as part of work, not after a request from Compliance 1.
4) Standardize test set management (avoid “mystery datasets”)
Operational controls that make MEASURE-2.1 easier:
- Maintain a test set inventory with unique IDs.
- Require dataset cards (or equivalent) that record purpose, provenance, and limitations.
- Keep immutable snapshots for each TEVV run, or document exactly how the dataset was assembled at run time.
For third party datasets, record licensing constraints and whether they affect retention/sharing of evidence.
5) Standardize metric definitions (avoid “metric drift by naming”)
Make metrics “auditable objects,” not dashboard labels:
- Store metric definitions in a controlled repository (e.g., a metrics registry).
- Record metric computation code and version it.
- Record why a metric was selected (risk linkage). For example: false positive rate may matter more than overall accuracy in certain workflows.
6) Document the toolchain so results are reproducible
Audits often fail on “we can’t reproduce that result.”
- Record evaluation harness name, repo commit hash, container image tag (or equivalent), library versions, and config parameters.
- If TEVV uses notebooks, require a versioned, runnable artifact (or export a report with the exact inputs/versions).
- If TEVV uses third party tools (scanners, bias toolkits, red-teaming platforms), record the product version and the configuration profile used.
7) Centralize evidence retention and approvals
Store TEVV packages with:
- Version control
- Access control (need-to-know for sensitive datasets)
- Approval workflow
- Retention policy aligned to internal recordkeeping needs
This is a natural place to use Daydream: create a MEASURE-2.1 control record, assign ownership, attach the TEVV template, and schedule recurring evidence collection tied to release/change events 1.
Required evidence and artifacts to retain (audit-ready list)
Maintain a “TEVV evidence package” per model/version that includes:
- Test set documentation
- Test set inventory entry (ID, owner, purpose)
- Provenance notes and any transformation steps
- Snapshot reference (storage location, hash/checksum if used internally)
- Leakage/overlap check summary (method and outcome)
- Metric documentation
- Metric definitions and formulas
- Thresholds/acceptance criteria (where applicable)
- Rationale mapping to risk and intended use
- Known limitations and interpretation notes
- Tooling documentation
- Tool list (internal and third party)
- Version numbers, configs, environment details
- Links to code repositories and run instructions
- Run logs and output reports
- Approvals and decisions
- Reviewer sign-off
- Release decision memo referencing TEVV results
- Exception/waiver documentation if anything was incomplete, with compensating controls
Common exam/audit questions and hangups
Expect questions like:
- “Show me the exact test set used for this model’s launch decision.”
- “How do you know the test set didn’t leak into training?”
- “Where is the formal definition of this metric, and who approved it?”
- “Can you reproduce the TEVV run with the same tools and versions?”
- “What changed in TEVV between version A and version B, and why?”
Hangups that trigger follow-ups:
- Metrics reported without definitions.
- Test sets described only verbally (“a held-out set”).
- Tooling not versioned (“we used Python and a notebook”).
- TEVV results stored as screenshots with no lineage.
Frequent implementation mistakes (and how to avoid them)
| Mistake | Why it fails | Fix |
|---|---|---|
| Treating TEVV docs as a one-time launch artifact | Model and tooling change; docs go stale | Tie documentation updates to change management gates 1 |
| No unique IDs for test sets | You can’t prove which dataset produced which result | Create a test set inventory with immutable snapshots |
| Metric names without computation details | “Accuracy” can mean different things | Maintain a metric registry with formulas and code pointers |
| Tool versions not recorded | Results can’t be reproduced | Record library/tool versions and config for each TEVV run |
| Third party model testing undocumented | You still own outcomes in your environment | Document your validation test sets/metrics/tools and request provider artifacts contractually |
Enforcement context and risk implications
No public enforcement cases were provided in the source catalog for MEASURE-2.1. Treat this as governance and defensibility risk rather than a direct penalty item: weak TEVV documentation increases the chance of inconsistent releases, inability to investigate incidents, and weak responses to customer, regulator, or board questions about how safety/performance claims were validated 1.
Practical 30/60/90-day execution plan
First 30 days (foundation)
- Assign a control owner and technical owner for MEASURE-2.1 1.
- Publish the TEVV documentation template and minimum required fields.
- Stand up a centralized repository for TEVV evidence packages with access controls.
- Identify priority AI systems (production, customer-facing, high-impact workflows) and baseline what TEVV documentation exists today.
Day 31–60 (operationalize)
- Integrate TEVV documentation into release/change gates (ticketing, SDLC, or MLOps workflow).
- Create a test set inventory and a metric registry (simple spreadsheets are acceptable if controlled).
- Pilot the process on one high-priority model and run a mock audit: can you reproduce results from the documented toolchain?
Day 61–90 (scale and harden)
- Expand to all in-scope models and third party AI integrations.
- Add reviewer sign-off and exception handling (waivers with expiry and compensating controls).
- Build recurring evidence collection in Daydream: reminders, control tests, and dashboards for missing TEVV packages mapped to MEASURE-2.1 1.
- Run an internal control test: select a model at random and verify the organization can trace each reported metric back to dataset IDs and tool versions.
Frequently Asked Questions
Do we need to document TEVV for a third party model API we only call, not train?
Yes. Document the test sets, metrics, and tools you used to validate the third party model in your environment and for your use case 1. Also record what TEVV artifacts you requested or received from the provider.
What counts as “tools” in TEVV documentation?
Include evaluation pipelines, scripts, notebooks, libraries, and any third party testing platforms used to generate results 1. Record versions and configurations that affect outputs.
How detailed do metric definitions need to be?
Detailed enough that another qualified team member can compute the same metric from the same test set and get consistent results 1. Store the formula, unit of analysis, and any thresholds/filters.
We iterate quickly. How do we keep TEVV documentation from slowing releases?
Make the template lightweight and embed it in existing engineering workflows (pull requests, release checklists, MLOps runs). Automate population of tool versions and run logs where possible; reserve manual effort for rationale and limitations.
Can we summarize test sets without retaining the data due to privacy or licensing?
Yes, but document the constraint, keep an immutable reference to how the set was assembled, and retain whatever metadata and run outputs you are permitted to keep. Auditors will focus on traceability and repeatability within those constraints.
What’s the minimum evidence to satisfy MEASURE-2.1 for an internal pilot?
A TEVV package that identifies the test set source/selection, metric definitions, and the exact tools/versions used, plus a stored output report and reviewer sign-off 1. Scope can be smaller, but traceability still matters.
Footnotes
Frequently Asked Questions
Do we need to document TEVV for a third party model API we only call, not train?
Yes. Document the test sets, metrics, and tools you used to validate the third party model in your environment and for your use case (Source: NIST AI RMF Core). Also record what TEVV artifacts you requested or received from the provider.
What counts as “tools” in TEVV documentation?
Include evaluation pipelines, scripts, notebooks, libraries, and any third party testing platforms used to generate results (Source: NIST AI RMF Core). Record versions and configurations that affect outputs.
How detailed do metric definitions need to be?
Detailed enough that another qualified team member can compute the same metric from the same test set and get consistent results (Source: NIST AI RMF Core). Store the formula, unit of analysis, and any thresholds/filters.
We iterate quickly. How do we keep TEVV documentation from slowing releases?
Make the template lightweight and embed it in existing engineering workflows (pull requests, release checklists, MLOps runs). Automate population of tool versions and run logs where possible; reserve manual effort for rationale and limitations.
Can we summarize test sets without retaining the data due to privacy or licensing?
Yes, but document the constraint, keep an immutable reference to how the set was assembled, and retain whatever metadata and run outputs you are permitted to keep. Auditors will focus on traceability and repeatability within those constraints.
What’s the minimum evidence to satisfy MEASURE-2.1 for an internal pilot?
A TEVV package that identifies the test set source/selection, metric definitions, and the exact tools/versions used, plus a stored output report and reviewer sign-off (Source: NIST AI RMF Core). Scope can be smaller, but traceability still matters.
Operationalize this requirement
Map requirement text to controls, owners, evidence, and review workflows inside Daydream.
See Daydream