MEASURE-2.1: Test sets, metrics, and details about the tools used during TEVV are documented.

9 min readLast verified: February 2026By Isaac Silverman

MEASURE-2.1 requires you to document, in a repeatable and reviewable way, the exact test sets, evaluation metrics, and tools used during TEVV (testing, evaluation, verification, and validation) for each AI system. Operationalize it by standardizing a TEVV documentation package, assigning an owner, and collecting evidence on every significant model, data, metric, or tooling change ¹.

Key takeaways:

You need a “TEVV evidence package” per model/version: test set lineage, metric definitions, and toolchain details ¹.
The hard part is change control: keep documentation current when data, metrics, or evaluation tooling changes.
Audit readiness depends on traceability from claims (model performance/safety) back to specific datasets, metrics, and tool versions.

MEASURE-2.1: test sets, metrics, and details about the tools used during TEVV are documented. requirement is a documentation control with real operational impact. If your team cannot point to the exact datasets, metric definitions, and evaluation tooling used to produce reported performance or risk results, you cannot defend decisions about release, monitoring thresholds, human review triggers, or customer disclosures. The requirement is simple on paper, but teams fail it in predictable ways: ad hoc test sets, metric names that mean different things to different teams, evaluation notebooks that are not reproducible, and tool versions that drift without anyone noticing.

For a Compliance Officer, CCO, or GRC lead, the fastest path is to treat TEVV documentation like any other controlled record: define the minimum required fields, require completion at key lifecycle gates (pre-release, major change, periodic review), and store the evidence in a system that supports versioning and approvals. Your goal is not to produce a novel. Your goal is to make evaluation outputs defensible and repeatable for internal challenge, customers, and auditors, aligned to the NIST AI RMF Core expectations ².

Regulatory text

Requirement (excerpt): “Test sets, metrics, and details about the tools used during TEVV are documented.” ¹

What the operator must do: Maintain controlled documentation that identifies (1) the test sets used in TEVV, (2) the metrics used and how they are computed/interpreted, and (3) the tools used to run TEVV, with enough detail to reproduce and explain results for the specific model/version and intended use ¹.

Plain-English interpretation (what auditors expect you to be able to show)

You must be able to answer, quickly and consistently:

Which test sets did you use, and where did they come from? Include how they were built, what time period they cover, what populations/edge cases they represent, and how you prevented leakage from training data.
Which metrics did you report, and what do they mean? Define each metric, the calculation method, thresholds/acceptance criteria (if any), and why that metric is appropriate for the use case and risks.
Which tools produced the results? Identify evaluation scripts/pipelines, libraries, versions, configuration parameters, and the execution environment. If a third party tool was used, record the product/version and how outputs are validated.

This is a documentation requirement, but it functions as a traceability control. It connects risk decisions to evidence ¹.

Who it applies to (entity and operational context)

Applies to: Organizations developing, fine-tuning, integrating, or deploying AI systems where TEVV is performed or relied upon for decisions ¹.

Operational contexts where this becomes non-negotiable:

Model release gates (go/no-go decisions).
High-impact or regulated use cases (credit, employment, healthcare, public sector, safety-related decisions), where you will need stronger documentation even if the framework is voluntary.
Third party AI (APIs, embedded models, SaaS features): you still need documentation of the test sets/metrics/tools you used to validate the third party system in your environment, plus what you can obtain contractually from the provider.

What you actually need to do (step-by-step)

1) Define the minimum TEVV documentation schema (a required template)

Create a standard template with required fields. Treat it like a controlled form.

Minimum fields to include (practical baseline):

System identification: model name, version/hash, intended use, prohibited uses, deployment context.
Test set register: test set name/ID, owner, source, creation method, inclusion/exclusion criteria, size/coverage description (qualitative), time window, labeling method, known limitations, retention constraints.
Data lineage controls: where stored, access controls, whether training data overlap checks were performed, and summary outcome.
Metric register: metric name, definition, calculation method, sampling/unit of analysis, thresholds/targets, rationale, known failure modes.
TEVV toolchain: evaluation pipeline name/ID, repository link, tool/library versions, configuration files, prompt templates (if applicable), hardware/runtime environment, random seed policy (if applicable).
Run log summary: date, executor, input artifacts, output artifacts, exceptions, approvals.
Interpretation: what results mean for release decision and risk posture.

Map this template explicitly to MEASURE-2.1 so it is easy to defend in reviews ¹.

2) Establish ownership and RACI

Assign clear accountability:

Control owner (GRC or Model Risk): ensures the template exists, is used, and evidence is retained.
Technical owner (ML lead / evaluation lead): populates test set, metric, and tooling details.
Reviewer (independent where possible): validates completeness, checks traceability, challenges metric appropriateness.
Approver (product/risk): ties TEVV evidence to release decisions.

If you cannot get independent review, require at least a second set of eyes from a different function for higher-risk deployments.

3) Put TEVV documentation into the delivery workflow (gates)

Pick natural gates and enforce “no doc, no ship”:

Pre-release / pre-launch
Major model update (new base model, fine-tune, architecture change)
Material data change (new labeling policy, new population)
Metric change (new definition, new threshold)
Tooling change (library upgrade, new evaluation harness)

Connect gates to your existing SDLC, MLOps pipeline, or change management process. The outcome you want: documentation gets created as part of work, not after a request from Compliance ¹.

4) Standardize test set management (avoid “mystery datasets”)

Operational controls that make MEASURE-2.1 easier:

Maintain a test set inventory with unique IDs.
Require dataset cards (or equivalent) that record purpose, provenance, and limitations.
Keep immutable snapshots for each TEVV run, or document exactly how the dataset was assembled at run time.

For third party datasets, record licensing constraints and whether they affect retention/sharing of evidence.

5) Standardize metric definitions (avoid “metric drift by naming”)

Make metrics “auditable objects,” not dashboard labels:

Store metric definitions in a controlled repository (e.g., a metrics registry).
Record metric computation code and version it.
Record why a metric was selected (risk linkage). For example: false positive rate may matter more than overall accuracy in certain workflows.

6) Document the toolchain so results are reproducible

Audits often fail on “we can’t reproduce that result.”

Record evaluation harness name, repo commit hash, container image tag (or equivalent), library versions, and config parameters.
If TEVV uses notebooks, require a versioned, runnable artifact (or export a report with the exact inputs/versions).
If TEVV uses third party tools (scanners, bias toolkits, red-teaming platforms), record the product version and the configuration profile used.

7) Centralize evidence retention and approvals

Store TEVV packages with:

Version control
Access control (need-to-know for sensitive datasets)
Approval workflow
Retention policy aligned to internal recordkeeping needs

This is a natural place to use Daydream: create a MEASURE-2.1 control record, assign ownership, attach the TEVV template, and schedule recurring evidence collection tied to release/change events ¹.

Required evidence and artifacts to retain (audit-ready list)

Maintain a “TEVV evidence package” per model/version that includes:

Test set documentation

Test set inventory entry (ID, owner, purpose)
Provenance notes and any transformation steps
Snapshot reference (storage location, hash/checksum if used internally)
Leakage/overlap check summary (method and outcome)

Metric documentation

Metric definitions and formulas
Thresholds/acceptance criteria (where applicable)
Rationale mapping to risk and intended use
Known limitations and interpretation notes

Tooling documentation

Tool list (internal and third party)
Version numbers, configs, environment details
Links to code repositories and run instructions
Run logs and output reports

Approvals and decisions

Reviewer sign-off
Release decision memo referencing TEVV results
Exception/waiver documentation if anything was incomplete, with compensating controls

Common exam/audit questions and hangups

Expect questions like:

“Show me the exact test set used for this model’s launch decision.”
“How do you know the test set didn’t leak into training?”
“Where is the formal definition of this metric, and who approved it?”
“Can you reproduce the TEVV run with the same tools and versions?”
“What changed in TEVV between version A and version B, and why?”

Hangups that trigger follow-ups:

Metrics reported without definitions.
Test sets described only verbally (“a held-out set”).
Tooling not versioned (“we used Python and a notebook”).
TEVV results stored as screenshots with no lineage.

Frequent implementation mistakes (and how to avoid them)

Mistake	Why it fails	Fix
Treating TEVV docs as a one-time launch artifact	Model and tooling change; docs go stale	Tie documentation updates to change management gates ¹
No unique IDs for test sets	You can’t prove which dataset produced which result	Create a test set inventory with immutable snapshots
Metric names without computation details	“Accuracy” can mean different things	Maintain a metric registry with formulas and code pointers
Tool versions not recorded	Results can’t be reproduced	Record library/tool versions and config for each TEVV run
Third party model testing undocumented	You still own outcomes in your environment	Document your validation test sets/metrics/tools and request provider artifacts contractually

Enforcement context and risk implications

No public enforcement cases were provided in the source catalog for MEASURE-2.1. Treat this as governance and defensibility risk rather than a direct penalty item: weak TEVV documentation increases the chance of inconsistent releases, inability to investigate incidents, and weak responses to customer, regulator, or board questions about how safety/performance claims were validated ¹.

Practical 30/60/90-day execution plan

First 30 days (foundation)

Assign a control owner and technical owner for MEASURE-2.1 ¹.
Publish the TEVV documentation template and minimum required fields.
Stand up a centralized repository for TEVV evidence packages with access controls.
Identify priority AI systems (production, customer-facing, high-impact workflows) and baseline what TEVV documentation exists today.

Day 31–60 (operationalize)

Integrate TEVV documentation into release/change gates (ticketing, SDLC, or MLOps workflow).
Create a test set inventory and a metric registry (simple spreadsheets are acceptable if controlled).
Pilot the process on one high-priority model and run a mock audit: can you reproduce results from the documented toolchain?

Day 61–90 (scale and harden)

Expand to all in-scope models and third party AI integrations.
Add reviewer sign-off and exception handling (waivers with expiry and compensating controls).
Build recurring evidence collection in Daydream: reminders, control tests, and dashboards for missing TEVV packages mapped to MEASURE-2.1 ¹.
Run an internal control test: select a model at random and verify the organization can trace each reported metric back to dataset IDs and tool versions.

Frequently Asked Questions

Do we need to document TEVV for a third party model API we only call, not train?

Yes. Document the test sets, metrics, and tools you used to validate the third party model in your environment and for your use case ¹. Also record what TEVV artifacts you requested or received from the provider.

What counts as “tools” in TEVV documentation?

Include evaluation pipelines, scripts, notebooks, libraries, and any third party testing platforms used to generate results ¹. Record versions and configurations that affect outputs.

How detailed do metric definitions need to be?

Detailed enough that another qualified team member can compute the same metric from the same test set and get consistent results ¹. Store the formula, unit of analysis, and any thresholds/filters.

We iterate quickly. How do we keep TEVV documentation from slowing releases?

Make the template lightweight and embed it in existing engineering workflows (pull requests, release checklists, MLOps runs). Automate population of tool versions and run logs where possible; reserve manual effort for rationale and limitations.

Can we summarize test sets without retaining the data due to privacy or licensing?

Yes, but document the constraint, keep an immutable reference to how the set was assembled, and retain whatever metadata and run outputs you are permitted to keep. Auditors will focus on traceability and repeatability within those constraints.

What’s the minimum evidence to satisfy MEASURE-2.1 for an internal pilot?

A TEVV package that identifies the test set source/selection, metric definitions, and the exact tools/versions used, plus a stored output report and reviewer sign-off ¹. Scope can be smaller, but traceability still matters.

Frequently Asked Questions

Do we need to document TEVV for a third party model API we only call, not train?

Yes. Document the test sets, metrics, and tools you used to validate the third party model in your environment and for your use case (Source: NIST AI RMF Core). Also record what TEVV artifacts you requested or received from the provider.

What counts as “tools” in TEVV documentation?

Include evaluation pipelines, scripts, notebooks, libraries, and any third party testing platforms used to generate results (Source: NIST AI RMF Core). Record versions and configurations that affect outputs.

How detailed do metric definitions need to be?

Detailed enough that another qualified team member can compute the same metric from the same test set and get consistent results (Source: NIST AI RMF Core). Store the formula, unit of analysis, and any thresholds/filters.

We iterate quickly. How do we keep TEVV documentation from slowing releases?

Can we summarize test sets without retaining the data due to privacy or licensing?

What’s the minimum evidence to satisfy MEASURE-2.1 for an internal pilot?

A TEVV package that identifies the test set source/selection, metric definitions, and the exact tools/versions used, plus a stored output report and reviewer sign-off (Source: NIST AI RMF Core). Scope can be smaller, but traceability still matters.

Operationalize this requirement

Map requirement text to controls, owners, evidence, and review workflows inside Daydream.

See Daydream

Regulatory text

Plain-English interpretation (what auditors expect you to be able to show)

Who it applies to (entity and operational context)

What you actually need to do (step-by-step)

1) Define the minimum TEVV documentation schema (a required template)

2) Establish ownership and RACI

3) Put TEVV documentation into the delivery workflow (gates)

4) Standardize test set management (avoid “mystery datasets”)

5) Standardize metric definitions (avoid “metric drift by naming”)

6) Document the toolchain so results are reproducible

7) Centralize evidence retention and approvals

Required evidence and artifacts to retain (audit-ready list)

Common exam/audit questions and hangups

Frequent implementation mistakes (and how to avoid them)

Enforcement context and risk implications

Practical 30/60/90-day execution plan

First 30 days (foundation)

Day 31–60 (operationalize)

Day 61–90 (scale and harden)

Frequently Asked Questions

Do we need to document TEVV for a third party model API we only call, not train?

What counts as “tools” in TEVV documentation?

How detailed do metric definitions need to be?

We iterate quickly. How do we keep TEVV documentation from slowing releases?

Can we summarize test sets without retaining the data due to privacy or licensing?

What’s the minimum evidence to satisfy MEASURE-2.1 for an internal pilot?

Footnotes

Frequently Asked Questions

Related Resources

Operationalize this requirement