MEASURE-2.13: Effectiveness of the employed TEVV metrics and processes in the measure function are evaluated and documented.

MEASURE-2.13 requires you to periodically evaluate whether your testing, evaluation, verification, and validation (TEVV) metrics and measurement processes actually detect and track AI risks, then document the results and resulting changes. Operationalize it by assigning an owner, defining effectiveness criteria, running scheduled reviews, recording findings, and closing improvement actions with retained evidence. 1

Key takeaways:

  • Define “effective” TEVV using explicit criteria (coverage, sensitivity, stability, decision usefulness) tied to your AI risk objectives. 1
  • Run a repeatable evaluation cycle (review → gaps → corrective actions → re-test) and document each step for audit readiness. 1
  • Treat TEVV as a controlled process: version metrics, manage changes, and keep traceability from incidents and model changes back to TEVV updates. 1

MEASURE-2.13 sits in the “Measure” function of the NIST AI Risk Management Framework and targets a specific operational failure mode: teams collect TEVV results, but nobody checks whether those metrics and processes are still doing the job as the model, data, users, and threat landscape change. The requirement is narrow and practical: evaluate the effectiveness of the TEVV metrics and processes you employ, and document that evaluation. 1

For a Compliance Officer, CCO, or GRC lead, this is less about inventing new tests and more about governance discipline. You need a defined cadence, clear effectiveness criteria, and records that show you identified gaps and corrected them. This is also one of the easiest places for an examiner (or internal audit) to press: “You say you measure model risk. How do you know your measurements work?” If you cannot answer with artifacts, the program can look cosmetic.

This page gives requirement-level implementation guidance you can hand to model risk, data science, QA, and product teams. It focuses on what to do, what evidence to keep, and where teams typically break the chain between TEVV outputs and decision-making. 2

Regulatory text

Text (excerpt): “Effectiveness of the employed TEVV metrics and processes in the measure function are evaluated and documented.” 1

What the operator must do:
You must (1) evaluate whether your current TEVV metrics and measurement processes are effective for the AI system’s risks and operating context, and (2) document that evaluation. “Effective” is not defined for you, so you must define effectiveness criteria that make sense for your system and risk posture, then show that you check those criteria on a recurring basis and take action when they are not met. 1

Plain-English interpretation

You are required to prove, with documentation, that:

  • The metrics you rely on (fairness metrics, robustness checks, privacy tests, drift monitors, red-team results, human-in-the-loop QA rates, etc.) still reflect real risk.
  • The way you produce those metrics (data pipelines, sampling, labeling QA, test environments, acceptance thresholds, review workflows) is dependable and repeatable.
  • You review and improve TEVV when it stops catching issues or stops supporting risk decisions. 1

A workable mental model: TEVV is a control, and MEASURE-2.13 is the control’s effectiveness testing plus control evidence. 1

Who it applies to

In scope entities: Organizations developing or deploying AI systems. 1

Operational contexts where this becomes “must-have”:

  • High-impact decisioning (credit, employment, insurance, healthcare triage, benefits eligibility).
  • User-facing generative AI where harmful outputs, data leakage, or prompt injection are credible risks.
  • Safety- or mission-critical environments where model degradation or distribution shift can create real-world harm.
  • Any environment with frequent model/data updates, multiple model versions, or rapid experimentation. 2

Teams you’ll need involved (typical):

  • Model Risk Management / AI governance (program ownership)
  • Data science / ML engineering (metric definition and measurement pipelines)
  • QA / validation (test execution independence, where applicable)
  • Product (risk acceptance decisions)
  • Security / privacy (adversarial testing, data protection validation)
  • Internal audit or compliance testing (challenge function, where established) 1

What you actually need to do (step-by-step)

1) Inventory your “employed TEVV”

Create a single register of the TEVV metrics and processes you currently rely on for each AI system (or model family). Minimum fields:

  • System/model name and version
  • Intended use and key risk objectives
  • TEVV metrics list (what is measured)
  • TEVV processes (how/when measured, tools, environments)
  • Owners (who runs it, who reviews it)
  • Decision points that depend on TEVV (launch, retrain, rollback, access controls) 1

Practical tip: If you cannot list the decision that each metric supports, you will struggle to defend “effectiveness.”

2) Define effectiveness criteria (make it testable)

Write criteria that let you judge whether TEVV is working. Examples you can adopt and tailor:

  • Coverage: TEVV spans the identified risk scenarios (including misuse and foreseeable failure modes).
  • Sensitivity: TEVV detects meaningful regressions and risk threshold breaches in time to act.
  • Reliability: Results are reproducible; measurement noise and sampling bias are understood.
  • Operational fit: Outputs arrive on time for governance gates and are understandable to approvers.
  • Change resilience: TEVV stays valid across model versioning, data drift, and feature changes. 1

Document these criteria in a short “TEVV Effectiveness Standard” owned by the Measure function.

3) Establish a recurring evaluation cycle (with triggers)

Create a procedure that evaluates TEVV effectiveness on:

  • A schedule (your program chooses the cadence; document it as policy).
  • Triggers, such as: material model changes, data pipeline changes, new user populations, new incident patterns, or new threat intelligence relevant to the system. 1

Your procedure should require both:

  • A self-assessment by the team running TEVV, and
  • A challenge review by a second-line function (GRC/model risk) or another independent reviewer if your governance model includes it. 1

4) Run the effectiveness review (what to examine)

Use a consistent checklist. At minimum, review:

  • Metric validity: Do the chosen metrics still map to the risk objectives and harm types?
  • Data representativeness: Are test datasets still representative of production?
  • Threshold fitness: Are pass/fail thresholds aligned to current risk appetite and use case?
  • Process integrity: Are test environments controlled, access managed, and results protected from tampering?
  • Defect discovery rate: Are incidents or user complaints being caught by TEVV or discovered “in the wild” first?
  • Remediation loop: Do TEVV findings result in tracked fixes, and does re-testing confirm fixes? 1

5) Document findings and decisions (non-negotiable)

Produce a dated record for each review that includes:

  • What was reviewed (metrics/processes in scope)
  • Evidence examined
  • Effectiveness conclusion (effective / partially effective / ineffective)
  • Gaps, root cause, and risk statement
  • Corrective actions, owners, due dates, and acceptance criteria
  • Any risk acceptances or exceptions and approver identity
  • Next review date or trigger conditions 1

6) Close the loop: improve TEVV and prove it worked

For each corrective action, require:

  • Updated metric definitions or process steps (version-controlled)
  • Re-test results showing improved detection or reduced failure recurrence
  • Updated documentation (TEVV register, SOPs, control mappings)
  • A sign-off that closes the action and confirms operational adoption 1

Where Daydream fits: If you already manage third-party risk and control evidence in Daydream, treat TEVV effectiveness reviews like any other recurring control: assign an owner, set evidence requests, track remediation tickets, and maintain an audit-ready timeline of artifacts across model versions.

Required evidence and artifacts to retain

Keep artifacts in a system that supports versioning and retrieval by model/version and date. Typical evidence set:

  • TEVV inventory/register (current and historical versions)
  • TEVV Effectiveness Standard (criteria and methodology)
  • Review schedule and trigger policy
  • Completed effectiveness review reports and checklists
  • Meeting minutes or sign-offs from governance forums
  • Corrective action tracker (tickets, owners, closure evidence)
  • Change logs showing metric/process updates after reviews
  • Samples of raw outputs (test results), plus summaries presented to approvers
  • Evidence of linkage to incidents (post-incident reviews referencing TEVV gaps) 1

Common exam/audit questions and hangups

Expect questions like:

  • “Define TEVV for this system. Which metrics are required to pass a release gate?”
  • “How do you know these metrics are appropriate for the stated risks?”
  • “Show the last effectiveness review and what changed as a result.”
  • “Where do you document exceptions, and who approved them?”
  • “How do you ensure TEVV remains valid after model/data changes?”
  • “Show traceability from incident X to a TEVV update.” 1

Hangup auditors press on: Teams show model test results but cannot show the meta-layer: the evaluation that the tests themselves are still effective.

Frequent implementation mistakes and how to avoid them

  1. Mistake: Treating TEVV outputs as proof of TEVV effectiveness.
    Fix: Separate artifacts: keep TEVV results, and also keep the effectiveness review that assesses whether those results are meaningful and sufficient. 1

  2. Mistake: No explicit effectiveness criteria.
    Fix: Write criteria and map each to evidence. If you cannot define “effective,” you cannot prove it. 1

  3. Mistake: Reviews happen only after an incident.
    Fix: Add a scheduled review plus trigger-based reviews tied to change management. 1

  4. Mistake: Findings don’t drive change.
    Fix: Require corrective actions with closure evidence and re-test results; treat overdue actions as governance escalations. 1

  5. Mistake: No ownership.
    Fix: Assign a control owner for MEASURE-2.13 and named owners for each TEVV process. Put it in your RACI. 1

Enforcement context and risk implications

NIST AI RMF is a framework, not a regulator, so enforcement typically appears indirectly: customers, procurement, internal audit, and sector regulators may expect you to align to it and demonstrate disciplined AI risk management. The core risk of failing MEASURE-2.13 is defensibility. If harm occurs, you may be unable to show that your measurement program was designed to detect issues and that you continuously verified it stayed effective. 2

Practical 30/60/90-day execution plan

First 30 days (stand up the control)

  • Assign a MEASURE-2.13 control owner and define a RACI across model risk, engineering, product, and QA. 1
  • Build the TEVV register for your highest-risk systems first; capture metrics, processes, and decision gates. 1
  • Draft the TEVV Effectiveness Standard: effectiveness criteria, review triggers, required artifacts. 1

Days 31–60 (run the first review cycle)

  • Execute a pilot effectiveness review on one in-scope AI system, using a checklist and documented evidence pack. 1
  • Open corrective actions for gaps; require owners and closure criteria.
  • Add TEVV effectiveness review as a step in model change management (so major releases trigger a review). 1

Days 61–90 (operationalize and scale)

  • Expand reviews to additional systems, prioritizing those with the highest impact and change rate. 1
  • Standardize templates: review report, exception memo, remediation tracker fields, sign-off workflow.
  • Set up recurring evidence collection and reminders in your GRC system (Daydream or equivalent) so you can reproduce the evidence pack on demand. 1

Frequently Asked Questions

What counts as “TEVV metrics and processes” for MEASURE-2.13?

Anything you rely on to test, evaluate, verify, or validate AI behavior and risk controls, plus the workflows and pipelines that produce those results. If a metric influences a launch decision or risk acceptance, it is in scope. 1

How do we define “effectiveness” without a prescribed standard?

Define effectiveness criteria tied to your risk objectives and decision points, then test those criteria in a repeatable review. Document the criteria and the evidence used to reach your conclusion. 1

Does this require independent validation?

The requirement says “evaluated and documented,” not “independently validated.” Many programs add a challenge review (second line or peer review) to improve defensibility and consistency. 1

What’s the minimum documentation an auditor will accept?

A dated review record that states what TEVV was assessed, what evidence was reviewed, the effectiveness conclusion, and what changed as a result. If no changes were needed, document why. 1

How do we handle rapidly changing models where metrics change often?

Put metric definitions and processes under change control with versioning, and tie model releases to trigger-based effectiveness reviews. Keep the history so you can explain what was measured at any point in time. 1

Can we satisfy MEASURE-2.13 with monitoring alone?

Monitoring is part of TEVV, but MEASURE-2.13 asks you to evaluate whether your monitoring metrics and processes are effective. You still need the meta-review, conclusions, and remediation tracking. 1

Footnotes

  1. NIST AI RMF Core

  2. NIST AI RMF Core; Source: NIST AI RMF program page

Frequently Asked Questions

What counts as “TEVV metrics and processes” for MEASURE-2.13?

Anything you rely on to test, evaluate, verify, or validate AI behavior and risk controls, plus the workflows and pipelines that produce those results. If a metric influences a launch decision or risk acceptance, it is in scope. (Source: NIST AI RMF Core)

How do we define “effectiveness” without a prescribed standard?

Define effectiveness criteria tied to your risk objectives and decision points, then test those criteria in a repeatable review. Document the criteria and the evidence used to reach your conclusion. (Source: NIST AI RMF Core)

Does this require independent validation?

The requirement says “evaluated and documented,” not “independently validated.” Many programs add a challenge review (second line or peer review) to improve defensibility and consistency. (Source: NIST AI RMF Core)

What’s the minimum documentation an auditor will accept?

A dated review record that states what TEVV was assessed, what evidence was reviewed, the effectiveness conclusion, and what changed as a result. If no changes were needed, document why. (Source: NIST AI RMF Core)

How do we handle rapidly changing models where metrics change often?

Put metric definitions and processes under change control with versioning, and tie model releases to trigger-based effectiveness reviews. Keep the history so you can explain what was measured at any point in time. (Source: NIST AI RMF Core)

Can we satisfy MEASURE-2.13 with monitoring alone?

Monitoring is part of TEVV, but MEASURE-2.13 asks you to evaluate whether your monitoring metrics and processes are effective. You still need the meta-review, conclusions, and remediation tracking. (Source: NIST AI RMF Core)

Operationalize this requirement

Map requirement text to controls, owners, evidence, and review workflows inside Daydream.

See Daydream