MEASURE-2.3: AI system performance or assurance criteria are measured qualitatively or quantitatively and demonstrated for conditions similar to deployment setting(s). Measures are documented.

9 min readLast verified: February 2026By Isaac Silverman

To meet MEASURE-2.3, you must define AI performance and assurance criteria, measure them with qualitative and/or quantitative methods, and prove the results under test conditions that closely match the real deployment environment. You also need complete documentation that shows what you measured, how you measured it, why it is relevant to deployment, and what you decided based on results. ¹

Key takeaways:

Define measurable acceptance criteria tied to the actual deployment setting (users, data, workflow, constraints). ¹
Run evaluations that replicate deployment conditions, not idealized lab scenarios, and record results and decision outcomes. ¹
Treat documentation as audit evidence: methods, datasets, thresholds, results, exceptions, approvals, and monitoring handoffs. ¹

MEASURE-2.3 is an “operator’s requirement” disguised as a measurement requirement: you do not pass by producing a model card with generic metrics. You pass by demonstrating that your system meets defined performance or assurance criteria in conditions similar to where it will actually run, and that you can show your work later. ¹

For a Compliance Officer, CCO, or GRC lead, the fastest path is to convert MEASURE-2.3 into a repeatable control: (1) set deployment-relevant criteria, (2) run a deployment-like evaluation, (3) document the measures and outcomes, and (4) make the evidence easy to retrieve. This requirement matters most when the AI output drives decisions with operational, customer, safety, financial, or legal consequences, because “we tested accuracy” is rarely persuasive if the test data, user behavior, and operational constraints differ from production. ¹

NIST AI RMF is a framework, not a statute, but it is increasingly used as a benchmark for governance, third-party due diligence, and internal assurance. Your job is to make measurement defensible, repeatable, and tied to how the system is actually used. ²

Regulatory text

Excerpt: “AI system performance or assurance criteria are measured qualitatively or quantitatively and demonstrated for conditions similar to deployment setting(s). Measures are documented.” ¹

What the operator must do:

Set criteria (performance and/or assurance) that are meaningful for the system’s intended use and risk. ¹
Measure those criteria using qualitative methods (expert review, human-in-the-loop scoring, red teaming narratives) and/or quantitative methods (metrics, rates, thresholds, statistical tests). ¹
Demonstrate under deployment-like conditions, meaning your test setup mirrors production as closely as practical: data characteristics, user workflow, latency constraints, integration points, and environment assumptions. ¹
Document the measures so an independent reviewer can reproduce what you did and understand what you decided. ¹

Plain-English interpretation

MEASURE-2.3 requires “proof, not vibes.” You need measurable criteria (what “good enough” means), evidence those criteria were tested in a production-like context (not a toy dataset or a vendor demo), and records that show results, exceptions, and approvals. ¹

A practical way to read “performance or assurance criteria”:

Performance: task quality (accuracy, error rates), stability, latency, throughput, robustness, calibration, detection/abstention behavior.
Assurance: safety, reliability, privacy controls, security behaviors, explainability/traceability requirements, human oversight effectiveness, fairness considerations where relevant to the use case.

Who it applies to

Entities: Any organization developing, integrating, or deploying AI systems, including when the AI is provided by a third party and you configure it or embed it into a business process. ¹

Operational contexts where it shows up in audits:

AI used for customer-impacting decisions (eligibility, pricing, claims handling, fraud review, support triage).
AI that summarizes, recommends, or generates content used by staff for regulated tasks.
AI embedded in security workflows (alert triage, phishing analysis) or safety workflows (industrial monitoring).
Third-party AI systems where you must validate fit-for-purpose in your environment, not the provider’s. ¹

What you actually need to do (step-by-step)

Step 1: Define deployment setting(s) and “similar conditions”

Write a short Deployment Conditions Profile:

Intended users and user skill assumptions
Decision workflow and points of human review
Input data sources, formats, and known quality issues
Production constraints (latency, cost guardrails, uptime expectations)
Integrations (case management, CRM, EHR, payment systems, logging)
Output: one page that becomes the anchor for “similar to deployment.” ¹

Step 2: Translate risk into measurable criteria (acceptance criteria)

Create an AI Acceptance Criteria Register with columns:

Criterion name (e.g., “false positive tolerance in fraud queue,” “hallucination containment,” “PII leakage resistance”)
Measurement type (qualitative/quantitative)
Metric definition (exact computation or rubric)
Threshold/target and rationale (business + risk)
Scope (model version, features, segments, languages)
Owner and approver
This is where GRC makes the requirement real: you are defining what “assured” means before you test. ¹

Step 3: Design an evaluation that mirrors deployment

Build a Deployment-Representative Evaluation Plan:

Data: production-like samples, edge cases, seasonality, and realistic label quality
Process: same prompts/templates, same pre-processing, same post-processing, same UI where possible
People: include the real user roles for qualitative scoring (ops, analysts, clinicians, support leads)
Environment: staging that matches production configurations (rate limits, retrieval sources, tools)
Threats/failures: known abuse cases and foreseeable misuse patterns
If you cannot mirror a condition (e.g., no production data yet), document the gap and how you compensated (synthetic data strategy, pilot constraints, phased rollout gates). ¹

Step 4: Execute measurement (quant + qual) and record results

Run the tests and produce:

Quantitative results by segment (key cohorts, regions, product lines, languages, input types)
Qualitative review outcomes (rubric scores, reviewer notes, disagreement handling, adjudication rules)
Failure analysis (top error modes, root causes, mitigations, residual risk)
Make sure results are tied to a specific model/system version and configuration. “Model v3 in prod-like pipeline” is evidence; “we tested the model” is not. ¹

Step 5: Make an explicit release decision and define monitoring handoff

Create a Go/No-Go and Conditions Memo:

Criteria met/not met
Approved exceptions (who approved, compensating controls, expiration)
Deployment constraints (human review required, limited scope, fallback rules)
Monitoring triggers tied back to the same criteria (what drift looks like)
This step prevents a common audit failure: strong testing with no documented decision trail. ¹

Step 6: Operationalize as a recurring control

Assign:

Control owner (usually product risk, model risk, or engineering quality)
Evidence cadence ³
Central repository for evidence and versioning
If you use Daydream, map MEASURE-2.3 to a named control, owner, procedure, and recurring evidence collection so audits stop being scavenger hunts. ¹

Required evidence and artifacts to retain

Retain artifacts in a way that supports reproduction and independent review:

Deployment Conditions Profile (what “similar conditions” means)
AI Acceptance Criteria Register (definitions, thresholds, owners, approvals)
Evaluation Plan (datasets, sampling, rubrics, environments, tools)
Dataset documentation: sources, selection logic, labeling approach, known limitations
Test execution logs: prompts/configs, code references, run IDs, dates, system versions
Results package: metric tables, segment breakdowns, qualitative summaries, error analysis
Go/No-Go and Conditions Memo, including exceptions and compensating controls
Change records: what changed between versions, why re-testing scope was sufficient
Monitoring mapping: which production monitors correspond to which pre-deployment measures ¹

Common exam/audit questions and hangups

Auditors and internal reviewers typically press on:

“Show me how your test conditions match production. Where is that documented?” ¹
“What are your acceptance thresholds, and who approved them?”
“How did you select evaluation data? Is it representative and current?”
“Do results vary by segment, channel, language, or geography?”
“What happened when criteria were not met? Show the decision record.”
“How do you ensure a third party’s benchmarks apply to your deployment?” ¹

Hangup: teams often have metrics but cannot show governance. MEASURE-2.3 expects both measurement and documentation. ¹

Frequent implementation mistakes (and how to avoid them)

Testing on convenience datasets only.
Fix: require a deployment-representative sampling section in the Evaluation Plan. ¹
No explicit thresholds.
Fix: force “target/threshold + rationale + approver” for each criterion in the register. ¹
Model-only testing instead of system testing.
Fix: test the full pipeline: retrieval, pre/post-processing, UI constraints, and human workflow. ¹
Qualitative reviews with no rubric.
Fix: document scoring definitions, reviewer training, and adjudication rules. ¹
Evidence scattered across tools.
Fix: a single evidence index per release that links to runs, data, approvals, and exceptions. Daydream can serve as the system of record for this mapping and collection. ¹

Enforcement context and risk implications

No enforcement sources are provided for this requirement in the supplied catalog, so you should treat MEASURE-2.3 primarily as an assurance benchmark used by customers, regulators, and auditors to evaluate whether your AI governance is defensible. The practical risk is that if an incident occurs (customer harm, discrimination allegation, safety event, major outage), inability to produce deployment-relevant testing records will weaken your position and slow remediation. ¹

Practical 30/60/90-day execution plan

First 30 days (stand up the control)

Appoint a MEASURE-2.3 control owner and approver path.
Create templates: Deployment Conditions Profile, Acceptance Criteria Register, Evaluation Plan, Go/No-Go memo.
Pilot on one AI system with active usage and medium risk; build the evidence index for the latest release. ¹

Days 31–60 (make it repeatable)

Expand to remaining in-scope AI systems; prioritize those with customer impact or regulated workflows.
Add segment reporting requirements and qualitative rubric standardization.
Implement a “no release without evidence index” gate in change management. ¹

Days 61–90 (harden and scale)

Tie pre-deployment criteria to production monitoring and incident response triggers.
Add third-party intake requirements: providers must supply evaluation artifacts, and you run a deployment-like validation before production use.
Centralize evidence and automate recurring collection in Daydream or your GRC system so audits pull from one source of truth. ¹

Frequently Asked Questions

Do we need both qualitative and quantitative measures?

MEASURE-2.3 allows qualitative or quantitative measurement, but most real deployments need both because some failure modes are easiest to catch with structured human review. Document why your chosen methods are sufficient for the risks and use case. ¹

What counts as “conditions similar to deployment” if we can’t use production data?

Define the deployment conditions you cannot replicate, document why, and show compensating steps such as a restricted pilot, synthetic or masked data design, and tighter release conditions. The key is a written, risk-based argument with traceable evidence. ¹

We buy an AI product from a third party. Can we rely on their benchmarks?

You can incorporate third-party testing, but MEASURE-2.3 still expects demonstration in conditions similar to your deployment. Do an integration-level evaluation in your environment and retain both the provider artifacts and your validation results. ¹

How do we set thresholds without “making up numbers”?

Start from business impact tolerances (queue capacity, review effort, customer harm scenarios) and translate them into measurable criteria with an approver. Keep the rationale in writing, and revisit after pilots or incidents. ¹

What’s the minimum documentation that satisfies “measures are documented”?

Keep the metric definitions, test plan, dataset description, system/version details, results, and the release decision record with approvals and exceptions. If a reviewer cannot recreate what you measured and why it mattered, documentation is not sufficient. ¹

How do we prevent teams from treating this as a one-time launch task?

Put the requirement into change management: any material change triggers re-evaluation, and periodic reviews confirm performance remains within criteria. Use a control mapping and recurring evidence collection workflow so the process survives staff turnover. ¹

Frequently Asked Questions

Do we need both qualitative and quantitative measures?

What counts as “conditions similar to deployment” if we can’t use production data?

We buy an AI product from a third party. Can we rely on their benchmarks?

How do we set thresholds without “making up numbers”?

What’s the minimum documentation that satisfies “measures are documented”?

How do we prevent teams from treating this as a one-time launch task?

Operationalize this requirement

Map requirement text to controls, owners, evidence, and review workflows inside Daydream.

See Daydream

Regulatory text

Plain-English interpretation

Who it applies to

What you actually need to do (step-by-step)

Step 1: Define deployment setting(s) and “similar conditions”

Step 2: Translate risk into measurable criteria (acceptance criteria)

Step 3: Design an evaluation that mirrors deployment

Step 4: Execute measurement (quant + qual) and record results

Step 5: Make an explicit release decision and define monitoring handoff

Step 6: Operationalize as a recurring control

Required evidence and artifacts to retain

Common exam/audit questions and hangups

Frequent implementation mistakes (and how to avoid them)

Enforcement context and risk implications

Practical 30/60/90-day execution plan

First 30 days (stand up the control)

Days 31–60 (make it repeatable)

Days 61–90 (harden and scale)

Frequently Asked Questions

Do we need both qualitative and quantitative measures?

What counts as “conditions similar to deployment” if we can’t use production data?

We buy an AI product from a third party. Can we rely on their benchmarks?

How do we set thresholds without “making up numbers”?

What’s the minimum documentation that satisfies “measures are documented”?

How do we prevent teams from treating this as a one-time launch task?

Footnotes

Frequently Asked Questions

Related Resources

Operationalize this requirement