MEASURE-1.1: Approaches and metrics for measurement of AI risks enumerated during the map function are selected for implementation starting with the most significant AI risks. The risks or trustworthiness characteristics that will not – or

11 min readLast verified: February 2026By Isaac Silverman

MEASURE-1.1 requires you to pick concrete measurement approaches and metrics for the AI risks you already identified during the MAP function, implement them starting with the highest-significance risks, and explicitly document any risks or trustworthiness characteristics you will not (or cannot) measure. Your goal is a defensible, prioritized measurement plan with owners, cadence, thresholds, and retained evidence. ¹

Key takeaways:

Start with the MAP risk register, then rank risks to decide what gets measured first. ¹
Define metrics that are operational, repeatable, and tied to decisions (go/no-go, mitigation triggers, monitoring). ¹
Maintain a “cannot measure / will not measure” log with rationale, compensating controls, and review cadence. ¹

MEASURE-1.1 sits at the point where AI risk management stops being a narrative and becomes an operating discipline. You are expected to translate the AI risks identified in MAP into a prioritized set of measurement methods and metrics, then stand those measurements up in the real world, beginning with the most significant risks. ¹

For a Compliance Officer, CCO, or GRC lead, the operational challenge is rarely picking “a metric.” The hard part is (1) deciding which risks deserve measurement now versus later, (2) defining metrics that engineering and product teams can actually produce on a recurring basis, and (3) documenting gaps without looking like you ignored the risk. MEASURE-1.1 is also explicit that some trustworthiness characteristics may not be measurable in your context; that is allowed, but only if you document it properly. ¹

This page gives requirement-level implementation guidance you can hand to control owners: how to select metrics, how to prioritize, what evidence to retain, what auditors will probe, and how to build a measurement backlog that improves over time without blocking delivery. ¹

Regulatory text

Text (excerpt): “Approaches and metrics for measurement of AI risks enumerated during the map function are selected for implementation starting with the most significant AI risks. The risks or trustworthiness characteristics that will not – or cannot – be measured are properly documented.” ¹

What the operator must do:

Ingest the MAP outputs (your AI risk inventory/risk register) and treat them as the measurement scope. ¹
Select measurement approaches and metrics for those risks, not just high-level principles. ¹
Prioritize implementation so the most significant risks get measurable coverage first. ¹
Document measurement exclusions (will not measure / cannot measure) with rationale and governance, so omissions are controlled rather than accidental. ¹

Plain-English interpretation (what MEASURE-1.1 really demands)

You must be able to answer: “Which AI risks are we measuring, how are we measuring them, how often, who owns the metric, and what do we do when results fall outside tolerance?” ¹
“Most significant” means your highest-impact and/or highest-likelihood AI risks in context (safety, bias, privacy, security, reliability, transparency, etc.), not the easiest items to measure. ¹
Documentation of “will not/cannot measure” is not a loophole. It is a controlled exception process: what you cannot measure, why, what you do instead, and when you will revisit the decision. ¹

Who it applies to (entity and operational context)

Applies to: any organization developing or deploying AI systems that uses NIST AI RMF as its risk management framework. ²

Operational contexts you should assume are in scope:

First-party AI (you train/fine-tune models, build features, or manage the ML pipeline).
Third-party AI (you buy or integrate models, APIs, scoring engines, copilots, or AI-enabled products from third parties). MEASURE-1.1 still applies; you may need vendor-provided metrics plus your own monitoring.
Material decisioning use cases (employment, credit, access, safety, healthcare, public-sector decisions). The requirement does not name sectors, but your “most significant” ranking should reflect where harm is plausible.

Control owners typically involved:

Product owner for the AI system (accountable for risk acceptance).
ML/DS lead (metric design, evaluation methodology).
Security/privacy (attack/abuse monitoring, data controls).
Compliance/GRC (governance, evidence, exceptions).
Vendor management/TPRM (third-party metric requirements and SLAs for AI components).

What you actually need to do (step-by-step)

Step 1: Establish measurement scope from MAP outputs

Create a single list of AI risks enumerated during MAP and confirm each risk is tied to:

a system name/version,
a use case,
impacted populations,
lifecycle phase (design, training, deployment, monitoring). ¹

Operator tip: if MAP outputs are scattered across tickets and slide decks, consolidate them into a risk register that can be referenced by audit and by engineers.

Step 2: Rank risks to identify “most significant”

Define your risk ranking method and apply it consistently. Your scoring method can be simple, but it must be explainable. Typical inputs:

severity of harm (legal, financial, safety, customer impact),
likelihood/expected frequency,
detectability and time-to-detect,
reversibility (can you remediate outcomes),
exposure (users/transactions affected). ¹

Deliverable: Top-tier risk list (the risks that get metrics first), plus the rationale for ranking.

Step 3: Select measurement approaches for each top-tier risk

For each prioritized risk, specify the approach you will use. Common measurement approaches include:

Pre-deployment evaluation (offline testing, red teaming, model cards/test reports).
In-production monitoring (drift, performance, anomaly/abuse signals).
Process conformance metrics (coverage of human review, appeal handling, incident response).
Third-party attestations and contractual reporting for vendor models/APIs (what the vendor must provide, and what you validate independently). ¹

Deliverable: a measurement approach matrix mapping risk → approach → metric(s).

Step 4: Define metrics that are decision-grade

For each risk, define metrics with these fields:

Metric name and definition (exact formula, numerator/denominator where relevant).
Data source (logs, labels, customer complaints, vendor reports).
Population and segmentation (by region, channel, user cohort, protected class where legally/ethically appropriate and permitted).
Cadence (how often computed and reviewed).
Thresholds/tolerances (what triggers mitigation, rollback, model update, human review expansion).
Owner (who computes, who reviews, who can approve exceptions).
Action playbook (what you do when out of tolerance). ¹

Examples (illustrative, adapt to your MAP risks):

Model performance risk: “Critical-intent classification false negative rate” monitored weekly on a holdout sample plus sampled production outcomes; trigger = escalation to incident process.
Bias/fairness risk: “Outcome parity delta across defined cohorts” evaluated pre-release and on scheduled intervals; trigger = mitigation plan and re-approval gate.
Security/abuse risk: “Prompt injection success rate in test suite” plus “policy-violating output rate” from monitoring; trigger = hardening changes and usage restrictions.

Step 5: Implement in order: most significant risks first

Convert your top-tier metrics into work items with clear acceptance criteria:

metric pipeline built,
dashboard/report produced,
review meeting scheduled,
response actions documented,
evidence retention configured. ¹

Governance move that works: tie “go-live approval” to “metric exists + owner assigned + first review completed” for high-significance risks.

Step 6: Document what you cannot (or will not) measure

Build a Measurement Exceptions Log for:

Cannot measure yet (no labels, insufficient ground truth, vendor black box, legal restrictions on collecting attributes).
Will not measure (disproportionate burden, measurement introduces new risk, immaterial risk after analysis).

For each exception record:

risk/trustworthiness characteristic,
why measurement is not feasible or not selected,
compensating measures (qualitative review, controls, monitoring proxies),
approval (who signed off),
revisit trigger (system change, new data availability, vendor contract renewal). ¹

Step 7: Operationalize recurring review and continuous improvement

MEASURE-1.1 is not a one-time build. Put in place:

periodic review of metric results and thresholds,
backlog grooming for expanding metric coverage,
change management hooks (model version updates, data pipeline changes, vendor version changes),
incident feedback loop (post-incident metric revisions). ¹

Required evidence and artifacts to retain

Auditors and internal reviewers will expect artifacts that prove both design and operation of measurement.

Minimum evidence set:

AI risk register from MAP with significance ranking and methodology. ¹
Measurement plan mapping risks → approaches → metrics → owners → cadence. ¹
Metric definitions (data dictionary, formulas, segmentation rules, known limitations).
Dashboards or reports showing metric outputs over time (screenshots or exports with dates).
Review minutes (risk committee, model governance board, product risk review) showing metrics were reviewed and actions assigned.
Action playbooks for threshold breaches (incident tickets, rollback procedures, mitigation plans).
Measurement Exceptions Log with approvals and compensating controls. ¹
Third-party evidence where relevant (contract clauses requiring reporting, vendor test reports, SLAs, and your validation notes).

If you use Daydream to track controls, map MEASURE-1.1 to a control owner and attach recurring evidence tasks so metric reviews and exception renewals do not rely on memory.

Common exam/audit questions and hangups

Expect questions that probe prioritization, traceability, and “paper vs reality” gaps:

“Show me the list of AI risks from MAP and which ones have implemented metrics today.” ¹
“How did you determine ‘most significant’ and who approved that ranking?” ¹
“Pick one high-risk metric and show the definition, data source, and the last two review instances.”
“What happens when a metric breaches tolerance? Show an example ticket and resolution.”
“Which risks are not measured and why? Who accepted that residual risk?” ¹
“For third-party models, what do you measure yourself versus what the third party reports?”

Hangups that derail audits:

metrics exist, but no one reviews them;
reviews occur, but no documented decisions;
exceptions exist, but no compensating controls or revisit triggers.

Frequent implementation mistakes and how to avoid them

Measuring what is easy, not what is significant
Fix: lock the top-tier risk list first, then force each top-tier risk to have at least one decision-grade metric or an approved exception. ¹
Vague metrics (“monitor fairness”) with no definition
Fix: require metric specs: formula, population, cadence, thresholds, and owner. If you cannot define it, treat it as “cannot measure yet” and document the gap. ¹
No link from metrics to action
Fix: pair every metric with a response playbook (who gets paged, what changes are permitted, approval path).
Third-party AI treated as out of scope
Fix: include vendor AI in the risk register; contract for reporting; implement your own monitoring where feasible. Track third-party deliverables as evidence artifacts.
Exceptions become permanent
Fix: add a required review trigger (system changes, vendor renewals, new dataset availability) and enforce periodic re-approval in governance routines. ¹

Enforcement context and risk implications

NIST AI RMF is a framework, not a standalone regulator. Still, MEASURE-1.1 reduces the practical risk that your AI governance program becomes “principles without controls.” If an incident occurs (harm, discrimination allegations, safety issue, security abuse), the first question from executives, customers, and regulators is usually: what did you measure, what did you know, and what did you do about it? MEASURE-1.1 is the backbone for answering that with records, not recollections. ¹

A practical 30/60/90-day execution plan

Use phases to avoid fake precision while still driving delivery.

Immediate (stabilize scope and prioritization)

Consolidate MAP risks into one risk register tied to systems and versions. ¹
Define and approve the “most significant” ranking method; record approvers.
Identify metric owners across Product, ML, Security, Privacy, and TPRM.
Stand up the Measurement Exceptions Log template and approval workflow. ¹

Near-term (build and launch measurement for top risks)

For each top-tier risk, finalize metric specs and data sources.
Implement dashboards/reports and schedule recurring review forums.
Pilot threshold breach procedures with a tabletop exercise using a realistic scenario.
For third-party AI, add contract requirements for reporting and audit support; document gaps where the third party cannot provide required signals.

Ongoing (expand coverage and harden evidence)

Add metrics for the next tier of risks and retire “cannot measure” items where feasible.
Tune thresholds and playbooks based on incidents, drift, and user feedback.
Audit your evidence trail: can you reproduce last period’s metric values and decisions from retained artifacts?
Use a GRC system (including Daydream if it fits your stack) to automate evidence collection tasks and exception renewals.

Frequently Asked Questions

Do we need metrics for every single AI risk in the MAP register?

MEASURE-1.1 expects you to select approaches and metrics starting with the most significant risks, not to measure everything immediately. For anything you do not or cannot measure, keep a properly approved exception with rationale and compensating measures. ¹

What qualifies as “cannot be measured” versus “we haven’t built it yet”?

Treat “cannot be measured” as a claim you can defend (missing ground truth, legal constraints, third-party opacity). If it is only a resourcing gap, document it as “not measured yet,” add it to the measurement backlog, and set a revisit trigger. ¹

How do we handle MEASURE-1.1 for third-party AI models and APIs?

Put third-party AI risks in the same register and require reporting in contracts where possible. Pair vendor-provided reports with your own monitoring signals (usage logs, complaint trends, output sampling) and document any black-box limitations as exceptions. ¹

What evidence is most persuasive to auditors or internal reviewers?

A traceable chain: MAP risk → significance ranking → metric definition → recurring reports → meeting minutes → action tickets. Exceptions are acceptable if they show approval, rationale, and compensating controls. ¹

Who should own the metrics, Compliance or Engineering?

Engineering (or ML Ops) should own metric computation and instrumentation, because they control the pipelines. Compliance should own the governance wrapper: required fields, review cadence, evidence retention, and exception approvals. ¹

Our teams disagree on the “most significant” risks. What breaks the tie?

Use a documented scoring method and define an escalation path to a model governance or risk committee for final approval. Record dissent and the final decision so prioritization is explainable later. ¹

Frequently Asked Questions

Do we need metrics for every single AI risk in the MAP register?

What qualifies as “cannot be measured” versus “we haven’t built it yet”?

How do we handle MEASURE-1.1 for third-party AI models and APIs?

What evidence is most persuasive to auditors or internal reviewers?

Who should own the metrics, Compliance or Engineering?

Our teams disagree on the “most significant” risks. What breaks the tie?

Operationalize this requirement

Map requirement text to controls, owners, evidence, and review workflows inside Daydream.

See Daydream

Regulatory text

Plain-English interpretation (what MEASURE-1.1 really demands)

Who it applies to (entity and operational context)

What you actually need to do (step-by-step)

Step 1: Establish measurement scope from MAP outputs

Step 2: Rank risks to identify “most significant”

Step 3: Select measurement approaches for each top-tier risk

Step 4: Define metrics that are decision-grade

Step 5: Implement in order: most significant risks first

Step 6: Document what you cannot (or will not) measure

Step 7: Operationalize recurring review and continuous improvement

Required evidence and artifacts to retain

Common exam/audit questions and hangups

Frequent implementation mistakes and how to avoid them

Enforcement context and risk implications

A practical 30/60/90-day execution plan

Immediate (stabilize scope and prioritization)

Near-term (build and launch measurement for top risks)

Ongoing (expand coverage and harden evidence)

Frequently Asked Questions

Do we need metrics for every single AI risk in the MAP register?

What qualifies as “cannot be measured” versus “we haven’t built it yet”?

How do we handle MEASURE-1.1 for third-party AI models and APIs?

What evidence is most persuasive to auditors or internal reviewers?

Who should own the metrics, Compliance or Engineering?

Our teams disagree on the “most significant” risks. What breaks the tie?

Footnotes

Frequently Asked Questions

Related Resources

Operationalize this requirement