MEASURE-2.6: The AI system is evaluated regularly for safety risks – as identified in the map function. The AI system to be deployed is demonstrated to be safe, its residual negative risk does not exceed the risk tolerance, and it can fail

10 min readLast verified: February 2026By Isaac Silverman

To meet measure-2.6: the ai system is evaluated regularly for safety risks – as identified in the map function. the ai system to be deployed is demonstrated to be safe, its residual negative risk does not exceed the risk tolerance, and it can fail requirement, you must run recurring safety evaluations tied to your MAP-identified risks, prove residual risk stays within approved tolerance, and implement fail-safe behaviors for out-of-scope conditions, with monitoring and incident response metrics to show it works in production. ¹

Key takeaways:

Convert MAP outputs into a repeatable safety evaluation plan with defined metrics, thresholds, and owners. ¹
Require a “residual risk vs. tolerance” decision record before deployment and after material changes. ¹
Engineer and test fail-safe modes (especially beyond knowledge limits) and monitor reliability, robustness, and failure response times. ¹

MEASURE-2.6 sits where governance becomes operational reality: you cannot rely on a one-time pre-launch review or a generic model card and call it “safe.” The requirement expects regular safety risk evaluation aligned to risks you already identified during the MAP function, a deployment decision that explicitly compares residual negative risk to approved risk tolerance, and proof the system can fail safely, especially when pushed beyond what it knows. ¹

For a Compliance Officer, CCO, or GRC lead, the fastest path is to treat MEASURE-2.6 as a control you can audit: define what “regularly” means for your environment, define safety metrics that map to the known hazards, define pass/fail thresholds tied to risk tolerance, and evidence the system’s monitored performance and failure handling after release. ¹

This page gives requirement-level implementation guidance you can drop into your GRC program: applicability, step-by-step execution, evidence to retain, common audit questions, and a practical 30/60/90-day plan. It is written to help you operationalize the control across internal build teams and third parties that supply models, APIs, data, or monitoring. ²

Regulatory text

NIST AI RMF MEASURE-2.6 excerpt: “The AI system is evaluated regularly for safety risks – as identified in the map function. The AI system to be deployed is demonstrated to be safe, its residual negative risk does not exceed the risk tolerance, and it can fail safely, particularly if made to operate beyond its knowledge limits. Safety metrics reflect system reliability and robustness, real-time monitoring, and response times for AI system failures.” ¹

Operator translation (what you must do):

Evaluate safety risks on a recurring cadence using the specific safety risks you already identified in MAP (not a generic checklist). ¹
Demonstrate safety before deployment with documented test results, acceptance criteria, and sign-off. ¹
Measure residual negative risk after controls and mitigations, and show it stays within an approved risk tolerance decision. ¹
Design and test fail-safe behavior, especially for “beyond knowledge limits” scenarios (unknown inputs, ambiguous prompts, distribution shift). ¹
Use safety metrics that cover reliability, robustness, monitoring, and response times for failures, and keep evidence those metrics are tracked and acted on. ¹

Plain-English interpretation

MEASURE-2.6 requires a closed loop: MAP identifies safety risks; MEASURE proves you’re checking them repeatedly; deployment is allowed only if residual risk is within tolerance; and the system must degrade safely when it fails. ¹

“Fail safely” means you plan for errors and unexpected conditions, then you prove the system defaults to safer outcomes (block, defer to human review, reduce capability, or route to an alternate process) rather than producing confident harmful outputs. “Beyond knowledge limits” is the common failure mode for generative and predictive systems: users ask novel questions, data shifts, or the model is used outside the approved context. ¹

Who it applies to (entity and operational context)

Applies to any organization developing or deploying AI systems, including where a third party provides the model, an API, components, training data, evaluation tooling, or monitoring. ²

Typical in-scope deployments:

Customer-facing AI (chat, recommendations, automated decisions) where unsafe behavior can harm customers, violate policy, or create legal exposure.
Employee-facing AI (assistants, coding copilots, HR screening support) where failures can create security, discrimination, or data leakage risks.
Safety-adjacent AI (healthcare triage support, industrial monitoring) where “fail safe” must be explicit and tested.
Third-party model integrations where you still own deployment safety, even if the third party built the model.

What you actually need to do (step-by-step)

1) Convert MAP outputs into a Safety Risk Evaluation Plan

Build a short plan that references the MAP risk register and answers:

Which safety risks are in scope for this system version and use case? ¹
What tests/measurements will detect each risk (offline tests, red teaming, simulation, canary monitoring)?
What metrics and thresholds define “safe enough” for deployment?
Who owns running evaluations and who approves exceptions?

Deliverable: Safety Evaluation Plan mapped to the MAP risk inventory. ¹

2) Define risk tolerance and residual risk decisioning

You need a concrete governance mechanism that links:

Risk tolerance statement (organizational or domain-specific)
Residual risk scoring after mitigations (guardrails, access controls, human review, rate limits, content filters, rollback plans)
Release decision (approve, approve with constraints, block)

Practical approach:

Require a Deployment Safety Assessment that includes a “residual risk vs. tolerance” table.
If residual risk exceeds tolerance, force one of three outcomes: add mitigations, narrow scope, or block deployment. ¹

3) Run recurring safety evaluations (“regularly”) and trigger-based re-evaluations

Define “regularly” in your control language as a function of change and impact. You need both:

Recurring evaluation cadence (routine reassessment even without changes)
Event-driven evaluation (retest when something material changes)

Trigger examples you can encode in policy:

Model version change (internal or third party)
Prompting/guardrail logic change
Training data refresh or RAG corpus update
New user group, geography, or use case
Incident, near miss, or new MAP risk identified ¹

Evidence must show the evaluations actually occurred and results were reviewed.

4) Prove the system can “fail safely,” especially beyond knowledge limits

You need explicit fail states and tests for them. Common fail-safe patterns:

Refuse and route: reject requests outside scope and route to a human or alternate workflow.
Degraded mode: reduce capability (e.g., no external actions, no tool use) when confidence drops.
Safe completion: provide general guidance and require authoritative sources for high-impact topics.
Stop-the-line controls: kill switch, rollback, and rate limiting for harmful output spikes. ¹

Minimum operational requirement: document fail-safe behaviors, test them, and show monitoring detects when the system enters those modes.

5) Implement monitoring and response-time metrics for AI failures

MEASURE-2.6 explicitly expects real-time monitoring and metrics reflecting reliability, robustness, and response times for failures. ¹

Operationalize this with:

Defined safety SLOs/SLIs (examples: unsafe output rate, refusal accuracy, tool-action error rate, drift indicators, alert latency)
On-call/triage workflow for AI safety incidents
Runbooks: containment, rollback, customer comms, and post-incident reviews
Integration with your incident management system so you can prove response timing and closure. ¹

6) Make it auditable: map to policy, procedure, owner, and recurring evidence

A recurring audit failure is “we do this informally.” Assign:

Control owner (product, engineering, model risk, or GRC)
Standard operating procedure (what steps, what tools, where evidence lives)
Evidence calendar (what gets captured per release and per monitoring period) ¹

Daydream (as a GRC workflow) fits here as the system of record for mapping MEASURE-2.6 to owners, schedules, evidence requests, and change-triggered attestations, so you can prove the control runs rather than relying on tribal knowledge.

Required evidence and artifacts to retain (audit-ready checklist)

Retain artifacts that show design + operation:

Governance and decisioning

MAP risk register entries linked to this system/use case ¹
Risk tolerance statement and approval authority
Residual risk assessment worksheet and sign-off
Deployment approval record (including any constraints) ¹

Testing and evaluation

Safety Evaluation Plan mapped to MAP risks ¹
Test protocols (red team scripts, simulation plans, acceptance criteria)
Results reports with pass/fail outcomes and remediation tickets
Regression testing evidence after changes

Fail-safe design

Documented fail-safe modes and trigger conditions (beyond knowledge limits)
Proof of testing fail-safe behaviors (test cases + results)
Kill switch / rollback procedure and access control list

Monitoring and incident response

Monitoring dashboard screenshots or exports (reliability/robustness signals) ¹
Alert definitions and escalation routes
Incident tickets for AI failures, including timestamps and resolution notes
Post-incident reviews that update MAP and the test plan

Third party dependencies

Third party change notifications and version notes (if they supply model/API)
Contract/SLA clauses requiring incident notification and safety-relevant change notice
Your internal re-evaluation record after third party changes

Common exam/audit questions and hangups

Expect these, and prepare short, evidence-backed answers:

“Show me how MAP risks drive what you test for safety.” ¹
“What does ‘regularly’ mean here, and where is it documented?”
“Who can approve residual risk exceptions, and where are the decisions recorded?”
“Demonstrate fail-safe behavior for out-of-scope prompts or distribution shift.” ¹
“Show monitoring and response-time evidence for an AI failure.” ¹
“How do you handle third party model updates?”

Hangup to avoid: teams provide a one-time pre-prod test report with no ongoing monitoring evidence.

Frequent implementation mistakes and how to avoid them

Mistake: Treating safety as model accuracy.
Fix: tie metrics to MAP safety risks and harmful outcomes, not only performance benchmarks. ¹
Mistake: No formal residual risk decision.
Fix: require an explicit residual risk vs. tolerance table and sign-off at every release gate. ¹
Mistake: “Fail safe” equals “the system errors out.”
Fix: define safe degradation paths (refuse, route, degrade, rollback) and test them. ¹
Mistake: Monitoring exists, but no one responds.
Fix: connect alerts to incident workflows and retain response records with timestamps. ¹
Mistake: Third party changes slip through.
Fix: enforce change notification terms and trigger mandatory re-evaluation when upstream versions change.

Enforcement context and risk implications

NIST AI RMF is a framework, not a regulator. Your exposure usually appears indirectly: supervisory expectations, procurement requirements, customer audits, and downstream regulatory regimes that care about unsafe automated behavior, weak monitoring, and lack of documented controls. The operational risk is concrete: if you cannot show recurring safety evaluation, residual risk acceptance, and safe failure handling, you will struggle to defend deployment decisions after an incident. ²

Practical 30/60/90-day execution plan

First 30 days (stand up the control)

Name the MEASURE-2.6 control owner and approver chain. ¹
Inventory AI systems in production and in-flight, and link each to MAP risks.
Draft a Safety Evaluation Plan template and a residual risk acceptance template.
Define fail-safe modes for the highest-impact system, and document current monitoring and incident paths.

Days 31–60 (run it once end-to-end)

Execute safety evaluations for the priority system against MAP risks; open remediation tickets for failures. ¹
Run a fail-safe exercise: out-of-scope prompts, distribution shift scenario, tool failure scenario; capture results. ¹
Implement or tighten monitoring: alerts, escalation, and incident runbooks with clear ownership. ¹
Add third party change triggers to your change management intake.

Days 61–90 (make it repeatable and auditable)

Operationalize release gating: no production promotion without residual risk vs. tolerance sign-off. ¹
Set a recurring evaluation cadence and a trigger-based re-evaluation process in policy/procedure.
Centralize evidence collection (tickets, dashboards, approvals) in your GRC system; Daydream can manage the evidence calendar and map artifacts to MEASURE-2.6 for audit readiness. ¹

Frequently Asked Questions

What counts as “evaluated regularly” for MEASURE-2.6?

“Regularly” should be defined by your policy and matched to system impact and change frequency. Document a recurring cadence plus event-driven re-evaluations tied to model, data, or scope changes. ¹

How do I demonstrate residual risk does not exceed risk tolerance?

Create a residual risk assessment that lists key MAP risks, applied mitigations, and the remaining risk rating, then compare it to an approved tolerance threshold. Require sign-off as a release gate and retain the decision record. ¹

What is “fail safely” for a generative AI assistant?

Define safe failure behaviors like refusal with an explanation, routing to a human, degraded mode without external actions, and rollback/kill switch for spikes in unsafe output. Test these behaviors explicitly, including beyond-knowledge-limit prompts. ¹

We use a third party model API. Do we still own this requirement?

Yes for the deployment context you control. Contract for change notifications and safety-relevant documentation, then run your own evaluations and monitoring to prove safe operation within your use case. ²

What evidence do auditors ask for most often?

They usually want traceability from MAP risks to tests, proof of ongoing monitoring, and a documented residual risk acceptance decision for production. Keep dashboards, incident tickets, test reports, and approval artifacts together. ¹

How do we handle “beyond knowledge limits” without over-refusing users?

Narrow scope with clear allowed-use policies, tune guardrails to route edge cases rather than blanket refusal, and measure both unsafe outputs and unnecessary refusals as safety metrics. Update thresholds based on observed production behavior and incident learnings. ¹

Frequently Asked Questions

What counts as “evaluated regularly” for MEASURE-2.6?

How do I demonstrate residual risk does not exceed risk tolerance?

What is “fail safely” for a generative AI assistant?

We use a third party model API. Do we still own this requirement?

What evidence do auditors ask for most often?

How do we handle “beyond knowledge limits” without over-refusing users?

Operationalize this requirement

Map requirement text to controls, owners, evidence, and review workflows inside Daydream.

See Daydream

Regulatory text

Plain-English interpretation

Who it applies to (entity and operational context)

What you actually need to do (step-by-step)

1) Convert MAP outputs into a Safety Risk Evaluation Plan

2) Define risk tolerance and residual risk decisioning

3) Run recurring safety evaluations (“regularly”) and trigger-based re-evaluations

4) Prove the system can “fail safely,” especially beyond knowledge limits

5) Implement monitoring and response-time metrics for AI failures

6) Make it auditable: map to policy, procedure, owner, and recurring evidence

Required evidence and artifacts to retain (audit-ready checklist)

Common exam/audit questions and hangups

Frequent implementation mistakes and how to avoid them

Enforcement context and risk implications

Practical 30/60/90-day execution plan

First 30 days (stand up the control)

Days 31–60 (run it once end-to-end)

Days 61–90 (make it repeatable and auditable)

Frequently Asked Questions

What counts as “evaluated regularly” for MEASURE-2.6?

How do I demonstrate residual risk does not exceed risk tolerance?

What is “fail safely” for a generative AI assistant?

We use a third party model API. Do we still own this requirement?

What evidence do auditors ask for most often?

How do we handle “beyond knowledge limits” without over-refusing users?

Footnotes

Frequently Asked Questions

Related Resources

Operationalize this requirement