MANAGE-4.3: Incidents and errors are communicated to relevant AI actors, including affected communities. Processes for tracking, responding to, and recovering from incidents and errors are followed and documented.
To meet MANAGE-4.3, you need an AI-specific incident and error management process that (1) identifies and logs AI incidents/errors, (2) routes timely notifications to the right internal and external parties (including affected communities when appropriate), and (3) documents response, recovery, and lessons learned end to end 1.
Key takeaways:
- Define “AI incident” and “AI error” with triggers that force ticketing, triage, and escalation 1.
- Build a communication matrix that includes impacted users/communities, not only internal stakeholders 1.
- Keep defensible evidence: incident register, notification records, response actions, recovery validation, and post-incident improvements 1.
Footnotes
MANAGE-4.3 sits in the “Manage” function of the NIST AI Risk Management Framework and expects operational discipline: when AI systems fail, drift, behave unexpectedly, or cause harmful outcomes, you don’t handle it ad hoc. You run a defined process, communicate to the right “AI actors,” and keep records that show what happened, what you did, and what changed afterward 1.
For a Compliance Officer, CCO, or GRC lead, the fastest path is to treat this as an extension of your security incident response and operational incident management programs, with AI-specific additions: model behavior failures, data issues, labeling mistakes, safety regressions, unfair outcomes, and human factors (for example, overreliance by staff). The requirement explicitly calls out affected communities, so your process cannot stop at engineering and legal. You need a method to decide when external communications are required, who approves them, how they are delivered, and how you document that you did it 1.
This page gives requirement-level implementation guidance you can hand to control owners and run in an audit.
Regulatory text
Excerpt (MANAGE-4.3): “Incidents and errors are communicated to relevant AI actors, including affected communities. Processes for tracking, responding to, and recovering from incidents and errors are followed and documented.” 1
What the operator must do:
- Communicate AI incidents and errors to the relevant stakeholders across the AI lifecycle, and include external groups impacted by the system when appropriate 1.
- Follow a defined process for tracking, responding, and recovering, rather than improvising 1.
- Document the process execution so you can show what happened, who was informed, what decisions were made, and how the system was stabilized 1.
Plain-English interpretation
MANAGE-4.3 requires “no surprises” operations for AI. If an AI system produces a material error or triggers an incident, you must:
- Capture it in a system of record.
- Triage it using clear severity and impact criteria.
- Notify the right people (internal AI actors such as product, engineering, risk, legal, support, third parties, and downstream deployers).
- Notify affected users/communities when the incident or error impacts them and communication is warranted.
- Recover safely (rollback, disable features, patch model/data/prompting, or add guardrails) and confirm recovery.
- Learn and improve (root cause, corrective actions, and control updates).
A practical test: if your AI model harmed customers or produced systematically wrong outputs, can you prove who you told, when you told them, and what you changed afterward?
Who it applies to (entity and operational context)
Applies to: organizations that develop, integrate, deploy, or operate AI systems, including those using third-party AI components 2.
Operational contexts where MANAGE-4.3 becomes “exam-critical”:
- Customer-facing AI (chat, recommendations, pricing, underwriting support, medical triage support).
- Employee decision support (HR screening, fraud review assistance, surveillance/monitoring tools).
- AI embedded in products where failure creates safety, financial loss, discrimination risk, or major customer impact.
- High-dependency third parties (hosted model providers, data labeling firms, model monitoring tools) where incident response requires coordinated communications.
AI actors you should explicitly account for:
- Internal: model owners, product owners, SRE/ops, security, privacy, legal, compliance, customer support, comms/PR, and executive incident commander.
- External: impacted customers/users, impacted communities (for example, groups disproportionately affected), regulators when required by other obligations, and critical third parties or downstream deployers.
What you actually need to do (step-by-step)
Step 1: Define scope, terms, and triggers
Create a short AI Incident & Error Standard that answers:
- What counts as an AI incident (e.g., harmful/unsafe output event; unauthorized model access; model inversion/data leakage; significant bias or fairness regression; safety policy bypass; major monitoring alert).
- What counts as an AI error (e.g., bad training data slice discovered; labeling defect; evaluation harness bug; prompt template causing systematic incorrect output; retrieval corpus contamination).
- What events require a ticket within your incident system and what can stay as “bug/backlog.”
Operational tip: teams fail audits here because “incident” only means cybersecurity. MANAGE-4.3 expects you to capture AI behavior failures, not only breaches 1.
Step 2: Build an AI incident taxonomy and severity model
Define severity criteria that include:
- User/community impact (harm, discrimination risk, financial loss, safety).
- Scope (single user vs systemic).
- Reproducibility and persistence.
- Legal/regulatory exposure and contractual breach risk.
- Data exposure risk (even if not a classic security incident).
Keep the severity model simple enough that on-call responders can apply it consistently.
Step 3: Stand up an incident register (system of record)
Use a tool your org already audits (ticketing/ITSM, security case management, or GRC workflow). Minimum fields:
- Unique ID; date/time opened; reporter; AI system/version; environment.
- Classification (incident vs error); severity; impacted populations/communities.
- Detection source (monitoring, customer complaint, internal test, third party).
- Containment actions; corrective actions; recovery validation.
- Communications log (who/when/how).
- Root cause category and control gaps.
Step 4: Create a communication and notification matrix (internal + external)
Build a RACI + notification matrix that maps severity levels to:
- Mandatory internal stakeholders (product, model owner, compliance, legal, privacy, security).
- External notifications (customers, downstream deployers, third parties).
- Affected communities: define decision criteria and an approval path.
A workable approach is a decision tree:
- Did the incident/error materially affect outputs for a group of users or a protected/at-risk community?
- Could a reasonable user/community benefit from being informed (to contest, mitigate, or avoid harm)?
- Do contractual commitments, policies, or other regulatory obligations require notice?
Even when you decide not to notify externally, document the rationale and approvers. That documentation is part of “followed and documented” 1.
Step 5: Integrate response and recovery runbooks into operations
Create AI-specific runbooks that responders can execute:
- Containment: disable feature flags; restrict model routes; add guardrails; rate limit; switch to fallback model; suspend automation.
- Diagnosis: reproduce prompts/inputs; review logs; check retrieval corpus; examine training data lineage; evaluate on a fixed test set.
- Remediation: patch prompts; adjust filters; retrain; rollback; update dataset; fix evaluation pipeline.
- Recovery validation: confirm monitoring returns to normal; run bias/safety checks; verify user-facing behavior; confirm downstream systems.
Tie runbooks to your standard incident command structure. Make sure someone owns external comms and community-facing language.
Step 6: Post-incident review with compliance outcomes
Require a post-incident review that produces:
- Root cause analysis focused on system and process failures.
- Corrective and preventive actions (CAPA) with owners and due dates.
- Updates to risk assessment, model card/system documentation, monitoring thresholds, and training for operators.
This is where you convert a painful incident into stronger controls, and it produces strong audit evidence.
Step 7: Map the requirement to a control owner and recurring evidence
Assign a single control owner (often AI Governance, Model Risk, or GRC) responsible for:
- Ensuring the process exists and is used.
- Sampling incidents quarterly (or your internal cadence) for evidence completeness.
- Chasing overdue CAPA items.
If you use Daydream, treat MANAGE-4.3 as a mapped control with scheduled evidence requests so teams attach the incident record, comms log, and post-incident review to a single control record. That prevents the common “evidence scattered across Jira, email, Slack, and PR drafts” failure mode.
Required evidence and artifacts to retain
Retain artifacts that prove tracking, communication, response, recovery, and documentation 1:
Core artifacts (minimum set)
- AI Incident & Error Management Policy/Standard.
- Incident taxonomy and severity criteria.
- Incident register export (with required fields populated).
- Incident runbooks (containment, remediation, recovery validation).
- Communication matrix + templates (customer notice, downstream deployer notice, internal escalation, community-facing FAQ).
- Completed incident records (tickets) with timestamps and assignees.
- Communications evidence: copies of notifications, approval trail, distribution list, and dates sent.
- Post-incident review document: root cause, CAPA, decision log, and sign-offs.
- Evidence of control improvements: updated monitoring rules, updated model documentation, training record, change tickets.
Third-party related artifacts (if applicable)
- Contract clauses or playbooks requiring the third party to notify you of AI incidents/errors and support investigation.
- Vendor incident reports and your internal assessment of them.
- Joint comms plan for co-branded products or downstream deployments.
Common exam/audit questions and hangups
Auditors and internal assurance teams tend to press on repeatable proof:
-
“Show me your definition of AI incident vs AI error.”
Hangup: definitions that only cover security breaches, not model behavior failures. -
“How do you decide when to inform affected communities?”
Hangup: no criteria, or criteria exist but are not used consistently. -
“Demonstrate end-to-end handling for a recent event.”
Hangup: the ticket exists, but comms approvals and recovery validation are missing. -
“Where is the evidence that the process is followed?”
Hangup: policy exists; execution evidence is informal (Slack screenshots) and incomplete. -
“How do third parties fit into your incident workflow?”
Hangup: vendor gives an incident write-up, but you don’t log it internally or track CAPA.
Frequent implementation mistakes and how to avoid them
| Mistake | Why it fails MANAGE-4.3 | Fix |
|---|---|---|
| Treating AI incidents as only cybersecurity incidents | Misses harms from incorrect, biased, or unsafe outputs 1 | Expand taxonomy to include model behavior, data quality, and evaluation failures |
| No defined path to communicate with impacted communities | Requirement explicitly includes affected communities 1 | Add decision criteria, approvals, and templates for community-facing communications |
| No single system of record | You can’t prove “tracked” and “documented” 1 | Centralize in ITSM/GRC and attach comms + PIR artifacts |
| “Recovery” stops at rollback | Doesn’t show stabilization or validation | Require recovery validation steps and sign-off in the ticket |
| CAPA items never close | Repeats incidents and signals weak governance | Track CAPA like audit issues, with owners and escalation |
Enforcement context and risk implications
NIST AI RMF is a framework, not a regulator. Your exposure typically shows up indirectly: customer claims, contractual disputes, reputational damage, and regulator scrutiny under other authorities if incident communications are misleading or omitted. MANAGE-4.3 reduces that risk by making communications deliberate, consistent, and provable 2.
A practical 30/60/90-day execution plan
First 30 days (foundation you can audit)
- Assign control owner and incident comms owner.
- Publish AI incident/error definitions and severity criteria.
- Create the incident register fields in your ticketing system.
- Draft the communication matrix and approval workflow, including community-impact decision criteria.
- Write one AI incident runbook for your highest-risk AI system.
Days 31–60 (operationalize)
- Train on-call, product, support, and compliance stakeholders on triggers and comms routing.
- Run a tabletop exercise using a realistic AI failure scenario and produce a post-incident review artifact.
- Add third-party notification requirements to vendor management playbooks for critical AI third parties.
- Establish a recurring evidence check: sample recent tickets and validate completeness.
Days 61–90 (make it durable)
- Expand runbooks to all material AI systems.
- Add monitoring-to-ticket automation for key alerts (bias drift, safety regressions, unusual complaint spikes).
- Stand up CAPA tracking and escalation.
- Report summary metrics internally (counts, categories, time to contain) without externalizing numbers unless legal/comms approves.
Frequently Asked Questions
Do we have to notify “affected communities” for every model bug?
No. You need a documented method to determine when community-facing communication is appropriate, and you must keep the decision record. The requirement is that incidents and errors are communicated to relevant AI actors, including affected communities when they are relevant 1.
What counts as an “AI actor” in practice?
Treat AI actors as anyone who develops, deploys, operates, relies on, or is materially impacted by the AI system. That includes internal owners and external downstream deployers and users when their actions depend on your system’s outputs 1.
Can our existing security incident response process satisfy MANAGE-4.3?
Often partially. You usually need additions for AI-specific incident types, model behavior recovery validation, and a defined approach to community-impact communications 1.
How do we handle incidents caused by a third-party model provider?
Log the event in your own incident register, route communications through your matrix, and track CAPA internally even if the technical fix sits with the third party. Keep the third party’s report and your assessment as evidence 1.
What evidence is most likely to be missing during an audit?
Communication logs (who was notified and when), recovery validation steps, and post-incident corrective actions with closure proof. Policies are rarely the problem; execution records are 1.
We discovered a historical error. Do we need to create an incident record retroactively?
Create a record when you identify a material past error that had user or community impact, document the retrospective analysis, and capture any notifications and remediation. Auditors care that you can show a controlled response once you knew 1.
Footnotes
Frequently Asked Questions
Do we have to notify “affected communities” for every model bug?
No. You need a documented method to determine when community-facing communication is appropriate, and you must keep the decision record. The requirement is that incidents and errors are communicated to relevant AI actors, including affected communities when they are relevant (Source: NIST AI RMF Core).
What counts as an “AI actor” in practice?
Treat AI actors as anyone who develops, deploys, operates, relies on, or is materially impacted by the AI system. That includes internal owners and external downstream deployers and users when their actions depend on your system’s outputs (Source: NIST AI RMF Core).
Can our existing security incident response process satisfy MANAGE-4.3?
Often partially. You usually need additions for AI-specific incident types, model behavior recovery validation, and a defined approach to community-impact communications (Source: NIST AI RMF Core).
How do we handle incidents caused by a third-party model provider?
Log the event in your own incident register, route communications through your matrix, and track CAPA internally even if the technical fix sits with the third party. Keep the third party’s report and your assessment as evidence (Source: NIST AI RMF Core).
What evidence is most likely to be missing during an audit?
Communication logs (who was notified and when), recovery validation steps, and post-incident corrective actions with closure proof. Policies are rarely the problem; execution records are (Source: NIST AI RMF Core).
We discovered a historical error. Do we need to create an incident record retroactively?
Create a record when you identify a material past error that had user or community impact, document the retrospective analysis, and capture any notifications and remediation. Auditors care that you can show a controlled response once you knew (Source: NIST AI RMF Core).
Operationalize this requirement
Map requirement text to controls, owners, evidence, and review workflows inside Daydream.
See Daydream