SI-17: Fail-safe Procedures
SI-17 requires you to define and implement fail-safe procedures that trigger when specified failures occur, so systems default to a secure, controlled state instead of failing open. To operationalize it quickly, identify your critical failure modes, decide the safe state for each, implement automated controls and runbooks, then test and retain evidence. 1
Key takeaways:
- Document failure modes and the “safe state” you expect for each, then map them to concrete technical controls and runbooks.
- Prove operation with test results, monitoring/alert evidence, and change records tied to each fail-safe mechanism.
- Focus on “fail closed” for security boundaries, privileged access, logging pipelines, and cryptographic enforcement points.
Footnotes
The si-17: fail-safe procedures requirement is about predictable, secure behavior under stress: when something breaks, what happens next must be defined, implemented, and repeatable. Many programs have resilience work (backups, DR, redundancy) but still fail SI-17 because they cannot show that specific failure conditions trigger specific safe outcomes, and that those outcomes are tested.
As a Compliance Officer, CCO, or GRC lead, your job is to turn SI-17 into an auditable set of decisions: which failures matter, what “fail-safe” means in your environment, who owns each mechanism, and what evidence proves the mechanism actually works. SI-17 is also a boundary control: it forces alignment between engineering intent (how systems behave) and governance intent (what risk is acceptable when dependencies fail).
This page gives requirement-level implementation guidance you can hand to system owners. It avoids theory and focuses on the artifacts assessors ask for: a failure-mode inventory, explicit safe-state definitions, implemented controls, test results, and change management traceability back to the requirement. 1
Regulatory text
Control statement (verbatim): “Implement the indicated fail-safe procedures when the indicated failures occur: {{ insert: param, si-17_prm_1 }}.” 2
What the operator must do with this text
The parameterized portion is where your program specifies (1) the failures you care about and (2) the fail-safe procedures that must occur. Operationally, you must:
- Name the failures (e.g., “loss of identity provider connectivity,” “SIEM ingestion failure,” “key management service unavailable”).
- Define the safe state for each failure (e.g., “deny new sessions,” “block privileged actions,” “queue logs locally and alert”).
- Implement mechanisms that reliably force that safe state.
- Test and evidence that the mechanism triggers correctly and does not degrade into “fail open.” 2
Plain-English interpretation of the requirement
SI-17 means: when a defined failure happens, your system must automatically take a predefined safe action. Safe usually means security-preserving and controlled: deny, restrict, isolate, stop processing sensitive actions, or degrade service in a way that does not bypass policy.
The assessment expectation is not perfection. It is intent + implementation + proof:
- Intent: you made explicit choices about failure handling.
- Implementation: your systems and procedures match those choices.
- Proof: you tested realistic failure scenarios and kept records. 1
Who it applies to (entity and operational context)
SI-17 commonly applies to:
- Federal information systems implementing NIST SP 800-53 controls. 1
- Contractor systems handling federal data where 800-53 is contractually required or flowed down. 2
Operationally, it applies most strongly to systems that:
- Enforce authentication, authorization, or session management.
- Provide security telemetry (logging, SIEM forwarding, EDR).
- Protect cryptographic keys, secrets, and certificates.
- Sit on network/security boundaries (WAF, API gateways, firewalls, ZTNA, PAM).
- Run high-impact business workflows where an error could cause unauthorized disclosure or transaction integrity loss.
What you actually need to do (step-by-step)
1) Assign ownership and define the scope boundary
- Assign a control owner (GRC) and implementation owners (system/platform owners).
- Define the system scope: components, dependencies, and trust boundaries. Keep this aligned with your SSP or system description if you maintain one. 1
Deliverable: SI-17 implementation record with owners, in-scope systems, and dependency list.
2) Build a failure-mode inventory (start with security-critical paths)
Create a short list of “indicated failures” that are realistic and high-risk. Use categories to keep it manageable:
- Identity and access: IdP outage, MFA service unavailable, directory lookup failure.
- Policy engines: authorization service down, ABAC/RBAC policy store unavailable.
- Crypto/secrets: KMS/HSM outage, certificate validation failure, secret retrieval failure.
- Telemetry: log forwarder down, SIEM endpoint unreachable, EDR agent failure.
- Data layer: database read/write errors, replication lag, storage full.
- Network dependencies: DNS failure, time sync failure, service mesh outage.
Tip: Tie each failure mode to a security objective (confidentiality, integrity, availability) so the “safe state” decision is defensible. 1
Deliverable: Failure Mode Register (system, failure condition, trigger signal, expected behavior, owner).
3) Define the fail-safe procedure and “safe state” for each failure
For each failure mode, write a one-paragraph requirement:
- Trigger: What condition indicates the failure? (health check, error rate threshold, missing heartbeat, dependency timeout)
- Action: What does the system do immediately?
- Safe state: What is permitted and what is blocked?
- Recovery: What conditions return the system to normal operation?
- Operator procedure: What does on-call do, and where is the runbook?
Common “safe states” that satisfy auditors:
- Fail closed for authentication/authorization: deny requests that require policy decisions when the policy source is unavailable.
- Read-only mode when write integrity cannot be guaranteed.
- Isolate a subsystem or disable integrations when trust cannot be established.
- Queue and alert for telemetry failures, with local buffering and explicit incident creation if loss is possible.
Deliverable: Fail-safe Design Spec (table format works best).
4) Implement the mechanisms (technical + procedural)
Match each fail-safe requirement to a control mechanism:
- Application controls: dependency timeouts, circuit breakers, explicit deny-on-error branches, safe defaults.
- Infrastructure controls: load balancer health checks, autoscaling guards, node taints, Kubernetes pod disruption budgets where relevant.
- Security tooling: WAF/API gateway default-deny, PAM “break-glass” workflows with approvals, certificate pinning/validation failure behaviors.
- Operations: runbooks, paging rules, incident playbooks, and change management gates for altering fail-safe behavior.
Keep the mapping explicit: failure mode → mechanism → owner → evidence location. This is where tools like Daydream help: you can map SI-17 to the owner, implementation procedure, and recurring evidence artifacts so assessments stop being a scavenger hunt. 2
Deliverable: SI-17 Control Implementation Matrix (failure mode mapped to code/config and runbook).
5) Test the fail-safe behavior and record results
Assessors will look for proof that fail-safe actions actually trigger. Build tests that reflect production reality:
- Tabletop: walk through “IdP down” or “KMS unavailable” scenarios and confirm runbook steps.
- Functional tests: simulate dependency failures in staging; confirm expected denies, degraded mode, or isolation.
- Observability checks: confirm alerts fire and are routed to the right on-call group.
Test artifacts must show: date, scenario, expected result, actual result, follow-ups, and ticket references.
Deliverable: Fail-safe Test Reports + remediation tickets.
6) Operationalize: monitoring, change control, and periodic review
Fail-safe behavior drifts when teams refactor services or swap providers. Put simple governance around it:
- Monitoring: alerts tied to each failure mode signal.
- Change control: require review when modifying authn/authz flows, KMS integrations, logging pipelines, or boundary devices.
- Review cadence: revisit the failure-mode register when you add a dependency or ship a major architecture change.
Deliverable: Change management checklist item for SI-17-impacting changes.
Required evidence and artifacts to retain
Keep evidence in a form an assessor can follow end-to-end:
- Failure Mode Register (approved by system owner)
- Fail-safe Design Spec (safe state definitions + triggers)
- Implementation proof: configuration snippets, IaC commits, code references, architecture diagrams
- Runbooks and incident procedures: on-call steps, escalation paths
- Test evidence: tabletop notes, staging test logs, screenshots, or test harness output
- Monitoring evidence: alert definitions, routing, sample alert events
- Change records: tickets/PRs that show review and approval of fail-safe changes 1
Common exam/audit questions and hangups
Assessors tend to press on these points:
- “What are the indicated failures?” If you cannot list them, you fail the intent of the parameterized control.
- “Show me what happens when X fails.” They will ask for live demos in non-prod or test results.
- “Does it fail open anywhere?” “Allow on error” branches for authorization and logging blind spots are frequent findings.
- “Who owns this behavior?” Shared ownership without named accountable owners causes evidence gaps.
- “How do you know it stays that way?” Lack of change control linkage is a common hangup. 1
Frequent implementation mistakes and how to avoid them
Mistake 1: Treating fail-safe as “high availability”
HA reduces downtime; SI-17 is about secure behavior during failure.
Avoid it: Document safe-state behavior even when you have redundancy.
Mistake 2: Logging fails silently
Teams assume “best effort logging” is acceptable, then cannot detect events during outages.
Avoid it: Define what happens when log forwarding fails (buffer, alert, degrade, or block specific actions).
Mistake 3: “Break-glass” that becomes the normal path
Emergency access paths often bypass controls and are poorly monitored.
Avoid it: Require approvals, strong authentication, session recording where available, and explicit expiration of elevated access.
Mistake 4: Unbounded retries that mask failure
Retry storms can degrade systems and prevent clean failover or safe degradation.
Avoid it: Use timeouts, circuit breakers, and clear error handling that triggers your safe state.
Mistake 5: No evidence trail
You may have solid engineering, but audits are evidence-driven.
Avoid it: Pre-map each fail-safe to recurring artifacts (test results, alert configs, change tickets). Daydream’s control-to-evidence mapping model fits this requirement well because SI-17 often spans multiple teams and systems. 2
Enforcement context and risk implications
No public enforcement cases were provided in the source material for this control, so this page does not cite case law or regulator actions.
Practically, SI-17 failures show up as:
- Unauthorized access when authz services degrade.
- Data integrity issues when systems keep accepting writes under partial failure.
- Undetected incidents when telemetry pipelines break.
- Extended incident duration because operators do not have defined failure procedures. 1
Practical 30/60/90-day execution plan
First 30 days (baseline and decisions)
- Assign SI-17 owners per in-scope system.
- Produce an initial Failure Mode Register for security-critical paths.
- Write safe-state decisions for each failure mode and get system owner sign-off.
- Create an evidence map: where each artifact will live and who updates it.
Days 31–60 (implementation and instrumentation)
- Implement or harden fail-safe mechanisms for the top failure modes.
- Add or refine alerts tied to failure triggers.
- Draft runbooks for each scenario; align paging and escalation.
- Put an SI-17 check into change review for impacted components.
Days 61–90 (testing, audit readiness, and steady state)
- Run structured fail-safe tests in staging; capture results and track fixes.
- Perform one cross-team tabletop exercise that includes GRC, engineering, and operations.
- Finalize the SI-17 control narrative and evidence package for assessment.
- Schedule periodic reviews tied to major releases and dependency changes. 1
Frequently Asked Questions
What counts as a “fail-safe procedure” for SI-17?
A fail-safe procedure is a defined action that moves a system to a safe state when a specified failure occurs, such as denying access when authorization cannot be evaluated. The key is that the trigger and action are explicit and testable. 2
Do we have to fail closed for every failure?
No, but you must define the safe state for each indicated failure and justify it based on risk. For security boundaries (authn/authz, key management), assessors commonly expect fail closed or tightly controlled degradation. 1
How do we handle SI-17 for third-party dependencies like an external IdP or SIEM?
Treat third-party outages as indicated failures and define your safe state when that dependency is unavailable. Keep evidence that your application or gateway enforces that behavior and that your runbook covers escalation to the third party. 1
What evidence is usually missing during audits?
Teams often lack test artifacts showing the fail-safe action actually triggered, plus change records showing the behavior is controlled over time. A mapping from each failure mode to specific evidence locations reduces this gap. 1
Can we satisfy SI-17 with runbooks alone?
Runbooks help, but SI-17 expects implemented procedures when failures occur, which typically includes automated technical enforcement for key security decisions. Use runbooks to cover operator actions, not to replace system behavior. 2
How should we document SI-17 so it is easy to assess?
Use a matrix: failure mode, trigger signal, safe-state behavior, enforcing mechanism, owner, test reference, and evidence links. Daydream is a practical place to maintain that mapping because it ties controls to owners and recurring evidence artifacts. 2
Footnotes
Frequently Asked Questions
What counts as a “fail-safe procedure” for SI-17?
A fail-safe procedure is a defined action that moves a system to a safe state when a specified failure occurs, such as denying access when authorization cannot be evaluated. The key is that the trigger and action are explicit and testable. (Source: NIST SP 800-53 Rev. 5 OSCAL JSON)
Do we have to fail closed for every failure?
No, but you must define the safe state for each indicated failure and justify it based on risk. For security boundaries (authn/authz, key management), assessors commonly expect fail closed or tightly controlled degradation. (Source: NIST SP 800-53 Rev. 5)
How do we handle SI-17 for third-party dependencies like an external IdP or SIEM?
Treat third-party outages as indicated failures and define your safe state when that dependency is unavailable. Keep evidence that your application or gateway enforces that behavior and that your runbook covers escalation to the third party. (Source: NIST SP 800-53 Rev. 5)
What evidence is usually missing during audits?
Teams often lack test artifacts showing the fail-safe action actually triggered, plus change records showing the behavior is controlled over time. A mapping from each failure mode to specific evidence locations reduces this gap. (Source: NIST SP 800-53 Rev. 5)
Can we satisfy SI-17 with runbooks alone?
Runbooks help, but SI-17 expects implemented procedures when failures occur, which typically includes automated technical enforcement for key security decisions. Use runbooks to cover operator actions, not to replace system behavior. (Source: NIST SP 800-53 Rev. 5 OSCAL JSON)
How should we document SI-17 so it is easy to assess?
Use a matrix: failure mode, trigger signal, safe-state behavior, enforcing mechanism, owner, test reference, and evidence links. Daydream is a practical place to maintain that mapping because it ties controls to owners and recurring evidence artifacts. (Source: NIST SP 800-53 Rev. 5 OSCAL JSON)
Operationalize this requirement
Map requirement text to controls, owners, evidence, and review workflows inside Daydream.
See Daydream