SI-13(5): Failover Capability

To meet the si-13(5): failover capability requirement, you must design and operate your system so it can fail over to a backup/alternate capability when the primary capability fails, and you must be able to prove it works through documented architecture, tested procedures, and retained test evidence. Focus on clearly defined failover triggers, roles, and verification.

Key takeaways:

  • Define what “failover” means for your system (scope, triggers, RTO/RPO-style targets) and document it in system-level requirements.
  • Build failover into architecture and operations (runbooks, monitoring, exercises), not as a one-time DR slide deck.
  • Retain evidence that auditors can replay: diagrams, configurations, runbooks, and test results tied to the control owner and cadence.

SI-13(5) sits in the NIST SP 800-53 System and Information Integrity family, but operationally it is a resiliency requirement: the system needs a planned, repeatable way to continue operating (or recover to an acceptable level of operation) when a component, service, zone, or site fails. For a CCO or GRC lead, the fastest path is to treat SI-13(5) as a control that must be translated into three things you can defend in an assessment: (1) explicit failover requirements (what must fail over, under what conditions, within what internal targets), (2) implemented capability (architecture + configurations + operational readiness), and (3) proven operation (tests, results, and remediation).

This control often fails audits for a simple reason: teams can describe disaster recovery in general, but cannot show system-specific failover design choices, the exact operational triggers, and recent evidence that failover was tested and the system behaved as intended. If you operationalize SI-13(5) as “documented design + practiced execution + retained artifacts,” you can satisfy both engineering and audit expectations with minimal churn.

Regulatory text

NIST excerpt (SI-13(5)): “Provide {{ insert: param, si-13.05_odp.01 }} {{ insert: param, si-13.05_odp.02 }} for the system.” 1

How to read this as an operator: NIST is requiring you to provide failover capability for the system. The excerpt uses parameter placeholders, which in practice means your organization must define the specific failover objectives (for example, what functions/components are in scope and what the organization considers an acceptable failover outcome) and then implement and maintain that capability. 2

Plain-English interpretation (what SI-13(5) expects)

You need a working, documented way for the system to keep delivering required service when a primary component fails. “Failover” can be automatic or manual, but it must be:

  • Defined: what fails over, to where, and under which failure conditions.
  • Implemented: the alternate capability exists and is reachable (infrastructure, identity, data, networking, dependencies).
  • Operable: staff can execute it using current runbooks, with monitoring/alerting that detects failure conditions.
  • Verified: you test it and keep the evidence.

A clean implementation connects SI-13(5) to availability engineering (redundancy, multi-zone, backups, replicas) and to incident response operations (detection, decisioning, communications, change control).

Who it applies to (entity and operational context)

Applies to:

  • Federal information systems and systems assessed against NIST SP 800-53 Rev. 5.
  • Contractor systems handling federal data where NIST 800-53 controls are contractually required (for example, agency ATO, flow-down requirements). 2

Operational contexts where assessors focus:

  • Mission/business-critical applications (identity, payments, customer portals, operational technology interfaces).
  • Systems with external uptime commitments or where downtime creates safety, regulatory, or contractual exposure.
  • Cloud services with shared dependencies (DNS, IdP, KMS, CI/CD, container registries), because “the app is redundant” is not the same as “the service can fail over.”

What you actually need to do (step-by-step)

1) Set system-scoped failover requirements (make the parameters real)

Create a short “Failover Requirements” section in the system’s SSP or control implementation summary:

  • In-scope services/components: user-facing app, API, database, message queue, file storage, IAM/SSO dependency, secrets management, logging pipeline.
  • Failure modes: zone loss, region/site loss, database corruption, dependency outage, misconfiguration.
  • Internal targets: define your organization’s required recovery behavior (example: “service must restore critical transactions before non-critical features”). Keep these as internal objectives if you cannot support strict numeric commitments.
  • Trigger criteria: what monitoring signals or conditions initiate failover, and who has authority to declare it.

Deliverable: a one-page “SI-13(5) Failover Objectives” addendum signed by the system owner and SRE/IT owner.

2) Design the failover architecture (and document it like an assessor will read it)

Produce architecture documentation that answers “fail over from what to what”:

  • Topology: active-active, active-passive, warm standby, cold standby.
  • Data: replication method, backup restore path, consistency constraints, key management continuity.
  • Network and identity: DNS/traffic management approach, firewall rules, IAM roles, certificates.
  • Dependency mapping: third party services required for failover (cloud provider services, managed databases, SaaS status pages, upstream APIs).

Artifacts: current architecture diagram(s), dependency map, and a configuration inventory excerpt (or IaC module references) showing the alternate stack exists.

3) Build operational runbooks that match the architecture

Write runbooks that a responder can follow under pressure:

  • Detection & declaration: alerts, dashboards, on-call escalation, incident commander decision points.
  • Failover steps: exact actions (traffic switch, database promotion, feature flags), including commands/console paths.
  • Rollback: how to return to primary, and what conditions must be met.
  • Communications: internal stakeholders and external notifications if applicable.
  • Change control: emergency changes, approvals, and logging requirements.

A common audit hangup: a DR plan exists, but it does not match the deployed architecture or current tooling.

4) Test the failover capability and capture evidence

Run failover tests that demonstrate the control works for the system, not just the platform:

  • Planned exercise: tabletop plus a technical failover (or a controlled simulation) that executes the runbook.
  • Scope: test at least one representative failure mode for each critical dependency chain.
  • Results: document what happened, what broke, what was corrected, and retest outcomes.

Evidence: test plan, execution log, screenshots/exports from monitoring, incident tickets, post-test report, and remediation tracking.

5) Operationalize as a recurring control (ownership + cadence + proof)

Map SI-13(5) to:

  • Control owner (typically SRE/IT ops leader) and business owner (system owner).
  • Recurring evidence: last test report, last runbook review record, current diagrams, monitoring coverage.
  • Exceptions: if certain components cannot fail over, document compensating controls and accepted risk.

Daydream tip: track SI-13(5) as a requirement with named owners, linked runbooks, and a standing evidence list so you can answer assessments without rebuilding the story from scratch.

Required evidence and artifacts to retain

Use an “auditor replay” mindset. Retain artifacts that let a third party verify design and operation:

  • System failover requirements statement (scope, triggers, approval)
  • Architecture diagrams showing primary/secondary and traffic routing
  • Inventory/configuration evidence that alternate capability exists (IaC references, cluster listings, DR environment manifests)
  • Runbooks (failover, failback, communications) with version history
  • Monitoring/alerting coverage list for failover triggers
  • Test plan(s), test execution records, and post-test report(s)
  • Remediation tickets with closure evidence (and retest results where applicable)
  • Third party dependency register entries tied to failover (for example, IdP, DNS, managed database provider)

Common exam/audit questions and hangups

Assessors tend to ask:

  • “Show me exactly what fails over, and where the alternate capability lives.”
  • “Is failover automatic or manual? Who can initiate it?”
  • “When was the last failover test? Provide the evidence.”
  • “Do your runbooks match current architecture and on-call practice?”
  • “What happens if a third party dependency fails (IdP, DNS, KMS, payment processor)?”

Hangups that stall audits:

  • Architecture diagrams are high-level marketing diagrams, not implementable diagrams.
  • Tests are tabletop-only with no technical execution evidence.
  • Failover exists for compute, but not for data, secrets, identity, or networking.

Frequent implementation mistakes (and how to avoid them)

  1. Failover capability exists only in someone’s head.
    Fix: require runbooks with step-by-step actions and store them in your controlled documentation repo.

  2. “Multi-zone” is claimed, but traffic management is single points of failure.
    Fix: document and test DNS/traffic steering, certificates, and WAF rules as part of failover.

  3. Data layer is ignored.
    Fix: explicitly document database promotion/restore steps, integrity checks, and how applications reconnect.

  4. Third party dependencies are not included.
    Fix: list critical third party services and define what “failover” looks like when they are down (graceful degradation, queued processing, manual workaround).

  5. No retained evidence.
    Fix: predefine an evidence bundle and attach it to the control record after each exercise.

Enforcement context and risk implications

No public enforcement cases were provided in the supplied source catalog for SI-13(5). Practically, the risk is still concrete: inability to fail over turns incidents into extended outages, can breach contractual availability commitments, and can trigger reportable events depending on your regulatory environment and customer agreements. Treat SI-13(5) as an availability-and-integrity control that reduces both operational loss and compliance exposure. 3

A practical 30/60/90-day execution plan

First 30 days (stabilize scope and ownership)

  • Name the SI-13(5) control owner and system owner; document RACI.
  • Define failover objectives and in-scope components for the system.
  • Produce a current-state architecture diagram and dependency map.
  • Inventory runbooks you have; gap-assess against actual architecture.

Days 31–60 (implement missing mechanics and documentation)

  • Close “single point of failure” gaps that block failover (traffic routing, IAM permissions, secrets, database replication path).
  • Write or update failover/failback runbooks; add communications steps.
  • Implement or tune monitoring to create clear failover triggers.
  • Create the evidence checklist and storage location (single folder or GRC record with links).

Days 61–90 (prove it works and make it repeatable)

  • Execute a planned failover test (or controlled simulation) using the runbook.
  • Publish a post-test report with lessons learned and remediation tickets.
  • Retest any critical fixes and attach evidence to the SI-13(5) record.
  • Set a recurring review rhythm: runbook review, architecture review, and exercise scheduling aligned to change velocity.

Frequently Asked Questions

Does SI-13(5) require automatic failover?

The text requires failover capability, not a specific automation level 1. Automatic failover is often easier to defend, but manual failover can meet the requirement if it is defined, tested, and consistently executable.

What systems should I prioritize first for SI-13(5)?

Start with systems that support critical business processes or where downtime would create regulatory or contractual exposure. If you have to sequence, prioritize identity, core data stores, and customer-facing production services.

What’s the minimum evidence set an assessor will accept?

Keep (1) documented failover requirements, (2) architecture showing alternate capability, (3) runbooks, and (4) a recent test record with results and remediation. If any one of those is missing, assessments often stall on “capability not demonstrated.”

How do I handle third party services that can’t “fail over”?

Document the dependency, the outage mode, and your designed response (graceful degradation, queuing, manual procedures, alternate provider where feasible). Capture this in the dependency map and the runbook so your failover story covers the full service chain.

Can a disaster recovery plan satisfy SI-13(5) on its own?

A generic DR plan rarely closes SI-13(5) by itself. You need system-specific failover design and proof of operation through testing and retained artifacts 3.

How should Daydream fit into SI-13(5) operations?

Use Daydream to assign ownership, store linked artifacts (runbooks, diagrams, test reports), and schedule recurring evidence so SI-13(5) stays continuously audit-ready instead of becoming a scramble before an assessment.

Footnotes

  1. NIST SP 800-53 Rev. 5 OSCAL JSON

  2. NIST SP 800-53 Rev. 5 OSCAL JSON; NIST SP 800-53 Rev. 5

  3. NIST SP 800-53 Rev. 5

Frequently Asked Questions

Does SI-13(5) require automatic failover?

The text requires failover capability, not a specific automation level (Source: NIST SP 800-53 Rev. 5 OSCAL JSON). Automatic failover is often easier to defend, but manual failover can meet the requirement if it is defined, tested, and consistently executable.

What systems should I prioritize first for SI-13(5)?

Start with systems that support critical business processes or where downtime would create regulatory or contractual exposure. If you have to sequence, prioritize identity, core data stores, and customer-facing production services.

What’s the minimum evidence set an assessor will accept?

Keep (1) documented failover requirements, (2) architecture showing alternate capability, (3) runbooks, and (4) a recent test record with results and remediation. If any one of those is missing, assessments often stall on “capability not demonstrated.”

How do I handle third party services that can’t “fail over”?

Document the dependency, the outage mode, and your designed response (graceful degradation, queuing, manual procedures, alternate provider where feasible). Capture this in the dependency map and the runbook so your failover story covers the full service chain.

Can a disaster recovery plan satisfy SI-13(5) on its own?

A generic DR plan rarely closes SI-13(5) by itself. You need system-specific failover design and proof of operation through testing and retained artifacts (Source: NIST SP 800-53 Rev. 5).

How should Daydream fit into SI-13(5) operations?

Use Daydream to assign ownership, store linked artifacts (runbooks, diagrams, test reports), and schedule recurring evidence so SI-13(5) stays continuously audit-ready instead of becoming a scramble before an assessment.

Operationalize this requirement

Map requirement text to controls, owners, evidence, and review workflows inside Daydream.

See Daydream