SI-13: Predictable Failure Prevention
To meet the si-13: predictable failure prevention requirement, you must identify the system components most likely to fail, determine their mean time to failure (MTTF) in the environments where they actually run, and use those values to drive preventive maintenance, redundancy, and replacement decisions. Auditors will expect documented MTTF assumptions, a repeatable method, and operational evidence that you act on the results. 1
Key takeaways:
- SI-13 is an engineering-meets-GRC control: quantify failure likelihood (MTTF) per component and environment, then manage to it. 1
- “Specific environments of operation” means your real conditions (cloud region, on-prem site, workload profile), not generic vendor specs. 1
- Evidence is the make-or-break: inventory scope, MTTF determination method, maintenance actions, and review cadence mapped to a control owner. 1
SI-13 sits in the “System and Information Integrity” family, but it functions like a reliability control: prevent predictable outages and integrity failures by understanding when parts of your system wear out or degrade. The requirement text is narrow but operationally demanding: you must determine mean time to failure (MTTF) for defined components, in defined environments, and then manage the system to avoid failures you could have anticipated. 1
For a Compliance Officer, CCO, or GRC lead, the fastest path is to treat SI-13 as a mini-program with three deliverables: (1) a scoped list of components and environments; (2) a defensible MTTF determination approach (data-driven where possible, justified assumptions where not); and (3) an operating rhythm that turns MTTF into tickets, maintenance windows, capacity changes, or replacement plans. If your reliability work lives in SRE/Infrastructure and your audits live in GRC, SI-13 is where those worlds must meet.
This page gives requirement-level implementation guidance you can hand to engineering and also defend in an assessment.
Regulatory text
Control requirement (excerpt): “Determine mean time to failure (MTTF) for the following system components in specific environments of operation: {{ insert: param, si-13_odp.01 }} ; and” 1
What the operator must do with that text
- Select the components in scope (the parameterized list in your implementation). You decide what components are included, but it must be explicit and risk-based.
- Define the environments of operation for those components. “Environment” must reflect where and how the component runs (examples: cloud region, availability zone pattern, on-prem datacenter conditions, workload intensity, storage type).
- Determine MTTF for each component/environment pair using a repeatable method.
- Operationalize the outputs so MTTF is not a spreadsheet artifact: use it to plan maintenance, replacement, redundancy, monitoring thresholds, and change management gates.
NIST leaves the parameter open on purpose. Your job is to define it and prove it is reasonable for your system and mission. 2
Plain-English interpretation (what SI-13 is really asking)
SI-13 requires you to stop treating outages as “bad luck.” For components that fail predictably over time (hardware aging, certificate expiry, storage wear, battery-backed devices, even software components with known lifecycle end dates), you must estimate how long they run before failure in your conditions, then take preventive action before that expected failure window creates downtime, data loss, or integrity issues. 1
In practice, “MTTF” can come from:
- Your own incident history and telemetry (best).
- Manufacturer or cloud provider reliability guidance, adjusted to match your use pattern.
- Engineering judgment documented with assumptions when data is limited.
What auditors want to see is not perfect math; they want a disciplined approach that reduces preventable failures. 1
Who it applies to
Entity scope
- Federal information systems and contractor systems handling federal data where NIST SP 800-53 is the selected control baseline. 3
Operational context (where SI-13 shows up in real programs)
- High-availability platforms (mission systems, shared services, identity services).
- Systems with regulated uptime or continuity expectations.
- Environments where component failures create downstream integrity risk (dropped logs, incomplete transactions, corrupted storage).
- Systems dependent on third parties (cloud infrastructure, managed databases, CDN/DNS providers): SI-13 still applies because the failure is predictable even if you do not own the hardware.
What you actually need to do (step-by-step)
Step 1: Assign ownership and define control boundaries
- Name a control owner (usually Head of SRE/Infrastructure or Platform Engineering) and a GRC owner responsible for evidence quality.
- Define boundary: which production systems and supporting components are included (and what is explicitly excluded with rationale).
- Map recurring evidence artifacts to the control (this is the fastest way to make SI-13 “audit-ready”). 1
Daydream fit: Many teams lose time because SI-13 evidence is scattered across CMDB, ticketing, monitoring, and runbooks. Daydream is useful when you need a single requirement page that ties SI-13 to an owner, a procedure, and a predictable evidence set you can refresh on schedule. 1
Step 2: Build the “SI-13 component list” (your parameter)
Create a table and force clarity. Keep it short at first; expand as you mature.
Recommended component categories to consider (pick what matches your system):
- Compute hosts / hypervisors / node groups
- Storage media (disks, SSD arrays), storage controllers
- Network devices (firewalls, load balancers) if in your boundary
- Managed cloud services with known failure modes (managed DB, queues)
- Identity and cryptographic components with lifecycle cliffs (certificates, HSMs) where failure is predictable (expiry/renewal)
- Backup infrastructure and log pipelines (integrity impact)
For each component: define the “environment of operation” in a way engineering agrees is meaningful (region, AZ pattern, hardware class, tenancy model, traffic profile). 1
Step 3: Determine MTTF per component in that environment
Use a documented hierarchy so you can defend choices:
-
Empirical MTTF from your telemetry
- Pull incident/problem records for the component class.
- Normalize: count failures that required repair/replacement or caused service-impacting degradation.
- Document the observation window and any data quality gaps.
-
Supplier or platform reliability inputs
- If you use cloud services or third-party appliances, document the source materials you relied on and how you mapped them to your environment.
- Keep the “adjustment logic” simple and explicit (example: “workload is sustained high IOPS; use conservative MTTF assumption”).
-
Engineering judgment with assumptions
- Only when data is scarce.
- Document assumptions and a plan to replace judgment with measured values over time.
Deliverable: an MTTF register (spreadsheet is fine) with columns: component, environment, MTTF value/range, method/source, assumptions, owner, last reviewed date, next review date, and linked evidence. 1
Step 4: Turn MTTF into preventive controls (tickets, changes, design)
MTTF that doesn’t drive action will fail an assessment.
Minimum operationalization set:
- Preventive maintenance plan: planned replacement/refresh, patch/upgrade cycles, certificate rotation schedules.
- Redundancy decisions: if MTTF suggests likely failures inside your service tolerance, document redundancy (N+1, multi-AZ, failover) or compensating controls.
- Monitoring thresholds tied to wear-out signals: storage error rates, memory errors, hardware SMART indicators, queue depth saturation.
- Change management gates: require review when proposed changes increase stress on components with lower MTTF.
Make these actions traceable:
- Each action becomes a ticket or change record that references the MTTF register line item.
- Each “completed maintenance” record becomes evidence. 1
Step 5: Establish a review and refresh rhythm
You need a repeatable cycle; pick a cadence and stick to it. The requirement does not prescribe frequency, but auditors will ask how you keep MTTF current as environments change. 1
Operational triggers that should force an out-of-cycle review:
- Major architecture change (new region, new storage class)
- Significant incident tied to component failure
- Supplier/platform change that alters reliability characteristics
Required evidence and artifacts to retain
Keep evidence tight and assessor-friendly:
- SI-13 control narrative (1–2 pages): scope, roles, method, tools, and how MTTF drives action. 1
- Component/environment inventory extract showing what you included (CMDB export, cloud inventory, or manually curated list).
- MTTF register with method and sources per line item.
- Data extracts used to calculate/estimate MTTF (incident exports, monitoring summaries).
- Preventive maintenance plan and a sample of executed records (tickets/changes).
- Meeting notes or review sign-offs from the recurring MTTF review (engineering + GRC).
- Exception log for components where you cannot determine MTTF yet, with compensating controls and a remediation plan.
Common exam/audit questions and hangups
Assessors tend to focus on these:
- “Show me your parameter list.” If you cannot produce the defined component list in scope, SI-13 collapses into an undefined intention. 1
- “How did you define ‘environment of operation’?” Generic “production” is rarely sufficient; they will probe cloud regions, datacenters, and workload types. 1
- “Where did the MTTF number come from?” Expect follow-ups on data quality, time window, and whether you used real incidents.
- “What did you do with it?” Bring executed maintenance tickets, refresh actions, or design changes tied to MTTF.
- “Who reviews this and how often?” If ownership is unclear, the control looks non-operational.
Frequent implementation mistakes (and how to avoid them)
-
Mistake: Treating SI-13 as a policy-only control.
Fix: Require at least one operational artifact per in-scope component class: register entry + action evidence. -
Mistake: Using manufacturer MTTF values without environment context.
Fix: Add an “environment adjustment” note. If you cannot justify adjustments, narrow the scope to components where you can. -
Mistake: Scoping only obvious hardware and ignoring predictable “time bombs.”
Fix: Include items like certificates, license renewals, and managed service lifecycle events when their failure mode is predictable and service-impacting. -
Mistake: No linkage to tickets/changes.
Fix: Add a mandatory reference field in tickets for “SI-13 MTTF register ID” for maintenance and refresh work. -
Mistake: No exception handling.
Fix: Maintain an exception log with compensating controls (redundancy, monitoring) and a plan to collect data.
Enforcement context and risk implications
No public enforcement cases were provided in the source materials for SI-13. For risk framing, treat SI-13 as an outage-and-integrity exposure control: predictable failures are the ones regulators and customers least tolerate because they are preventable with basic reliability discipline. 3
Practical 30/60/90-day execution plan
Use phases, not promises. Adjust based on system criticality and staffing.
First 30 days (stand up the control)
- Assign SI-13 control owner and evidence owner.
- Define system boundary and draft the component/environment list.
- Create the first MTTF register with placeholders and methods per component.
- Select two high-impact component classes and calculate/estimate initial MTTF.
- Define how maintenance tickets will reference SI-13 items.
By 60 days (make it operational)
- Expand MTTF register coverage across the agreed component list.
- Run the first MTTF review with engineering and document decisions.
- Create preventive maintenance tickets based on MTTF outputs.
- Add monitoring or alerting for early failure indicators where feasible.
- Write the SI-13 control narrative and store all artifacts in a single evidence location.
By 90 days (make it repeatable and audit-ready)
- Demonstrate closed-loop execution: MTTF inputs → decisions → executed work → updated register.
- Add exception log and compensating controls where MTTF is still maturing.
- Perform an internal control test: sample components, trace from MTTF to completed actions.
- Package evidence for assessors: narrative, register, sample tickets, review records.
Frequently Asked Questions
Does SI-13 require a mathematically precise MTTF calculation?
SI-13 requires you to “determine” MTTF for in-scope components in their environments, and to be able to explain the method and assumptions. Precision matters less than defensibility and evidence that you act on the results. 1
What counts as an “environment of operation” in cloud systems?
Define environment in a way that changes reliability characteristics, such as region, availability zone design, instance/storage class, and workload profile. Document the definition and apply it consistently in the MTTF register. 1
We rely on managed services. How can we determine MTTF?
Use your incident history and service-impacting events as the primary dataset, then supplement with provider reliability documentation where needed. Keep a written rationale that ties the managed service to your specific deployment environment. 1
What if we don’t have enough failure history to estimate MTTF?
Record an initial engineering assumption with clear inputs, log it as an exception if needed, and define what telemetry you will collect to replace assumptions with measured values. Pair that with compensating controls like redundancy and tighter monitoring. 1
How do auditors typically sample SI-13?
They usually pick a few components from your list and ask you to show the MTTF determination, the environment definition, and resulting maintenance actions. Prepare traceability from register line item to tickets/changes and review notes. 1
Where does Daydream help without turning this into a tooling project?
Daydream helps you map SI-13 to an owner, a documented procedure, and a recurring evidence bundle so the control stays current across audits. Use it to centralize the register, decisions, and proof of execution rather than hunting across engineering systems at assessment time. 1
Footnotes
Frequently Asked Questions
Does SI-13 require a mathematically precise MTTF calculation?
SI-13 requires you to “determine” MTTF for in-scope components in their environments, and to be able to explain the method and assumptions. Precision matters less than defensibility and evidence that you act on the results. (Source: NIST SP 800-53 Rev. 5 OSCAL JSON)
What counts as an “environment of operation” in cloud systems?
Define environment in a way that changes reliability characteristics, such as region, availability zone design, instance/storage class, and workload profile. Document the definition and apply it consistently in the MTTF register. (Source: NIST SP 800-53 Rev. 5 OSCAL JSON)
We rely on managed services. How can we determine MTTF?
Use your incident history and service-impacting events as the primary dataset, then supplement with provider reliability documentation where needed. Keep a written rationale that ties the managed service to your specific deployment environment. (Source: NIST SP 800-53 Rev. 5 OSCAL JSON)
What if we don’t have enough failure history to estimate MTTF?
Record an initial engineering assumption with clear inputs, log it as an exception if needed, and define what telemetry you will collect to replace assumptions with measured values. Pair that with compensating controls like redundancy and tighter monitoring. (Source: NIST SP 800-53 Rev. 5 OSCAL JSON)
How do auditors typically sample SI-13?
They usually pick a few components from your list and ask you to show the MTTF determination, the environment definition, and resulting maintenance actions. Prepare traceability from register line item to tickets/changes and review notes. (Source: NIST SP 800-53 Rev. 5 OSCAL JSON)
Where does Daydream help without turning this into a tooling project?
Daydream helps you map SI-13 to an owner, a documented procedure, and a recurring evidence bundle so the control stays current across audits. Use it to centralize the register, decisions, and proof of execution rather than hunting across engineering systems at assessment time. (Source: NIST SP 800-53 Rev. 5 OSCAL JSON)
Operationalize this requirement
Map requirement text to controls, owners, evidence, and review workflows inside Daydream.
See Daydream