Problem management

9 min readLast verified: March 2026By Isaac SilvermanOur methodology

ISO/IEC 20000-1 Clause 8.6.3 requires you to run a full problem management lifecycle so recurring or high-impact incidents get permanently fixed, not repeatedly triaged. You must log, categorize, prioritize, investigate, resolve, and review problems, with evidence that root cause and corrective actions reduce incident likelihood and impact ¹.

Key takeaways:

You need a defined lifecycle that starts with problem detection and ends with a documented review and control improvements.
Auditors will look for traceability from incidents → problem records → root cause → corrective actions → verification and closure.
The fastest path to compliance is standardizing intake, prioritization, RCA, and closure criteria, then enforcing records and ownership.

Problem management is the control that keeps your service desk from becoming a “repeat incident factory.” Under ISO/IEC 20000-1:2018 Clause 8.6.3, the expectation is simple but strict: you manage problems throughout their lifecycle to prevent incidents and minimize impact, and you can prove you did it with complete records ¹.

For a Compliance Officer, CCO, or GRC lead, the operational question is: “What does ‘manage throughout the lifecycle’ mean in a way an auditor can verify?” It means you have a consistent mechanism to (1) identify candidates for problem records (trend, severity, repeat failures), (2) assign accountable owners, (3) run structured investigation and root cause analysis (RCA) appropriate to risk, (4) drive corrective actions through change/release where needed, and (5) review outcomes so you prevent recurrence, not just close tickets.

This page gives requirement-level implementation guidance: scope, roles, step-by-step process, evidence to retain, audit traps, common mistakes, and a practical execution plan you can hand to operations.

Regulatory text

Requirement (excerpt): “The organization shall manage problems throughout their lifecycle to prevent incidents from occurring and minimize impact. Problems shall be logged, categorized, prioritized, investigated, resolved, and reviewed.” ¹

Operator interpretation: you need a defined, repeatable process that covers each lifecycle stage named in the clause. Auditors will test two things:

Design: documented workflow, roles, and criteria for logging/categorization/prioritization/investigation/resolution/review.
Operating effectiveness: completed problem records showing those stages happened, with clear linkage to incidents and evidence that fixes were validated.

Plain-English interpretation of the requirement

Problem management is how you eliminate underlying causes of incidents and reduce business impact when issues recur. The standard expects you to:

Create a problem record when patterns, severity, or risk justify deeper investigation.
Triage consistently (category + priority) so the right issues get the right attention.
Investigate to root cause (or document why root cause is not feasible and what mitigation is used instead).
Deliver permanent fixes (often through change management) and confirm outcomes.
Review the problem after resolution to capture learning and strengthen controls ¹.

Who it applies to (entity and operational context)

Applies to: any organization providing or operating IT services in scope for an ISO/IEC 20000-1 service management system ¹.

Operational contexts where it matters most:

High-availability customer-facing services (where recurring incidents harm customers quickly).
Regulated environments (where recurring outages raise operational resilience and governance concerns).
Complex ecosystems with many third parties (SaaS, cloud hosting, MSPs) where problems cross organizational boundaries.

In-scope teams: IT service management, SRE/operations, application/platform engineering, security operations (where security defects manifest as service incidents), and change/release management. GRC owns oversight: policy/process expectations, evidence, and audit readiness.

What you actually need to do (step-by-step)

Use this as your minimum viable, audit-ready lifecycle.

1) Define triggers for logging a problem

Document criteria that turn “an incident” into “a problem to investigate,” such as:

Repeat incidents with the same symptom/service/component.
A major incident where you need permanent prevention.
A suspected design flaw, capacity constraint, or recurring third-party failure.
A workaround that keeps the lights on but creates ongoing risk.

Control point: require incident-to-problem linkage in your tool (ticket relationship field), so you can show the audit trail.

2) Log the problem and assign ownership

For each problem record, capture:

Unique ID, opened date, service/component, affected customers/business process.
Linked incident(s) and monitoring alerts.
Initial impact statement and current workaround (if any).
Problem owner accountable for investigation through closure (not a shared queue).

Practical tip: align ownership to the team that can change the underlying system, not the team that only answers the phone.

3) Categorize with a taxonomy that matches how you fix things

Use categories that support routing and trend analysis, for example:

Service, application, infrastructure, network, end-user, third-party, data, change-related.

Avoid “miscellaneous.” If you can’t route it, you can’t manage it.

4) Prioritize using impact + urgency (and document the rationale)

Define a simple prioritization method and enforce it:

Impact: customer/business harm, service downtime, safety/security implications, contractual exposure.
Urgency: likelihood of recurrence, how quickly risk will materialize, dependency deadlines.

Evidence expectation: auditors will look for consistency between the problem priority and the linked incidents’ severity, plus documented rationale for outliers.

5) Investigate and perform RCA appropriate to risk

Investigation should be structured, time-bound, and evidence-based:

Collect logs, metrics, traces, configs, recent changes, dependency status, third-party incident notices.
Run RCA using a method your teams can execute consistently (e.g., 5 Whys or fault tree). The standard does not require a specific method; it requires that problems are investigated ¹.

If root cause is unknown: document “root cause not confirmed,” what you tried, and what mitigation you implemented. Closure without learning is an audit magnet unless you show a reasoned decision.

6) Define and track corrective actions (often via change management)

Translate findings into actions with owners and due dates:

Code fix, configuration change, capacity increase, monitoring improvement, runbook update, access/control change, third-party escalation, architectural change.

Where changes affect production, route through your change/release controls. Keep cross-references between problem record, change record, and deployment evidence.

7) Resolve, validate, and close with explicit criteria

Closure should require:

Fix implemented (or accepted risk documented).
Validation evidence: testing, monitoring signals, error rate reduction, or successful post-change verification.
Workaround retired or formally documented as long-term.

Closure criteria (write it down): “Resolved” means the underlying cause is addressed or an approved risk decision exists; “Closed” means review completed and documentation updated.

8) Review the problem (post-resolution)

The clause requires problems be reviewed ¹. Make the review lightweight but real:

What failed (process/technology/third-party)?
What detection gaps existed?
What control/process changes prevent recurrence?
What should be added to knowledge base/runbooks?

This is where you prove continuous improvement, not just ticket throughput.

Required evidence and artifacts to retain

Auditors typically expect a complete chain of evidence. Maintain:

Problem management procedure (lifecycle stages, roles, prioritization method, triggers, closure criteria).
Problem records with required fields populated: category, priority, owner, investigation notes, RCA output, corrective actions, approvals where relevant.
Linkage artifacts: incident-to-problem relationships; problem-to-change/release relationships.
Validation evidence: test results, monitoring snapshots, deployment verification, rollback plans if used.
Review outputs: post-problem review notes, lessons learned, knowledge articles, runbook updates.
Trend reporting: recurring problem themes, backlog, aging, and remediation status (even simple exports are fine if consistent).

If you use Daydream to run your compliance program, map these artifacts to a single “Problem management” control record so evidence collection is continuous and audit prep becomes retrieval, not reconstruction.

Common exam/audit questions and hangups

Expect examiners to ask:

“Show me a problem record from a major incident and walk me through the lifecycle steps.”
“How do you decide what becomes a problem? Who can open one?”
“How do you prioritize problems against feature work?”
“Where is root cause documented, and how do you validate corrective actions worked?”
“How do you ensure problems are reviewed before closure?”
“How do you handle problems caused by a third party, and what evidence do you keep?”

Hangups that cause findings:

Problem records closed with vague notes (“fixed,” “restarted service”) and no RCA.
No evidence that corrective actions went through change control when production was affected.
No review step, or “review” is implied but not documented.

Frequent implementation mistakes and how to avoid them

Treating a problem as a long incident.
Fix: require a problem-specific RCA section and corrective action plan, not just incident timelines.
No consistent triggers, so logging is random.
Fix: define clear triggers and empower incident managers/SRE leads to open problems automatically for major incidents.
Backlog becomes a graveyard.
Fix: run a weekly triage for new problems and a recurring review of aging high-priority problems. Document decisions to defer with risk acceptance.
Weak linkage to changes and releases.
Fix: enforce ticket linking as a gate for production changes tied to problem remediation.
Closing without verification.
Fix: add explicit validation requirements (monitoring checks, test evidence, or post-change verification notes) before closure.

Enforcement context and risk implications

No public enforcement cases were provided in the available source catalog for this requirement. Operationally, weak problem management increases the likelihood of repeat incidents, prolonged outages, and uncontrolled “fix-forward” behavior that bypasses change controls. In ISO audits, recurring incidents without corresponding problem records, or problem records without RCA and review, commonly drive nonconformities because they contradict the “prevent incidents and minimize impact” intent ¹.

Practical 30/60/90-day execution plan

First 30 days (stand up the minimum viable lifecycle)

Publish a problem management procedure aligned to the clause stages (log, categorize, prioritize, investigate, resolve, review) ¹.
Configure your ITSM tool: problem ticket type, mandatory fields, linkage to incidents and changes, closure states.
Define triggers for problem logging and a simple priority matrix.
Train service desk leads, incident managers, and SRE/ops leads on when and how to open problems.

Days 31–60 (make it operational and auditable)

Start weekly problem triage and backlog grooming with documented decisions.
Standardize RCA templates and require evidence attachments (logs, timelines, test notes).
Implement a closure checklist: corrective action completed, validation captured, review notes recorded.
Run an internal “mini-audit” on a sample of problem records and fix gaps.

Days 61–90 (prove effectiveness and tighten governance)

Add trend reporting: top recurring categories, aging problems, and remediation status for leadership review.
Integrate third-party coordination: require vendor/third-party incident references and escalation notes in problem records when relevant.
Formalize knowledge capture: convert repeated fixes into known error articles/runbooks.
In Daydream, assign owners, map evidence, and set recurring tasks so record quality stays consistent between audits.

Frequently Asked Questions

What’s the difference between an incident and a problem for ISO/IEC 20000-1?

An incident restores service; a problem removes the underlying cause and prevents recurrence. ISO/IEC 20000-1 explicitly requires managing problems through logging, prioritization, investigation, resolution, and review ¹.

Do we need root cause analysis for every problem?

The requirement says problems must be investigated and reviewed; it does not mandate a single RCA method ¹. In practice, scale the depth of RCA to risk, and document your rationale when root cause cannot be confirmed.

Can we close a problem with a workaround?

You can, but only if you document that the workaround is the accepted long-term treatment and capture a review that addresses residual risk and prevention steps ¹. Auditors will expect a clear decision record, not an implied acceptance.

How should we handle third-party-caused problems?

Log the problem internally, link related incidents, and document your investigation steps, third-party communications, and corrective actions under your control. The lifecycle still applies even when the fix depends on a third party ¹.

What evidence is most likely to be requested in an audit?

Complete problem records with linkages to incidents and changes, RCA/investigation notes, corrective actions, and documented reviews. Auditors also ask for the written process that defines categorization and prioritization ¹.

We have engineering tickets in Jira. Is that enough?

It can be, if you can show the required lifecycle steps and traceability: logged problem, category/priority, investigation, resolution, and review, plus linkage to incidents and changes ¹. Many teams keep Jira for engineering work but maintain an ITSM “problem wrapper” for audit-grade evidence.

ISO/IEC 20000-1:2018 Information technology — Service management

Frequently Asked Questions

What’s the difference between an incident and a problem for ISO/IEC 20000-1?

Do we need root cause analysis for every problem?

The requirement says problems must be investigated and reviewed; it does not mandate a single RCA method (Source: ISO/IEC 20000-1:2018 Information technology — Service management). In practice, scale the depth of RCA to risk, and document your rationale when root cause cannot be confirmed.

Can we close a problem with a workaround?

You can, but only if you document that the workaround is the accepted long-term treatment and capture a review that addresses residual risk and prevention steps (Source: ISO/IEC 20000-1:2018 Information technology — Service management). Auditors will expect a clear decision record, not an implied acceptance.

How should we handle third-party-caused problems?

What evidence is most likely to be requested in an audit?

We have engineering tickets in Jira. Is that enough?

It can be, if you can show the required lifecycle steps and traceability: logged problem, category/priority, investigation, resolution, and review, plus linkage to incidents and changes (Source: ISO/IEC 20000-1:2018 Information technology — Service management). Many teams keep Jira for engineering work but maintain an ITSM “problem wrapper” for audit-grade evidence.

Authoritative Sources

ISO/IEC 20000-1:2018

Operationalize this requirement

Map requirement text to controls, owners, evidence, and review workflows inside Daydream.

See Daydream

Regulatory text

Plain-English interpretation of the requirement

Who it applies to (entity and operational context)

What you actually need to do (step-by-step)

1) Define triggers for logging a problem

2) Log the problem and assign ownership

3) Categorize with a taxonomy that matches how you fix things

4) Prioritize using impact + urgency (and document the rationale)

5) Investigate and perform RCA appropriate to risk

6) Define and track corrective actions (often via change management)

7) Resolve, validate, and close with explicit criteria

8) Review the problem (post-resolution)

Required evidence and artifacts to retain

Common exam/audit questions and hangups

Frequent implementation mistakes and how to avoid them

Enforcement context and risk implications

Practical 30/60/90-day execution plan

First 30 days (stand up the minimum viable lifecycle)

Days 31–60 (make it operational and auditable)

Days 61–90 (prove effectiveness and tighten governance)

Frequently Asked Questions

What’s the difference between an incident and a problem for ISO/IEC 20000-1?

Do we need root cause analysis for every problem?

Can we close a problem with a workaround?

How should we handle third-party-caused problems?

What evidence is most likely to be requested in an audit?

We have engineering tickets in Jira. Is that enough?

Footnotes

Frequently Asked Questions

Authoritative Sources

Related Resources

Operationalize this requirement