Problem management

Business administration

Company Business Conglomerate
Business organization Joint-stock company Limited liability company State-owned enterprise Privately held company
Business entity Cooperative Sole proprietorship Partnership Corporation
Corporate governance Annual general meeting Board of directors Supervisory board Advisory board
Corporate titles Chairman Chief executive officer (CEO) Chief financial officer (CFO) Chief information officer (CIO) Chief human resources officer (CHRO) Chief business officer (CBO) Chief technology officer (CTO)
Economy Commodity Public economics Labour economics Development economics International economics Mixed economy Planned economy Econometrics Environmental economics Open economy Market economy Knowledge economy Microeconomics Macroeconomics Economic development Economic statistics
Corporate law Commercial law Constitutional documents Contract Corporate crime Corporate liability Insolvency law International trade law Mergers and acquisitions
Finance Financial statement Insurance Factoring Cash conversion cycle Insider dealing Capital budgeting Commercial bank Derivative Financial statement analysis Financial risk Public finance Corporate finance Managerial finance International finance Liquidation Stock market Financial market Tax Financial institution Working capital Venture capital
Accounting Management accounting Financial accounting Financial audit
Trade Business analysis Business ethics Business plan Business judgment rule Consumer behaviour Business operations International business Business model International trade Business process Business statistics
Organization Architecture Behavior Communication Culture Conflict Development Engineering Hierarchy Patterns Space Structure
Society Marketing Marketing research Public relations
Types of management Asset Brand Business intelligence Capacity Change innovation Commercial marketing Communications Configuration Conflict Content Customer relationship Distributed Earned value Electronic business Enterprise resource planning management information system Financial Human resource development Incident Integrated Knowledge Materials Network Office Operations Performance Power Problem Process Product life-cycle Product Project Quality Records Resource Risk crisis Sales Security Service Strategic Supply chain Systems administrator Talent Technology

Problem Management is the process responsible for managing the lifecycle of all problems. The primary objectives of problem management are to prevent problems and resulting incidents from happening, to eliminate recurring incidents, and to minimize the impact of incidents that cannot be prevented. The Information Technology Infrastructure Library defines a problem as the cause of one or more incidents.

Scope

Problem Management includes the activities required to diagnose the root cause of incidents identified through the Incident Management process, and to determine the resolution to those problems. It is also responsible for ensuring that the resolution is implemented through the appropriate control procedures, especially Change Management and Release Management.

Problem Management will also maintain information about problems and the appropriate workarounds and resolutions, so that the organization is able to reduce the number and impact of incidents over time. In this respect, Problem Management has a strong interface with Knowledge Management, and tools such as the Known Error Database will be used for both. Although Incident Management and Problem Management are separate processes, they are closely related and will typically use the same tools, and may use similar categorization, impact and priority coding systems. This will ensure effective communication when dealing with related incidents and problems.

Value to business

Problem Management works together with Incident Management and Change Management to ensure that IT service availability and quality are increased. When incidents are resolved, information about the resolution is recorded. Over time, this information is used to speed up the resolution time and identify permanent solutions, reducing the number and resolution time of incidents. This results in less downtime and less disruption to business critical systems.

Process activities, methods and techniques

Problem Management consists of two major processes:

Reactive Problem Management, which is generally executed as part of Service Operation
Proactive Problem Management which is initiated in Service Operation, but generally driven as part of Continual service improvement (CSI).

Problem detection

Suspicion or detection of a cause of one or more incidents by the Service Desk, resulting in a Problem Record being raised – Service Desk may have resolved the incident but has not determined a definitive cause and suspects that it is likely to recur.
Analysis of an incident by a technical support group which reveals that an underlying problem exists, or is likely to exist.
Automated detection of an infrastructure or application fault, using event/alert tools automatically to raise an incident which may reveal the need for a Problem Record.
A notification from a supplier or contractor that a problem exists that has to be resolved.
Analysis of incidents as part of proactive Problem Management: watch-bulletins, releases, relevant papers

Problem logging

All the relevant details of the problem must be recorded so that a full historic record exists. This must be date and time stamped to allow suitable control and escalation. A cross-reference must be made to the incident(s) which initiated the "Problem Record":

Service details
Equipment details
Date/time initially logged
Priority and categorization details
Incident description
Details for all diagnostic or attempted recovery actions taken.

Problem Prioritization

Problems must be categorized (severity/priority) in the same way as incidents in order to trace a problem. Prioritize a problem implies to keep into account the impact of the incidents and the frequency of the occurrences. Problem prioritization should take into account the severity of the problems. From an infrastructure point of view we can have:

Can the system be recovered, or does it need to be replaced?
How much will it cost?
How many people will be involved to fix the problem?
How long will it take to fix the problem?
How many additional resources will be involved?
What is the cost of not resolving the problem?

Problem investigation and diagnosis

The result of an investigation for a problem will be a root cause diagnosis or a RCA report. The resolution should be the sum of the appropriate level of resources and skills used to find it. There are a number of useful problem solving techniques that can be used to help diagnosis and resolved problems.

The CMS must be used to help determine the level of impact and to assist in pinpointing the point of failure.
The Known Error Database or KEDB should be accessed and checked in order to find out if the problem has occurred in the past, if so a resolution should be already in place.
The Chronological analysis, the events that trigged the problem will be checked in chronological order in order to have a timeline of events. The purpose is to see which event trigger the next event and so on, or to rule out some possible events.

The Pain Value Analysis contains a broader view of the impact of an incident or a problem on the business. Rather than analysing the number of incidents/problems of a particular type in a particular time interval, the technique focus on in-depth analysis of what level of pain has been caused to the business by these incidents/problems. A formula to calculate the level of pain should take into account:

the number of people affected
the duration of the downtime caused
the cost to the business

The Kepner and Tregoe method is used to investigate deeper-rooted problems. They defined the following stages:

defining the problem
describing the problem in terms of identity, location, time (duration) and size (impact)
establishing possible causes
testing the most probable cause
verifying the true cause

Pareto Analysis or Pareto chart is a technique for separating important potential causes from trivial issues. The following steps should be taken:

Form a table listing the causes and their frequency as a percentage
Arrange the rows in the decreasing order of importance of the causes (the most important cause first)
Add a cumulative percentage column to the table
Create a bar chart with the causes, in order of their percentage of total
Draw a line at 80% on the Y-axis, then drop the line at the point of intersection with the X-axis. From the chart you can see the primary causes for the network failures. These should be targeted first.

	Network failures
Causes	Percentage of total	Computation %	Cumulative
Network Controller	35	0+35%	35
File corruption	26	35% + 26%	61
Server OS	6	61%+6%	67%

Known Error Record

After the investigation is complete and a workaround (or even a permanent solution) has been found, a Known Error Record must be raised and placed in the Known Error Database in order to identify and resolve further similar problems. The main purpose is to restore the affected service as soon as possible with a minimal impact on the business.

A good practice would be to raise a Known Error Record even earlier in the investigation - just for information purposes

Major Problem Review

A good practice is to have a review for all major problems. The review should examine:

The correct steps taken
The problems encountered during the implementation of the solution
The need to improve
Prevent the recurrence of further similar incidents
Third-Party/Vendor/Supplier involved in the implementation

The knowledge learned from the review should be incorporated into a service review with the business customer to ensure that the customer is aware of the actions taken and the plans to prevent future similar incidents from occurring.This helps to improve customer satisfaction and assure the business that Service Operations is handling major incidents responsibly and actively working to prevent their future recurrence.

References

The New Rational Manager - Describes KT Problem Solving and Decision Making (PSDM)
Offord, Paul (2011). RPR: A Problem Diagnosis Method for IT Professionals. Essex, England: Advance Seven Limited. ISBN 978-1-4478-4443-3.

This article is issued from Wikipedia - version of the Friday, January 29, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.