Incident management (ITSM)

For the emergency response, see Incident management.

Incident Management (IcM) is an IT service management (ITSM) process area. The first goal of the incident management process is to restore a normal service operation as quickly as possible and to minimize the impact on business operations, thus ensuring that the best possible levels of service quality and availability are maintained. 'Normal service operation' is defined here as service operation within service-level agreement (SLA). It is one process area within the broader ITIL and ISO 20000 environment.

ISO 20000 defines the objective of Incident management (part 1, 8.2) as: To restore agreed service to the business as soon as possible or to respond to service requests.

Incidents that cannot be resolved quickly by the help desk will be assigned to specialist technical support groups. A resolution or work-around should be established as quickly as possible in order to restore the service.

There are some software based services for incident management. OpsGenie and PagerDuty both provides alerts, schedules, escalation policies and integrations with other monitoring tools to organize and manage incidents.

Definition

ITIL 2011 defines an incident as:

An unplanned interruption to an IT Service or reduction in the quality of an IT service. Failure of a configuration item that has not yet affected service is also an incident — for example, failure of one disk from a mirror set. The ITIL incident management process ensures that normal service operation is restored as quickly as possible and the business impact is minimized.

ISO 20000 defines an incident (part 1, 2.7) as:

any event which is not part of the standard operation of a service and which causes or may cause an interruption to, or a reduction in, the quality of that service.

Incidents are the result of service failures or interruption. The cause of incidents may be apparent and may be addressed without the need for further action. Incidents are often assigned priorities (e.g. P1, P2, P3, P4 or High, Medium, Low) based on the impact and urgency of the failure or interruption.

Incidents, problems and known errors

Incidents may match with existing 'problems' (without a known root cause) or 'known errors' (with a known root cause) under the control of problem management and registered in the known-error database ( KeDB ). Where existing workarounds have been developed, it is suggested that referencing these will allow the service desk to provide a quick first-line fix. When an incident is not the result of a problem or known error, it may be either an isolated or individual occurrence or may (once the initial issue has been addressed) require that the problem management process become involved, possibly resulting in a new problem record being raised.

Problem definition

When multiple occurrences of related incidents are observed, a problem record should be created as a result. The management of a problem differs from the process of managing an incident and is typically performed by different staff and controlled by the problem management process. Root cause analysis is part of problem resolution.

Change definition

A request for change (RFC) may be raised to modify an IT service in order to resolve a problem. This is covered by the change management process. An incident or a problem may lead to a change.

Incident management processes

The activities within the incident management process include:

Examples

Incidents should be classified as they are recorded, Examples of incidents by classification are:

Bibliography