Incident management
Incident management (IcM) is a term describing the activities of an organization to identify, analyze, and correct hazards to prevent a future re-occurrence. These incidents within a structured organization are normally dealt with by either an Incident Response Team (IRT), or an Incident Management Team (IMT). These are often designated before hand, or during the event and are placed in control of the organization whilst the incident is dealt with, to restore normal functions.
Similar to an IRT or IMT is an Incident Command System (ICS). Popular with public safety agencies and jurisdictions in the United States, Canada and other countries, it is growing in practice in the private sector as organizations begin to manage without or co-manage emergencies with public safety agencies. It is a command and control mechanism that provides an expandable structure to manage emergency agencies. Although some of the details vary by jurisdiction, ICS normally consists of five primary elements: command, operations, planning, logistics and finance / administration. Several special staff positions, including public affairs, safety, and liaison, report directly to the incident commander (IC) when the emergency warrants establishment of those positions.
An incident is an event that could lead to loss of, or disruption to, an organization's operations, services or functions.[1] If not managed an incident can escalate into an emergency, crisis or a disaster. Incident management is therefore the process of limiting the potential disruption caused by such an event, followed by a return to business as usual.
Without effective incident management an incident can rapidly disrupt business operations, information security, IT systems, employees or customers and other vital business functions.[2]
Usually as part of the wider management process in private organizations, incident management is followed by post-incident analysis where it is determined why the incident happened despite precautions and controls. This analysis is normally overseen by the leaders of the organization, with the view of preventing repetition of the incident through precautionary measures and often changes in policy. This information is then used as feedback to further develop the security policy and/or its practical implementation. In the United States, the National Incident Management System, developed by the Department of Homeland Security, integrates effective practices in emergency management into a comprehensive national framework. This often results in a higher level of contingency planning, exercise and training, as well as an evaluation of the management of the incident.[3]
Computer security incident management
Today, an important role is played by a Computer Security Incident Response Team (CSIRT), due to the rise of internet crime, and is a common example of incident faced by companies in developed nations all across the world. For example, if an organization discovers that an intruder has gained unauthorized access to a computer system, the CSIRT would analyze the situation, determine the breadth of the compromise, and take corrective action. Computer forensics is one task included in this process. Currently, over half of the world’s hacking attempts on Trans National Corporations (TNCs) take place in North America (57%). 23% of attempts take place in Europe.[4] This makes CSIRT a highly prominent player in incident management.
Incident management process, as defined by ITIL
Incident management can be defined as an unplanned interruption to an IT service or a reduction in the quality of an IT service (also known as an "incident definition as per V3"). Failure of a configuration item that has not yet impacted service is also an incident. An example of this would be failure of one disk from a mirror set.
An “incident definition as per V2” is an event which is not part of the standard operation of a service and which causes or may cause disruption to or a reduction in the quality of services and customer productivity. The objective of incident management is to restore normal operations as quickly as possible with the least possible impact on either the business or the user, at a cost-effective price.
The incident manager is a functional role, rather than a position of employment, however both may be true dependent upon the hiring organization. Incident management provides to the external customer a focal point for leadership and drive during an event by ensuring adherence to follow up on commitments and adequate information flow.
The objective of incident management during an incident is service restoration as quickly as possible; the objective is not to make a system perfect. If service can be restored by a temporary workaround quicker than by correcting the underlying root cause of the issue then that is acceptable. After service restoration, correction of underlying root causes is done by the problem management team by a process called root cause analysis (RCA). An example of service restoration by temporary workaround is that which was done on the Apollo 13.
The primary focus of incident management is to ensure a prompt recovery of the system, supervising and directing the internal or external resources. Prompt system recovery and minimization of any impact to customers has priority over unreasonably long and intensive data collection for the event root cause investigation.
Incidents can be classified into three primary categories: software (applications), hardware, and service requests. (Note that service requests are not always regarded as incidents, but rather requests for change. However, the handling of failures and the handling of service requests are similar and therefore are included in the definition and scope of the process of incident management.)
ITIL separates incident management into six basic components:
- Incident detection and recording
- Classification and initial support
- Investigation and diagnosis
- Resolution and recovery
- Incident closure
- Ownership, monitoring, tracking, and communication (monitoring the progress of the resolution of the incident and keeping those who are affected by the incident up to date with the status)
From ITIL point of view, the activities of Incident Management are:
Activities of ICM defined by ITIL v3
- Identification - detect or reported the incident
- Registration - the incident is registered in an ICM System
- Categorization - the incident is categorized by priority, SLA etc. attributes defined above
- Prioritization - the incident is prioritized for better utilization of the resources and the Support Staff time
- Diagnosis - reveal the full symptom of the incident
- Escalation - should the Support Staff need support from other organizational units
- Investigation and diagnosis - if no existing solution from the past could be found the incident is investigated and root cause found
- Resolution and recovery - once the solution is found the incident is resolved
- Incident closure - the registry entry of the incident in the ICM System is closed by providing the end-status of the incident[5]
Incident Manager responsibilities
- understand any incident/fault on a basic level (at least) in order to use the appropriate competences (resources)
- drive the restoration team to gather sufficient information to start an analysis
- maintain a general overview of the incident (keeping the focus on the restoration via a workaround)
- understand the functionality of multiple areas (RAN, Core Network, VAS, BSS/OSS)
- obtain guidance on priorities to the teams starting the immediate urgent unexpected recovery work[6]
Incident management software systems
Incident management software systems are designed for collecting consistent, time sensitive, documented Incident report data. Many of these products include features to automate the approval process of an incident report or case investigation. These products may also have the ability to collect real time incident information such as time and date data. Additionally incident report systems will automatically send notifications, assign tasks and escalations to appropriate individuals depending on the incident type, priority, time, status and custom criteria. Modern products provide the ability for administrators to configure the Incident report forms as needed, create analysis reports and set access controls on the data. These incident reports may have the ability for customization that may best suit the organizations using the systems. Some of these products have the ability to collect images, video, audio and other data. Incident management software systems exist that relate directly to specific industries. There are some software based services for incident management. OpsGenie, PagerDuty, and Enterprise Alert all provides alerts, schedules, escalation policies and integrations with other monitoring tools to organize and manage incidents.
Human factors
During the root cause analysis, human factors should be assessed. This text will not go into depth on human factors, but will mention a couple of salient areas that can assist in ensuring after action root cause analysis comes to an effective conclusion, after taking into consideration all the aspects of the cause and effects of an accident/incident. James Reason (1995) conducted a study into the understanding of adverse effects – Human Factors. The following will summarise some of the major points and explain the reasoning behind human factors playing a proportionate part of any incident. The study found, major incident investigations such as Piper Alpha, Kings Cross Underground Fire, made it clear that the causes of the accidents were distributed widely within and outside the organization. There are two types of event, active failure, an action that has immediate effects and has the likelihood to cause an accident. The second is a latent or delayed action, these events can take years to have an effect; they usually combine with triggering events then cause the accident.
- Active failures
These failures are unsafe acts (errors and violations) committed by those at the "sharp end" of the system (the actual operators of machinery, supervisors of tasks). It is the people at the human-system interface whose actions can, and sometimes do, have immediate adverse consequences.
- Latent failures
They are created as the result of decisions taken at the higher echelons of an organisation. There damaging consequences may lie dormant for a long time, only becoming evident when they combine with local triggering factors (for example, the spring tide, the loading difficulties at Zeebrugge harbour, etc.) to breach the system's defences.
Decisions taken in the higher echelons of an organization can trigger the events towards an accident becoming more likely, the planning, scheduling, forecasting, designing, policy making, etc., can have a slow burning effect. The actual unsafe act that commits or triggers an accident can be traced back through the organization and the subsequent failures will be exposed, and discover the accumulation of latent failures within the system as a whole that led to the accident becoming more likely and ultimately happening.
To conclude, most incidents are not just about the actual events that happened, if human factors are studied during the investigation period, the actual chain of latent actions will be discovered. Consequently, better improvement action can be applied, and reduce the likelihood of the event happening again.[7]
Physical Incident Management
Incident management should be considered to be much more than just the analysis of perceived threats and hazards towards and organization in order to work out the risk of that event occurring, and therefore the ability of that organization to conduct business as usual activities during the incident. It should be remembered that as well as an important part of risk management process and business resilience planning that Incident management is a real time physical activity.
The planning that has happened to formulate the response to an incident; be that a disaster, emergency, crisis or accident has been done so that effective business resilience can take place to ensure minimal loss or damage whether that is to tangible or non tangible assets of that organization. The only way the effective planning that has gone before can be implemented is by efficient physical management of the incident, making best use of both time and resources that are available and understanding how to get more resources from outside the organization when needed by clear and timely liaison.
National Fire Protection Association states that incident management can be described as; “When an emergency occurs or there is a disruption to the business, organized teams will respond in accordance with established plans. Public emergency services may be called to assist. Contractors may be engaged and other resources may be needed. Inquiries from the news media, the community, employees and their families and local officials may overwhelm telephone lines. How should a business manage all of these activities and resources? Businesses should have an incident management system (IMS). An IMS is “the combination of facilities, equipment, personnel, procedures and communications operating within a common organizational structure, designed to aid in the management of resources during incidents” (National Fire Protection Association (NFPA), 2013).[8][9]
The physical incident management is very much the real time response that may last for hour’s, days or longer. The United Kingdom Cabinet Office have produced the National Recovery Guidance (NRG), which is aimed at local responders as part of the implementation of the Civil Contingencies Act 2004 (CCA) and it describes the response as the following; “Response encompasses the actions taken to deal with the immediate effects of an emergency. In many scenarios, it is likely to be relatively short and to last for a matter of hours or days – rapid implementation of arrangements for collaboration, co-ordination and communication are, therefore, vital. Response encompasses the effort to deal not only with the direct effects of the emergency itself (eg fighting fires, rescuing individuals) but also the indirect effects (eg disruption, media interest)” (NRG, 2007).[10][11]
International Organization for Standardization (ISO), which is the worlds largest developer of international standards also makes a point in the description of its risk management, principles and guidelines document ISO 31000:2009 that, "Using ISO 31000 can help organizations increase the likelihood of achieving objectives, improve the identification of opportunities and threats and effectively allocate and use resources for risk treatment". This again shows the importance of not just good planning but effective allocation of resources to treat the risk (ISO 31000, 2009).[12]
See also
- D3 Security Management Systems in the United States and Canada
- National Incident Management System in the United States
- Coordinated Regional Incident Management (Netherlands) in the Netherlands
- PPM 2000 Inc. in the United States and Canada
References
- ↑ Glossary of Terms, The Business Continuity Institute Good Practice Guidelines 2010 Global Edition. thebci.org Retrieved on 2015-09-03.
- ↑
- ↑ About the Contingency Planning and Incident Management Division | Homeland Security. Dhs.gov (1999-02-22). Retrieved on 2012-11-17.
- ↑ Hacking Incidents 2009 – Interesting Data – Roger's Security Blog – Site Home – TechNet Blogs. Blogs.technet.com (2010-03-12). Retrieved on 2012-11-17.
- ↑ ITIL v3 - a Pocket Guide. 2009. p. 136.
- ↑ Wearne, Stephen (2006). Project management journal 37 (5): 97–102. Missing or empty
|title=
(help) - ↑ O’Callaghan, Katherine Mary, Incident Management: Human Factors and Minimising Mean Time to Restore, Ph.D. Thesis, Australian Catholic University, 2010.
- ↑ United States of America, National Fire Protection Association (NFPA). (2013) [online]. Available from: http://www.nfpa.org/aboutthecodes/AboutTheCodes.asp?DocNum=1600&cookie%5Ftest=1 [Accessed 10 April 2013].
- ↑ Federal Emergency Management Agency (FEMA). (2012) [online]. Available from: http://www.ready.gov/business/implementation/incident [Accessed 10 April 2013].
- ↑ United Kingdom Cabinet Office, National Recovery Guidance. (2007) [online]. Available from: https://www.gov.uk/national-recovery-guidance [Accessed 10 April 2013].
- ↑ United Kingdom Government legislation, Civil Contingencies Act (CCA) 2004. (2012) [online]. Available from: http://www.legislation.gov.uk/ukpga/2004/36/contents [Accessed 10 April 2013].
- ↑ International Organization for Standardization (ISO). (2009) [online]. Available from: http://www.iso.org/iso/home/standards/iso31000.htm [Accessed 13 April 2013].
External links
- National Incident Management System Consortium in the United States
- ISS 24/7 in the United States
- PPM 2000 Inc. in the United States and Canada, serving clients Globally
- D3 Security Management Systems in the United States and Canada, serving clients Globally
- United Kingdom Government legislation, Civil Contingencies Act (CCA) 2004. (2012)
- Federal Emergency Management Agency (FEMA). (2012)
- Project Management Software in the United States, serving clients Globally
Further reading
- Adam Krug (2014-09/16), "Incident Management Software System Case Studies", Case Studies 1 - 34