Grey Area Diagnosis
From Wikipedia, the free encyclopedia
Grey Area Diagnosis is an activity performed during Incident Management. It is the activity via which the cause(s) of an Incident are determined in order to escalate the Incident to the appropriate team(s) for further investigation and resolution.
Contents |
[edit] Advantages:
Advantages of effective Grey Area Diagnosis include:
- Reduced time to achieve RTO (Return to Operation).
- Improved efficiency through reduced duplication of effort.
- Improved documentation.
- Opportunity to improve people (ie: capability/knowledge), processes and technology through lessons learned.
[edit] Prerequisites:
In order to perform Grey Area Diagnosis, the following is required:
- An Incident ticket logged in the appropriate Incident Management System, with the appropriate information being captured.
- Clearly defined areas of responsibility.
- Competency in the technologies the Incident could potentially be related to.
- Tools and/or procedures.
- Clear and concise Documentation.
- Clear lines of communication with the Incident Requestor.
[edit] Detection:
Incident Detection typically occurs via one of two methods:
- An Incident being logged by an affected end-user.
- An abnormal system event being detected by Enterprise Management Systems.
[edit] Clarification:
If an Incident is logged by an affected end-user, it is important that particular pieces of information are captured at that point.
They include:
- Clear contact details for the Incident Requestor.
- The sequence of events leading up to the Incident.
- Details of any error messages or screenshots.
- The number of users affected.
- The business impact(s) caused by the Incident.
- Whether any workaround is available.
If any of this information is missing, it must be obtained with urgency.
[edit] Initial Investigation:
Investigation of the Incident is important in order to diagnose the possible cause(s). When investigating the Incident, it is important to consider...
- What has changed in order to cause the Incident?
- What has become unavailable in order to cause the Incident?
- What system health checks can I perform?
- Which tools and/or procedures might I need to use?
- What system components are performing as expected?
- What system components are performing outside of normal parameters?
- Who is responsible for any of the system components which require further attention?
[edit] Documentation:
Once Initial Investigation has been completed, it is very important to document...
- Any changes in the environment which may be contributing to the cause of Incident.
- Any unavailable system components which may be contributing to the cause Incident.
- Any health checks performed on system components.
- The results of those health checks.
- Details of system components the health of which is unknown or abnormal.
- The teams the Incident should be escalated to, with appropriate rationale.
[edit] Escalation:
Once the GAD findings have been documented, the Incident should be escalated to the appropriate internal and/or external team(s).
When escalating to the appropriate team, it is important to...
- Make the results of any findings available to them.
- Discuss the rationale for the escalation, and ensure any points requiring clarification are identified and resolved promptly.
- Obtain agreement from the team that they are happy to receive the Incident Escalated.
- Where agreement to receive an escalation is not forthcoming without good reason, the Priority of the Incident may dictate that escalation to your line manager becomes necessary.
[edit] Lessons learned:
After any Incident has been Returned to Operation, it is often very useful to identify lessons learned and make use of those lessons in order to improve People, Processes or Technology for the future.
Some examples are:
- Information capture for future Incidents of this type could be improved.
(Suggestion - Feed back suggestions for improved information capture to Service Desk)
- Tools and/or procedures to perform Health Checks could be improved.
(Suggestion - Identify potential improvements to Health Check tools and/or doco and implement as appropriate)
- Escalation was not possible due to unclear lines of responsibility.
(Suggestion - Agree lines of responsibility with escalation teams, document and publish to appropriate teams)
- Significant duplication of effort occurred because findings were not clearly documented or made available to teams being escalated to.
(Suggestion - Ensure the importance of documenting findings is communicated to the appropriate teams)
- Grey Area Diagnosis activities have been delayed due to a lack of competency in the required technologies.
(Suggestion - Identify potential opportunities for training and development)
- Incident Detection was delayed by a lack of appropriate monitoring.
(Suggestion - Identify gaps and implement appropriate changes to resolve them)