WHITE PAPER
EFFECTIVE INCIDENT AND PROBLEM MANAGEMENT IN THE DYNAMIC IT ENTERPRISE By Venkatesh Pandian Rajagopalan
Abstract This paper discusses effective IT services management (ITSM) processes, specifically the incident management processes, that were implemented by an energy utility client and a food processing client of Infosys. It aims to demonstrate how existing ITSM processes can be improved by relaying the best practices and tailored processes used by Infosys in such infrastructure projects. It also explains how we leveraged a proven problem management technique to prevent enterprise-wide outages and improve the mean time to resolve (MTTR) across the IT landscape. The practices outlined here can be applied to any ITSM framework used in IT operations like ITIL, DevOps, COBIT, etc. Introduction One of our energy utility clients was experiencing at least two to four critical incidents (P1) and five to seven urgent incidents (P2) per week. They wanted to address the critical incidents quickly and effectively to ensure reliable infrastructure operations.
Infosys did an in-depth assessment and customization of the existing ITSM processes. Best practices were used across the project, yielding better outcomes. Incident management, problem management and change management were leveraged across the enterprise to improve day-to-day business operations. Enabling innovation across these processes can help clients avert major outages and minimize downtime of enterprise IT systems.
Effective ITSM incident management practices Incident discovery and questions. Once this is done, the incident becomes valuable to solve the issue on prioritization - Best practices will be treated as ‘critical’ and will follow the time and quickly restore services critical incident management process. • The right decisions made during the 1. Ask the right questions These questions are important to discovery stage will help determine the When any major incident is reported to assure program owners that they are, in correct resources, team and time that the incident manager, it must be followed fact, dealing critical incidents and not is needed to swiftly resolve the critical by robust and rapid decision making to provisioning resources for non-critical issue determine if the incident is business- issues. Further, when critical issues do arise, • Business-critical applications, servers critical. At this initial outage stage, some response time is of paramount importance. and business users that may disrupt key questions must be asked by the services to the organization must be incident manager. These are: 2. Plan ahead to save time documented in Operations Handbook. In preparation of critical incidents, it is • How severe is the Outage caused by the The document should also map recommended to have some key aspects Incident? applications to their corresponding integrated with the existing processes to • How many critical applications or servers. In this way, if a specific server improve critical incident MTTR. Enterprises business users are impacted? is down, users will know which can plan ahead through the following applications may be impacted. Such • Are there any business-critical servers steps: documentation must be done on a that are down? • Incident managers must have weekly basis so that all new enterprise • Is this a security incident that could have valuable enterprise operational data systems become part of the ITSM organization-wide impact? readily available. Pre-planning and process These are some of the common questions documenting the Operations flow of In some cases, the incident may appear to be asked during the initial phase. the IT enterprise is an on-going exercise to be a non-critical issue or outage. Here, Although many incidents are often and needs to be done on a constant the incident manager can downgrade reported as ‘critical’, most do not qualify basis. This data can be very helpful to the issue to a low priority incident that is as critical incidents. It is up to the incident understand which direct and indirect either ‘urgent’ or ‘expedited’. This prevents manager, the technical program manager elements are impacted by the incident. the enterprise IT team from diverting their or the application manager to validate • Any important data that is available focus. that the incident in question is actually at hand during the critical incident critical based on responses to the above
External Document © 2020 Infosys Limited External Document © 2020 Infosys Limited Incident analysis and important information that explains the • List of all applications, servers, users, and communication – Best practices problem in brief. Such criteria should groups affected include: • Methodology of incident management 1. Communicate major incidents to communication stakeholders • Date and time of the critical incident • A critical incident number, which in case When an incident is categorized as • Events leading to the crisis of multiple critical tickets, is the first a business-critical one, all relevant • The current status, which may change reported incident stakeholders must be alerted within 15 to as relevant teams work on resolving the • Contact section that contains the 30 minutes. By this time, incident managers issue contacts of the primary incident may already be engaged in conference • Steps taken until now by different teams manager, technical manager, application calls to commence the resolution phase of in their attempt to contain the incident. manager, and other senior managers the Incident. This should be in reverse chronological order. The steps should be listed briefly, These eight sections should be presented To achieve best results, we recommend with minimal technical description, in a table. The format should be saved for using a standard and organization- in a way that is understandable by all usage at any time. approved incident communication stakeholders template. The template should contain
Major Incident Communication Major incident summary Thursday, 01-JUN-2019 Date and time of critical incident 8:00 AM Events leading to the crisis Current status
Steps taken (listed in descending order) Groups, locations and applications affected
Email, pager Methods used to communicate with clients/stakeholders Technical bridge/WEBEX details:
Incident ticket number INC000000000
Emergency change order ticket number
Incident manager Phone number
Contacts Technical manager (from support team) Phone number
IT manager-in-charge Phone number Table 1: A major incident communication template
Once the initial communication of outage It is always best to compose the and they are interested only in the area has been sent within 15-30 minutes of its communication email in a non-technical of impact, high-level remediation steps, occurrence, the person in charge should manner and explain the situation broadly. current status, and estimated time of arrival send periodic updates every hour until Stakeholders are from different units (ETA). the problem is resolved. Depending on the criticality, these timelines can be revised according to the situation. One may have a leadership call for very critical outages, which would be outside of the Incident management call. The suggested methodology can be adapted by the Incident management or Operations management team.
External Document © 2020 Infosys Limited External Document © 2020 Infosys Limited Incident remediation, resolution 2. Closure of Critical Incident-Reaffirm Infosys recommends sending a daily report and closure – Best practices the solution is working, Initiate problem containing information about critical and management and “One last good urgent incidents to the IT enterprise team. While incident-related communication Communication”. All critical incidents resolved during the must be sent at intervals as the relevant week as well as ongoing critical incidents technical teams work on solving the The incident manager or the technical should be included in the report. problem, it is also important to document manager must get validation results from all the steps or actions taken by various different teams and impacted users. He Integrating incident and problem teams trying to resolve the problem. Each should cross-check that all applications, management as a part of ITSM update must be appended in reverse systems and servers are responding strategy chronological order to the ‘steps taken’ properly. It is important to validate the section. In this way, users will get a clear solution across all the relevant areas. In The scope of proactiveness during summary of the direction of problem many instances, the problem reappears incident management may be limited. resolution. after closing the ticket – a situation that However, there is always a wide scope of must be avoided as much as possible. using problem management to achieve 1. Resolving the Critical Incident better results in the IT environment. For The closing communication should -Enable vendors and external parties, as this reason, one of the most important summarize the issue and provide key necessary. Provide New ETA. aspects of the ITSM process is the problem information about the problem. This The incident manager must play a management practice. should include: proactive role in provisioning additional The more proactive the problem teams as required. In many cases, IT teams • Area where the problem was management, the more successful should reach out to product vendors. encountered and applications/systems enterprises are in avoiding repeated critical Incident managers along with the relevant impacted and restored outages to their IT systems. technical team should facilitate the • Applications, systems, servers, and users remediation discussion with any external that validated the solution Associating problem tickets with parties or vendors. The faster this decision • The team(s) responsible for resolving the critical incidents (P1/P2) is made, the better equipped all teams are issue to address the issue. Any critical incident in the IT landscape • Remediation steps followed to resolve (all P1 and some P2 incidents) must have At the outset of issue resolution, it is the issue now and what can be done in a problem ticket after the closure of the recommended to inform stakeholders the future critical incident. In most ITSM tools, this that a solution has been identified and • Problem ticket details. While this is functionality is integrated. In environments provide them with an updated ETA. If optional, it is recommended to provide where it is not integrated, we strictly the root cause of the problem has been this detail along with the names of the recommend integrating incident and identified, it is better to inform users of the owners of the problem ticket problem tickets. Thus, for any P1 incident, a same so that they know what could have Once the email on problem validation problem ticket is automatically generated caused the issue. Sharing such information and closure summary is sent, all channels by the system. promptly will help improve adoption of the of communication related to the critical problem management practice. Further, P2 incidents that may need problem incident management process are closed. the associated team will be careful in the tickets must be determined by the incident Following this, the problem management future to avoid similar problems. manager (or problem manager) along with process begins. the IT technical manager. A problem ticket is needed for a P2 incident in the following cases:
• When the P2 incident is repetitive • When the P2 incident may escalate to a critical or P1 issue now or in the near future • When the P2 incident is a result of a change ticket that was carried out in the recent past
External Document © 2020 Infosys Limited External Document © 2020 Infosys Limited Overview of the problem ticket process The problem management process can be tailored to the unique needs of every enterprise.
Problem Management Process
P1 incident P2 incident - ser restoration P2 incident – Auto-generated ecurring incidents (Automatically creates a (Automatically creates a (Manually creates a problem (Manually creates a problem problem incident) problem incident) incident as re uired) incident as re uired)
Problem logging