OA-OIT Incident Management Process

Total Page:16

File Type:pdf, Size:1020Kb

OA-OIT Incident Management Process

Appendix G

OA-OIT Incident Management Process

Version: 1.7 Date: 10/12/2016 OA-OIT Incident Management Process

Revision History

Version Date Description Author

1.0 11/20/2015 Initial Draft Geralyn Matscavage

1.1 1/11/2016 Based on Steering Committee feedback, Geralyn Matscavage removed definition of Customer, added ITP-SEC24 policy, and added metric for Number of Incidents per service, by day, by month.

1.2 1/19/2016 Based on Steering Committee feedback: Geralyn Matscavage added Service Provider definition; Updated RACI table ;added statement to Metrics for review/trending over 13 month time frame

1.3 2/18/2016 Based on Steering Committee feedback: Geralyn Matscavage clarified Helpdesk vs Service Desk; Updated RACI table; Updated Major Incident Team under Roles and Responsibilities; Updated Metrics table; Added Appendix: Incident Management User Surveys

1.3.1 4/13/2016 Added to Procedures to Step #9 for an Geralyn Matscavage End User to use a new Reopen Button in the SN Tool to reopen a resolved incident (approved by OA IM Process Team SMEs)

1.4 6/22/2016 Document clean-up and formatting Jason Salvaggio, Sue Danella, Geralyn Matscavage

1.5 7/14/2016 Added outage notification procedure, Sue Danella, Jason updated RACI, added Incident Priority Salvaggio, Geralyn Table, updated process flow to include Matscavage major incident work flow, document clean- up and formatting

1.6 8/17/2016 Updated Work Flows and Priority Jason Salvaggio assignments

1.7 10/12/2016 Style Revision Stephen DeSante

V1.7 10/12/2016 - 2 - OA-OIT Incident Management Process

Table of Contents

1 Scope 4

2 Definitions and Acronyms 5

3 Policies 9 3.1 POLICY STATEMENTS 9 3.2 SUPPORTING IT POLICIES 9

4 Roles and Responsibilities 11 4.1 INCIDENT MANAGEMENT PROCESS OWNER 11 4.2 INCIDENT MANAGER 12 4.3 1ST LEVEL SUPPORT 13 4.4 2ND LEVEL SUPPORT 13 4.5 3RD LEVEL SUPPORT 14 4.6 MAJOR INCIDENT MANAGER 14 4.7 MAJOR INCIDENT OWNER 15 4.8 MAJOR INCIDENT TEAM 15 4.9 APPLICATIONS ANALYST 16 4.10 TECHNICAL ANALYST 16 4.11 IT OPERATOR 16 4.12 RACI 17

5 Service Model 19 5.1 PROCESS FLOW OVERVIEW 19 5.2 OVERVIEW PROCEDURES 20 5.3 PROCESS WORKFLOW 27 5.4 PROCESS PROCEDURES 28 5.5 NOTIFICATIONS 32

6 Metrics 33

Appendix A – Incident Management User Surveys 35

Appendix B – CoPA Incident Report Procedures 37

V1.7 10/12/2016 - 3 - OA-OIT Incident Management Process

1 Scope

This document provides a unified process which shall be used by all Service Owners within the OA-OIT organization. Its purpose is to align the OA-OIT Incident Management Process with the ITIL standards for consistency, improve service delivery and provide continuous improvement.

ITIL Incident Management aims to manage the lifecycle of all Incidents. The primary objective of Incident Management is to return unexpectedly degraded or disrupted services to users as quickly as possible, in order to minimize business impact.

V1.7 10/12/2016 - 4 - OA-OIT Incident Management Process

2 Definitions and Acronyms

The following definitions are from ITIL® v3 documentation. ITIL® is a Registered Trade Mark and a Registered Community Trade Mark of the Office of Government Commerce and is registered in the U.S. Patent and Trademark Office.

Term Acronym Definition

After Hours After hours concerns any time or day outside of business hours.

Business Hours Business hours vary by service.

Category A named group of things that have something in common. Categories are used to group similar things together. For example, cost types are used to group similar types of cost. Incident categories are used to group similar types of incident, while CI types are used to group similar types of configuration item

Customer Facing IT services that are seen by the customer. These Service are typically services that support the customer's business units/business processes, directly facilitating the customer's desired outcome(s).

Enterprise Incident caused by the failure of an enterprise- Incident wide application having extensive/widespread impact.

Impact A measure of the effect of an Incident on Business Processes. Impact is often based on how Service Levels will be affected. Impact and Urgency are used to assign Priority.

Incident An unplanned interruption to an IT Service or a reduction in the Quality of an IT Service. Failure of a Configuration Item that has not yet impacted Service is also an Incident. For example Failure of one disk from a mirror set.

Incident IM The Process responsible for managing the Lifecycle Management of all Incidents. The primary Objective of Incident Management is to return the IT Service to Users as quickly as possible. It concentrates on restoring unexpectedly degraded or disrupted services to users as quickly as possible, in order to minimize business impact.

V1.7 10/12/2016 - 5 - OA-OIT Incident Management Process

Term Acronym Definition

Incident Record A Record containing the details of an Incident. Each Incident record documents the Lifecycle of a single Incident.

Information IT Information Technology. Technology

Information ITP Information Technology Policy. Technology Policy

IT Service A Service is a means of delivering value to Customers by facilitating outcomes Customers want to achieve without the ownership of specific costs and risks. An IT Service uses Information Technology to support the Customer’s business processes.

Key Performance KPI A metric that is used to help manage an IT Indicator service, process, plan, project, or other activity. Key performance indicators are used to measure the achievement of critical success factors. Many metrics may be measured but only the most important of these are defined as key performance indicators and used to actively manage and report on the process, IT service, or activity. They ensure efficiency, effectiveness, and cost effectiveness are all managed.

Major Incidents Most severe type of incident which has the highest impact on the business. Typically there is a pre- defined process for dealing with Major Incidents which includes liaising with Senior Management and Business representatives.

Major Incident Has overall responsibility for verifying that Service Manager Level Measurements or Objectives are achieved by managing the impact of Major Incidents.

Major Incident Has overall responsibility for managing recovery Owner from a Major Incident. Specific responsibilities may include: Managing and owning the Major Incident through service recovery and reviewing classification of the Incident as a Major Incident.

V1.7 10/12/2016 - 6 - OA-OIT Incident Management Process

Term Acronym Definition

Non-Critical An “incident” is documented in the current tool Incident that is out of scope of this process. Tickets having low-urgency classification encompass incidents easily corrected problems (e.g., manually change the value of a data element to compensate for an operator error), and those annoyances or anomalies that do not result in an outage or severe degradation of an IT service.

Office of OA-OIT Office of Administration – Office for Information Administration – Technology Office for Information Technology

Operating Level OLA An Agreement between an IT Service Provider and Agreement another part of the same Organization. An OLA supports the IT Service Provider's delivery of IT Services to Customers. The OLA defines the goods or Services to be provided and the responsibilities of both parties.

Priority A category used to identify the relative importance of an Incident. Priority is based on Impact and Urgency, and used to identify required times for actions to be taken.

Service Desk The single point of contact between the service provider and the users. A typical service desk manages incidents and service requests, and also handles communication with the users.

Service Level A Service Level is a measured and reported achievement against one or more Service Level Targets. The term Service Level is sometimes used informally to mean service level target.

Service Level SLA A SLA is an agreement between an IT Service Agreement Provider and a Customer. The SLA describes the IT service, documents service level targets, and specifies the responsibilities of the IT Service provider and the Customer. A single SLA may cover multiple IT Services or multiple Customers.

Service Level SLO Service Level Objective. Objective

V1.7 10/12/2016 - 7 - OA-OIT Incident Management Process

Term Acronym Definition

Service Owner Is accountable for the availability, performance, quality, and cost of one or more services. Deals directly with the Service Customer or Incident, usually in the context of a Service Level Agreement or Operating Level Agreement. Service Owner is responsible for day-to-day operation of the service.

Service Provider An organization supplying services to one or more internal customers or external customers. Service provider is often used as an abbreviation for IT service provider.

Service User A person who uses one or several IT services on a day-to-day basis. Service Users are distinct from Customers, as some Customers do not use IT services directly.

Stakeholders Person who has an interest in an organization, project, IT service etc. Stakeholders may be interested in the activities, targets, resources or deliverables. Stakeholders may include customers, partners, employees, shareholders, owners etc. (Commonwealth role of Business Process Owner (BPO) will also be considered a stakeholder in this process.)

Timescales These will be determined by the Service Level Agreements (SLAs) and Operational Level Agreements (OLAs) that have been agreed between IT and the Business. Under Pinning Contracts (UPCs) with 3rd party suppliers also need to be factored in too.

Urgency A measure of how long it will be until an incident, problem or change has a significant impact on the business. For example, a high-impact incident may have low urgency if the impact will not affect the business until the end of the financial year. Impact and urgency are used to assign priority.

User Someone who uses the IT service on a day-to-day basis. Users are distinct from customers, as some customers do not use the IT service directly.

V1.7 10/12/2016 - 8 - OA-OIT Incident Management Process

3 Policies

This section describes policies that are used in conjunction with the Incident Management process.

3.1 Policy Statements

1. Incidents and their status must be timely and effectively communicated. 2. A good Service Desk function is in place to coordinate communication about incidents impacted by them as well as working to resolve them. 3. Incidents must be resolved within timeframes acceptable by the Commonwealth. 4. Customer satisfaction is to be maintained at all times.

3.2 Supporting IT Policies

1. ITP-SEC24 IT Security Incident Reporting Policy where security incident reporting and escalation policy enables the enterprise to respond effectively to security incidents, by clearly detailing the roles and responsibilities of all the parties involved. 2. Incidents can be identified by technical staff, reported and detected by event monitoring tools, communications from users through various forms of communication or reported by third-party suppliers and partners. 3. Purpose of IM is to restore normal service operation as fast as possible and to mitigate the negative impact on business operations. 4. A key aspect of incident management is providing effective communications with affected stakeholders. Crucially, stakeholders must be advised of the outage, anticipated time until service is restored, and the actual time of restoration of service. 5. Outage Severity (Impact and Urgency): Determining Factors for Impact and Urgency a. Priority Levels and Allowable Response Times 1. Critical– up to 2 hours to respond to customer 2. High– up to 4 hours to respond to customer 3. Medium– up to 8 hours to respond to customer 4. Low– up to 16 hours to respond to customer b. Impact 1. Extensive/Widespread – Multiple Agencies impacted 2. Significant/Large – Building or Agency impacted 3. Moderate/Limited – Department or Floor impacted 4. Minor/Localized - Single / Multiple Users impacted c. Urgency 1. Critical – Multiple Services are unavailable and is impacting critical functions for the business area/users work activities 2. High – No workaround exists and it is impacting the business area/users daily work activities

V1.7 10/12/2016 - 9 - OA-OIT Incident Management Process

3. Medium – A workaround exists but it the issue is degrading functionality of the identified business area or users daily work activities 4. Low – A workaround exists or it is not impacting the business area or users daily work activities

Priority Matrix

Urgency

1 - Critical 2 - High 3 - Medium 4 - Low t

c 1 - Extensive/Widespread Critical Critical High Low a p

m 2 - Significant/Large Critical High Medium Low I 3 - Moderate/Limited High High Medium Low

4 - Minor/Localized High Medium Medium Low

Any of the shaded areas above could be classified as a Major incident, depending on the number of users and the business impact.

V1.7 10/12/2016 - 10 - OA-OIT Incident Management Process

4 Roles and Responsibilities

The following roles are associated with the Incident Management process. A single individual may have one or more roles they perform based on the activity and situation.

4.1 Incident Management Process Owner

Profile The Incident Management Process Owner is responsible for ensuring that the process is being performed according to the agreed and documented process and is meeting the aims of the process definition. There will be one Incident Management Process Owner.

Responsibilities  Facilitate the process design  Document the process  Define appropriate standards to be employed throughout the process  Define Key Performance Indicators (KPIs) to evaluate the effectiveness and efficiency of the process and design reporting specification  Ensure that quality reports are produced, distributed and utilized  Review KPIs and take action required following the analysis with higher management levels  Address any issues with the running of the process  Review opportunities for process enhancements and for improving the efficiency and effectiveness of the process  Ensure that all relevant staff have the required technical and business understanding; and process knowledge, training and understanding and are aware of their role in the process  Ensure that the process, roles, responsibilities and documentation are regularly reviewed and audited  Communicate process information or changes as appropriate to ensure awareness

Authority  Verify Incident Managers are performing documented processes. Escalate any breaches of Incident Management Process to higher management levels  Review proposed changes to the Incident Management Process and take to higher management levels for approval

V1.7 10/12/2016 - 11 - OA-OIT Incident Management Process

4.2 Incident Manager

Profile The Incident Manager performs the day-to-day operational and managerial tasks demanded by the process activities.

Responsibilities  The Incident Manager is responsible for the effective implementation of the Incident Management process and carries out the corresponding reporting.  S/He represents the first stage of escalation for Incidents, should these not be resolvable within the agreed Service Levels  Function as a point of escalation when required  Interface with the management, ensuring that the process receives the needed staff resources  Obtain and verify information to be included in the Incident Management from the authorized Service Owners and other information providers  Obtain End User satisfaction feedback  Ensure that Incident Management assessment reviews are scheduled, carried out with customers regularly and are documented with agreed actions  Ensure that improvement initiatives identified in Incident Management reviews are acted upon and progress reports are provided to customers  Record and manage all complaints and escalate, where necessary, to reach resolution  Provide measurement, recording, analysis and improvement options and recommendations  Organize and maintain the regular Incident Management activities with both Customers and Service Owners including:  Review outstanding actions from previous reviews  Review recent performance and availability  Review Incident Management Levels and Targets, as necessary  Review associated agreements and OLAs, as necessary  Agree on appropriate actions to maintain / improve Incident Management Service Levels  Initiate any actions required to maintain or improve Incident Management Service Levels  Identify improvement opportunities to make the Incident Management process more effective and efficient  Escalate to the Incident Management Process Owner where the process is not fit-for-purpose  Promote the Incident Management within the organization, through available communication channels and training IM staff in communication skills where needed

V1.7 10/12/2016 - 12 - OA-OIT Incident Management Process

 Coordinate and facilitate Incident Management meetings to assess Service Catalog quality and value and to address any actionable gaps.  Responsible for Sending out all Incident Notification Communications to applicable Stakeholders.

Authority  To escalate any breaches of Incident Management Process to higher management levels  To take remedial action as a result of any process non- compliance  To approve proposed changes to the Incident Management Process  To organize training for IT employees

4.3 1st Level Support

Profile The 1st Level Support provides first-line support for incidents when they occur using the incident management process.

Responsibilities  The responsibility of 1st Level Support is to register and classify received Incidents and to undertake an immediate effort in order to restore a failed IT service as quickly as possible.  If no ad-hoc solution can be achieved, 1st Level Support will transfer the Incident to expert technical support groups (2nd Level Support).  1st Level Support also processes Service Requests and keeps users informed about their Incidents' status at agreed intervals.

4.4 2nd Level Support

Profile The 2nd Level Support takes over Incidents which cannot be solved immediately by means of the 1st Level Support.

Responsibilities  2nd Level Support takes over Incidents which cannot be solved immediately with the means of 1st Level Support.  If necessary, it will request external support, e.g. from software or hardware manufacturers.  The aim is to restore a failed IT Service as quickly as possible.  If no solution can be found, the 2nd Level Support would be escalate the incident to 3rd level support (if it exists) and if no resolution it would be entered into the Known Error Database.

V1.7 10/12/2016 - 13 - OA-OIT Incident Management Process

4.5 3rd Level Support

Profile The 3rd Level Support is typically located at hardware or software manufacturers (third-party suppliers). Its services are requested by 2nd Level Support if required for solving an Incident. The aim is to restore a failed IT Service as quickly as possible.

Responsibilities  3rd Level Support is typically located at hardware or software manufacturers (third-party suppliers).  Its services are requested by 2nd Level Support if required for solving an Incident. The aim is to restore a failed IT Service as quickly as possible.

4.6 Major Incident Manager

Profile Major Incident Owner has overall responsibility for verifying that Service Level Measurements or Objectives are achieved by managing the impact of Major Incidents.

Responsibilities  Managing Major Incidents  Tailoring and maintaining the Major Incident framework  Participating in the Problem and Change processes to find and alleviate the root cause and to verify that objectives for availability of services are met  Collecting Major Incident measurement data for reports, as needed  Performing post Major Incident follow up via Post Incident Review on Major Incidents, as required  Providing communications to service delivery teams  Responsible for Sending out all Incident Notification Communications to applicable Stakeholders.

Authority  Managing Major Incidents  Overall responsibility for verifying that Service Level Measurements or Objectives are achieved by managing the impact of Major Incidents.

V1.7 10/12/2016 - 14 - OA-OIT Incident Management Process

4.7 Major Incident Owner

Profile Major Incident Owner has overall responsibility for managing recovery from a Major Incident. The Service Owner provides the individual to be the Major Incident Owner.

Responsibilities  Managing and owning the Major Incident through service recovery  Reviewing classification of the Incident as a Major Incident  Determining and handling the scope of the Major Incident  Driving, assessing and handling the recovery plan  Assembling a team of resolver groups (other levels of support and across platforms as required) if additional support is required within the allowable time  Confirming that internal notification and escalation activities are executed  Facilitating conference bridges, as needed  Handling Incident determination activities  Confirming that the Incident Analyst (Resolver) contacts the Requester to confirm that the service has been restored to their satisfaction  Making service restoration/recovery decisions (engaging the service delivery organization as required)  Reviewing that the progress of the Major Incident recovery and relevant times are documented in the associated Incident Record(s)

Authority  To escalate any breaches of Incident Management Process to higher management levels  To take remedial action as a result of any process non- compliance

4.8 Major Incident Team

Profile The Incident Team formulated to concentrate on the resolution of a Major Incident.

Responsibilities  A dynamically established team of IT managers and technical experts, usually under the leadership of the Incident Manager, formulated to concentrate on the resolution of a Major Incident.

V1.7 10/12/2016 - 15 - OA-OIT Incident Management Process

4.9 Applications Analyst

Profile The Applications Analyst is an Application Management role which manages applications throughout their lifecycle.

Responsibilities  There is typically one Applications Analyst or team of analysts for every key application.  This role plays an important part in the application-related aspects of designing, testing, operating and improving IT services.  It is also responsible for developing the skills required to operate the applications required to deliver IT services.

4.10 Technical Analyst

Profile The Technical Analyst is a Technical Management role which provides technical expertise and support for the management of the IT infrastructure.

Responsibilities  There is typically one Technical Analyst or team of analysts for every key technology area.  This role plays an important part in the technical aspects of designing, testing, operating and improving IT services.  It is also responsible for developing the skills required to operate the IT infrastructure.

4.11 IT Operator

Profile The IT Operators are the staff who perform the day-to-day operational activities Responsibilities  For instance, this role will ensure that all day-to-day operational activities are carried out in a timely and reliable way.

V1.7 10/12/2016 - 16 - OA-OIT Incident Management Process

4.12 RACI

This RACI model is used to clarify operational roles, responsibilities, and relationships; define levels of accountability; and coordinate participation in the Incident Management Process.

R – Responsible – Those who perform an activity or make a decision to complete an activity. Responsibilities may be shared.

A – Accountable – The individual who is ultimately accountable for the correct and thorough completion of an activity. There is only one person accountable for each activity.

C – Consulted – Those who need to be consulted or provide input before an activity is performed or a decision is made.

I – Informed – Those who need to be informed as or after an activity is performed or a decision is made. For example, they may receive outputs from an activity or need to be kept up-to-date on progress or completion of an activity.

V1.7 10/12/2016 - 17 - OA-OIT Incident Management Process

The RACI model below represents the practitioners within the process. As the Incident Management Process Owner is not a practitioner of the process, this role is not included in the RACI model.

2nd Major ITIL Role / Incident 1st Level Level 3rd Level Incident Service Service Sub-Process Manager Support Support Support Team Owner* Provider*

Incident Management AR I I I I CI CI Support

Incident Logging and AR RI I I I CI CI Categorization

Immediate Incident AR R I _ _ CI CI Resolution by 1st Level Support

Incident Resolution by 2nd AR I R _ _ CI CI Level Support

Incident Resolution by 3rd AR I I R I CI CI Level Support

Handling of Major AR R R R R RC RC Incidents

Incident Monitoring and AR R R R I CI CI Escalation

Incident Closure AR R CI CI CI CI CI and Evaluation

Pro-Active User AR RI I I I CI CI Information

Incident Management AR I I I I I I Reporting

Incident Management R I I I I A I Outage Communications

*Definitions for Service Owner and Service Provider are listed under the Definitions and Acronyms Section.

V1.7 10/12/2016 - 18 - OA-OIT Incident Management Process

5 Service Model

5.1 Process Flow Overview

The Incident Management Process starts with degradation to IT Service. IM is reactive in nature. Typical examples include  User calling the Service Desk  User reporting an issue on the Self-Service Portal  Event Monitoring Tools  Proactive Monitoring by Resolver Groups  Third-Party/Suppliers reporting issues  Incidents raised from Service Requests

The following Incident Management Process workflow illustrates the steps required for the process.

O A - O I T I n c i d e n t M a n a g e m e n t P r o c e s s F l o w

N o t r o

p 19. I m p l e m e n t p L e v e l 3 . I n c i d e n t . I n f o r m a t i o n R e s o l u t i o n . U s e r u 16. 17 18 21 20. F o l l o w - U p w i t h S R e c e i v e s A s s i g n e d R e s o u r c e G a t h e r i n g / ( I f C h a n g e r e q u i r e d , v a l i d a t e s

l U s e r

e I n c i d e n t A s s i g n m e n t T r o u b l e - s h o o t i n g r e f e r t o C M r e s o l u t i o n ?

v P r o c e s s ) * e L

d r Y e s 3

N o t r o

p 13. I m p l e m e n t p

u 10. I n c i d e n t 11. I n f o r m a t i o n 12.I n c i d e n t R e s o l u t i o n 15. U s e r 6. L e v e l 2 R e c e i v e s 14. F o l l o w - U p w i t h S

R e s o u r c e G a t h e r i n g / E s c a l a t i o n ? N o ( I f C h a n g e r e q u i r e d , v a l i d a t e s l A s s i g n e d I n c i d e n t U s e r

e A s s i g n m e n t T r o u b l e - s h o o t i n g ( S L M ) * r e f e r t o C M r e s o l u t i o n ? v P r o c e s s ) * e L

d n

2 Y e s

2 . I n c i d e n t 7. . I m p l e m e n t Y e s 1. T r i g g e r s : P h o n e R e g i s t r a t i o n a n d 3. I n c i d e n t 4 . I n f o r m a t i o n 5. I n c i d e n t R e s o l u t i o n C a l l , S e l f S u b m i t , C l a s s i f i c a t i o n o f 2a. Is this a N o R e s o u r c e G a t h e r i n g / E s c a l a t i o n ? N o ( I f C h a n g e E v e n t M a n a g e m e n t I n c i d e n t : P 1 ; P 2 ; Major Incident? A s s i g n m e n t T r o u b l e - s h o o t i n g ( S L M ) * r e q u i r e d , r e f e r t o

t P 3 : P 4 C M P r o c e s s ) * r o p

p Y e s u S

l N o e v

e M a j o r I n c i d e n t L

Y e s

t F l o w s 1

U s e r F o l l o w - U p w i t h 9. I n c i d e n t 2 3 . I M a n d S L A 8. v a l i d a t e s Y e s 22. P r o c e s s E n d s U s e r C l o s u r e ; R e p o r t i n g ; 24. r e s o l u t i o n ? s p i A s s e t & s h P r o b l e m K n o w l e d g e * S e r v i c e L e v e l . * C h a n g e s s C o n f i g u r a t i o n e

n M a n a g e m e n t M a n a g e m e n t M a n a g e m e n t M a n a g e m e n t

c M a n a g e m e n t o

i P r o c e s s P r o c e s s P r o c e s s P r o c e s s o t

r P r o c e s s a P l e R

V1.7 10/12/2016 - 19 - OA-OIT Incident Management Process

5.2 Overview Procedures

Step Description Owner

1. TRIGGER: an incident is triggered by user User, Event Management calling the service desk or completes a web- based incident-logging screen or automatically via event management tools.

2. Incident Registration: During this step, the 1st Level Support / Service Desk, Self-submittal or event- Automated monitoring tool will register the incident by creating a ticket in the ITSM system. When the ticket is created, a notification will be sent to the end user and will provide the Incident number. During this step, the Service Desk will follow all existing process and documentation requirements. The Incident at this time will be Classified as P1; P2; P3; P4 (See Priority Incident Table). P1 Incident will be considered a Major Incident and will follow Major Incident Steps.

2a Is this a Major Incident? If Yes, go to Major Incident Management Process flow. If No, go to Step #3.

3. Incident Resource Assignment: Once the 1st Level Support / incident has been registered, it will be Automated assigned to a resolver group for investigation, troubleshooting and resolution. During this part of the incident lifecycle need to determine, based on Categorization, what Infrastructure Support Team the ticket should be assigned to. The ticket will be automatically sent to the assignment group.

4. Information Gathering/Troubleshooting: 1st Level Support / When a ticket is assigned, the Service Desk Automated will follow procedures to ensure tickets are dr iven to resolution.

V1.7 10/12/2016 - 20 - OA-OIT Incident Management Process

Step Description Owner

5. Incident Escalation? During an incident’s 1st Level Support lifecycle, there could be additional environmental or operation issues that impact the way in which an incident is attended to. During this step an incident is reviewed to ensure the Technical Teams are making progress on resolving the incident and if it’s determined that the progress is not being met the reviewer will take step to escalate the incident. Escalation points will be identified prior to a new service being introduced into the production environment. If Yes, go to Step #6. If No, go to Step #7.

6. Incident Escalation Handling: During an 1st Level Support / incident’s lifecycle, there could be additional Incident Manager environmental and operational issues that impact the way in which an incident is managed. During this step an incident is reviewed to ensure the Technical Teams are making progress on resolving the incident and if its determined that progress is not being met the reviewer will take steps to escalate the incident. When an Incident is escalated, there are two defined types of escalation paths that are followed: Functional where the support of a higher level specialist is needed to resolve the problem. Hierarchical where a manager with more authority needs to be consulted in order to make decisions that are beyond the responsibilities assigned to the level, example to assign more resources in order to resolve a specific incident. Go to Step #10.

7. Incident Resolved: At this point of the 1st Level Support / process, the Technical Teams have taken the Incident Manager appropriate actions to resolve the reported issue. During this stage the reported issues status will be changed to Resolved.

V1.7 10/12/2016 - 21 - OA-OIT Incident Management Process

Step Description Owner

8. Follow Up with User: This status change 1st Level Support / will initiate a message to the end user Incident Manager identifying that resolution has been provided. The notification will enable the end user to confirm that they agree the issue is resolved. The end user will have 7 days to report to the Technical Teams that the issue is not resolved. If the end user does not respond within 7 days, the incident will be closed. Change Required? Many times during the lifecycle of an incident, issues are identified that require a change to the environment to enable the resolution of the reported incident. These changes are managed in the Change Management Process.

9. End User Validates IM Solution?: At this User stage, the end user will receive a notification from the ITSM Tool stating that the Support Team believes the issue has been resolved. The end user will then need to validate and confirm the issue is in fact resolved; If the issue is not resolved the End User can use the Reopen Button which will only appear in the self-service view for the end user for an incident that is in state of resolved only. If they don’t respond within the 7 day time period the incident will be auto closed at which time it cannot be reopened. If the End User uses the Reopen button the Incident is placed back in Assigned state and the ticket will be assigned to the current assignment group as they are the ones who resolved the ticket. If Yes, go to Step #22. If No, go to Step #4.

10. Incident Resource Assignment: Once the 2nd Level Support incident has been registered, it will be assigned to a resolver group for investigation, troubleshooting and resolution. During this part of the incident lifecycle need to determine, based on Categorization, what Infrastructure Support Team the ticket should be assigned to. The ticket will be automatically sent to the assignment group.

V1.7 10/12/2016 - 22 - OA-OIT Incident Management Process

Step Description Owner

11. Information Gathering/Troubleshooting: 2nd Level Support When a ticket is assigned, the Service Desk will follow procedures to ensure tickets are dr iven to resolution. Go to Problem Management for possible Work Around and/or registration of Incident as a Problem. See OA-OIT Problem Management Process.

12. Incident Escalation? During an incident’s 2nd Level Support lifecycle, there could be additional environmental or operation issues that impact the way in which an incident is attended to. During this step an incident is reviewed to ensure the Technical Teams are making progress on resolving the incident and if it’s determined that the progress is not being met the reviewer will take step to escalate the incident. Escalation points will be identified prior to a new service being introduced into the production environment. If Yes, go to Step #16. If No, go to Step #13.

13. Incident Resolved: At this point of the 2nd Level Support process, the Technical Teams have taken the appropriate actions to resolve the reported issue. During this stage the reported issues status will be changed to Resolved.

14. Follow Up with User: This status change 2nd Level Support will initiate a message to the end user identifying that resolution has been provided. The notification will enable the end user to confirm that they agree the issue is resolved. The end user will have 7 days to report to the Technical Teams that the issue is not resolved. If the end user does not respond within 7 days, the incident will be closed. Change Required? Many times during the lifecycle of an incident, issues are identified that require a change to the environment to enable the resolution of the reported incident. These changes are managed in the Change Management Process.

V1.7 10/12/2016 - 23 - OA-OIT Incident Management Process

Step Description Owner

15. End User Validates IM Solution?: At this 2nd Level Support stage, the end user will receive a notification from the ITSM Tool stating that the Support Team believes the issue has been resolved. The end user will then need to validate and confirm the issue is in fact resolved; If the issue is not resolved the End User can use the Reopen Button which will only appear in the self-service view for the end user for an incident that is in state of resolved only. If they don’t respond within the 7 day time period the incident will be auto closed at which time it cannot be reopened. If the End User uses the Reopen button the Incident is placed back in Assigned state and the ticket will be assigned to the current assignment group as they are the ones who resolved the ticket. If Yes, go to Step #22. If No, go to Step #11.

16. Incident Escalation Handling: During an 3rd Level Support incident’s lifecycle, there could be additional environmental and operational issues that impact the way in which an incident is managed. During this step an incident is reviewed to ensure the Technical Teams are making progress on resolving the incident and if its determined that progress is not being met the reviewer will take steps to escalate the incident. When an Incident is escalated, there are two defined types of escalation paths that are followed: Functional where the support of a higher level specialist is needed to resolve the problem. Hierarchical where a manager with more authority needs to be consulted in order to make decisions that are beyond the responsibilities assigned to the level, example to assign more resources in order to resolve a specific incident.

V1.7 10/12/2016 - 24 - OA-OIT Incident Management Process

Step Description Owner

17. Incident Resource Assignment: Once the 3rd Level Support incident has been registered, it will be assigned to a resolver group for investigation, troubleshooting and resolution. During this part of the incident lifecycle need to determine, based on Categorization, what Infrastructure Support Team the ticket should be assigned to. The ticket will be automatically sent to the assignment group.

18. Information Gathering/Troubleshooting: 3rd Level Support When a ticket is assigned, the Service Desk will follow procedures to ensure tickets are dr iven to resolution. Go to Problem Management for possible Work Around and/or registration of Incident as a Problem. See OA-OIT Problem Management Process.

19. Incident Resolved: At this point of the 3rd Level Support process, the Technical Teams have taken the appropriate actions to resolve the reported issue. During this stage the reported issues status will be changed to Resolved.

20. Follow Up with User: This status change 3rd Level Support will initiate a message to the end user identifying that resolution has been provided. The notification will enable the end user to confirm that they agree the issue is resolved. The end user will have 7 days to report to the Technical Teams that the issue is not resolved. If the end user does not respond within 7 days, the incident will be closed. Change Required? Many times during the lifecycle of an incident, issues are identified that require a change to the environment to enable the resolution of the reported incident. These changes are managed in the Change Management Process.

V1.7 10/12/2016 - 25 - OA-OIT Incident Management Process

Step Description Owner

21. End User Validates IM Solution?: At this User stage, the end user will receive a notification from the ITSM Tool stating that the Support Team believes the issue has been resolved. The end user will then need to validate and confirm the issue is in fact resolved; If the issue is not resolved the End User can use the Reopen Button which will only appear in the self-service view for the end user for an incident that is in state of resolved only. If they don’t respond within the 7 day time period the incident will be auto closed at which time it cannot be reopened. If the End User uses the Reopen button the Incident is placed back in Assigned state and the ticket will be assigned to the current assignment group as they are the ones who resolved the ticket. If Yes, go to Step #22. If No, go to Step #18.

22. Incident Closure: When an incident is 1st Level Support / 2nd placed in this status no additional actions can Level Support / 3rd Level be performed. If an end user reports that the Support issue has re-occurred then a new incident will have to be created and then linked to the Closed incident, to provide a historical perspective to the Technical Teams that will be working the newly created incident. Note: Any 3rd Level Support Incidents could potentially determine that an incident (or incidents) need to go to Problem Management.

23. IM and SLA Reporting: During this final 1st Level Support / ITSM stage of the process, the Incident is closed, Tool after 7 days of being in Resolved, and reporting can commence. The ITSM tool will provide data to the Reporting Team to enable reporting for SLA purposes.

24. Process Ends

V1.7 10/12/2016 - 26 - OA-OIT Incident Management Process

5.3 Process Workflow

The following Major Incident Management Process workflow illustrates the steps required for the process.

O A - O I T M a j o r I n c i d e n t M a n a g e m e n t P r o c e s s F l o w C

O 4. R e c e i v e P

13.R e c e i v e

y C o m m u n i c a t i o n N o t i f i c a t i o n o f c a b o u t M a j o r n R e s o l u t i o n

e I n c i d e n t g A

t n e r 1. S e r v i c e O w n e r d

e 12. N o t i f y i R e c e i v e N o t i f i c a t i o n g c 3. P e r f o r m M a j o r D i s t r i b u t i o n t h a t a n o f M a j o r I n c i d e n t I n

I n c i d e n t N o t i f i c a t i o n M a j o r I n c i d e n t w a s

r a n d a s s i g n M I a r e s o l v e d o

j O w n e r M a M s h t a 2.C o o r d i n a t e

P . I s t h i s a M a j o r I n c i d e n t . I s P r o b l e m 6 9. 10 U p d a t e R e c o r d ,

m 14. l R e c o v e r y f r o m t h e N o Y e s e a l D i s a s t e r ? R e s o l v e d ? M g m t r e q u i r e d ? l C l o s e M a j o r

e a M a j o r I n c i d e n t r I n c i d e n t T a

P t n

e Y e s ( P a r a l l e l P a t h s )

d Y e s i c n

I d e n t i f y / D e v e l o p 5. I r R e c o v e r y P l a n o

j 8. I T S e r v i c e

a . D e c l a r e a . P r o b l e m 7 C o n t i n u i t y 11

M D i s a s t e r M a n a g e m e n t M a n a g e m e n t r e n

t w n

O P 1 I n c i d e n t /

e r M a j o r I n c i d e n t d i e c g n a I n a M

V1.7 10/12/2016 - 27 - OA-OIT Incident Management Process

5.4 Process Procedures

Step Description Owner

1. Receive Notification of Major Incident Major Incident Manager and Assign MI Owner Receive notification, typically from: ▪ Incident Manager who determines that the Incident is a Major Incident ▪ Automated tool that sends information directly to a 2nd level Resolver Proceed in parallel to: ▪ Coordinate Recovery from the Major Incident ▪ Perform Major Incident Notification ▪ Identify / Develop Recovery Plan

2. Coordinate Recovery from the Major Major Incident Owner / Incident Major Incident Team Manage the Incident determination and Incident recovery activities, which may include the following: ▪ Determine scope of Major Incident ▪ Confirm appropriate staff are working on the Incident to minimize the duration ▪ Assemble service recovery team Proceed in Is this a Disaster?

V1.7 10/12/2016 - 28 - OA-OIT Incident Management Process

Step Description Owner

3. Perform Major Incident Notification Major Incident Manager The following are requirements during the course of a Major Incident to keep the appropriate support teams and management updated regarding the status/progress of the Incident: ▪ Verify that the State is notified and kept up to date with the resolution of the Incident ▪ Verify that internal management is notified and kept up to date with the progression of the Incident ▪ Verify that there is escalation for additional resources/visibility at prescribed intervals ▪ Verify that details of the Incident determination/Incident recovery are Documented Proceed in is this a Disaster?

4. Receive Communication about Major Agency POC Incident Receive ongoing communications related to the Major Incident.

5. Identify / Develop Recovery Plan Major Incident Owner/Major Incident Develop a recovery plan to assist in the Team recovery from the Major Incident

6. Is this a Disaster? Major Incident Owner/Major Incident ▪ If Yes, proceed to Declare a Disaster Team ▪ If No, proceed to Major Incident Resolved?

7. Declare a Disaster Major Incident Owner/Major Incident Declare the Major Incident a disaster Team

V1.7 10/12/2016 - 29 - OA-OIT Incident Management Process

Step Description Owner

8. IT Service Continuity Management Major Incident Owner/Major Incident Invoke the IT Service Continuity Management Team process to engage the ITSCM Team for assistance.

9. Major Incident Resolved? Major Incident Owner/Major Incident Continually track the status of the Major Team Incident and determine if the Major Incident has been resolved. (The Major Incident is considered resolved when the viability of the bypass or service restoration has been ascertained and the Incident has been documented, reviewed and recorded in the Incident Record.) ▪ If Yes, proceed to the following parallel paths:  Is Problem Management Required? ▪ If No, return to the following parallel paths:  Coordinate Recovery from the Major Incident  Perform Major Incident Notification  Identify / Develop Recovery Plan

10. Is Problem Management Required? Major Incident Owner/Major Incident ▪ If Yes, proceed in parallel paths to Problem Team Management and Notify Distribution That Major Incident Was Resolved ▪ If No, proceed to Notify Distribution That Major Incident Was Resolved

V1.7 10/12/2016 - 30 - OA-OIT Incident Management Process

Step Description Owner

11. Problem Management Major Incident Owner/Major Incident Invoke the Problem Management process to Team perform the Major Incident Review, which is intended to prevent the recurrence of related Incidents and to promote the continuous improvement of service delivery. The objectives are: ▪ Clearly identify the root cause of the Problem ▪ Identify process/procedure compliance issues and/or deficiencies ▪ Provide timely information regarding the known error and its resolution ▪ Confirm action items are identified and logged ▪ Propagate knowledge learned to other platforms and teams ▪ Update Knowledge Bases, as needed

12. Notify Distribution that Major Incident Major Incident Manager Was Resolved Verify that the State is made aware of the resolution.

13. Receive Notification of Resolution Agency POC Receive final notification that the Major Incident was resolved.

14. Update Record, Close Major Incident Major Incident Owner/Major Incident Update the record with actions taken and Team contacts made. Close the Major Incident activities, such as, close down communication bridges and perform communication related activities. The Handle Major Incident flow is complete.

V1.7 10/12/2016 - 31 - OA-OIT Incident Management Process

5.5 Notifications

The Incident process notifications are:

Incident Submission: The Requester who submits the Incident will receive an email notification that the request has been submitted. If a person is identified in the Contact fields, they also receive an email notification that the Incident has been submitted.

Incident Assignment: The Incident Assignment Group receives a notification after the Incident is assigned to their Work Group, in an Assigned status. The Incident Owner group receives a notification when the Incident is assigned to the Work Group. The Incident Assignee receives a notification when the Incident is assigned to them.

Incident Status: Notifications are sent at defined intervals to appropriate stakeholders. The intervals for notifications and stakeholders are defined when a new service is introduced into the production environment.

Incident Resolution: The Requester receives an email notification after the Incident is moved to a resolved status.

V1.7 10/12/2016 - 32 - OA-OIT Incident Management Process 6 Metrics

The following metrics will be measured and reported monthly by the Incident Manager(s) to help manage the process. Metrics will be saved over a 13 month period time frame for reviewing and trending purposes.

V1.7 10/12/2016 - 33 - OA-OIT Incident Management Process

Classification Metric Description/Calculation Frequency KPI

Strategic Average Time Based on Business Duration Monthly to Restore Service (w/Event Management Configuration)

Management Average Based on Business Duration Monthly X Incident Resolution

Management Incident Based on resolution within Monthly X Resolution SLA be category within SLA by Category

Operational Incidents Based on the subcategory Monthly Assigned for the Reassignment Count More than greater the 0 Once by Category

Operational Number of Based on the subcategory Monthly Incidents by Category

Operational Number of Based on Resolve code Monthly Repeated Incidents

Operational Remotely Based on Resolve code- Monthly Resolved broken out by specific codes Incidents

Operational First Time Based on Resolution rate by Monthly Resolution by category Category

Operational Average work Based on incident time Monthly Effort for worked and category Resolving Incidents by Category

Operational Number of Based on number of Monthly Incidents per Incidents per service, by service, by day, by month day, by month

V1.7 10/12/2016 - 34 - OA-OIT Incident Management Process

V1.7 10/12/2016 - 35 - OA-OIT Incident Management Process Appendix A – Incident Management User Surveys

Scope

A key measurement of incident management is customer satisfaction. Typically customer satisfaction is measured using a survey which captures the customer’s satisfaction with the timeliness of the resolution, the resolution itself, and the IM service. In incident management workflow the affected user of an incident is notified to fill out a survey when the incident is closed. The user’s response should be able to correlate back to the incident and the analyst resolving the incident.

Agency Quality Assurance (QA) reviewers will receive a notification if a submitted survey's average score is below an established threshold or if comments were added to the survey. Thresholds have been pre-defined to meet the criteria set forth by the receiving agency.  Agency Benefit: Eliminates the need to search for and identify unsatisfactory survey results.

Surveys and their associated incident tickets are connected and viewable together.  Agency Benefit: Eliminates the need to search for and match a survey to an incident ticket.

Once the QA reviewer reviews, closes and saves the survey, the survey is stamped with the date and time of the review and the name of the QA reviewer.  Agency Benefit: Date and time stamp notation improves archived records related directly to an incident ticket and its survey.

The QA reviewer has the option to assign another QA team member the task of conducting a follow up investigation on the survey. If the first QA reviewer wishes to assign the follow up task to another QA team member, a notification is sent to the selected QA team member. Date and time stamps still apply.  Agency Benefit: This enhancement improves communication between team members.

Two reports have been created.  Survey Averages - Provides the average score for each survey or a group of surveys based on date parameters. An average score for one question based on date parameters is also available  Survey Details - Provides detailed survey results for completed surveys based on date parameters.

V1.7 10/12/2016 - 36 - OA-OIT Incident Management Process

V1.7 10/12/2016 - 37 - OA-OIT Incident Management Process Appendix B – CoPA Incident Report Procedures

Standard Operation Procedure (SOP)

This Standard Operating Procedure (SOP) document is used to detail the CoPA Incident Report process.

 The process is used to provide timely notification and information to CoPA OIT Executive Management and related OIT Managers and entities, and as appropriate, affected agencies, processes, or CoPA-wide notifications, of various IT and Communications incidents and outages occurring in real time.

 The document is limited to information related to preparing, sending, and communicating critical information to identified recipients through an incident’s life cycle using CoPA Incident Report Procedures.

 When there is an incident that is affecting multiple agencies or a high profile application and/or network, it is required that an Incident Report be sent to Management on a regular basis to keep them informed of the situation.

CoPA Incident Report Procedures.docx

V1.7 10/12/2016 - 38 -

Recommended publications