Problem Management - Avoiding Deja Vu Introduction

• Neil Thomas – Industry experience in IT & IT support – ITIL Vendor Product Management – ITIL Consulting – Specialised in Service Catalog & CMDB Introduction

• Fully Accredited ITIL Training • Fully Accredited SDI Training • ITIL Consultancy • eLearning • Social Media Training & Consultancy • Industry Webinars (ITSM & SM) • Industry/Organizational Podcasts • SDI Partner for Social Media Courses The Webinar Series

• Service Catalog • Developing a CMDB • Incident Management • Problem Management • Change Management • Measuring Service Desk Performance Metrics Topics today

• Detection....how do we know we have problems? • 'Known errors' & ‘Work Arounds’ • Incident Management & Problem Management • Problem Process inputs & outputs • Root Cause Analysis skills and culture • Actions to minimize impact of problems • Avoiding Repeated Incidents • Minimizing Impact Of Problems • Key Activities • Implementation best practices • Categorizing the incidents & how to identify the problems

If Something Managing it Goes Wrong User Needs Something (Service Portfolio Management & Financial (Incident Management) (Service Requests & Service Catalogue) Management)

How Quickly Ensuring it’s there do we Support in the Future (Service Level (Availability Management & Management) Service Capacity Management & Service Continuity Management)

If Something What Delivers it Keeps Goes (Configuration Management) Wrong Delivering Agreed (Problem Management) Need to Improve Changes to Business or Resolve (Release Management) Problems (Change Management) What is a Problem?

• A 'Problem' is the unknown cause of one or more incidents, often identified as a result of multiple similar incidents Problem Management

• The objective of Problem Management is to minimize the impact of problems on the organization caused by errors in the IT infrastructure and initiate actions to prevent recurrence of incidents

• Problem Management plays an important role in the detection and providing solutions to problems (work arounds & known errors) and prevents their reoccurrence Problem Detection

• Service Desk suspicions of a cause of one or more incidents no definitive cause but suspects recurrence • Analysis reveals that an underlying problem exists • Automated detection of an infrastructure or application fault • A notification from a supplier that a problem exists • Analysis of incidents as part of proactive Problem Management Major Incidents

• A Major Incident is an unplanned or temporary interruption of service with severe negative consequences Problem Logging

• User details • Service details • Equipment details • Date/time initially logged • Priority and categorization details • Incident description • Details of all diagnostic or attempted recovery actions taken Problem Categorization

• Problems must be categorized (severity/priority) in the same way as incidents in order to trace a problem • Take into account the IMPACT of the incidents and the FREQUENCY of the occurrences • Takes into account the SEVERITY of the problems – Can the system be recovered, or does it need to be replaced? – How much will it cost? – How many people will be involved to fix the problem? – How long will it take to fix the problem? – How many additional resources will be involved? Categorize

Effective categorization of Incidents & Problems has two aspects…

• Classification to determine incident type • The Configuration Item (CI) that is affected • Use standardized coding criteria – Software:Server:Exchange (Technical classification by org. chart - Department:Team:Assignment) – Server:Software:Exchange (Technical classification By CI categorization) – Printing:Printer:Cannot Print (Service oriented - Service:Component:Problem) – E-mail:Outlook:Access (Service oriented) Configuration Management

• Defines WHAT delivers a SERVICE • Defines the RELATIONSHIPS & dependencies • Know WHAT is important and HOW it connects • Change Process uses IMPACT ANALYSIS Inputs & Outputs of Problem

• Incident Records • Details About Incidents • Known Errors • Information about CIs • Other Processes

• Requests for Change • Management Information • Work Arounds • Known Errors • Updated Problem Records • Problem Management Problem Process Goals

• ITIL-aligned Problem Management Policies, Processes and Procedures • Dedicated Problem Manager (person or responsibility) • Problem classification categories • Problem trend reports • Publicized Known Errors • Problem analysis toolkit • Root Cause Analysis skills and culture • Actions to minimize impact of problems Problem Solving

• A Problem is the cause (typically unknown) of one or more incidents. Activities include: – Analyze and identify the root cause of one or more incidents – Validate and publish the workaround for incidents whose cause is known (known error) – Effect the systematic removal of the root cause via RFCs Ishikawa Diagrams (Fishbone)

• Causal diagrams that show the causes of a certain event • Each cause or reason for imperfection is a source of variation • Causes are usually grouped into major categories to identify these sources of variation. The categories typically include: – People: Anyone involved with the process – Methods: How the process is performed and the specific requirements for doing it, such as policies, procedures, rules, regulations and laws – Machines: Any equipment, computers, tools etc. required to accomplish the job – Materials: Raw materials, parts, pens, paper, etc. used to produce the final product – Measurements: Data generated from the process that are used to evaluate its quality – Environment: The conditions, such as location, time, temperature, and culture in which the process operates Known Errors

• Problem Management identifies the underlying causal factor • It might take many incidents to understand the root cause. • When identified the causal factor becomes a “known error” Known Errors & Work Arounds

• A 'Known Error' is a problem for which the root cause is understood and a temporary workaround or a permanent fix has been identified (implementation of the permanent fix may be some time in the future)

• A work around is a solution to a problem where the is not known or fully understood. Work arounds are usually used to minimize the effects of the problem, until a permanent solution is offered. When the root cause has been identified Work Arounds become Known Errors. Knowledge & Problems

• Problem Management maintains information about : – Problems – Workarounds – Resolutions • To reduce the number & impact of Incidents over time • Strong interface with Knowledge Management, – Known Error Database will be used for both – Incident Management descriptions – Incident Management categorizations – Incident Management resolutions Problem Critical Success Factors

Avoiding Repeated Incidents

Minimizing Impact Of Problems Problem KPIs

• Examples of Key Process Performance Indicators • Avoiding Repeated Incidents – Number of repeat incidents – Number of existing Problems – Number of existing Known Errors • Minimizing Impact Of Problems – Average time for diagnosis of Problems – Average time for resolution of Known Errors – Number of open Problems – Number of open Known Errors – Number of repeat Problems – Number of Major Incident/Problem reviews Incident Management & Problem

• Goal of the Incident Management process is to restore service asap • Check for known errors and use any “workarounds” • Resolving the Incident with solutions or workarounds Link to Incident

• Works with Incident Management to ensure increased IT service availability and quality • When incidents are resolved, information about the resolution is recorded. • Over time, this information is used to speed up the resolution time and identify permanent solutions, reducing the number and resolution time of incidents. This results in less downtime and less disruption to business critical systems. Incident & Problem Conflict

• Incident Management is concerned with restoring service as quickly as possible • Problem Management is concerned with determining and eliminating root cause (and hence eliminating repeat problems)

• Example: – From an Incident Management perspective the best decision is to, for example, reboot a server to restore the service. – Not ideal from a Problem Management perspective as the reboot may destroy any diagnostics and so prevent progress towards identifying the root cause I versus P – A solution

• Form a plan of attack for the next occurrence of the problem • Decide what diagnostics to collect • How long to allow for diagnostics before service is restored • Prepare the necessary resources (people, process, and technology) prior to the incident • Communicate the plan to the stakeholders

V Incident, Problem & Change

• Accurate analysis • Identification of Configuration Items • Good Problem analysis that touches ALL Incidents • Link to Known Errors and Work Arounds Problem Best Practice 1

1. Define Roles and Responsibilities – Designated Problem Manager – Responsibility to identify problems during daily operations as well as through historical reporting that shows recurring incidents – Service Desk Manager and Problem Manager communication 2. Focus on the Root Cause – Create a documented process for Root Cause Analysis – Define what techniques will be used e.g. • Brainstorming • Ishikawa diagrams • Causal Mapping • “Group Think” representatives from any possible area of breakdown • Schedule regular incident reviews. • Review all incidents where the root cause was not removed. Problem Best Practice 2

3. Make a “Known Error” Known • The workaround should be communicated to all end-users who have submitted an incident • Incidents placed in a “resolved” status • Problem record should be in a “known error” status • Known error & workaround published to knowledgebase • Continue to open related incidents as reported and link them to the problem record (newly related incidents “resolved” state). This should stop SLA calculations against the incidents, but will not allow full closure until the problem is resolved and closed • Problem Managers must decide if permanently fixing the root cause is economically viable or if the workaround should become permanent Problem Best Practice 3

4. Weigh the ROI • If ROI for repairing a root cause is not achievable in 6 months, consider leaving the workaround in place otherwise raise a Request for Change (RFC). • The Problem is linked to the RFC (Close Problem & Incidents) 5. Focus on Root Cause • Don’t automatically close Problem records when an RFC completes • Review by the Problem Manager to assure that any workaround in place is backed out 6. Be Customer-centric • Focus on customers, not infrastructure. • Tendency is to focus on the most troublesome infrastructure BUT goal of effective IT Service Management is to focus on customers Benefits

• Higher availability of IT services • Higher productivity of business and IT staff • Reduced expenditure on workarounds or fixes that do not work • Reduction in cost of effort in fire-fighting or resolving repeat incidents. Q & A Time…….