Monitoring Plan Project Name
Total Page:16
File Type:pdf, Size:1020Kb
MONITORING PLAN
The Monitoring Plan defines the process by which the operational environment will monitor the solution. It describes what will be monitored, what monitoring is looking for, how monitoring will be done, and how the results of monitoring will be reported and used. The paragraphs written in the “Comment” style are for the benefit of the person writing the document and should be removed before the document is finalized.
SEPTEMBER 11, 1998
Revision Chart
This chart contains a history of this document’s revisions. The entries below are provided solely for purposes of illustration. Entries should be deleted until the revision they refer to has actually been created. The document itself should be stored in revision control, and a brief description of each version should be entered in the revision control system. That brief description can be repeated in this section.
Version Primary Author(s) Description of Version Date Completed Draft TBD Initial draft created for distribution and TBD review comments Preliminary TBD Second draft incorporating initial review TBD comments, distributed for final review Final TBD First complete draft, which is placed under TBD change control Revision 1 TBD Revised draft, revised according to the TBD change control process and maintained under change control etc. TBD TBD TBD Monitoring Plan Project Name
PREFACE
The preface contains an introduction to the document. It is optional and can be deleted if desired.
Introduction The Monitoring Plan defines the process by which the operational environment will monitor the solution. It describes what will be monitored, what monitoring is looking for, how monitoring will be done, and how the results of monitoring will be reported and used. Customers use automated procedures to monitor many aspects of their solutions. Automated monitoring is a key best practice that enables identification of failure conditions and potential problems. Monitoring helps to reduce the time needed to recover from failures.
Justification The plan will provide the details of the monitoring process, which will be incorporated into the functional specification. Once incorporated into the functional specification, the monitoring process (manual and automated) will be included in the solution design. Monitoring ensures that operators are made aware that a failure has occurred so they can initiate procedures to restore service. Additionally, some organizations monitor their servers’ performance characteristics to spot usage trends. This proactive best practice allows organizations to identify the conditions that contribute to system failure and take action to prevent those conditions from occurring.
Team Role Primary Program Management is responsible for ensuring that the plan is completed and has acceptable quality, as well as incorporating it into the Master Project Plan and Operations Plan. Release Management will contribute heavily to the content of the plan in its responsibility for designing an effective solution monitoring process.
Team Role Secondary Development will review the plan to ensure that the functional specification and project deliverables are in synch with the monitoring plan. Product Management will review the plan to ensure that external customer needs are met by the monitoring plan. Test and User Experience will review the plan to ensure that what is monitored supports their functional areas of interest.
04768d3b8f4ca14ed9979bb8b67f69f6.doc (02/08/10) Page 2 Monitoring Plan Project Name
CONTENTS
New paragraphs formatted as Heading 1, Heading 2, and Heading 3 will be added to the table automatically. To update this table of contents in Microsoft Word, put the cursor anywhere in the table and press F9. If you want the table to be easy to maintain, do not change it manually.
1. INTRODUCTION...... 4
1.1 MONITORING PLAN SUMMARY...... 4 1.2 MONITORING PLAN OBJECTIVES...... 4 1.3 DEFINITIONS, ACRONYMS, AND ABBREVIATIONS...... 4 1.4 REFERENCES...... 4
2. ANTICIPATING FAILURES...... 5
2.1 RESOURCE THRESHOLD MONITORING...... 5 2.2 PERFORMANCE MONITORING...... 5 2.3 TREND ANALYSIS...... 6 2.4 APPLICATION HEALTH AND PERFORMANCE MONITORING...... 6
3. DETECTING FAILURES (INCIDENTS)...... 7
3.1 ERROR DETECTION...... 7 3.1.1 SNMP...... 7 3.1.2 Event Logs...... 7 3.1.3 Exception Trapping...... 8 3.2 NOTIFICATIONS...... 8
4. DIAGNOSING FAILURES (PROBLEMS)...... 9
5. RESOLVING FAILURES (KNOWN ERRORS)...... 10
6. RECOVERING FROM FAILURES...... 11
7. TOOLS...... 12
8. INDEX...... 13
9. APPENDICES...... 14
04768d3b8f4ca14ed9979bb8b67f69f6.doc (02/08/10) Page 3 Monitoring Plan Project Name
LIST OF FIGURES
New figures that are given captions using the Caption paragraph style will be added to the table automatically. To update this table of contents in Microsoft Word, put the cursor anywhere in the table and press F9. If you want the table to be easy to maintain, do not change it manually. This section can be deleted if the document contains no figures or if otherwise desired.
Error! No table of figures entries found.
04768d3b8f4ca14ed9979bb8b67f69f6.doc (02/08/10) Page 4 Monitoring Plan Project Name
1. INTRODUCTION
This section should provide an overview of the entire document. No text is necessary between the heading above and the heading below unless otherwise desired.
1.1 Monitoring Plan Summary Provide an overall summary of the contents of this document. Some project participants may need to know only the plan’s highlights, and summarizing creates that user view. It also enables the full reader to know the essence of the document before they examine the details.
1.2 Monitoring Plan Objectives The Objectives section describes the business and technical drivers of the monitoring process and what key objectives are targeted for the monitoring process. Identifying the drivers and monitoring objectives signals to the customer that the project team has carefully considered the situation and solution and created an appropriate monitoring approach.
1.3 Definitions, Acronyms, and Abbreviations Provide definitions or references to all the definitions of the special terms, acronyms and abbreviations used within this document.
1.4 References List all the documents and other materials referenced in this document. This section is like the bibliography in a published book.
04768d3b8f4ca14ed9979bb8b67f69f6.doc (02/08/10) Page 5 Monitoring Plan Project Name
2. ANTICIPATING FAILURES
The Anticipating Failures section should List the solution components that are subject to failure Identify the components that are single points of failure Define known mean time between component failures List the conditions and circumstances that can lead to failures Define the probabilities of failures Define the impacts of failures This information could be documented in matrix such as:
Component Single Point Mean Time Conditions and Probability of Impacts of of Failure Between Circumstances Component Failure (Yes/No) Failures Leading to Failure Failure
Anticipating failures will enable operations either to avoid them or be prepared to deal with them when they occur.
2.1 Resource Threshold Monitoring The Resource Threshold Monitoring section identifies the solution resources that will be monitored, it defines the conditions and circumstances to be monitored for each type of resource, and it defines the thresholds to be used to judge that resources are working properly and are/are not sufficient to support the solution. Resources include hard drives, CPU, memory, and threads.
2.2 Performance Monitoring The Performance Monitoring section defines the monitoring process that gathers and records information about the performance of the total solution and the individual components in the solution. For each type of solution event it includes Counting event occurrences Recording event start and end time Counting the number of failures Counting the number of retries following failures
04768d3b8f4ca14ed9979bb8b67f69f6.doc (02/08/10) Page 6 Monitoring Plan Project Name
2.3 Trend Analysis The Trend Analysis section defines the analysis that will take place on the data collected during performance monitoring. Trend analysis uses the information gathered and recorded by performance monitoring to predict solution and component performance and health under different conditions and circumstances, such as a larger user set and a changing solution environment.
2.4 Application Health and Performance Monitoring The Application Health and Performance Monitoring section should list and describe each software application in the solution and describe the plan for monitoring each application: Responsibility for monitoring Conditions and circumstances to be monitored Monitoring methods and tools Monitoring thresholds
04768d3b8f4ca14ed9979bb8b67f69f6.doc (02/08/10) Page 7 Monitoring Plan Project Name
3. DETECTING FAILURES (INCIDENTS)
The Detecting Failures section should describe how the development team, operations, and maintenance will utilize the functional specifications and user acceptance criteria to detect failure incidents. The functional specifications clearly define the success criteria for a solution and for each of its components. User Acceptance Criteria, based on the functional specifications, precisely define user expectations for the correct and effective operation of the solution.
3.1 Error Detection The Error Detection section describes the processes, methods, and tools teams will use to detect and diagnose solution errors. The goal of an error detection strategy should be that the error is detected, resolved and recovered without the knowledge of the user community. Error detection in a Windows environment will enhance a solution’s reliability and availability. Early detection and handling of application and system errors can help avoid a shutdown, or at least allow for an orderly shutdown. It can also increase availability by allowing the solution to continue operating in a degraded state.
3.1.1 SNMP The SNMP protocol captures or traps configuration and status information from a Windows NT server.
3.1.2 Event Logs The Event Logs section describes the logs that will provide a system for capturing and reviewing significant application and system events. Describe the logs operations will maintain and the procedures they will use to record events and time in the logs.
3.1.2.1 Monitoring for Failure The Monitoring for Failure section should describe the processes, methods, and tools teams will use to detect and report solution failures.
3.1.2.2 Monitoring for Success The Monitoring for Success section describes the processes, methods, and tools teams will use to determine the solution is working correctly and is meeting user expectations. Monitoring for success includes the use of monitoring tools and interaction with solution users to gather information about solution successes.
3.1.2.3 Monitoring for Alarms The Monitoring for Alarms section describes how solution alarms will signal that a problem is about to occur or has occurred in a solution. It should identify all solution alarms, indicate how they will signal users and operations, and define what each alarm means.
04768d3b8f4ca14ed9979bb8b67f69f6.doc (02/08/10) Page 8 Monitoring Plan Project Name
3.1.3 Exception Trapping The Exception Trapping section describes a type of monitoring built into a solution that recognizes incidents, indicating a solution has produced a result that is an exception to acceptable results (i.e., the result lies outside the range of acceptability). This section should identify where the development team will build exception traps into the solution that continually monitor solutions or that operations will turn on when they suspect problems within a solution. Exception trapping capabilities allow for reliable programmer and program control over responses to exceptions that occur during the execution of a solution.
3.2 Notifications The Notifications section describes how people will be notified when monitoring and exception trapping has detected solution failures. This should include notification for errors and cases in which user performance expectations have not been met.
04768d3b8f4ca14ed9979bb8b67f69f6.doc (02/08/10) Page 9 Monitoring Plan Project Name
4. DIAGNOSING FAILURES (PROBLEMS)
The Diagnosing Failures section describes the processes, methods, and tools teams will employ to diagnose the problems detected in solutions by monitoring and exception trapping.
04768d3b8f4ca14ed9979bb8b67f69f6.doc (02/08/10) Page 10 Monitoring Plan Project Name
5. RESOLVING FAILURES (KNOWN ERRORS)
The Resolving Failures section describes the procedures teams will use to correct the errors detected and diagnosed in solutions and to improve solutions that do not meet user expectations.
04768d3b8f4ca14ed9979bb8b67f69f6.doc (02/08/10) Page 11 Monitoring Plan Project Name
6. RECOVERING FROM FAILURES
The Recovering from Failures section defines how the solution will be recovered from failure or referenced the Backup and Recovery Plan.
04768d3b8f4ca14ed9979bb8b67f69f6.doc (02/08/10) Page 12 Monitoring Plan Project Name
7. TOOLS
The Tools section lists and describes the tools teams can employ to detect, diagnose, and correct errors and to improve a solution’s performance. The table below is an example of this.
Tool Description Microsoft Systems Integrated inventory, distribution, installation, and remote troubleshooting tools for Management centralized management of hardware and software. Microsoft Systems Management Server Server can be used in medium to large multi-site Windows–based environments to reduce the cost of change and configuration management of Windows based desktop and server computers. Details available at http://www.microsoft.com/backoffice. Microsoft Windows Server administrative tool that enables viewing behavior of processors, Performance memory, cache, threads, and process objects. Each object has an associated set of Monitor (Perfmon) counters that provide information about device usage, queue length, delays, and other data that measures throughput and internal congestion. Details available at http://www.microsoft.com/windows. Microsoft Microsoft Press® kits contain both technical documentation and a CD-ROM with Windows Server useful utilities and accessory programs to help install, configure, and troubleshoot 2003 Resource Kit Microsoft Windows Server. See Details available at http://www.mspress.microsoft.com. Tivoli Management Family of products with a single management framework integrating disparate IBM Software systems management applications. Details available at http://www.tivoli.com.
04768d3b8f4ca14ed9979bb8b67f69f6.doc (02/08/10) Page 13 Monitoring Plan Project Name
8. INDEX
The index is optional according to the IEEE standard. If the document is made available in electronic form, readers can search for terms electronically.
04768d3b8f4ca14ed9979bb8b67f69f6.doc (02/08/10) Page 14 Monitoring Plan Project Name
9. APPENDICES
Include supporting detail that would be too distracting to include in the main body of the document.
04768d3b8f4ca14ed9979bb8b67f69f6.doc (02/08/10) Page 15