Automated IT Service Fault Diagnosis Based on Event Correlation Techniques

Automated IT Service Fault Diagnosis Based on Event Correlation Techniques

Automated IT Service Fault Diagnosis Based on Event Correlation Techniques Dissertation an der Fakultat¨ fur¨ Mathematik, Informatik und Statistik der Ludwig-Maximilians-Universitat¨ Munchen¨ vorgelegt von Andreas Hanemann Tag der Einreichung: 22. Mai 2007 1. Berichterstatter: Professor Dr. Heinz-Gerd Hegering, Ludwig-Maximilians-Universit¨at M¨unchen 2. Berichterstatterin: Professor Dr. Gabrijela Dreo Rodosek, Universit¨at der Bundeswehr M¨unchen Automated IT Service Fault Diagnosis Based on Event Correlation Techniques Dissertation an der Fakultat¨ fur¨ Mathematik, Informatik und Statistik der Ludwig-Maximilians-Universitat¨ Munchen¨ vorgelegt von Andreas Hanemann Tag der Einreichung: 22. Mai 2007 Tag der m¨undlichen Pr¨ufung: 19. Juli 2007 1. Berichterstatter: Professor Dr. Heinz-Gerd Hegering, Ludwig-Maximilians-Universit¨at M¨unchen 2. Berichterstatterin: Professor Dr. Gabrijela Dreo Rodosek, Universit¨at der Bundeswehr M¨unchen Acknowledgments This thesis has been written as part of my work as a researcher at the Leib- niz Supercomputing Center (Leibniz-Rechenzentrum, LRZ) of the Bavarian Academy of Sciences and Humanities which was funded by the German Re- search Network (DFN-Verein) as well as in cooperation with the research group of Prof. Dr. Heinz-Gerd Hegering. Apart from the LRZ, this research group called MNM-Team (Munich Network Management Team) is located at the University of Munich (LMU), the Munich University of Technology (TUM) and the University of Federal Armed Forces in Munich. At first, I would like to thank my doctoral advisor Prof. Dr. Heinz-Gerd Hegering for his constant support and helpful advice during the whole prepa- ration time of this thesis. I would also like to express my special gratefulness to my second advisor, Prof. Dr. Gabi Dreo Rodosek, for giving me advice on finding an appropriate research matter for the thesis and also for many discussions about the thesis structure and contents. At the LRZ I would like to thank my supervisors Dr. Victor Apostolescu and Dr. Helmut Reiser for giving me the opportunity to integrate the work on the PhD thesis into the work on our network monitoring project. The meetings of the MNM Team have been an important possibilityfor me to present and discuss the status of my thesis with other researchers. In this con- text, I would like to thank Timo Baur, Latifa Boursas, Michael Brenner, Dr. Thomas Buchholz, Vitalian Danciu, Nils Otto vor dem gentschen Felde, Dr. Markus Garschhammer, Matthias Hamm, Iris Hochstatter, Wolfgang Hom- mel, Dr. Bernhard Kempter, Ralf K¨onig, Silvia Knittl, Annette Kostelezky, Dr. Michael Krause, Feng Liu, Dr. Harald R¨olle, Thomas Schaaf, Michael Schiffers, Georg Treu, and Mark Yampolskiy. In particular, I would like to thank David Schmitz and Martin Sailer (both also being members of the MNM Team) who have pursued related research directions for many valuable discussions. The outcome of some student work which I have supervised also has been helpful for the thesis preparation. Therefore, I would like to thank Dirk Bern- sau, Hans Beyer, Marta Galochino, Patricia Marcu, and Martin Roll for their efforts. At the LRZ I would like to thank Dr. Eberhard Hahn, Dr. Ulrike Kirchgesser, Klaus Natterer, Gudrun Sch¨ofer, Werner Spirk, and Michael Storz for infor- mation about the example services. Last but not least, I would like to express my gratitude to Karl-Heinz Geisler for improving the language quality of the thesis and my parents for their sup- port prior and during the thesis preparation. Munich, May 2007 This work was supported in part by the EC IST-EMANICS Network of Excellence (#26854). Summary In the previous years a paradigm shift in the area of IT service management could be witnessed. IT management does not only deal with the network, end systems, or applications anymore, but is more and more concerned with IT services. This is caused by the need of organizations to monitor the efficiency of internal IT departments and to have the possibility to subscribe IT services from external providers. This trend has raised new challenges in the area of IT service management, especially with respect to service level agreements lay- ing down the quality of service to be guaranteed by a service provider. Fault management is also facing new challenges which are related to ensuring the compliance to these service level agreements. For example, a high utilization of network links in the infrastructure can imply a delay increase in the de- livery of services with respect to agreed time constraints. Such relationships have to be detected and treated in a service-oriented fault diagnosis which therefore does not deal with faults in a narrow sense, but with service quality degradations. This thesis aims at providing a concept for service fault diagnosis which is an important part of IT service fault management. At first, a motivation of the need of further examinations regarding this issue is given which is based on the analysis of services offered by a large IT service provider. A gener- alization of the scenario forms the basis for the specification of requirements which are used for a review of related research work and commercial prod- ucts. Even though some solutions for particular challenges have already been provided, a general approach for service fault diagnosis is still missing. For addressing this issue, a framework is presented in the main part of this thesis using an event correlation component as its central part. Event correlation techniques which have been successfully applied to fault management in the area of network and systems management are adapted and extended accord- ingly. Guidelines for the application of the framework to a given scenario are provided afterwards. For showing their feasibility in a real world scenario, they are used for both example services referenced earlier. Kurzfassung In den letzten Jahren war im Bereich des IT-Managements ein Paradigmen- wechsel zu beobachten. Hierbei geht es in zunehmendem Maße nicht mehr um das reine Management von Netzen, Endsystemen oder Applikationen, sondern um das Management von IT-Diensten. Dieses ist dadurch bedingt, dass Organisationen die Leistungen interner IT-Abteilungen ¨uberpr¨ufbarer machen sowie den Einkauf extern erbrachter IT-Dienste von Dienstanbietern erm¨oglichen m¨ochten. Hieraus ergeben sich neue Anforderungen an das IT- Management, insbesondere im Zusammenhang mit Dienstvereinbarungen, die die durch einen Dienstleister zu erbringende Dienstqualit¨at festlegen. Auch im Bereich des Fehlermanagements ergeben sich neue Fragestellungen im Zusammenhang mit diesen Dienstvereinbarungen. Beispielsweise kann eine hohe Auslastung von Verbindungen in der Netzinfrastruktur zu einem Anstieg der Verz¨ogerung bei der Erbringung von Diensten f¨uhren, was im Hinblick auf vereinbarte Zeitbedingungen betrachtet werden muss. Solche Zusammenh¨ange m¨ussen erkannt und in einer dienstorientierten Fehlerdiag- nose behandelt werden, die sich daher nicht mehr mit Fehlern im engeren Sinne, sondern mit Verminderungen der Dienstqualit¨at befasst. In dieser Arbeit geht es um ein Konzept zur Diagnose von Fehlern bei der Erbringung von IT-Diensten, was einen Teil des Fehlermanagements f¨ur IT- Dienste darstellt. Zun¨achst wird eine Motivation der Notwendigkeit von weiteren Untersuchungen in diesem Bereich gegeben, die auf der Analyse von IT-Diensten, die im Umfeld eines großen IT-Dienstleisters angeboten werden, beruht. Eine Verallgemeinerung des Szenarios dient als Grundlage f¨ur die Festlegung von Anforderungen, die im weiteren f¨ur die Bewertung von verwandten Forschungsarbeiten und kommerziellen Produkten verwen- det werden. Obwohl einige bisherige Arbeiten L¨osungen f¨ur Teilaspekte der Fragestellung bieten, wird deutlich, dass ein allgemeiner Ansatz zur Dienst- fehlerdiagnose bislang fehlt. Im Hauptteil der Arbeit wird hierzu ein Rah- menwerk vorgestellt, als dessen zentrale Komponente ein Ereigniskorrela- tor eingesetzt wird. Ereigniskorrelationstechniken, die bisher erfolgreich auf der Netz- und Systemmanagementebene eingesetzt wurden, werden hierf¨ur entsprechend angepasst und erweitert. Empfehlungen zur Anpassung des Rahmenwerks an ein gegebenes Dienstszenario werden im folgenden zur Verf¨ugung gestellt. Um deren Nutzen in einem realen Szenario deutlich zu machen, werden diese f¨ur die beiden vorher dargestellten Beispieldienste angewendet. CONTENTS 1 Introduction 1 1.1 ResearchIssue ......................... 3 1.2 Deficits of Today’s IT Service Fault Management . 5 1.3 ThesisOutline.......................... 6 2 Requirements 9 2.1 DefinitionofTerms . .. .. .. .. .. .. 10 2.2 Service Management Scenario at the Leibniz Supercomputing Center.............................. 15 2.3 GenericScenarioforServiceFaultDiagnosis . 24 2.4 RequirementsDerivation . 26 2.5 Summary ............................ 35 3 Related Work 37 3.1 ITProcessManagementFrameworks. 39 3.2 ServiceandResourceModeling . 55 3.3 FaultManagementInterfaces . 61 3.4 FaultManagementTechniques . 69 3.5 SLAManagement........................ 91 3.6 Summary ............................ 99 4 Framework for Service-Oriented Event Correlation 105 4.1 Motivation for Service-Oriented Event Correlation . .107 I Contents 4.2 RefinementoftheRequirements . .108 4.3 EventCorrelationWorkflow . .110 4.4 EventCorrelationFramework. .124 4.5 HybridEventCorrelationArchitecture . 129 4.6 InformationModelingandManagement . 155 4.7 AssessmentMetricsforaGivenScenario . 174 4.8 CollaborationwithImpactAnalysis . .175 4.9 Assessment ...........................176

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    343 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us