Automated Grid Fault Detection and Repair
Total Page:16
File Type:pdf, Size:1020Kb
Automated Grid Fault Detection and Repair Submitted in fulfilment of the requirements of the degree of Master of Science of Rhodes University Leslie Luyt Grahamstown, South Africa 2012 Abstract With the rise in interest in the field of grid and cloud computing, it is becoming increasingly necessary for the grid to be easily maintainable. This maintenance of the grid and grid services can be made easier by using an automated system to monitor and repair the grid as necessary. We propose a novel system to perform automated monitoring and repair of grid systems. To the best of our knowledge, no such systems exist. The results show that certain faults can be easily detected and repaired. ACM Computing Classification System Classification Thesis classification under the ACM Computing Classification System (1998 version, valid through 2012) [1]: D.1.3[Programming Techniques] Concurrent Programming | Distributed Programming K.4.3[Organizational Impacts] Automation K.6.4[System Management] General-Terms: Reliability, Security, Management Acknowledgements The production of this thesis has taken many months and it would be unfair not to acknowledge all those who have assisted me. I would therefore like to thank all those who have contributed and assisted, in any way, to the completion of this project. There are a number of people I would like to specially note. Firstly, I would like to acknowledge the financial and technical support of Telkom, Tellabs, Stortech, Eastel, Bright Ideas Project 39, and THRIP through the Distributed Multimedia Centre of Excellence in the Department of Computer Science at Rhodes University. Secondly, I would like to acknowledge my parents for all the support, guidance, funding and love they have given me over the years. I know it would not have been possible for me to have had this great opportunity had it not been for them. Thirdly, I would like to extend my thanks to Jonas Bardino of the Minimum Intrusion Grid project whose input was invaluable and without whose assistance this thesis would not have been nearly as productive as it has been. Finally, I would like to thank and acknowledge my supervisor, Dr. Karen Bradshaw. Without her much valued input, support and guidance this thesis would not have been possible. Contents 1 Introduction 13 1.1 Problem Statement . 13 1.2 Research Methodology . 14 1.3 Project Outcomes . 14 1.3.1 Expected End-product . 14 1.3.2 End-product Requirements . 14 1.3.3 End-product Evaluation . 15 1.4 Thesis Organisation . 15 2 Background Knowledge 17 2.1 Introduction . 17 2.2 Programming Language Selection . 17 2.2.1 PHP . 17 2.2.2 Python . 19 2.3 Web Technology Evaluation . 20 2.3.1 mod python . 20 2.3.2 mod wsgi . 21 2 CONTENTS 3 2.3.3 Web Programming Frameworks . 22 2.4 Artificial Intelligence . 22 2.4.1 Neural Networks . 22 2.4.2 Bayesian Networks . 23 2.4.3 Expert Systems . 24 3 Grids 26 3.1 Grid Software . 26 3.1.1 Introduction . 26 3.1.2 Grid Categorisation . 28 3.1.3 Globus Toolkit . 29 3.1.4 MiG . 30 3.2 Grid Security . 31 3.2.1 Introduction . 31 3.2.2 Public Key Security . 31 3.2.3 Secure Socket Layer . 32 3.2.4 Web Vulnerbilities . 33 4 Current State of Grid Management 34 4.1 Introduction . 34 4.2 Grid Management . 35 4.2.1 MiG . 35 4.2.2 Globus . 35 4.3 Server Management . 36 4.3.1 Console Management . 36 4.3.2 GUI Management . 36 4.3.3 Web-based GUI Management . 37 CONTENTS 4 5 Research Design 38 5.1 Introduction . 38 5.2 Design Approach . 39 5.2.1 Fast Response Time . 39 5.2.2 Security . 39 5.2.3 Minimising Intrusion . 40 5.2.4 Maximising Gained Information . 40 5.2.5 Modular . 41 5.2.6 Cross-platform . 41 5.2.7 Open-source . 41 5.3 Proposed Design . 41 5.4 Summary . 43 6 Implementation 44 6.1 Introduction . 44 6.2 Implementation Challenges . 48 6.2.1 Modular Plug And Play Design . 48 6.2.2 Securing The Repair System . 48 6.3 Implemented Metrics . 49 6.3.1 RAM Metric . 49 6.3.2 Grid Jobs Metric . 51 6.4 Expert System . 55 6.5 Metric Security . 60 6.5.1 RAM Metric . 60 6.5.2 Grid Jobs Metric . 61 6.6 Summary . 61 CONTENTS 5 7 RAM Metric Results 62 7.1 Introduction . 62 7.2 Sudden Memory Change . 64 7.2.1 Scenario . 64 7.2.2 Expected Results . 64 7.2.3 Achieved Results . 64 7.2.4 Discussion . 64 7.3 Machine Reboot . 66 7.3.1 Scenario . 66 7.3.2 Expected Results . 66 7.3.3 Achieved Results . 66 7.3.4 Discussion . 67 7.4 Power Outage . 68 7.4.1 Scenario . 68 7.4.2 Expected Results . 68 7.4.3 Achieved Results . 68 7.4.4 Discussion . 69 7.5 Grid Process Failure . 69 7.5.1 Scenario . 69 7.5.2 Expected Results . 69 7.5.3 Achieved Results . 69 7.5.4 Discussion . 70 7.6 Process Unloaded To Swap Space . 71 CONTENTS 6 7.6.1 Scenario . 71 7.6.2 Expected Results . 71 7.6.3 Achieved Results . 72 7.6.4 Discussion . 74 7.7 Summary . 74 8 Grid Jobs Metric Results 75 8.1 Job Event Activity Spike . 76 8.1.1 Scenario . 76 8.1.2 Expected Results . 76 8.1.3 Achieved Results . 77 8.1.4 Discussion . 77 8.2 Job Event Inactivity . 78 8.2.1 Scenario . 78 8.2.2 Expected Results . ..