Towards Data Center Self-Diagnosis Using a Mobile Robot

Towards Data Center Self-Diagnosis Using a Mobile Robot Jonathan Lenchner Canturk Isci Jeffrey O. Kephart IBM Thomas J Watson IBM Thomas J Watson IBM Thomas J Watson Research Center Research Center Research Center Hawthorne, NY 10532 Hawthorne, NY 10532 Hawthorne, NY 10532 [email protected] [email protected] [email protected] Christopher Mansley Jonathan Connell Suzanne McIntosh Dept. of Computer Science IBM Thomas J Watson IBM Thomas J Watson Rutgers University, Research Center Research Center Piscataway, NJ 08854 Hawthorne, NY 10532 Hawthorne, NY 10532 [email protected] [email protected] [email protected] ABSTRACT In previous work [7, 6] we have painted a picture of the We describe an inexpensive robot that serves as a physi- data center as a computing environment with the capac- cal autonomic element, capable of navigating, mapping and ity for self-regulation, populated with elements represent- monitoring data centers with little or no human involve- ing physical components such as power distribution units ment, even ones that it has never seen before. Through a and air conditioners that interact with one another and series of real experiments and simulations, we establish that with software elements such as workload managers. In this the robot is su±ciently accurate, e±cient and robust to be paper, we describe a truly physical autonomic element|a of practical bene¯t in real data center environments. We robot that navigates the data center, mapping its layout demonstrate how the robot's integration with Maximo for and monitoring its temperature and other quantities of in- Energy Optimization, a commercial data center energy man- terest with little to no human assistance. The robot is in- agement product, supports autonomic management at the tegrated into state-of-the art enterprise energy management level of the data center as a whole, particularly self-diagnosis software in such a way that, as thermal anomalies arise in of emerging thermal problems. the data center (such as the occurrence of areas of under- and over-provisioned cooling) the robot can automatically ACM Classi¯cation Keywords: I.2.9 Computing Method- be dispatched to investigate these regions to provide evi- ologies, Arti¯cial Intelligence, Robotics. dence regarding suspected causes of the anomalies and fa- General Terms: Algorithms, Design, Management, Per- cilitate an automatic response. formance, Reliability. There are three main contributions of this work. First, we describe a full-prototype implementation of the autonomous Author Keywords: Autonomous systems, self-managing, data center robot that is deployed and tested in several pro- energy e±ciency, data centers, mobile robots. duction data centers. Second, we describe several trade-o®s and improvements in robot design and data center monitoring, and provide quantitative evaluations for di®erent de- 1. INTRODUCTION sign choices. We evaluate di®erent navigation approaches Over time, data centers around the world are consuming and their trade-o®s with di®erent data center layout con¯g- ever more energy, with those in the US now responsible for urations. We propose and evaluate several novel techniques an estimated 2% of the nation's electricity budget [12, 26]. for improving the robot's monitoring coverage, inlet tem- Recognizing that cooling is a signi¯cant contributor to en- perature sampling ¯delity, tile detection accuracy, and ad- ergy consumption, data center operators are beginning to dressing energy and scanning time considerations. Third, tolerate higher operating temperatures. While this practice we demonstrate the integration of robot operation to an saves substantial amounts of energy, running closer to al- enterprise-level data center management software and the lowable operating temperature limits increases the risk that application of this framework to di®erent management use temperature problems will result in equipment failures that cases. wipe out the ¯nancial bene¯ts of saving energy. Vigilance is The remainder of the paper is organized as follows. After needed, and increasingly that vigilance is being provided by reviewing related work in Section 2, we describe the physical data center energy management software that monitors data and software design principles and architecture in Section 3. center temperatures and alerts operators when troublesome In Sections 4 and 5, we describe several basic and enhanced hot spots develop. algorithms that support vision, navigation, asset classi¯cation, and self-protection, intermingling the algorithms with evidence of their e®ectiveness from simulation and physical experiments that we have conducted in four di®erent data Permission to make digital or hard copies of all or part of this work for centers. Section 6 describes the integration of the robot with personal or classroom use is granted without fee provided that copies are a commercial data center energy management product, and not made or distributed for profit or commercial advantage and that copies presents practical scenarios that illustrate how the robot can bear this notice and the full citation on the first page. To copy otherwise, to improve the self-diagnostic capabilities of the data center. republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICAC’11, June 14–18, 2011, Karlsruhe, Germany. Copyright 2011 ACM 978-1-4503-0607-2/11/06 ...$10.00. 2. RELATED WORK robot's robustness to unpredictability in its environment, Intelligent monitoring and automated, adaptive energy and thus its autonomy, and provide quantitative evaluations and thermal management of data centers has been a highly of the robot's e±ciency and robustness. We also describe the active research area in recent years. This comes as no sur- robot's integration into an enterprise asset management ap- prise in large scale computing as energy and cooling costs plication for data center energy e±ciency management and loom, the environmental concerns increase, and the empha- problem diagnosis. sis on energy regulation for computing systems and facili- ties grows stronger [26, 1]. Patel, Sharma et al. present 3. ROBOT DESIGN some of the early work in data center monitoring and man- Our robot's most fundamental purpose is to eliminate the agement [18, 19, 23] identifying some key ine±ciencies in human labor required to map and monitor both new and cooling and CRAC con¯gurations. Several studies explore known data centers. This translates into the following set techniques for e±cient energy and thermal provisioning in of basic design objectives, all of which must be attained data centers [5, 22, 8, 7]. Some studies employ coordinated completely autonomously: schemes based on control or utility models to collectively manage multiple control knobs [6, 4, 21]. Some recent re- ² Full Coverage. The robot must visit every region of search also incorporates dynamic workload placement as an the data center that is contiguous from an arbitrary additional soft control knob for data center management [14, starting position, with or without prior knowledge of 16, 15]. While there is clearly no shortage of work in data the data center layout. center management, most of the prior approaches assume a perfect world, where all the required vast level of monitor- ² Sensor Map Generation. The robot must generate ing is readily available and all the distributed sensing and a reasonably accurate map of sensor readings. Specif- control knobs are seamlessly integrated. However, the state ically, it must take sensor readings at su±ciently ¯ne of reality is nowhere near such level of integration for con- spatial granularity and associate with each reading an trollability and especially observability. Thus, our mobile accurate spatial location. monitoring and management approach bridges a major crit- ical gap in data center monitoring and management by ¯rst, ² Layout Generation. The robot must generate a sim- providing a simple, tractable and dense monitoring approach ple data center layout. At a minimum, the layout and second, demonstrating a true seamless integration of should distinguish a few basic types of structures or monitoring to data center control. assets, so that a human operator can recognize the Some prior studies also consider bridging the gap between generated layout as reflecting the physical layout of monitoring and management by providing distributed inte- the data center. grated monitoring or using controlled mobile sampling. In ² Robustness. The robot must carry out its functions comparison to these, our technique presents an approach under ordinary operating conditions, working around with minimal intrusion to data center operations, and more unexpected obstacles, coping with network outages, importantly, brings in autonomic monitoring into data cen- and protecting itself from harm. ters with a practically invisible barrier for adoption. In prior work, Tschudi et al. [25] recommend integrated monitoring ² Low cost. The robot should be constructible from schemes for data centers, while this is inarguably of great inexpensive components. value, such high-end integrated monitoring has a high cost of entry, and requires periodic maintenance, veri¯cation and We have been able to meet all of these basic design ob- calibration. Our work reduces this barrier to almost nothing jectives, as well as satisfy some additional desirable crite- by eliminating any requisite in the existing data center de- ria, through judicious physical design, software architecture, sign.

Load more