Proceedings of Hotos IX: the 9Th Workshop on Hot Topics in Operating Systems
Total Page:16
File Type:pdf, Size:1020Kb
USENIX Association Proceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems Lihue, Hawaii, USA May 18–21, 2003 THE ADVANCED COMPUTING SYSTEMS ASSOCIATION © 2003 by The USENIX Association All Rights Reserved For more information about the USENIX Association: Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: [email protected] WWW: http://www.usenix.org Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercial reproduction of the work for educational or research purposes. This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein. Magpie: online modelling and performance-aware systems Paul Barham, Rebecca Isaacs, Richard Mortier, and Dushyanth Narayanan Microsoft Research Ltd., Cambridge, UK. Abstract A B Understanding the performance of distributed systems requires correlation of thousands of interactions be- Event Tracing Event Tracing tween numerous components — a task best left to a com- puter. Today’s systems provide voluminous traces from each component but do not synthesise the data into con- cise models of system performance. Models Models Performance We argue that online performance modelling should be Queries Query Engine a ubiquitous operating system service and outline sev- The diagram shows how requests move through different soft- eral uses including performance debugging, capacity ware components across multiple machines in a distributed sys- tem. Magpie synthesizes event traces from each machine into planning, system tuning and anomaly detection. We de- models that can be queried programatically. scribe the Magpie modelling service which collates de- tailed traces from multiple machines in an e-commerce Figure 1. Magpie architecture. site, extracts request-specific audit trails, and constructs probabilistic models of request behaviour. A feasibil- ity study evaluates the approach using an offline demon- operating system service. Performance traces should be strator. Results show that the approach is promising, but routinely collected by all machines at all times and used that there are many challenges to building a truly ubiq- to generate system performance models which are then uitious, online modelling infrastructure. made available for online programmatic query. To this end, we are building Magpie, an online modelling in- frastructure. Magpie is based on two key design princi- ples. Black-box instrumentation requires no source code 1 Introduction modification to the measured system. End-to-end trac- ing tracks not just aggregate statistics but each individual request’s path through the system. Computing today is critically dependent on distributed Fine-grained, low-overhead tracing already exists for infrastructure. E-mail, file access, and web browsing Linux [27] and Microsoft’s .Net Server [17]; the chal- require the interaction of many machines and software lenge is to efficiently process this wealth of tracing in- modules. When end-users of such systems experience formation for improved system reliability, manageabil- poor performance it can be extremely difficult to find ity and performance. Our goal is a system that collects the cause. Worse, problems are often intermittent or af- fine-grained traces from all software components; com- fect only a small subset of users and transactions — the bines these traces across multiple machines; attributes ‘it works for me’ syndrome. trace events and resource usage to the initiating request; Aggregate statistics are insufficient to diagnose such uses machine learning to build a probabilistic model problems: the system as a whole might perform quite of request behaviour; and compares individual requests well, yet individual users see poor performance. Ac- against this model to detect anomalies. Figure 1 shows curate diagnosis requires a detailed audit trail of each our envisioned high-level design. request and a model of normal request behaviour. Com- In Sections 2 and 3 we describe the many potential uses paring observed behaviour against the model allows of online monitoring and modelling. Section 4 describes identification of anomalous requests and malfunctioning the current Magpie prototype, which does online mon- system components. itoring but offline modelling. Sections 5 and 6 briefly We believe that providing such models should be a basic describe related work and summarize our position. USENIX Association HotOS IX: The 9th Workshop on Hot Topics in Operating Systems 85 2 Scenario: performance debugging tion 4 describes our first step towards these goals in the form of our offline modelling prototype. Joe Bloggs logs into his favourite online bookstore, Capacity planning. Performance prediction tools such monongahela.com,tobuy‘The Art of Surfing’ in prepa- as Indy [11] require workload models that include a ration for an upcoming conference. Frustratingly, he detailed breakdown of resource consumption for each cannot add this book to his shopping cart, though he transaction type. Manual creation of such models is time can access both the book details and his shopping cart. consuming and difficult; Magpie’s clustering algorithm His complaint to customer support is eventually picked (Section 4) automatically creates workload models from up by Sysadmin Sue. However, Sue is perfectly able to live system traces. order ‘The Art of Surfing’ from her machine, and finds Tracking workload level shifts. Workloads can change nothing suspicious in the system’s throughput or avail- qualitatively due to subtle changes in user behaviour ability statistics: she can neither replicate the bug nor or client software. For example, a new web interface identify the faulty component. might provide substring matching in addition to key- Consider the same website augmented with the Magpie word search. To reconfigure the system appropriately, online performance modelling infrastructure. Magpie we must distinguish level shifts from the usual fluctua- maintains detailed logs of resource usage and combines tions in workload. Magpie could do so by comparing them in real time to provide per-request audit trails. It models of current and past workloads. This might also knows the resource consumption of each request in each detect some types of denial-of-service attacks. of several stages — parsing, generation of dynamic con- Detecting component failure. Failure of software or tent, and database access — and can determine whether hardware components can degrade end-to-end perfor- and where Joe’s request is out of bounds with respect to mance in non-obvious ways. For example, a failed disk the model of a correctly behaving request. in a RAID array will slow down reads that miss in the Using Magpie’s modelling and visualization tools, Sue buffer cache. By tracking each request through each observes a cluster of similar requests which would not component, Magpie can pinpoint suspiciously behaving normally be present in the workload model. On closer components. examination, she sees that these requests spend a sus- Comparison-based diagnosis. Workload models could picious amount of time accessing the ‘Book Prices’ ta- be compared across site replicas to diagnose perfor- ble, causing a time-out in the front-end server. This is mance discrepancies. the problem which is affecting Joe and the culprit is a misconfigured SQL Server. Sue could not replicate the ‘Bayesian Watchdogs’. Given a probabilistic model of bug because her requests were redirected by an IP ad- normal request behaviour, we can maintain an estimate dress based load balancer to a different, correctly con- of the likelihood of any request as it moves through the figured replica. Within minutes the offending machine is system. Requests deemed unlikely can then be escorted reconfigured and restarted, and Joe can order his book into a safe sandboxed area. in good time for the conference. Monitoring SLAs. The emerging Web Services stan- This is just one performance debugging scenario, but dards [26] allow multiple sites to cooperate in servicing there are many others. File access, browsing and e-mail a single request. For example, an e-commerce site might in an Intranet may rely on Active Directory, authenti- access third-party services for authentication, payment, cation and DNS servers: slow response times could be and shipping. Magpie performance models could be caused by any combination of these components. Di- compared with service level agreements to check com- agnosing such problems today requires expert manual pliance. However, this does raise issues of trust and pri- ¡ ¢ £ ¤ ¥ ¦ £ ¡ § ¦ intervention using tools such as , and vacy of sensitive performance data. ¥ ¢ © § ¢ . Kernel performance debugging. Response times on a single machine are governed by complex interactions between hardware, device drivers, the I/O manager, the 3 Applications memory system and inter-process communication [14]. Event Tracing for Windows already traces all these OS components, enabling request tracking at this fine gran- Pervasive, online, end-to-end modelling has many uses ularity. apart from distributed performance debugging; here we list some of the most exciting ones. These applications End-to-end latency tuning. Scalable self-tuning sys- are research goals rather than accomplished facts. Sec- tems such as SEDA [25] optimise individual compo- 86 HotOS IX: The 9th Workshop on Hot Topics in Operating Systems USENIX Association points for this system. Active HTTP Web 20% 50% SQL Server Server Server Each instrumentation point generates a named