Profiling a Warehouse-Scale Computer
Total Page:16
File Type:pdf, Size:1020Kb
Profiling a warehouse-scale computer Svilen Kanev† Juan Pablo Darago† Kim Hazelwood† Harvard University Universidad de Buenos Aires Yahoo Labs Parthasarathy Ranganathan Tipp Moseley Gu-Yeon Wei David Brooks Google Google Harvard University Harvard University Abstract generation cloud and datacenter platforms is among the most important challenges for future computer architectures. With the increasing prevalence of warehouse-scale (WSC) Computing platforms for cloud computing and large inter- and cloud computing, understanding the interactions of server net services are often hosted in large data centers, referred to applications with the underlying microarchitecture becomes as warehouse-scale computers (WSCs) [4]. The design chal- ever more important in order to extract maximum performance lenges for such warehouse-scale computers are quite different out of server hardware. To aid such understanding, this paper from those for traditional servers or hosting services, and em- presents a detailed microarchitectural analysis of live data- phasize system design for internet-scale services across thou- center jobs, measured on more than 20,000 Google machines sands of computing nodes for performance and cost-efficiency over a three year period, and comprising thousands of differ- at scale. Patterson and Hennessy, for example, posit that ent applications. these warehouse-scale computers are a distinctly new class We first find that WSC workloads are extremely diverse, of computer systems that architects must design to [19]: “the breeding the need for architectures that can tolerate appli- datacenter is the computer”[41]. cation variability without performance loss. However, some At such scale, understanding performance characteristics patterns emerge, offering opportunities for co-optimization becomes critical – even small improvements in performance of hardware and software. For example, we identify com- or utilization can translate into immense cost savings. De- mon building blocks in the lower levels of the software stack. spite that, there has been a surprising lack of research on This “datacenter tax” can comprise nearly 30% of cycles the interactions of live, warehouse-scale applications with across jobs running in the fleet, which makes its constituents the underlying microarchitecture. While studies on isolated prime candidates for hardware specialization in future server datacenter benchmarks [14, 49], or system-level characteriza- systems-on-chips. We also uncover opportunities for classic tions of WSCs [5, 27], do exist, little is known about detailed microarchitectural optimizations for server processors, espe- performance characteristics of at-scale deployments. cially in the cache hierarchy. Typical workloads place signifi- This paper presents the first (to the best of our knowledge) cant stress on instruction caches and prefer memory latency profiling study of a live production warehouse-scale computer. over bandwidth. They also stall cores often, but compute heav- We present detailed quantitative analysis of microarchitecture ily in bursts. These observations motivate several interesting events based on a longitudinal study across tens of thousands directions for future warehouse-scale computers. of server machines over three years running workloads and 1. Introduction services used by billions of users. We highlight important pat- terns and insights for computer architects, some significantly Recent trends show computing migrating to two extremes: different from common wisdom for optimizing SPEC-like or software-as-a-service and cloud computing on one end, and open-source scale-out workloads. more functional mobile devices and sensors (“the internet of Our methodology addresses key challenges to profiling things”) on the other end. Given that the latter category is often large-scale warehouse computers, including breakdown analy- supported by back-end computing in the cloud, designing next- sis of microarchitectural stall cycles and temporal analysis of workload footprints, optimized to address variation over the 36+ month period of our data (Section2). Even though ex- † The work was done when these authors were at Google. tracting maximal performance from a WSC requires a careful concert of many system components [4], we choose to focus Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies on server processors (which are among the main determinants are not made or distributed for profit or commercial advantage and that of both system power and performance [25]) as a necessary copies bear this notice and the full citation on the first page. Copyrights for first step in understanding WSC performance. third-party components of this work must be honored. For all other uses, From a software perspective, we show significant diversity contact the Owner/Author. Copyright is held by the owner/author(s). ISCA’15, June 13-17, 2015, Portland, OR USA in workload behavior with no single “silver-bullet” application ACM 978-1-4503-3402-0/15/06. to optimize for and with no major intra-application hotspots http://dx.doi.org/10.1145/2749469.2750392 (Section3). While we find little hotspot behavior within appli- cations, there are common procedures across applications that Background: WSC software deployment We begin with constitute a significant fraction of total datacenter cycles. Most a brief description of the software environment of a modern of these hotspots are in functions unique to performing compu- warehouse-scale computer as a prerequisite to understanding tation that transcends a single machine – components that we how processors perform under a datacenter software stack. dub “datacenter tax”, such as remote procedure calls, protocol While the idioms described below are based on our experience buffer serialization and compression (Section4). Such “tax” at Google, they are typical for large-scale distributed systems, presents interesting opportunities for microarchitectural opti- and pervasive in other platform-as-a-service clouds. mizations (e.g., in- and out-of-core accelerators) that can be Datacenters have bred a software architecture of distributed, applied to future datacenter-optimized server systems-on-chip multi-tiered services, where each individual service exposes (SoCs). a relatively narrow set of APIs.1 Communication between Optimizing tax alone is, however, not sufficient for radical services happens exclusively through remote procedure calls performance gains. By drilling into the reasons for low core (RPCs) [17]. Requests and responses are serialized in a com- utilization (Section5), we find that the cache and memory mon format (at Google, protocol buffers [18]). Latency, es- systems are notable opportunities for optimizing server pro- pecially at the tail end of distributions, is the defining per- cessors. Our results demonstrate a significant and growing formance metric, and a plethora of techniques aim to reduce problem with instruction-cache bottlenecks. Front-end core it [11]. stalls account for 15-30% of all pipeline slots, with many One of the main benefits of small services with narrow APIs workloads showing 5-10% of cycles completely starved on is the relative ease of testability and deployment. This encour- instructions (Section6). The instruction footprints for many ages fast release cycles – in fact, many teams inside Google key workloads show significant growth rates (≈30% per year), release weekly or even daily. Nearly all of Google’s datacenter greatly exceeding the current growth of instruction caches, software is stored in a single shared repository, and built by especially at the middle levels of the cache hierarchy. one single build system [16]. Consequently, code sharing is Perhaps unsurprisingly, data cache misses are the largest frequent, binaries are mostly statically linked to avoid dynamic fraction of stall cycles, at 50% to 60% (Section7). Latency dependency issues. Through these transitive dependences, bi- is a significantly bigger bottleneck than memory bandwidth, naries often reach 100s of MBs in size. Thus, datacenter which we find to be heavily over provisioned for our work- CPUs are exposed to varied and diverse workloads, with large loads. A typical datacenter application mix involves access instruction footprints, and shared low-level routines. patterns that indicate bursts of computations mixed with bursts Continuous profiling We collect performance-related data of stall times, presenting challenges for traditional designs. from the many live datacenter workloads using Google-Wide- This suggests that while wide, out-of-order cores are neces- Profiling (GWP) [44]. GWP is based on the premise of low- sary, they are often used inefficiently. While simultaneous overhead random sampling, both of machines within the data- multithreading (SMT) helps both with hiding latency and over- center, and of execution time within a machine. It is inspired lapping stall times (Section8), relying on current-generation by systems like DCPI [2]. 2-wide SMT is not sufficient to eliminate the bulk of overheads In short, GWP collectors: (i) randomly select a small frac- we observed. tion of Google’s server fleet to profile each day, (ii) trigger Overall, our study suggests several interesting directions for profile collection remotely on each machine-under-test for future microarchitectural exploration: design of more general- a brief period of time (most often through perf [10]), (iii) purpose cores with additional threads to address