A Continuous Profiling Infrastructure for Data Centers

[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 65

...... GOOGLE-WIDE PROFILING: ACONTINUOUS PROFILING INFRASTRUCTURE FOR DATA CENTERS

...... GOOGLE-WIDE PROFILING (GWP), A CONTINUOUS PROFILING INFRASTRUCTURE FOR

DATA CENTERS, PROVIDES PERFORMANCE INSIGHTS FOR CLOUD APPLICATIONS.WITH

NEGLIGIBLE OVERHEAD, GWP PROVIDES STABLE, ACCURATE PROFILES AND A

DATACENTER-SCALE TOOL FOR TRADITIONAL PERFORMANCE ANALYSES.FURTHERMORE,

GWP INTRODUCES NOVEL APPLICATIONS OF ITS PROFILES, SUCH AS APPLICATION-

PLATFORM AFFINITY MEASUREMENTS AND IDENTIFICATION OF PLATFORM-SPECIFIC,

MICROARCHITECTURAL PECULIARITIES.

...... As cloud-based computing grows uniquely qualified for performance monitor- in pervasiveness and scale, understanding data- ing in the data center. Traditionally, sam- center applications’ performance and utiliza- pling tools run on a single machine, tion characteristics is critically important, monitoring specific processes or the system because even minor performance improve- as a whole. Profiling can begin on demand Gang Ren ments translate into huge cost savings. Tradi- to analyze a performance problem, or it can tional performance analysis, which typically run continuously.1 Eric Tune needs to isolate benchmarks, can be too com- Google-WideProfilingcanbetheoreti- plicated or even impossible with modern cally viewed as an extension of the Digital Tipp Moseley datacenter applications. It’s easier and more Continuous Profiling Infrastructure representative to monitor datacenter applica- (DCPI)1 to data centers. GWP is a continu- Yixin Shi tions running on live traffic. However, applica- ous profiling infrastructure; it samples across tion owners won’t tolerate latency degradations machines in multiple data centers and col- Silvius Rus of more than a few percent, so these tools must lects various events—such as stack traces, be nonintrusive and have minimal overhead. hardware events, lock contention profiles, Robert Hundt As with all profiling tools, observer distortion heap profiles, and kernel events—allowing must be minimized to enable meaningful cross-correlation with job scheduling data, Google analysis. (For additional information on re- application-specific data, and other informa- lated techniques, see the ‘‘Profiling: From sin- tion from the data centers. gle systems to data centers’’ sidebar.) GWP collects daily profiles from several Sampling-based tools can bring overhead thousand applications running on thousands and distortion to acceptable levels, so they’re of servers, and the compressed profile ......

0272-1732/10/$26.00 c 2010 IEEE Published by the IEEE Computer Society 65 [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 66

...... DATACENTER COMPUTING

...... Profiling: From single systems to data centers From gprof1 to Intel VTune (http://software.intel.com/en-us/intel- as MPI calls) from parallel programs but aren’t suitable for continuous vtune), profiling has long been standard in the development process. profiling. Most profiling tools focus on a single execution of a single program. As computing systems have evolved, understanding the bigger picture across multiple machines has become increasingly important. References Continuous, always-on profiling became more important with Morph 1. S.L. Graham, P.B. Kessler, and M.K. Mckusick, ‘‘Gprof: A and DCPI for Digital UNIX. Morph uses low overhead system-wide sam- Call Graph Execution Profiler,’’ Proc. Symp. Compiler Con- pling (approximately 0.3 percent) and binary rewriting to continuously struction (CC 82), ACM Press, 1982, pp. 120-126. evolve programs to adapt to their host architectures.2 DCPI gathers 2. X. Zhang et al., ‘‘System Support for Automatic Profiling and much more robust profiles, such as precise stall times and causes, Optimization,’’ Proc. 16th ACM Symp. Operating Systems and focuses on reporting information to users.3 Principles (SOSP 97), ACM Press, 1997, pp. 15-26. OProfile, a DCPI-inspired tool, collects and reports data much in the 3. J.M. Anderson et al., ‘‘Continuous Profiling: Where Have All same way for a plethora of architectures, though it offers less sophisti- the Cycles Gone?’’ Proc. 16th ACM Symp. Operating Sys- cated analysis. GWP uses OProfile (http://oprofile.sourceforge.net) as a tems Principles (SOSP 97), ACM Press, 1997, pp. 357-390. profile source. High-performance computing (HPC) shares many profiling 4. N.R. Tallent et al., ‘‘Diagnosing Performance Bottlenecks in challenges with cloud computing, because profilers must gather repre- Emerging Petascale Applications,’’ Proc. Conf. High Perfor- sentative samples with minimal overhead over thousands of nodes. mance Computing Networking, Storage and Analysis (SC Cloud computing has additional challenges—vastly disparate applica- 09), ACM Press, 2009, no. 51. tions, workloads, and machine configurations. As a result, different profil- 5. V. Herrarte and E. Lusk, Studying Parallel Program Behavior ing strategies exist for HPC and cloud computing. HPCToolkit uses with Upshot, tech. report, Argonne National Laboratory, lightweight sampling to collect call stacks from optimized programs 1991. and compares profiles to identify scaling issues as parallelism increases.4 6. O. Zaki et al., ‘‘Toward Scalable Performance Visualization Open|SpeedShop (www.openspeedshop.org) provides similar func- with Jumpshot,’’ Int’l J. High Performance Computing Appli- tionality. Similarly, Upshot5 and Jumpshot6 can analyze traces (such cations, vol. 13, no. 3, 1999, pp. 277-288.

database grows by several Gbytes every day. Does a particular memory allocation Profiling at this scale presents significant scheme benefit a particular class of challenges that don’t exist for a single ma- applications? chine. Verifying that the profiles are correct What is the cycles per instruction (CPI) is important and challenging because the for applications across platforms? workloads are dynamic. Managing profiling overhead becomes far more important as Additionally, we can derive higher-level data well, as any unnecessary profiling overhead to more complex but interesting questions, can cost millions of dollars in additional such as which compilers were used for appli- resources. Finally, making the profile data cations in the fleet, whether there are more universally accessible is an additional chal- 32-bit or 64-bit applications running, and lenge. GWP is also a cloud application, how much utilization is being lost by subop- with its own scalability and performance timal job scheduling. issues. With this volume of data, we can answer Infrastructure typical performance questions about datacen- Figure 1 provides an overview of the en- ter applications, including the following: tire GWP system.

What are the hottest processes, rou- Collector tines, or code regions? GWP samples in two dimensions. At any How does performance differ across moment, profiling occurs only on a small software versions? subset of all machines in the fleet, and Which locks are most contended? event-based sampling is used at the machine Which processes are memory hogs? level. Sampling in only one dimension would ...... 66 IEEE MICRO [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 67

be unsatisfactory; if event-based profiling were active on every machine all the time, Data center at a normal event-sampling rate, we would Machine database be using too many resources across the Applications fleet. Alternatively, if the event-sampling (with profiling interfaces) rate is too low, profiles become too sparse to drill down to the individual machine Daemons level. For each event type, we choose a Collectors sampling rate high enough to provide mean- Binary ingful machine-level data while still repository Raw profiles minimizing the distortion caused by the Symbolizer profiling on critical applications. The system (MapReduce) has been actively profiling nearly all Binary Processed collector machines at Google for several years with profiles only rare complaints of system interference. A central machine database manages all machines in the fleet and lists every machine’s Webserver Profile database name and basic hardware characteristics. The (OLAP) GWP profile collector periodically gets a list of all machines from that database and selects a random sample of machines from that pool. Figure 1. An overview of the Google-Wide Profiling (GWP) infrastructure. The collector then remotely activates profil- The whole system consists of collector, symbolizer, profile database, ing on the selected machines and retrieves Web server, and other components. the results. It retrieves different types of sampled profiles sequentially or concurrently, depending on the machine and event type. lower sampling frequencies). Finally, we For example, the collector might gather save the profile and metadata in their raw for- hardware performance counters for several mat and perform symbolization on a separate seconds each, then move on to profiling for set of machines. As a result, the aggregated lock contention or memory allocation. It profiling overhead is negligible—less than takes a few minutes to gather profiles for a 0.01 percent. At the same time, the derived specific machine. profiles are still meaningful, as we show in For robustness, the GWP collector is a the ‘‘Reliability analysis’’ section. distributed service. It helps improve availability and reduce additional variation from the Profiles and profiling interfaces collector itself. To minimize distortion on GWP collects two categories of profiles: the machines and the services running on whole-machine and per-process. Whole- them, the collector monitors error conditions machine profiles capture all activities happening and ceases profiling if the failure rate reaches on the machine, including user applications, a predefined threshold. Aside from the collec- the kernel, kernel modules, daemons, tor, we monitor all other GWP components and other background jobs. The whole- to ensure an always-on service to users. machine profiles include hardware perfor- On the top of the two-dimensional sam- mance monitoring (HPM) event profiles, pling approach, we apply several techniques kernel event traces, and power measurements. to further reduce the overhead. First, we mea- Users without root access cannot directly sure the event-based profiling overhead on invoke most of the whole-machine profiling a set of benchmark applications and then con- systems, so we deploy lightweight daemons servatively set the maximum rates to ensure on every machine to let remote users (such the overhead is always less than a few percent. as GWP collectors) access those profiles. Second, we don’t collect whole call stacks for The daemons act as gate keepers to control the machine-wide profiles to avoid the high access, enforce sampling rate limits, and col- overhead associated with unwinding (but we lect system variables that must be synchron- collect call stacks for most server profiles at ized with the profiles......

JULY/AUGUST 2010 67 [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 68

...... DATACENTER COMPUTING

We use OProfile (http://oprofile.sourceforge. profiles so that we later correlate the profiles net) to collect HPM event profiles. OProfile is with job, machine, or datacenter attributes. a system-wide profiler that uses HPM to generate event-based samples for all running Symbolization and binary storage binaries at low overhead. To hide the hetero- After collection, the Google File System geneity of events between architectures, we (GFS) stores the profiles.3 To provide mean- define some generic HPM events on top ingful information, the profiles must corre- of the platform-specific events, using an late to source code. However, to save approach similar to PAPI.2 The most com- network bandwidth and disk space, applica- monly used generic events are CPU cycles, tions are usually deployed into data centers retired instructions, L1 and L2 cache misses, without any debug or symbolic information, and branch mispredictions. We also provide which can make source correlation impossi- access to some architecture-specific events. ble. Furthermore, several applications, such Although the aggregated profiles for those as Java and QEMU, dynamically generate events are biased to specific architectures, and execute code. The code is not available they provide useful information for machine- offline and can therefore no longer be sym- specific scenarios. bolized. The symbolizer must also symbolize In addition to whole-machine profiles, we operating system kernels and kernel loadable collect various types of profiles from most modules. applications running on a machine using Therefore, the symbolization process be- the Google Performance Tools (http://code. comes surprisingly complicated, although google.com/p/google-perftools). Most appli- it’s usually trivial for single-machine profil- cations include a common library that ing. Various strategies exist to obtain binaries enables process-wide stacktrace-attributed with debug information. For example, we profiling mechanisms for heap allocation, could try to recompile all sampled applica- lock contention, wall time and CPU time, tions at specific source milestones. However, and other performance metrics. The com- it’s too resource-intensive and sometimes im- mon library includes a simple HTTP server possible for applications whose source isn’t linked with handlers for each type of profiler. readily available. An alternative is to persis- A handler accepts requests from remote tently store binaries that contain debug infor- users, activates profiling (if it’s not already mation before they’re stripped. active), and then sends the profile data back. Currently, GWP stores unstripped The GWP collector learns from the cluster- binaries in a global repository, which other wide job management system what applica- services use to symbolize stack traces for tions are running on a machine and on automated failure reporting. Since the which port each can be contacted for remote binaries are quite large and many unique profiling invocation. A machine lacking re- binaries exist, symbolization for a single day mote profiling support has some programs, of profiles would take weeks if run sequen- which the per-process profiling doesn’t cap- tially. To reduce the result latency, we dis- ture. However, these are few; comparison tribute symbolization across a few hundred with system-wide profiles shows that remote machines using MapReduce.4 per-process profiling captures the vast major- ity of Google’s programs. Most programs’ Profile storage users have found profiling useful or unobtru- Over the past years, GWP has amassed sive enough to leave them enabled. several terabytes of historical performance Together with profiles, GWP collects other data. GFS archives the entire performance information about the target machine and logs and corresponding binaries. To make applications. Some of the extra information the data useful and accessible, we load the is needed to postprocess the collected profiles, samples into a read-only dimensional data- such as a unique identifier for each running base that is distributed across hundreds of binary that can be correlated across machines machines. That service is accessible to all with unstripped versions for offline symboliza- users for ad hoc queries and to systems for tion. The rest are mainly used to tag the automated analyses...... 68 IEEE MICRO [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 69

The database supports a subset of SQL- like semantics. Although the dimensional database is well suited to perform queries that aggregate over the large data set, some individual queries can take tens of seconds to complete. Fortunately, most queries are seen frequently, so the profile server uses ag- gressive caching to hide the database latency. User interfaces For most users, GWP deploys a webserver to provide a user interface on top of the profile database. This makes it easy to access profile data and construct ad hoc queries for the traditional use of application profiles (with additional freedom to filter, group, and aggregate profiles differently). (a) Query view. Several visual interfaces retrieve information from the profile database, and all are navigable from a web browser. The primary interface (see Figure 2) displays the result entries, such as functions or executables, that match the desired query parameters. This page supplies links that let users refine the query to more specific data. For example, the user can restrict the query to only report samples for a specific executable collected within a desired time period. Additionally, the user can modify or refine any of the parameters to the current query to create a custom profile view. The GWP homepage has links to display the top results, Google-wide, for each perfor- (b) mance metric. Figure 2. An example query view: an application-level profile (a) and a Call graph view. For most server profile function-level profile (b). samples, the profilers collect full call stacks with each sample. Call stacks are aggregated to produce complete dynamic call graphs describing overall profile information about for a given profile. Figure 3 shows an exam- the file and a histogram bar showing the rel- ple call graph. Each node displays the func- ative hotness of each source file line. Because tion name and its percentage of samples, different versions of each file can exist in and the nodes are shaded based on this per- source repositories and branches, we retrieve centage. The call graph is also displayed a hash signature from the repository for each through the web browser, via a Graphviz source file and aggregate samples on files plug-in. with identical signatures.

Source annotation. The query and call graph Profile data API. In addition to the web- views are useful in directing users to specific server, we offer a data-access API to read functions of interest. From there, GWP pro- profiles directly from the database. It’s vides a source annotation view that presents more suitable for automated analyses that theoriginalsourcefilewithaheader must process a large amount of profile ......

JULY/AUGUST 2010 69 [3B2-14] mmi2010040065.3d 30/7/010 19:48 Page 70

...... DATACENTER COMPUTING

Application-specific profiling is generic and can target any specific set of machines. For example, we can use it to profile a set of machines deployed with the newest kernel version. We can also limit the profiling dura- tion to a small time period, such as the application’s running time. It’s useful for batch jobs running on data centers, such as MapRe- duce, because it facilitates collecting, aggregating, and exploring profiles collected from hundreds or thousands of their workers. Reliability analysis To conduct continuous profiling on datacenter machines serving real traffic, extremely low overhead is paramount, so we sample in both time and machine dimensions. Sam- pling introduces variation, so we must measure and understand how sampling affects the profiles’ quality. But the nature of datacenter workloads makes this difficult; their behavior is continually changing. There’s Figure 3. An example dynamic call graph. no direct way to measure the datacenter Function names are intentionally blurred. applications’ profiles’ representativeness. In- stead, we use two indirect methods to evaluate their soundness. data (such as reliability studies) offline. We First, we study the stability of aggregated store both raw profiles and symbolized pro- profiles themselves using several different files in ProtocolBuffer formats (http://code. metrics. Second, we correlate profiles with google.com/apis/protocolbuffers). the performance data from other sources to Advanced users can access and reprocess cross-validate both. them using their preferred programming language. Stability of profiles We use a single metric, entropy, to mea- Application-specific profiling sure a given profile’s variation. In short, en- Although the default sampling rate is high tropy is a measure of the uncertainty enough to derive top-level profiles with high associated with a random variable, which in confidence, GWP might not collect enough this case is profile samples. The entropy H samples for applications that consume rela- of a profile is defined as tively few cycles Google-wide. Increasing the overall sampling rate to cover those pro- Xn files is too expensive because they’re usually HWðÞ¼ pxðÞi logðÞpxðÞi ¼ sparse. i 1 Therefore, we provide an extension to where n is the total number of entries in the GWP for application-specific profiling on profile and p(x)isthefractionofprofile the cloud. The machine pool for applica- samples on the entry x.5 tion-specific profiling is usually much In general, a high entropy implies a flat smaller than GWP, so we can achieve a profile with many samples. A low entropy high sampling rate on those machines for usually results from a small number of sam- the specific application. Several application ples or most samples being concentrated to teams at Google use application-specific few entries. We’re not concerned with en- profiling to continuously monitor their tropy itself. Because entropy is like the signa- applications running on the fleet. ture in a profile, measuring the inherent ...... 70 IEEE MICRO [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 71

20 5.0

18 4.5

16 4.0

14 3.5

12 3.0

10 2.5 Entropy 8 2.0

6 1.5

Number of samples (millions) 4 1.0

2 0.5

0 0 Random dates

Figure 4. The number of samples and the entropy of daily application-level profiles. The primary y-axis (bars) is the total number of profile samples. The secondary y-axis (line) is the entropy of the daily application-level profile. variation, it should be stable between repre- samples collected for each date. Unless speci- sentative profiles. fied, we used CPU cycles in the study, and Entropy doesn’t account for differences our conclusions also apply to the other between entry names. For example, a func- types. As the graph shows, the entropy of tion profile with x percent on foo and y per- daily application-level profiles is stable be- cent on bar has the same entropy as a profile tween dates, and it usually falls into a small with y percent on foo and x percent on bar. interval. The correlation between the num- So, when we need to identify the changes ber of samples and the profile’s entropy is on the same entries between profiles, we cal- loose. Once the number of samples reaches culate the Manhattan distance of two profiles some threshold, it doesn’t necessarily lead by adding the absolute percentage differences to a lower entropy, partly because GWP between the top k entries, defined as sometimes samples more machines than nec- essary for daily application-level profiles. Xk This is because users frequently must drill MXðÞ¼; Y p ðÞx p ðÞx x i y i down to specific profiles with additional fil- i¼1 ters on certain tags, such as application where X and Y are two profiles, k is the num- names, which are a small fraction of all pro- ber of top entries to count, and py(xi)is0 files collected. when xi is not in Y. Essentially, the Manhat- We can conduct similar analysis on an tan distance is a simplified version of relative application’s function-level profile (for ex- entropy between two profiles. ample, the application in Figure 2b). The result,showninFigure5a,isfroman Profiles’ entropy. First, we compare the application whose workload is fairly stable entropies of application-level profiles when aggregating from many clients. Its where samples are broken down on individ- entropy is actually more stable. It’s interest- ual applications. Figure 2a shows an exam- ing to analyze how entropy changes among ple of such an application-level profile. machines for an application’s function- Figure 4 shows daily application-level level profiles. Unlike the aggregated profiles profiles’ entropies for a series of dates, to- across machines, an application’s per- gether with the total number of profile machine profiles can vary greatly in terms ......

JULY/AUGUST 2010 71 [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 72

...... DATACENTER COMPUTING

0.36 0.34 6.0 0.32 5.5 0.30 0.28 5.0 0.26 4.5 0.24 4.0 0.22 0.20 3.5 0.18 3.0 0.16 Entropy 0.14 2.5 0.12 2.0 0.10 0.08 1.5 Number of samples (millions) 0.06 1.0 0.04 0.5 0.02 0 0 Random dates (a)

7.0

6.5

6.0

5.5

5.0

4.5

4.0

3.5 Entropy 3.0

2.5

2.0

1.5

1.0

0.5

0.0 0 0.01 0.02 0.03 0.04 0.05 0.06 Number of samples (millions) (b)

Figure 5. Function-level profiles. The number of samples and the entropy for a single application (a). The correlation between the number of samples and the entropy for all per-machine profiles (b).

...... 72 IEEE MICRO [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 73

of number of samples. Figure 5b plots the root relationship between the distance and the relationship between the number of sam- number of samples, ples per machine and the entropy of func- pffiffiffiffiffiffiffiffiffiffiffiffi tion-level profiles. As expected, when the MXðÞ¼C= NXðÞ total number of samples is small, the pro- where N(X) is the total number of samples file’s entropy is also small (limited by the in a profile and C is a constant that depends maximum possible uncertainty). But once on the profile type. the threshold is reached at the maximum number of samples, the entropy becomes Derived metrics. We can also indirectly stable. We can observe two clusters from evaluate the profiles’ stability by computing the graph; some entropies are concentrated some derived metrics from multiple pro- between 5.5 and 6, and the others fall be- files. For example, we can derive CPI tween 4.5 and 5. The application’s two be- from HPM profiles containing cycles and havioral states can explain the two clusters. retired instructions. Figure 7 shows that We’ve seen various clustering patterns on the derived CPI is stable across dates. Not different applications. surprisingly, the daily aggregated profiles’ CPI falls into a small interval between 1.7 The Manhattan distance between profiles. We and 1.8 for those days. use the Manhattan distance to study the variation between profiles considering the changes on entry name, where smaller Comparing with other sources distance implies less variation. Figure 6a Beyond measuring profiles’ stability illustrates the Manhattan distance between across dates, we also cross-validate the pro- the daily application-level profiles for a files with performance and utilization data series of dates. The results from the Man- from other Google sources. One example is hattan distances for both application-level the utilization data that the data center’s and function-level profiles are similar to monitoring system collects. Unlike GWP, the results with entropy. the monitoring system collects data from all In Figure 6b, we plot the Manhattan dis- machines in the data center but at a coarser tance for several profile types, leading to two granularity, such as overall CPU utilization. observations: Its CPU utilization data, in terms of coreseconds, matches the measurement from In general, memory and thread profiles GWP’s CPU cycles profile with the follow- have smaller distances, and their varia- ing formula: tions appear less correlated with the CoreSeconds ¼ Cycles * SamplingRatemachine * other profiles. SamplingPeriod/CPUFrequencyaverage Server CPU time profiles correlate with HPM profiles of cycles and instructions in terms of variations, which could Profile uses imply that those variations resulted nat- In each profile, GWP records the samples urally from external reasons, such as of interesting events and a vector of associ- workload changes. ated information. GWP collects roughly a dozen events, such as CPU cycles, retired To further understand the correlation be- instructions, L1 and L2 cache misses, branch tween the Manhattan distance and the num- mispredictions, heap memory allocations, ber of samples, we randomly pick a subset and lock contention time. The sample defi- of machines from a specific machine set and nition varies depending on the event then compute the Manhattan distance of the type—it can be CPU cycles or cache misses, selected subset’s profile against the whole bytes allocated, or the sampled thread’s lock- set’s profile. We could use a power function’s ing time. Note that the sample must be trendlinetocapturethechangeintheMan- numeric and capable of aggregation. hattan distance over the number of samples. The associated vector contains information The trend line roughly approximates a square such as application name, function name, ......

JULY/AUGUST 2010 73 [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 74

...... DATACENTER COMPUTING

22 0.20 0.19 20 0.18 0.17 18 0.16 0.15 16 0.14 14 0.13 0.12 12 0.11 0.10 10 0.09 0.08 8 0.07 Manhattan distance 0.06 6 Number of samples (millions) 0.05 4 0.04 0.03 2 0.02 0.01 0 0 Random dates Cycles Instr L1_Miss L2_Miss CPU Heap Threads

(a) 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35

Manhattan distance 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0 2 4 6 8 1012141618 (b) Number of samples (millions)

Figure 6. The Manhattan distance between daily application-level profiles for various profile types (a). The correlation between the number of samples and the Manhattan distance of profiles (b).

...... 74 IEEE MICRO [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 75

24 2.0 22 1.8 20 1.6 18 1.4 16 14 1.2 12 1.0

10 0.8 8 0.6 6 Cycles per instruction (CPI) Number of samples (millions) 0.4 4 2 0.2 0 0 Random dates

Figure 7. The correlation between the number of samples and derived cycles per instruction (CPI).

platform, compiler version, image name, and functions, which is useful for many as- data center, kernel information, build revi- pects of designing, building, maintaining, sion, and builder’s name. Assuming that and operating data centers. Infrastructure the vector contains m elements, we can rep- teams can see the big picture of how their resent a record GWP collected as a tuple software stacks are being used, aggregated . performance-critical components, create rep- When aggregating, GWP lets users resentative benchmark suites, and prioritize choose k keys from the m dimensions and performance efforts. groups the samples by the keys. Basically, it At the same time, an application team can filters the samples by imposing one or use GWP as the first stop for the applica- more restrictions on the rest of the dimen- tion’s profiles. As an always-on profiling ser- sions (mk) and then projects the samples vice, GWP collects a representative sample of into k key dimensions. GWP finally displays the application’s running instances over time. the sorted results to users, delivering answers Application developers often are surprised to various performance queries with high by application’s profiles when browsing confidence. Although not every query GWP results. For example, Google’s speech- makes sense in practice, even a small subset recognition team quickly optimized a previ- of them are demonstrably informative in ously unknown hot function that they identifying performance issues and providing couldn’t have easily located without the insights into computing resources in the aggregated GWP results. Application teams cloud. also use GWP profiles to design, evaluate, and calibrate their load tests. Cloud applications’ performance GWP profiles provide performance Finding the hottest shared code. Shared code insights for cloud applications. Users can is remarkably abundant. Profiling each pro- see how cloud applications are actually con- gram independently might not identify hot suming machine resources and how the pic- shared code if it’s not hot in any single ap- ture evolves over time. For example, Figure 2 plication, but GWP can identify routines shows cycles distributed over top executables that don’t account for a significant portion ......

JULY/AUGUST 2010 75 [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 76

...... DATACENTER COMPUTING

Table 1. Platform affinity example. computation by looking at its percentage of all Google’s computations. As another Random assignment of instructions (CPI in brackets) example, GWP profiles can identify the Platform 1 Platform 2 applications running on old hardware configurations and evaluate whether they NumCrunch 100 (1) 100 (1) should be retired for efficiency. MemBench 100 (1) 100 (2) Total cycles 200 300 Optimizing for application affinities Some applications run better on a partic- Optimal assignment of instructions ular hardware platform due to sensitivity to Platform 1 Platform 2 architectural details, such as processor micro- architecture or cache size. It’s generally very NumCrunch 0 (1) 200 (1) hard or impossible to predict which applica- MemBench 200 (1) 0 (2) tion will fare best on which platform. In- Total cycles 200 200 stead, we measure an efficiency metric, CPI, for each application and platform com- of any single application but consume the bination. We can then improve job schedul- most cycles overall. For example, the ing so that applications are scheduled on GWP profiles revealed that the zlib library platforms where they do best, subject to (www.zlib.net) accounted for nearly 5 per- availability. The example in Table 1 shows cent of all CPU cycles consumed. That how the total number of cycles needed to motivated an effort to optimize zlib routines run a fixed number of instructions on a and evaluate compression alternatives. In fixed machine capacity drops from 500 to fact, some users have used GWP numbers 400 using preferential scheduling. Specifi- to calculate the estimated savings for perfor- cally, although the application NumCrunch mance tuning efforts on shared functions. runs just as well on Platform1 as on Plat- Not surprisingly, given the Google fleet’s form2, application MemBench does poorly scale, a single-percent improvement on a on Platform2 because of the smaller L2 core routine could potentially save signifi- cache. Thus, the scheduler should give Mem- cant money per year. Unsurprisingly, the Bench preference to Platform1. new informal metric, ‘‘dollar amount per The overall optimization process has sev- performance change,’’ has become popular eral steps. First, we derive cycle and instruc- among Google engineers. We’re considering tion samples from GWP profiles. Then, we providing a new metric, ‘‘dollar per source compute an improved assignment table by line,’’ in annotated source views. moving instructions away from application- At the same time, some users have used platform combinations with the worst relative GWP profiles as a source of coverage infor- efficiency. We use cycle and instruction sam- mation to assess the feasibility of deprecation ples over a fixed period of time, aggregated and to uncover users of library functions that per job and platform. We then compute are to be deprecated. Due to the profiles’ and normalize CPI by clock rate. We can for- dynamic nature, the users might miss less mulate finding the optimal assignment as a common clients, but the biggest, most linear programming problem. The one un- important callers are easy to find. known is Loadij —the number of instruction samples of application j on platform i. Evaluating hardware features. The low-level The constants are: information GWP provides about how CPU cycles (and other machine resources) CPIij,themeasuredCPIofapplication are spent is also used for early evaluation j on platform i. of new hardware features that datacenter TotalLoadj, the total measured number operators might want to introduce. One in- of instruction samples of application j. teresting example has been to evaluate Capacityi, the total capacity for plat- whether it would be beneficial to use a spe- form i, measured as total number of cial coprocessor to accelerate floating-point cycle samples for platform i...... 76 IEEE MICRO [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 77

The equation is: function samples can reconstruct the X entire call graph, showing users how Minimize CPIij Loadij the CPU cycles (and other events) are Xi;j distributed in the program. where Loadij ¼ TotalLoadj Xj A dramatic change to hot functions in the daily profiles could trigger a finer-grained and CPIij Loadij Capacityi j comparison, which eventually points blame to the source code revision number, com- We use a simulated annealing solver that piler, or data center. approximates the optimal solution in seconds for workloads of around 100 jobs running Feedback-directed optimization on thousands of machines of four different Sampled profiles can also be used for platforms over one month. Although appli- feedback-directed compiler optimization cation developers already mapped major (FDO), outlined in work by Chen et al.7 applications to their best platform through GWP collects such sampled profiles and manual assignment, we’ve measured 10 to offers a mechanism to extract profiles for a 15 percent potential improvement in most particular binary in a format that the com- cases where many jobs run on multiple plat- piler understands. This profile will be higher forms. Similarly, users can use GWP data to quality than any profile derived from test identify how to colocate multiple applica- inputs because it was derived from running tions on a single machine to achieve the on live data. Furthermore, as long as devel- best throughput.6 opers make no changes in critical program sections, we can use an aged profile to im- Datacenter performance monitoring prove a freshly released binary’s performance. GWP users can also use GWP queries Many Web companies have release cycles of with computing-resourcerelated keys, such two weeks or less, so this approach works as data center, platform, compiler, or well in practice. builder, for auditing purposes. For example, Similar to load test calibration, users also when rolling out a new compiler, users in- use GWP for quality assurance for profiles spect the versions that running applications generated from static benchmarks. Bench- are actually compiled with. Users can easily marks can represent some codes well, but not measure such transitions from time-based others, and users use GWP to identify which profiles. Similarly, a user can measure how codes are ill-suited for FDO compilation. soon a new hardware platform becomes Finally, users use GWP to estimate the active or how quickly old ones are retired. quality of HPM events and microarchitec- This also applies to new versions of applica- tural features, such as cache latencies of vari- tions being rolled out. Grouping by data ous incarnations of processors with the same center, GWP displays how the applications, instruction set architecture (ISA). For exam- CPU cycles, and machine types are distrib- ple, if the same application runs on two plat- uted in different locations. forms, we can compare the HPM counters Beyond providing profile snapshots, and identify the relevant differences. GWP can monitor the changes between cho- sen profiles from two queries. The two pro- esides adding more types of perfor- files should be similar in that they must B manceeventstocollect,we’renow have the identical keys and events, but differ- exploring more directions for using GWP ent in one or more other dimensions. For profiles. These include not only user-interface applications, GWP usually focuses on moni- enhancement but also advanced data-mining toring function profiles. This is useful for techniques to detect interesting patterns in two reasons: profiles.It’salsointerestingtomesh-upGWP profiles with performance data from other performance optimization normally sources to address complex performance starts at the function level, and problems in datacenter applications. MICRO ......

JULY/AUGUST 2010 77 [3B2-14] mmi2010040065.3d 30/7/010 19:58 Page 78

...... DATACENTER COMPUTING

...... optimization. His research interests include References application performance tuning in data 1. J.M. Anderson et al., ‘‘Continuous Profiling: centers and building tools for datacenter-scale Where Have All the Cycles Gone?’’ Proc. 16th performance analysis and optimization. Ren ACM Symp. Operating Systems Principles has a PhD in computer science from the (SOSP 97), ACM Press, 1997, pp. 357-390. University of Illinois at Urbana-Champaign. 2. J. Dongarra et al., ‘‘Using PAPI for Hard- ware Performance Monitoring on Linux Eric Tune is a senior software engineer at Systems,’’ Conf. Linux Clusters: The HPC Google, where he’s working on a system that Revolution, 2001. allocates computational resources and pro- 3. S. Ghemawat, H. Gobioff, and S.-T. Leung, vides isolation between jobs that share ‘‘TheGoogleFileSystem,’’Proc. 19th machines. Tune has a PhD in computer ACM Symp. Operating Systems Principles engineering from the University of Califor- (SOSP 03), ACM Press, 2003, pp. 29-43. nia, San Diego. 4. J. Dean and S. Ghemawat, ‘‘MapReduce: Simplified Data Processing on Large Clus- Tipp Moseley is a software engineer at ters,’’ Proc. 6th Conf. Symp. Operating Sys- Google, where he’s working on datacenter- tem Design and Implementation (OSDI 04), scale performance analysis. His research Usenix Assoc., 2004, pp. 137-150. interests include program analysis, profiling, 5. S. Savari and C. Young, ‘‘Comparing and and optimization. Moseley has a PhD in Combining Profiles,’’ Proc. Workshop on computer science from the University of Feedback-Directed Optimization (FDO 02), Colorado, Boulder. He is a member of ACM. ACM Press, 2000, pp. 50-62. 6. J. Mars and R. Hundt, ‘‘Scenario Based Yixin Shi is a software engineer at Google, Optimization: A Framework for Statically where he’s working on performance analysis Enabling Online Optimizations,’’ Proc. tools and large-volume imagery data proces- 2009 Int’l Symp. Code Generation and Opti- sing. His research interests include architec- mization (CGO 09), IEEE CS Press, 2009, tural support for securing program execution, pp. 169-179. cache design for wide-issue processors, com- 7. D. Chen, N. Vachharajani, and R. Hundt, puter architecture simulators, and large-scale ‘‘Taming Hardware Event Samples for data processing. Shi has a PhD in computer FDO Compilation, Proc. 8th Ann. IEEE/ engineering from the University of Illinois ACM Int’l Symp. Code Generation and Opti- at Chicago. He is a member of ACM. mization (CGO 10), ACM Press, 2010, pp. 42-52. Silvius Rus is a senior software engineer at Google, where he’s working on datacenter Gang Ren is a senior software engineer at application performance optimization Google, where he’s working on datacenter through compiler and library transformations application performance analysis and based on memory allocation and reference analysis. Rus has a PhD in computer science from Texas A&M University. He is a member of ACM.

Robert Hundt is a tech lead at Google, where he’s working on compiler optimization and datacenter performance. Hundt has a Diplom Univ. in computer science from the Technical University in Munich. He is a member of IEEE and SIGPLAN.

Direct question and comments about this article to Gang Ren, Google, 1600 Amphitheatre Pkwy. Mountain View, CA 94043; [email protected]...... 78 IEEE MICRO [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 79

......

JULY/AUGUST 2010 79