On the Effectiveness of Simultaneous Multithreading on Network Server Workloads

On the Effectiveness of Simultaneous Multithreading on Network Server Workloads Yaoping Ruan1 Erich Nahum1 Vivek Pai2 John Tracey1 IBM T.J. Watson Research Center 1 Princeton University 2 Hawthorne, NY, 10532 Princeton, NJ, 08540 fyaoping.ruan,nahum,[email protected] [email protected] Abstract enabled hardware has come to market. Industry accep- tance of SMT has been gaining, as shown by recent ship- This paper experimentally investigates the effectiveness ping processors from major manufacturers such as IBM of simultaneous multithreading (SMT) for network server and Intel, who have promoted SMT-enabled processors workloads. We study how well SMT improves perfor- for use in high performance servers. The availability of mance on two very different shipping platforms that sup- this hardware offers the opportunity to measure perfor- port SMT: IBM's POWER5 and the Intel Xeon. We use mance and see how effective SMT is in practice. How- the architectural performance counters available on each ever, current implementations of SMT have evolved from processor to compare their architectural behavior, exam- existing single-threaded processors, and thus competing ining events such as cache misses, pipeline stalls, etc. By manufacturers may have made very different choices re- observing how these events change in response to the in- garding what resources need to be replicated or modified troduction of threading, we can determine whether and for supporting SMT. For example, the Intel Xeon proces- how SMT stresses the system. sor with Hyper-Threading is based on the Pentium 4 Net- We find that POWER5 makes more effective use of Burst architecture, and the IBM POWER5 is derived from SMT for improving performance than the Xeon. In gen- POWER4. Thus, studying the performance and observ- eral, POWER5 achieves a 40-50% increase whereas the able bottlenecks in different processor families presents Xeon exhibits only a 10-30% gain. Examination using the us with an excellent opportunity to understand what ar- performance counters reveals that cache size and memory chitectural features are important for supporting SMT. bandwidth are the key requirements for fully exploiting This paper experimentally investigates the effectiveness SMT. A secondary but still noticeable component is min- of simultaneous multithreading for network server work- imizing pipeline stalls due to branch mispredicts and in- loads. We study how well SMT improves performance terrupts. We present suggestions for improving SMT per- on the two shipping platforms that support SMT: IBM's formance for each platform based on our observations. POWER5 and the Intel Xeon. We focus on server workloads since the processors are targeted for such usage 1 Introduction and because most previous studies have examined user- Simultaneous multithreading (SMT) is a technique for im- space and/or CPU-bound workloads. We measure the per- proving processor resource utilization by allowing two or formance of two different Web servers, Apache [2] and more hardware threads to execute simultaneously on a sin- Flash [30], running on a standard client-server testbed gle processor core. By duplicating a small set of hardware connected using an isolated switched Gigabit Ethernet, resources, such as architectural state registers, SMT al- using the industry-standard SPECWeb99 [41] for work- lows multiple instruction streams to execute in a single load generation. pipeline simultaneously. Because the additional hardware In particular, we measure how performance changes support for these extra resources is small, SMT is seen as a with the introduction of SMT, thus evaluating how effec- relatively inexpensive way to gain instruction throughput. tive each processor at utilizing this feature. To help under- Academic research has demonstrated considerable perfor- stand when, how, and why SMT is effective, we use the mance gains using this technique [11, 44]. However, since architectural performance counters available on each plat- the actual hardware did not exist at the time of these stud- form to measure and compare architectural behavior. By ies, they could only be conducted using simulation. observing how cycle-intensive architectural events such as This limitation has been changing, however, as SMT- cache misses, ITLB misses, pipeline stalls, and branch 1 mispredictions change in response to adding SMT, we Xeon POWER5 can determine whether and how SMT stresses the system. Shared caches, branch predictors, This paper makes the following contributions: Resources decoder logic, execution units DTLB TLB • We provide the first experimental performance com- Duplicated interrupt controller, status registers, parison of SMT implementations across multiple Resources renaming logic platforms using server workloads. While POWER5 ITLB instruction buffer, achieves 37-60% performance increases with SMT, Partitioned load/store buffer the Xeon gains only 5-30%. The Xeon also exhibits Resources µop queue, Global completion less benefit from SMT in the dual-processor case reordering buffer table (GCT) than with a single processor. POWER5 improve- ments are more consistent for both single and dual processors. Table 1: SMT resource allocation in Intel Xeon and POWER5. • We show that cache provisioning is a crucial factor for exploiting SMT. Performance counter measure- execution interval could be interleaved cycle by cycle, ments indicate that the Xeon does not significantly known as fine-grained multithreading, or by certain events benefit from SMT because of increased CPI due to (such as a cache miss), known as coarse-grained multi- cache contention. In contrast, cache and memory threading. An early example of the former was the Denel- system contribution to overall CPI is only slightly af- cor HEP [32], which had 8 hardware contexts. A much fected by SMT on POWER5, which has much larger later design from the same architect, the Tera MTA [10], caches. used 128 contexts. Recent examples include the IBM RS64-II [6], introduced in 1998, which switches threads • Pipeline management is a secondary but significant to avoid processor time during long-latency events. The factor for achieving SMT benefits. CPI is only Sun Fire T1000 (Niagara) from Sun Microsystems has slightly increased by pipeline stalls and branch mis- fine-grained multithreading which implements a thread- predicts on the Xeon when SMT is enabled, but rises select logic to decide which thread is active in a given more significantly on POWER5. cycle [31]. • We provide recommendations for both platforms on 2.2 Simultaneous Multithreading how each can better take advantage of SMT. Multithreading technology in general was introduced to increase functional unit utilization. SMT improves this The remainder of the paper is organized as follows. utilization by executing instructions from multiple pro- Section 2 provides some architectural background on cesses or threads simultaneously in a single pipeline. In hardware multithreading, particularly SMT. Section 3 de- SMT-enabled processors, most of the processor resources scribes our experimental platform in detail, including are shared by the threads. Duplication of components server hardware, network connectivity, server software, varies from processor to processor, except that registers and workload generation. Section 4 presents the overall for architectural states have to be duplicated to fetch in- performance results of the two platforms. Section 5 delves structions from different processes independently. Be- further by using performance counter measurements to cause operating systems usually get processor informa- help understand the performance seen in Section 4. Sec- tion from these registers, SMT appears exactly as multi- tion 6 discusses related work, and Section 7 presents our processor if no further identification is taken. conclusions and directions for future work. 2.3 Intel Xeon 2 Background The Intel Xeon with Hyper-Threading was the earliest In this section, we present a brief overview of the vari- SMT-capable processor family shipped with real imple- ous approaches to multithreading, describe SMT in more mentation. The Xeon is Intel's server class processor, with detail, and discuss the two SMT-capable processors that dual or multi-processor support. Intel Xeon with Hyper- we use – IBM POWER5 and Intel Xeon processor with Threading was based on the Intel NetBurst microarchitec- Hyper-Threading. ture [14], which has a superscalar, out-of-order pipeline with 20 stages for early versions and 31 stages for the lat- 2.1 Hardware Multithreading est ones. It has two threads (logical processors) per phys- Although SMT is a relatively new feature, the concept of ical processor, which share most of the resources, such multithreading has been around for several years. In these as caches, functional units, branch prediction unit etc. architectures, instructions from different threads are exe- Some of the resources are duplicated, such as architecture cuted at various intervals without context switches. The state registers, interrupt controllers, etc. Some resources 2 Xeon POWER5 Term Kernel Num Threads Total Clock Rate 3.06 GHz 1.5 GHz Used Type Proc. / Proc. Threads Pipeline Stages 30 16-21 UP UP 1 1 1 TC/I-L1 12 K µops, 64 KB, 1P SMP 1 1 1 8-way, 6µops line 2-way, 128B line D-L1 8 KB, 32 KB, 2T SMP 1 2 2 4-way, 64B line 4-way, 128B line 2P SMP 2 1 2 L2 512 KB, 1.875 MB*, 4T SMP 2 2 4 8-way, 128B line 10-way, 128B line L3 1 MB, 36 MB*, Table 3: Nomenclature used in this paper 8-way, 128B line 12-way, 256B line Main Memory 2 GB 2 GB Memory POWER5. Bandwidth 600 MB/s 1280 MB/s Address 128-entry ITLB 1024-entry I/DTLB 3 Experimental Testbed Translation 64-entry DTLB 128-entry ERAT Our experimental setup consists of an Intel Xeon server 64-entry I/DSLB and an IBM POWER5 server. Our POWER5 server is Branch 4K branch 8-entry branch a P5 520, which has a dual-core processor running at Prediction target buffer information queue 1.5GHz. Our Intel server is a SE7505VB2 which sup- ports Xeon dual processors running at a number of clock Table 2: Intel Xeon and POWER5 hardware specification.

On the Effectiveness of Simultaneous Multithreading on Network Server Workloads

2.5 Classification of Parallel Computers

Massively Parallel Computing with CUDA

Parallel Computer Architecture

A Review of Multicore Processors with Parallel Programming

Linux Performance Tools

Vector Vs. Scalar Processors: a Performance Comparison Using a Set of Computational Science Benchmarks

Parallel Processing! 1! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 2! Suggested Readings! •! Readings! –! H&P: Chapter 7! •! (Over Next 2 Weeks)!

Challenges for the Message Passing Interface in the Petaflops Era

Why Open Source Software?

A Survey on Parallel Multicore Computing: Performance & Improvement

Analysis of Multithreaded Architectures for Parallel Computing

Linux Certification Bible.Pdf