CMP/CMT Scaling of Specjbb2005 on Ultrasparc T1
Total Page:16
File Type:pdf, Size:1020Kb
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 Dimitris Kaseridis and Lizy K. John Department of Electrical and Computer Engineering The University of Texas at Austin {kaseridi, ljohn}@ece.utexas.edu Abstract cache implementation along with system’s scalability and memory hierarchy performance. The UltraSPARC T1 (Niagara) from Sun Microsystems is a new multi-threaded processor that 2. Description of UltraSPARC T1 Architecture combines Chip Multiprocessing (CMP) and Simultaneous Multi-threading (SMT) with an efficient instruction In November 2005, SUN released its latest pipeline so as to enable Chip Multithreading (CMT). Its microprocessor, the UltraSPARC T1, formerly known as design is based on the decision not to focus the "Niagara" [4, 5]. The distinct design principle of the performance of single or dual threads, but rather to processor is the decision to optimize Niagara for optimize for multithreaded performance in a commercial multithreaded performance. This approach promises server environment that tends to feature workloads with increased performance by improving throughput, the total large amounts of thread level parallelism (TLP). This amount of work done across multiple threads of execution. paper presents a study of Niagara using SPEC’s latest This is especially effective in commercial server Java server benchmark (SPECjbb2005) and provides applications such as web services and databases that tend some insight into the impact of the architectural to have workloads with large amounts of thread level characteristics and chip design decisions. According to parallelism (TLP). our study, we found that adding an additional hardware The Niagara architecture, shown in Figure 1, consists thread per core can achieve up to 75% of the performance of eight simple, in-order cores, or individual execution of adding an additional core. Moreover, we show that any pipelines, with each one of them being able to handle four additional hardware thread beyond two and up to the limit active context threads that share the same pipeline, the of four can achieve approximately 45% of the Sparc pipe . This configuration allows the processor to improvements of a single core. support up to 32 hardware threads in parallel and simultaneously execute eight of them per CPU cycle. The 1. Introduction striking feature of the processor that allows it to achieve higher levels of throughput is that the hardware can hide For the past several years, microarchitecture the memory and pipeline stalls on a given thread by designers have been using process scaling and aggressive effectively scheduling the other threads in a pipeline with speculation as the means for improving processors a zero-cycle switch penalty. performance. It is a common belief in the research community that the performance growth of microarchitectures will slow substantially due to poor wire scaling and diminishing improvements in clock rates as the semiconductor feature size scales down [1]. For this reason, recent trends in microprocessor architectures have diverged from previous well-studied designs, basically driven by the increasing demands for performance along with power efficiency. Such approaches move away from single core architectures and experiment with concepts based on multiple cores and multithreaded designs [2, 3]. Sun's UltraSPARC T1 [4, 5] is such a processor. A processor like Niagara is designed for favoring overall throughput by exploiting thread level parallelism. Niagara implements a combination of chip multi- processing (CMP) and simultaneous multi-threading Figure 1. UltraSPARC T1 architecture [5] (SMT) design to form a chip multi-threaded organization In addition to the 32-way multi-threaded architecture, (CMT). Sun’s architects selected an eight core CMP with Niagara contains a high-speed, low-latency crossbar that each core being able to handle four hardware threads, connects and synchronizes the 8 on-chip cores, L2 cache which results in an equivalent 32-way machine, virtually memory banks and the other shared resources of the CPU. presented to the operating system. From this design point, Due to this crossbar, the UltraSPARC T1 processor can be we can analyze its performance and quantify the benefits considered a Symmetric Multiprocessor system (SMP) that additional cores and/or additional hardware threads implemented on chip. Furthermore, in order to achieve have on a Java server workload benchmark like low levels of memory latency and be able to provide the SPECjbb2005 [6] which intents to stress processor and required volume of data to all of the 8 cores operate in parallel, SUN included four dual-data rate 2 (DDR2) 1 DRAM on-chip channels. This memory configuration different cores, respectively. By using all of the achieves a maximum bandwidth in excess of 20GBytes/s, aforementioned tools, we were able to monitor the and a capacity of up to 128 Gbytes. operation of the UltraSPARC processor in all of the The above characteristics allow the UltraSPARC T1 different phases of the SPECjbb2005 benchmark processor to offer a new thread-rich environment for execution. developers and promise higher level of throughput by hiding memory and pipeline stalls and favoring the 4. SPECjbb2005 overview execution of multiple threads in parallel. Taking Enterprise Java workloads constitute an important everything into consideration, the processor has the class of applications that feature the creation of numerous potential to scale well with applications that are small objects with relatively short life-time. These short- throughput oriented, particularly web, transaction lived objects, which most of the times process various processing, and Java technology services. The classes of kinds of transactions with databases, create a significant applications that Niagara is expected to scale up are: pressure on the garbage collection subsystem of the Java multi-threaded applications, multi-process applications, Virtual Machine (JVM). SPECjbb2005 (Java Server Java technology-based applications and multi-instance Benchmark) [6] is the latest iteration from SPEC that applications. attempts to model a self contained, three-tier system that includes clients, server and database elements. In this 3. Description of Performance counters and tools to paper we used SPECjbb2005 as the benchmark for extract data studying Niagara. The Niagara processor includes in its design a set of In SPECjbb2005 [6], each client is represented by an hardware performance counters [5, 7] that allows the individual thread that sequentially executes a set of counting of a series of processor events per hardware operations on a database emulated by a collection of Java thread context which include cache misses, pipeline stalls objects. The database contains approximately 25MB of and floating-point operations. Statistics of processor data. Every instance of a database is considered as an events can be collected in hardware with little or no individual warehouse. During the execution of the overhead, making these counters a powerful tool for benchmark the overall number of the individual monitoring an application and analyzing its performance. warehouses is scaled so as to allow more clients to The Niagara and the Solaris 10 operating system provide a connect simultaneously to the available warehouses and set of tools for configuring and accessing these counters. therefore more threads execute in parallel. The The basic tools that we used were: cpustat and cputrack performance measured by SPECjbb covers CPUs, caches [8]. Table 1 contains the microarchitectural events that all and the memory hierarchy of the system without of theses tools allow us to monitor along with a small generating any I/O disk and network traffic since all of the description of each one of them. In addition to these three tiers of the modeled system are implemented in the events mpstat and vmstat [8] tools of Solaris 10 can same JVM. Furthermore, the reported score is a provide higher level of statistics for individual hardware- throughput rate metric that is proportional to the overall threads and virtual memory, respectively. number of transactions that the system is able to serve in a specific time interval and therefore reflects the ability of Table 1. Events of instrumented performance counters the server to scale to larger number of executed threads. Event Name Description Instr_cnt Number of completed instructions. 5. Methodology SB_full Number of store buffer full cycles FP_instr_cnt Number of completed floating-point The UltraSPARC T1 used for this study was instructions configured as shown in the table below: IC_miss Number of instruction cache (L1) misses DC_miss Number of data cache (L1) misses for loads Table 2. Experimental Parameters ITLB_miss Number of instruction TLB miss trap taken. Parameter Value DTLB_miss Number of data TLB miss trap taken (includes Operating SunOS 5.10 Generic_118833-17 real_translation misses). System L2_imiss Number of secondary cache (L2) misses due CPU Frequency 1000 MHz to instruction cache requests. Main Memory 8 Gbytes DDR2 DRAM L2_dmiss_ld Number of secondary cache (L2) misses due Size to data cache load requests. JVM version Java(TM) 2 build 1.5.0_06-b05 SPECjbb Java -Xmx2560m -Xms2560m -Xmn1536m - In addition to the above tools, we used Solaris’ psrset Execution Xss128k -XX:+UseParallelOldGC - Command XX:ParallelGCThreads=15 - and psradm [8] administrator tools for changing the XX:+AggressiveOpts - configuration of T1 processor. These tools allow the user XX:LargePageSizeInBytes=256m