<<

The Intel move from ILP into Multi-threading

Miguel Pires

Departamento de Informática, Universidade do Minho Braga, Portugal [email protected]

Abstract. Multicore technology came into consumer market in the last years to face what seem to be the limits to the “technological paradigm of single core”. With a small increase in chip cost and some engineering development, the first implementations of multithreading presented relevant improvements by dividing programs into threads and mixing threads in the processor (single core). Core parallelism (multicore architectures) applied the multithread technology at core level (in the same chip). This paper presents an introduction to simultaneous multithreading technology and its implementations. It explains Intel’s hyperthreading approach, and the analysis of multicore techniques, with special emphasis on the Xeon processor (single and dual core) and its competitors. As the single core technology does not seem to achieve enough improving, multithreading techniques at core level (core parallelism) and sometimes at logical level too (like hyperthreading) combined with powerful software " tailored" seem to achieve the best results.

1 Introduction 2 Multithreading on Single Core Thread level parallelism techniques in a single processor, like HT (hyperthreading), tend to Superscalar single threaded architectures optimize minimise both horizontal and vertical waste in processor pipeline but do not have in mind processor pipeline by simulating a logical important factors that are external to the pipeline, processor and by mixing various threads at the like cash-misses, interrupts and branch same time. mispredictions. A superscalar, out-of-order, with MC (multicore) processors extend this technique multi level pre-fetches and with the possibility of in a multi processor (in the same chip) at physical executing several instructions simultaneously in level. Nowadays some processors combine these the same clock cycle processor is represented in techniques at instruction and thread (logical and figure 1. As is can be seen, some horizontal and processor) level with powerful thread oriented vertical execution “bubbles” can be found, because software to maximise a processor’s multithreading only one block of instructions runs at a time capabilities and than to achieve better throughputs. [1][2][12]. In section 2 of this article will be described SC and Section 4 compares HT and MC performance gain In fine-grain multithreading various threads run on Intel’s micro architectures and critical factors in simultaneously, but only one thread is executed at performance. a clock cycle. A commuting selects one Finally is presented a comparative analysis thread at a time, avoiding “horizontal bubbles” in between two different approaches to MC the pipeline, but does not give any solution to the architecture, Intel Xeon and AMD Opteron vertical waste. SMT (simultaneous multi threading) (section 5). runs at the same time various threads, trying to fill pipeline vertically and horizontally, as much as partition resources either in space or time, thereby possible. limiting their flexibility to adapt to available parallelism.

As described in [1] the first known implementation of

multithreading technology is called TX2 and dates back to 1959. TX2 used multiple threads to support fast context switching to handle I/O functions. Since then many evolutions of this concept were developed but the more significant one was a fine-grained multithreading scheme with interleave scheduling among threads (CDC 6600). The first simulation for multithreaded superscalar architecture appeared in 1994 and in 1995 was known the first realistic simulation assessment and Figure 1 – Vertical and horizontal waste of coined the term simultaneous multi threading (SMT). non-threaded microarchitectures [3]

Intel introduced Hyperthreading technology in 2002, As we can see in figure 2, experiences made in this field based in SMT techniques by allowing two simultaneous [4], shown that even when comparing SMT single core threads at the same clock cycle in a single processor. and parallel multiprocessor without SMT, the first one’s The execution can be divided in two mixed threads in performance is better (like in figure 3). the pipeline. Using optimized algorithms, threads share physical resources such as caches, execution units, branch predictors, control logic, and buses. APIC’s (Advanced Programmable Interrupt Controllers) control the state of each logical processor and therefore they are duplicated as shown in Figure 4. [5]

Figure 2 – Performance comparison of SMT to Superscalars, multithreaded processors and on- chip multiprocessors (instructions/cycle)[4] Multiprocessor

SMT is then an evaluative technique that minimizes the (DP) AS AS pipeline’s waste at multiple levels (thread and AS APIC instruction levels), to raise substantially the use of the APIC processor. As a consequence, the number of instructions C PER PE R per clock cycle raises and leads to both PER multiprogramming and parallel workloads gains.

Multiscalar processors speculatively execute threads using dynamic branch prediction techniques and squashes threads if control (branches) or data (memory) Figure 3 – Pipeline Multiprocessor Architecture (based in [5]) speculation is incorrect. Although all of these architectures exploit multiple forms of parallelism, only

SMT has the ability to dynamically share execution resources among all threads. In contrast, the others

way, HT gains depend on how applications are fitted to take advantage of this technology, like those who DualCore HT explore data-parallel execution but most of the times require some engineering. [6]

AS AS Technolog Intel reported that HT achieved 15% to 27% increase in A AS AS APIC APIC processor resources utilization in well-optimized ASP IC APIC APIC y multimedia Technology. [7] CC CCCC PER PER PER

Figure 5 – Hyperthreading technology performance gains on

several popular multithreaded software packages. [8] Figure 4 – Dual Core and HyperThreading Intel technology

(based in [5]) In a super-pipelined micro architecture, events like cache misses, interrupts and branch mispredictions can be costly, so when this happens in one thread, HT processors can fill the pipeline with the other thread, and then maximise the number of instructions instruction per cycle.

Intel reports that once the logical processors share Figure 6 - Hyperthreading technology performance boost on almost all physical resources, and only multitasking workloads. [8] a few small structures were replicated, the die area cost of the implementation was less than 5% of the total area During the period 1985-2000, microprocessor and the clock-cycle time is not significantly different performances improved at appreciated levels (following from the nonMultithread one [5]. the Moore’s Law) in a single processor basis. In November of 2003, a group of Intel researchers This “two logical” processor architecture has many announced the technological limits for the engineering implications. HT changed many basic miniaturization [16]. Due to this single assumptions about single-threaded out-of-order design. core technology constraints, continuous demands in Therefore to introduce multithread Intel had to change speed gains, limits in exploring Instruction Parallelism algorithms and create new ones to prioritize micro (will not support the same growth) and the advances in operations, or micro-ops, from different logical technology, manufacturers saw new processors. They had to take some options concerning opportunities in connecting multiple processors together memory sharing by the two threads and pointer [1]. manipulation, a subject already complex enough in x86 architecture. Increased complexity dramatically As the main aim of parallelism is to maximize the use increases the validation effort. Also on the platform side of the processor, all the accelerating techniques in a they reviewed and optimised chipset, BIOS, operating single core cause more activity and therefore higher systems, and applications. [5] temperatures in the processor. The more single processor technology slows down the HT improves overall performance by multitasking, and performance growth, the more attractive is the when applications are already multithreaded. In this multiprocessor field. two threads can be mixed in the same pipeline and run at the same time (what means four threads for dual 3 Multithreading on Multicore core). If the execution has fewer threads than de maximum Although HT seems like there are two logical allowed by processor, naturally preference goes to possessors instead of one, the number of core execution because of the limitations of HT microinstructions at a single clock cycle that can be compared to MC efficiency explained above. executed at the same time (pipeline width) is the same. In some versions of MC, HT does not exist or can be Furthermore, SMT single core technologies have a switched off, because in some computing markets HT major impact in long pipelines (Intel’s netburst is not efficient in the MC approach. architecture case) but can be inefficient in other micro architectures. [11] Processors explore the full advantage of MC when the execution is thread tuned (naturally or “forced”) but The dissemination of SMT technology and the good there are some computing markets where work tends performances achieved in parallel computing to be “naturally” threaded. In this cases like server encouraged manufacturers to think about new market, is possible to take advantage of this opportunities in the SMT (applied to multiplie cores) multithreading executing procedure as many times as as shown in [9]. microelectronics (and market) can get. One example Researchers realized that SMT would lead to higher of this is Intel’s QuadCore tailored to server market, degrees of parallelism with MP products. which principle is the same applied to 4 Cores (and With significant advances in microelectronics and also HT). high threaded software usage, Intel reported in 2005 But having the possibility of many cores, can normal the MC (MultiCore) product line. [10] Advances in systems take advantage of Hyperthreading? To what electronics and miniaturization made possible to have limit? two cores (and its cache memories) in the same chip. Many authors think that systems do not explore yet the It is like having independent processors but with much possibilities of multithreaded execution, because this faster communication and memory access. paradigm only recently was realised and is relatively Initially in the server market with the high-end recent in the software industry, so there is much more computing market Pentium (Xeon), Intel introduced to run in this way. [10] for the first time the MultiProcessor (MP) Technology. Researches in this domain suggest that a very high In one the first versions of Multicore Xeon (in figure number of threads lead to complicated and inefficient 7), each of the Xeon cores has L1 (16k) and L2 (1MB) resource sharing, even in powerful processors. In this and the 16 MB of L3 memory is shared between cores. way, some authors think that the processor shall Due to the 65 nm technologies, it was possible to put decide the optimal number of threads to process. [13]. 1.328 millions of transistors in one single chip. This Xeon presents a hyper-pipelined architecture with 32 levels of pipeline.

The advantage of having two cores in the same chip is the possibility of real processor multithreading with a communication much faster between the cores (one of the problems of parallel computing) and more efficient management. Figure 7 – View of Intel Xeon Dual Core Chip. [16] In first versions all the Xeon processors accumulated MC with HT technology, what means that in each core, 4 SMT Performance comparison: SC and MC The following comparison gives an overview of characteristics and performances of two big competitors in multicore processors, the Intel Xeon 7140M 3.4 Ghz As it was referred in the section 2, multithreading and AMD Opteron 8220SE 2.8 Ghz [12]. techniques surplus depend how software takes Intel presents an architecture dualcore hyper-pipelined advantage of processor’s multithreading capabilities. (31 stages), superscalar, hyperthreading, L1 (16k), L2 Experiences in this field [10] demonstrate that MC (1MB) and 16MB L3 shared chache. demand specific adjustments in Compilers and other AMD Opteron is a dual core with a 64Kb L1, and 1MB software. They suggest that to take full advantage L2 per core, hypertransport and AMD virtualization from the MC innovations, compilers should be “core- technologies. tuned” (2 Cores, 4 Cores, etc.).

Placing the two SMT techniques side-by-side, Dual Both processors present high performance levels Core with two threads is more efficient than a SC with although very distinct architectural options. Xeon’s the same threads because of the resource sharing in a 16MB L3 cache is a surplus in ERP’s Applications and single core. database. On the other hand, Opteron gets better

scalability due to bus between cores, memory and a Is not always efficient to use various contexts (virtual larger bandwidth. processors) in each core, because at a high number of threads may cause conflict sharing (depends on the Due to the long pipeline, Xeon uses hyperthreading application), but it is certain that Multicore Processors technology to optimize threads in each core. Intel’s are faster than Uni-core ones when applications simply placed two Prescott (previous series) cores in the (mainly compilers) take advantage of multithread. same chip. On the other side, AMD developed a new The MC gain can be up to 30% when parallel memory control between cores. This means that there is execution is at high level, but in common applications no need to communicate thru chipset, because memories will be under that value [7]. Figure 8 shows the MC are addressed directly from an exclusive bus named effect at microinstructions level in various scenarios hypertransport what means best bandwidth. The (cores, threads). communication with the other resources is also made by hypertransport. There is no need to share resources of the super I/O – IDE controller, SATA, AGP, PCI- Express, USB, etc.). Hypertransport is a high performance, low latency and full-duplex connection, and it is possible to expand from dual core to quadcore applying the same scheme (Fig. 9).

Figure 8-Normalized execution time of the benchmarks on the SMT multiprocessor. The sequential execution time is used as a reference for the normalization.

5 Comparison with competitors

As usual, the best technological examples are introduced first in the high-value market, and the high-performance Figure 9-AMD QuadCore technologies with server market is a good example. HyperTransport [15]

6 Conclusions [4] Eggers, S., Emer, J., Levy, H., Lo, J., Stamm, R., Tullsen, D.: SMT: A Platform for Next-Generation Simultaneous Multi Threading is a Processors, IEEE Micro, 1997 methodology that combines the instruction level parallelism and the thread level parallelism. The aim is [5] Marr, D., Binns, F., Hill, D., Hinton, G., Koufaty, to increase gains of conventional superscalar D., Miller, J. Upton, M.: Hyper-Threading processors in single or lately multiprocessor-in-one Technology Architecture and Microarchitecture: chip basis. Multithread techniques divide the Intel Technology Journal Q1, 2002 execution into several independent threads. In single [6] Magro, W., Petersen, P., Shah S.: Hyper- core SMT technology (HT in Intel netburst Threading Technology: Impact on Compute- architecture), physical resources are just shared in an Intensive Workloads, Intel Technology Journal Q1, optimal thread mixing, but the pipeline does not 2002

“enlarge” and probably at micro-instruction level may [7] Chen, Y., Holliman, M., Debes, E., Zheltov, S., cause some inefficiency. On the other hand SMT Knyazev, A., Bratanov, S., Belenov, R., Santos, single core is inefficient in small pipelines like AMD’s I.:Media Applications on Hyper-Threading Opteron because there are no many “wait times” in the Technology, Intel Technology Journal, 2002 pipeline. [8] Koufaty, David, Marr, Deborah T.: Lately, incorporating SMT and parallel computation Hyperthreading Technology in the Netburst knowledge and recent progresses of microelectronics, Microarchitecture, IEEE Computer Society, 2003 manufacturers moved into MultiCore – Many [9] Spracklen, L., Abraham, G.: Chip Multithreading: processors in one chip – concept that allows threads Opportunities and Challenges, IEEE, 2005 distributed by the cores available. Core parallelism is a model that is only in the [10] Curtis-Maury, M., Ding, X., Antonoupoulos, beginning and can be improved to electronic C., Nikopoulos, S.: An evaluation of OpenMP on miniaturization limits. Current and Emerging Multithreaded/Multicore The MC technology replication seems to give good Processors, DCS – The College of William and results and can be a key to faster micro processing Mary, 2002 architectures. However, the bandwidth off-chip does [11] Hewlett-Packard Development Company: not seem to increase at the same speed and this will be Characterizing x86 processors for industry- certainly constraining the number of “useful” cores in standard servers: AMD Opteron and Intel Xeon, a chip. This means that the supply-chain of cores will Technology Brief, 2nd Edition, 2005 not be fast enough to send all the data that cores can [12] Silva, D., Ferreira, A.: Comparação dos process [14]. MultiProcessadores Intel Xeon Dual Core e AMD Opteron, IST – DEI, 2006/2007 References [13] Courtis-Mauri, M., Wang, T., Antonopoulos, [1] Hennessy, John L. and Patterson, David A.: C., Nikolopoulos, D.:Integrating Multiple Forms –A quantitative approach– of Multithreaded Execution on SMT Processors, Chapter 6, 3d edition, Elsevier Science USA, 2003 College of William and Mary, 2005

[2] LO, J., EMER, J., Levy, H., Stamm, R., Tullsen, [14] http://www.princeton.edu/~jdonald/research/h M., Converting Thread-Level Parallelism to yperthreading/ visited in January, 25, 2007 Dua, R., Instruction-Level Parallelism via SMT, ACM, Vol. Lokhande, B.:A Comparative study of SMT and 15, No. 3, August 1997. CMP multiprocessors, Princeton University, [3] Tullsen, D., Levy H., Simultaneous Multithreading: ee8365, 2006 Maximizing On-Chip Parallelism, ACM [15] Cardoso, B., Rosa, S., Fernandes, Transactions on Computer Systems, 1995 T.:Multicore, Unicamp, 2005