REPORT Compaq Chooses SMT for Alpha Simultaneous Multithreading
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Implicitly-Multithreaded Processors
Appears in the Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA) Implicitly-Multithreaded Processors Il Park, Babak Falsafi∗ and T. N. Vijaykumar School of Electrical & Computer Engineering ∗Computer Architecture Laboratory (CALCM) Purdue University Carnegie Mellon University {parki,vijay}@ecn.purdue.edu [email protected] http://www.ece.cmu.edu/~impetus Abstract In this paper, we propose the Implicitly-Multi- Threaded (IMT) processor. IMT executes compiler-speci- This paper proposes the Implicitly-MultiThreaded fied speculative threads from a sequential program on a (IMT) architecture to execute compiler-specified specula- wide-issue SMT pipeline. IMT is based on the fundamental tive threads on to a modified Simultaneous Multithreading observation that Multiscalar’s execution model — i.e., pipeline. IMT reduces hardware complexity by relying on compiler-specified speculative threads [10] — can be the compiler to select suitable thread spawning points and decoupled from the processor organization — i.e., distrib- orchestrate inter-thread register communication. To uted processing cores. Multiscalar [10] employs sophisti- enhance IMT’s effectiveness, this paper proposes three cated specialized hardware, the register ring and address novel microarchitectural mechanisms: (1) resource- and resolution buffer, which are strongly coupled to the distrib- dependence-based fetch policy to fetch and execute suit- uted core organization. In contrast, IMT proposes to map able instructions, (2) context multiplexing to improve utili- speculative threads on to generic SMT. zation and map as many threads to a single context as IMT differs fundamentally from prior proposals, TME allowed by availability of resources, and (3) early thread- and DMT, for speculative threading on SMT. While TME invocation to hide thread start-up overhead by overlapping executes multiple threads only in the uncommon case of one thread’s invocation with other threads’ execution. -
Kaisen Lin and Michael Conley
Kaisen Lin and Michael Conley Simultaneous Multithreading ◦ Instructions from multiple threads run simultaneously on superscalar processor ◦ More instruction fetching and register state ◦ Commercialized! DEC Alpha 21464 [Dean Tullsen et al] Intel Hyperthreading (Xeon, P4, Atom, Core i7) Web applications ◦ Web, DB, file server ◦ Throughput over latency ◦ Service many users ◦ Slight delay acceptable 404 not Idea? ◦ More functional units and caches ◦ Not just storage state Instruction fetch ◦ Branch prediction and alignment Similar problems to the trace cache ◦ Large programs: inefficient I$ use Issue and retirement ◦ Complicated OOE logic not scalable Need more wires, hardware, ports Execution ◦ Register file and forwarding logic Power-hungry Replace complicated OOE with more processors ◦ Each with own L1 cache Use communication crossbar for a shared L2 cache ◦ Communication still fast, same chip Size details in paper ◦ 6-SS about the same as 4x2-MP ◦ Simpler CPU overall! SimOS: Simulate hardware env ◦ Can run commercial operating systems on multiple CPUs ◦ IRIX 5.3 tuned for multi-CPU Applications ◦ 4 integer, 4 floating-point, 1 multiprog PARALLELLIZED! Which one is better? ◦ Misses per completed instruction ◦ In general hard to tell what happens Which one is better? ◦ 6-SS isn’t taking advantage! Actual speedup metrics ◦ MP beats the pants off SS some times ◦ Doesn’t perform so much worse other times 6-SS better than 4x2-MP ◦ Non-parallelizable applications ◦ Fine-grained parallel applications 6-SS worse than 4x2-MP -
A Speculative Control Scheme for an Energy-Efficient Banked Register File
IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 6, JUNE 2005 741 A Speculative Control Scheme for an Energy-Efficient Banked Register File Jessica H. Tseng, Student Member, IEEE, and Krste Asanovicc, Member, IEEE Abstract—Multiported register files are critical components of modern superscalar and simultaneously multithreaded (SMT) processors, but conventional designs consume considerable die area and power as register counts and issue widths grow. Banked multiported register files consisting of multiple interleaved banks of lesser ported cells can be used to reduce area, power, and access time and previous work has shown that such designs can provide sufficient bandwidth for a superscalar machine. These previous banked designs, however, have complex control structures to avoid bank conflicts or to buffer conflicting requests, which add to design complexity and would likely limit cycle time. This paper presents a much simpler and faster control scheme that speculatively issues potentially conflicting instructions, then quickly repairs the pipeline if conflicts occur. We show that, once optimizations to avoid regfile reads are employed, the remaining read accesses observed in detailed simulations are close to randomly distributed and this contributes to the effectiveness of our speculative control scheme. For a four-issue superscalar processor with 64 physical registers, we show that we can reduce area by a factor of three, access time by 25 percent, and energy by 40 percent, while decreasing IPC by less than 5 percent. For an eight-issue SMT processor with 512 physical registers, area is reduced by a factor of seven, access time by 30 percent, and energy by 60 percent, while decreasing IPC by less than 2 percent. -
The Microarchitecture of a Low Power Register File
The Microarchitecture of a Low Power Register File Nam Sung Kim and Trevor Mudge Advanced Computer Architecture Lab The University of Michigan 1301 Beal Ave., Ann Arbor, MI 48109-2122 {kimns, tnm}@eecs.umich.edu ABSTRACT Alpha 21464, the 512-entry 16-read and 8-write (16-r/8-w) ports register file consumed more power and was larger than The access time, energy and area of the register file are often the 64 KB primary caches. To reduce the cycle time impact, it critical to overall performance in wide-issue microprocessors, was implemented as two 8-r/8-w split register files [9], see because these terms grow superlinearly with the number of read Figure 1. Figure 1-(a) shows the 16-r/8-w file implemented and write ports that are required to support wide-issue. This paper directly as a monolithic structure. Figure 1-(b) shows it presents two techniques to reduce the number of ports of a register implemented as the two 8-r/8-w register files. The monolithic file intended for a wide-issue microprocessor without hardly any register file design is slow because each memory cell in the impact on IPC. Our results show that it is possible to replace a register file has to drive a large number of bit-lines. In register file with 16 read and 8 write ports, intended for an eight- contrast, the split register file is fast, but duplicates the issue processor, with a register file with just 8 read and 8 write contents of the register file in two memory arrays, resulting in ports so that the impact on IPC is a few percent. -
PERL – a Register-Less Processor
PERL { A Register-Less Processor A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by P. Suresh to the Department of Computer Science & Engineering Indian Institute of Technology, Kanpur February, 2004 Certificate Certified that the work contained in the thesis entitled \PERL { A Register-Less Processor", by Mr.P. Suresh, has been carried out under my supervision and that this work has not been submitted elsewhere for a degree. (Dr. Rajat Moona) Professor, Department of Computer Science & Engineering, Indian Institute of Technology, Kanpur. February, 2004 ii Synopsis Computer architecture designs are influenced historically by three factors: market (users), software and hardware methods, and technology. Advances in fabrication technology are the most dominant factor among them. The performance of a proces- sor is defined by a judicious blend of processor architecture, efficient compiler tech- nology, and effective VLSI implementation. The choices for each of these strongly depend on the technology available for the others. Significant gains in the perfor- mance of processors are made due to the ever-improving fabrication technology that made it possible to incorporate architectural novelties such as pipelining, multiple instruction issue, on-chip caches, registers, branch prediction, etc. To supplement these architectural novelties, suitable compiler techniques extract performance by instruction scheduling, code and data placement and other optimizations. The performance of a computer system is directly related to the time it takes to execute programs, usually known as execution time. The expression for execution time (T), is expressed as a product of the number of instructions executed (N), the average number of machine cycles needed to execute one instruction (Cycles Per Instruction or CPI), and the clock cycle time (), as given in equation 1. -
Prozessorarchitektur Am Beispiel Des Amdathlon
PROZESSORARCHITEKTUR AM BEISPIEL DES AMD ATHLON AUSGEARBEITET VON ALEXANDER TABAKOFF Betreuender Lehrer: Prof. Wolfgang Schinwald VERÖFFENTLICHT AM 26.2.2001 PROZESSORARCHITEKTUR INHALTSVERZEICHNIS: 1 Historische / allgemeine Einführung 1.1Die Anwendungsbereiche von Prozessoren 1.2Der erste Prozessor 1.3Die Entwicklung bis zum 586 1.4Der AMD Athlon und der Pentium III - Entwicklungsgeschichte 2 Grundlegende Dinge zur Prozessorarchitektur und dem Bau von Prozessoren 2.1Physikalisch 2.1.1Der Aufbau eines Transistors 2.1.2Die Auswirkungen in die Praxis 2.2Logisch 2.3Die Herstellung von Prozessoren und ihre Grenzen 2.4Der Von-Neumann-Rechner 3 Die Prozessorarchitektur des AMD Athlon im Vergleich zu seinen Konkurrenten 3.1Das Design des AMD Athlon 3.2Das Bussytem des AMD Athlon 3.3Die Cachearchitektur des AMD Athlon 3.4Vor- und Nachteile gegenüber anderen Designs 3.5Interview mit Jan Gütter, Public Relations Sprecher von AMD 4 Anhang 4.1Der Grund dieser Arbeit 4.2Glossar 4.3Literaturverzeichnis 4.4Begleitprotokoll 4.5Bildnachweis Inhaltsverzeichnis: - Seite 2 PROZESSORARCHITEKTUR 1 HISTORISCHE / ALLGEMEINE EINFÜHRUNG 1.1Die Anwendungsbereiche von Prozessoren Prozessoren haben heute verschiedenste Anwendungsbereiche. Sie werden in Autos, Set Top Boxen, Spielekonsolen, Handys, Taschenrechnern, PCs usw. verwendet. Dabei macht der Marktanteil der PC Prozessoren nur rund 2%1 aus. Trotz dieser vergleichsweise geringen Produktion genießen PC Prozessoren einen bedeutend höheren Bekanntheitsgrad. Fast jeder kennt PC Prozessoren wie den Intel Pentium -
A Bibliography of Publications in IEEE Micro
A Bibliography of Publications in IEEE Micro Nelson H. F. Beebe University of Utah Department of Mathematics, 110 LCB 155 S 1400 E RM 233 Salt Lake City, UT 84112-0090 USA Tel: +1 801 581 5254 FAX: +1 801 581 4148 E-mail: [email protected], [email protected], [email protected] (Internet) WWW URL: http://www.math.utah.edu/~beebe/ 16 September 2021 Version 2.108 Title word cross-reference -Core [MAT+18]. -Cubes [YW94]. -D [ASX19, BWMS19, DDG+19, Joh19c, PZB+19, ZSS+19]. -nm [ABG+16, KBN16, TKI+14]. #1 [Kah93i]. 0.18-Micron [HBd+99]. 0.9-micron + [Ano02d]. 000-fps [KII09]. 000-Processor $1 [Ano17-58, Ano17-59]. 12 [MAT 18]. 16 + + [ABG+16]. 2 [DTH+95]. 21=2 [Ste00a]. 28 [BSP 17]. 024-Core [JJK 11]. [KBN16]. 3 [ASX19, Alt14e, Ano96o, + AOYS95, BWMS19, CMAS11, DDG+19, 1 [Ano98s, BH15, Bre10, PFC 02a, Ste02a, + + Ste14a]. 1-GHz [Ano98s]. 1-terabits DFG 13, Joh19c, LXB07, LX10, MKT 13, + MAS+07, PMM15, PZB+19, SYW+14, [MIM 97]. 10 [Loc03]. 10-Gigabit SCSR93, VPV12, WLF+08, ZSS+19]. 60 [Gad07, HcF04]. 100 [TKI+14]. < [BMM15]. > [BMM15]. 2 [Kir84a, Pat84, PSW91, YSMH91, ZACM14]. [WHCK18]. 3 [KBW95]. II [BAH+05]. ∆ 100-Mops [PSW91]. 1000 [ES84]. 11- + [Lyl04]. 11/780 [Abr83]. 115 [JBF94]. [MKG 20]. k [Eng00j]. µ + [AT93, Dia95c, TS95]. N [YW94]. x 11FO4 [ASD 05]. 12 [And82a]. [DTB01, Dur96, SS05]. 12-DSP [Dur96]. 1284 [Dia94b]. 1284-1994 [Dia94b]. 13 * [CCD+82]. [KW02]. 1394 [SB00]. 1394-1955 [Dia96d]. 1 2 14 [WD03]. 15 [FD04]. 15-Billion-Dollar [KR19a]. -
Optimizing SIMD Execution in HW/SW Co-Designed Processors
Optimizing SIMD Execution in HW/SW Co-designed Processors Rakesh Kumar Department of Computer Architecture Universitat Politècnica de Catalunya Advisors: Alejandro Martínez Intel Barcelona Research Center Antonio González Intel Barcelona Research Center Universitat Politècnica de Catalunya A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy / Doctor per la UPC ABSTRACT SIMD accelerators are ubiquitous in microprocessors from different computing domains. Their high compute power and hardware simplicity improve overall performance in an energy efficient manner. Moreover, their replicated functional units and simple control mechanism make them amenable to scaling to higher vector lengths. However, code generation for these accelerators has been a challenge from the days of their inception. Compilers generate vector code conservatively to ensure correctness. As a result they lose significant vectorization opportunities and fail to extract maximum benefits out of SIMD accelerators. This thesis proposes to vectorize the program binary at runtime in a speculative manner, in addition to the compile time static vectorization. There are different environments that support runtime profiling and optimization support required for dynamic vectorization, one of most prominent ones being: 1) Dynamic Binary Translators and Optimizers (DBTO) and 2) Hardware/Software (HW/SW) Co-designed Processors. HW/SW co-designed environment provides several advantages over DBTOs like transparent incorporations of new hardware features, binary compatibility, etc. Therefore, we use HW/SW co-designed environment to assess the potential of speculative dynamic vectorization. Furthermore, we analyze vector code generation for wider vector units and find out that even though SIMD accelerators are amenable to scaling from hardware point of view, vector code generation at higher vector length is even more challenging. -
UNIVERSITY of CALIFORNIA, SAN DIEGO Holistic Design for Multi-Core Architectures a Dissertation Submitted in Partial Satisfactio
UNIVERSITY OF CALIFORNIA, SAN DIEGO Holistic Design for Multi-core Architectures A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science (Computer Engineering) by Rakesh Kumar Committee in charge: Professor Dean Tullsen, Chair Professor Brad Calder Professor Fred Chong Professor Rajesh Gupta Dr. Norman P. Jouppi Professor Andrew Kahng 2006 Copyright Rakesh Kumar, 2006 All rights reserved. The dissertation of Rakesh Kumar is approved, and it is acceptable in quality and form for publication on micro- film: Chair University of California, San Diego 2006 iii DEDICATIONS This dissertation is dedicated to friends, family, labmates, and mentors { the ones who taught me, indulged me, loved me, challenged me, and laughed with me, while I was also busy working on my thesis. To Professor Dean Tullsen for teaching me the values of humility, kind- ness,and caring while trying to teach me football and computer architecture. For always encouraging me to do the right thing. For always letting me be myself. For always believing in me. For always challenging me to dream big. For all his wisdom. And for being an adviser in the truest sense, and more. To Professor Brad Calder. For always caring about me. For being an inspiration. For his trust. For making me believe in myself. For his lies about me getting better at system administration and foosball even though I never did. To Dr Partha Ranganathan. For always being there for me when I would get down on myself. And that happened often. For the long discussions on life, work, and happiness. -
Sony's Emotionally Charged Chip
VOLUME 13, NUMBER 5 APRIL 19, 1999 MICROPROCESSOR REPORT THE INSIDERS’ GUIDE TO MICROPROCESSOR HARDWARE Sony’s Emotionally Charged Chip Killer Floating-Point “Emotion Engine” To Power PlayStation 2000 by Keith Diefendorff rate of two million units per month, making it the most suc- cessful single product (in units) Sony has ever built. While Intel and the PC industry stumble around in Although SCE has cornered more than 60% of the search of some need for the processing power they already $6 billion game-console market, it was beginning to feel the have, Sony has been busy trying to figure out how to get more heat from Sega’s Dreamcast (see MPR 6/1/98, p. 8), which has of it—lots more. The company has apparently succeeded: at sold over a million units since its debut last November. With the recent International Solid-State Circuits Conference (see a 200-MHz Hitachi SH-4 and NEC’s PowerVR graphics chip, MPR 4/19/99, p. 20), Sony Computer Entertainment (SCE) Dreamcast delivers 3 to 10 times as many 3D polygons as and Toshiba described a multimedia processor that will be the PlayStation’s 34-MHz MIPS processor (see MPR 7/11/94, heart of the next-generation PlayStation, which—lacking an p. 9). To maintain king-of-the-mountain status, SCE had to official name—we refer to as PlayStation 2000, or PSX2. do something spectacular. And it has: the PSX2 will deliver Called the Emotion Engine (EE), the new chip upsets more than 10 times the polygon throughput of Dreamcast, the traditional notion of a game processor. -
Superscalar Execution Scalar Pipeline and the Flynn Bottleneck Multiple
Remainder of CSE 560: Parallelism • Last unit: pipeline-level parallelism • Execute one instruction in parallel with decode of next • Next: instruction-level parallelism (ILP) CSE 560 • Execute multiple independent instructions fully in parallel Computer Systems Architecture • Today: multiple issue • In a few weeks: dynamic scheduling • Extract much more ILP via out-of-order processing Superscalar • Data-level parallelism (DLP) • Single-instruction, multiple data • Ex: one instruction, four 16-bit adds (using 64-bit registers) • Thread-level parallelism (TLP) • Multiple software threads running on multiple cores 1 6 This Unit: Superscalar Execution Scalar Pipeline and the Flynn Bottleneck App App App • Superscalar scaling issues regfile System software • Multiple fetch and branch prediction I$ D$ • Dependence-checks & stall logic Mem CPU I/O B • Wide bypassing P • Register file & cache bandwidth • So far we have looked at scalar pipelines: • Multiple-issue designs 1 instruction per stage (+ control speculation, bypassing, etc.) – Performance limit (aka “Flynn Bottleneck”) is CPI = IPC = 1 • Superscalar – Limit is never even achieved (hazards) • VLIW and EPIC (Itanium) – Diminishing returns from “super-pipelining” (hazards + overhead) 7 8 Multiple-Issue Pipeline Superscalar Pipeline Diagrams - Ideal regfile scalar 1 2 3 4 5 6 7 8 9101112 lw 0(r1)r2 F D XMW I$ D$ lw 4(r1)r3 F D XMW lw 8(r1)r4 F DXMW B add r14,r15r6 F D XMW P add r12,r13r7 F DXMW add r17,r16r8 F D XMW lw 0(r18)r9 F DXMW • Overcome this limit using multiple issue • Also called -
Mini-Threads: Increasing TLP on Small-Scale SMT Processors
Appears in Proceedings of the Ninth International Symposium on High Performance Computer Architecture (HPCA-9), 2003. Mini-threads: Increasing TLP on Small-Scale SMT Processors Joshua Redstone Susan Eggers Henry Levy University of Washington {redstone,eggers,levy}@cs.washington.edu Abstract 21464, the register file would have been 3 to 4 times the Several manufacturers have recently announced the size of the 64KB instruction cache [23]. In addition to the first simultaneous-multithreaded processors, both as sin- area requirements, the large register file either inflates gle CPUs and as components of multi-CPU chips. All are cycle time or demands additional stages on today’s aggres- small scale, comprising only two to four thread contexts. A sive pipelines; for example, the Alpha 21464 architecture significant impediment to the construction of larger-scale would have required three cycles to access the register file SMTs is the register file size required by a large number of [23]. The additional pipeline stages increase the branch contexts. This paper introduces and evaluates mini- misprediction penalty, increase the complexity of the for- threads, a simple extension to SMT that increases thread- warding logic, and compound pressure on the renaming level parallelism without the commensurate increase in registers (because instructions are in flight longer). Alter- register file size. A mini-threaded SMT CPU adds addi- natively, lengthening the cycle time to avoid the extra tional per-thread state to each hardware context; an appli- pipeline stages directly degrades performance by reducing cation executing in a context can create mini-threads that the rate at which instructions are processed.