XMT-M: a Scalable Decentralized Processor

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/2804946 XMT-M: A Scalable Decentralized Processor Article · October 1999 Source: CiteSeer CITATIONS READS 4 25 5 authors, including: Bruce Jacob Uzi Vishkin University of Maryland University College University of Maryland, College Park 146 PUBLICATIONS 3,534 CITATIONS 221 PUBLICATIONS 8,916 CITATIONS SEE PROFILE SEE PROFILE Some of the authors of this publication are also working on these related projects: Low Power Architecture View project Optical rectification View project All content following this page was uploaded by Uzi Vishkin on 20 July 2017. The user has requested enhancement of the downloaded file. XMTM A Scalable Decentralized Pro cessor Efraim Berkovich Joseph Nuzman Mano j Franklin Bruce Jacob and Uzi Vishkin Department of Electrical and Computer Engineering and University of Maryland Institute for Advanced Computer Studies UMIACS University of Maryland College Park MD Abstract A dening challenge for research in computer science and engineering has b een the ongoing quest for reducing the completion time of a single computation task Even outside the parallel pro cessing com munities there is little doubt that the key to further progress in this quest is to do parallel pro cessing of some kind A recently prop osed parallel pro cessing framework that spans the entire sp ectrum from parallel algorithms to architecture to implementation is the explicit multithreading XMT framework This framework provides i simple and natural parallel algorithms for essentially every generalpurp ose application including notoriously dicult irregular integer applications and ii a multithreaded pro gramming mo del for these algorithms which allows an indep endenceoforder semantics every thread can pro ceed at its own sp eed indep endent of other concurrent threads To the extent p ossible the XMT framework uses established ideas in parallel pro cessing This pap er presents XMTM a microarchitecture implementation of the XMT mo del that is p ossi ble with current technology XMTM oers an engineering design point that addresses four concerns buildability programmability performance and scalability The XMTM hardware is geared to execute multiple threads in parallel on a single chip relying on very few new gadgets it can execute parallel threads without busywaits Existing co de can b e run on XMTM as a single thread without any mo di cations thereby providing backward compatibility for commercial acceptance Simulationbased studies of XMTM demonstrate considerable improvements in p erformance relative to the b est serial pro cessor even for small and therefore practical input sizes Keywords Finegrained SPMD indep endence of order semantics instructionlevel parallelism ILP nobusywait nite state machines parallel algorithms prexsum and spawnjoin Intro duction The coming years promise to b e exciting ones in the area of computer architecture Continued scaling of submicron technology will give us orders of magnitude increase in onchip hardware resources Even by conservative estimates a single chip will have a billion transistors in a few years Exploiting parallelism in a big way is a natural way to translate this increase in transistor count to completing individual tasks faster Parallelism had b een traditionally exploited at coarse and negrained levels Emphasizing the build able in the short term traditional techniques targeting coarsegrained parallelism have fo cused primarily on MPPs massively parallel pro cessors Although MPPs provide the strongest available machines for some timecritical applications they have had very little impact on the mainstream computer market Most computers to day are unipro cessors and even large servers have only mo dest numb ers of pro ces sors A recent rep ort from the Presidents Information Technology Advisory Committee PITAC has acknowledged the imp ortance and diculty of achieving scalable application p erformance on to days parallel machines According to the rep ort there is substantive evidence that current scalable paral lel architectures are not wel l suited for a number of important applications especial ly those where the com putations are highly irregular or those where huge quantities of data must be transferred from memory to support the calculation The commo dity micropro cessor industry has b een traditionally lo oking to negrained or instruction level parallelism ILP for improving p erformance with sophisticated microarchitectural techniques such as pip elinin g branch prediction outoforder execution and sup erscalar execution and sophisticated compiler optimizations but with little help from programmers Such hardwarecentered techniques ap p ear to have scalability problems in the submicron technology era and are already app earing to run out of steam Compilercentered techniques also are handicapp ed primarily due to the articial dep endencies intro duced by serial programming On analyzing this scenario it b ecomes apparent that the huge investment in serial software has forced programmers to hide most of the parallelism present in an application by expressing the algorithm in a serial form and delegating it to the compiler and the hardware to reextract a part of that hidden parallelism The result has b een that b oth hardware complexity and compiler complexity have b een increasing monotonically with a less satisfying improvement in p erformance However we are reaching a p oint in time when such evolutionary approaches can no longer b ear much fruit b ecause of increasing complexity and fast approaching physical limits According to a recent p osition pap er by Dally and Lacy over the past years the increased density of VLSI chips was applied to close the gap between microprocessors and highend CPUs Today this gap is ful ly closed and adding devices to uniprocessors is wel l beyond the point of diminishing returns To get signicant increases in computing p ower a radically dierent approach may b e needed One such approach is to set free the crippled programmers so that they are not forced to suppress the parallelism they observe and are instead allowed to explicitly sp ecify the parallelism The b o oks attest to the many great ideas that the parallel computing eld has develop ed over the years although some of the ideas were ahead of the implementation technology and are still waiting to b e put to practical use Culler and Singh in their recent b o ok on Parallel Computer Architecture mention under the title Potential Breakthroughs p breakthrough may come come from architecture if we can somehow design machines in a costeective way that makes it much less important for a programmer to worry about data locality and communication that is to truly design a machine that can look to the programmer like a PRAM The recently prop osed explicit multithreading XMT framework was inuenced by a hop e that this can b e done We view ILP as the main success story form of parallelism thus far as it was adopted in a big way in the commercial world for reducing the completion time of general purp ose applications XMT aspires to expand the ILP paral lelism bridgehead with the ground forces of algorithmlevel parallelism which is guided by a sound theoretical foundation by letting programmers express b oth negrained and coarse 1 grained parallelism in a natural way 1 Designed for reducing data access communication and synchronization cost for current multipro cessors there has b een the parallel programming metho dology as describ ed in Section of There has also b een a related evolutionary approach to let programmers express some of the coarsegrained paralleli sm with the use of heavyweight forks carried out by the op erating system and lightweight threads using library functions to b e run on multipro cessors However it has not yet b een demonstrated that generalpurp ose application s could b enet much from these techniques two concrete but p ointed examples are breadthrstsearch on graphs and searching directed acyclic graphs more generally irregular integer application s of the kind taught in standard Computer Science algorithms and datastructure courses The XMT framework also p ermits decentralized and scalable pro cessors with reduced hardware com plexity Decentralization is very imp ortant b ecause in the future wire delays wil l become the dominant factor in chip performance By wire delays we mean onchip delay of connections b etween gates The Semiconductor Industry Asso ciation estimates that within a decade only of a chip will b e reachable in a clo ck cycle Microarchitectures will have to use decentralization techniques to tolerate long onchip communication latencies ie lo calize communication so as to make infrequent use of crosschip signal propagation The ob jective of this pap er is to explore a decentralized microarchitecture implementation for the XMT paradigm The highlights of the investigated microarchitecture called XMTM are i decentralized pro cessing elements that can execute multiple threads in parallel ii indep endence of order among the concurrent threads iii relaxed memory consistency mo del and iv buildabili ty with current technol ogy The rest of this pap er is organized as follows Section provides background material on the XMT framework Section describ es XMTM a realizable microarchitecture implementation of the XMT paradigm Section presents an exp erimental analysis of XMTMs p erformance conducted with a de tailed simulator In particular it shows that even

Load more