Parallel Computer Systems

Total Page:16

File Type:pdf, Size:1020Kb

Parallel Computer Systems Parallel Computer Systems Randal E. Bryant CS 347 Lecture 27 April 29, 1997 Topics • Parallel Applications • Shared vs. Distributed Model • Concurrency Models • Single Bus Systems • Network-Based Systems • Lessons Learned Motivation Limits to Sequential Processing • Cannot push clock rates beyond technological limits • Instruction-level parallelism gets diminishing returns – 4-way superscalar machines only get average of 1.5 instructions / cycle – Branch prediction, speculative execution, etc. yield diminishing returns Applications have Insatiable Appetite for Computing • Modeling of physical systems • Virtual reality, real-time graphics, video • Database search, data mining Many Applications can Exploit Parallelism • Work on multiple parts of problem simultaneously • Synchronize to coordinate efforts • Communicate to share information – 2 – CS 347 S’97 Historical Perspective: The Graveyard • Lots of venture capital and DoD Resarch $$’s • Too big to enumerate, but some examples … ILLIAC IV • Early research machine with overambitious technology Thinking Machines • CM-2: 64K single-bit processors with single controller (SIMD) • CM-5: Tightly coupled network of SPARC processors Encore Computer • Shared memory machine using National microprocessors Kendall Square Research KSR-1 • Shared memory machine using proprietary processor NCUBE / Intel Hypercube / Intel Paragon • Connected network of small processors • Survive only in niche markets – 3 – CS 347 S’97 Historical Perspective: Successes Shared Memory Multiprocessors (SMP’s) • E.g., SGI Challenge, SUN servers, DEC Alpha Servers • Good for handling “server” applications – Number of loosely coupled (or independent) computing tasks – E.g., multiuser system, Web server – Share resources such as primary memory Cray Vector Machines (also Fujitsu, NEC) • Single instruction can specify operation over entire vector – E.g., c[i] = a[i] + b[i] * d, for 0 ≤ i < 128 • Effective for many scientific computing applications Cray T3D, T3E • DEC Alpha’s connected by high performance network • Less versatile than vector machine, but better cost performance – 4 – CS 347 S’97 Application Classes Loosely Compute Servers Coupled • Number of independent users using single computing facility • Only synchronization is to mediate use of shared resources – Memory, disk sectors, file system Database Servers • Users performing transactions on shared database – E.g., bank records, flight reservations • Synchronization required to guarantee consistency – Don’t want two people to get last seat on flight True Parallel Applications • Computationally intensive task exploiting multiple computing agents Tightly • Synchronization required to coordinate efforts Coupled – 5 – CS 347 S’97 Parallel Application Example Discrete representation of Finite Element Model continuous system • Spatially: partition into mesh elements • Temporally: Update state every dT time units Example Computation )RUWLPHIURPWRPD[7 IRUHDFKPHVKHOHPHQW 8SGDWHPHVKYDOXH Locality • Update depends only on values of adjacent elements – 6 – CS 347 S’97 Parallel Mapping Spatial Partitioning Partitioning • Divide mesh into regions P1 P2 P3 P1 P2 P3 • Allocate different regions to different processors Computation for Each Processor P4P4 P5P5 P6P6 )RUWLPHIURPWRPD[7 *HWERXQGDU\YDOXHVIURP QHLJKERUV P7P7 P8P8 P9P9 )RUHDFKPHVKHOHPHQW 8SGDWHPHVKYDOXH 6HQGERXQGDU\YDOXHVWR QHLJKERUV – 7 – CS 347 S’97 Complicating Factors Communication Overhead • N X N mesh, M processors • Elements / processor = N2 / M – How much work is required per iteration • Boundary elements / processor ~ N / Sqrt(M) – How much communication is required per iteration • Communication vs. computation load ~ Sqrt(M) / N – Become communication limited as increase number of processors Nonuniformities • Irregular mesh, varying computing / mesh element • Makes partioning & load balancing difficult Synchronization • Keeping all processors on same iteration • Determining global properties such as convergence and time step – 8 – CS 347 S’97 Shared Memory Model Global Memory Space P P P • •¬• P Conceptual View • All processors access single memory – Physical address space – Use virtual address mapping to partition among processes • If one processor updates location, then all will see it – Memory consistency – 9 – CS 347 S’97 Bus-Based Realization Memory Bus • Handles all accesses to shared memory Memory Caches Memory Bus • One per processor • Allows local copies of heavily used data C C C C • Must avoid stale data P P P • •¬• P Considerations • Small step up from single processor system – Support added to many microprocessor chips • Does not scale well – Bus becomes bottleneck – Limited to ~16 processors – 10 – CS 347 S’97 Network-Based Realization Memory • Partitioned Among Processors Interconnection Network Network • Transmit messages to perform accesses to remote memories M M M M Caches • Local copies of heavily used data C C C C • Must avoid stale data – Harder than with bus-based system P P P • •¬• P Considerations • Scales well – 1024 processor systems have been built • Nonuniform memory access – 100’s of cycles for remote access – 11 – CS 347 S’97 Memory Consistency Model Initially: [ \ • Independent processes with access to shared variables Process A • No assumptions about relative D[ timing D LI\ « – Which starts first – Relative rates Process B Sequential Consistency E\ • Each process executes its steps in program order E LI[ « • Overall effect should match that of some interleaving of the individual process steps – 12 – CS 347 S’97 Sequential Consistency Example Process A D Process B E D[ E\ D LI\ « D E LI[ « E Possible Interleavings D D D E E E D =T E E D D E =T E D =F E =F D =F E =F D E =F E =F D =F E =F D =F D =F – 13 – CS 347 S’97 Sequential Inconsistency D E • Cannot have both tests yield T – b2 must precede a1 – a2 must precede b1 D E – Cannot satisfy these plus program order constraints Network Real Life Scenario [ • Process A \ – Remote write x – Local read y Pa Pb • Process B – Remote write y – Local read x \[ • Could have both reads yield 0 0 0 – 14 – CS 347 S’97 Snoopy Bus-Based Consistency Caches • Write-back – Minimize bus traffic • Monitor bus transactions when not Memory master Memory Bus Cached blocks • Clean block can have multiple, read- only copies C C C • To write, must obtain exclusive copy P P • •¬• P – Marked as dirty SnoopMaster Snoop Getting copy • Make bus request • Memory replies if block clean • Owning cache replies if dirty – 15 – CS 347 S’97 Implementation Details Block Status Bus Operations • Maintained by each cache for • Read each of its blocks – Get read-only copy • Invalid • Invalidate – Entry not valid – Invalidate all other copies • Clean – Make local copy writeable – Valid, read-only copy • Write – Matches copy in main – Write back dirty block memory – To make room for different • Dirty block – Exclusive,writeable copy – Must write back to evict – 16 – CS 347 S’97 Bus Master Actions P Read = i Requested Block P Write – P Read – None – t Current Block Read i Read i – Read i Stall i Read P Read ≠ Invalid Clean Read i i Stall P Write ≠ P Read ≠ P Write ≠ P Write = Read i Write t Write t Inval. i i Stall – Stall – Stall – Write Request P/B i:t Dirty Operation Bus Bus Operation Block P Read = P Write = Tag Processor None – None – Update Operation – Read – Write – 17 – CS 347 S’97 Bus Snoop Actions B – ≠ B Inv = – – – – – – – – B Read = Invalid Clean – – – – i Requested Block t Current Block B Read = Data: Cache supplies block – i – Data Request P/B i:t Dirty Operation Bus Bus Operation Block B – ≠ Tag Cache – – Update Operation – – – 18 – CS 347 S’97 Example 1 Process A Process B D[ E\ D LI\ « E LI[ « ABBus Transactions A: Read x A: Invalidate x a1: x = 1 B: Read y A: Read y a2: = T B: Invalidate y b1: y = 1 B: Read x b2: = F – 19 – CS 347 S’97 Example 2 Process A Process B D[ E\ D LI\ « E LI[ « ABBus Transactions A: Read x A: Invalidate x a1: x = 1 B: Read y B: Invalidate y b1: y = 1 A: Read y a2: = F B: Read x b2: = F – 20 – CS 347 S’97 Livelock Example Process A Process B D\ EZKLOHW \ E \ W ABBus Transactions A: Read y B: Read y • B: Invalidate y b1: t = y • b2: y = t+1 • A: Read y B: Read y Never gets B: Invalidate y b1: t = y chance to write b2: y = t+1 A: Read y • B: Read y • • – 21 – CS 347 S’97 Single Bus Machine Example SGI Challenge Series Up to 36 MIPS R4400 processors • Up to 16 GB main memory Bus • 256-bit wide data • 40-bit wide address • Data transferred at 1.22 GB / second • Split transaction – Read request & Read response are separate bus transactions – Can use bus for other things while read outstanding – Complicates synchronization Performance • 164 processor cycles to handle remote read • Assuming no bus contention – 22 – CS 347 S’97 Network-Based Cache Coherency Home-Based Protocol Memory Controller 4 • Each block has “home” Block Status Copy Holders – Memory controller tracking its status 24 shared 0 1 0 1 0 1 0 1 • Home maintains 25 remote 0 1 0 0 0 0 0 0 – Block status 26 uncached 0 0 0 0 0 0 0 0 – Identity of copy holders » 1 bit flag / processor Block Status Values • Shared – 1 or more remote, read-only copies • Remote – Writeable copy in remote cache • Uncached – No remote copies – 23 – CS 347 S’97 Network-Based Consistency To Obtain Copy of Block • Processor sends message to its home • Home retrieves remote copy if status is remote • Sends copy to requester • If exclusive copy requested, send invalidate message to all other copy holders Tricky Details • Lots of possible sources of deadlock & errors – 24 – CS 347
Recommended publications
  • Evaluation of Architectural Support for Global Address-Based
    Evaluation of Architectural Supp ort for Global AddressBased Communication in LargeScale Parallel Machines y y Arvind Krishnamurthy Klaus E Schauser Chris J Scheiman Randolph Y Wang David E Culler and Katherine Yelick the sp ecic target architecture Wehave develop ed multi Abstract ple highly optimized versions of this compiler employing a Largescale parallel machines are incorp orating increas range of co degeneration strategies for machines with dedi ingly sophisticated architectural supp ort for userlevel mes cated network pro cessors In this studywe use this sp ec saging and global memory access We provide a systematic trum of runtime techniques to evaluate the p erformance evaluation of a broad sp ectrum of current design alternatives tradeos in architectural supp ort for communication found based on our implementations of a global address language in several of the current largescale parallel machines on the Thinking Machines CM Intel Paragon Meiko CS We consider ve imp ortant largescale parallel platforms Cray TD and Berkeley NOW This evaluation includes that havevarying degrees of architectural supp ort for com a range of compilation strategies that makevarying use of munication the Thinking Machines CM Intel Paragon the network pro cessor each is optimized for the target ar Meiko CS Cray TD and Berkeley NOW The CM pro chitecture and the particular strategyWe analyze a family vides direct userlevel access to the network the Paragon of interacting issues that determine the p erformance trade provides a network pro cessor
    [Show full text]
  • Scalability Study of KSR-1
    Scalability Study of the KSR-1 Appeared in Parallel Computing, Vol 22, 1996, 739-759 Umakishore Ramachandran Gautam Shah S. Ravikumar Jeyakumar Muthukumarasamy College of Computing Georgia Institute of Technology Atlanta, GA 30332 Phone: (404) 894-5136 e-mail: [email protected] Abstract Scalability of parallel architectures is an interesting area of current research. Shared memory parallel programming is attractive stemming from its relative ease in transitioning from sequential programming. However, there has been concern in the architectural community regarding the scalability of shared memory parallel architectures owing to the potential for large latencies for remote memory accesses. KSR-1 is a commercial shared memory parallel architecture, and the scalability of KSR-1 is the focus of this research. The study is conducted using a range of experiments spanning latency measurements, synchronization, and analysis of parallel algorithms for three computational kernels and an application. The key conclusions from this study are as follows: The communication network of KSR-1, a pipelined unidirectional ring, is fairly resilient in supporting simultaneous remote memory accesses from several processors. The multiple communication paths realized through this pipelining help in the ef®cient implementation of tournament-style barrier synchronization algorithms. Parallel algorithms that have fairly regular and contiguous data access patterns scale well on this architecture. The architectural features of KSR-1 such as the poststore and prefetch are useful for boosting the performance of parallel applications. The sizes of the caches available at each node may be too small for ef®ciently implementing large data structures. The network does saturate when there are simultaneous remote memory accesses from a fully populated (32 node) ring.
    [Show full text]
  • The KSR1: Experimentation and Modeling of Poststore Amy Apon Clemson University, [email protected]
    Clemson University TigerPrints Publications School of Computing 2-1993 The KSR1: Experimentation and Modeling of Poststore Amy Apon Clemson University, [email protected] E Rosti Universita degli studi de Milano E Smirni Vanderbilt University T D. Wagner Vanderbilt University M Madhukar Vanderbilt University See next page for additional authors Follow this and additional works at: https://tigerprints.clemson.edu/computing_pubs Part of the Computer Sciences Commons Recommended Citation Please use publisher's recommended citation. This Article is brought to you for free and open access by the School of Computing at TigerPrints. It has been accepted for inclusion in Publications by an authorized administrator of TigerPrints. For more information, please contact [email protected]. Authors Amy Apon, E Rosti, E Smirni, T D. Wagner, M Madhukar, and L W. Dowdy This article is available at TigerPrints: https://tigerprints.clemson.edu/computing_pubs/9 3 445b 0374303 7 E. Rasti E. Smirni A. W. Apoa L. w. Dowdy .- .. , . - . .. .. ... ..... i- ORNL/TM- 1228 7 I' Engineering Physics and Mathematics Division ; ?J -2 c_ Mathematical Sciences Section I.' THE KSR1: EXPERIMENTATION AND MODELING OF POSTSTORE E. Rosti E. Smirni t T. D. Wagner + A. W. Apon L. W. Dowdy Dipartimento di Scienze dell'Informazione Universitb degli Studi di Milano Via Comelico 39 20135 Milano, Italy t Computer Science Department Vaiiderbilt University Box 1679, Station B Nashville, TN 37235 Date Published: February 1993 This work was partially supported by sub-contract 19X-SL131V from the Oak Ridge National Laboratory, and by grant N. 92.01615.PF69 from the Italian CNR "Progetto Finalizzato Sistemi Informatici e Calcolo Parallel0 - Sottoprogetto 3." Prepared by the Oak Ridge National Laboratory Oak Ridge, Tennessee 37831 managed by Martin Marietta Energy Systems, Inc.
    [Show full text]
  • THE RISE and Fall the 01 BRILLIANT START-UP THAT Some Day We Will Build a Think­ I~Z~~~~~ Thinking Ing Machine
    Company Profile THE RISE and Fall THE 01 BRILLIANT START-UP THAT Some day we will build a think­ I~Z~~~~~ Thinking ing machine. It will be a truly NEVER GRASPED intelligent machine. One that can see and hear and speak. A THE BASICS Mach-Ines machine that will be proud of us. by Gary Taubes -From a Thinking Machines brochure seven 'years a~ter. its The truth is very different. This is the simple proeessors, all of them completing In 19 90 founding, Thlllklllg story of how Thinking Machines got the a single instruction at the same time. To Machines was the market leader in paral­ jump on a hot new market-and then get more speed, more processors would lel supercomputers, with sales of about screwed up, big time. be added. Eventually, so the theory went, $65 million. Not only was the company with enough processors (perhaps billions) protitable; it also, in the words of one IBM ntil W. Daniel Hillis came along, and the right software, a massively paral­ computer scientist, had cornered the mar­ Ucomputers more or less had been de­ lel computer might start acting vaguely . ket "on sex appeal in high-performance signed along the lines of ENIAC. Ifl that human. Whether it would take pride in its computing." Several giants in the com­ machine a single processor complete? in­ creators would remain to be seen. puter industry were seeking a merger or a structions one at a time, in sequence. "Se­ Hillis is what good scientists call a very partnership with the company.
    [Show full text]
  • NASA Contractor Report ICASE Report No. 94-2 191592 SHARED
    //v NASA Contractor Report 191592 ICASE Report No. 94-2 IC S SHARED VIRTUAL MEMORY AND GENERALIZED SPEEDUP (NASA-CR-1915q2) SHARED VIRTUAL N94-27416 MEMORY AND GENERALIZED SPEEDUP Final Report (ICASE) 24 p Xian-He Sun Unclas Jianping Zhu G3/61 0000316 NASA Contract No. NAS 1-19480 January 1994 Institute for Computer Applications in Science and Engineering NASA Langley Research Center Hampton, Virginia 23681-0001 Operated by the Universities Space Research Association National Aeronautics and Space Administration Langley Research Center Hampton, Virginia 23681-0001 Shared Virtual Memory and Generalized Speedup * Xian-He Sun Jianping Zhu ICASE NSF Engineering Research Center Mail Stop 132C Deft. of Math. and Stat. NASA Langley Research Center Mississippi State University Hampton, VA 23681-0001 Mississippi State, MS 39762 Abstract Generalized speedup is defined as parallel speed over sequential speed. In this pa- per the generalized speedup and its relation with other existing performance metrics, such as traditional speedup, efficiency, scalability, etc., are carefully studied. In terms of the introduced asymptotic speed, we show that the difference between the gener- alized speedup and the traditional speedup lies in the definition of the efficiency of uniprocessor processing, which is a very important issue in shared virtual memory ma- chines. A scientific application has been implemented on a KSR-1 parallel computer. Experimental and theoretical results show that the generalized speedup is distinct from the traditional speedup and provides a more reasonable measurement. In the study of different speedups, various causes of superlinear speedup are also presented. "This research was supported by the National Aeronautics and Space Administration under NASA contract NAS1- 19480 while the first author was in residence at the Institute for Computer Applications in Science and Engineering (ICASE), NASA Langley Research Center, Hampton, VA 23681-0001.
    [Show full text]
  • NSA and the Supercomputer Industry
    ·DOCID: 4001121 I OP S!Cltl!!T t:IMIAll: . NSA and the Supercomputer Industry · (b)(1 ) -(Eij"(3J::PT:-00°36--------------- -- ------- -----------------------------_I______ _ (b)(3)-50 USC 403 (b)(3)-P.L. 86-36 NSA'S RELIANCE ON BIGH-PERFORMANC,ECOMPUTING TECHNOLOGY l'rsc.i.. NSA is heavily depende.nt -~p~n high-performance computing m·J>C) technology. heretofore known as ~pe~mputing technology, particularly in the area\of cryptanalysis. HPC is used for' fundamental cryptanalytic CCA) mathematical research. diagnosis of unknown ~tofogics, development of attacks, and daily exploitation fot. OO~n. fut~ u~:·:-m1::e SIGINT production_.. CA efforts which eventually tbeir way p:tn \. devices (SPl),s);·high-performance desktop computers, ort _ _ ___ __ I _______fiad their genesis in breakthroughs made on NSA: supercompu rs. tiona y, NSA has a smaller portion of its HPC assets devoted to signals processing and to the protection of United States cryptographic aystems. ~ As a consequence, NSA has constructed the largest single-site, single­ mission supercomputer complex in the world. This has allowed the Agency to gain unique HPC experience as well as leverage with HPC vendors. A symbiotic relationship has evolved over time. NSA shares probleme and requirements with HPC vendors~ in exchange it gets systems that not only meet its needs better but also create improved products for the entire range of HPC cust.omers. Indeed, NSA can c;ite multiple examples of supercomputer technology spanning several decades which was designed as a direct result ofour request&. ~Of equal importance is technology related to the construction of SPDs.
    [Show full text]
  • Appendix M Historical Perspectives and References
    M.1 Introduction M-2 M.2 The Early Development of Computers (Chapter 1) M-2 M.3 The Development of Memory Hierarchy and Protection (Chapter 2 and Appendix B) M-9 M.4 The Evolution of Instruction Sets (Appendices A, J, and K) M-17 M.5 The Development of Pipelining and Instruction-Level Parallelism (Chapter 3 and Appendices C and H) M-27 M.6 The Development of SIMD Supercomputers, Vector Computers, Multimedia SIMD Instruction Extensions, and Graphical Processor Units (Chapter 4) M-45 M.7 The History of Multiprocessors and Parallel Processing (Chapter 5 and Appendices F, G, and I) M-55 M.8 The Development of Clusters (Chapter 6) M-74 M.9 Historical Perspectives and References M-79 M.10 The History of Magnetic Storage, RAID, and I/O Buses (Appendix D) M-84 M Historical Perspectives and References If … history … teaches us anything, it is that man in his quest for knowledge and progress is determined and cannot be deterred. John F. Kennedy Address at Rice University (1962) Those who cannot remember the past are condemned to repeat it. George Santayana The Life of Reason (1905), Vol. 2, Chapter 3 M-2 ■ Appendix M Historical Perspectives and References M.1 Introduction This appendix provides historical background on some of the key ideas presented in the chapters. We may trace the development of an idea through a series of machines or describe significant projects. If you are interested in examining the initial development of an idea or machine or are interested in further reading, references are provided at the end of each section.
    [Show full text]
  • National Energy Research Scientific Computing Center (NERSC) The
    National Energy Research Scientific Computing Center (NERSC) The Divergence Problem Horst D. Simon Director, NERSC Center Division, LBNL November 19, 2002 # Outline ? Introducing NERSC-3 E ? The Divergence Problem ? What NERSC is doing about it Combined NERSC-3 Characteristics ? The combined NERSC-3/4 system (NERSC-3Base and NERSC-3Enhanced) will have — 416 16 way Power 3+ nodes with each CPU at 1.5 Gflop/s ?380 for computation — 6,656 CPUs – 6,080 for computation — Total Peak Performance of 10 Teraflop/s — Total Aggregate Memory is 7.8 TB — Total GPFS disk will be 44 TB ?Local system disk is an additional 15 TB — Combined SSP-2 measure is 1.238 Tflop/s — NERSC-3E be in production by the end of Q1/CY03 ?Nodes will arrive in the first two weeks of November Comparison with Other Systems NERSC-3 E ASCI White ES Cheetah PNNL (ORNL) Mid 2003 Nodes 416 512 640 27 700 CPUs 6,656 8,192 5,120 864 1400 Peak(Tflops) 10 12 40 4.5 9.6(8.3) Memory (TB) 7.8 4 10 1 1.8 Disk(TB) 60 150 700 9 53 SSP(Gflop/s) 1,238 1,652 179 PNNL system available in Q3 CY2003 SSP = sustained systems performance (NERSC applications benchmark) Outline ? Introducing NERSC-3 E ? The Divergence Problem ? What NERSC is doing about it Signposts of Change in HPC In early 2002 there were several signposts, which signal a fundamental change in HPC in the US: ? Installation and very impressive early performance results of the Earth Simulator System (April 2002) ? Lack of progress in computer architecture research evident at Petaflops Workshop (WIMPS, Feb.
    [Show full text]
  • Hardware for Fast Global Operations on Distributed Memory Multicomputers and Multiprocessors
    Portland State University PDXScholar Dissertations and Theses Dissertations and Theses 1995 Hardware for Fast Global Operations on Distributed Memory Multicomputers and Multiprocessors Douglas V. Hall Portland State University Follow this and additional works at: https://pdxscholar.library.pdx.edu/open_access_etds Part of the Electrical and Computer Engineering Commons Let us know how access to this document benefits ou.y Recommended Citation Hall, Douglas V., "Hardware for Fast Global Operations on Distributed Memory Multicomputers and Multiprocessors" (1995). Dissertations and Theses. Paper 1286. https://doi.org/10.15760/etd.1285 This Dissertation is brought to you for free and open access. It has been accepted for inclusion in Dissertations and Theses by an authorized administrator of PDXScholar. Please contact us if we can make this document more accessible: [email protected]. HARDWARE FOR FAST GLOBAL OPERATIONS ON DISTRIBUTED MEMORY MULTICOMPUTERS AND MULTIPROCESSORS by DOUGLAS VINCENT HALL A submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY In ELECTRICAL AND COMPUTER ENGINEERING Portland State University 1995 DISSERTATION APPROVAL The abstract and dissertation of Douglas Vincent Hall for the Doctor of Philosophy in Electrical and Computer Engineering were presented December 9, 1994, and accepted by dissertation committee and the doctora~,rogra. m.' /; COMMITTEE APPROVALS: ~------ Michael A. Driscoll, ~ , Marek Perkowski / W, Robert Daasch Richard D, Morris Representative of the Office of Graduate Studies DOCTORAL PROGRAM APPROVAL: Rolf Schaumann, Chair Department of Electrical Engineering ************************************************************************ ACCEPTED FOR PORTLAND STATE UNIVERSITY BY THE LIBRARY by ABSTRACT An abstract of the dissertation of Douglas Vincent Hall for the Doctor of Philosophy in Electrical and Computer Engineering presented December 9, 1994.
    [Show full text]
  • A Survey of Parallel Programming Languages and Tools
    ASurveyofParallel Programming Languages and Tools Doreen Y.Cheng Report RND-93-005 March 1993 NASA Ames Research Center M/S 256-6 Moffett Field, CA 94035 -2- Abstract This survey examines thirty-fiveparallel programming languages and fifty-nine par- allel programming tools. It focuses on tool capabilities needed for writing parallel scien- tific programs, not on features that explore general computer science issues. The tools are classified based on their functions and ranked with current and future needs of NAS in mind: in particular,existing and anticipated NAS supercomputers and workstations, oper- ating systems, programming languages, and applications. The report is designed to give readers a quick grasp of the tool features, and provides tables to compare their main func- tions. -3- Introduction Providing sufficient parallel programming tools is one key step to enable NAS users to use parallel computers. The report "A Survey ofParallel Programming Tools" (RND-91-005) has been requested by more than 100 organizations. Since then, NAS has added massively parallel supercomputers into its production computing facilities. Pro- gramming for a wide variety of parallel architectures (SIMD, shared-memory MIMD, distributed-memory MIMD, heterogeneous MIMD) demands a survey ofabroader range of programming tools than the tools included in report RND-91-005. In response to the newdemand, this report surveysparallel programming languages as well as tools. In the text below, theyare both referred to as tools. The scope of the survey has been enlarged to include tools for all forms of parallel architectures and programming paradigms. More than 80 tools were submitted to the survey;only a fewwere eliminated due to their proprietary or obsolescent nature.
    [Show full text]
  • A History of Modern 64-Bit Computing
    A History of Modern 64-bit Computing Matthew Kerner [email protected] Neil Padgett [email protected] CSEP590A Feb 2007 -1- Background By early 2002, the rumors had been swirling for months: despite the company’s intense focus on its IA64 64-bit processor, Intel engineers in Oregon were busy updating the company’s venerable x86/Pentium line with 64-bit EM64T (later Intel64) extensions. Not only would this mean Intel was potentially forgoing, or at least hedging, its bet on IA64 as its next-generation processor and entrance into the 64-bit processor space; reports *1+ were also claiming that Intel’s 64-bit x86 processor would be compatible with the x86-64 (later AMD64) architecture that rival AMD had announced just over a year before [2]. Intel boasted revenues of $26.5 Billion in 2001 [3], and was a giant compared to AMD, who sold a relatively tiny $3.89 Billion that year [4]. If true, Intel would be making a bold admission of AMD’s coup by following the technology lead of its smaller rival. This run-in was hardly the first time Intel and AMD crossed paths however; both had deep roots in Silicon Valley and in some ways they shared a common history. The relationship was a tortured one: the two companies were alternately friends advancing the x86 architecture and foes competing furiously. Intel’s founding traces back through Fairchild Semiconductor to Shockley Semiconductor Lab, the post-Bell Labs home of transistor co-inventor William Shockley. Founded in 1955 [5], Shockley Labs focused on the development and evolution of the transistor.
    [Show full text]
  • The Mountains of Pi: the New Yorker 10/05/09 12:06
    Profiles: The Mountains of Pi: The New Yorker 10/05/09 12:06 PROFILES THE MOUNTAINS OF PI by Richard Preston MARCH 2, 1992 regory Volfovich Chudnovsky recently built a supercomputer in his apartment from mail-order parts. Gregory G Chudnovsky is a number theorist. His apartment is situated near the top floor of a run-down building on the West Side of Manhattan, in a neighborhood near Columbia University. Not long ago, a human corpse was found dumped at the end of the block. The world’s most powerful supercomputers include the Cray Y-MP C90, the Thinking Machines CM-5, the Hitachi S-820/80, the nCube, the Fujitsu parallel machine, the Kendall Square Research parallel machine, the nec SX- 3, the Touchstone Delta, and Gregory Chudnovsky’s apartment. The apartment seems to be a kind of container for the supercomputer at least as much as it is a container for people. Gregory Chudnovsky’s partner in the design and construction of the supercomputer was his older brother, David Volfovich Chudnovsky, who is also a mathematician, and who lives five blocks away from Gregory. The Chudnovsky brothers call their machine m zero. It occupies the former living room of Gregory’s apartment, and its tentacles reach into other rooms. The brothers claim that m zero is a “true, general-purpose supercomputer,” and that it is as fast and powerful as a somewhat older Cray Y-MP, but it is not as fast as the latest of the Y-MP machines, the C90, an advanced supercomputer made by Cray Research.
    [Show full text]