Computational RAM, a Memory-SIMD Hybrid

Total Page:16

File Type:pdf, Size:1020Kb

Computational RAM, a Memory-SIMD Hybrid INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy submittad. Broken or indistinct print, cobred or poor quality illustrations and photographs, print bleedthrough, substandard margins, and impmper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there am missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand corner and continuing from left to right in equal sedions with small overlaps. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6" x 9" black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI diractly to order. Bell & Howell Information and Learning 300 North Zeeb Road, Ann Arbor, MI 481061346 USA Computational RAM: A Memory-SIMD Hybrid Duncan George Elliott A thesis submitted in conformity with the requirements for the Degree of Doctor of Philosophy, Graduate Department of Electrical and Computer Engineering University of Toronto @ 1998 by Duncan George Elliott National Library BiMiothitque nationale du Canada Acquisitions and Acquisitions el Bibliographic Services senrims bibliographiques 395 Wellington Street 395, we Wdlingtan OttawaON K1AW OttawaON KIA0654 Canada Canada The author has granted a non- L'auteur a accorde me licence non exclusive licence allowing the exclusive permettant a la National Library of Canada to Bibliotheque nationale du Canada de reproduce, loan, distribute or sell reproduire, prgter, distribuer ou copies of this thesis in microform, vendre des copies de cette these sous paper or electronic formats. la forme de microfichelfilm, de reproduction sur papier ou sur format Bectronique. The author retains ownership of the L'auteur conserve la propriete du copyright in this thesis. Neither the droit d'auteur qui protege cette these. thesis nor substantial extracts &om it Ni la these ni des extraits substantiels may be printed or otherwise de celle-ci ne doivent &e imprimes reproduced without the author's son . ou autrement reproduits sans Abstract Computational RAM: A Memory-SIMD Hybrid Doctor of Philosophy dissertation, 1998 Duncan George Elliott Departnent of Electrical and Computer Engineering University of Toronto In this thesis, a novel computer architecture called Computational RAM (CaRAM) is proposed and implemented. CaRAM is semiconductor random access memory with processors incorporated into the design, while retaining a memory interface. CaRAM can be used to build an inexpensive massively-parallel computer. Applications that contain the appropriate parallelism will typically run thousands of times faster on CmRAM than on the CPU. This work includes the design and implementation of the architecture as a working chip with 64 processor elements (PEs), a PE design for a 2048-PE 4 Mbit DRAM,and applications. CaRAM is the first processor-in-memory architecture that is scalable across many generations of DRAM. This scalability is obtained by pitchmatching narrow 1-bit PEs to the memory and restricting communications to using 1-dimensional interconnects. The PEs are pitchmatched to - memory columns so that they can be connected to the sense amplifiers. The 1-bit wide datapath is suitable for a nmow, arrayable VLSI implementation, is compatible with memory redundancy, and has the highest performancdcost ratio among hardware arithmetic algorithms. For scalability, the memory anays and memory-style packaging limit the internal interprocessor communications to 1-dimensional networks. Of these networks, both a broadcast bus network and a left-right nearest-neighbour network are implemented. CeRAM requires little overhead over the existing memory to exploit much of the internal memory bandwidth. When CoRAM PEs are added to DRAM, more than 25% of the internal memory bandwidth is exploited at a cost of less than 25% in terms of silicon area and power. The memory bandwidth internal to memory chips at the sense amplifiers can be 3000 times the memory bandwidth at the CPU. By placing SIMD PEs adjacent to those sense amplifiers, this internal memory bandwidth can be better utilized. The performance of CeRAM has been demonstrated in a wide range of application areas, and speed-ups of several orders of magnitude compared to a typical workstation have been shown in the fields of signal and image processing, database, and CAD. iii Acknowledgments I am grateful for the support from and technical exchange with MOSAID Technologies and IBM Microelectronics; IC fabrication provided by Northern Telecom through the Canadian Microelectronics Corporation; and support from the Natural Sciences and Engineering Research Council of Canada, Micronet, and the University of Toronto. Professor David Wheeler of the Computer Laboratory, University of Cambridge introduced me to the idea of putting the computing in the memory chips. My supervisors, Professors Michael Sturnrn and Martin Snelgrove kept me on track without a single change of topic. Michael Stumm put in more than a reasonable effort in encouraging me and organizing my work. Martin Snelgrove was instrumental in making connections to potential industrial partners and taught me valuable business lessons. I have learned lessons on DRAMS that no textbook could tell from Peter Gillingharn, Graham Allan, Randy Tonance, Dick Foss, Iain Scott, David Frank, Howard Kalter, and John Barth. My lab-mates and members of the semiconductor design staff at MOSAID improved the quality and the enjoyment of my learning. Oswin Hall provided the first user feedback on CeRAM and proposed the PE shift register circuit. For strengthening this work by offering their questions and suggestions I also thank: Christian Cojocaru, Dickson Tak Shun Cheung, David Galloway, Robert McKenzie, Wayne Loucks, Tet Yeap, Martin Lefebvre, Jean-Francois Rivest, Sethurman Panchamthan, Wojciech Ziarko, Thinh Le, Glenn Gulak, David Wortman. Graham Jullien, Safwat Zaky, and Zvonko Vranesic. This dissertation would not have seen the light of day without encouragement from my whole family. I would especially like to thank my wife, Janet Elliott, who was a constant source of enthusiasm, had streams of writing suggestions, and in the end sat me down to write this document. Contents 1 Introduction 1 1.1 Dissertation Organization ............................. 5 2 Prior Art 7 2.1 Massively Parallel SIMD Rocessors ........................ 7 2.2 Memories ..................................... 20 2.2.1 DRAM Circuits .............................. 22 2.2.2 DRAM Chip Interfaces .......................... 24 2.2.3 DRAMIC Process ............................. 25 2.2.4 Redundancy ................................ 27 2.2.5 Smart Memories .............................. 28 2.3 Summsty...................................... 29 3 Computing in the Memory: the Abstraction 30 3.1 Memory Bandwidth ................................ 31 3.2 Maintaining Compatibility with Memory ..................... 34 3.2.1 Using DRAM IC Processes ........................ 34 3.2.2 Physical Dimensions and Performance ................... 34 3.2.3 Coping with Long Wordlines ....................... 35 3.2.4 Redundancy Considerations ........................ 36 3.2.5 Packaging Considerations ......................... 38 3.3 Additional Design Issues .............................. 39 3.3.1 Asymptotic Advantages of Bit-Serial PEs ................. 39 3.3.2 PE Aspect Ratio .............................. 42 3.4 BasicCoRAMArchitecture ............................ 45 3.5 Advantages of CaRAM Architecture ........................ 46 3.5.1 Systemcost ................................ 46 3.5.2 Performance ................................ 47 3.5.3 Scalability ................................. 47 3.5.4 Power Consumption ............................ 49 3.6 Design Space Alternatives ............................. 49 3.7 Summary ...................................... 50 4 Architectural Details 52 4.1 Processing Element ................................ 52 4.1.1 ALUDesign ................................ 53 4.1.2 Conditional Operations .......................... 58 4.2 Interconnection Network .............................. 59 4.2.1 Suitability of Interconnection Networks .................. 60 4.3 System I0 ..................................... 67 4.4 Summary ...................................... 67 5 Implementation 69 5.1 Prototype ...................................... 69 5.1.1 Memory .................................. 70 5.1.2 ProcessorElernent ............................. 74 5.1.3 Testing and Performance .......................... 81 5.1.4 Summary ................................. 81 5.2 High Density Design ................................ 82 5.2.1 Reference DRAM Architecture ...................... 83 5.2.2 CeRAM Memory ............................. 83 5.2.3 PE-Memory Interface ........................... 84 5.2.4 Processor Element ............................. 85 5.2.5 Power ................................... 88 5.2.6 External Interface ............................
Recommended publications
  • System Design for a Computational-RAM Logic-In-Memory Parailel-Processing Machine
    System Design for a Computational-RAM Logic-In-Memory ParaIlel-Processing Machine Peter M. Nyasulu, B .Sc., M.Eng. A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree of Doctor of Philosophy Ottaw a-Carleton Ins titute for Eleceical and Computer Engineering, Department of Electronics, Faculty of Engineering, Carleton University, Ottawa, Ontario, Canada May, 1999 O Peter M. Nyasulu, 1999 National Library Biôiiothkque nationale du Canada Acquisitions and Acquisitions et Bibliographie Services services bibliographiques 39S Weiiington Street 395. nie WeUingtm OnawaON KlAW Ottawa ON K1A ON4 Canada Canada The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant à la National Library of Canada to Bibliothèque nationale du Canada de reproduce, ban, distribute or seU reproduire, prêter, distribuer ou copies of this thesis in microform, vendre des copies de cette thèse sous paper or electronic formats. la forme de microficbe/nlm, de reproduction sur papier ou sur format électronique. The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts fkom it Ni la thèse ni des extraits substantiels may be printed or otherwise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation. Abstract Integrating several 1-bit processing elements at the sense amplifiers of a standard RAM improves the performance of massively-paralle1 applications because of the inherent parallelism and high data bandwidth inside the memory chip.
    [Show full text]
  • A Modern Primer on Processing in Memory
    A Modern Primer on Processing in Memory Onur Mutlua,b, Saugata Ghoseb,c, Juan Gomez-Luna´ a, Rachata Ausavarungnirund SAFARI Research Group aETH Z¨urich bCarnegie Mellon University cUniversity of Illinois at Urbana-Champaign dKing Mongkut’s University of Technology North Bangkok Abstract Modern computing systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in computing that cause performance, scalability and energy bottlenecks: (1) data access is a key bottleneck as many important applications are increasingly data-intensive, and memory bandwidth and energy do not scale well, (2) energy consumption is a key limiter in almost all computing platforms, especially server and mobile systems, (3) data movement, especially off-chip to on-chip, is very expensive in terms of bandwidth, energy and latency, much more so than computation. These trends are especially severely-felt in the data-intensive server and energy-constrained mobile systems of today. At the same time, conventional memory technology is facing many technology scaling challenges in terms of reliability, energy, and performance. As a result, memory system architects are open to organizing memory in different ways and making it more intelligent, at the expense of higher cost. The emergence of 3D-stacked memory plus logic, the adoption of error correcting codes inside the latest DRAM chips, proliferation of different main memory standards and chips, specialized for different purposes (e.g., graphics, low-power, high bandwidth, low latency), and the necessity of designing new solutions to serious reliability and security issues, such as the RowHammer phenomenon, are an evidence of this trend.
    [Show full text]
  • A PARALLEL IMPLEMENTATION of BACKPROPAGATION NEURAL NETWORK on MASPAR MP-1 Faramarz Valafar Purdue University School of Electrical Engineering
    Purdue University Purdue e-Pubs ECE Technical Reports Electrical and Computer Engineering 3-1-1993 A PARALLEL IMPLEMENTATION OF BACKPROPAGATION NEURAL NETWORK ON MASPAR MP-1 Faramarz Valafar Purdue University School of Electrical Engineering Okan K. Ersoy Purdue University School of Electrical Engineering Follow this and additional works at: http://docs.lib.purdue.edu/ecetr Valafar, Faramarz and Ersoy, Okan K., "A PARALLEL IMPLEMENTATION OF BACKPROPAGATION NEURAL NETWORK ON MASPAR MP-1" (1993). ECE Technical Reports. Paper 223. http://docs.lib.purdue.edu/ecetr/223 This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information. TR-EE 93-14 MARCH 1993 A PARALLEL IMPLEMENTATION OF BACKPROPAGATION NEURAL NETWORK ON MASPAR MP-1" Faramarz Valafar Okan K. Ersoy School of Electrical Engineering Purdue University W. Lafayette, IN 47906 - * The hdueUniversity MASPAR MP-1 research is supponed in pan by NSF Parallel InfrasmctureGrant #CDA-9015696. - 2 - ABSTRACT One of the major issues in using artificial neural networks is reducing the training and the testing times. Parallel processing is the most efficient approach for this purpose. In this paper, we explore the parallel implementation of the backpropagation algorithm with and without hidden layers [4][5] on MasPar MP-I. This implementation is based on the SIMD architecture, and uses a backpropagation model which is more exact theoretically than the serial backpropagation model. This results in a smoother convergence to the solution. Most importantly, the processing time is reduced both theoretically and experimentally by the order of 3000, due to architectural and data parallelism of the backpropagation algorithm.
    [Show full text]
  • Thinking Machines
    High Performance Comput ing IAnd Communications Week I K1N.G COMMUNICATIONS GROUP, INC. 627 National Press Building, Washington, D.C. 20045 (202) 638-4260 Fax: (202) 662-9744 Thursday, July 8,1993 Volume 2, Number 27 Gordon Bell Makes His Case: Get The Feds out Of Computer Architecture BY RICHARD McCORMACK It's an issue that has been simmering, then smolder- Bell: "You've got ing and occasionally flaring up: will the big massively huge forces tellmg parallel machines costing tens of millions of dollars you who's malnl~ne prove themselves worthy of their promise? Or will and how you bu~ld these machines, developed with millions of dollars computers " from the taxpayer, be an embarrassing bust? It's a debate that occurs daily-even with spouses in bed at night-but not much outside of the high- performance computing industry's small borders. Lit- erally thousands of people are engaged in trying to make massive parallelism a viable technology. But there are still few objective observers, very little data, and not enough experience with the big machines to prove-or disprove-their true worth. afraid to talk about [the situation] because they know Interestingly, though, one of the biggest names in they've conned the world and they have to keep lying computing has made up his mind. The MPPs are aw- to support" their assertions that the technology needs ful, and the companies selling them, notably Intel and government support, says the ever-quotable Bell. "It's Thinking Machines, but others as well, are bound to really bad when it turns the scientists into a bunch of fail, says Gordon Bell, whose name is attached to the liars.
    [Show full text]
  • The Maspar MP-1 As a Computer Arithmetic Laboratory
    Division Computing and Applied Mathematics Laboratory The MasPar MP-1 as a Computer Arithmetic Laboratory M. A. Anuta, D. W. Lozier and P. R. Turner January 1995 U. S. DEPARTMENT OF COMMERCE Technology Administration National Institute of Standards and Technology Gaithersburg, MD 20899 QC N\sr 100 .U56 NO. 5569 1995 NISTIR 5569 The MasPar MP-l as a Computer Arithmetic Laboratory M. A. Anuta D. W. Lozier P. R. Turner U.S. DEPARTMENT OF COMMERCE Technology Administration National Institute of Standards and Technology Applied and Computational Mathematics Division Computing and Applied Mathematics Laboratory Gaithersburg, MD 20899 January 1995 U.S. DEPARTMENT OF COMMERCE Ronald H. Brown, Secretary TECHNOLOGY ADMINISTRATION Mary L. Good, Under Secretary for Technology NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY Arati Prabhakar, Director I I The MasPar MP-1 as a Computer Arithmetic Laboratory Michael A Anuta^ Daniel W Lozier and Peter R Turner^ Abstract This paper describes the use of a massively parallel SIMD computer architecture for the simulation of various forms of computer arithmetic. The particular system used is a DEC/MasPar MP-1 with 4096 processors in a square array. This architecture has many ad\>cmtagesfor such simulations due largely to the simplicity of the individual processors. Arithmetic operations can be spread across the processor array to simulate a hardware chip. Alternatively they may be performed on individual processors to allow simulation of a massively parallel implementation of the arithmetic. Compromises between these extremes permit speed-area trade-offs to be examined. The paper includes a description of the architecture and its features. It then summarizes some of the arithmetic systems which have been, or are to be, implemented.
    [Show full text]
  • The Race Continues
    SLALOM Update: The Race Continues John Gustafson, Diane Rover, Stephen Elbert, and Michael Carter Ames Laboratory DOE, Ames, Iowa Last November, we introduced in these pages a new kind of computer benchmark: a complete scientific problem that scales to the amount of computing power available, and always runs in the same amount of time… one minute. SLALOM assigns no penalty for novelty in language or architecture, and runs on computers as different as an Alliant, a MasPar, an nCUBE, and a Toshiba notebook PC. Since that time, there have been several developments: • The number of computer systems in the list has more than doubled. • The algorithms have improved. • An annual award for SLALOM performance has been announced. • SLALOM is the judge for at least one competitive supercomputer procurement. • The massively-parallel contenders are starting to unseat the low-end Cray computers. • All but a few major scientific computer manufacturers are represented in our report. • Many of the original numbers have improved significantly. Most Wanted List We’re still waiting to hear results for a few major players in supercomputing: Thinking Machines, Convex, MEIKO, and Stardent haven’t sent anything to us, nor have any of their customers. We’d also very much like numbers for the WaveTracer and Active Memory Technology computers. Our Single-Instruction, Multiple Data (SIMD) version has been improved since the last Supercomputing Review article, so the groups working on those machines might want to check it out as a better starting point (see inset). The only IBM mainframe measurements are nonparallel and nonvector, so we expect big improvements to its performance.
    [Show full text]
  • Porting Industrial Codes and Developing Sparse Linear Solvers on Parallel Computers
    Science and Engineering Research Council "The Science and Engineering Research Council does not accept any responsibility for loss or damage arising from the use of information contained in any of its reports or in any communication about its tests or investigations" OCR Output RAL 94-019 Porting Industrial Codes and Developing Sparse Linear Solvers on Parallel Computers Michel J. Daydé 2 and Iain S. Duff ABSTRACT We address the main issues when porting existing codes from serial to parallel computers and when developing portable parallel software on MIMD multiprocessors (shared memory, virtual shared memory, and distributed memory multiprocessors, and networks of computers). We discuss the use of numerical libraries as a way of developing portable and efficient parallel code. We illustrate this by using examples from our experience in porting industrial codes and in designing parallel numerical libraries. We report in some detail on the parallelization of scientific applications coming from Centre National d’Etudes Spatiales and from Aérospatiale, and we illustrate how it is possible to develop portable and efficient numerical software by considering the parallel solution of sparse linear systems of equations. Keywords: industrial codes, sparse matrices, multifrontal, BLAS, PVM, P4, MIMD multiprocessors, networks. AMS(MOS) subject classifications: 65F05, 65F50, 68N99, 68U99. 90C30. Computing Reviews classification: G.4, G.1.3, D.2.7, D.2.10 Computing Reviews General Terms: Algorithms Text from invited lectures given at First International Meeting on Vector and Parallel Processing, Porto, Portugal. 29 September - 1 October, 1993. Part of this work was supported by Aérospatiale, Division Avions, and Centre National d’Etudes Spatiales under contracts 11C05770 and 873/CNES/90/0841/00.
    [Show full text]
  • THE RISE and Fall the 01 BRILLIANT START-UP THAT Some Day We Will Build a Think­ I~Z~~~~~ Thinking Ing Machine
    Company Profile THE RISE and Fall THE 01 BRILLIANT START-UP THAT Some day we will build a think­ I~Z~~~~~ Thinking ing machine. It will be a truly NEVER GRASPED intelligent machine. One that can see and hear and speak. A THE BASICS Mach-Ines machine that will be proud of us. by Gary Taubes -From a Thinking Machines brochure seven 'years a~ter. its The truth is very different. This is the simple proeessors, all of them completing In 19 90 founding, Thlllklllg story of how Thinking Machines got the a single instruction at the same time. To Machines was the market leader in paral­ jump on a hot new market-and then get more speed, more processors would lel supercomputers, with sales of about screwed up, big time. be added. Eventually, so the theory went, $65 million. Not only was the company with enough processors (perhaps billions) protitable; it also, in the words of one IBM ntil W. Daniel Hillis came along, and the right software, a massively paral­ computer scientist, had cornered the mar­ Ucomputers more or less had been de­ lel computer might start acting vaguely . ket "on sex appeal in high-performance signed along the lines of ENIAC. Ifl that human. Whether it would take pride in its computing." Several giants in the com­ machine a single processor complete? in­ creators would remain to be seen. puter industry were seeking a merger or a structions one at a time, in sequence. "Se­ Hillis is what good scientists call a very partnership with the company.
    [Show full text]
  • FPGA Implementation of RISC-Based Memory-Centric Processor Architecture
    (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 10, No. 9, 2019 FPGA Implementation of RISC-based Memory-centric Processor Architecture Danijela Efnusheva1 Computer Science and Engineering Department Faculty of Electrical Engineering and Information Technologies Skopje, North Macedonia Abstract—The development of the microprocessor industry in by the strict separation of the processing and memory resources terms of speed, area, and multi-processing has resulted with in the computer system. In such processor-centric system the increased data traffic between the processor and the memory in a memory is used for storing data and programs, while the classical processor-centric Von Neumann computing system. In processor interprets and executes the program instructions in a order to alleviate the processor-memory bottleneck, in this paper sequential manner, repeatedly moving data from the main we are proposing a RISC-based memory-centric processor memory in the processor registers and vice versa, [1]. architecture that provides a stronger merge between the Assuming that there is no final solution for overcoming the processor and the memory, by adjusting the standard memory processor-memory bottleneck, modern computer systems hierarchy model. Indeed, we are developing a RISC-based implement different types of techniques for "mitigating" the processor that integrates the memory into the same chip die, and occurrence of this problem, [10], (ex. branch prediction thus provides direct access to the on-chip memory, without the use of general-purpose registers (GPRs) and cache memory. The algorithms, speculative and re-order instructions execution, proposed RISC-based memory-centric processor is described in data and instruction pre-fetching, and multithreading, etc.).
    [Show full text]
  • Analysis of the Maspar MP-1 Architecture Bradley C
    Analysis of the MasPar MP-1 architecture Bradley C. Kuszmaul March 26, 1990 Please don't spread this document around outside of Thinking Machines Corporation. There may be mistakes, which are my fault, and if widely distributed, such mistakes might make Thinking Machines look bad. This analysis is mostly based on a talk given at the MIT VLSI Seminar, March 13, 1990. "VLSI Architecture for a Massively Parallel Computer" Won S. Kim and John Zapisek MasPar Computer Corporation about the MasPar MP-1 computer. Won S. Kim is the processor chip implementor. John Zapisek is the router chip implementor. I have tried to be clear about what is conjecture, and what is 'stated fact'. When I say that something is 'stated fact', I mean that Mr. Kim or Mr. Zapisek said it is true. When I say something is 'conjecture', I mean that I have reverse-engineered from the stated facts to hypothesize about their machine. When I do 'analysis', it is, unless otherwise stated, based on stated-fact. Unless indictated otherwise, the statements in this document are stated-fact. All of the analysis is based on the talk, rather than on other sources of information (i.e., this document is free from the effects of industrial espionage). The processor chip architecture: Each processor chip has 32 4-bit PE's. There are 48 32-bit registers per PE (on chip memory) There are 16K bytes DRAM/PE (off-chip) (64K bytes with 4Mbit DRAM) The PE clock rate is 7Ons (I saw 14Mhz elsewhere) The memory is operated in fast-page mode at 80ns/byte The processors are 'clustered' into 16 PE's per cluster.
    [Show full text]
  • Eigenvalue Computation
    Research Institute for Advanced Computer Science NASA Ames Research Center //v--_p -(/a._ ./_::v33 Efficient, Massively Parallel Eigenvalue Computation Yan Huo Robert Schreiber N94-13921 (NASA-CR-194289) EFFICIENTv HASSIVELY PAKALLEL EIGENVALUE COHPUTATION (Research Inst. for unclas Advanced Computer Science) 17 p G316Z 0185433 // RIACS Technical Report 93.02 January, 1993 Submitted to: International Journal of Supercompuler Applications Efficient, Massively Parallel Eigenvalue Computation Yan Huo Robert Schreiber The Research Institute of Advanced Computer Science is operated by Universities Space Research Association, The American City Building, Suite 311, Columbia, MD 21044, (301) 730-2656 Work reported herein was supported by NASA via Contract NAS 2-13721 between NASA and the Universities Space Research Association (USRA). Work was performed at the Research Institute for Advanced Computer Science (RIACS), NASA Ames Research Center, Moffett Field, CA 94035-1000. Efficient, Massively Parallel Eigenvalue Computation Yan Huo* Robert Schreiber t January 22, 1993 Abstract In numerical simulations of disordered electronic systems, one of the most common ap- proaches is to diagonalize random Hamiltonian matrices and to study the eigenvalues and eigenfunctions of a single electron in the presence of a random potential. In this paper, we describe an effort to implement a matrix diagonalization routine for real symmetric dense matrices on massively parallel SIMD computers, the Maspar MP-1 and MP-2 systems. Results of numerical tests and timings are also presented. "Department of Electrical Engineering, Princeton University, Princeton NJ 08544 IResearch Institute for Advanced Computer Science, MS 230-5, NASA Ames Research Center, Moffett Field, CA 94035-1000. This author's work was supported by the NAS Systems Division and DARPA via Cooperative Agreement NCC 2-387 between NASA and the University Space Research Association (USRA).
    [Show full text]
  • Manycore Parallel Computing with CUDA
    Manycore Parallel Computing with CUDA Mark Harris [email protected] 1 Future Science & Engineering Breakthroughs Hinge on Computing Computational Computational Computational Computational Geoscience Modeling Medicine Physics Computational Computational Computational Image Chemistry Biology Finance Processing © NVIDIA Corporation 2008 2 Faster is not “Just faster” 3 2-3x “Just faster” Do a little more, wait a little less Doesn’t change how you work 4 5-10x “Significant” Worth upgrading Worth rewriting (parts of) your application 5 100x+ “Fundamentally Different” Worth considering a new platform Worth re-architecting your application Makes new applications possible Drives down “time to discovery” Creates fundamental changes in science 6 Parallel Computing with CUDA Enabling new science and engineering By drastically reducing time to discovery Engineering design cycles: from days to minutes, weeks to days Enabling new computer science By reinvigorating research in parallel algorithms, programming models, architecture, compilers, and languages 7 GeForce® TeslaTM Quadro® Entertainment High-Performance Computing Design & Creation © NVIDIA Corporation 2008 8 Wide Developer Acceptance and Success 146X 36X 19X 17X 100X Interactive Ion placement for Transcoding HD Simulation in Astrophysics N- visualization of molecular video stream to Matlab using .mex body simulation volumetric white dynamics H.264 file CUDA function matter simulation connectivity 149X 47X 20X 24X 30X Financial GLAME@lab: An Ultrasound Highly optimized Cmatch exact simulation
    [Show full text]