Computational RAM, a Memory-SIMD Hybrid

INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy submittad. Broken or indistinct print, cobred or poor quality illustrations and photographs, print bleedthrough, substandard margins, and impmper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there am missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand corner and continuing from left to right in equal sedions with small overlaps. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6" x 9" black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI diractly to order. Bell & Howell Information and Learning 300 North Zeeb Road, Ann Arbor, MI 481061346 USA Computational RAM: A Memory-SIMD Hybrid Duncan George Elliott A thesis submitted in conformity with the requirements for the Degree of Doctor of Philosophy, Graduate Department of Electrical and Computer Engineering University of Toronto @ 1998 by Duncan George Elliott National Library BiMiothitque nationale du Canada Acquisitions and Acquisitions el Bibliographic Services senrims bibliographiques 395 Wellington Street 395, we Wdlingtan OttawaON K1AW OttawaON KIA0654 Canada Canada The author has granted a non- L'auteur a accorde me licence non exclusive licence allowing the exclusive permettant a la National Library of Canada to Bibliotheque nationale du Canada de reproduce, loan, distribute or sell reproduire, prgter, distribuer ou copies of this thesis in microform, vendre des copies de cette these sous paper or electronic formats. la forme de microfichelfilm, de reproduction sur papier ou sur format Bectronique. The author retains ownership of the L'auteur conserve la propriete du copyright in this thesis. Neither the droit d'auteur qui protege cette these. thesis nor substantial extracts &om it Ni la these ni des extraits substantiels may be printed or otherwise de celle-ci ne doivent &e imprimes reproduced without the author's son . ou autrement reproduits sans Abstract Computational RAM: A Memory-SIMD Hybrid Doctor of Philosophy dissertation, 1998 Duncan George Elliott Departnent of Electrical and Computer Engineering University of Toronto In this thesis, a novel computer architecture called Computational RAM (CaRAM) is proposed and implemented. CaRAM is semiconductor random access memory with processors incorporated into the design, while retaining a memory interface. CaRAM can be used to build an inexpensive massively-parallel computer. Applications that contain the appropriate parallelism will typically run thousands of times faster on CmRAM than on the CPU. This work includes the design and implementation of the architecture as a working chip with 64 processor elements (PEs), a PE design for a 2048-PE 4 Mbit DRAM,and applications. CaRAM is the first processor-in-memory architecture that is scalable across many generations of DRAM. This scalability is obtained by pitchmatching narrow 1-bit PEs to the memory and restricting communications to using 1-dimensional interconnects. The PEs are pitchmatched to - memory columns so that they can be connected to the sense amplifiers. The 1-bit wide datapath is suitable for a nmow, arrayable VLSI implementation, is compatible with memory redundancy, and has the highest performancdcost ratio among hardware arithmetic algorithms. For scalability, the memory anays and memory-style packaging limit the internal interprocessor communications to 1-dimensional networks. Of these networks, both a broadcast bus network and a left-right nearest-neighbour network are implemented. CeRAM requires little overhead over the existing memory to exploit much of the internal memory bandwidth. When CoRAM PEs are added to DRAM, more than 25% of the internal memory bandwidth is exploited at a cost of less than 25% in terms of silicon area and power. The memory bandwidth internal to memory chips at the sense amplifiers can be 3000 times the memory bandwidth at the CPU. By placing SIMD PEs adjacent to those sense amplifiers, this internal memory bandwidth can be better utilized. The performance of CeRAM has been demonstrated in a wide range of application areas, and speed-ups of several orders of magnitude compared to a typical workstation have been shown in the fields of signal and image processing, database, and CAD. iii Acknowledgments I am grateful for the support from and technical exchange with MOSAID Technologies and IBM Microelectronics; IC fabrication provided by Northern Telecom through the Canadian Microelectronics Corporation; and support from the Natural Sciences and Engineering Research Council of Canada, Micronet, and the University of Toronto. Professor David Wheeler of the Computer Laboratory, University of Cambridge introduced me to the idea of putting the computing in the memory chips. My supervisors, Professors Michael Sturnrn and Martin Snelgrove kept me on track without a single change of topic. Michael Stumm put in more than a reasonable effort in encouraging me and organizing my work. Martin Snelgrove was instrumental in making connections to potential industrial partners and taught me valuable business lessons. I have learned lessons on DRAMS that no textbook could tell from Peter Gillingharn, Graham Allan, Randy Tonance, Dick Foss, Iain Scott, David Frank, Howard Kalter, and John Barth. My lab-mates and members of the semiconductor design staff at MOSAID improved the quality and the enjoyment of my learning. Oswin Hall provided the first user feedback on CeRAM and proposed the PE shift register circuit. For strengthening this work by offering their questions and suggestions I also thank: Christian Cojocaru, Dickson Tak Shun Cheung, David Galloway, Robert McKenzie, Wayne Loucks, Tet Yeap, Martin Lefebvre, Jean-Francois Rivest, Sethurman Panchamthan, Wojciech Ziarko, Thinh Le, Glenn Gulak, David Wortman. Graham Jullien, Safwat Zaky, and Zvonko Vranesic. This dissertation would not have seen the light of day without encouragement from my whole family. I would especially like to thank my wife, Janet Elliott, who was a constant source of enthusiasm, had streams of writing suggestions, and in the end sat me down to write this document. Contents 1 Introduction 1 1.1 Dissertation Organization ............................. 5 2 Prior Art 7 2.1 Massively Parallel SIMD Rocessors ........................ 7 2.2 Memories ..................................... 20 2.2.1 DRAM Circuits .............................. 22 2.2.2 DRAM Chip Interfaces .......................... 24 2.2.3 DRAMIC Process ............................. 25 2.2.4 Redundancy ................................ 27 2.2.5 Smart Memories .............................. 28 2.3 Summsty...................................... 29 3 Computing in the Memory: the Abstraction 30 3.1 Memory Bandwidth ................................ 31 3.2 Maintaining Compatibility with Memory ..................... 34 3.2.1 Using DRAM IC Processes ........................ 34 3.2.2 Physical Dimensions and Performance ................... 34 3.2.3 Coping with Long Wordlines ....................... 35 3.2.4 Redundancy Considerations ........................ 36 3.2.5 Packaging Considerations ......................... 38 3.3 Additional Design Issues .............................. 39 3.3.1 Asymptotic Advantages of Bit-Serial PEs ................. 39 3.3.2 PE Aspect Ratio .............................. 42 3.4 BasicCoRAMArchitecture ............................ 45 3.5 Advantages of CaRAM Architecture ........................ 46 3.5.1 Systemcost ................................ 46 3.5.2 Performance ................................ 47 3.5.3 Scalability ................................. 47 3.5.4 Power Consumption ............................ 49 3.6 Design Space Alternatives ............................. 49 3.7 Summary ...................................... 50 4 Architectural Details 52 4.1 Processing Element ................................ 52 4.1.1 ALUDesign ................................ 53 4.1.2 Conditional Operations .......................... 58 4.2 Interconnection Network .............................. 59 4.2.1 Suitability of Interconnection Networks .................. 60 4.3 System I0 ..................................... 67 4.4 Summary ...................................... 67 5 Implementation 69 5.1 Prototype ...................................... 69 5.1.1 Memory .................................. 70 5.1.2 ProcessorElernent ............................. 74 5.1.3 Testing and Performance .......................... 81 5.1.4 Summary ................................. 81 5.2 High Density Design ................................ 82 5.2.1 Reference DRAM Architecture ...................... 83 5.2.2 CeRAM Memory ............................. 83 5.2.3 PE-Memory Interface ........................... 84 5.2.4 Processor Element ............................. 85 5.2.5 Power ................................... 88 5.2.6 External Interface ............................

Computational RAM, a Memory-SIMD Hybrid

System Design for a Computational-RAM Logic-In-Memory Parailel-Processing Machine

A Modern Primer on Processing in Memory

A PARALLEL IMPLEMENTATION of BACKPROPAGATION NEURAL NETWORK on MASPAR MP-1 Faramarz Valafar Purdue University School of Electrical Engineering

Thinking Machines

The Maspar MP-1 As a Computer Arithmetic Laboratory

The Race Continues

Porting Industrial Codes and Developing Sparse Linear Solvers on Parallel Computers

THE RISE and Fall the 01 BRILLIANT START-UP THAT Some Day We Will Build a Think I~Z~~~~~ Thinking Ing Machine

FPGA Implementation of RISC-Based Memory-Centric Processor Architecture

Analysis of the Maspar MP-1 Architecture Bradley C

Eigenvalue Computation

Manycore Parallel Computing with CUDA