Programming for the Intel Xeon Phi

Total Page:16

File Type:pdf, Size:1020Kb

Programming for the Intel Xeon Phi Programming for the Intel Xeon Phi Michael Florian Hava RISC Software GmbH RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 1 The Road to Xeon Phi RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 2 Intel Pentium (1993 - 1995) . The Pentium was the first superscalar x86 – No out-of-order execution! – Predates all SIMD extensions . 1994: P54C – 75 – 100MHz – Core-design is used in several research projects, including the Xeon Phi architecture! RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 3 Tera-Scale Computing (2006-) . Research project to design TFLOP CPU . 2007: Teraflops Research Chip/Polaris – 80 cores (96-bit VLIW) – 1 TFLOP @ 63W . 2009: Single-chip Cloud Computer – 48 cores (x86) – No cache coherence RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 4 Larrabee (2009) . A fully programmable GPGPU based on x86 – Software renderer for OpenGL, DirectX,… . 32 – 48 cores – 4-way Hyper-Threading – Cache coherence – 512-bit vector registers [LRBni] . Planned product release: 2009-2010 RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 5 Many Integrated Core (2010-) . 2010: Knights Ferry (prototype) – 32 cores @ 1.2 GHz – 4-way Hyper-Threading – 512-bit vector registers [???] – 0.75 TFLOPS @ single precision . 2012: Knights Corner [Xeon Phi] – 57 – 61 cores @ 1.0 – 1.2 GHz – 4-way Hyper-Threading – 512-bit vector registers [KNC] – 1 TFLOP @ double precision RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 6 Xeon Phi . Calculating Peak Xeon Phi FLOPs = – FLOPs = #core × GHz × SIMD vector width × fp-ops – FMA == 2 floating point operations (takes 1 cycle) . SIMD vector width: – Single Precision: 16 elements – Double Precision: 8 elements 57 × 1.1 × 16 × 2 → 2 TFLOPs (single) 57 × 1.1 × 8 × 2 → 1 TFLOPs (double) RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 7 The Future . 2015: Knights Landing – Up to 72 Airmont cores – 4-way Hyper-Threading – 512-bit vector registers [AVX-512] • Support for existing x86 extensions – 3 TFLOPS @ double precision – Both as co-processor and standalone CPU RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 8 Programming for the Xeon Phi RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 9 Supported Technologies Tools Tools MKL MKL Fortran (CAF) Fortran (CAF) TBB TBB CPU native MIC OpenMP Executable PCIe Executable OpenMP C++ C++ Cilk Plus Cilk Plus Offloading OpenCL OpenCL RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 10 Execution Models . Native execution – Copy cross-compiled executable to Phi – Local Linux based OS • Almost completely independent from host system . Offloading – Implicit/Automatic – Explicit/Manual . Message Passing (MPI) – Phi as Cluster or Node RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 11 Offloading . Similar to execution model of GPGPU . Technologies – OpenCL – Intel Offload Extension – OpenMP 4.0 . Xeon Phi reserves one core in offload mode RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 12 Memory Models . Explicit/distinct – Identical to the GPGPU – Memory has to be copied – Limited to primitive types . Implicit/virtual shared – Simulated a SMM – Complex data structures – Only available in C/C++ RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 13 OpenMP 4.0 RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 14 OpenMP 4.0 . Released in July 2013 . Introduces concept several new concepts – Accelerators – SIMD – Thread teams . Partially abandons pure shared memory model – Accelerators contain local memory – Explicit memory & computation offloading RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 15 Tagging for Offload . Types, Functions, Variables, etc. that should be available on the accelerator have to be tagged #pragma omp declare target //contains last error of calculate extern int last_error; struct result_t { float sum, avg, min, max; }; result_t calculate(const float * array, int count); #pragma omp end declare target RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 16 Offloading Computations and Data . Offloading of computation and data is controlled via pragmas float * array = new float[N]; std::generate_n(array, N, rand);//initialize array //create device data context and start computation #pragma omp target map(to: array[0:N], N)\ map(from: result, last_error) result_t result = calculate(array, N); print(result, last_error); delete[] array; RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 17 „Explicit“ Device Data Context Management #pragma omp target data map(to: N) {//data device context exists for this scope float * array = new float[N]; std::generate_n(array, N, rand);//initialize array #pragma omp target update to(array[0:N]) result_t result; #pragma omp target map(from: result) result = calculate(array, N); #pragma omp target update from(last_error) print(result, last_error); delete[] array; } RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 18 Intel OpenMP Extensions (KMP) RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 19 Environment Variables (Xeon Phi) . Xeon Phi environment uses custom prefix – MIC_ENV_PREFIX=##MIC## – Variables with prefix are copied to the Xeon Phi at offload . Samples – ##MIC##_OMP_NUM_THREADS – ##MIC##_KMP_DETERMINISTIC_REDUCTION – ##MIC##_KMP_AFFINITY RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 20 Deterministic OpenMP Reductions . OpenMP does not specify the order in which the partial sums should be combined! – Results are not reproducible! std::vector<float> arr = get_input(); float sum; #pragma omp parallel for reduction(+:sum) for(auto i = 0; i < arr.size(); ++i) sum += arr[i]; . KMP_DETERMINISTIC_REDUCTION=1 – Ensures reproducible results – Slight performance impact RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 21 Thread Affinity . Distribution of OpenMP threads is implementation defined . KMP_AFFINITY allows control of distribution across sockets, CPUs, cores and hyper-threads – Compact: use closest possible cores – Scatter: distribute evenly among all cores – Balanced: distribute evenly among all cores, but keep “close” threads on as close as possible cores • only available for the Xeon Phi • recommended mode RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 22 Thread affinity on Multicore CPU System Core0 Core1 Core2 Core3 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 compact 0 1 2 3 - - - - scatter 0 - 1 - 2 - 3 - RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 23 Thread affinity with Dual-Socket System System Socket0 Socket1 Core0 Core1 Core2 Core3 Core0 Core1 Core2 Core3 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 compact 0 1 2 3 4 5 6 7 - - - - - - - - scatter 0 - 4 - 2 - 6 - 1 - 5 - 3 - 7 - RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 24 Thread affinity on Xeon Phi Phi Core0 Core1 Core2 Core3 Core4 Core5 HT0 HT1 HT2 HT3 HT0 HT1 HT2 HT3 HT0 HT1 HT2 HT3 HT0 HT1 HT2 HT3 HT0 HT1 HT2 HT3 HT0 HT1 HT2 HT3 compact 0 1 2 3 4 5 6 7 8 9 A B - - - - - - - - - - - - scatter 0 6 - - 1 7 - - 2 8 - - 3 9 - - 4 A - - 5 B - - balanced 0 1 - - 2 3 - - 4 5 - - 6 7 - - 8 9 - - A B - - RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 25 Castor, 4228m Pollux, 4092m Thank You! zwischen Monte-Rosa-Massiv und Matterhorn Wallis, Schweiz www.risc-software.at RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 26 .
Recommended publications
  • Multi-Core Processors and Systems: State-Of-The-Art and Study of Performance Increase
    Multi-Core Processors and Systems: State-of-the-Art and Study of Performance Increase Abhilash Goyal Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT speedup. Some tasks are easily divided into parts that can be To achieve the large processing power, we are moving towards processed in parallel. In those scenarios, speed up will most likely Parallel Processing. In the simple words, parallel processing can follow “common trajectory” as shown in Figure 2. If an be defined as using two or more processors (cores, computers) in application has little or no inherent parallelism, then little or no combination to solve a single problem. To achieve the good speedup will be achieved and because of overhead, speed up may results by parallel processing, in the industry many multi-core follow as show by “occasional trajectory” in Figure 2. processors has been designed and fabricated. In this class-project paper, the overview of the state-of-the-art of the multi-core processors designed by several companies including Intel, AMD, IBM and Sun (Oracle) is presented. In addition to the overview, the main advantage of using multi-core will demonstrated by the experimental results. The focus of the experiment is to study speed-up in the execution of the ‘program’ as the number of the processors (core) increases. For this experiment, open source parallel program to count the primes numbers is considered and simulation are performed on 3 nodes Raspberry cluster . Obtained results show that execution time of the parallel program decreases as number of core increases.
    [Show full text]
  • Exascale Computing Study: Technology Challenges in Achieving Exascale Systems
    ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems Peter Kogge, Editor & Study Lead Keren Bergman Shekhar Borkar Dan Campbell William Carlson William Dally Monty Denneau Paul Franzon William Harrod Kerry Hill Jon Hiller Sherman Karp Stephen Keckler Dean Klein Robert Lucas Mark Richards Al Scarpelli Steven Scott Allan Snavely Thomas Sterling R. Stanley Williams Katherine Yelick September 28, 2008 This work was sponsored by DARPA IPTO in the ExaScale Computing Study with Dr. William Harrod as Program Manager; AFRL contract number FA8650-07-C-7724. This report is published in the interest of scientific and technical information exchange and its publication does not constitute the Government’s approval or disapproval of its ideas or findings NOTICE Using Government drawings, specifications, or other data included in this document for any purpose other than Government procurement does not in any way obligate the U.S. Government. The fact that the Government formulated or supplied the drawings, specifications, or other data does not license the holder or any other person or corporation; or convey any rights or permission to manufacture, use, or sell any patented invention that may relate to them. APPROVED FOR PUBLIC RELEASE, DISTRIBUTION UNLIMITED. This page intentionally left blank. DISCLAIMER The following disclaimer was signed by all members of the Exascale Study Group (listed below): I agree that the material in this document reects the collective views, ideas, opinions and ¯ndings of the study participants only, and not those of any of the universities, corporations, or other institutions with which they are a±liated. Furthermore, the material in this document does not reect the o±cial views, ideas, opinions and/or ¯ndings of DARPA, the Department of Defense, or of the United States government.
    [Show full text]
  • Research Challenges for On-Chip Interconnection Networks
    ......................................................................................................................................................................................................................................................... RESEARCH CHALLENGES FOR ON-CHIP INTERCONNECTION NETWORKS ......................................................................................................................................................................................................................................................... ON-CHIP INTERCONNECTION NETWORKS ARE RAPIDLY BECOMING A KEY ENABLING John D. Owens TECHNOLOGY FOR COMMODITY MULTICORE PROCESSORS AND SOCS COMMON IN University of California, CONSUMER EMBEDDED SYSTEMS.LAST YEAR, THE NATIONAL SCIENCE FOUNDATION Davis INITIATED A WORKSHOP THAT ADDRESSED UPCOMING RESEARCH ISSUES IN OCIN William J. Dally TECHNOLOGY, DESIGN, AND IMPLEMENTATION AND SET A DIRECTION FOR RESEARCHERS Stanford University IN THE FIELD. ...... VLSI technology’s increased capa- (NoC), whose philosophy has been sum- Ron Ho bility is yielding a more powerful, more marized as ‘‘route packets, not wires.’’2 capable, and more flexible computing Connecting components through an on- Sun Microsystems system on single processor die. The micro- chip network has several advantages over processor industry is moving from single- dedicated wiring, potentially delivering core to multicore and eventually to many- high-bandwidth, low-latency, low-power D.N. (Jay) core architectures, containing tens to hun-
    [Show full text]
  • Unstructured Computations on Emerging Architectures
    Unstructured Computations on Emerging Architectures Dissertation by Mohammed A. Al Farhan In Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy King Abdullah University of Science and Technology Thuwal, Kingdom of Saudi Arabia May 2019 2 EXAMINATION COMMITTEE PAGE The dissertation of M. A. Al Farhan is approved by the examination committee Dissertation Committee: David E. Keyes, Chair Professor, King Abdullah University of Science and Technology Edmond Chow Associate Professor, Georgia Institute of Technology Mikhail Moshkov Professor, King Abdullah University of Science and Technology Markus Hadwiger Associate Professor, King Abdullah University of Science and Technology Hakan Bagci Associate Professor, King Abdullah University of Science and Technology 3 ©May 2019 Mohammed A. Al Farhan All Rights Reserved 4 ABSTRACT Unstructured Computations on Emerging Architectures Mohammed A. Al Farhan his dissertation describes detailed performance engineering and optimization Tof an unstructured computational aerodynamics software system with irregu- lar memory accesses on various multi- and many-core emerging high performance computing scalable architectures, which are expected to be the building blocks of energy-austere exascale systems, and on which algorithmic- and architecture-oriented optimizations are essential for achieving worthy performance. We investigate several state-of-the-practice shared-memory optimization techniques applied to key kernels for the important problem class of unstructured meshes. We illustrate
    [Show full text]
  • High-Performance Optimizations on Tiled Many-Core Embedded Systems: a Matrix Multiplication Case Study
    High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study Arslan Munir, Farinaz Koushanfar, Ann Gordon-Ross & Sanjay Ranka The Journal of Supercomputing An International Journal of High- Performance Computer Design, Analysis, and Use ISSN 0920-8542 J Supercomput DOI 10.1007/s11227-013-0916-9 1 23 Your article is protected by copyright and all rights are held exclusively by Springer Science +Business Media New York. This e-offprint is for personal use only and shall not be self- archived in electronic repositories. If you wish to self-archive your article, please use the accepted manuscript version for posting on your own website. You may further deposit the accepted manuscript version in any repository, provided it is only made publicly available 12 months after official publication or later and provided acknowledgement is given to the original source of publication and a link is inserted to the published article on Springer's website. The link must be accompanied by the following text: "The final publication is available at link.springer.com”. 1 23 Author's personal copy J Supercomput DOI 10.1007/s11227-013-0916-9 High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study Arslan Munir · Farinaz Koushanfar · Ann Gordon-Ross · Sanjay Ranka © Springer Science+Business Media New York 2013 Abstract Technological advancements in the silicon industry, as predicted by Moore’s law, have resulted in an increasing number of processor cores on a single chip, giving rise to multicore, and subsequently many-core architectures. This work focuses on identifying key architecture and software optimizations to attain high per- formance from tiled many-core architectures (TMAs)—an architectural innovation in the multicore technology.
    [Show full text]
  • High-Performance Optimizations on Tiled Many-Core Embedded Systems: a Matrix Multiplication Case Study
    J Supercomput (2013) 66:431–487 DOI 10.1007/s11227-013-0916-9 High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study Arslan Munir · Farinaz Koushanfar · Ann Gordon-Ross · Sanjay Ranka Published online: 5 April 2013 © Springer Science+Business Media New York 2013 Abstract Technological advancements in the silicon industry, as predicted by Moore’s law, have resulted in an increasing number of processor cores on a single chip, giving rise to multicore, and subsequently many-core architectures. This work focuses on identifying key architecture and software optimizations to attain high per- formance from tiled many-core architectures (TMAs)—an architectural innovation in the multicore technology. Although embedded systems design is traditionally power- centric, there has been a recent shift toward high-performance embedded computing due to the proliferation of compute-intensive embedded applications. The TMAs are suitable for these embedded applications due to low-power design features in many of these TMAs. We discuss the performance optimizations on a single tile (processor core) as well as parallel performance optimizations, such as application decompo- sition, cache locality, tile locality, memory balancing, and horizontal communica- tion for TMAs. We elaborate compiler-based optimizations that are applicable to A. Munir () · F. Koushanfar Department of Electrical and Computer Engineering, Rice University, Houston, TX, USA e-mail: [email protected] F. Koushanfar e-mail: [email protected] A. Gordon-Ross Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL, USA e-mail: [email protected]fl.edu A. Gordon-Ross NSF Center for High-Performance Reconfigurable Computing (CHREC), University of Florida, Gainesville, FL, USA S.
    [Show full text]
  • When HPC Meets Big Data in the Cloud
    When HPC meets Big Data in the Cloud Prof. Cho-Li Wang The University of Hong Kong Dec. 17, 2013 @Cloud-Asia Big Data: The “4Vs" Model • High Volume (amount of data) • High Velocity (speed of data in and out) • High Variety (range of data types and sources) • High Values : Most Important 2010: 800,000 petabytes (would fill a stack of DVDs 2.5 x 1018 reaching from the earth to the moon and back) By 2020, that pile of DVDs would stretch half way to Mars. Google Trend: (12/2012) Big Data vs. Data Analytics vs. Cloud Computing Cloud Computing Big Data 12/2012 • McKinsey Global Institute (MGI) : – Using big data, retailers could increase its operating margin by more than 60%. – The U.S. could reduce its healthcare expenditure by 8% – Government administrators in Europe could save more than €100 billion ($143 billion). Google Trend: 12/2013 Big Data vs. Data Analytics vs. Cloud Computing “Big Data” in 2013 Outline • Part I: Multi-granularity Computation Migration o "A Computation Migration Approach to Elasticity of Cloud Computing“ (previous work) • Part II: Big Data Computing on Future Maycore Chips o Crocodiles: Cloud Runtime with Object Coherence On Dynamic tILES for future 1000-core tiled processors” (ongoing) Big Data Too Big To Move Part I Multi-granularity Computation Migration Source: Cho-Li Wang, King Tin Lam and Ricky Ma, "A Computation Migration Approach to Elasticity of Cloud Computing", Network and Traffic Engineering in Emerging Distributed Computing Applications, IGI Global, pp. 145-178, July, 2012. Multi-granularity Computation
    [Show full text]
  • (3-D) Integration Technology
    FP7-ICT-2007-1 ELITE-215030 August 2011 CONFIDENTIAL WP: D2.5 Conclusion Report Extended Large (3-D) Integration TEchnology Seventh Framework Programme FP7-ICT-2007-1 Project Number: FP7-ICT-215030 D2.5: 3-D IC’s Modelling and Simulation: Conclusion Report Version 3.0 [1] FP7-ICT-2007-1 ELITE-215030 August 2011 CONFIDENTIAL WP: D2.5 Conclusion Report [2] FP7-ICT-2007-1 ELITE-215030 August 2011 CONFIDENTIAL WP: D2.5 Conclusion Report Extended Large (3-D) Integration TEchnology Project Name Extended Large (3-D) Integration Technology Project Number ELITE-215030 Document Title 3D IC’s Modelling and Simulation: Conclusion Report Work Package WP2 Dissemination Level CONFIDENTIAL Lead Beneficiary Lancaster University (ULANC) Document Editors Roshan Weerasekera, Matt Grange, Dinesh Pamunuwa (ULANC), Christine Fuchs (LETI), Luca Bortesi and Loris Vendrame (MICRON) Version 3.0 [3] FP7-ICT-2007-1 ELITE-215030 August 2011 CONFIDENTIAL WP: D2.5 Conclusion Report Abstract In this report we discuss the modelling and analysis of 3-D ICs within the scope of ELITE project and their usage in design space exploration of general systems. We present a set of TSV parasitic extraction models for various TSV structures laid out in different substrates, taking into account the physical proximity of neighbouring TSVs. The proposed models can be used in system-level performance design space explorations. The RF and low-frequency behaviour of TSVs has also been modelled and measured from test structures. We then discuss 3-D signalling conventions by exhaustively quantifying the trade offs between standard CMOS drivers and receivers as well as shielding techniques over the TSVs.
    [Show full text]
  • Resilient On-Chip Memory Design in the Nano Era
    UC Irvine UC Irvine Electronic Theses and Dissertations Title Resilient On-Chip Memory Design in the Nano Era Permalink https://escholarship.org/uc/item/1fj2c0t2 Author Banaiyanmofrad, Abbas Publication Date 2015 License https://creativecommons.org/licenses/by/4.0/ 4.0 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, IRVINE Resilient On-Chip Memory Design in the Nano Era DISSERTATION submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in Computer Science by Abbas BanaiyanMofrad Dissertation Committee: Professor Nikil Dutt, Chair Professor Alex Nicolau Professor Alex Veidenbaum 2015 c 2015 Abbas BanaiyanMofrad DEDICATION To my wife Marjan whom provided me the necessary strength to pursue my dreams. ii TABLE OF CONTENTS Page LIST OF FIGURES vi LIST OF TABLES ix ACKNOWLEDGMENTS x CURRICULUM VITAE xii ABSTRACT OF THE DISSERTATION xv 1 Introduction 1 1.1 Nano Era Design Trends and Challenges . 1 1.1.1 Technology Trend . 1 1.1.2 Chip Design Trend . 2 1.1.3 Application Trend . 2 1.2 Memories and Errors . 3 1.3 State-of-the-art Research Efforts . 5 1.4 Motivation . 7 1.5 Thesis Contributions . 10 1.6 Thesis Organization . 14 2 Flexible and Low-Cost Fault-tolerant Cache Design 15 2.1 Introduction . 15 2.2 Related Work . 19 2.2.1 Circuit-level Techniques . 19 2.2.2 Error Coding Techniques . 20 2.2.3 Architecture-Level Techniques . 21 2.3 FFT-Cache . 23 2.3.1 Proposed Architecture . 23 2.3.2 Evaluation .
    [Show full text]
  • Research Challenges for On-Chip Interconnection Networks
    ..................................................................................................................................................................................................................................................... RESEARCH CHALLENGES FOR ON-CHIP INTERCONNECTION NETWORKS ..................................................................................................................................................................................................................................................... ON-CHIP INTERCONNECTION NETWORKS ARE RAPIDLY BECOMING A KEY ENABLING John D. Owens TECHNOLOGY FOR COMMODITY MULTICORE PROCESSORS AND SOCS COMMON IN University of California, CONSUMER EMBEDDED SYSTEMS.LAST YEAR, THE NATIONAL SCIENCE FOUNDATION Davis INITIATED A WORKSHOP THAT ADDRESSED UPCOMING RESEARCH ISSUES IN OCIN William J. Dally TECHNOLOGY, DESIGN, AND IMPLEMENTATION AND SET A DIRECTION FOR RESEARCHERS Stanford University IN THE FIELD. ...... VLSI technology’s increased capa- (NoC), whose philosophy has been sum- Ron Ho bility is yielding a more powerful, more marized as ‘‘route packets, not wires.’’2 capable, and more flexible computing Connecting components through an on- Sun Microsystems system on single processor die. The micro- chip network has several advantages over processor industry is moving from single- dedicated wiring, potentially delivering core to multicore and eventually to many- high-bandwidth, low-latency, low-power D.N. (Jay) core architectures, containing tens to hun- communication
    [Show full text]
  • 3D Stacked Memory: Patent Landscape Analysis
    3D Stacked Memory: Patent Landscape Analysis Table of Contents Executive Summary……………………………………………………………………….…………………….1 Introduction…………………………………………………………………………….…………………………..2 Filing Trend………………………………………………………………………………….……………………….7 Taxonomy…………………………………………………………………………………….…..……….…………8 Top Assignees……………………………………………………………………………….….…..……………11 Geographical Heat Map…………………………………………………………………….……………….13 LexScoreTM…………………………………………………………….…………………………..….……………14 Patent Strength……………………………………………………………………………………..….……….16 Licensing Heat Map………………………………………………………….…………………….………….17 Appendix: Definitions………………………………………………………………………………….……..19 3D Stacked Memory: Patent Landscape Analysis EXECUTIVE SUMMARY Memory bandwidth, latency and capacity have become a major performance bottleneck as more and more performance and storage are getting integrated in computing devices, demanding more data transfer between processor and system memory (Volatile and Non-Volatile). This memory bandwidth and latency problem can be addressed by employing a 3D-stacked memory architecture which provides a wide, high frequency memory-bus interface. 3D stacking enables stacking of volatile memory like DRAM directly on top of a microprocessor, thereby significantly reducing transmission delay between the two. The 3D- stacked memory also improves memory capacity and cost of non-volatile storage memory like flash or solid state drives. By stacking, memory dies vertically in a three-dimensional structure, new potential for 3D memory capacities are created, eliminating performance and
    [Show full text]
  • Architecture of Large Systems CS-602
    Architecture of Large Systems CS-602 Computer Science and Engineering Department National Institute of Technology Instructor : Dr. Lokesh Chouhan Slide Sources : Andrew S. Tanenbaum, Structured Computer Organization Morris Mano, Computer System and organization book William Stallings, Computer System and organization adapted and supplemented Parallel Processing Computer Organizations Multiple Processor Organization • Single instruction, single data stream – SISD • Single instruction, multiple data stream – SIMD • Multiple instruction, single data stream – MISD • Multiple instruction, multiple data stream- MIMD Single Instruction, Single Data Stream - SISD • Single processor • Single instruction stream • Data stored in single memory Single Instruction, Multiple Data Stream - SIMD • Single machine instruction — Each instruction executed on different set of data by different processors • Number of processing elements — Machine controls simultaneous execution – Lockstep basis — Each processing element has associated data memory • Application: Vector and array processing Multiple Instruction, Single Data Stream - MISD • Sequence of data • Transmitted to set of processors • Each processor executes different instruction sequence • Not clear if it has ever been implemented Multiple Instruction, Multiple Data Stream- MIMD • Set of processors • Simultaneously executes different instruction sequences • Different sets of data • Examples: SMPs, NUMA systems, and Clusters Taxonomy of Parallel Processor Architectures Block Diagram of Tightly Coupled
    [Show full text]