A Simd Approach to Large-Scale Real-Time System Air Traffic Control Using Associative Processor and Consequences for Parallel Computing

Total Page:16

File Type:pdf, Size:1020Kb

A Simd Approach to Large-Scale Real-Time System Air Traffic Control Using Associative Processor and Consequences for Parallel Computing A SIMD APPROACH TO LARGE-SCALE REAL-TIME SYSTEM AIR TRAFFIC CONTROL USING ASSOCIATIVE PROCESSOR AND CONSEQUENCES FOR PARALLEL COMPUTING A dissertation submitted to Kent State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Man Yuan December 2012 Dissertation written by Man Yuan B.S., Hefei University of Technology, China 2001 M.S., University of Western Ontario, Canada 2003 Ph.D., Kent State University, 2012 Approved by Dr. Johnnie W. Baker , Chair, Doctoral Dissertation Committee Dr. Lothar Reichel , Members, Doctoral Dissertation Committee Dr. Mikhail Nesterenko Dr. Ye Zhao Dr. Richmond Nettey Accepted by Dr. Javed I. Khan , Chair, Department of Computer Science Dr. Raymond Craig , Dean, College of Arts and Sciences ii TABLE OF CONTENTS LISTOFFIGURES .................................. vii LISTOFTABLES ................................... ix Acknowledgements ................................... x 1 Introduction..................................... 1 2 BackgroundInformation . .. .. 7 2.1 Flynn’s Taxonomy and Classification . ... 7 2.1.1 Multiple Instruction Stream Multiple Data Stream (MIMD) ... 7 2.1.2 Single Instruction Stream Multiple Data Stream (SIMD)..... 8 2.2 Real-TimeSystems .............................. 10 2.3 AirTrafficControl(ATC) .......................... 13 2.3.1 TaskCharacteristics . 14 2.3.2 The worst-case environment of ATC . 14 2.3.3 ATCTasks .............................. 15 2.4 AnAssociativeProcessorforATC. .. 16 2.5 OpenMP.................................... 18 2.5.1 AvoidFalseSharing ......................... 20 2.5.2 OptimizeBarrierUse. 21 2.5.3 AvoidtheOrderedConstruct . 22 iii 2.5.4 AvoidLargeCriticalRegions . 22 2.5.5 MaximieParallelRegions . 22 2.5.6 Avoid Parallel Regions in Inner Loops . 23 2.5.7 ImproveLoadBalance . 24 2.5.8 Using Compiler Features to Improve Performance . .... 24 3 SurveyofLiterature................................ 26 3.1 PreviousWorkonATC............................ 26 3.2 Classical MIMD Real-Time Scheculing Theory . ..... 29 3.3 OtherSimilarWork.............................. 29 4 APSolutiontoATC ................................ 31 4.1 OverviewofAPSolution........................... 31 4.2 AdvantagesofAPoverMIMDforATC. 31 4.3 APproperties................................. 33 5 Emulating the AP on the ClearSpeed CSX600 . ... 37 5.1 OverviewofClearSpeedCSX600. 37 5.2 Emulating the AP on the ClearSpeed CSX600 . .. 39 6 ATCSystemDesign ................................ 41 6.1 ATCDataFlow................................ 41 6.2 StaticScheduling ............................... 41 6.3 Scaling up from CSX600 to a Realistic Size AP . ... 43 iv 7 AP Solution for ATC Tasks Implemented on CSX600 . .... 46 7.1 ReportCorrelationandTracking . .. 46 7.2 CockpitDisplay................................ 52 7.3 ControllerDisplayUpdate . 52 7.4 AutomaticVoiceAdvisory . 53 7.5 SporadicRequests............................... 53 7.6 Conflict Detection and Resolution (CD&R) . .... 53 7.6.1 ConflictDetection .......................... 53 7.6.2 ConflictResolution . .. .. 57 7.7 TerrainAvoidance .............................. 58 7.8 FinalApproach(Runways). 60 7.9 Comparison of AP and MIMD solutions for ATC Tasks . ... 61 7.10 Implementation of Algorithm Issues . ..... 63 8 ExperimentalResults............................... 65 8.1 ExperimentalSetup.............................. 65 8.2 ExperimentResults.............................. 67 8.2.1 Comparison of Performance on STI, MIMD and CSX600 . 67 8.2.2 Comparison of Performance on MIMD, CSX600 and STARAN . 68 8.2.3 How close to the AP is our CSX600 emulation of the AP . 70 8.2.4 Comparison of Predictability on CSX600 and MIMD . .. 72 8.2.5 Comparison of Missing Deadlines on CSX600 and MIMD . .. 73 8.2.6 Timingsfor8Tasks.......................... 73 v 8.2.7 VideoandActualSTARANDemos . 76 9 Observations and Consequences of the Preceding Results . .......... 78 10ConclusionsandFutureWork . .. 83 BIBLIOGRAPHY ................................... 89 vi LIST OF FIGURES 1 SIMDModel.................................. 9 2 APArchitecture................................ 17 3 Fork-JoinTaskModel ............................ 19 4 APossibleAPArchitectureforATC . 34 5 CSX600acceleratorboard .......................... 38 6 MTAParchitectureofCSX600. .. .. 38 7 Static schedule of ATC tasks implemented in AP . .. 43 8 Track/Reportcorrelation. 48 9 Searchboxsizeforflightmaneuver . 50 10 Conflictdetection............................... 56 11 Timingoftrackingtask. ........................... 68 12 Timingofterrainavoidancetask. .. 68 13 TimingofCD&Rtask. ............................ 69 14 Comparison of tracking task of MTI, CSX600 and STARAN. .... 69 15 Comparison of terrain avoidance task of MTI, CSX600 and STARAN. 70 16 Comparison of CD&R task of MTI, CSX600 and STARAN. 70 17 Comparison of display processing task of MTI, CSX600 and STARAN. 71 18 Time of tracking for number of aircraft from 10 to 384. ...... 71 19 Time of CD&Rfornumber ofaircraftfrom10to384. ... 72 vii 20 Comparison of MTI, STARAN and Super-CSX600 with at most one air- craftperPE. ................................. 72 21 Predictability of execution times for report correlation and tracking task. 73 22 Predictability of execution times for terrain avoidance task......... 73 23 Predictability of execution times for CD&R task. ...... 74 24 Number of iterations missing deadlines when scheduling tasks. ...... 74 25 OverallATCSystemDesign.. .. .. 98 viii LIST OF TABLES 1 Timings of Associative Functions . 40 2 StaticScheduleforATCTasks. .. .. 42 3 Statically Scheduled Solution Time for Worst Case Environment of ATC 45 4 Data Structures for Radar Reports ..................... 48 5 Data Structures for Tracks .......................... 49 6 ComparisonofRequiredATCOperations. 62 7 PerformanceofOneFlight/PE. .. .. 74 8 Performance of Eight Tasks for Ten Flights/PE . ... 75 ix Acknowledgements I would like to sincerely thank my advisor, Dr. Johnnie W. Baker, for his constant support and encouragement throughout my Ph.D. study. He has been the advisor for my research work throughout these years and also the mentor for my professional ca- reer development. Without his encouragement and guidance, this dissertation would be impossible. My special thanks will go to many people who have helped me at various stages of this work: Will Meilander, Frank Drews, Sue Peti, and Marcy Curtiss etc. In particular, Professor Will Meilander has contributed valuable information for my dissertation. He has read through all my publications and provided many useful comments. I would also like to thank my dissertation committee members for their willingness to serve on the committee and their valuable time. On a personal note, I would like to express my gratitude to my parents Zhen Yuan and Xiaohang Xu for their love and emotional support. Finally, I would like to express my thanks to my girl friend Sijin Zhang for her moral support and for the incredible amount of patience she had with me so that I can realize my dream eventually. x CHAPTER 1 Introduction The Air Traffic Control (ATC) system is a typical real-time system that continuously monitors, examines, and manages space conditions for thousands of flights by processing large volumes of data that are dynamically changing due to reports by sensors, pilots, and controllers, and gives the best estimate of position, speed and heading of every aircraft in the environment at all times. The ATC software consists of multiple real-time tasks that must be completed in time to meet their individual deadlines. By performing various tasks, the ATC system keeps timely correct information concerning positions, velocities, contiguous space conditions, etc., for all flights under control. Since any lateness or failure of task completion could cause disastrous consequences, time constraints for tasks are critical. [1–3] In the past, solutions to this problem have been implemented on multicomputer sys- tems with records for various aircraft stored in the memory accessible to the processors in this system. The dispersed nature of this dynamic ATC system and the necessity of maintaining data integrity while providing rapid access to this data by multiple instruc- tion streams (MIS) increases the difficulty and complexity of handling air traffic control. It is difficult for these MIMD implementations to satisfy reasonable predictability stan- dards that are critical for meeting the strict certification standards needed to ensure safety for critical software components. Massive efforts have been devoted to finding an efficient MIMD solution to the ATC problems since 1963 but none of them has been 1 2 satisfactory yet [4–7]. To a large degree, this is due to the fact that the software used by MIMDs in solving real-time problems involves solutions to many difficult problems such as dynamic scheduling, load balancing, etc [6]. It is currently widely accepted that large real-time problems such as ATC do not have a polynomial time MIMD solution [8, 9]. Even though many researchers have put extensive efforts into producing a reasonably good (e.g., using heuristic algorithms) solution to the ATC problem, the current ATC system has repeatedly failed to meet the requirements of the ATC automation system. It is frequently been reported that the current ATC system periodically loses radar data without any warnings and misses tracking of aircraft in the sky. This problem has even occurred with Air Force One. [3–5,7] Instead of using the traditional MIMD approach, we implement a prototype of the ATC system
Recommended publications
  • X86 Platform Coprocessor/Prpmc (PC on a PMC)
    x86 Platform Coprocessor/PrPMC (PC on a PMC) 32b/33MHz PCI bus PN1/PN2 +5V to Kbd/Mouse/USB power Vcore Power +3.3V +2.5v Conditioning +3.3VIO 1K100 FPGA & 512KB 10/100TX TMS2250 PCI BIOS/PLA Ethernet RJ45 Compact to PCI bridge Flash Memory PLA I/O Flash site 8 32b/33MHz Internal PCI bus Analog SVGA Video Pwr Seq AMD SC2200 Signal COM 1 (RXD/TXD only) IDE "GEODE (tm)" Conditioning COM 2 (RXD/TXD only) Processor USB Port 1 Rear I/O PN4 I/O Rear 64 USB Port 2 LPC Keyboard/Mouse Floppy 36 Pin 36 Pin Connector PC87364 128MB SDRAM Status LEDs Super I/O (16MWx64b) PC Spkr This board is a Processor PMC (PrPMC) implementing A PrPMC implements a TMS2250 PCI-to-PCI bridge to an x86-based PC architecture on a single-wide, stan- the host, and a “Coprocessor” cofniguration implements dard height PMC form factor. This delivers a complete back-to-back 16550 compatible COM ports. The “Mon- x86 platform with comprehensive I/O support, a 10/100- arch” signal selects either PrPMC or Co-processor TX Ethernet interface, and disk access (both floppy and mode. IDE drives). The board features an on-board hard drive using Compact Flash. A 512Kbyte FLASH memory holds the 1K100 PLA im- age that is automatically loaded on power up. It also At the heart of the design is an Advanced Micro De- contains the General Software Embedded BIOS 2000™ vices SC2200 GEODE™ processor that integrates video, BIOS code that is included with each board.
    [Show full text]
  • Convey Overview
    THE WORLD’S FIRST HYBRID-CORE COMPUTER. CONVEY HYBRID-CORE COMPUTING Hybrid-core Computing Convey HC-1 High Performance of application- specific hardware Heterogenous solutions • can be much more efficient Performance/ • still hard to program Programmability and Power efficiency deployment ease of an x86 server Application Multicore solutions • don’t always scale well • parallel programming is hard Low Difficult Ease of Deployment Easy 1/22/2010 3 Hybrid-Core Computing Application-Specific Personalities Applications • Extend the x86 instruction set • Implement key operations in Convey Compilers hardware Life Sciences x86-64 ISA Custom ISA CAE Custom Financial Oil & Gas Shared Virtual Memory Cache-coherent, shared memory • Both ISAs address common memory *ISA: Instruction Set Architecture 7/12/2010 4 HC-1 Hardware PCI I/O FPGA FPGA Intel Personalities Chipset FPGA FPGA 8 GB/s 80 GB/s Memory Memory Cache Coherent, Shared Virtual Memory 1/22/2010 5 Using Personalities C/C++ Fortran • Personalities are user specifies reloadable personality at instruction sets Convey Software compile time Development Suite • Compiler instruction generates x86 descriptions and coprocessor instructions from Hybrid-Core Executable P ANSI standard x86-64 and Coprocessor Personalities C/C++ & Fortran Instructions • Executable can run on x86 nodes FPGA Convey HC-1 or Convey Hybrid- bitfiles Core nodes Intel x86 Coprocessor personality loaded at runtime by OS 1/22/2010 6 SYSTEM ARCHITECTURE HC-1 Architecture “Commodity” Intel Server Convey FPGA-based coprocessor Direct
    [Show full text]
  • Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators
    Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators Toshio Endo Akira Nukada Graduate School of Information Science and Engineering Global Scientific Information and Computing Center Tokyo Institute of Technology Tokyo Institute of Technology Tokyo, Japan Tokyo, Japan [email protected] [email protected] Satoshi Matsuoka Naoya Maruyama Global Scientific Information and Computing Center Global Scientific Information and Computing Center Tokyo Institute of Technology/National Institute of Informatics Tokyo Institute of Technology Tokyo, Japan Tokyo, Japan [email protected] [email protected] Abstract—We report Linpack benchmark results on the Roadrunner or other systems described above, it includes TSUBAME supercomputer, a large scale heterogeneous system two types of accelerators. This is due to incremental upgrade equipped with NVIDIA Tesla GPUs and ClearSpeed SIMD of the system, which has been the case in commodity CPU accelerators. With all of 10,480 Opteron cores, 640 Xeon cores, 648 ClearSpeed accelerators and 624 NVIDIA Tesla GPUs, clusters; they may have processors with different speeds as we have achieved 87.01TFlops, which is the third record as a result of incremental upgrade. In this paper, we present a heterogeneous system in the world. This paper describes a Linpack implementation and evaluation results on TSUB- careful tuning and load balancing method required to achieve AME with 10,480 Opteron cores, 624 Tesla GPUs and 648 this performance. On the other hand, since the peak speed is ClearSpeed accelerators. In the evaluation, we also used a 163 TFlops, the efficiency is 53%, which is lower than other systems.
    [Show full text]
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Massively Parallel Computing with CUDA
    Massively Parallel Computing with CUDA Antonino Tumeo Politecnico di Milano 1 GPUs have evolved to the point where many real world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs. Jack Dongarra Professor, University of Tennessee; Author of “Linpack” Why Use the GPU? • The GPU has evolved into a very flexible and powerful processor: • It’s programmable using high-level languages • It supports 32-bit and 64-bit floating point IEEE-754 precision • It offers lots of GFLOPS: • GPU in every PC and workstation What is behind such an Evolution? • The GPU is specialized for compute-intensive, highly parallel computation (exactly what graphics rendering is about) • So, more transistors can be devoted to data processing rather than data caching and flow control ALU ALU Control ALU ALU Cache DRAM DRAM CPU GPU • The fast-growing video game industry exerts strong economic pressure that forces constant innovation GPUs • Each NVIDIA GPU has 240 parallel cores NVIDIA GPU • Within each core 1.4 Billion Transistors • Floating point unit • Logic unit (add, sub, mul, madd) • Move, compare unit • Branch unit • Cores managed by thread manager • Thread manager can spawn and manage 12,000+ threads per core 1 Teraflop of processing power • Zero overhead thread switching Heterogeneous Computing Domains Graphics Massive Data GPU Parallelism (Parallel Computing) Instruction CPU Level (Sequential
    [Show full text]
  • Exploiting Free Silicon for Energy-Efficient Computing Directly
    Exploiting Free Silicon for Energy-Efficient Computing Directly in NAND Flash-based Solid-State Storage Systems Peng Li Kevin Gomez David J. Lilja Seagate Technology Seagate Technology University of Minnesota, Twin Cities Shakopee, MN, 55379 Shakopee, MN, 55379 Minneapolis, MN, 55455 [email protected] [email protected] [email protected] Abstract—Energy consumption is a fundamental issue in today’s data A well-known solution to the memory wall issue is moving centers as data continue growing dramatically. How to process these computing closer to the data. For example, Gokhale et al [2] data in an energy-efficient way becomes more and more important. proposed a processor-in-memory (PIM) chip by adding a processor Prior work had proposed several methods to build an energy-efficient system. The basic idea is to attack the memory wall issue (i.e., the into the main memory for computing. Riedel et al [9] proposed performance gap between CPUs and main memory) by moving com- an active disk by using the processor inside the hard disk drive puting closer to the data. However, these methods have not been widely (HDD) for computing. With the evolution of other non-volatile adopted due to high cost and limited performance improvements. In memories (NVMs), such as phase-change memory (PCM) and spin- this paper, we propose the storage processing unit (SPU) which adds computing power into NAND flash memories at standard solid-state transfer torque (STT)-RAM, researchers also proposed to use these drive (SSD) cost. By pre-processing the data using the SPU, the data NVMs as the main memory for data-intensive applications [10] to that needs to be transferred to host CPUs for further processing improve the system energy-efficiency.
    [Show full text]
  • Parallel Computer Architecture
    Parallel Computer Architecture Introduction to Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 2 – Parallel Architecture Outline q Parallel architecture types q Instruction-level parallelism q Vector processing q SIMD q Shared memory ❍ Memory organization: UMA, NUMA ❍ Coherency: CC-UMA, CC-NUMA q Interconnection networks q Distributed memory q Clusters q Clusters of SMPs q Heterogeneous clusters of SMPs Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 2 Parallel Architecture Types • Uniprocessor • Shared Memory – Scalar processor Multiprocessor (SMP) processor – Shared memory address space – Bus-based memory system memory processor … processor – Vector processor bus processor vector memory memory – Interconnection network – Single Instruction Multiple processor … processor Data (SIMD) network processor … … memory memory Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 3 Parallel Architecture Types (2) • Distributed Memory • Cluster of SMPs Multiprocessor – Shared memory addressing – Message passing within SMP node between nodes – Message passing between SMP memory memory nodes … M M processor processor … … P … P P P interconnec2on network network interface interconnec2on network processor processor … P … P P … P memory memory … M M – Massively Parallel Processor (MPP) – Can also be regarded as MPP if • Many, many processors processor number is large Introduction to Parallel Computing, University of Oregon,
    [Show full text]
  • CUDA What Is GPGPU
    What is GPGPU ? • General Purpose computation using GPU in applications other than 3D graphics CUDA – GPU accelerates critical path of application • Data parallel algorithms leverage GPU attributes – Large data arrays, streaming throughput Slides by David Kirk – Fine-grain SIMD parallelism – Low-latency floating point (FP) computation • Applications – see //GPGPU.org – Game effects (FX) physics, image processing – Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting Previous GPGPU Constraints CUDA • Dealing with graphics API per thread per Shader Input Registers • “Compute Unified Device Architecture” – Working with the corner cases per Context • General purpose programming model of the graphics API Fragment Program Texture – User kicks off batches of threads on the GPU • Addressing modes Constants – GPU = dedicated super-threaded, massively data parallel co-processor – Limited texture size/dimension Temp Registers • Targeted software stack – Compute oriented drivers, language, and tools Output Registers • Shader capabilities • Driver for loading computation programs into GPU FB Memory – Limited outputs – Standalone Driver - Optimized for computation • Instruction sets – Interface designed for compute - graphics free API – Data sharing with OpenGL buffer objects – Lack of Integer & bit ops – Guaranteed maximum download & readback speeds • Communication limited – Explicit GPU memory management – Between pixels – Scatter a[i] = p 1 Parallel Computing on a GPU Extended C • Declspecs • NVIDIA GPU Computing Architecture – global, device, shared, __device__ float filter[N]; – Via a separate HW interface local, constant __global__ void convolve (float *image) { – In laptops, desktops, workstations, servers GeForce 8800 __shared__ float region[M]; ... • Keywords • 8-series GPUs deliver 50 to 200 GFLOPS region[threadIdx] = image[i]; on compiled parallel C applications – threadIdx, blockIdx • Intrinsics __syncthreads() – __syncthreads ..
    [Show full text]
  • A Review of Multicore Processors with Parallel Programming
    International Journal of Engineering Technology, Management and Applied Sciences www.ijetmas.com September 2015, Volume 3, Issue 9, ISSN 2349-4476 A Review of Multicore Processors with Parallel Programming Anchal Thakur Ravinder Thakur Research Scholar, CSE Department Assistant Professor, CSE L.R Institute of Engineering and Department Technology, Solan , India. L.R Institute of Engineering and Technology, Solan, India ABSTRACT When the computers first introduced in the market, they came with single processors which limited the performance and efficiency of the computers. The classic way of overcoming the performance issue was to use bigger processors for executing the data with higher speed. Big processor did improve the performance to certain extent but these processors consumed a lot of power which started over heating the internal circuits. To achieve the efficiency and the speed simultaneously the CPU architectures developed multicore processors units in which two or more processors were used to execute the task. The multicore technology offered better response-time while running big applications, better power management and faster execution time. Multicore processors also gave developer an opportunity to parallel programming to execute the task in parallel. These days parallel programming is used to execute a task by distributing it in smaller instructions and executing them on different cores. By using parallel programming the complex tasks that are carried out in a multicore environment can be executed with higher efficiency and performance. Keywords: Multicore Processing, Multicore Utilization, Parallel Processing. INTRODUCTION From the day computers have been invented a great importance has been given to its efficiency for executing the task.
    [Show full text]
  • Comparing the Power and Performance of Intel's SCC to State
    Comparing the Power and Performance of Intel’s SCC to State-of-the-Art CPUs and GPUs Ehsan Totoni, Babak Behzad, Swapnil Ghike, Josep Torrellas Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA E-mail: ftotoni2, bbehza2, ghike2, [email protected] Abstract—Power dissipation and energy consumption are be- A key architectural challenge now is how to support in- coming increasingly important architectural design constraints in creasing parallelism and scale performance, while being power different types of computers, from embedded systems to large- and energy efficient. There are multiple options on the table, scale supercomputers. To continue the scaling of performance, it is essential that we build parallel processor chips that make the namely “heavy-weight” multi-cores (such as general purpose best use of exponentially increasing numbers of transistors within processors), “light-weight” many-cores (such as Intel’s Single- the power and energy budgets. Intel SCC is an appealing option Chip Cloud Computer (SCC) [1]), low-power processors (such for future many-core architectures. In this paper, we use various as embedded processors), and SIMD-like highly-parallel archi- scalable applications to quantitatively compare and analyze tectures (such as General-Purpose Graphics Processing Units the performance, power consumption and energy efficiency of different cutting-edge platforms that differ in architectural build. (GPGPUs)). These platforms include the Intel Single-Chip Cloud Computer The Intel SCC [1] is a research chip made by Intel Labs (SCC) many-core, the Intel Core i7 general-purpose multi-core, to explore future many-core architectures. It has 48 Pentium the Intel Atom low-power processor, and the Nvidia ION2 (P54C) cores in 24 tiles of two cores each.
    [Show full text]
  • Vector Vs. Scalar Processors: a Performance Comparison Using a Set of Computational Science Benchmarks
    Vector vs. Scalar Processors: A Performance Comparison Using a Set of Computational Science Benchmarks Mike Ashworth, Ian J. Bush and Martyn F. Guest, Computational Science & Engineering Department, CCLRC Daresbury Laboratory ABSTRACT: Despite a significant decline in their popularity in the last decade vector processors are still with us, and manufacturers such as Cray and NEC are bringing new products to market. We have carried out a performance comparison of three full-scale applications, the first, SBLI, a Direct Numerical Simulation code from Computational Fluid Dynamics, the second, DL_POLY, a molecular dynamics code and the third, POLCOMS, a coastal-ocean model. Comparing the performance of the Cray X1 vector system with two massively parallel (MPP) micro-processor-based systems we find three rather different results. The SBLI PCHAN benchmark performs excellently on the Cray X1 with no code modification, showing 100% vectorisation and significantly outperforming the MPP systems. The performance of DL_POLY was initially poor, but we were able to make significant improvements through a few simple optimisations. The POLCOMS code has been substantially restructured for cache-based MPP systems and now does not vectorise at all well on the Cray X1 leading to poor performance. We conclude that both vector and MPP systems can deliver high performance levels but that, depending on the algorithm, careful software design may be necessary if the same code is to achieve high performance on different architectures. KEYWORDS: vector processor, scalar processor, benchmarking, parallel computing, CFD, molecular dynamics, coastal ocean modelling All of the key computational science groups in the 1. Introduction UK made use of vector supercomputers during their halcyon days of the 1970s, 1980s and into the early 1990s Vector computers entered the scene at a very early [1]-[3].
    [Show full text]
  • World's First High- Performance X86 With
    World’s First High- Performance x86 with Integrated AI Coprocessor Linley Spring Processor Conference 2020 April 8, 2020 G Glenn Henry Dr. Parviz Palangpour Chief AI Architect AI Software Deep Dive into Centaur’s New x86 AI Coprocessor (Ncore) • Background • Motivations • Constraints • Architecture • Software • Benchmarks • Conclusion Demonstrated Working Silicon For Video-Analytics Edge Server Nov, 2019 ©2020 Centaur Technology. All Rights Reserved Centaur Technology Background • 25-year-old startup in Austin, owned by Via Technologies • We design, from scratch, low-cost x86 processors • Everything to produce a custom x86 SoC with ~100 people • Architecture, logic design and microcode • Design, verification, and layout • Physical build, fab interface, and tape-out • Shipped by IBM, HP, Dell, Samsung, Lenovo… ©2020 Centaur Technology. All Rights Reserved Genesis of the AI Coprocessor (Ncore) • Centaur was developing SoC (CHA) with new x86 cores • Targeted at edge/cloud server market (high-end x86 features) • Huge inference markets beyond hyperscale cloud, IOT and mobile • Video analytics, edge computing, on-premise servers • However, x86 isn’t efficient at inference • High-performance inference requires external accelerator • CHA has 44x PCIe to support GPUs, etc. • But adds cost, power, another point of failure, etc. ©2020 Centaur Technology. All Rights Reserved Why not integrate a coprocessor? • Very low cost • Many components already on SoC (“free” to Ncore) Caches, memory, clock, power, package, pins, busses, etc. • There often is “free” space on complex SOCs due to I/O & pins • Having many high-performance x86 cores allows flexibility • The x86 cores can do some of the work, in parallel • Didn’t have to implement all strange/new functions • Allows fast prototyping of new things • For customer: nothing extra to buy ©2020 Centaur Technology.
    [Show full text]