Enabling the Use of Low Power Mobile and Embedded Technologies For

Total Page:16

File Type:pdf, Size:1020Kb

Enabling the Use of Low Power Mobile and Embedded Technologies For Enabling the Use of Embedded and Mobile Technologies for High-Performance Computing Author: Advisor: Nikola Rajovic´ Alex Ramirez A THESIS SUBMITTED IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor per la Universitat Polit`ecnicade Catalunya Departament d’Arquitectura de Computadors Barcelona, Spring 2017 To my family . Моjоj породици . Acknowledgements During my PhD studies I received help and support from many people - it would be strange if I did not. Thus I would like to thank professors Mateo Valero and Veljko Milutinovi´cfor recognizing my interest in HPC and introducing me to my advisor, professor Alex Ramirez. I would like to thank him for trusting in me and giving me an opportunity to be a pioneer of HPC on mobile ARM platforms, for guiding and shaping my work last five years, for helping me understand what real priorities are, and for being pushy when it was really needed. In addition, thank you Carlos and Puzo for helping me to have a smooth start of my PhD and for filling the gap when Alex was too busy. Last two years of my PhD I have been working with Alex remotely, and I would like to thank the local crew who helped me: professor Eduard Ayguade for providing me with some very hard-to-get manuscripts and accepting to be my "ponente" at the University; professor Jesus Labarta for thorough dis- cussions about parallel performance issues and know-how lessons on demand; Alex Rico for helping me deliver work for the Mont-Blanc project and for friendly advices every so little; Filippo Mantovani for making sure I could use the prototypes without outages and for filtering-out internal bureaucracy issues. In addition, I would like to acknowledge support from BSC Tools depart- ment, especially Harald Servat and Judit Gimenez, who always managed to find a free slot for me in their busy agenda. Gabriele Carteni and Lluis Vilanova shared with me their Linux Guru experience, which was extremely helpful during deployment of my first proto- type - Tibidabo. Thank you guys! In addition, thank you Dani for bypassing ticketing system and fixing issues related to our internal prototypes as soon as I trigger them. Thanks to prelectura committee and the constructive comments I re- ceived, the quality of my PhD thesis manuscript was significantly improved. Professor Nacho Navarro used to cheer me up many times with his down- to-earth attitude, ideas and advices - I am sad that he cannot see me defend- ing my thesis. v I shared my office with a lot of people, and I would like to thank them for all the good and bad times. I would like to thank all my Serbian friends I lived, used to hang around, played basketball, or shared office with - thank you Браћала, Боки, Бране, Радуловача, Уги, Зоки, Зуки, Влаjко, Мрџи, Павле, Вуjке, Перо, Пузо ... Thanks to Rajovi´cheadquarters and their unconditional support for my ideas, my PhD adventure finally ends with the writing of this Acknowledge- ment. Knowing you are with me helped a lot! Last but not least I would like to thank my darling, my Ivana, for the patience, understanding and endless support. You are the true hero . and thank you Ugljeˇsa,my son, for inspiring me with your smile . Author My graduate work leading to these results was supported by the PRACE project (European Community funding under grants RI-261557 and RI-283493), Mont-Blanc project (European Community’s Seventh Framework Programme [FP7/2007-2013] under grant agreement no 288777), the Spanish Ministry of Science and Technology through Computacion de Altas Prestaciones (CICYT) VI (TIN2012-34557), and the Spanish Government through Programa Severo Ochoa (SEV-2011-0067). Author has been financially supported throughout his studies with a grant from Barcelona Supercomputing Center. vi Abstract In the late 1990s, powerful economic forces led to the adoption of commodity desktop processors in High-Performance Computing (HPC). This transfor- mation has been so effective that the November 2016 TOP500 list is still dominated by x86 architecture. In 2016, the largest commodity market in computing is not PCs or servers, but mobile computing, comprising smart- phones and tablets, most of which are built with ARM-based Systems on Chips (SoC). This suggests that once mobile SoCs deliver sufficient perfor- mance, mobile SoCs can help reduce the cost of HPC. This thesis addresses this question in detail. We analyze the trend in mobile SoC performance, comparing it with the similar trend in the 1990s. Through development of real system prototypes and their performance anal- ysis we assess the feasibility of building an HPC system based on mobile SoCs. Through simulation of the future mobile SoC, we identify the missing features and suggest improvements that would enable the use of future mo- bile SoCs in HPC environment. Thus, we present design guidelines for future generations mobile SoCs, and HPC systems built around them, enabling the new class of cheap supercomputers. vii Contents Front matteri Dedication . iii Acknowledgements . .v Abstract . vii Contents . xi List of figures . xvi List of tables . xviii 1 Introduction1 1.1 Microprocessors in Supercomputing . .1 1.2 Energy Efficiency . .3 1.3 Mobile Processors Evolution . .5 1.3.1 ARM Processors . .6 1.3.2 Embedded GPUs . .8 1.4 Contributions . .8 2 Related Work 11 3 Methodology 17 3.1 Hardware platforms . 17 3.2 Single core, CPU, and node benchmarks . 18 3.2.1 Mont-Blanc benchmarks . 18 3.3 System benchmarks and workloads . 20 3.4 Power measurements . 22 3.5 Simulation methodology . 22 3.6 Tools . 24 3.6.1 Extrae . 24 3.6.2 Paraver . 25 3.6.3 Clustering tools . 25 3.6.4 Basic analysis tool . 25 3.6.5 GA tool . 25 ix CONTENTS 3.7 Reporting . 26 4 ARM Processors Performance Assessment 27 4.1 Floating-Point Support Issue . 27 4.2 Compiler Flags Exploration . 28 4.2.1 Compiler Maturity . 29 4.3 Achieving Peak Floating-Point Performance . 30 4.3.1 Algebra Backend . 31 4.4 Comparison Against a Contemporary x86 Processor . 32 4.4.1 Results . 33 4.4.2 Discussion . 34 5 Tibidabo, The First Mobile HPC Cluster 37 5.1 Architecture . 37 5.2 Software Stack . 39 5.3 Evaluation . 40 5.3.1 Methodology . 40 5.3.2 Cluster Performance . 40 5.3.3 Interconnect . 44 5.4 Comparison Against an X86-Based Cluster . 45 5.4.1 Reference x86 System . 47 5.4.2 Applications . 47 5.4.3 Power Acquisition . 47 5.4.4 Input Configurations . 48 5.4.5 Results . 48 5.5 Projections . 51 5.6 Interconnect requirements . 57 5.6.1 Lessons Learned and Next Steps . 58 5.7 Conclusions . 61 6 Mobile Developer Kits 63 6.1 Evaluation Methodology . 64 6.2 CARMA Kit: a Mobile SoC and a Discrete GPU . 64 6.2.1 Evaluation Results . 65 6.3 Arndale Kit: Improved CPU Core IP and On-Chip GPU . 68 6.3.1 ARM Mali-T604 GPU IP . 68 6.3.2 Evaluation Results . 69 6.4 Putting It All Together . 70 6.4.1 Comparison Against a Contemporary x86 Architecture 70 6.5 Conclusions . 75 x CONTENTS 7 The Mont-Blanc Prototype 77 7.1 Architecture . 77 7.1.1 Compute Node . 78 7.1.2 The Mont-Blanc Blade . 78 7.1.3 The Mont-Blanc System . 79 7.1.4 The Mont-Blanc Software Stack . 80 7.1.5 Power Monitoring Infrastructure . 82 7.1.6 Performance Summary . 83 7.2 Compute Node Evaluation . 84 7.2.1 Core Evaluation . 85 7.2.2 Node Evaluation . 85 7.2.3 Node Power Profiling . 87 7.3 Interconnection Network Tuning and Evaluation . 88 7.4 Overall System Evaluation . 90 7.4.1 Applications Scalability . 90 7.4.2 Comparison With Traditional HPC . 95 7.5 Scalability Projection . 97 7.6 Conclusions . 99 8 Mont-Blanc Next-Generation 101 8.1 Methodology . 101 8.1.1 Description . 102 8.1.2 Benchmarks . 104 8.1.3 Applications . 105 8.1.4 Base and Target Architectures . 105 8.1.5 Validation . 105 8.2 Performance Projections . 106 8.2.1 Mont-Blanc Prototype . 108 8.2.2 NVIDIA Jetson . 109 8.2.3 ARM Juno . 109 8.2.4 NG Node . 110 8.3 Power Projections . 111 8.3.1 Methodology . 111 8.4 Conclusions . 117 9 Conclusions 119 9.1 Future Work . 120 Back matter 121 List of publications . 121 Bibliography . 123 xi List of Figures 1.1 Development of CPU architectures share in supercomputers from TOP500 list. .2 1.2 Vector and commodity processors peak floating-point perfor- mance developmenmt . .3 1.3 Server and mobile processors peak floating-point performance development. .5 3.1 Methodology: power measurement setup. 22 3.2 An example of a Dimemas simulation where each row presents the activity of a single processor: it is either in a computation phase (grey) or in MPI communication (black). 24 4.1 Comparison of different compilers for ARM Cortex-A9 with Dhrystone benchmark. 29 4.2 Comparison of different compilers for ARM Cortex-A9 with LINPACK1000x1000 benchmark. 30 4.3 Exploration of the ARM Cortex-A9 double-precision floating- point pipeline for FADD and FMAC instructions with mi- crobenchmarks. 31 4.4 Performance of HPL on ARM Cortex-A9 for different input matrix and block sizes. 32 4.5 Comparison between Intel Core i7-64M and ARM Cortex-A9 with SPEC CPU2006 benchmark suite. 35 5.1 Tibidabo prototype: physical view of the node card and the node motherboard . ..
Recommended publications
  • DISI - University of Trento
    PhD Dissertation International Doctorate School in Information and Communication Technologies DISI - University of Trento Cyber-Physical Systems: Two Case Studies in Design Methodologies Luca Rizzon Advisor: Prof. Roberto Passerone Universit`adegli Studi di Trento April 2016 Abstract To analyze embedded systems, engineers use tools that can simulate the performance of software components executed on hardware architectures. When the embedded system functionality is strongly correlated to physical quantities, as in the case of Cyber-Physical System (CPS), we need to model physical processes to determine the overall behavior of the system. Unfortunately, embedded systems simulators are not generally suitable to evaluate physical processes, and in the same way physical model simulators hardly capture the functionality of computing systems. In this work, we present a methodology to concurrently explore these aspects using the metroII design framework. The methodology provides guidelines for the implementation of these models in the design environment. To demonstrate the feasibility of the proposed approach, we applied the methodology to two case studies. A case study regards a binaural guidance system developed to be included into a smart rollator for older adults. The second case consists of an energy recovery device which gets energy from the heat dissipated by a high performance processor and power a smart sink able to provide cooling or to serve as a wireless sensing node. Keywords [Cyber-Physical Systems, Design Methodology, Binaural Synthesis, Energy Harvesting] Contents 1 Introduction 1 1.1 CPS Design Challenges . .1 1.2 Structure of the Thesis . .3 2 Design Methodology 5 2.1 State of the Art . .5 2.2 MetroII Design Framework .
    [Show full text]
  • Introducing Slambench, a Performance and Accuracy Benchmarking Methodology for SLAM
    Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM Luigi Nardi1, Bruno Bodin2, M. Zeeshan Zia1, John Mawer3, Andy Nisbet3, Paul H. J. Kelly1, Andrew J. Davison1, Mikel Lujan´ 3, Michael F. P. O’Boyle2, Graham Riley3, Nigel Topham2 and Steve Furber3 Kernels Architectures Correctness Performance Metrics Frame rate Computer bilateralFilter (..) halfSampleRobust (..) Vision renderVolume (..) integrate (..) : Energy : Accuracy Compiler/Runtime Hardware ICL-NUIM Dataset Fig. 1: The SLAMBench framework makes it possible for experts coming from the computer vision, compiler, run-time, and hardware communities to cooperate in a unified way to tackle algorithmic and implementation alternatives. Abstract— Real-time dense computer vision and SLAM offer While classical point feature-based simultaneous locali- great potential for a new level of scene modelling, tracking sation and mapping (SLAM) techniques are now crossing and real environmental interaction for many types of robot, into mainstream products via embedded implementation in but their high computational requirements mean that use on mass market embedded platforms is challenging. Mean- projects like Project Tango [3] and Dyson 360 Eye [1], while, trends in low-cost, low-power processing are towards dense SLAM algorithms with their high computational re- massive parallelism and heterogeneity, making it difficult for quirements are largely at the prototype stage on PC or robotics and vision researchers to implement their algorithms laptop platforms. Meanwhile, there has been a great focus in a performance-portable way. In this paper we introduce in computer vision on developing benchmarks for accuracy SLAMBench, a publicly-available software framework which represents a starting point for quantitative, comparable and comparison, but not on analysing and characterising the validatable experimental research to investigate trade-offs in envelopes for performance and energy consumption.
    [Show full text]
  • Report on Tuned Linux-ARM Kernel and Delivery of Kernel Patches to the Linux Kernel Version 1.0
    D5.4{ Report on Tuned Linux-ARM kernel and delivery of kernel patches to the Linux kernel Version 1.0 Document Information Contract Number 288777 Project Website www.montblanc-project.eu Contractual Deadline M18 Dissemintation Level PU Nature Other Coordinator Alex Ramirez (BSC) Contributors Roxana Rusitoru (ARM) Reviewers Isaac Gelado (BSC) Keywords Linux kernel, HPC, Transparent HugePage Notices: The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no 288777 c 2013 Mont-Blanc Consortium Partners. All rights reserved. D5.4 - Report on Tuned Linux-ARM kernel and delivery of kernel patches to the Linux kernel Version 1.0 Change Log Version Description of Change v0.1 Initial draft released to the WP5 contributors v0.2 Revised draft released to the WP5 contributors v1.0 Version to send to the WP leaders and the EC 2 D5.4 - Report on Tuned Linux-ARM kernel and delivery of kernel patches to the Linux kernel Version 1.0 Contents 1 Introduction5 2 Analysis of Transparent HugePages: pandaboard5 2.1 Introduction...................................5 2.2 Hardware setup.................................5 2.2.1 Limitations...............................5 2.3 Benchmarks...................................6 2.3.1 Setup..................................8 2.3.1.1 Compilation flags......................8 2.3.1.2 Processor affinity.......................8 2.3.1.3 ATLAS............................8 2.4 Sample HPC application - bigDFT......................8 2.5 Methodology..................................9 2.6 Results...................................... 12 2.6.1 Observations.............................. 15 2.7 Conclusions................................... 15 2.8 Further work.................................. 15 2.8.1 Kernel profiling............................. 15 2.8.2 In-depth analysis of results.....................
    [Show full text]
  • Proyecto Fin De Grado
    ESCUELA TÉCNICA SUPERIOR DE INGENIERÍA Y SISTEMAS DE TELECOMUNICACIÓN PROYECTO FIN DE GRADO TÍTULO: Despliegue de Liota (Little IoT Agent) en Raspberry Pi AUTOR: Ricardo Amador Pérez TITULACIÓN: Ingeniería Telemática TUTOR (o Director en su caso): Antonio da Silva Fariña DEPARTAMENTO: Departamento de Ingeniería Telemática y Electrónica VºBº Miembros del Tribunal Calificador: PRESIDENTE: David Luengo García VOCAL: Antonio da Silva Fariña SECRETARIO: Ana Belén García Hernando Fecha de lectura: Calificación: El Secretario, Despliegue de Liota (Little IoT Agent) en Raspberry Pi Quizás de todas las líneas que he escrito para este proyecto, estas sean a la vez las más fáciles y las más difíciles de todas. Fáciles porque podría doblar la longitud de este proyecto solo agradeciendo a mis padres la infinita paciencia que han tenido conmigo, el apoyo que me han dado siempre, y el esfuerzo que han hecho para que estas líneas se hagan realidad. Por todo ello y mil cosas más, gracias. Mamá, papá, lo he conseguido. Fáciles porque sin mi tutor Antonio, este proyecto tampoco sería una realidad, no solo por su propia labor de tutor, si no porque literalmente sin su ayuda no se hubiera entregado a tiempo y funcionando. Después de esto Antonio, voy a tener que dejarme ganar algún combate en kenpo como agradecimiento. Fáciles porque, sí melones os toca a vosotros, Alex, Alfonso, Manu, Sama, habéis sido mi apoyo más grande en los momentos más difíciles y oscuros, y mis mejores compañeros en los momentos de felicidad. Amigos de Kulturales, los hermanos Baños por empujarme a mejorar, Pablo por ser un ejemplo a seguir, Chou, por ser de los mejores profesores y amigos que he tenido jamás.
    [Show full text]
  • Que E Un Arduino
    Que E Un Arduino Que E Un Arduino 1 / 3 2 / 3 Arduino es una plataforma de hardware libre, basada en una placa con un microcontrolador y un entorno de desarrollo (software), diseñada .... Arduino es una plataforma de electrónica que pretende facilitar el uso de la electrónica en todo tipo de proyectos y se fundamenta en el código .... Código Abierto (open source y open hardware) significa que está permitida la fabricación o ensamblaje de las placas Arduino y la distribución del software por .... Arduino es una plataforma de electrónica "open-source"o de código abierto cuyos principios son contar con software y hardware fáciles de usar.. Arduino es una plataforma de hardware de fuente abierta basada en una sencilla placa de entradas y salidas simple y un entorno de desarrollo que .... The open-source Arduino Software (IDE) makes it easy to write code and upload it to the board. It runs on Windows, Mac OS X, and Linux. The .... Arduino es una plataforma de desarrollo basada en una placa electrónica de hardware libre que incorpora un microcontrolador re-programable y una serie de .... Y es que el arduino es nada más y nada menos que una placa basada en un microcontrolador, concretamente un ATMEL. Pero, ¿qué es un .... Para que serve um Arduino? Quais as vantagens? Como eu começo a programar? Nesse tutorial vamos apresentar um resumo sobre o que é .... Qué es Arduino ?, Arduino es una tarjeta electrónica digital y además es un lenguaje de programación basado en C++ que es "open-source".. Arduino is an open-source electronics platform based on easy-to-use hardware and software.
    [Show full text]
  • Energy Neutral Wireless Sensing for Server Farms Monitoring
    324 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 4, NO. 3, SEPTEMBER 2014 Energy Neutral Wireless Sensing for Server Farms Monitoring Maurizio Rossi, Student Member, IEEE, Luca Rizzon, Matteo Fait, Roberto Passerone, Member, IEEE,and Davide Brunelli, Member, IEEE Abstract—Energy harvesting techniques are consolidating as ef- symbiosis with the server, since it relies on the server to guar- fective solutions to power electronic devices with embedded wire- antee its power supply, and acts as a data collector to monitor less capability. We present an energy harvesting system capable to interesting features, including temperature and other quantities sustain sensing and wireless communication using thermoelectric generators as energy scavengers and server’s CPU as heat source. related to thermal and resources wastage. We target data center safety monitoring, where human presence Technology advances enable the development of innova- should be avoided, and the maintenance must be reduced the most. tive architectures for embedded systems towards orthogonal We selected ARM-based CPUs to tune and to demonstrate the pro- directions. From one side, systems characterized by high posed solution since market forecasts envision this architecture as the core of future data centers. Our main goal is to achieve a com- computing performance exhibit lower power consumption pletely sustainable monitoring system powered with heat dissipa- and thermal dissipation compared to previous generation of tion of microprocessor. To this end we present the performance personal computers. In fact, ARM-based devices are consid- characterization of different thermal-electric harvesters. We dis- ered a promising solution for building computing server for cuss the relationship between the temperature and the CPU load cloud service providers and Web 2.0 applications [1], [2].
    [Show full text]
  • D5.5– Intermediate Report on Porting and Tuning of System Software to ARM Architecture Version 1.0
    D5.5– Intermediate Report on Porting and Tuning of System Software to ARM Architecture Version 1.0 Document Information Contract Number 288777 Project Website www.montblanc-project.eu Contractual Deadline M24 Dissemintation Level PU Nature Report Coordinator Alex Ramirez (BSC) Contributors Gabor´ Dozsa´ (ARM), Axel Auweter (BADW-LRZ), Daniele Tafani (BADW-LRZ), Vicenc¸ Beltran (BSC), Harald Servat (BSC), Judit Gimenez (BSC), Ramon Nou (BSC), Petar Radojkovic´ (BSC) Reviewers Chris Adeniji-Jones (ARM) Keywords System software, HPC, Energy-efficiency Notices: The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no 288777 c 2011 Mont-Blanc Consortium Partners. All rights reserved. D5.5 - Intermediate Report on Porting and Tuning of System Software to ARM Architecture Version 1.0 Change Log Version Description of Change v0.1 Initial draft released to the WP5 contributors v0.2 Version released to Internal Reviewer v1.0 Final version sent to EU 2 D5.5 - Intermediate Report on Porting and Tuning of System Software to ARM Architecture Version 1.0 Contents Executive Summary 4 1 Introduction 5 2 Parallel Programming Model and Runtime Libraries 6 2.1 Nanos++ runtime system and Mercurium compiler . .6 2.2 Tuning of the OpenMPI library for ARM Architecture . .6 2.2.1 Baseline Performance Results . .8 2.2.2 Optimizations . .9 3 Operating System 11 4 Scientific libraries 12 5 Toolchain 13 5.1 Development Tools . 13 5.2 Performance Monitoring Tools . 13 5.3 Performance Analysis Tools . 13 6 Cluster Management 14 6.1 System Monitoring .
    [Show full text]
  • On First Page
    D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0 Document Information Contract Number 288777 Project Website www.montblanc-project.eu Contractual Deadline M24 Dissemination Level PU Nature Report Author S. Requena (GENCI) B. Videau (IMAG), D. Brayford (LRZ), P. Lanucara (CINECA), X. Saez (BSC), R. Halver (JSC), D. Broemmel (JSC), S. Mohanty Contributors (JSC), N. Sanna (CINECA), C. Cavazzoni (CINECA), JH. Meincke (JSC), D. Komatitsch (Université de Marseille) and V. Moureau (CORIA) Reviewer Petar Radojkovic (BSC) Keywords Exascale, scientific applications, porting, profiling, optimisation Notices: The research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777. 2011 Mont-Blanc Consortium Partners. All rights reserved. D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0 Change Log Version Description of Change V0.1 Initial draft released to the WP4 contributors V0.2 Version released to internal reviewer V0.3 Comments of the internal reviewer V1.0 Final version sent to EC 2 D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0 Table of Contents 1 Introduction ..........................................................................................................................5 2 Platforms used by WP4 ..........................................................................................................6 3 Report
    [Show full text]
  • Design, Construction, and Use of a Single Board Computer Beowulf Cluster: Application of the Small- Footprint, Low-Cost, Insignal 5420 Octa Board
    Design, Construction, and Use of a Single Board Computer Beowulf Cluster: Application of the Small- Footprint, Low-Cost, InSignal 5420 Octa Board James J. Cusick, William Miller, Nicholas Laurita, Tasha Pitt {James.Cusick; William.Miller; Nicholas.Lauita; Tasha.Pitt}@wolterskluwer.com Wolters Kluwer, New York, NY here focuses on work done for the New York-based Corporate Abstract—In recent years development in the area of Single Legal Services (CLS) Division which manages five units Board Computing has been advancing rapidly. At Wolters including CT Corporation (CT). The systems operated by CT Kluwer’s Corporate Legal Services Division a prototyping effort include public-facing Web-based applications and internally was undertaken to establish the utility of such devices for used ERP (Enterprise Resource Planning) systems. Major practical and general computing needs. This paper presents the vendors manage network services and hosting for our background of this work, the design and construction of a 64 core 96 GHz cluster, and their possibility of yielding approximately computing environments. The CT IT team is responsible for 400 GFLOPs from a set of small footprint InSignal boards the development and operations of these systems from an end created for just over $2,300. Additionally this paper discusses the customer standpoint. The R&D work reflected here is hoped to software environment on the cluster, the use of a standard benefit the development of these systems in the long run by Beowulf library and its operation, as well as other software reducing costs, improving flexibility, and reducing Time to application uses including Elastic Search and ownCloud.
    [Show full text]
  • Samsung Exynos 5 Board - 10-30-2012 by Vincent - Streamcomputing
    Samsung Exynos 5 board - 10-30-2012 by vincent - StreamComputing - http://streamcomputing.eu Samsung Exynos 5 board by vincent – Tuesday, October 30, 2012 https://streamcomputing.eu/knowledge/sdks/samsung-exynos-5-board/ This Samsung Exynos 5 dev-board, the Arndale Board, contains an ARM Cortex-A15 dual core CPU and an ARM Mali-T604. Currently (November 2012) the MALI T604 is currently the strongest GPU with a theoretical peak-performance of 68 (or 72) GFLOPS. Its TDP is between 2 and 4 Watts (currently unknown). It is unknown what the NEON-capable ARM CPU adds to the total amount of GFLOPS. SDK The SDK can be downloaded here. Developers manual is here. For compilation on Ubuntu g++-arm-linux-gnueabi is needed. Also remove the “-none” in platform.mk. Then compilation will result in a libOpenCL.so. It is unclear if the drivers will be available for other devices with the same chipset (such as the new Chromebook and the Nexus 10). Software available for the board can be found here. Drivers for Arndale+Android (including graphics drivers with OpenCL-support) can be found here. A guide is in the making. Boards The following boards are available: 1. Arndale 5250-A 2. ARMBRIX Zero 3. YICsystem YSE5250 4. YICsystem SMDKC520 5. Nexus 10 Scroll down for more info on these boards. Board 1: Arndale 5250-A The board is fully loaded and can be extended with touch-screen, SSD, Wifi+sound and camera. Below is an image with the soundboard and connectivity-board attached. Working boards using OpenCL on display (click on image to see the Twitter-status): Here are a few characteristics: page 1 / 6 Samsung Exynos 5 board - 10-30-2012 by vincent - StreamComputing - http://streamcomputing.eu Cortex-A15 @ 1.7 GHz dual core 128 bit SIMD NEON 2GB 800MHz DDR3 (32 bit) More information can be found on the wiki and forum.
    [Show full text]
  • Fast Software Polynomial Multiplication on ARM Processors Using the NEON Engine Danilo Câmara, Conrado Gouvêa, Julio López, Ricardo Dahab
    Fast Software Polynomial Multiplication on ARM Processors Using the NEON Engine Danilo Câmara, Conrado Gouvêa, Julio López, Ricardo Dahab To cite this version: Danilo Câmara, Conrado Gouvêa, Julio López, Ricardo Dahab. Fast Software Polynomial Multipli- cation on ARM Processors Using the NEON Engine. 1st Cross-Domain Conference and Workshop on Availability, Reliability, and Security in Information Systems (CD-ARES), Sep 2013, Regensburg, Germany. pp.137-154. hal-01506572 HAL Id: hal-01506572 https://hal.inria.fr/hal-01506572 Submitted on 12 Apr 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution| 4.0 International License Fast Software Polynomial Multiplication on ARM Processors using the NEON Engine Danilo C^amara,Conrado P. L. Gouv^ea?, Julio L´opez??, and Ricardo Dahab University of Campinas (Unicamp) fdfcamara,conradoplg,jlopez,[email protected] Abstract. Efficient algorithms for binary field operations are required in several cryptographic operations such as digital signatures over binary elliptic curves and encryption. The main performance-critical operation in these fields is the multiplication, since most processors do not support instructions to carry out a polynomial multiplication. In this paper we describe a novel software multiplier for performing a polynomial multi- plication of two 64-bit binary polynomials based on the VMULL instruction included in the NEON engine supported in many ARM processors.
    [Show full text]
  • KVM/ARM: the Design and Implementation of the Linux ARM Hypervisor
    KVM/ARM: The Design and Implementation of the Linux ARM Hypervisor Christoffer Dall and Jason Nieh ~ 1.2 billion ~ 300 million Source: Gartner Market Report, Jan 2014 ARM Servers ARM Network Equipment Virtualization Extensions Key Challenges != Virtualization Extensions VT-x • No PC-standard on ARM Key Contributions • Split-Mode Virtualization • Implemented and evaluated KVM/ARM • Upstreamed implementation to mainline linux Outline • ARM Virtualization Extensions • Comparison to x86 • Split-Mode Virtualization • Results • Experiences ARM Virtualization Extensions • Provides virtualization in 4 key areas: • CPU Virtualization • Memory Virtualization • Interrupt Virtualization • Timer Virtualization ARM Virtualization Extensions CPU Virtualization User Kernel ARM Virtualization Extensions CPU Virtualization User Kernel Hyp ARM Virtualization Extensions CPU Virtualization VM 0 VM 1 User User Kernel Kernel Hypervisor ARM Virtualization Extensions CPU Virtualization • Separate stack • Different control registers • Different page table format • Controls Memory, Interrupt, and Timer virtualization • Can be completely bypassed ARM Virtualization Extensions Memory Virtualization 0 4 GB Virtual User space application Kernel Addresses Stage-1 Page Tables MMU 0 2^40 Intermediate Physical RAM Devices Addresses (IPA) Stage-2 Page Tables MMU 0 2^40 Physical RAM Devices Addresses ARM Virtualization Extensions Interrupt Virtualization • Interrupts are an important part of any system • The mechanism to notify CPU of events • CPUs interact with an interrupt
    [Show full text]