Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

Total Page:16

File Type:pdf, Size:1020Kb

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing Gaurav Mitra Andrew Haigh Luke Angove Anish Varghese Eric McCreath Alistair P. Rendell Research School of Computer Science Australian National University Canberra, Australia April 07, 2016 Introduction & Background Overview 1 Introduction & Background 2 Power Measurement Environment 3 Experimental Platforms 4 Approach 5 Results & Analysis 6 Conclusion Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 2 / 20 Introduction & Background Use of low-powered SoCs for HPC Nvidia Jetson TK1: ARM + GPU SoC Nvidia Jetson TX1: ARM + GPU SoC TI Keystone II: ARM + DSP SoC Adapteva Parallella: ARM + 64-core NoC TI BeagleBoard: ARM + DSP SoC Terasic DE1: ARM + FPGA SoC Rockchip Firefly: ARM + GPU SoC Freescale Wandboard: ARM + GPU SoC Cubieboard4: ARM + GPU SoC http://cs.anu.edu.au/systems Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 3 / 20 Introduction & Background Use of low-powered SoCs for HPC In order for SoC processors to be considered viable exascale building blocks, important factors to explore include: Absolute performance Balancing use of different on-chip devices Understanding the performance-energy trade-off Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 4 / 20 Introduction & Background Contributions Environment for monitoring and collecting high resolution power measurements for SoC systems Understanding the benefits of exploiting both the host CPU and accelerator GPU cores simultaneously for critical HPC kernels Performance and energy comparisons with conventional HPC systems - Intel Xeon CPUs and NVIDIA K20 and K80 GPUs Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 5 / 20 Power Measurement Environment Measurement Requirements SoC systems generally consume very low power ∼ few Watts Subtle differences in energy consumption triggered by different factors such as the use of CPU or on-chip GPU cores Changes in DC current supplied to SoC system boards must be reliably measured Current use ranges from µAmps to a few Amps, a very high-precision ammeter must be used to measure subtle changes Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 6 / 20 Power Measurement Environment Measurement Apparatus µCurrent Gold: High-precision ammeter for measuring low-currents An mbed LPC1768 micro-controller with a 12-bit ADC (0-3.3V) used to measure analog output signals from µCurrent Gold https://www.eevblog.com/projects/ucurrent/ The ADC has a resolution of 0.81±0.40mV, which corresponds to 0.81mA. This is 9.7±4.8mW at 12V. https://developer.mbed.org/ Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 7 / 20 Power Measurement Environment Power Measurement Environment Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 8 / 20 Experimental Platforms Experimental Platforms TK1 TX1 SANDY HASWELL CPU ARM Cortex-A15 ARM Cortex-A57 Xeon E5-2665 Xeon E5-2670 v3 CPU Cores 4 4 2×8 2×12 CPU Freq. 2.3 GHz 2.2 GHz 2.4 GHz 2.3 GHz RAM 2GB LPDDR3 3GB LPDDR4 128GB DDR3 128GB DDR3 GPU GK20A GM20B K20m (GK110) K80 (GK210) GPU Cores 192 256 2496 2496 GPU Freq. 852 MHz 998 MHz 706 MHz 875 MHz GPU RAM Shared Shared 5GB 12GB CUDA v6.5 v7.0 v7.0 v7.5 Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 9 / 20 Approach Evaluation Kernel C = A × B 2 3 2 3 2 3 4 C1 C2 5 = 4 A 5 × 4 B1 B2 5 .& .& .& C1 = A × B1 C2 = A × B2 2 3 2 3 2 3 2 3 2 3 2 3 4 C1 5 = 4 A 5 × 4 B1 5 4 C2 5 = 4 A 5 × 4 B2 5 CPU GPU Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 10 / 20 Approach Approaches Traditional methods: Assign all work to GPU or CPU Static Partitioning: Partition work between GPU and CPU based on apriori information Beaumont et al., Matrix Multiplication on Heterogeneous Platforms C. Yang et al., Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing Donfack et al., Dynamically Balanced Synchronization-Avoiding LU Factorization with Multicore and GPUs Dynamic Partitioning: Papadrakakis et al., A New Era in Scientific Computing: Domain Decomposition Methods in Hybrid CPU-GPU Architectures ! Existing approaches do not consider the use of shared physical memory or the implications for energy efficiency Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 11 / 20 Approach Our approach Static partitioning: Guess a partition based on experimentally measured peak performances of CPU and Dynamic partitioning: GPU CPU and GPU remove chunks of matrix Used the achieved peaks to refine the columns from a workqueue partition Chunk size must be sufficient to occupy Repeat until convergence CPU and GPU fully Suitable for repeated calculations of the On traditional discrete GPU systems, same size copies have to be carefully scheduled Implemented using OpenMP Use of shared memory on SoC systems: Two threads, one each for CPU and GPU, taking work off a master queue CUDA driver automatically protects CUDA-allocated memory during kernel The GPU thread executes at the expense execution phase of doing productive work on the CPU cores We circumvent this by immediately unprotecting using mprotect() the memory after initiating a kernel execution Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 12 / 20 Results & Analysis Results: Best split performance Platform Matrix CPU GPU CPU SPLIT Size GFLOPS GFLOPS SPLIT COLS GFLOPS DGEMM TK1 4096 14 12 2176 26 TX1 4096 18 9 2608 25 SANDY 8192 311 836 2128 1099 HASWELL 16384 804 1124 6912 1870 SGEMM TK1 4096 34 205 448 227 TX1 4096 38 391 128 399 SANDY 16384 643 2318 3392 2887 HASWELL 16384 1753 2526 6896 4109 Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 13 / 20 Results & Analysis Best Split Search - Tegra K1/X1 TK1 GFLOPS TX1 GFLOPS TK1 JOULES TX1 JOULES 25 120 20 100 15 80 JOULES 10 60 DGEMM GFLOPS TK1 GFLOPS TX1 GFLOPS TK1 JOULES TX1 JOULES 400 60 300 40 200 20 JOULES 100 SGEMM GFLOPS 0 0 1;000 2;000 3;000 4;000 Split Size Given to CPU Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 14 / 20 Results & Analysis Best Split Search - Intel + NVIDIA GPUs SANDY GFLOPS HASWELL GFLOPS SANDY JOULES HASWELL JOULES 1;500 1;000 100 500 50 JOULES DGEMM GFLOPS SANDY GFLOPS HASWELL GFLOPS SANDY JOULES HASWELL JOULES 60 2;000 40 1;000 20 JOULES SGEMM GFLOPS 0 1;000 2;000 3;000 4;000 Split Size Given to CPU Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 15 / 20 Results & Analysis Performance Scaling - TK1 28 CPU 26 GPU 24 SPLIT 22 20 DYNAMIC 18 TBALANCE 16 PEAK (CPU+GPU) 14 12 10 8 DGEMM GFLOPS 6 4 2 0 16 32 64 128 256 512 1024 2048 4096 280 260 240 220 200 180 160 140 120 100 80 SGEMM GFLOPS 60 40 20 0 16 32 64 128 256 512 1024 2048 4096 Matrix Dimension M=N=K Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 16 / 20 Results & Analysis Performance Scaling - TX1 CPU 30 GPU SPLIT 25 DYNAMIC 20 TBALANCE PEAK (CPU+GPU) 15 10 DGEMM GFLOPS 5 0 16 32 64 128 256 512 1024 2048 4096 500 450 400 350 300 250 200 150 SGEMM GFLOPS 100 50 0 16 32 64 128 256 512 1024 2048 4096 Matrix Dimension M=N=K Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 17 / 20 Results & Analysis Energy Efficiency - TX1 - SGEMM CPU GPU SPLIT −8 TBALANCE 10 DYNAMIC 10−9 −10 Joules/FLOP (SP) 4:22 · 10 10−10 3:75 · 10−11 128 256 512 1024 2048 4096 Matrix Dimension M=N=K Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 18 / 20 Results & Analysis Energy Efficiency - Haswell - SGEMM CPU GPU SPLIT TBALANCE DYNAMIC 10−9 Joules/FLOP (SP) 1:76 · 10−10 10−10 8:24 · 10−11 512 1024 2048 4096 8192 16384 Matrix Dimension M=N=K Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 19 / 20 Conclusion Conclusion A high accuracy and high resolution energy measurement system introduced here enables tuning algorithms for optimal energy usage. This would allow libraries like ATLAS to tune and produce best-performance and best-energy optimized libraries. How might a running application use information on energy usage to dynamically change its behaviour? Use of shared physical memory on SoC systems eliminates transfer overhead Under some circumstances, there is a case (TX1 DGEMM) where an energy benefit was observed from exploting both CPU and GPU together The best energy efficiency observed on SoC systems was 37.5 pJ/FLOP SGEMM on TX1 while on conventional systems, 82.4 pJ/FLOP SGEMM was observed on the K80. Contact: [email protected] https://www.linkedin.com/in/alistair-rendell-6230b72 Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 20 / 20.
Recommended publications
  • Bootstomp: on the Security of Bootloaders in Mobile Devices
    BootStomp: On the Security of Bootloaders in Mobile Devices Nilo Redini, Aravind Machiry, Dipanjan Das, Yanick Fratantonio, Antonio Bianchi, Eric Gustafson, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna, UC Santa Barbara https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/redini This paper is included in the Proceedings of the 26th USENIX Security Symposium August 16–18, 2017 • Vancouver, BC, Canada ISBN 978-1-931971-40-9 Open access to the Proceedings of the 26th USENIX Security Symposium is sponsored by USENIX BootStomp: On the Security of Bootloaders in Mobile Devices Nilo Redini, Aravind Machiry, Dipanjan Das, Yanick Fratantonio, Antonio Bianchi, Eric Gustafson, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna fnredini, machiry, dipanjan, yanick, antoniob, edg, yans, chris, [email protected] University of California, Santa Barbara Abstract by proposing simple mitigation steps that can be im- plemented by manufacturers to safeguard the bootloader Modern mobile bootloaders play an important role in and OS from all of the discovered attacks, using already- both the function and the security of the device. They deployed hardware features. help ensure the Chain of Trust (CoT), where each stage of the boot process verifies the integrity and origin of 1 Introduction the following stage before executing it. This process, in theory, should be immune even to attackers gaining With the critical importance of the integrity of today’s full control over the operating system, and should pre- mobile and embedded devices, vendors have imple- vent persistent compromise of a device’s CoT. However, mented a string of inter-dependent mechanisms aimed at not only do these bootloaders necessarily need to take removing the possibility of persistent compromise from untrusted input from an attacker in control of the OS in the device.
    [Show full text]
  • FAN53525 3.0A, 2.4Mhz, Digitally Programmable Tinybuck® Regulator
    FAN53525 — 3.0 A, 2.4 MHz, June 2014 FAN53525 3.0A, 2.4MHz, Digitally Programmable TinyBuck® Regulator Digitally Programmable TinyBuck Digitally Features Description . Fixed-Frequency Operation: 2.4 MHz The FAN53525 is a step-down switching voltage regulator that delivers a digitally programmable output from an input . Best-in-Class Load Transient voltage supply of 2.5 V to 5.5 V. The output voltage is 2 . Continuous Output Current Capability: 3.0 A programmed through an I C interface capable of operating up to 3.4 MHz. 2.5 V to 5.5 V Input Voltage Range Using a proprietary architecture with synchronous . Digitally Programmable Output Voltage: rectification, the FAN53525 is capable of delivering 3.0 A - 0.600 V to 1.39375 V in 6.25 mV Steps continuous at over 80% efficiency, maintaining that efficiency at load currents as low as 10 mA. The regulator operates at Programmable Slew Rate for Voltage Transitions . a nominal fixed frequency of 2.4 MHz, which reduces the . I2C-Compatible Interface Up to 3.4 Mbps value of the external components to 330 nH for the output inductor and as low as 20 µF for the output capacitor. PFM Mode for High Efficiency in Light Load . Additional output capacitance can be added to improve . Quiescent Current in PFM Mode: 50 µA (Typical) regulation during load transients without affecting stability, allowing inductance up to 1.2 µH to be used. Input Under-Voltage Lockout (UVLO) ® At moderate and light loads, Pulse Frequency Modulation Regulator Thermal Shutdown and Overload Protection . (PFM) is used to operate in Power-Save Mode with a typical .
    [Show full text]
  • A 1024-Core 70GFLOPS/W Floating Point Manycore Microprocessor
    A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor Andreas Olofsson, Roman Trogan, Oleg Raikhman Adapteva, Lexington, MA The Past, Present, & Future of Computing SIMD MIMD PE PE PE PE MINI MINI MINI CPU CPU CPU PE PE PE PE MINI MINI MINI CPU CPU CPU PE PE PE PE MINI MINI MINI CPU CPU CPU MINI MINI MINI BIG BIG CPU CPU CPU CPU CPU BIG BIG BIG BIG CPU CPU CPU CPU PAST PRESENT FUTURE 2 Adapteva’s Manycore Architecture C/C++ Programmable Incredibly Scalable 70 GFLOPS/W 3 Routing Architecture 4 E64G400 Specifications (Jan-2012) • 64-Core Microprocessor • 100 GFLOPS performance • 800 MHz Operation • 8GB/sec IO bandwidth • 1.6 TB/sec on chip memory BW • 0.8 TB/sec network on chip BW • 64 Billion Messages/sec IO Pads Core • 2 Watt total chip power • 2MB on chip memory Link Logic • 10 mm2 total silicon area • 324 ball 15x15mm plastic BGA 5 Lab Measurements 80 Energy Efficiency 70 60 50 GFLOPS/W 40 30 20 10 0 0 200 400 600 800 1000 1200 MHz ENERGY EFFICIENCY ENERGY EFFICIENCY (28nm) 6 Epiphany Performance Scaling 16,384 G 4,096 F 1,024 L 256 O 64 4096 P 1024 S 16 256 64 4 16 1 # Cores On‐demand scaling from 0.25W to 64 Watt 7 Hold on...the title said 1024 cores! • We can build it any time! • Waiting for customer • LEGO approach to design • No global timinga paths • Guaranteed by design • Generate any array in 1 day • ~130 mm2 silicon area 1024 Cores 1Core 8 What about 64-bit Floating Point? Single Precision Double Precision 2 FLOPS/CYCLE 2 FLOPS/CYCLE 64KB SRAM 64KB SRAM 0.215mm^2 0.237mm^2 700MHz 600MHz 9 Epiphany Latency Specifications
    [Show full text]
  • GPU Developments 2018
    GPU Developments 2018 2018 GPU Developments 2018 © Copyright Jon Peddie Research 2019. All rights reserved. Reproduction in whole or in part is prohibited without written permission from Jon Peddie Research. This report is the property of Jon Peddie Research (JPR) and made available to a restricted number of clients only upon these terms and conditions. Agreement not to copy or disclose. This report and all future reports or other materials provided by JPR pursuant to this subscription (collectively, “Reports”) are protected by: (i) federal copyright, pursuant to the Copyright Act of 1976; and (ii) the nondisclosure provisions set forth immediately following. License, exclusive use, and agreement not to disclose. Reports are the trade secret property exclusively of JPR and are made available to a restricted number of clients, for their exclusive use and only upon the following terms and conditions. JPR grants site-wide license to read and utilize the information in the Reports, exclusively to the initial subscriber to the Reports, its subsidiaries, divisions, and employees (collectively, “Subscriber”). The Reports shall, at all times, be treated by Subscriber as proprietary and confidential documents, for internal use only. Subscriber agrees that it will not reproduce for or share any of the material in the Reports (“Material”) with any entity or individual other than Subscriber (“Shared Third Party”) (collectively, “Share” or “Sharing”), without the advance written permission of JPR. Subscriber shall be liable for any breach of this agreement and shall be subject to cancellation of its subscription to Reports. Without limiting this liability, Subscriber shall be liable for any damages suffered by JPR as a result of any Sharing of any Material, without advance written permission of JPR.
    [Show full text]
  • Embedded Computer Solutions for Advanced Automation Control «
    » Embedded Computer Solutions for Advanced Automation Control « » Innovative Scalable Hardware » Qualifi ed for Industrial Software » Open Industrial Communication The pulse of innovation » We enable Automation! « Open Industrial Automation Platforms Kontron, one of the leaders of embedded computing technol- ogy has established dedicated global business units to provide application-ready OEM platforms for specifi c markets, includ- ing Industrial Automation. With our global corporate headquarters located in Germany, Visualization & Control Data Storage Internet-of-Things and regional headquarters in the United States and Asia-Pa- PanelPC Industrial Server cifi c, Kontron has established a strong presence worldwide. More than 1000 highly qualifi ed engineers in R&D, technical Industrie 4.0 support, and project management work with our experienced sales teams and sales partners to devise a solution that meets M2M SYMKLOUD your individual application’s demands. When it comes to embedded computing, you can focus on your core capabilities and rely on Kontron as your global OEM part- ner for a successful long-term business relationship. In addition to COTS standards based products, Kontron also of- fers semi- and full-custom ODM services for a full product port- folio that ranges from Computer-on-Modules and SBCs, up to embedded integrated systems and application ready platforms. Open for new technologies Kontron provides an exceptional range of hardware for any kind of control solution. Open for individual application Kontron systems are available either as readily integrated control solutions, or as open platforms for customers who build their own control applications with their own look and feel. Open for real-time Kontron’s Industrial Automation platforms are open for Real- Industrial Ethernet Time operating systems like VxWorks and Linux with real time extension.
    [Show full text]
  • Comparison of 116 Open Spec, Hacker Friendly Single Board Computers -- June 2018
    Comparison of 116 Open Spec, Hacker Friendly Single Board Computers -- June 2018 Click on the product names to get more product information. In most cases these links go to LinuxGizmos.com articles with detailed product descriptions plus market analysis. HDMI or DP- USB Product Price ($) Vendor Processor Cores 3D GPU MCU RAM Storage LAN Wireless out ports Expansion OSes 86Duino Zero / Zero Plus 39, 54 DMP Vortex86EX 1x x86 @ 300MHz no no2 128MB no3 Fast no4 no5 1 headers Linux Opt. 4GB eMMC; A20-OLinuXino-Lime2 53 or 65 Olimex Allwinner A20 2x A7 @ 1GHz Mali-400 no 1GB Fast no yes 3 other Linux, Android SATA A20-OLinuXino-Micro 65 or 77 Olimex Allwinner A20 2x A7 @ 1GHz Mali-400 no 1GB opt. 4GB NAND Fast no yes 3 other Linux, Android Debian Linux A33-OLinuXino 42 or 52 Olimex Allwinner A33 4x A7 @ 1.2GHz Mali-400 no 1GB opt. 4GB NAND no no no 1 dual 40-pin 3.4.39, Android 4.4 4GB (opt. 16GB A64-OLinuXino 47 to 88 Olimex Allwinner A64 4x A53 @ 1.2GHz Mali-400 MP2 no 1GB GbE WiFi, BT yes 1 40-pin custom Linux eMMC) Banana Pi BPI-M2 Berry 36 SinoVoip Allwinner V40 4x A7 Mali-400 MP2 no 1GB SATA GbE WiFi, BT yes 4 Pi 40 Linux, Android 8GB eMMC (opt. up Banana Pi BPI-M2 Magic 21 SinoVoip Allwinner A33 4x A7 Mali-400 MP2 no 512MB no Wifi, BT no 2 Pi 40 Linux, Android to 64GB) 8GB to 64GB eMMC; Banana Pi BPI-M2 Ultra 56 SinoVoip Allwinner R40 4x A7 Mali-400 MP2 no 2GB GbE WiFi, BT yes 4 Pi 40 Linux, Android SATA Banana Pi BPI-M2 Zero 21 SinoVoip Allwinner H2+ 4x A7 @ 1.2GHz Mali-400 MP2 no 512MB no no WiFi, BT yes 1 Pi 40 Linux, Android Banana
    [Show full text]
  • Low-Power Ultra-Small Edge AI Accelerators for Image Recog- Nition with Convolution Neural Networks: Analysis and Future Directions
    Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 16 July 2021 doi:10.20944/preprints202107.0375.v1 Review Low-power Ultra-small Edge AI Accelerators for Image Recog- nition with Convolution Neural Networks: Analysis and Future Directions Weison Lin 1, *, Adewale Adetomi 1 and Tughrul Arslan 1 1 Institute for Integrated Micro and Nano Systems, University of Edinburgh, Edinburgh EH9 3FF, UK; [email protected]; [email protected] * Correspondence: [email protected] Abstract: Edge AI accelerators have been emerging as a solution for near customers’ applications in areas such as unmanned aerial vehicles (UAVs), image recognition sensors, wearable devices, ro- botics, and remote sensing satellites. These applications not only require meeting performance tar- gets but also meeting strict reliability and resilience constraints due to operations in harsh and hos- tile environments. Numerous research articles have been proposed, but not all of these include full specifications. Most of these tend to compare their architecture with other existing CPUs, GPUs, or other reference research. This implies that the performance results of the articles are not compre- hensive. Thus, this work lists the three key features in the specifications such as computation ability, power consumption, and the area size of prior art edge AI accelerators and the CGRA accelerators during the past few years to define and evaluate the low power ultra-small edge AI accelerators. We introduce the actual evaluation results showing the trend in edge AI accelerator design about key performance metrics to guide designers on the actual performance of existing edge AI acceler- ators’ capability and provide future design directions and trends for other applications with chal- lenging constraints.
    [Show full text]
  • Tegra Linux Driver Package
    TEGRA LINUX DRIVER PACKAGE RN_05071-R32 | March 18, 2019 Subject to Change 32.1 Release Notes RN_05071-R32 Table of Contents 1.0 About this Release ................................................................................... 3 1.1 Login Credentials ............................................................................................... 4 2.0 Known Issues .......................................................................................... 5 2.1 General System Usability ...................................................................................... 5 2.2 Boot .............................................................................................................. 6 2.3 Camera ........................................................................................................... 6 2.4 CUDA Samples .................................................................................................. 7 2.5 Multimedia ....................................................................................................... 7 3.0 Top Fixed Issues ...................................................................................... 9 3.1 General System Usability ...................................................................................... 9 3.2 Camera ........................................................................................................... 9 4.0 Documentation Corrections ..................................................................... 10 4.1 Adaptation and Bring-Up Guide ............................................................................
    [Show full text]
  • Survey and Benchmarking of Machine Learning Accelerators
    1 Survey and Benchmarking of Machine Learning Accelerators Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, and Jeremy Kepner MIT Lincoln Laboratory Supercomputing Center Lexington, MA, USA freuther,pmichaleas,michael.jones,vijayg,sid,[email protected] Abstract—Advances in multicore processors and accelerators components play a major role in the success or failure of an have opened the flood gates to greater exploration and application AI system. of machine learning techniques to a variety of applications. These advances, along with breakdowns of several trends including Moore’s Law, have prompted an explosion of processors and accelerators that promise even greater computational and ma- chine learning capabilities. These processors and accelerators are coming in many forms, from CPUs and GPUs to ASICs, FPGAs, and dataflow accelerators. This paper surveys the current state of these processors and accelerators that have been publicly announced with performance and power consumption numbers. The performance and power values are plotted on a scatter graph and a number of dimensions and observations from the trends on this plot are discussed and analyzed. For instance, there are interesting trends in the plot regarding power consumption, numerical precision, and inference versus training. We then select and benchmark two commercially- available low size, weight, and power (SWaP) accelerators as these processors are the most interesting for embedded and Fig. 1. Canonical AI architecture consists of sensors, data conditioning, mobile machine learning inference applications that are most algorithms, modern computing, robust AI, human-machine teaming, and users (missions). Each step is critical in developing end-to-end AI applications and applicable to the DoD and other SWaP constrained users.
    [Show full text]
  • SPORK: a Summarization Pipeline for Online Repositories of Knowledge
    SPORK: A SUMMARIZATION PIPELINE FOR ONLINE REPOSITORIES OF KNOWLEDGE A Thesis presented to the Faculty of California Polytechnic State University San Luis Obispo In Partial Fulfillment of the Requirements for the Degree Master of Science in Computer Science by Steffen Lyngbaek June 2013 c 2013 Steffen Lyngbaek ALL RIGHTS RESERVED ii COMMITTEE MEMBERSHIP TITLE: SPORK: A Summarization Pipeline for Online Repositories of Knowledge AUTHOR: Steffen Lyngbaek DATE SUBMITTED: June 2013 COMMITTEE CHAIR: Professor Alexander Dekhtyar, Ph.D., De- parment of Computer Science COMMITTEE MEMBER: Professor Franz Kurfess, Ph.D., Depar- ment of Computer Science COMMITTEE MEMBER: Professor Foaad Khosmood, Ph.D., Depar- ment of Computer Science iii Abstract SPORK: A Summarization Pipeline for Online Repositories of Knowledge Steffen Lyngbaek The web 2.0 era has ushered an unprecedented amount of interactivity on the Internet resulting in a flood of user-generated content. This content is of- ten unstructured and comes in the form of blog posts and comment discussions. Users can no longer keep up with the amount of content available, which causes developers to start relying on natural language techniques to help mitigate the problem. Although many natural language processing techniques have been em- ployed for years, automatic text summarization, in particular, has recently gained traction. This research proposes a graph-based, extractive text summarization system called SPORK (Summarization Pipeline for Online Repositories of Knowl- edge). The goal of SPORK is to be able to identify important key topics presented in multi-document texts, such as online comment threads. While most other automatic summarization systems simply focus on finding the top sentences rep- resented in the text, SPORK separates the text into clusters, and identifies dif- ferent topics and opinions presented in the text.
    [Show full text]
  • 042Cf377-Ed0c-4715-9260-770F680082fc.Pdf
    WiFi Tablets 70 Neon + The ARCHOS 70 Neon Plus is one of the most affordable tablets on the market. It includes a powerful quad-core processor running the latest Android operating system: Android™ 5.1 Lollipop®. The 7-inch IPS display provides incredible colors and wide viewing angles, perfect for enjoying your content on-the-go. The ARCHOS 70 Neon Plus is designed to offer a unique multimedia experience. AndroidTM 5.1, Lollipop® Android™ 5.1, “Lollipop” 7” IPS Capacitive Screen 1024x 600 pixels Rockchip 3126 Quad-Core CPU @ 1.3 GHz 1 GB RAM 8 GB flash memory (+ microSD slot) GB Micro SD Wifi WiFi, dual cameras, micro USB host, G-sensor, speaker... 8 4 5 90b Neon Featuring a powerful 1.2 GHz Quad-core processor and Dual-core graphics processor, the ARCHOS 90b Neon has a super smooth interface. At an affordable price, the ARCHOS 90b Neon delivers an amazing experience to meet all your needs: emails, movies, photos, web browsing… Android™ 4.4, “Kitkat” 9” Capacitive Screen 1024 x 600 pixels All Winner A33 Quad-Core CPU @ 1.2 GHz 512 MB RAM 8 GB flash memory (+ microSD slot) GB Wifi WiFi, dual cameras, micro USB host, G-sensor, 8 speaker... 6 7 101d Neon A mix between large screen and productivity, the ARCHOS 101d Neon includes a large 10.1” screen and ARCHOS media applications, perfect for your multimedia, wherever you are. Featuring a powerful Quad-core processor and Quad-core graphics processor, the ARCHOS 101d Neon offers a super smooth interface for an affordable price and delivers an amazing experience to meet all your needs.
    [Show full text]
  • Master's Thesis: Adaptive Core Assignment for Adapteva Epiphany
    Adaptive core assignment for Adapteva Epiphany Master of Science Thesis in Embedded Electronic System Design Erik Alveflo Chalmers University of Technology Department of Computer Science and Engineering G¨oteborg, Sweden 2015 The Author grants to Chalmers University of Technology and University of Gothenburg the non-exclusive right to publish the Work electronically and in a non-commercial pur- pose make it accessible on the Internet. The Author warrants that he/she is the author to the Work, and warrants that the Work does not contain text, pictures or other mate- rial that violates copyright law. The Author shall, when transferring the rights of the Work to a third party (for example a publisher or a company), acknowledge the third party about this agreement. If the Author has signed a copyright agreement with a third party regarding the Work, the Author warrants hereby that he/she has obtained any necessary permission from this third party to let Chalmers University of Technology and University of Gothenburg store the Work electronically and make it accessible on the Internet. Adaptive core assignment for Adapteva Epiphany Erik Alveflo, c Erik Alveflo, 2015. Examiner: Per Larsson-Edefors Chalmers University of Technology Department of Computer Science and Engineering SE-412 96 G¨oteborg Sweden Telephone + 46 (0)31-772 1000 Department of Computer Science and Engineering G¨oteborg, Sweden 2015 Adaptive core assignment for Adapteva Epiphany Erik Alveflo Department of Computer Science and Engineering Chalmers University of Technology Abstract The number of cores in many-core processors is ever increasing, and so is the number of defects due to manufacturing variations and wear-out mechanisms.
    [Show full text]