Poster/Demo Sessions Slides

Total Page:16

File Type:pdf, Size:1020Kb

Poster/Demo Sessions Slides Poster/Demo Sessions Slides From RISC-V Workshop in Barcelona 7-10 May, 2018 Table of Contents for Poster/Demo Sessions Slides • Derek Atkins, Slide 3 • Eric Matthews and Lesley Shannon, Slides 47 – 48 • Mary Bennett, Slides 4 – 5 • Ekaterina Berezina and Andrey Smolyarov, • Paulo Matos, Slides 49 – 50 Slides 6 – 7 • Lucas Morais, Slides 51 – 52 • Alex Bradbury, Slide 8 • Mauro Olivieri, Slides 53 – 54 • Luca Carloni and Christian Palmiero, Slides 9 - • Aleksandar Pajkanovic, Slides 55 – 56 10 • Atish Patra, Slides 57 – 60 • Jie Chen, Slides 11 – 13 • Shubhodeep Roy, Slides 61 – 63 • Matt Cockrell, Slides 14-26 • Boris Shingarov, Slides 64 – 65 • Alberto Dassatti, Slides 27 – 28 • • Christian Fabre, Slides 29 – 30 Christoph Schulz, Slides 66 – 67 • Juan Fumero, Slides 31 – 32 • Wei Song, Rui Hou and Dan Meng, Slides 68 – 69 • Nicolas Gaude and Hai Yu, Slides 33 – 36 • Greg Sullivan, Slides 70 – 71 • Chris Jones and Zdenek Prikryl, Slides 37 – 38 • Robert Trout, Slide 72 • Felix Kaiser, Slides 39 – 40 • Vasily Varaksin and Ekaterina Berezina, Slides 73 – 74 • Alexander Kamkin and Andrei Tatarnikov, Slides • Danny Ybarra, Slides 75 – 82 41 – 42 • Simon Davidmann and Lee Moore, 83 – 87 • Luke Leighton, Slide 43 • Heng Lin, Slides 44 – 45 • Maja Malenko, Slide 46 Fast, Quantum-Resistant Secure Boot Solution For RISC-V MCUs with off-chip program stores Demo at Poster Session: Mynewt RTOS on HiFive1 development board Run-time metrics Security Verification Powered by Walnut Digital Signature Method Level Time Algorithm™ ECDSA P256 128-bit 179 ms • Up to 40X faster than ECDSA at 128-bit ECDSA P521 256-bit (not supported) security • Small footprint: easily fits in FE310’s 8K byte WalnutDSA 128-bit 6.76 ms OTP ROM WalnutDSA 256-bit 33.6 ms MCU: SiFive FE310, 32-bit RISC-V • Based on Group Theoretic Cryptography CPU Clock: 256 MHz (GTC) RTOS: Mynewt 1.3.0 • GTC Leverages - Structured groups - Matrices and permutations - Arithmetic over finite fields 3 What is CGEN How does it work? (define-insn (name "add") (comment "register add") (attrs base-isas) (syntax "add ${rd},${rs1},${rs2}") (format + (f-funct7 funct7) rs2 rs1 (f-funct3 funct3) rd (f-opcode opcode)) (semantics (set DI rd (add DI rs1 rs2))) ) About Syntacore IP company, founding member of RISC-V foundation Develops and licenses state-of-the-art RISC-V cores ■ Initial line is available and shipping to customers ■ SDKs, samples in silicon, full collateral ■ Core team comes from 10+ years of highly-relevant background ■ 2.5+ years of focused RISC-V development Full service to specialize CPU IP for customer needs ■ One-stop workload-specific customization for 10x improvements • with tools/compiler support ■ IP hardening at the required library node ■ SoC integration and SW migration Copyright © 2018 Syntacore. All trademarks, product, and brand names belong to their respective owners. SCR1 overview Compact MCU core for deeply embedded applications ■ Open source under SHL-license since May 2017 (Apache 2.0 derivative with HW specific) Unrestricted commercial use allowed ■ RV32I|E[MC] ISA ■ <15 kGates in basic RV32EC configuration ■ 2 to 4 stages pipeline ■ M-mode only ■ Optional configurable IPIC: 8..32 IRQs -O2 1.28 ■ Optional integrated Debug Controller DMIPS OpenOCD based -best** 1.72 Performance*, ■ Verification suite per MHz * Dhrystone 2.1, Coremark 1.0, GCC 7.1 BM from TCM-O2 2.60 ■ Documentation ** -O3 -funroll-loops -fpeel-loops -fgcse-smCoremark -fgcse-las -flto ■ Best-effort support provided -best** 2.78 ■ Commercial support available https://github.com/syntacore/scr1 Copyright © 2018 Syntacore. All trademarks, product, and brand names belong to their respective owners. Diving into RISC-V LLVM ● Why RISC-V LLVM ● Current status ● Approach ● Nest steps ● Supporting your own extension ● Get involved Alex Bradbury [email protected] @asbradbury @lowrisc Securing a RISC-V Core for IoT Applications with Dynamic Information Flow Tracking (DIFT) I/O CHANNELS Motivation DIFT Step 1. Each data item coming • Many IoT devices are prone to TAG from potentially malicious attacks due to low-level INITIALIZATION channels is extended with a tag programming errors (e.g. buffer that marks it as spurious overflow and format string Step 2. During program attacks) VULNERABILITY execution, the RISC-V core Contributions performs tag propagation to PROGRAM keep track of information flows • Design and implementation of a DETECTION generated by spurious data low-overhead protection scheme based on DIFT to secure RISC-V Step 3. By tag checking, the cores for IoT applications RISC-V core detects if spurious SECURITY EXCEPTION data are used in an unsafe manner and generates a • Demonstration with an security exception FPGA-based prototype C. Palmiero, G. Di Guglielmo, L. Lavagno, and L. Carloni 9 Design, Implementation, and Evaluation of the DIFT-Enhanced RISC-V Processor Core PC T ALU Tag Decoder Load Check TPR Store Logic CSR TCR Unit off-chip IF ID EX Tag Instruction Register File T MULT ID EX WB Check Memory DIV FPU Logic Tag Tag Instruction Update Propagation Data Cache T Logic Logic Memory D-RI5CY off-chip • Design and implementation of a DIFT architecture for an optimized 4-stage, in-order, 32-bit RISC-V core based on the PULPino platform • The D-RI5CY architecture supports various software-programmable security policies; it was evaluated with policies for memory protection • The experimental results demonstrate that securing a RISC-V core with DIFT is feasible, does not incur in any run-time overhead, and requires negligible resources C. Palmiero, G. Di Guglielmo, L. Lavagno, and L. Carloni Ultra Low Power Cluster I$ exploration with RI5CY Cluster Domain (Cluster CLK, Cluster VDD) Dual Clock FIFOs Bank Bank TCDM Bank DMA #1 #2 . #16 Single-Ported Shared Instruction Cache I$ BANK Low latency Read TAG RI5CY … ... RI5CY Logarithmic interconnect #1 #8 Read DATA 112 L0 Buffer L0 Buffer Prefetch Prefetch 8 128 (128bits) 128 (128bits) interface interface Cluster Bus (AXI4) Read-Only Logarithmic DEMU DEMU Cache interconnect 8 x 8 X X TAG DATA Controlle 128 128 Peripheral … ... interconnect (Way (Way 1) r share share RI5CY … ... RI5CY 1) SCM Dual #1 #8 dI$ … ... dI$ L0 Buffer L0 Buffer SCM Port #1 #8 Prefetch Prefetch Dual 128bit cache 128 (128bits) 128 (128bits) interface interface Port line 64 64 Private Write DATA … ... I$ Instruction Cache I$ 64 #1 #8 Instruction Bus (AXI4) Write 64 ... TAG 64 … ... 64 Instruction Bus (AXI4) 64 1 Two-Level Icache & Results Multi-Port Shared Instruction Cache Two-level relaxed-timing Instruction Cache I$ type RI5CY RI5CY Two- … ... SP MP #1 #8 Characteristics Level L0 Buffer L0 Buffer RI5CY RI5CY Prefetch Prefetch … ... 128 (128bits) 128 (128bits) #1 #8 interface interface L0 Buffer L0 Prefetch Prefetch [1] 128 (128bits) 128 Buffer Throughput -5% 1.00 -15% interface interface (128bits Cache Cache L1 ) Controller Controller I$ I$ … ... #1 … ... #8 128 128 Area 1.00 +50% +40% 128 128 Read-Only Logarithmic Read-Only Logarithmic ComparisonPower between-25% throughput,-50% area 1.00 … interconnect 8 x 8 interconnect 8 x 1 … andEfficiency power efficiency among SP, MP 128 L1.5 128 and two-level icache. 1.00 means the share share best among the three types of icache. DATA 128 DATA dI$ … ... dI$ & … ... … ... & #1 #8 [1] For throughput, the value of two- Master TAG TAG level icache is the worst case. Cache 64 … ... 64 Controll All value in the table are the average er Instruction Bus (AXI4) 64 value except [1] . For more details and 64 data, please see the poster. Instruction Bus (AXI4) 64 2 REFERENCES [1] I. Loi, A. Capotondi, D. Rossi, A. Marongiu and L. Benini, "The Quest for Energy-Efficient I$ Design in Ultra-Low-Power Clustered Many-Cores," in IEEE Transactions on Multi-Scale Computing Systems, vol. PP, no. 99, pp. 1-1. [2] L. Benini, E. Flamand, D. Fuin, and D. Melpignano, “P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator,” in 2012 Design, Automation Test in Europe Conference Exhibition (DATE), March 2012, pp. 983 –987. [3] A. Teman, D. Rossi, P. Meinerzhagen, L. Benini, and A. Burg, “Power, Area, and Performance Optimization of Standard Cell Memory Arrays Through Controlled Placement,” ACM Trans. Des. Autom. Electron. Syst., vol. 21, no. 4, pp. 59:1–59:25, May 2016. [Online]. Available: http://doi.acm.org/10.1145/2890498 3 Proprietary + Confidential Evaluation of RISC-V for Pixel Visual Core Matt Cockrell ([email protected]) (May 9th, 2018 1 Proprietary + Confidential Overview ● Use case ● Core selection ● Integration 2 Background: Pixel Visual Core Proprietary + Confidential “Pixel Visual Core is the Google-designed Image Processing Unit (IPU)—a fully programmable, domain-specific processor designed from scratch to deliver maximum performance at low power.”[1] Critical point: A dedicated A53 (top left) aggregates application layer IPU resource requests and configures appropriately. Magnified Image of Pixel Visual Core [1] “Pixel Visual Core: image processing and machine learning on Pixel 2”, Oct 17, 2017, Ofer Shacham, Google Inc. 3 Pixel Visual Core today Proprietary + Confidential Main Dedicated A-Class CPU CPU runs OS and support IPU I/O IPU 4 Modern SoC - multiple accelerators Proprietary + Confidential Modern mobile SoC with Main Main CPU no longer multiple devices. CPU dedicated only to IPU ⇒ Local controller is desired X Y Z PU I/O Add microcontroller as job scheduling and dispatch unit. Open source IP is an interesting option for this microcontroller. 5 Proprietary + Confidential Core Selection Considerations Level of Effort How difficult would it be to work with and integrate. Risk Stability and reliability of support. License Flexibility of use. 6 ential Candidate 1: Bottle Rocket (https://github.com/google/bottlerocket) ● Internal Project to demonstrate ability to easily develop custom RISC-V implementation by leveraging Rocket Chip. ● Implements RV32IMC ● Represents evaluating Rocket Chip as an option.
Recommended publications
  • Lecture 26: Domain Specific Architectures Chapter 07, CAQA 6Th Edition
    Lecture 26: Domain Specific Architectures Chapter 07, CAQA 6th Edition CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong Yan [email protected] https://passlab.github.io/CSCE513 Copyright and Acknowledgements § Copyright © 2019, Elsevier Inc. All rights Reserved – Textbook slides § Machine Learning for Science” in 2018 and A Superfacility Model for Science” in 2017 By Kathy Yelic – https://people.eecs.berkeley.edu/~yelick/talks.html 2 CSE 564 Class Contents § Introduction to Computer Architecture (CA) § Quantitative Analysis, Trend and Performance of CA – Chapter 1 § Instruction Set Principles and Examples – Appendix A § Pipelining and Implementation, RISC-V ISA and Implementation – Appendix C, RISC-V (riscv.org) and UCB RISC-V impl § Memory System (Technology, Cache Organization and Optimization, Virtual Memory) – Appendix B and Chapter 2 – Midterm covered till Memory Tech and Cache Organization § Instruction Level Parallelism (Dynamic Scheduling, Branch Prediction, Hardware Speculation, Superscalar, VLIW and SMT) – Chapter 3 § Data Level Parallelism (Vector, SIMD, and GPU) – Chapter 4 § Thread Level Parallelism – Chapter 5 § Domain-Specific Architecture – Chapter 7 3 The Moore’s Law TrendSuperscalar/Vector/Parallel GPUs 1 PFlop/s (1015) IBM Parallel BG/L ASCI White ASCI Red 1 TFlop/s Pacific 12 (10 ) 2X Transistors/Chip TMC CM-5 Cray T3D Every 1.5 Years Vector TMC CM-2 1 GFlop/s Cray 2 Cray X-MP (109) Super Scalar Cray 1 1941 1 (Floating Point operations / second, Flop/s) 1945 100 CDC 7600 IBM 360/195
    [Show full text]
  • P1360R0: Towards Machine Learning for C++: Study Group 19
    P1360R0: Towards Machine Learning for C++: Study Group 19 Date: 2018-11-26(Post-SAN mailing): 10 AM ET Project: ISO JTC1/SC22/WG21: Programming Language C++ Audience SG19, WG21 Authors : Michael Wong (Codeplay), Vincent Reverdy (University of Illinois at Urbana-Champaign, Paris Observatory), Robert Douglas (Epsilon), Emad Barsoum (Microsoft), Sarthak Pati (University of Pennsylvania) Peter Goldsborough (Facebook) Franke Seide (MS) Contributors Emails: [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] Reply to: [email protected] Introduction 2 Motivation 2 Scope 5 Meeting frequency and means 6 Outreach to ML/AI/Data Science community 7 Liaison with other groups 7 Future meetings 7 Conclusion 8 Acknowledgements 8 References 8 Introduction This paper proposes a WG21 SG for Machine Learning with the goal of: ● Making Machine Learning a first-class citizen in ISO C++ It is the collaboration of a number of key industry, academic, and research groups, through several connections in CPPCON BoF[reference], LLVM 2018 discussions, and C++ San Diego meeting. The intention is to support such an SG, and describe the scope of such an SG. This is in terms of potential work resulting in papers submitted for future C++ Standards, or collaboration with other SGs. We will also propose ongoing teleconferences, meeting frequency and locations, as well as outreach to ML data scientists, conferences, and liaison with other Machine Learning groups such as at Khronos, and ISO. As of the SAN meeting, this group has been officially created as SG19, and we will begin teleconferences immediately, after the US thanksgiving, and after NIPS.
    [Show full text]
  • SIMD in Different Processors
    SIMD Introduction Application, Hardware & Software Champ Yen (嚴梓鴻) [email protected] http://champyen.blogspot.tw Agenda ● What & Why SIMD ● SIMD in different Processors ● SIMD for Software Optimization ● What are important in SIMD? ● Q & A Link of This Slides https://goo.gl/Rc8xPE 2 What is SIMD (Single Instruction Multiple Data) one lane for(y = 0; y < height; y++){ for(x = 0; x < width; x+=8){ //process 8 point simutaneously for(y = 0; y < height; y++){ uin16x8_t va, vb, vout; for(x = 0; x < width; x++){ va = vld1q_u16(a+x); //process 1 point vb = vld1q_u16(b+x); out[x] = a[x]+b[x]; vout = vaddq_u16(va, vb); } vst1q_u16(out, vout); a+=width; b+=width; out+=width; } a+=width; b+=width; out+=width; } } 3 Why do we need to use SIMD? 4 Why & How do we use SIMD? Image Processing Gaming Scientific Computing Deep Neural Network 5 SIMD in different Processor - CPU ● x86 − MMX − SSE − AVX − AVX-512 ● ARM - Application − v5 DSP Extension − v6 SIMD − v7 NEON − v8 Advanced SIMD (NEON) − SVE https://software.intel.com/sites/landingpage/IntrinsicsGuide/ http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf6 SIMD in different Processor - GPU ● SIMD − AMD GCN − ARM Mali ● WaveFront − Nvidia − Imagination PowerVR Rogue 7 SIMD in different Processor - DSP ● Qualcomm Hexagon 600 HVX ● Cadence IVP P5 ● Synopsys EV6x ● CEVA XM4 8 SIMD Optimization SIMD SIMD Framework / Software Hardware Programming Design Model 9 SIMD Optimization ● Auto/Semi-Auto Method ● Compiler Intrinsics ● Specific Framework/Infrastructure
    [Show full text]
  • An Analysis and Implementation of the HDR+ Burst Denoising Method
    Published in Image Processing On Line on 2021–05–25. Submitted on 2021–03–01, accepted on 2021–05–05. ISSN 2105–1232 c 2021 IPOL & the authors CC–BY–NC–SA This article is available online with supplementary materials, software, datasets and online demo at https://doi.org/10.5201/ipol.2021.336 2015/06/16 v0.5.1 IPOL article class An Analysis and Implementation of the HDR+ Burst Denoising Method Antoine Monod1,2, Julie Delon1, Thomas Veit2 1 MAP5, Universit´ede Paris, France ({antoine.monod, julie.delon}@u-paris.fr) 2 GoPro Technology, France ([email protected]) Abstract HDR+ is an image processing pipeline presented by Google in 2016. At its core lies a denoising algorithm that uses a burst of raw images to produce a single higher quality image. Since it is designed as a versatile solution for smartphone cameras, it does not necessarily aim for the maximization of standard denoising metrics, but rather for the production of natural, visually pleasing images. In this article, we specifically discuss and analyze the HDR+ burst denoising algorithm architecture and the impact of its various parameters. With this publication, we provide an open source Python implementation of the algorithm, along with an interactive demo. Source Code The Python implementation of HDR+ has been peer-reviewed and accepted by IPOL. The source code, its documentation, and the online demo are available from the web page of this article1. Compilation and usage instructions are included in the README.txt file of the archive. The code is also available on GitHub2. Keywords: burst denoising; computational photography; motion estimation; temporal filter- ing; high dynamic range; image signal processor 1 Introduction Image noise is a common issue in digital photography, and that issue is even more prominent in smaller sensors used in devices like smartphones or action cameras.
    [Show full text]
  • Gables: a Roofline Model for Mobile Socs
    2019 IEEE International Symposium on High Performance Computer Architecture (HPCA) Gables: A Roofline Model for Mobile SoCs Mark D. Hill∗ Vijay Janapa Reddi∗ Computer Sciences Department School Of Engineering And Applied Sciences University of Wisconsin—Madison Harvard University [email protected] [email protected] Abstract—Over a billion mobile consumer system-on-chip systems with multiple cores, GPUs, and many accelerators— (SoC) chipsets ship each year. Of these, the mobile consumer often called intellectual property (IP) blocks, driven by the market undoubtedly involving smartphones has a significant need for performance. These cores and IPs interact via rich market share. Most modern smartphones comprise of advanced interconnection networks, caches, coherence, 64-bit address SoC architectures that are made up of multiple cores, GPS, and many different programmable and fixed-function accelerators spaces, virtual memory, and virtualization. Therefore, con- connected via a complex hierarchy of interconnects with the goal sumer SoCs deserve the architecture community’s attention. of running a dozen or more critical software usecases under strict Consumer SoCs have long thrived on tight integration and power, thermal and energy constraints. The steadily growing extreme heterogeneity, driven by the need for high perfor- complexity of a modern SoC challenges hardware computer mance in severely constrained battery and thermal power architects on how best to do early stage ideation. Late SoC design typically relies on detailed full-system simulation once envelopes, and all-day battery life. A typical mobile SoC in- the hardware is specified and accelerator software is written cludes a camera image signal processor (ISP) for high-frame- or ported.
    [Show full text]
  • Euphrates: Algorithm-Soc Co-Design for Low-Power Mobile Continuous Vision
    Euphrates: Algorithm-SoC Co-Design for Low-Power Mobile Continuous Vision Yuhao Zhu1 Anand Samajdar2 Matthew Mattina3 Paul Whatmough3 1University of Rochester 2Georgia Institute of Technology 3ARM Research Abstract Haar HOG Tiny YOLO 2 SSD YOLO Faster R-CNN 10 Continuous computer vision (CV) tasks increasingly rely on SOTA CNNs 1 convolutional neural networks (CNN). However, CNNs have 10 Compute Capability massive compute demands that far exceed the performance 0 @ 1W Power Budget and energy constraints of mobile devices. In this paper, we 10 propose and develop an algorithm-architecture co-designed -1 Scaled-down 10 CNN Better system, Euphrates, that simultaneously improves the energy- -2 10 efficiency and performance of continuous vision tasks. Hand-crafted Approaches -3 Our key observation is that changes in pixel data between 10 Tera Ops Per Second (TOPS) Tera 0 20 40 60 80 100 consecutive frames represents visual motion. We first propose Accuracy (%) an algorithm that leverages this motion information to relax the number of expensive CNN inferences required by contin- Fig. 1: Accuracy and compute requirement (TOPS) comparison be- uous vision applications. We co-design a mobile System-on- tween object detection techniques. Accuracies are measured against the widely-used PASCAL VOC 2007 dataset [32], and TOPS is based a-Chip (SoC) architecture to maximize the efficiency of the on the 480p (640×480) resolution common in smartphone cameras. new algorithm. The key to our architectural augmentation is to co-optimize different SoC IP blocks in the vision pipeline col- requirements measured in Tera Operations Per Second (TOPS) lectively. Specifically, we propose to expose the motion data as well as accuracies between different detectors under 60 that is naturally generated by the Image Signal Processor (ISP) frames per second (FPS).
    [Show full text]
  • Concepts Introduced in Chapter 7 Reasons for Adoption of Domain SpeciC Architectures
    Introduction Guidelines TPU Catapult Pixel Visual Core Issues Introduction Guidelines TPU Catapult Pixel Visual Core Issues Concepts Introduced in Chapter 7 Reasons for Adoption of Domain Specic Architectures End of Dennard scaling and slowing of Moore's Law means we need to lower the energy per operation to improve performance. guidelines for domain specic architectures (DSAs) If order-of-magnitude performance improvements are to be Google's Tensor Processing Unit obtained, we need to increase the number of arithmetic Microsoft's Catapult operations per instruction from one to hundreds. Google's Pixel Visual Core Conventional architecture innovations may not be a good crosscutting issues match to some domains. Many believe that future computers will consist of standard processors that can run conventional programs along with domain-specic processors that can very eciently perform only a narrow range of tasks. Introduction Guidelines TPU Catapult Pixel Visual Core Issues Introduction Guidelines TPU Catapult Pixel Visual Core Issues Factors That Can Make a DSA Cost Eective DSA Guidelines Find a domain whose demand for applications is for enough Use dedicated memories to minimize the distance over which data is moved. chips to justify the nonrecurring engineering (NRE) costs. Multi-level caches are expensive in area and energy for moving DSAs are best applied for small compute-intensive kernels of data. larger systems, where a signicant fraction of the computation Many domains have predictable memory access patterns with is done for some applications. little data reuse. This means that future computers may be much more DSA programmers understand their domain. Data movement can be more ecient with software controlled heterogeneous than current homogeneous multicore chips.
    [Show full text]
  • Cloudleak: DNN Model Extractions from Commercial Mlaas Platforms
    CloudLeak: DNN Model Extractions from Commercial MLaaS Platforms Yier Jin, Honggang Yu, and Tsung-Yi Ho University of Florida, National Tsing Hua University [email protected] #BHUSA @BLACKHATEVENTS Who We Are Yier Jin, PhD @jinyier Associate Professor and Endowed IoT Term Professor The Warren B. Nelms Institute for the Connected World Department of Electrical and Computer Engineering University of Florida Research Interests: • Hardware and Circuit Security • Internet of Things (IoT) and Cyber-Physical System (CPS) Design • Functional programming and trusted IP cores • Autonomous system security and resilience • Trusted and resilient high-performance computing 3 Who We Are Tsung-Yi Ho, PhD @TsungYiHo1 Professor Department of Computer Science National Tsing Hua University Program Director AI Innovation Program Ministry of Science and Technology, Taiwan Research Interests: • Hardware and Circuit Security • Trustworthy AI • Design Automation for Emerging Technologies 4 Who We Are Honggang Yu @yuhonggang Visiting PhD. student The Warren B. Nelms Institute for the Connected World Department of Electrical and Computer Engineering University of Florida Research Interests: • Intersection of security, privacy, and machine learning • Embedded systems security • Internet of Things (IoT) security • Machine learning and deep learning with applications in VLSI computer aided design (CAD) 5 Outline Background and Motivation . AI Interface API in Cloud . Existing Attacks and Defenses Adversarial Examples based Model Stealing . Adversarial Examples
    [Show full text]
  • COEN-4730 Computer Architecture Lecture 12 Domain Specific Architectures (DSA) Chapter 7
    COEN-4730 Computer Architecture Lecture 12 Domain Specific Architectures (DSA) Chapter 7 Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University 1 1 2 1 Introduction 3 Introduction • Need factor of 100 improvements in number of operations per instruction – Requires domain specific architectures (DSAs) – For ASICs, NRE cannot be amoratized over large volumes – FPGAs are less efficient than ASICs • Video: – https://www.acm.org/hennessy-patterson-turing-lecture • Paper: – https://cacm.acm.org/magazines/2019/2/234352-a-new-golden-age-for-computer-architecture/fulltext • Slides: – https://iscaconf.org/isca2018/docs/HennessyPattersonTuringLectureISCA4June2018.pdf 4 2 Machine Learning Domain 5 5 Artificial Intelligence, Machine Learning and Deep Learning 6 6 3 Example Domain: Deep Neural Networks 7 7 Example Domain: Deep Neural Networks 8 8 4 Example Domain: Deep Neural Networks • Three of the most important DNNs: 1. Multi-Layer Perceptrons 2. Convolutional Neural Network 3. Recurrent Neural Network 9 9 1) Multi-Layer Perceptrons 10 10 5 2) Convolutional Neural Network 11 11 2) Convolutional Neural Network ◼ Parameters: ◼ DimFM[i-1]: Dimension of the (square) input Feature Map ◼ DimFM[i]: Dimension of the (square) output Feature Map ◼ DimSten[i]: Dimension of the (square) stencil ◼ NumFM[i-1]: Number of input Feature Maps ◼ NumFM[i]: Number of output Feature Maps 2 ◼ Number of neurons: NumFM[i] x DimFM[i] ◼ Number of weights per output Feature Map: NumFM[i-1] x DimSten[i]2 ◼ Total number of weights per layer: NumFM[i]
    [Show full text]
  • Review & Background
    Spring 2018 :: CSE 502 Review & Background Nima Honarmand Spring 2018 :: CSE 502 Measuring & Reporting Performance Spring 2018 :: CSE 502 Performance Metrics • Latency (execution/response time): time to finish one task • Throughput (bandwidth): number of tasks finished per unit of time – Throughput can exploit parallelism, latency can’t – Sometimes complimentary, often contradictory • Example: move people from A to B, 10 miles – Car: capacity = 5, speed = 60 miles/hour – Bus: capacity = 60, speed = 20 miles/hour – Latency: car = 10 min, bus = 30 min – Throughput: car = 15 PPH (w/ return trip), bus = 60 PPH Pick the right metric for your goals Spring 2018 :: CSE 502 Performance Comparison • “Processor A is X times faster than processor B” if – Latency(P, A) = Latency(P, B) / X – Throughput(P, A) = Throughput(P, B) * X • “Processor A is X% faster than processor B” if – Latency(P, A) = Latency(P, B) / (1+X/100) – Throughput(P, A) = Throughput(P, B) * (1+X/100) • Car/bus example – Latency? Car is 3 times (200%) faster than bus – Throughput? Bus is 4 times (300%) faster than car Spring 2018 :: CSE 502 Latency/throughput of What Program? • Very difficult question! • Best case: you always run the same set of programs – Just measure the execution time of those programs – Too idealistic • Use benchmarks – Representative programs chosen to measure performance – (Hopefully) predict performance of actual workload – Prone to Benchmarketing: “The misleading use of unrepresentative benchmark software results in marketing a computer system” -- wikitionary.com Spring 2018 :: CSE 502 Types of Benchmarks • Real programs – Example: CAD, text processing, business apps, scientific apps – Need to know program inputs and options (not just code) – May not know what programs users will run – Require a lot of effort to port • Kernels – Small key pieces (inner loops) of scientific programs where program spends most of its time – Example: Livermore loops, LINPACK • Toy Benchmarks – e.g.
    [Show full text]
  • Mempool: a Shared-L1 Memory Many-Core Cluster with a Low-Latency Interconnect
    MemPool: A Shared-L1 Memory Many-Core Cluster with a Low-Latency Interconnect Matheus Cavalcante Samuel Riedel Antonio Pullini Luca Benini ETH Zurich¨ ETH Zurich¨ GreenWaves Technologies ETH Zurich¨ Zurich,¨ Switzerland Zurich,¨ Switzerland Grenoble, France Zurich,¨ Switzerland matheusd at iis.ee.ethz.ch sriedel at iis.ee.ethz.ch pullinia at iis.ee.ethz.ch Universita` di Bologna Bologna, Italy lbenini at iis.ee.ethz.ch Abstract—A key challenge in scaling shared-L1 multi-core domains, from the streaming processors of Graphics Processing clusters towards many-core (more than 16 cores) configurations Units (GPUs) [8], to the ultra-low-power domain with Green- is to ensure low-latency and efficient access to the L1 memory. Waves’ GAP8 processor [9], to high-performance embedded In this work we demonstrate that it is possible to scale up the shared-L1 architecture: We present MemPool, a 32 bit many- signal processing with the Kalray processor clusters [10], to core system with 256 fast RV32IMA “Snitch” cores featuring aerospace applications with the Ramon Chips’ RC64 system [11]. application-tunable execution units, running at 700 MHz in typical However, as we will detail in Section II, these clusters only conditions (TT/0.80 V/25 °C). MemPool is easy to program, with scale to low tens of cores and suffer from long memory access all the cores sharing a global view of a large L1 scratchpad latency due to the topology of their interconnects. memory pool, accessible within at most 5 cycles. In MemPool’s physical-aware design, we emphasized the exploration, design, and In this paper, we set out to design and optimize the first scaled- optimization of the low-latency processor-to-L1-memory intercon- up many-core system with shared low-latency L1 memory.
    [Show full text]
  • Smartphone Huawei Senza Futuro Stop Ai Chip E Alla Licenza Android
    n.200 / 19 20 MAGGIO 2019 MAGAZINE Sony lascia il digitale Sound United compra Nella sede Dolby Windows 10: terrestre. Addio a anche Onkyo con Panasonic. Vision Microsoft dice addio Cine Sony e Pop 02 e Pioneer 04 e Atmos per tutti 16 alle app universali 20 Smartphone Huawei senza futuro Stop ai chip e alla licenza Android È ufficiale: dopo il bando di Trump a Huawei negli USA, Google toglie la licenza Android all’azienda cinese. Interrotti i rapporti anche con Intel e Qualcomm 02 OnePlus 7 Pro è ufficiale 23 Luci e ombre in anteprima Ben rifinito, solido e con fotocamera frontale a scomparsa, l’atteso top di OnePlus ha anche alcuni DJI Osmo Action aspetti che non convincono. La nostra analisi La rivale della GoPro 07 che sembra una GoPro Apple TV, rivoluzione a metà 33 La prova sui TV Samsung Completamente rivista nel design, l’app in Italia non è Renault Zoe però come Apple aveva promesso. L’abbiamo provata 10 anni dell’elettrica sugli ultimi modelli di TV Samsung insieme ad AirPlay 14 più popolare IN PROVA IN QUESTO NUMERO 25 28 30 31 Samsung QLED Q90R Sony 75” XG95 Google Pixel 3a Honor 20 Lite Non è “il solito” LCD Effetto cinema La nostra anteprima Best buy sotto la lente n.200 / 19 20 MAGGIO 2019 MAGAZINE MERCATO Google ha deciso, dopo la mossa di Trump, di revocare la licenza Android a Huawei Diritti Google toglie la licenza Android a Huawei Champions, Rapporti interrotti anche da Intel e Qualcomm Rai porta Sky in tribunale Huawei ora dovrà usare la versione opensource del sistema operativo senza le Google app Come prevedibile, la Rai ha deciso di di Roberto PEZZALI passare ai propri avvocati la decisione di Sky di non voler rinnovare l’opzione a notizia che arriva dagli States lascia per la trasmissione in chiaro della increduli: a seguito del ban da parte partita del mercoledì di Champions L di Trump di Huawei negli Usa, Goo- League per le prossime due stagioni.
    [Show full text]