SGX-Bomb: Locking Down the Processor Via Rowhammer Attack
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Microkernel Mechanisms for Improving the Trustworthiness of Commodity Hardware
Microkernel Mechanisms for Improving the Trustworthiness of Commodity Hardware Yanyan Shen Submitted in fulfilment of the requirements for the degree of Doctor of Philosophy School of Computer Science and Engineering Faculty of Engineering March 2019 Thesis/Dissertation Sheet Surname/Family Name : Shen Given Name/s : Yanyan Abbreviation for degree as give in the University calendar : PhD Faculty : Faculty of Engineering School : School of Computer Science and Engineering Microkernel Mechanisms for Improving the Trustworthiness of Commodity Thesis Title : Hardware Abstract 350 words maximum: (PLEASE TYPE) The thesis presents microkernel-based software-implemented mechanisms for improving the trustworthiness of computer systems based on commercial off-the-shelf (COTS) hardware that can malfunction when the hardware is impacted by transient hardware faults. The hardware anomalies, if undetected, can cause data corruptions, system crashes, and security vulnerabilities, significantly undermining system dependability. Specifically, we adopt the single event upset (SEU) fault model and address transient CPU or memory faults. We take advantage of the functional correctness and isolation guarantee provided by the formally verified seL4 microkernel and hardware redundancy provided by multicore processors, design the redundant co-execution (RCoE) architecture that replicates a whole software system (including the microkernel) onto different CPU cores, and implement two variants, loosely-coupled redundant co-execution (LC-RCoE) and closely-coupled redundant co-execution (CC-RCoE), for the ARM and x86 architectures. RCoE treats each replica of the software system as a state machine and ensures that the replicas start from the same initial state, observe consistent inputs, perform equivalent state transitions, and thus produce consistent outputs during error-free executions. -
SMASH: Synchronized Many-Sided Rowhammer Attacks from Javascript
SMASH: Synchronized Many-sided Rowhammer Attacks from JavaScript Finn de Ridder Pietro Frigo Emanuele Vannacci Herbert Bos ETH Zurich VU Amsterdam VU Amsterdam VU Amsterdam VU Amsterdam Cristiano Giuffrida Kaveh Razavi VU Amsterdam ETH Zurich Abstract has instead been lagging behind. In particular, rather than Despite their in-DRAM Target Row Refresh (TRR) mitiga- addressing the root cause of the Rowhammer bug, DDR4 in- tions, some of the most recent DDR4 modules are still vul- troduced a mitigation known as Target Row Refresh (TRR). nerable to many-sided Rowhammer bit flips. While these TRR, however, has already been shown to be ineffective bit flips are exploitable from native code, triggering them against many-sided Rowhammer attacks from native code, in the browser from JavaScript faces three nontrivial chal- at least in its current in-DRAM form [12]. However, it is lenges. First, given the lack of cache flushing instructions in unclear whether TRR’s failing also re-exposes end users to JavaScript, existing eviction-based Rowhammer attacks are TRR-aware, JavaScript-based Rowhammer attacks from the already slow for the older single- or double-sided variants browser. In fact, existing many-sided Rowhammer attacks and thus not always effective. With many-sided Rowhammer, require frequent cache flushes, large physically-contiguous mounting effective attacks is even more challenging, as it re- regions, and certain access patterns to bypass in-DRAM TRR, quires the eviction of many different aggressor addresses from all challenging in JavaScript. the CPU caches. Second, the most effective many-sided vari- In this paper, we show that under realistic assumptions, it ants, known as n-sided, require large physically-contiguous is indeed possible to bypass TRR directly from JavaScript, memory regions which are not available in JavaScript. -
UNIT 8B a Full Adder
UNIT 8B Computer Organization: Levels of Abstraction 15110 Principles of Computing, 1 Carnegie Mellon University - CORTINA A Full Adder C ABCin Cout S in 0 0 0 A 0 0 1 0 1 0 B 0 1 1 1 0 0 1 0 1 C S out 1 1 0 1 1 1 15110 Principles of Computing, 2 Carnegie Mellon University - CORTINA 1 A Full Adder C ABCin Cout S in 0 0 0 0 0 A 0 0 1 0 1 0 1 0 0 1 B 0 1 1 1 0 1 0 0 0 1 1 0 1 1 0 C S out 1 1 0 1 0 1 1 1 1 1 ⊕ ⊕ S = A B Cin ⊕ ∧ ∨ ∧ Cout = ((A B) C) (A B) 15110 Principles of Computing, 3 Carnegie Mellon University - CORTINA Full Adder (FA) AB 1-bit Cout Full Cin Adder S 15110 Principles of Computing, 4 Carnegie Mellon University - CORTINA 2 Another Full Adder (FA) http://students.cs.tamu.edu/wanglei/csce350/handout/lab6.html AB 1-bit Cout Full Cin Adder S 15110 Principles of Computing, 5 Carnegie Mellon University - CORTINA 8-bit Full Adder A7 B7 A2 B2 A1 B1 A0 B0 1-bit 1-bit 1-bit 1-bit ... Cout Full Full Full Full Cin Adder Adder Adder Adder S7 S2 S1 S0 AB 8 ⁄ ⁄ 8 C 8-bit C out FA in ⁄ 8 S 15110 Principles of Computing, 6 Carnegie Mellon University - CORTINA 3 Multiplexer (MUX) • A multiplexer chooses between a set of inputs. D1 D 2 MUX F D3 D ABF 4 0 0 D1 AB 0 1 D2 1 0 D3 1 1 D4 http://www.cise.ufl.edu/~mssz/CompOrg/CDAintro.html 15110 Principles of Computing, 7 Carnegie Mellon University - CORTINA Arithmetic Logic Unit (ALU) OP 1OP 0 Carry In & OP OP 0 OP 1 F 0 0 A ∧ B 0 1 A ∨ B 1 0 A 1 1 A + B http://cs-alb-pc3.massey.ac.nz/notes/59304/l4.html 15110 Principles of Computing, 8 Carnegie Mellon University - CORTINA 4 Flip Flop • A flip flop is a sequential circuit that is able to maintain (save) a state. -
Drammer: Deterministic Rowhammer Attacks on Mobile Platforms
Drammer: Deterministic Rowhammer Attacks on Mobile Platforms Victor van der Veen Yanick Fratantonio Martina Lindorfer Vrije Universiteit Amsterdam UC Santa Barbara UC Santa Barbara [email protected] [email protected] [email protected] Daniel Gruss Clémentine Maurice Giovanni Vigna Graz University of Technology Graz University of Technology UC Santa Barbara [email protected] [email protected] [email protected] Herbert Bos Kaveh Razavi Cristiano Giuffrida Vrije Universiteit Amsterdam Vrije Universiteit Amsterdam Vrije Universiteit Amsterdam [email protected] [email protected] [email protected] ABSTRACT 1. INTRODUCTION Recent work shows that the Rowhammer hardware bug can The Rowhammer hardware bug allows an attacker to mod- be used to craft powerful attacks and completely subvert a ify memory without accessing it, simply by repeatedly ac- system. However, existing efforts either describe probabilis- cessing, i.e., \hammering", a given physical memory loca- tic (and thus unreliable) attacks or rely on special (and often tion until a bit in an adjacent location flips. Rowhammer unavailable) memory management features to place victim has been used to craft powerful attacks that bypass all cur- objects in vulnerable physical memory locations. Moreover, rent defenses and completely subvert a system [16,32,35,47]. prior work only targets x86 and researchers have openly won- Until now, the proposed exploitation techniques are either dered whether Rowhammer attacks on other architectures, probabilistic [16,35] or rely on special memory management such as ARM, are even possible. features such as memory deduplication [32], MMU paravir- We show that deterministic Rowhammer attacks are feasi- tualization [47], or the pagemap interface [35]. -
With Extreme Scale Computing the Rules Have Changed
With Extreme Scale Computing the Rules Have Changed Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 11/17/15 1 • Overview of High Performance Computing • With Extreme Computing the “rules” for computing have changed 2 3 • Create systems that can apply exaflops of computing power to exabytes of data. • Keep the United States at the forefront of HPC capabilities. • Improve HPC application developer productivity • Make HPC readily available • Establish hardware technology for future HPC systems. 4 11E+09 Eflop/s 362 PFlop/s 100000000100 Pflop/s 10000000 10 Pflop/s 33.9 PFlop/s 1000000 1 Pflop/s SUM 100000100 Tflop/s 166 TFlop/s 1000010 Tflop /s N=1 1 Tflop1000/s 1.17 TFlop/s 100 Gflop/s100 My Laptop 70 Gflop/s N=500 10 59.7 GFlop/s 10 Gflop/s My iPhone 4 Gflop/s 1 1 Gflop/s 0.1 100 Mflop/s 400 MFlop/s 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2015 1 Eflop/s 1E+09 420 PFlop/s 100000000100 Pflop/s 10000000 10 Pflop/s 33.9 PFlop/s 1000000 1 Pflop/s SUM 100000100 Tflop/s 206 TFlop/s 1000010 Tflop /s N=1 1 Tflop1000/s 1.17 TFlop/s 100 Gflop/s100 My Laptop 70 Gflop/s N=500 10 59.7 GFlop/s 10 Gflop/s My iPhone 4 Gflop/s 1 1 Gflop/s 0.1 100 Mflop/s 400 MFlop/s 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2015 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 SUM 100 Tflop/s 100000 10 Tflop/s N=1 10000 1 Tflop/s 1000 100 Gflop/s N=500 100 10 Gflop/s 10 1 Gflop/s 1 100 Mflop/s 0.1 1996 2002 2020 2008 2014 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 SUM 100 Tflop/s 100000 10 Tflop/s N=1 10000 1 Tflop/s 1000 100 Gflop/s N=500 100 10 Gflop/s 10 1 Gflop/s 1 100 Mflop/s 0.1 1996 2002 2020 2008 2014 • Pflops (> 1015 Flop/s) computing fully established with 81 systems. -
Throwhammer: Rowhammer Attacks Over the Network and Defenses
Throwhammer: Rowhammer Attacks over the Network and Defenses Andrei Tatar Radhesh Krishnan Konoth Elias Athanasopoulos VU Amsterdam VU Amsterdam University of Cyprus Cristiano Giuffrida Herbert Bos Kaveh Razavi VU Amsterdam VU Amsterdam VU Amsterdam Abstract never progressed beyond local privilege escalations or sandbox escapes. The attacker needs the ability to run Increasingly sophisticated Rowhammer exploits allow an code on the victim machine in order to flip bits in sen- attacker that can execute code on a vulnerable system sitive data. Hence, Rowhammer posed little threat from to escalate privileges and compromise browsers, clouds, attackers without code execution on the victim machines. and mobile systems. In all these attacks, the common In this paper, we show that this is no longer true and at- assumption is that attackers first need to obtain code tackers can flip bits by only sending network packets to execution on the victim machine to be able to exploit a victim machine connected to RDMA-enabled networks Rowhammer either by having (unprivileged) code exe- commonly used in clouds and data centers [1, 20, 45, 62]. cution on the victim machine or by luring the victim to a website that employs a malicious JavaScript applica- Rowhammer exploits today Rowhammer allows at- tion. In this paper, we revisit this assumption and show tackers to flip a bit in one physical memory location that an attacker can trigger and exploit Rowhammer bit by aggressively reading (or writing) other locations (i.e., flips directly from a remote machine by only sending hammering). As bit flips occur at the physical level, they network packets. -
Theoretical Peak FLOPS Per Instruction Set on Modern Intel Cpus
Theoretical Peak FLOPS per instruction set on modern Intel CPUs Romain Dolbeau Bull – Center for Excellence in Parallel Programming Email: [email protected] Abstract—It used to be that evaluating the theoretical and potentially multiple threads per core. Vector of peak performance of a CPU in FLOPS (floating point varying sizes. And more sophisticated instructions. operations per seconds) was merely a matter of multiplying Equation2 describes a more realistic view, that we the frequency by the number of floating-point instructions will explain in details in the rest of the paper, first per cycles. Today however, CPUs have features such as vectorization, fused multiply-add, hyper-threading or in general in sectionII and then for the specific “turbo” mode. In this paper, we look into this theoretical cases of Intel CPUs: first a simple one from the peak for recent full-featured Intel CPUs., taking into Nehalem/Westmere era in section III and then the account not only the simple absolute peak, but also the full complexity of the Haswell family in sectionIV. relevant instruction sets and encoding and the frequency A complement to this paper titled “Theoretical Peak scaling behavior of current Intel CPUs. FLOPS per instruction set on less conventional Revision 1.41, 2016/10/04 08:49:16 Index Terms—FLOPS hardware” [1] covers other computing devices. flop 9 I. INTRODUCTION > operation> High performance computing thrives on fast com- > > putations and high memory bandwidth. But before > operations => any code or even benchmark is run, the very first × micro − architecture instruction number to evaluate a system is the theoretical peak > > - how many floating-point operations the system > can theoretically execute in a given time. -
Can Microkernels Mitigate Microarchitectural Attacks?⋆
Can Microkernels Mitigate Microarchitectural Attacks?? Gunnar Grimsdal1, Patrik Lundgren2, Christian Vestlund3, Felipe Boeira1, and Mikael Asplund1[0000−0003−1916−3398] 1 Department of Computer and Information Science, Link¨oping University, Sweden ffelipe.boeira,[email protected] 2 Westermo Network Technologies [email protected] 3 Sectra AB, Link¨oping,Sweden Abstract. Microarchitectural attacks such as Meltdown and Spectre have attracted much attention recently. In this paper we study how effec- tive these attacks are on the Genode microkernel framework using three different kernels, Okl4, Nova, and Linux. We try to answer the question whether the strict process separation provided by Genode combined with security-oriented kernels such as Okl4 and Nova can mitigate microar- chitectural attacks. We evaluate the attack effectiveness by measuring the throughput of data transfer that violates the security properties of the system. Our results show that the underlying side-channel attack Flush+Reload used in both Meltdown and Spectre, is effective on all in- vestigated platforms. We were also able to achieve high throughput using the Spectre attack, but we were not able to show any effective Meltdown attack on Okl4 or Nova. Keywords: Genode, Meltdown, Spectre, Flush+Reload, Okl4, Nova 1 Introduction It used to be the case that general-purpose operating systems were mostly found in desktop computers and servers. However, as IoT devices are becoming in- creasingly more sophisticated, they tend more and more to require a powerful operating system such as Linux, since otherwise all basic services must be im- plemented and maintained by the device developers. At the same time, security has become a prime concern both in IoT and in the cloud domain. -
Misleading Performance Reporting in the Supercomputing Field David H
Misleading Performance Reporting in the Supercomputing Field David H. Bailey RNR Technical Report RNR-92-005 December 1, 1992 Ref: Scientific Programming, vol. 1., no. 2 (Winter 1992), pg. 141–151 Abstract In a previous humorous note, I outlined twelve ways in which performance figures for scientific supercomputers can be distorted. In this paper, the problem of potentially mis- leading performance reporting is discussed in detail. Included are some examples that have appeared in recent published scientific papers. This paper also includes some pro- posed guidelines for reporting performance, the adoption of which would raise the level of professionalism and reduce the level of confusion in the field of supercomputing. The author is with the Numerical Aerodynamic Simulation (NAS) Systems Division at NASA Ames Research Center, Moffett Field, CA 94035. 1 1. Introduction Many readers may have read my previous article “Twelve Ways to Fool the Masses When Giving Performance Reports on Parallel Computers” [5]. The attention that this article received frankly has been surprising [11]. Evidently it has struck a responsive chord among many professionals in the field who share my concerns. The following is a very brief summary of the “Twelve Ways”: 1. Quote only 32-bit performance results, not 64-bit results, and compare your 32-bit results with others’ 64-bit results. 2. Present inner kernel performance figures as the performance of the entire application. 3. Quietly employ assembly code and other low-level language constructs, and compare your assembly-coded results with others’ Fortran or C implementations. 4. Scale up the problem size with the number of processors, but don’t clearly disclose this fact. -
Intel Xeon & Dgpu Update
Intel® AI HPC Workshop LRZ April 09, 2021 Morning – Machine Learning 9:00 – 9:30 Introduction and Hardware Acceleration for AI OneAPI 9:30 – 10:00 Toolkits Overview, Intel® AI Analytics toolkit and oneContainer 10:00 -10:30 Break 10:30 - 12:30 Deep dive in Machine Learning tools Quizzes! LRZ Morning Session Survey 2 Afternoon – Deep Learning 13:30 – 14:45 Deep dive in Deep Learning tools 14:45 – 14:50 5 min Break 14:50 – 15:20 OpenVINO 15:20 - 15:45 25 min Break 15:45 – 17:00 Distributed training and Federated Learning Quizzes! LRZ Afternoon Session Survey 3 INTRODUCING 3rd Gen Intel® Xeon® Scalable processors Performance made flexible Up to 40 cores per processor 20% IPC improvement 28 core, ISO Freq, ISO compiler 1.46x average performance increase Geomean of Integer, Floating Point, Stream Triad, LINPACK 8380 vs. 8280 1.74x AI inference increase 8380 vs. 8280 BERT Intel 10nm Process 2.65x average performance increase vs. 5-year-old system 8380 vs. E5-2699v4 Performance varies by use, configuration and other factors. Configurations see appendix [1,3,5,55] 5 AI Performance Gains 3rd Gen Intel® Xeon® Scalable Processors with Intel Deep Learning Boost Machine Learning Deep Learning SciKit Learn & XGBoost 8380 vs 8280 Real Time Inference Batch 8380 vs 8280 Inference 8380 vs 8280 XGBoost XGBoost up to up to Fit Predict Image up up Recognition to to 1.59x 1.66x 1.44x 1.30x Mobilenet-v1 Kmeans Kmeans up to Fit Inference Image up to up up Classification to to 1.52x 1.56x 1.36x 1.44x ResNet-50-v1.5 Linear Regression Linear Regression Fit Inference Language up to up to up up Processing to 1.44x to 1.60x 1.45x BERT-large 1.74x Performance varies by use, configuration and other factors. -
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Copyright © 2012, Elsevier Inc. All rights reserved. 1 Contents 1. SIMD architecture 2. Vector architectures optimizations: Multiple Lanes, Vector Length Registers, Vector Mask Registers, Memory Banks, Stride, Scatter-Gather, 3. Programming Vector Architectures 4. SIMD extensions for media apps 5. GPUs – Graphical Processing Units 6. Fermi architecture innovations 7. Examples of loop-level parallelism 8. Fallacies Copyright © 2012, Elsevier Inc. All rights reserved. 2 Classes of Computers Classes Flynn’s Taxonomy SISD - Single instruction stream, single data stream SIMD - Single instruction stream, multiple data streams New: SIMT – Single Instruction Multiple Threads (for GPUs) MISD - Multiple instruction streams, single data stream No commercial implementation MIMD - Multiple instruction streams, multiple data streams Tightly-coupled MIMD Loosely-coupled MIMD Copyright © 2012, Elsevier Inc. All rights reserved. 3 Introduction Advantages of SIMD architectures 1. Can exploit significant data-level parallelism for: 1. matrix-oriented scientific computing 2. media-oriented image and sound processors 2. More energy efficient than MIMD 1. Only needs to fetch one instruction per multiple data operations, rather than one instr. per data op. 2. Makes SIMD attractive for personal mobile devices 3. Allows programmers to continue thinking sequentially SIMD/MIMD comparison. Potential speedup for SIMD twice that from MIMID! x86 processors expect two additional cores per chip per year SIMD width to double every four years Copyright © 2012, Elsevier Inc. All rights reserved. 4 Introduction SIMD parallelism SIMD architectures A. Vector architectures B. SIMD extensions for mobile systems and multimedia applications C. -
Summarizing CPU and GPU Design Trends with Product Data
Summarizing CPU and GPU Design Trends with Product Data Yifan Sun, Nicolas Bohm Agostini, Shi Dong, and David Kaeli Northeastern University Email: fyifansun, agostini, shidong, [email protected] Abstract—Moore’s Law and Dennard Scaling have guided the products. Equipped with this data, we answer the following semiconductor industry for the past few decades. Recently, both questions: laws have faced validity challenges as transistor sizes approach • Are Moore’s Law and Dennard Scaling still valid? If so, the practical limits of physics. We are interested in testing the validity of these laws and reflect on the reasons responsible. In what are the factors that keep the laws valid? this work, we collect data of more than 4000 publicly-available • Do GPUs still have computing power advantages over CPU and GPU products. We find that transistor scaling remains CPUs? Is the computing capability gap between CPUs critical in keeping the laws valid. However, architectural solutions and GPUs getting larger? have become increasingly important and will play a larger role • What factors drive performance improvements in GPUs? in the future. We observe that GPUs consistently deliver higher performance than CPUs. GPU performance continues to rise II. METHODOLOGY because of increases in GPU frequency, improvements in the thermal design power (TDP), and growth in die size. But we We have collected data for all CPU and GPU products (to also see the ratio of GPU to CPU performance moving closer to our best knowledge) that have been released by Intel, AMD parity, thanks to new SIMD extensions on CPUs and increased (including the former ATI GPUs)1, and NVIDIA since January CPU core counts.