Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* for Xeon Phi Cluster

Total Page:16

File Type:pdf, Size:1020Kb

Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* for Xeon Phi Cluster Performance Optimization of Deep Learning Frameworks on Modern Intel Architectures ElMoustapha Ould-Ahmed-Vall, AG Ramesh, Vamsi Sripathi and Karthik Raman Representing the work of many at Intel Agenda • Op#mizaon maers on modern architectures • Intel’s recent Xeon and Xeon Phi products • Introduc#on to Deep Learning • Op#mizing DL frameworks on IA • Key challenges • Op#mizaon techniques • Performance data • DL scaling Moore’s Law Goes on! Increasing clock speeds -> more cores + wider SIMD (Hierarchical parallelism) Combined Amdahl’s Law for Vector Mul<cores* �������=(​1/​������↓���� +​1−​������↓���� /�������� )∗(​1/​������↓���� +​1−​������↓���� /������������ ) Goal: Reduce Serial Fraction and Reduce Scalar Fraction of Code Ideal Speedup: NumCores*VectorLength (requires zero scalar, zero serial work) Peak “Compute” Gflops/s Compute Bound Performance Most kernels of ML codes are compute bound i.e. raw FLOPS matter Peak “Compute” Gflops/s without SIMD Roofline Model Gflops/s = min (Peak Gflops/s, Stream BW * flops/byte) A+ainable Gflops/s Compute intensity (flops/byte) Overview of Current Generation of Intel Xeon and Xeon Phi Products Current Intel® Xeon PlaBorms 45nm Process 14nm Process Technology 32nm Process Technology 22nm Process Technology Technology Nehalem Westmere Sandy Bridge Ivy Bridge Haswell Broadwell NEW Intel® Intel NEW Intel Intel NEW Intel Intel Microarchitecture Microarchitecture Microarchitecture Microarchitecture Microarchitecture Microarchitecture (Nehalem) (Nehalem) (Sandy Bridge) (Sandy Bridge) (Haswell) (Haswell) TOCK TICK TOCK TICK TOCK TICK Latest released – Broadwell (14nm process) • Intel’s foundation of HPC and ML performance • Suited for full scope of workloads • Industry leading performance/watt for serial & highly parallel workloads. • Upto 22 cores / socket (Broadwell-EP) (w/ Hyper-Threading technology) Software optimization helps maximize benefit and adoption of new features 2nd Generaon Intel® Xeon Phi™ Plaorm Intel® AVX Technology SNB/IVB HSW/BDW SKX & KNL 256b AVX1 256b AVX2 512b AVX512 Flops/Cycle: 16 SP / 8 Flops/Cycle: 32SP / 16 Flops/Cycle: 64SP / 32 DP DP (FMA) DP (FMA) AVX512 AVX AVX2 512-bit FP/Integer 256-bit basic FP Float16 (IVB 2012) 32 registers 16 registers 256-bit FP FMA 8 mask registers Embedded rounding NDS (and AVX128) 256-bit integer Embedded broadcast Improved blend PERMD Scalar/SSE/AVX “promotions” MASKMOV Gather Native media additions Implicit unaligned HPC additions Transcendental support Gather/Scatter Overview of Deep Learning and DL Frameworks Deep Learning – ConVolu<onal Neural Network Convolution Parameters: Number of outputs/feature-maps: < 4 > Feature maps Filter size: < 3 x 3 > Stride: < 2 > Pad_size (for corner case): <1> Filter = 3 x Stride = Pad_size = 3 2 1 Deep Learning: Train Once Use Many Times • Step 1: Training • Step 2: Inference (Over Hours/Days/Weeks) (Real Time) New input from Input data camera and Person sensors Create Deep Trained Trained neural network Model network model 97% Output 90% person person Output Classificaon 8% traffic light Classificaon Deep Learning: Why Now? Bigger Data Be[er Hardware Smarter Algorithms Image: 1000 KB / picture Transistor density doubles every Advances in algorithm Audio: 5000 KB / song 18 months innovaon, including neural networks, leading to be+er Video: 5,000,000 KB / movie Cost / GB in 1995: $1000.00 accuracy in training models Cost / GB in 2015: $0.03 Intel Caffe – ML Framework Opmized for Xeon and Xeon Phi Products Intel Caffe q Fork of BVLC Caffe by Intel to optimize for IA Compute Data Layer q Leverages Intel MKL Deep Neural Layer Network (DNN) API’s LMDB, Convoluon ReLU Pooling LevelDB, q Optimized for BDW (AVX2) and KNL HDF5 (MIC_AVX512) q https://github.com/intel/caffe Intel MKL DNN BDW KNL Tensorflow ™ : Open Source ML Framework (Google) • Computaon is a Dataflow Graph with Tensors • General compu#ng mathemacal framework – widely used for • Deep Neural Networks • Other machine learning algorithms • HPC applicaons • Key computaonal kernels, extendable user operaons • Core in C++, front end wrapper in python • Mul# node support using GRPC • Google Remote Procedural Calls Example from Jeff Dean’s presentaon Optimizing Deep Learning Frameworks Performance Op<miza<on on Modern PlaBorms Hierarchical Parallelism Coarse-Grained / Fine-Grained Parallelism / within node mul-node Sub-domain: 1) Mul<-leVel domain decomposi<on (ex. across layers) Domain decomposion 2) Data decomposion (layer parallelism) Scaling U#lize all the Vectorize/SIMD Efficient cores memory/cache Improve load OpenMP, MPI, TBB… Unit strided access use balancing per SIMD lane Reduce Blocking Reduce synchronizaon High vector synchronizaon events, serial code efficiency Data reuse events, all-to-all comms Improve load Data alignment Prefetching balancing Memory allocaon Intel Strategy: Opmized Deep Learning Environment Fuel the development of ver#cal solu#ons Intel® Deep Learning SDK Accelerate design, training, and deployment Drive op#mizaons across open source deep learning frameworks Intel® Math Kernel Library (Intel® MKL) Intel® MKL-DNN Maximum performance on Intel architecture Intel® Omni-Path Deliver best single node and mul#-node + Architecture + performance (Intel® OPA) Training Inference Example Challenge 1: Data Layout Has Big Impact on Performance • Data Layouts impacts performance • Sequen#al access to avoid gather/scaer • Have iteraons in inner most loop to ensure high vector u#lizaon • Maximize data reuse; e.g. weights in a convolu#on layer • Conver#ng to/from op#mized Layout is some #mes less expensive than operang on unop#mized Layout 21 18 32 6 3 21 18 … 1 .. 8 92 .. 1 8 0 3 26 8 92 37 29 44 40 9 22 76 81 Be+er op#mized for 11 9 22 3 26 some operaons 23 44 81 32 11 3 47 29 88 1 vs 5 38 10 11 1 15 16 22 46 12 21 8 18 92 .. 1 11 .. 29 9 13 11 1 Example Challenge 2: Minimize ConVersions OVerhead • End to end op#mizaon can reduce conversions • Staying in op#mized layout as long as possible becomes one of the tuning goals • Minimize the number of back and forth conversions • Use of graph op#mizaon techniques Nave to Convoluon MKL layout Max Pool Nave to Convoluon MKL layout MKL layout to Nave MKL layout to Nave Example Challenge 3: Ensuring Enough Parallelism to Leverage all Cores • Maximize parallelism to use all cores efficiently • Intra operaon/layer parallelism Inter operaon parallelism across operators within operators (OpenMP) 8 92 37 29 Parallel execu#on 11 9 22 3 1x1 3x3 5x5 Conv Conv Conv 3 47 29 88 15 16 22 46 10 20 Convolu#on of #les in parallel 15 18 concat Example Challenge 4: Opmizing the Data Layer • Data Layer comprises 3 major ops C0 Boost C1 Boost C2 C3 Cn-1 thread thread OpenMP OpenMP …. …. OpenMP o Read data o Decode data: e.g. JPEG decode, decompression o Transform data • Result of read, decode & transform is input to DNN layers • Reduce number of cores dedicated to feed DNN o IO op#mizaon: consider compression o Decode: consider LMDB instead of JPEG o Resizing/data processing: consider pre-processing o Then vectorize, parallelize Op<mizing Deep Learning Frameworks for Intel® Architecture • Leverage high performant compute libraries and tools • e.g. Intel® Math Kernel Library, Intel® Python, Intel® Compiler etc. • Data Format/Shape: • Right format/shape for max performance: blocking, gather/scaer • Data Layout: • Minimize cost of data layout conversions • Parallelism: • Use all cores, eliminate serial sec#ons, load imbalance • Other Func#ons/Primi#ves (un-op#mized in libraries): • Op#mize via compiler knobs, improve exis#ng implementaons • Memory allocaon • unique characteris#cs and ability to reuse buffers • Data layer op#mizaons: • parallelizaon, vectorizaon, IO • Op#mize hyper parameters: • e.g. batch size for more parallelism • learning rate and op#mizer to ensure accuracy/convergence AlexNet Op<miza<on Progression 60 49.07x 50 40.71x 40 30 25.49x 27.17x 18.28x 20 12.48x 13.36x 13.72x 9.27x 6.96x 7.72x 10 2.20x 4.18x 2.16x Cumulave speedup 1.00x 0 Broadwell Knights Landing VGG Opmizaon Progression 300 273.50x 250 200 164.95x 171.20x 150 122.50x 100 Cumulave Speedup 50 23.60x 19.27x 15.80x 10.18x 13.29x 14.65x 1.00x 1.00x 3.15x 5.40x 0 Baseline MKL Integraon Thread Compiler Knobs Matrix Memory Conversions Op#mizaon Tuning Transpose/Data Allocaons Op#mizaon Transformaons Broadwell Knights Landing Configuraon details Intel® Xeon™ processor E5-2699v4 (22 Cores, 2.2 GHz), 128GB DDR memory, Centos 7.2 based on Red Hat* Enterprise Linux 7.2 Intel® Xeon Phi™ processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: Flat mode), 96GB DDR memory, Centos 7.2 based on Red Hat* Enterprise Linux 7.2 AlexNet and VGG benchmarks: h+ps://github.com/soumith/convnet-benchmarks 25 Mul<-Node Distributed Training • Model Parallelism • Break the model into N nodes • The same data is in all the nodes • Data Parallelism • Break the dataset into N nodes • The same model is in all the nodes • Good for networks with few weights, e.g. GoogLeNet • You can use either model or data parallelism or a hybrid of both Data Parallelism Scaling Efficiency: Intel® Xeon Phi™ Processor Deep Learning Image Classificaon Training Performance : MULTI-NODE Scaling OverFeat AlexNet VGG-A GoogLeNet 100 87 90 80 70 62 60 50 40 30 SCALING EFFICIENCY % 20 10 0 1 2 4 8 16 32 64 128 # OF INTEL® XEON PHI™ PROCESSOR 7250 (68-CORES, 1.4 GHZ, 16 GB) NODES Soxware and workloads used in performance tests may have been op#mized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, soxware, operaons and func#ons. Any change to any of those factors may cause the results to vary. You should consult other informaon and performance tests to assist you in fully evaluang your contemplated purchases, including the performance of that product when combined with other products.
Recommended publications
  • What If What I Need Is Not in Powerai (Yet)? What You Need to Know to Build from Scratch?
    IBM Systems What if what I need is not in PowerAI (yet)? What you need to know to build from scratch? Jean-Armand Broyelle June 2018 IBM Systems – Cognitive Era Things to consider when you have to rebuild a framework © 2017 International Business Machines Corporation 2 IBM Systems – Cognitive Era CUDA Downloads © 2017 International Business Machines Corporation 3 IBM Systems – Cognitive Era CUDA 8 – under Legacy Releases © 2017 International Business Machines Corporation 4 IBM Systems – Cognitive Era CUDA 8 Install Steps © 2017 International Business Machines Corporation 5 IBM Systems – Cognitive Era cuDNN and NVIDIA drivers © 2017 International Business Machines Corporation 6 IBM Systems – Cognitive Era cuDNN v6.0 for CUDA 8.0 © 2017 International Business Machines Corporation 7 IBM Systems – Cognitive Era cuDNN and NVIDIA drivers © 2017 International Business Machines Corporation 8 IBM Systems – Cognitive Era © 2017 International Business Machines Corporation 9 IBM Systems – Cognitive Era © 2017 International Business Machines Corporation 10 IBM Systems – Cognitive Era cuDNN and NVIDIA drivers © 2017 International Business Machines Corporation 11 IBM Systems – Cognitive Era Prepare your environment • When something goes wrong it’s better to Remove local anaconda installation $ cd ~; rm –rf anaconda2 .conda • Reinstall anaconda $ cd /tmp; wget https://repo.anaconda.com/archive/Anaconda2-5.1.0-Linux- ppc64le.sh $ bash /tmp/Anaconda2-5.1.0-Linux-ppc64le.sh • Activate PowerAI $ source /opt/DL/tensorflow/bin/tensorflow-activate • When you
    [Show full text]
  • インテル® Parallel Studio XE 2020 リリースノート
    インテル® Parallel Studio XE 2020 2019 年 12 月 5 日 内容 1 概要 ..................................................................................................................................................... 2 2 製品の内容 ........................................................................................................................................ 3 2.1 インテルが提供するデバッグ・ソリューションの追加情報 ..................................................... 5 2.2 Microsoft* Visual Studio* Shell の提供終了 ........................................................................ 5 2.3 インテル® Software Manager .................................................................................................... 5 2.4 サポートするバージョンとサポートしないバージョン ............................................................ 5 3 新機能 ................................................................................................................................................ 6 3.1 インテル® Xeon Phi™ 製品ファミリーのアップデート ......................................................... 12 4 動作環境 .......................................................................................................................................... 13 4.1 プロセッサーの要件 ...................................................................................................................... 13 4.2 ディスク空き容量の要件 .............................................................................................................. 13 4.3 オペレーティング・システムの要件 ............................................................................................. 13
    [Show full text]
  • Open Source Copyrights
    Kuri App - Open Source Copyrights: 001_talker_listener-master_2015-03-02 ===================================== Source Code can be found at: https://github.com/awesomebytes/python_profiling_tutorial_with_ros 001_talker_listener-master_2016-03-22 ===================================== Source Code can be found at: https://github.com/ashfaqfarooqui/ROSTutorials acl_2.2.52-1_amd64.deb ====================== Licensed under GPL 2.0 License terms can be found at: http://savannah.nongnu.org/projects/acl/ acl_2.2.52-1_i386.deb ===================== Licensed under LGPL 2.1 License terms can be found at: http://metadata.ftp- master.debian.org/changelogs/main/a/acl/acl_2.2.51-8_copyright actionlib-1.11.2 ================ Licensed under BSD Source Code can be found at: https://github.com/ros/actionlib License terms can be found at: http://wiki.ros.org/actionlib actionlib-common-1.5.4 ====================== Licensed under BSD Source Code can be found at: https://github.com/ros-windows/actionlib License terms can be found at: http://wiki.ros.org/actionlib adduser_3.113+nmu3ubuntu3_all.deb ================================= Licensed under GPL 2.0 License terms can be found at: http://mirrors.kernel.org/ubuntu/pool/main/a/adduser/adduser_3.113+nmu3ubuntu3_all. deb alsa-base_1.0.25+dfsg-0ubuntu4_all.deb ====================================== Licensed under GPL 2.0 License terms can be found at: http://mirrors.kernel.org/ubuntu/pool/main/a/alsa- driver/alsa-base_1.0.25+dfsg-0ubuntu4_all.deb alsa-utils_1.0.27.2-1ubuntu2_amd64.deb ======================================
    [Show full text]
  • Introduction to High Performance Computing
    Introduction to High Performance Computing Shaohao Chen Research Computing Services (RCS) Boston University Outline • What is HPC? Why computer cluster? • Basic structure of a computer cluster • Computer performance and the top 500 list • HPC for scientific research and parallel computing • National-wide HPC resources: XSEDE • BU SCC and RCS tutorials What is HPC? • High Performance Computing (HPC) refers to the practice of aggregating computing power in order to solve large problems in science, engineering, or business. • Purpose of HPC: accelerates computer programs, and thus accelerates work process. • Computer cluster: A set of connected computers that work together. They can be viewed as a single system. • Similar terminologies: supercomputing, parallel computing. • Parallel computing: many computations are carried out simultaneously, typically computed on a computer cluster. • Related terminologies: grid computing, cloud computing. Computing power of a single CPU chip • Moore‘s law is the observation that the computing power of CPU doubles approximately every two years. • Nowadays the multi-core technique is the key to keep up with Moore's law. Why computer cluster? • Drawbacks of increasing CPU clock frequency: --- Electric power consumption is proportional to the cubic of CPU clock frequency (ν3). --- Generates more heat. • A drawback of increasing the number of cores within one CPU chip: --- Difficult for heat dissipation. • Computer cluster: connect many computers with high- speed networks. • Currently computer cluster is the best solution to scale up computer power. • Consequently software/programs need to be designed in the manner of parallel computing. Basic structure of a computer cluster • Cluster – a collection of many computers/nodes. • Rack – a closet to hold a bunch of nodes.
    [Show full text]
  • Mochi-JCST-01-20.Pdf
    Ross R, Amvrosiadis G, Carns P et al. Mochi: Composing data services for high-performance computing environments. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 35(1): 121–144 Jan. 2020. DOI 10.1007/s11390-020-9802-0 Mochi: Composing Data Services for High-Performance Computing Environments Robert B. Ross1, George Amvrosiadis2, Philip Carns1, Charles D. Cranor2, Matthieu Dorier1, Kevin Harms1 Greg Ganger2, Garth Gibson3, Samuel K. Gutierrez4, Robert Latham1, Bob Robey4, Dana Robinson5 Bradley Settlemyer4, Galen Shipman4, Shane Snyder1, Jerome Soumagne5, and Qing Zheng2 1Argonne National Laboratory, Lemont, IL 60439, U.S.A. 2Parallel Data Laboratory, Carnegie Mellon University, Pittsburgh, PA 15213, U.S.A. 3Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada 4Los Alamos National Laboratory, Los Alamos NM, U.S.A. 5The HDF Group, Champaign IL, U.S.A. E-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected] E-mail: [email protected]; [email protected]; [email protected]; [email protected] E-mail: [email protected]; [email protected]; [email protected]; {bws, gshipman}@lanl.gov E-mail: [email protected]; [email protected]; [email protected] Received July 1, 2019; revised November 2, 2019. Abstract Technology enhancements and the growing breadth of application workflows running on high-performance computing (HPC) platforms drive the development of new data services that provide high performance on these new platforms, provide capable and productive interfaces and abstractions for a variety of applications, and are readily adapted when new technologies are deployed. The Mochi framework enables composition of specialized distributed data services from a collection of connectable modules and subservices.
    [Show full text]
  • Enforcing Abstract Immutability
    Enforcing Abstract Immutability by Jonathan Eyolfson A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Electrical and Computer Engineering Waterloo, Ontario, Canada, 2018 © Jonathan Eyolfson 2018 Examining Committee Membership The following served on the Examining Committee for this thesis. The decision of the Examining Committee is by majority vote. External Examiner Ana Milanova Associate Professor Rensselaer Polytechnic Institute Supervisor Patrick Lam Associate Professor University of Waterloo Internal Member Lin Tan Associate Professor University of Waterloo Internal Member Werner Dietl Assistant Professor University of Waterloo Internal-external Member Gregor Richards Assistant Professor University of Waterloo ii I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. iii Abstract Researchers have recently proposed a number of systems for expressing, verifying, and inferring immutability declarations. These systems are often rigid, and do not support “abstract immutability”. An abstractly immutable object is an object o which is immutable from the point of view of any external methods. The C++ programming language is not rigid—it allows developers to express intent by adding immutability declarations to methods. Abstract immutability allows for performance improvements such as caching, even in the presence of writes to object fields. This dissertation presents a system to enforce abstract immutability. First, we explore abstract immutability in real-world systems. We found that developers often incorrectly use abstract immutability, perhaps because no programming language helps developers correctly implement abstract immutability.
    [Show full text]
  • Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor
    Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor Douglas Doerfler, Jack Deslippe, Samuel Williams, Leonid Oliker, Brandon Cook, Thorsten Kurth, Mathieu Lobet, Tareq Malas, Jean-Luc Vay, and Henri Vincenti Lawrence Berkeley National Laboratory {dwdoerf, jrdeslippe, swwilliams, loliker, bgcook, tkurth, mlobet, tmalas, jlvay, hvincenti}@lbl.gov Abstract The Roofline Performance Model is a visually intuitive method used to bound the sustained peak floating-point performance of any given arithmetic kernel on any given processor architecture. In the Roofline, performance is nominally measured in floating-point operations per second as a function of arithmetic intensity (operations per byte of data). In this study we determine the Roofline for the Intel Knights Landing (KNL) processor, determining the sustained peak memory bandwidth and floating-point performance for all levels of the memory hierarchy, in all the different KNL cluster modes. We then determine arithmetic intensity and performance for a suite of application kernels being targeted for the KNL based supercomputer Cori, and make comparisons to current Intel Xeon processors. Cori is the National Energy Research Scientific Computing Center’s (NERSC) next generation supercomputer. Scheduled for deployment mid-2016, it will be one of the earliest and largest KNL deployments in the world. 1 Introduction Moving an application to a new architecture is a challenge, not only in porting of the code, but in tuning and extracting maximum performance. This is especially true with the introduction of the latest manycore and GPU-accelerated architec- tures, as they expose much finer levels of parallelism that can be a challenge for applications to exploit in their current form.
    [Show full text]
  • A Dataset for Github Repository Deduplication
    A Dataset for GitHub Repository Deduplication Diomidis Spinellis Audris Mockus Zoe Kotti [email protected] {dds,zoekotti}@aueb.gr University of Tennessee Athens University of Economics and Business ABSTRACT select distinct p1, p2 from( select project_commits.project_id as p2, GitHub projects can be easily replicated through the site’s fork first_value(project_commits.project_id) over( process or through a Git clone-push sequence. This is a problem for partition by commit_id empirical software engineering, because it can lead to skewed re- order by mean_metric desc) as p1 sults or mistrained machine learning models. We provide a dataset from project_commits of 10.6 million GitHub projects that are copies of others, and link inner join forkproj.all_project_mean_metric each record with the project’s ultimate parent. The ultimate par- on all_project_mean_metric.project_id = ents were derived from a ranking along six metrics. The related project_commits.project_id) as shared_commits projects were calculated as the connected components of an 18.2 where p1 != p2; million node and 12 million edge denoised graph created by direct- Listing 1: Identification of projects with common commits ing edges to ultimate parents. The graph was created by filtering out more than 30 hand-picked and 2.3 million pattern-matched GitHub contains many millions of copied projects. This is a prob- clumping projects. Projects that introduced unwanted clumping lem for empirical software engineering. First, when data contain- were identified by repeatedly visualizing shortest path distances ing multiple copies of a repository are analyzed, the results can between unrelated important projects. Our dataset identified 30 end up skewed [27]. Second, when such data are used to train thousand duplicate projects in an existing popular reference dataset machine learning models, the corresponding models can behave of 1.8 million projects.
    [Show full text]
  • Introduction to Intel Xeon Phi (“Knights Landing”) on Cori
    Introduction to Intel Xeon Phi (“Knights Landing”) on Cori" Brandon Cook! Brian Friesen" 2017 June 9 - 1 - Knights Landing is here!" • KNL nodes installed in Cori in 2016 • “Pre-produc=on” for ~ 1 year – No charge for using KNL nodes – Limited access (un7l now!) – Intermi=ent down7me – Frequent so@ware changes • KNL nodes on Cori will soon enter produc=on – Charging Begins 2017 July 1 - 2 - Knights Landing overview" Knights Landing: Next Intel® Xeon Phi™ Processor Intel® Many-Core Processor targeted for HPC and Supercomputing First self-boot Intel® Xeon Phi™ processor that is binary compatible with main line IA. Boots standard OS. Significant improvement in scalar and vector performance Integration of Memory on package: innovative memory architecture for high bandwidth and high capacity Integration of Fabric on package Three products KNL Self-Boot KNL Self-Boot w/ Fabric KNL Card (Baseline) (Fabric Integrated) (PCIe-Card) Potential future options subject to change without notice. All timeframes, features, products and dates are preliminary forecasts and subject to change without further notification. - 3 - Knights Landing overview TILE CHA 2 VPU 2 VPU Knights Landing Overview 1MB L2 Core Core 2 x16 X4 MCDRAM MCDRAM 1 x4 DMI MCDRAM MCDRAM Chip: 36 Tiles interconnected by 2D Mesh Tile: 2 Cores + 2 VPU/core + 1 MB L2 EDC EDC PCIe D EDC EDC M Gen 3 3 I 3 Memory: MCDRAM: 16 GB on-package; High BW D Tile D D D DDR4: 6 channels @ 2400 up to 384GB R R 4 36 Tiles 4 IO: 36 lanes PCIe Gen3. 4 lanes of DMI for chipset C DDR MC connected by DDR MC C Node: 1-Socket only H H A 2D Mesh A Fabric: Omni-Path on-package (not shown) N Interconnect N N N E E L L Vector Peak Perf: 3+TF DP and 6+TF SP Flops S S Scalar Perf: ~3x over Knights Corner EDC EDC misc EDC EDC Streams Triad (GB/s): MCDRAM : 400+; DDR: 90+ Source Intel: All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
    [Show full text]
  • Intel® Parallel Studio XE 2020 Update 2 Release Notes
    Intel® Parallel StudIo Xe 2020 uPdate 2 15 July 2020 Contents 1 Introduction ................................................................................................................................................... 2 2 Product Contents ......................................................................................................................................... 3 2.1 Additional Information for Intel-provided Debug Solutions ..................................................... 4 2.2 Microsoft Visual Studio Shell Deprecation ....................................................................................... 4 2.3 Intel® Software Manager ........................................................................................................................... 5 2.4 Supported and Unsupported Versions .............................................................................................. 5 3 What’s New ..................................................................................................................................................... 5 3.1 Intel® Xeon Phi™ Product Family Updates ...................................................................................... 12 4 System Requirements ............................................................................................................................. 13 4.1 Processor Requirements........................................................................................................................ 13 4.2 Disk Space Requirements .....................................................................................................................
    [Show full text]
  • Effectively Using NERSC
    Effectively Using NERSC Richard Gerber Senior Science Advisor HPC Department Head (Acting) NERSC: the Mission HPC Facility for DOE Office of Science Research Largest funder of physical science research in U.S. Bio Energy, Environment Computing Materials, Chemistry, Geophysics Particle Physics, Astrophysics Nuclear Physics Fusion Energy, Plasma Physics 6,000 users, 48 states, 40 countries, universities & national labs Current Production Systems Edison 5,560 Ivy Bridge Nodes / 24 cores/node 133 K cores, 64 GB memory/node Cray XC30 / Aries Dragonfly interconnect 6 PB Lustre Cray Sonexion scratch FS Cori Phase 1 1,630 Haswell Nodes / 32 cores/node 52 K cores, 128 GB memory/node Cray XC40 / Aries Dragonfly interconnect 24 PB Lustre Cray Sonexion scratch FS 1.5 PB Burst Buffer 3 Cori Phase 2 – Being installed now! Cray XC40 system with 9,300 Intel Data Intensive Science Support Knights Landing compute nodes 10 Haswell processor cabinets (Phase 1) 68 cores / 96 GB DRAM / 16 GB HBM NVRAM Burst Buffer 1.5 PB, 1.5 TB/sec Support the entire Office of Science 30 PB of disk, >700 GB/sec I/O bandwidth research community Integrate with Cori Phase 1 on Aries Begin to transition workload to energy network for data / simulation / analysis on one system efficient architectures 4 NERSC Allocation of Computing Time NERSC hours in 300 millions DOE Mission Science 80% 300 Distributed by DOE SC program managers ALCC 10% Competitive awards run by DOE ASCR 2,400 Directors Discretionary 10% Strategic awards from NERSC 5 NERSC has ~100% utilization Important to get support PI Allocation (Hrs) Program and allocation from DOE program manager (L.
    [Show full text]
  • Mlsm: Making Authenticated Storage Faster in Ethereum
    mLSM: Making Authenticated Storage Faster in Ethereum Pandian Raju1 Soujanya Ponnapalli1 Evan Kaminsky1 Gilad Oved1 Zachary Keener1 Vijay Chidambaram1;2 Ittai Abraham2 1University of Texas at Austin 2VMware Research Abstract the LevelDB [15] key-value store. We show that reading a single key (e.g., the amount of ether in a given account) Ethereum provides authenticated storage: each read can result in 64 LevelDB reads, while writing a single returns a value and a proof that allows the client to verify key can lead to a similar number of LevelDB writes. In- the value returned is correct. We experimentally show ternally, LevelDB induces extra write amplification [23], that such authentication leads to high read and write am- further increasing overall amplification. Such write and × plification (64 in the worst case). We present a novel read amplification reduces throughput (storage band- data structure, Merkelized LSM (mLSM), that signifi- width is wasted by the amplification), and write ampli- cantly reduces the read and write amplification while still fication in particular significantly reduces the lifetime of allowing client verification of reads. mLSM significantly devices such as Solid State Drives (SSDs) which wear increases the performance of the storage subsystem in out after a limited number of write cycles [1, 16, 20]. Ethereum, thereby increasing the performance of a wide Thus, reducing the read and write amplification can both range of Ethereum applications. increase Ethereum throughput and reduce hardware re- 1 Introduction placement costs. Modern crypto-currencies such as Bitcoin [21] and We trace the read and write amplification in Ethereum Ethereum [26] seek to provide a decentralized, to the fact that it provides authenticated storage.
    [Show full text]