Best Practice Guide Intel Xeon Phi V1.1

Total Page:16

File Type:pdf, Size:1020Kb

Best Practice Guide Intel Xeon Phi V1.1 Best Practice Guide Intel Xeon Phi v1.1 Michaela Barth, KTH Sweden Mikko Byckling, CSC Finland Nevena Ilieva, NCSA Bulgaria Sami Saarinen, CSC Finland Michael Schliephake, KTH Sweden Volker Weinberg (Editor), LRZ Germany <[email protected]> 14-02-2014 1 Best Practice Guide Intel Xeon Phi v1.1 Table of Contents 1. Introduction .............................................................................................................................. 4 2. Intel MIC architecture & system overview ..................................................................................... 4 2.1. The Intel MIC architecture ............................................................................................... 4 2.1.1. Intel Xeon Phi coprocessor architecture overview ....................................................... 4 2.1.2. The cache hierarchy .............................................................................................. 6 2.2. Network configuration & system access .............................................................................. 7 3. Native compilation ................................................................................................................... 11 4. Intel compiler's offload pragmas ................................................................................................. 12 4.1. Simple example ............................................................................................................ 12 4.2. Obtaining informations about the offloading ....................................................................... 13 4.3. Syntax of pragmas ........................................................................................................ 14 4.4. Recommendations ......................................................................................................... 15 4.4.1. Explicit worksharing ........................................................................................... 15 4.4.2. Persistent data on the coprocessor .......................................................................... 15 4.4.3. Optimizing offloaded code ................................................................................... 16 4.5. Intel Cilk Plus parallel extensions .................................................................................... 17 5. OpenMP and hybrid ................................................................................................................. 17 5.1. OpenMP ...................................................................................................................... 17 5.1.1. OpenMP basics .................................................................................................. 17 5.1.2. Threading and affinity ......................................................................................... 17 5.1.3. Loop scheduling ................................................................................................. 18 5.1.4. Scalability improvement ....................................................................................... 18 5.2. Hybrid OpenMP/MPI ..................................................................................................... 19 5.2.1. Programming models ........................................................................................... 19 5.2.2. Threading of the MPI ranks .................................................................................. 19 6. MPI ...................................................................................................................................... 19 6.1. Setting up the MPI environment ...................................................................................... 19 6.2. MPI programming models .............................................................................................. 20 6.2.1. Coprocessor-only model ....................................................................................... 20 6.2.2. Symmetric model ................................................................................................ 20 6.2.3. Host-only model ................................................................................................. 21 6.3. Simplifying launching of MPI jobs ................................................................................... 21 7. Intel MKL (Math Kernel Library) ............................................................................................... 21 7.1. MKL usage modes ........................................................................................................ 22 7.1.1. Automatic Offload (AO) ...................................................................................... 22 7.1.2. Compiler Assisted Offload (CAO) ......................................................................... 23 7.1.3. Native Execution ................................................................................................ 23 7.2. Example code ............................................................................................................... 24 7.3. Intel Math Kernel Library Link Line Advisor ..................................................................... 24 8. TBB: Intel Threading Building Blocks ........................................................................................ 24 8.1. Advantages of TBB ....................................................................................................... 24 8.2. Using TBB natively ....................................................................................................... 25 8.3. Offloading TBB ............................................................................................................ 27 8.4. Examples ..................................................................................................................... 27 8.4.1. Exposing parallelism to the processor ..................................................................... 27 8.4.2. Vectorization and Cache-Locality .......................................................................... 31 8.4.3. Work-stealing versus Work-sharing ........................................................................ 31 9. IPP: The Intel Integrated Performance Primitives .......................................................................... 31 9.1. Overview of IPP ........................................................................................................... 31 9.2. Using IPP .................................................................................................................... 32 9.2.1. Getting Started ................................................................................................... 32 9.2.2. Linking of Applications ....................................................................................... 33 9.3. Multithreading .............................................................................................................. 33 9.4. Links and References ..................................................................................................... 33 10. Further programming models ................................................................................................... 33 2 Best Practice Guide Intel Xeon Phi v1.1 10.1. OpenCL ..................................................................................................................... 34 10.2. OpenACC .................................................................................................................. 34 11. Debugging ............................................................................................................................ 34 11.1. Native debugging with gdb ........................................................................................... 34 11.2. Remote debugging with gdb .......................................................................................... 34 12. Tuning ................................................................................................................................. 35 12.1. Single core optimization ............................................................................................... 35 12.1.1. Memory alignment ............................................................................................ 35 12.1.2. SIMD optimization ............................................................................................ 36 12.2. OpenMP optimization .................................................................................................. 37 12.2.1. OpenMP thread affinity ...................................................................................... 37 12.2.2. Example: Thread affinity .................................................................................... 38 12.2.3. OpenMP thread placement .................................................................................. 39 12.2.4. Multiple parallel regions and barriers .................................................................... 40 12.2.5. Example: Multiple parallel regions and barriers ...................................................... 40 12.2.6. False sharing .................................................................................................... 42 12.2.7. Example: False sharing ...................................................................................... 42 12.2.8. Memory limitations ..........................................................................................
Recommended publications
  • Driving Performance with Intel® Advisor's Flow Graph Analyzer
    Software THE PARALLEL UNIVERSE Driving Code Performance with Intel® Advisor’s Flow Graph Analyzer Modernize your Code with Issue Intel® Parallel Studio XE 30 Enabling FPGAs for Software Developers 2017 CONTENTS Letter from the Editor 3 Meet Intel® Parallel Studio XE 2018 by Henry A. Gabb, Senior Principal Engineer, Intel Corporation Driving Performance with Intel® Advisor’s Flow Graph Analyzer 5 by Vasanth Tovinkere, Architect, Intel® Flow Graph Analyzer; Pablo Reble, Software Engineer; Farshad Akhbari, Perceptual Computing Technical Lead; and Palanivel Guruvareddiar, Perceptual Computing Software Architect; Intel Corporation FEATURE Welcome to the Adult World, OpenMP* 19 by Barbara Chapman, Professor, Stony Brook University, and Director of Computer Science and Mathematics, Brookhaven National Laboratory Enabling FPGAs for Software Developers 25 by Bernhard Friebe, Senior Director of FPGA Software Solutions Marketing, Intel Corporation, and James Reinders, HPC Enthusiast Modernize Your Code for Performance, Portability, and Scalability 37 by Jackson Marusarz, Technical Consulting Engineer, Intel Corporation Dealing with Outliers 45 by Oleg Kremnyov, QA Engineer; Mikhail Averbukh, Software Engineer; and Ivan Kuzmin, Software Engineering Manager; Intel Corporation Tuning for Success with the LatestSIMD Extensions 57 and Intel® Advanced Vector Extensions 512 by Xinmin Tian, Senior Principal Engineer; Hideki Saito, Principal Engineer; Sergey Kozhukhov, Senior Staff Engineer; and Nikolay Panchenko, Staff Engineer; Intel Compiler and Language Lab, Intel Corporation Effectively Using Your Whole Cluster 77 by Rama Kishan Malladi, Technical Marketing Engineer, Intel Corporation Is Your Cluster Healthy? 85 by Brock A. Taylor, HPC Solution Architect, Intel Corporation Optimizing HPC Clusters 89 by Michael Hebenstreit, Data Center Engineer, Intel Corporation For more complete information about compiler optimizations, see our Optimization Notice.
    [Show full text]
  • Intel® Parallel Studio XE 2013 SP1 Product Brief
    EFFICIENTLY PRODUCE FAST, SCALABLE, AND RELIABLE APPLICATIONS Intel® Parallel Studio XE 2013 SP1 Product Brief Top Features Leading Development Tools for Top Performance . Industry-leading application Deliver top application performance while minimizing development, tuning and testing time, and performance that scales as effort. Intel® Parallel Studio XE provides C/C++ and Fortran developers cutting edge performing processor core count and vector compilers and libraries, the right parallel programming models, and complementary and width increase compatible analysis tools. It plugs seamlessly into Microsoft Visual Studio* and the GNU tool chain to keep you productive while preserving your development environment investment. Efficiently scale on tomorrow’s Boost performance for your applications as they run on today’s and tomorrow’s IA-compatible hardware while preserving processors and coprocessors, including Intel® Xeon® processors and Intel® Xeon Phi™ investment in existing code coprocessors. Compatible with leading development environments Intel Parallel Studio XE includes the next generation of software development tools: . Intel® C, C++ and Fortran Compilers – Industry-Leading Compilers “If you need performance for your . Intel® Math Kernel Library (Intel® MKL) and Intel® Integrated Performance Primitives C++ and/or Fortran applications, (Intel® IPP) – Performance Libraries you need to try Intel software . Intel® Threading Building Blocks (Intel® TBB) and Intel® Cilk™ Plus – Parallel Programming development products." Models Dr. Artur Guzik, Senior Engineer, Neubrex. Co., . Intel® Advisor XE – Threading Prototyping Tool Ltd. Intel® VTune™ Amplifier XE – Advanced Threading and Performance Profiler . Intel® Inspector XE – Memory and Threading Debugger "The new VTune Amplifier XE Get more performance with less effort. Optimize performance by using the latest versions of brings even more capability to an broadly used Intel® Software Development Products.
    [Show full text]
  • Intel® Technology Journal | Volume 18, Issue 2, 2014
    Intel® Technology Journal | Volume 18, Issue 2, 2014 Publisher Managing Editor Content Architect Cory Cox Stuart Douglas Paul Cohen Program Manager Technical Editor Technical Illustrators Paul Cohen David Clark MPS Limited Technical and Strategic Reviewers Bruce J Beare Tracey Erway Michael Richmond Judi A Goldstein Kira C Boyko Miao Wei Paul De Carvalho Ryan Tabrah Prakash Iyer Premanand Sakarda Ticky Thakkar Shelby E Hedrick Intel® Technology Journal | 1 Intel® Technology Journal | Volume 18, Issue 2, 2014 Intel Technology Journal Copyright © 2014 Intel Corporation. All rights reserved. ISBN 978-1-934053-63-8, ISSN 1535-864X Intel Technology Journal Volume 18, Issue 2 No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4744. Requests to the Publisher for permission should be addressed to the Publisher, Intel Press, Intel Corporation, 2111 NE 25th Avenue, JF3-330, Hillsboro, OR 97124-5961. E-Mail: [email protected]. Th is publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold with the understanding that the publisher is not engaged in professional services. If professional advice or other expert assistance is required, the services of a competent professional person should be sought.
    [Show full text]
  • Intel Compiler
    FZ-Jülich, 26 November 2020 Intel Tuning for Juwels and Jureca Dr. Heinrich Bockhorst - Intel Notices & Disclaimers Intel technologies may require enabled hardware, software or service activation. Learn more at intel.com or from the OEM or retailer. Your costs and results may vary. Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy. Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804. https://software.intel.com/en-us/articles/optimization-notice Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See backup for configuration details. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
    [Show full text]
  • Intel® Oneapi Programming Guide
    Intel® oneAPI Programming Guide Intel Corporation www.intel.com Notices and Disclaimers Contents Notices and Disclaimers....................................................................... 5 Chapter 1: Introduction Intel oneAPI Programming Overview ............................................................6 oneAPI Toolkit Distribution..........................................................................7 About This Guide.......................................................................................8 Related Documentation ..............................................................................8 Chapter 2: oneAPI Programming Model Data Parallel C++ (DPC++) ........................................................................9 C/C++ or Fortran with OpenMP* Offload Programming Model ........................ 11 Device Selection...................................................................................... 13 Chapter 3: oneAPI Development Environment Setup Use the setvars Script with Windows* ........................................................ 16 Use a Config file for setvars.bat on Windows ..................................... 18 Automate the setvars.bat Script with Microsoft Visual Studio* ............... 21 Use the setvars Script with Linux* or MacOS* ............................................. 24 Use a Config file for setvars.sh on Linux or macOS ............................. 26 Automate the setvars.sh Script with Eclipse* ...................................... 29 Use Modulefiles with Linux* .....................................................................
    [Show full text]
  • Использование Intel® Distribution for Python* INTRODUCTION Python Is #1 Programming Language in Hiring Demand Followed by Java and C++
    Использование Intel® Distribution for Python* INTRODUCTION Python is #1 programming language in hiring demand followed by Java and C++. Python is among the most popular programming languages And the demand is . Especially in research/data science for prototyping growing . But very limited use in production . Big data real time analytics requires deploying models to High Performance Data Analytics (HPDA) environments . Hire a team of Java/C++ programmers … OR . Ease access to HPDA for data scientist . This talk shows how Intel Distribution for Python may ease access to production environment for researcher/data scientist . Complementary to Intel TAP ATK Problem statement: DATA scientist is not a professional programmer Until 2015 customer demand for using Intel tools with scripting & JIT languages was addressed by publishing Knowledge Base articles on Intel Developer Zone “… I was hoping to take “Any KB articles I found advantage of by on your site that related building my NumPy and to actually using the Scipy installations MKL for compiling using the MKL … After something were overly spending over 40 hours technical. I couldn't trying to figure it out figure out what the myself, I gave up.” – heck some of the Researcher, Arizona things were doing or State University* talking about.” – Anonymous Data Scientist* * PSXE 2015 Beta Customer Survey Programming Languages Productivity LANGUAGE VERBOSITY PROGRAMMING COMPLEXITY (LOC/FEATURE) (HOURS) 4 5 4 3 3 2 2 1 1 0 0 Python Java C++ C Python Java C++ C Prechelt* Berkholz** Prechelt* Berkholz** * L.Prechelt, An empirical comparison of seven programming languages, IEEE Computer, 2000, Vol. 33, Issue 10, pp.
    [Show full text]
  • How Intel® Parallel Studio XE 2015 Helps Make Faster Code Faster for HPC
    Intel Software Conference 2015 Игорь Лопатин Руководитель группы по разработке ПО Intel Intel Software and Services Group, Developer Products Division 11 марта, 2015 Intel Software & Services Group Network Infrastructure Cloud API Services Cloud and Data Center Products Client Software Platforms Intel Developer Zone * Other names and brands may be claimed as the property of others. 2 Intel Software & Services – глобальное присутствие ~14,000+ сотрудников, более 35 офисов 3 Intel в России – активный рост Saint Petersburg ~1000 сотрудников в 4 городах Moscow 80% исследования и разработки Nizhniy Novgorod Novosibirsk 4 Intel Software в России Российские разработчики играют ключевую роль • Intel Parallel Studio, Cluster Tools, Compilers, Performance Libraries, Threading Tools, GPAs, Media SDK, и т.д. • Поддержка и разработка ~ 70 продуктов • Около 20 релизов ПО ежемесячно Программы Intel для разработчиков • Intel® Developer Zone www.intel.ru/software • Intel® Software Partner Program www.intel.ru/partner • Intel - INTUIT Academy http://www.intuit.ru/catalog/se/intel/ Центры компетенции в ведущих российских 5 ВУЗах Создавайте инновационные приложения вместе с Intel Intel® Intel® Intel® Media Intel® Integrated Intel® XDK Parallel Studio XE System Studio Server Studio Native Developer Experience Technical Embedded Media Development Native App Web and hybrid Computing Systems/Device Software Development HTML5 App Enterprise, and HPC Development Software Development Software Software Software 6 Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Technical Computing Intel® Parallel Studio XE Enterprise, and HPC Software Улучшение масштабируемости, устойчивости и производительности приложений Модели Лидирующая параллельного Средства анализа производительно программирован программ сть ия . Продвинутые компиляторы и .
    [Show full text]
  • Intel Tuning for JURECA and JUWELS
    FZJ, May 20th, 2020 Dr. Heinrich Bockhorst, Intel Agenda Introduction Processor Architecture Overview Composer XE – Compiler Intel Python APS – Application Performance Snapshot MPI and ITAC analysis VTune Amplifier XE - analysis Advisor XE - Vectorization Selected Intel® Tools References Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. The “Free Lunch” is over, really Processor clock rate growth halted around 2005 Source: © 2014, James Reinders, Intel, used with permission Software must be parallelized to realize all the potential performance Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. What platform should I use for code modernization? Juwels The world is going Intel® Xeon® Processor Intel® Xeon Phi™ x100 Intel® Xeon® Processor Intel® Xeon Phi™ x200 parallel – stick E5-2600 v3 Product Product Family E5-2600 v4 Product Product Family with sequential Family formerly formerly codenamed Family codenamed codenamed … code and you will codenamed Knights Broadwell Knights Skylake fall behind. Haswell Corner Landing Cores 18 61 22 72 28 Threads/Core 2 4 2 4 2 Vector Width 256-bit 512-bit 256-bit 512-bit (x2) 512-bit (x2) Peak Memory Bandwidth 68 GB/s 352 GB/s 77 GB/s >500 GB/s 228 GB/s Both Xeon and KNL are suitable platforms; KNL provides higher scale & memory bandwidth.
    [Show full text]
  • Разработка Приложений Для Tizen. Tizen SDK
    Операционная система Tizen для разработчика Ренат Няжеметдинов Старший специалист по новым мобильным платформам, Samsung Russia Содержание • Введение в Tizen • Архитектура Tizen • Разработка приложений для Tizen 2 Введение в Введение в Tizen. О платформе • Операционная система: – открытый исходный код – ядро Linux • Архитектура: x86, ARM • Поддержка: – Linux Foundation – Tizen Association: • OEM: • Операторы: 4 Введение в Tizen. Основные черты Tizen • Web-ориентированная платформа – Высокий уровень поддержки стандартов HTML5 http://html5test.com – WebKit2 – Web-компоненты для интерфейса пользователя – Доступ к функциям устройства через Web API • Нативный API (C/C++) – Glibc, libstdc++, libxml2 – OpenGL® ES, OpenAL и OpenMP® – Высокий уровень совместимости с bada 2.0 API • Встроенный магазин приложений – Бесплатная регистрация – Предустановлен на все устройства Tizen 5 Введение в Tizen. Разновизности Tizen Мобильные Компьютеры Информационно- устройства (Tizen PC) развлекательные системы автомобилей (Tizen IVI) И другие устройства: • Телевизоры • Бытовая техника • Носимая электроника Часы (Gear 2) Камеры (NX300M) • Медицинская техника 6 Введение в Tizen. Profiles • Mobile – мобильные устройства • IVI – развлекательные системы для транспортных средств • PC – desktops, laptops • TV – DTV/STB/IPTV • Wearables – носимые устройства (например, Samsung Gear 2) • … 7 История Tizen 8 История релизов Tizen Mobile Profile Tizen 2.2.1 Tizen 2.2 Minor Update Tizen 2.1 Commercial Ready - Privacy menu in w/ Enhanced UX settings Tizen 2.0 Hybrid Web/Native,
    [Show full text]
  • Intel Xeon-Phi an Overview – MKL
    C-DAC Four Days Technology Workshop ON Hybrid Computing – Coprocessors/Accelerators Power-Aware Computing – Performance of Applications Kernels hyPACK-2013 Mode 3 : Intel Xeon Phi Coprocessors Lecture Topic : Intel Xeon-Phi An Overview – MKL Venue : CMSD, UoHYD ; Date : October 15-18, 2013 C-DAC hyPACK-2013 An Overview of MKL – Intel Xeon-Phi Prog. 1 An Overview of Xeon Phi Coprocessor Lecture Outline Following topics will be discussed Understanding of Xeon –Phi Architectures Programming on Xeon-Phi Architectures –MKL Tuning & Performance – Software Threading C-DAC hyPACK-2013 An Overview of MKL – Intel Xeon-Phi Prog. 2 MIC Architecture, System Overview C-DAC hyPACK-2013 An Overview of MKL – Intel Xeon-Phi Prog. 3 Systems with Accelerators A set (one or more) of very simple execution units that can perform few operations (with respect to standard CPU) with very high efficiency. When combined with full featured CPU (CISC or RISC) can accelerate the “nominal” speed of a system. CPU ACC Single thread perf. throughput CPU ACC. Physical integration CPU : Control Path & Data Path ALU : Arithmetic and logic units CPU & ACC ACC: Accumulator PC : Program counter Architectural integration Micro-Instructions (Register transfer) Source : NVIDIA, AMD, SGI, Intel, IBM Alter, Xilinux References C-DAC hyPACK-2013 An Overview of MKL – Intel Xeon-Phi Prog. 4 Compute modes vision Xeon Centric MIC Centric Xeon Scalar Symmetric Parallel MIC Hosted Co-processing Co-processing Hosted General purpose Codes with Highly-parallel serial and parallel balanced needs codes computing Codes with highly- Highly parallel parallel phases codes with scalar phases Source : References & Intel Xeon-Phi; http://www.intel.com/ C-DAC hyPACK-2013 An Overview of MKL – Intel Xeon-Phi Prog.
    [Show full text]
  • Best Practice Guide Intel Xeon Phi V2.0
    Best Practice Guide Intel Xeon Phi v2.0 Emanouil Atanassov, IICT-BAS, Bulgaria Michaela Barth, KTH, Sweden Mikko Byckling, CSC, Finland Vali Codreanu, SURFsara, Netherlands Nevena Ilieva, NCSA, Bulgaria Tomas Karasek, IT4Innovations, Czech Republic Jorge Rodriguez, BSC, Spain Sami Saarinen, CSC, Finland Ole Widar Saastad, University of Oslo, Norway Michael Schliephake, KTH, Sweden Martin Stachon, IT4Innovations, Czech Republic Janko Strassburg, BSC, Spain Volker Weinberg (Editor), LRZ, Germany 31-01-2017 1 Best Practice Guide Intel Xeon Phi v2.0 Table of Contents 1. Introduction .............................................................................................................................. 5 2. Intel MIC architecture & system overview ..................................................................................... 6 2.1. The Intel MIC architecture ............................................................................................... 6 2.1.1. Intel Xeon Phi coprocessor architecture overview ....................................................... 6 2.1.2. The cache hierarchy .............................................................................................. 8 2.2. Network configuration & system access .............................................................................. 9 3. Programming Models ............................................................................................................... 15 4. Native compilation ..................................................................................................................
    [Show full text]
  • Intel-Xeon-Phi-Coprocessor-Quick-Start
    Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE FOR MICROSOFT* WINDOWS HOST White Paper Intel Xeon Phi Coprocessor DEVELOPER’S QUICK START GUIDE FOR MICROSOFT* WINDOWS HOST Version 1.2 Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE FOR MICROSOFT* WINDOWS HOST Contents Introduction ........................................................................................................................................................................................................ 4 Goals ............................................................................................................................................................................................................................. 4 This document does: ...................................................................................................................................................................................... 4 This document does not: ............................................................................................................................................................................. 4 Terminology .............................................................................................................................................................................................................. 4 System Configuration ...................................................................................................................................................................................
    [Show full text]