Are You Ready to Enter a Parallel Universe: Optimizing Applications

Welcome to the Parallel Universe Welcome to the Parallel Universe Are You Ready to Enter development (Figure 1). Intel Parallel Studio is composed of the Intel compiler supports all of the current industry-standard following products: Intel® Parallel Advisor (design), Intel® Parallel OpenMP directives and compiles parallel programs annotated Composer (code/debug), Intel® Parallel Inspector (verify) and Intel® with OpenMP directives. The Intel compiler also provides Intel- a Parallel Universe: Parallel Amplifier (tune). specific extensions to the OpenMP Version 3.0 specification, Intel Parallel Composer speeds up software development by including runtime library routines and environment variables. incorporating parallelism with a C/C++ compiler and comprehen- Using /Qopenmp switch enables the compiler to generate mul- sive threaded libraries. By supporting a broad array of parallel tithreaded code based on the OpenMP directives. The code can Optimizing Applications programming models, a developer can find a match to the cod- be executed in parallel on both uniprocessor and multiprocessor ing methods most appropriate for their application. Intel Parallel systems. Inspector is a proactive “bug finder.” It’s a flexible tool that adds n Auto-parallelization feature: The auto-parallelization for Multicore reliability regardless of the choice of parallelism programming feature of the Intel compiler automatically translates serial por- models. Unlike traditional debuggers, Intel Parallel Inspector tions of the input program into equivalent multithreaded code. “I know how to make four horses pull a cart—I don’t know detects hard-to-find threading errors in multithreaded C/C++ Automatic parallelization determines the loops that are good Windows* applications and does root-cause analysis for defects work-sharing candidates, and performs the dataflow analysis how to make 1,024 chickens do it.” — Enrico Clementi such as data races and deadlocks. Intel Parallel Amplifier assists to verify correct parallel execution. It then partitions the data in fine-tuning parallel applications for optimal performance on for threaded code generation as needed in programming with multicore processors by helping find unexpected serialization(s) OpenMP directives. By using /Qparallel, compiler will try to that prevents scaling. auto-parallelize the application. DESIGN n Intel Threading Building Blocks (Intel TBB): Intel TBB is an INTEL PARALLEL COMPOSER award-winning runtime-based parallel programming model, By Levent Akyil The introduction of A look at parallelization Intel Parallel Composer enables developers to express paral- consisting of a template-based runtime library to help develop- methods made possible by multicore processors started lelism with ease, in addition to taking advantage of multicore ers harness the latent performance of multicore processors. the new Intel® Parallel Studio— a new era for both consum- designed for Microsoft Visual architectures. It provides parallel programming extensions, Intel TBB allows developers to write scalable applications that Studio* C/C++ developers ers and software developers. which are intended to quickly introduce parallelism. Intel Parallel take advantage of concurrent collections and parallel algorithms. of Windows* applications. While bringing vast oppor- Composer integrates and enhances the Microsoft Visual Studio n Simple concurrent functionality: Four keywords tunities to consumers, the environment with additional capabilities for parallelism at the C ( __taskcomplete, __task, __par, and __critical) are used as increase in capabilities and ODE application level, such as OpenMP 3.0*, lambda functions, auto- statement prefixes to enable a simple mechanism to write processing power of new vectorization, auto-parallelization, and threaded libraries support. parallel programs. The keywords are implemented using OpenMP multicore processors puts & The award-winning Intel® Threading Building Blocks (Intel® TBB) runtime support. If you need more control over parallelization of new demands on developers, DE is also a key component of Intel Parallel Composer that offers your program, use OpenMP features directly. In order to enable who must create products TUNE a portable, easy-to-use, high-performance way to do parallel B this functionality, use /Qopenmp compiler switch to use parallel that efficiently use these UG programming in C/C++. execution features. The compiler driver automatically links in the processors. For that reason, OpenMP runtime support libraries. The runtime system manages Intel is committed to providing Some of the key extensions EXAMPLE the software community for parallelism Intel Parallel with tools to preserve the Composer brings are: void sum (int length, int *a, int *b, int *c ) // Using concurrent functionality investment it has in software { __taskcomplete development and to int i; { n Vectorization support: for (i=0; i<length; i++) __task sum(500, a, b, c); take advantage of the The automatic vectorizer (also c[i] = a[i] + b[i]; __task sum(500, a+500, b+500, c+500); rapidly growing installed base } } VERIFY called the auto-vectorizer) of multicore systems. In light is a component of the Intel® //Serial call of this commitment, Intel compiler that automatically sum(1000, a, b, c); introduced Intel® Parallel Studio for Microsoft Visual Studio* Figure 1: The four steps uses SIMD (Single Instruction C/C++ developers and opened the doors of the parallel universe in parallel software Multiple Data) instructions in to a wider range of software developers. Intel Parallel Studio will development the MMX™, Intel® Streaming SIMD Extensions (Intel® SSE, SSE2, the actual degree of parallelism. not only help developers create applications that will be better SSE3 and SSE4 Vectorizing Compiler and Media Accelerators) n C++ lambda function support: C++ lambda expressions positioned in the highly competitive software marketplace, but and Supplemental Streaming SIMD Extensions (SSSE3) instruc- are primary expressions that define function objects. Such it will also add parallelism for multicore that forward scales to tion sets. expressions can be used wherever a function object is expected; manycore. n OpenMP 3.0 support: The Intel compiler performs transfor- for example, as arguments to Standard Template Library (STL) In this article, I’ll discuss Intel Parallel Studio products and the mations to generate multithreaded code based on a developer’s algorithms. The Intel compiler implements lambda expressions key technologies introduced for different aspects of software placement of OpenMP directives in the source program. The as specified in the ISO C++ document, which is available at 8 9 Welcome to the Parallel Universe Welcome to the Parallel Universe www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/. n Intel® Parallel Debugger Extension: The Intel Parallel INTEL PARALLEL INSPECTOR INTEL PARALLEL INSPECTOR By using /Qstd=c++0x switch, lambda support can be enabled. Debugger Extension for Microsoft Visual Studio is a debug- “It had never evidenced itself until MEMOry ANALYSIS LEVELS n Intel® Integrated Performance Primitives (Intel® IPP): Now ging add-on for the Intel® C++ compiler’s parallel code Intel Parallel Inspector uses Pin in different settings to provide part of Intel Parallel Composer, Intel IPP is an extensive library development features. It doesn’t replace or change the that day … This fault was so deeply four levels of analysis, each having different configurations of multicore-ready, highly optimized software functions for Visual Studio debugging features; it simply extends what is embedded, it took them weeks and different overhead, as seen in Figure 4. The first three multimedia data processing and communications applications. already available with: of poring through millions of lines analysis levels are targeted for memory problems occurring on Intel IPP provides optimized software building blocks to comple- the heap while the fourth level can also analyze the memory • Thread data sharing analysis to detect accesses to identical ment Intel compiler and performance optimization tools. Intel of code and data to find it.” problems on the stack. The technologies employed by Intel data elements from different threads —Ralph DiNicola IPP provides basic low-level functions for creating applications Spokesman for U.S.-Canadian task force investigating the Parallel Inspector to support all the analysis levels are the • Smart breakpoint feature to stop program execution on a in several domains, including signal processing, audio coding, Northeast 2003 blackout Leak Detection (Level 1) and Memory Checking (Levels 2–4) re-entrant function call speech recognition and coding, image processing, video coding, technologies, which use Pin in various ways. • Serialized execution mode to enable or disable the creation operations on small matrices, and 3-D data processing. Finding the cause of errors in multithreaded applications can of additional worker threads in OpenMP parallel loops n LEVEL 1 The first analysis level helps to find out if the Valarray: Valarray is a C++ standard template (STL) dynamically be a challenging task. Intel Parallel Inspector, an Intel Parallel container class for arrays consisting of array methods for high- Studio tool, is a proactive bug finder that helps you detect and application has any memory leaks. Memory leaks occur when a • Set of OpenMP runtime information views for advanced block of memory is allocated and never released. performance computing. The operations

Are You Ready to Enter a Parallel Universe: Optimizing Applications

Here I Led Subcommittee Reports Related to Data-Intensive Science and Post-Moore Computing) and in CRA’S Board of Directors Since 2015

Ultra-Large-Scale Sites

Optimizing Applications for Multicore by Intel Software Engineer Levent Akyil Welcome to the Parallel Universe

August 2009 Volume 34 Number 4

High Performance Neural Network Inference, Streaming, and Visualization of Medical Images Using FAST

Comparison and Analysis of Parallel Computing Performance Using Openmp and MPI

Threading Building Blocks Getting Started Guide

Programming Models & Environments Summit

Mlperf Inference Benchmark

Evaluating the Separation of Algorithm and Implementation Within Existing Programming Models

Flow Graphs, Speculative Locks, and Task Arenas in Intel® Threading Building Blocks

Intel Concurrent Collections for Haskell Ryan Newton, Chih-Ping Chen, and Simon Marlow