Letter to the Editor by parallelism author and expert James Reinders

Are You Ready to Enter a Parallel Universe: Optimizing Applications for Multicore by software engineer Levent Akyil Welcome to the Parallel Universe

Contents Think Parallel or Perish, by James Reinders...... 2 James Reinders, Lead Evangelist and a Director with Intel® Software Development Products, sees a future where every software developer needs to be thinking about parallelism first when programming. He first published“ Think Parallel or Perish“ three years ago. Now he revisits his comments to offer an update on where we have gone and what still lies ahead.

Parallelization Methodology...... 4 The four stages of parallel application development addressed by Intel® Parallel Studio.

Writing Parallel Code Safely, by Peter Varhol...... 5 Writing multithreaded code to take full advantage of multiple processors and multicore processors is difficult. The new Intel® Parallel Studio should help us bridge that gap. Are You Ready to Enter a Parallel Universe: Optimizing Applications for Multicore, by Levent Akyil...... 8 A look at parallelization methods made possible by the new Intel® Parallel Studio—designed for Microsoft Visual Studio* C/C++ developers of Windows* applications. 8 Rules for Parallel Programming for Multicore, by James Reinders...... 16 There are some consistent rules that can help you solve the parallelism challenge and tap into the potential of multicore. EvolvE your codE. Get robust parallelism from analysis and compiling through debugging and tuning. Intel® Parallel Studio is the ultimate all-in-one toolset for Windows* applications. These advanced software tools enable agile multicore software development for Microsoft Visual Studio* C/C++ developers. Everything you need to take serial code to parallel and parallel to perfection. Preorder now. Product shipping May 26. www.intel.com/software/parallelstudio

© 2009 – 2010, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. © 2009-2010, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. 1 Welcome to the Parallel Universe Welcome to the Parallel Universe

Three years ago, as I thought Think about my own future in software de- A good design comes velopment, I penned a piece noting that as software developers we all need to from a human developer, personally “Think Parallel” or face a future where others will do so and we will cease to be competitive. I am still confident that not the tools. software development in 2016 will not This dawning of manycore proces- And, we have gone to beta with our be kind to programmers who have not What we must all learn to Parallel learned to “Think Parallel.” sors, as profound as it is, will not affect Intel® Parallel Studio project. This takes ev- do is “Think Parallel.” or Some of us will write parallel programs, very many programmers in 2009. It will, erything we’ve done to date, and makes it however, make this new parallel era even much more intuitive, robust, and exciting. By James Reinders some of us will not. Some will write using James Reinders, Lead Evangelist new languages to implement in, changing more real than multicore processors Soon, it will be a full product from Intel. and a Director with Intel® Software the way we do most everything. Most have. The opportunity for hardware to be These new projects are in addition to Development Products, sees a future where every software developer will not. enormously parallel is staggering. all the tools we had a few years ago. We needs to be thinking about parallelism What we must all do is learn to “Think The “not parallel” era we are now exit- are growing and expanding our tools to first when programming. He first Perish We are leaving behind an era of ing will appear to be a very primitive time support customers, while they expand published “Think Parallel or Perish“ Parallel.” Understanding how parallelism is too little parallelism. three years ago. Now he revisits and will be used on computers, in in the history of computers when people to use parallelism. We are generally not his comments to offer an update on Where are we now? where we have gone and what still applications, and in our environment is look back in a hundred years. The forcing obsolescence as we expand to To which I say: “It’s about lies ahead. critical to our future. Grasping how to use world works in parallel, and it is time for offer parallelism. Our customers appreci- time!” the good and avoid the bad in this new computer programs to do the same. We ate that. world, intuitively, is the challenge ahead are leaving behind an era of too little I’m also quite excited about Intel’s Ct of us in “Thinking Parallel.” parallelism, to which I say: “It’s about technology, and we’ll be talking more I still have my note stuck to my cube time!” Doing one thing at a time is “so about that as the year goes on. It is one wall: “Memo to self: The fastest way to yesterday.” of several more projects we have been unemployment is to not hone my pro- In less than a decade, a programmer experimenting with, and talking with gramming skills for exploiting parallelism.” who does not “Think Parallel” first will not customers about, to guide us on what This year, 2009, is the fifth year in be a programmer. more we should be doing to help. which Intel is shipping multicore proces- We are building tools that help with sors. Those five years were not the first What’s Intel doing to help? parallel programming. We are embracing five years of multicore processors, but When I first wrote about this a few years standards, extending current languages, they were the five years that drove x86 ago, Intel was already a longtime cham- using library APIs—all with an eye on scal- processors to multicore in servers and pion for the OpenMP* standard (www. ability, correctness, and maintainability. desktop and laptop computers. In this openmp.org) and had many tools for MPI It is still up to us, as software develop- time span, we’ve gone from a few multi- support. Our performance analysis tools ers, to know what to do with these core processor offerings to “game over” and debuggers were ahead of their time wonderful tools. Just as before parallelism, for single-core processors. with top to bottom support for paral- a good design comes from the human This year, 2009, we will also witness lelism. We also had emerging tools for developer—not the tools. Parallelism will another monumental milestone: this will finding deadlocks and race conditions in be no different. Hence, we humans need be the first year that Intel has a produc- parallel programs. to work on “Think Parallel.” tion manycore processor. Manycore pro- Since then, we’ve helped update James Reinders cessors differ from multicore processors OpenMP to specification version 3.0 and Portland, Oregon by having a more sophisticated system of were among the very first to support it in April 2009 interconnecting processors, caches, mem- our compilers. We launched our own MPI ories, and I/O. This is needed at about library, which is more portable and higher James Reinders is Chief Software Evangelist 16 or so cores—so manycore processors performing than prior options. and Director of Software Development Products, can be expected to have more cores than We created Intel® Threading Building Intel Corporation. His articles and books on paral- multicore processors. The world has never Blocks (Intel® TBB), which has become the lelism include Intel Threading Building Blocks: Outfitting C++ for Multicore seen a high-volume manycore processor, best and most popular way to program Processor Parallelism. Find out more at but over the next few years it will. parallelism in C++. www.go-parallelcom.

2 3 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice. Welcome to the Parallel Universe Welcome to the Parallel Universe

Writing Parallel Parallelization Methodology

® The Intel Software Development Tools solve four barriers, locks, and thread teams. To debug an Code Safely stages of parallel application development: application with the Intel Parallel Debugger the application must be built with the Intel® C++ Writing multithreaded With the explosion of multicore processors, the pressure is now ® 1. Code and Debug with Intel Parallel Composer: Compiler using the /debug:parallel option. code to take full on application developers to make effective use of this computing Use source code analysis capabilities of Intel Parallel advantage of multiple ® power. Developers are increasingly going to have to better identify Composer to identify errors in the source code during 2. Verify with Intel Parallel Inspector: Run Intel Paral- processors and compilation phase. To enable source code analysis lel Inspector to find threading errors such as data multicore processors opportunities for multithreading and independent parallel operation you will need to use the /Qdiag-enable:sc {[1|2|3]} races and deadlocks and memory errors that are likely is difficult. The in their applications, and to be able to implement those techniques compiler option. The number specifies the severity to occur. If you found errors, fix them and rebuild the ® level of the diagnostics (1=all critical errors, 2=all application with Intel Parallel Composer. new Intel Parallel in their code. errors, and 3=all errors and warnings). The following Studio should help scenarios represent the various ways that Intel Paral- 3. Tune with Intel® Parallel Amplifier: Intel Parallel Am- us bridge that gap. But these are difficult activities; understand- As a result, few developers want to write lel Composer supports coding of parallel applications. plifier enables you to get the maximum performance By Peter Varhol ing how to protect data and system resources parallel code that takes advantage of multiple These scenarios are likely to benefit from available processor cores. Use Intel Par- during parallel operations requires technical ex- processors or processor cores, or are techni- manifest in different applications: allel Amplifier’s concurrency analysis to find processor pertise and attention to detail. The likelihood cally prepared to do so. This kind of coding is underutilization cases where the code is not fully SCENARIO 1: User decides to use OpenMP 3.0* of errors, such as race conditions or overwritten considered specialized and niche, even though to add parallelism. utilizing processor resources. Implement performance data, is high. virtually all modern servers and many worksta- improvement changes and rebuild the application with And identifying areas in code where parallel tions and laptops use multicore processors. SCENARIO 2: User has code with cryptography Intel Parallel Composer. Use Intel Parallel Inspector to execution is feasible is a challenge in and of But multiple processor systems and multi- and wants to parallelize. Add Intel® Integrated check for possible threading errors and then verify itself. It typically requires a deep level of under- core processors aren’t going away; and in fact, Performance Primitives (Intel® IPP) to replace the effects of performance improvement changes standing of not only the code, but also of the the industry is going to see processors with cryptography code. Add OpenMP for parallelism. with Intel Parallel Amplifier. flow of application execution as a whole. This an increasing number of cores over the next SCENARIO 3: User adds Intel® Threading Building 4. Check in the code. is very difficult to achieve with current tools several years. It is likely that even desktop Blocks (Intel® TBB) pipeline. and techniques. systems over the next several years will have 5. Go back to step 1 and find a new region to add 16 or more processor cores. Both commercial SCENARIO 4: User has audio/video code and wants to parallelism to. and custom software has to make better use parallelize. Add Intel® IPP to pertinent domains. of the computing power being afforded by Add Intel TBB to parallelize additional regions. The “Parallelizing N-Queens with the Intel® Parallel these processors. Composer” tutorial provided with Parallel Composer It’s difficult. But tools are emerging that can SCENARIO 5: User has code with Windows* threads. provides hands-on training on how to apply each make a difference in the development of paral- Add Intel TBB with Lambda function support to new of the parallelization techniques discussed in this lelized applications for multicore processors and regions to parallelize. document to implement a parallel solution to the multiple processors. For example, Intel® Parallel N-Queens problem, which is a more general version For the scenarios where OpenMP is used to of the Eight Queens Puzzle (http://en.wikipedia. Studio provides essential tools for working introduce parallelism, use the Intel® Parallel Debugger org/wiki/N-queens). The tutorial is located in the with multithreaded code, enabling developers Extension for Microsoft Visual Studio* “Samples\NQueens” folder under the Intel Parallel to more easily identify errors in threads and (http://software.intel.com/en-us/articles/preview/ Composer main installation folder. memory, and to be able to identify and analyze parallel-debugger-extension) to debug the OpenMP bottlenecks. code. The debugger has a set of features that can be TECHNICAL SUPPORT is available for beta accessed through the Microsoft Visual Studio Debug Visit Intel’s online Community Support User Forums and program download at www.intel.com/soft- pull-down menu. The debugger can provide insights Knowledge Base to get all of the help you need from our ware/parallelstudio. It consists of Intel® Parallel into thread data sharing violations, detect re-entrant own tools and parallelism experts, and your fellow develop- Composer, Intel® Parallel Inspector, and Intel® function calls, provide access to SSE registers, provide ers. Go to www.intel.com/software/support to start your Parallel Amplifier. Together, these tools enable access to OpenMP data structures, such as tasks, task search. developers building multithreaded applications wait lists, task spawn trees, for parallel execution to quickly construct and debug applications that are able to make more

4 5 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice. Welcome to the Parallel Universe Welcome to the Parallel Universe

efficient use of today’s processors. depending on how critical) can execute while that section is This is called a race condition, where the race of threads to how that affects results. Intel Parallel Composer consists of a parallel debugger plug-in running. completion affects the result of the computation. It is among Intel Parallel Amplifier is first and foremost a performance that simplifies the debugging of parallel code and helps to However, critical sections tend to lock resources they are the most insidious and difficult errors to find in programming, profiler—it tells you where your code is spending most of its ensure thread accuracy. It also includes Intel® Threading Build- working with, slowing execution and sometimes leading to in part because it may only occur some of the time. This turns time—the hotspots, so to speak. But it does more than that. It ing Blocks and Intel® Integrated Performance Primitives, which deadlock. If the resources remain locked and other threads one of the most fundamental of tenets of computer execution also looks at where a program or its threads are waiting, and provide a variety of threaded generic and application-specific cannot execute, and other threads also hold resources that on its head—that anything a computer does is deterministic. also looks at where concurrency might not be performing satis- functions enabling developers to quickly add parallelism to are required by the running thread, no further execution can In reality, unless an error occurs all of the time, it is almost factorily. In other words, it would identify idle threads, waiting applications. occur. Because such a deadlock involves different resources impossible to identify and analyze in code. Race conditions within the program, and resources that remain idle for inordinate Intel Parallel Inspector detects threading and memory errors from multiple threads, identifying the resources that cause the can only be found almost by accident—stumbled upon because lengths of time. and provides guidance to help ensure application reliability. problem and locking out those resources at the right time is a developers know that something is wrong, yet are not able Intel Parallel Composer is the cornerstone to the package. Intel Parallel Amplifier is a performance profiler that makes it highly detailed and error-prone activity. to localize that issue. Tools that provide developers with an The parallel libraries and Intel Threading Building Blocks C++ straightforward to quickly find multicore performance bottle- Memory leaks and similar memory errors in threads are in-depth and detailed look at thread execution may be the only template library enables developers to abstract threads to tasks necks without needing to know the processor architecture or also difficult to identify and diagnose in parallel execution. way to identify these and other highly technical and difficult-to- and to more easily create multithreaded applications that have assembly code. Memory leaks are fairly common in C and C++ code as memory identify issues in multithreaded code. the ability to execute in parallel. These libraries provide develop- is allocated for use but not automatically returned to the free ers with a variety of threaded generic Challenge of Parallel Construction memory list. These are especially difficult to detect because and application-specific functions, Developers increasingly find that their application code doesn’t they can occur in any running thread, and typically don’t mani- which enable developers to quickly fully utilize modern multicore processors. In many cases they fest themselves during the short test runs often performed by add parallelism to applications. can achieve high performance on a single core, but without a developers and testers. Rather, lost memory accumulates over Intel Parallel Composer also provides significant effort geared toward making the code operate in time, initially slowing down application execution as the heap a debugger specifically designed for parallel, cannot scale to make use of the cores available. becomes larger, and finally causing program crashes as the heap working with multithreaded applica- Parallel code is that which is able to execute indepen- exceeds allocated memory. tions, and a C++ compiler to build code. dently of other code threads. On a single processor or single Any one of these tools by itself core system, it can offer at least the illusion of speeding up The Race Condition isn’t sufficient to assist greatly in the performance, because processor downtime on one thread can If multiple threads are running simultaneously and the result development of multithreaded code be used to run other threads. On multiprocessor and multi- is dependent upon which thread finishes first, it is known as a for parallel execution. Writing mul- core systems, it is essential to take full advantage of multiple race condition. The threads exchange data during execution, tithreaded code can occur without separate execution pipelines. and that data may be different, depending on when a particular existing, debugged libraries for paral- It’s not easy. Most code is written in either a straightforward thread is running, and how far along it is in executing. lel execution. Finding memory leaks execution fashion or a top-down fashion. Either way, develop- Of course, the threads are not supposed to be exchang- is possible without a multithreading ers don’t have a good opportunity to look at their code from the ing data while they are supposedly executing independently. memory tool. And analyzing deadlock standpoint of operations that can be parallelized. The common However, the protections provided in the code—usually some Figure 2. Intel® Parallel Amplifier provides a way for developers to analyze the and race conditions is conceivable development processes of today simply don’t afford them the form of a mutex or critical section—are either absent or not performance of their applications from the standpoint of multithreaded execution. without specialized tools for looking It provides a performance profiler, as well as the ability to analyze idle time opportunity to do so. sufficient, allowing data to be exchanged among threads while and bottlenecks. at thread execution. But to write, Instead, developers have to view their code within the they should be in their critical sections. analyze, and debug multithreaded context of application execution. code intended for high-performance They have to be able to envision the Intel Helps Find Parallel Bugs operation on multicore processors can’t easily be done with- execution of the application, and the Intel Parallel Studio’s components enable developers to build out the ability to easily add parallel functions, debug those sequence of how the source code will multithreaded applications, which are better able to take functions, look for deadlock and race conditions, and analyze execute once a user is running that advantage of multiple processors or processor cores, than single performance. application. Operations that can occur threaded ones. This plug-in to Microsoft Visual Studio* provides Intel Parallel Studio provides the software tools for doing all independently in this context are can- memory analysis, thread analysis, debugging, threaded libraries, of this, within the context of Microsoft Visual Studio. By using didates for parallel computation. and performance analysis for these complex programs. Intel Parallel Studio, developers can rapidly build, analyze, and But that’s only the initial step in Intel Parallel Inspector lets developers analyze memory use in debug multithreaded applications for use in demanding environ- the process. Once developers have multithreaded applications. Memory is a resource that is common- ments for parallel execution. determined what execution paths are ly misused in single-threaded applications, as developers allocate However we might write parallel code in the future, Intel able to run in parallel, they then have memory and don’t reclaim that memory after the operations have is now providing tools that might help us do so. It is the only to make it happen. Threading the been completed. Finding and fixing these errors in multithreaded way that we can start using the computing power the last few code isn’t enough; developers have to programs can be a difficult and time-consuming task. years have afforded us. And if we don’t take advantage of that be able to protect the threads once Intel Parallel Inspector can also help developers detect and computing power, computer users are losing out on what Intel is they launch and execute indepen- analyze threading errors such as deadlock, and will also help delivering in terms of new and innovative technology. dently. This involves setting up critical detect race conditions. It does so by observing the behavior of

sections of the code, ensuring that Figure 1. Intel® Parallel Studio’s Intel® Parallel Inspector enables developers to iden- individual threads and how they use and free memory. For race Peter Varhol is principal at Technology Strategy Research, LLC, an indus- nothing related (or nothing else at all, tify memory leaks and other memory errors in running multithreaded applications. conditions, it can look at what threads are completing first and try analysis and consulting firm.

6 7 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice. Welcome to the Parallel Universe Welcome to the Parallel Universe

Are You Ready to Enter development (Figure 1). Intel Parallel Studio is composed of the Intel compiler supports all of the current industry-standard following products: Intel® Parallel Advisor (design), Intel® Parallel OpenMP directives and compiles parallel programs annotated Composer (code/debug), Intel® Parallel Inspector (verify) and Intel® with OpenMP directives. The Intel compiler also provides Intel- a Parallel Universe: Parallel Amplifier (tune). specific extensions to the OpenMP Version 3.0 specification, Intel Parallel Composer speeds up software development by including runtime library routines and environment variables. incorporating parallelism with a C/C++ compiler and comprehen- Using /Qopenmp switch enables the compiler to generate mul- sive threaded libraries. By supporting a broad array of parallel tithreaded code based on the OpenMP directives. The code can Optimizing Applications programming models, a developer can find a match to the cod- be executed in parallel on both uniprocessor and multiprocessor ing methods most appropriate for their application. Intel Parallel systems. Inspector is a proactive “bug finder.” It’s a flexible tool that adds n Auto-parallelization feature: The auto-parallelization for Multicore reliability regardless of the choice of parallelism programming feature of the Intel compiler automatically translates serial por- models. Unlike traditional debuggers, Intel Parallel Inspector tions of the input program into equivalent multithreaded code. “I know how to make four horses pull a cart—I don’t know detects hard-to-find threading errors in multithreaded C/C++ Automatic parallelization determines the loops that are good Windows* applications and does root-cause analysis for defects work-sharing candidates, and performs the dataflow analysis how to make 1,024 chickens do it.” — Enrico Clementi such as data races and deadlocks. Intel Parallel Amplifier assists to verify correct parallel execution. It then partitions the data in fine-tuning parallel applications for optimal performance on for threaded code generation as needed in programming with multicore processors by helping find unexpected serialization(s) OpenMP directives. By using /Qparallel, compiler will try to that prevents scaling. auto-parallelize the application. DESIGN n Intel Threading Building Blocks (Intel TBB): Intel TBB is an Intel Parallel Composer award-winning runtime-based parallel programming model, By Levent Akyil of A look at parallelization The introduction Intel Parallel Composer enables developers to express paral- consisting of a template-based runtime library to help develop- methods made possible by multicore processors started lelism with ease, in addition to taking advantage of multicore ers harness the latent performance of multicore processors. the new Intel® Parallel Studio— a new era for both consum- designed for Microsoft Visual architectures. It provides parallel programming extensions, Intel TBB allows developers to write scalable applications that Studio* C/C++ developers ers and software developers. which are intended to quickly introduce parallelism. Intel Parallel take advantage of concurrent collections and parallel algorithms. of Windows* applications. While bringing vast oppor- Composer integrates and enhances the Microsoft Visual Studio n Simple concurrent functionality: Four keywords tunities to consumers, the environment with additional capabilities for parallelism at the ( __taskcomplete, __task, __par, and __critical) are used as C

increase in capabilities and ODE & DEB U G application level, such as OpenMP 3.0*, lambda functions, auto- statement prefixes to enable a simple mechanism to write processing power of new vectorization, auto-parallelization, and threaded libraries support. parallel programs. The keywords are implemented using OpenMP multicore processors puts The award-winning Intel® Threading Building Blocks (Intel® TBB) runtime support. If you need more control over parallelization of new demands on developers, is also a key component of Intel Parallel Composer that offers your program, use OpenMP features directly. In order to enable who must create products TU NE a portable, easy-to-use, high-performance way to do parallel this functionality, use /Qopenmp compiler switch to use parallel that efficiently use these programming in C/C++. execution features. The compiler driver automatically links in the processors. For that reason, OpenMP runtime support libraries. The runtime system manages Intel is committed to providing Some of the key extensions Example the software community for parallelism Intel Parallel with tools to preserve the Composer brings are: void sum (int length, int *a, int *b, int *c ) // Using concurrent functionality investment it has in software { __taskcomplete development and to int i; { n Vectorization support: for (i=0; i

8 9 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice. Welcome to the Parallel Universe Welcome to the Parallel Universe

www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/. n Intel® Parallel Debugger Extension: The Intel Parallel Intel Parallel Inspector Intel Parallel Inspector By using /Qstd=c++0x switch, lambda support can be enabled. Debugger Extension for Microsoft Visual Studio is a debug- “It had never evidenced itself until Memory Analysis Levels n Intel® Integrated Performance Primitives (Intel® IPP): Now ging add-on for the Intel® C++ compiler’s parallel code Intel Parallel Inspector uses Pin in different settings to provide part of Intel Parallel Composer, Intel IPP is an extensive library development features. It doesn’t replace or change the that day … This fault was so deeply four levels of analysis, each having different configurations of multicore-ready, highly optimized software functions for Visual Studio debugging features; it simply extends what is embedded, it took them weeks and different overhead, as seen in Figure 4. The first three multimedia data processing and communications applications. already available with: of poring through millions of lines analysis levels are targeted for memory problems occurring on Intel IPP provides optimized software building blocks to comple- the heap while the fourth level can also analyze the memory • Thread data sharing analysis to detect accesses to identical ment Intel compiler and performance optimization tools. Intel of code and data to find it.” problems on the stack. The technologies employed by Intel data elements from different threads —Ralph DiNicola IPP provides basic low-level functions for creating applications Spokesman for U.S.-Canadian task force investigating the Parallel Inspector to support all the analysis levels are the • Smart breakpoint feature to stop program execution on a in several domains, including signal processing, audio coding, Northeast 2003 blackout Leak Detection (Level 1) and Memory Checking (Levels 2–4) re-entrant function call speech recognition and coding, image processing, video coding, technologies, which use Pin in various ways. • Serialized execution mode to enable or disable the creation operations on small matrices, and 3-D data processing. Finding the cause of errors in multithreaded applications can of additional worker threads in OpenMP parallel loops Level 1 The first analysis level helps to find out if the n Valarray: Valarray is a C++ standard template (STL) dynamically be a challenging task. Intel Parallel Inspector, an Intel Parallel container class for arrays consisting of array methods for high- Studio tool, is a proactive bug finder that helps you detect and application has any memory leaks. Memory leaks occur when a • Set of OpenMP runtime information views for advanced block of memory is allocated and never released. performance computing. The operations are designed to exploit OpenMP program state analysis perform root-cause analysis on threading and memory errors in hardware features such as vectorization. In order to take full multithreaded applications. • SSE (Streaming SIMD Extensions) register view with exten- Level 2 The second analysis level detects if the application advantage of valarray, the Intel compiler recognizes valarray as Intel Parallel Inspector enables C and C++ application devel- sive formatting and editing options for debugging parallel has invalid memory accesses including uninitialized memory opers to: an intrinsic type and replaces such types by Intel IPP library calls. data using the SIMD (Single Instruction, Multiple Data) accesses, invalid deallocation, and mismatched allocation/dea- instruction set ➤ Locate a large variety of memory and resource problems llocation. Invalid memory accesses occur when a read or write Example including leaks, buffer overrun errors, and pointer problems instruction references memory that is logically or physically As mentioned above, the Intel Parallel Debugger extension ➤ Detect and predict thread-related deadlocks, data races, invalid. At this level, invalid partial memory accesses can also be // Create a valarray of integers is useful in identifying thread data sharing problems. Intel and other synchronization problems identified. Invalid partial accesses occur when a read instruction valarray_t::value_type ibuf[10] = {0,1,2,3,4,5,6,7,8,9}; Parallel Debugger Extension uses source instrumentation in valarray_t vi(ibuf, 10); ➤ Detect potential security issues in parallel applications references a block (2 bytes or more) of memory where part of order to detect data sharing problems. To enable this feature, ➤ Rapidly sort errors by size, frequency, and type to identify the block is logically invalid. // Create a valarray of booleans for a mask set /debug: parallel by enabling Enable Parallel Debug Checks maskarray_t::value_type mbuf[10] = {1,0,1,1,1,0,0,1,1,0}; and prioritize critical problems maskarray_t mask(mbuf,10); under Configuration Properties > C/C++ > Debug. Figure 2 Intel Parallel Inspector (Figure 3) uses binary instrumentation Level 3 The third analysis level is similar to the second level shows Intel Parallel Debugger Extension breaking the execu- technology called Pin to check memory and threading errors. except that the call stack depth is increased to 12 from 1, in // Double the values of the masked array addition to the enhanced dangling pointer check being enabled. vi[mask] += static_cast (vi[mask]); tion of the application upon detecting two threads accessing Pin is a dynamic instrumentation system provided by the same data. Dangling pointers access/point to data that no longer exist. Intel Parallel Inspector delays a deallocation when it occurs so that the memory is not available for reallocation (it can’t be returned by another allocation request). Thus, any references that follow the deallocation can be guaranteed as invalid references from dan- gling pointers. This technique requires additional memory and the Figure 3: Intel® Parallel Inspector toolbar memory used for the delayed deallocation list is limited; therefore Intel Parallel Inspector must eventually start actually deallocating the delayed references. Intel (www.pintool.org), which allows C/C++ code to be injected into the areas of interest in a running executable. Level 4 The fourth analysis level tries to find all memory problems by increasing the call stack depth to 32, enabling enhan- The injected code is then used to observe the behavior ced dangling pointer check, including system libraries in of the program. the analysis, and analyzing the memory problems on the stack. The stack analysis is only enabled at this level.

Figure 2: Figure 4: Memory errors analysis levels Intel® Parallel Debugger Extension can break the execution upon detecting a data sharing problem

10 11 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice. Welcome to the Parallel Universe Welcome to the Parallel Universe

Figure 5: Intel® Parallel Inspector analysis result showing memory errors found Figure 7: Intel® Parallel Inspector analysis result showing data race issues found

As seen in Figure 5, it is possible to filter the results by the call stack depth to 32, and by analyzing the problems on the Intel Parallel Amplifier the severity, problem description, source of the problem, stack. The stack analysis is only enabled at this level. The byte function name, and the module. level granularity for this level is 1. “Programmers waste enormous amounts of time thinking about, or Intel Parallel Inspector Threading The main threading errors Intel Parallel Inspector identifies worrying about, the speed of Errors Analysis Levels are data races, (Figure 7, p.13), deadlocks, lock hierarchy Figure 8: Intel® Parallel Amplifier toolbar Intel Parallel Inspector also provides four levels of analysis for violations, and potential privacy infringement. noncritical parts of their programs … threading errors (Figure 6). Data races can occur in various ways. Intel Parallel Inspector We should forget about small will detect write-write, read-write, and write-read race conditions: efficiencies, say about 97 percent and eliminating them appropriately is the key to an efficient Level 1 The first level of analysis helps determine if the appli- ➤ Write-write data race condition occurs when two or more optimization. cation has any deadlocks. Deadlocks occur when two or more threads write to the same memory location of the time: premature optimization With a single mouse click, Intel Parallel Amplifier can perform threads wait for the other to release resources such as mutex, three powerful performance analyses. These analysis types ➤ Read-write race condition occurs when one thread reads is the root of all evil.” critical section, thread handle, and so on, but none of the threads from a memory location, while another thread writes to it — Donald Knuth (adapted from C. A. R. Hoare) are known as hotspot analysis, concurrency analysis, and locks releases the resources. In this scenario, no thread can proceed. concurrently and waits analysis. Before explaining each analysis level it is The call stack depth is set to 1. beneficial to explain the metrics used by Intel Parallel Amplifier. ➤ Write-read race condition occurs when one thread writes Multithreaded applications tend to have their own unique to a memory location, while a different thread concurrently Level 2 The second analysis level detects if the application has sets of problems due to the complexities introduced by paral- Elapsed Time The elapsed time is the amount of time the any data races or deadlocks. Data races are one of the most com- reads from the same memory location lelism. Converting a serial code base to thread-safe code is not application executes. Reducing the elapsed time for an mon threading errors and happen when multiple threads access In all cases, the order of the execution will affect the data an easy task. It usually has an impact on development time, application when running a fixed workload is one of the key the same memory location without proper synchronization. The that is shared. and results in increasing the complexity of the existing serial metrics. The elapsed time for the application is reported in call stack depth is also 1 in this level. The byte application. The common multithreading performance issues the summary view. level granularity for this level is 4. can be summarized in a nutshell as follows: ➤ Increased complexity (data restructuring, use of CPU Time The CPU time is the amount of time a thread spends Level 3 Like the previous level, Level 3 synchronization) executing on a processor. For multiple threads, the CPU time of tries to find data races and deadlocks, but ➤ the threads is aggregated. The total CPU time is the sum of the additionally tries to detect where they occur. Performance (requires optimization and tuning) CPU time of all the threads that run during the analysis. The call stack depth is set to 12 for finer ➤ Synchronization overhead analysis. The byte level granularity for this In keeping with Knuth’s advice, Intel Parallel Amplifier (Figure Wait Time The wait time is the amount of time that a given level is 1. 8) can help developers identify the bottlenecks of their code thread waited for some event to occur. These events can be for optimization that has the most return on investment (ROI). events such as synchronization waits and I/O waits. Level 4 The fourth level of analysis tries Identifying the performance issues in the target application to find all threading problems by increasing Figure 6: Threading errors analysis levels

12 13 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice. Welcome to the Parallel Universe Welcome to the Parallel Universe

Figure 11: Intel® Hotspot Analysis Parallel Amplifier concurrency By using a low overhead statistical sampling (also known as analysis results stack sampling) algorithm, hotspot analysis (Figure 9) helps summary view the developer understand the application flow and identify the sections of code that took a long time to execute (hotspots). During hotspot analysis, Intel Parallel Amplifier profiles the application by sampling at certain intervals us- ing the OS timer. It collects samples of all active instruction addresses with their call sequences upon each sample. Then it analyzes and displays these stored sampled instruction pointers (IP), along with the associated call sequences. Figure 12: Locks and waits analysis results Statistically collected IP samples with call sequences enable Intel Parallel Amplifier to generate and display a call tree. sors, Intel Parallel Amplifier classifies how the application utilizes the processors in the system. Concurrency Analysis The time values in the concurrency and locks and waits win- Concurrency analysis measures how an application utilizes dows correspond to the following utilization types (Figure 11): The synchronization objects analyzed can be given as mu- vast opportunities the available processors on a given system. The concurrency Idle: All threads in the program are waiting—no threads are texes (mutual exclusion objects), semaphores, critical sections, Parallel programming is not new. It has been well studied and analysis helps developers identify hotspot functions where running. There can be only one node in the Summary tab graph and fork-joins operations. A synchronization object with the has been employed in the high-performance computing processor utilization is poor, as seen in Figure 10. During indicating idle utilization. longest waiting time and high concurrency level is very likely to community for many years, but now, with the expansion of the concurrency analysis, Intel Parallel Amplifier collects and Poor: Poor utilization. By default, poor utilization is when the be a bottleneck for the application. multicore processors, parallel programming is becoming main- provides information on how many threads are active, meaning number of threads is up to 50% of the target concurrency. It is also very important to mention that for both Intel stream. This is exactly where Intel Parallel Studio comes into threads that are either running or are queued and are not wait- OK: Acceptable (OK) utilization. By default, OK utilization is Parallel Inspector and Intel Parallel Amplifier, it is possible to play. Intel Parallel Studio brings vast opportunities and tools ing at a defined waiting or blocking API. The number of running when the number of threads is between 51% and 85% of the drill down all the way to the source code level. For example, by that ease the developers’ transition to the realm of parallel threads corresponds to the concurrency level of an application. target concurrency. double-clicking on a line item in Figure 13, I can drill down to programming and hence significantly reduce the entry barriers By comparing the concurrency level with the number of proces- Ideal: Ideal utilization. By default, ideal utilization is when the the source code and observe which synchronization object is to the parallel universe. Welcome to the parallel universe. number of threads is between 86% and causing the problem. 115% of the target concurrency. Levent Akyil is Staff Software Engineer in the Performance, Analysis, and Threading Lab, Software Development Products, Intel Corporation. Locks and Waits Analysis While concurrency analysis helps develop- ers identify where their application is not parallel or not fully utilizing the available processors, locks and waits analysis helps developers identify the cause of the ineffective processor utilization (Figure 12, p.15). The most common problem for Figure 9: Intel® Parallel Amplifier Hotspot analysis results poor utilization is caused by threads wait- ing too long on synchronization objects (locks). In most cases no useful work is done; as a result, performance suffers, resulting in low processor utilization. During locks and waits analysis, developers can estimate the impact of each synchronization object. The analysis results help to understand how long the application was required to wait on each synchronization object, or in blocking APIs, such as sleep and blocking I/O. Figure 10: Intel® Parallel Amplifier concurrency analysis results. Granularity is set to Function-Thread Figure 13: Source code view of a problem in locks and waits analysis

14 15 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice. Welcome to the Parallel Universe 8 Rules for Parallel Programming for Multicore By James Reinders There are some consistent rules that can help you solve the parallelism challenge and tap into the potential of multicore.

Programming for multicore processors Design with the option to turn Use scalable memory allocators. poses new challenges. Here are eight concurrency off. To make debug- Threaded programs need to use rules for multicore programming to help ging simpler, create programs scalable memory allocators. Period. you be successful: 4 that can run without concurrency. 7 There are a number of solutions, This way, when debugging, you can run and I’d guess that all of them are better Think parallel. Approach all prob- programs first with—then without—con- than malloc(). Using scalable memory alloca- lems looking for the parallelism. currency, and see if both runs fail or not. tors speeds up applications by eliminating Understand where parallelism is, Debugging common issues is simpler global bottlenecks, reusing memory within and organize your thinking to ex- when the program is not running concur- threads to better utilize caches, and par- 1 rently because it is more familiar and better titioning properly to avoid cache line sharing. press it. Decide on the best parallel approach before other design or implementation supported by today’s tools. Knowing that decisions. Learn to “Think Parallel.” something fails only when run concurrently Design to scale through increased hints at the type of bug you are tracking workloads. The amount of work Program using abstraction. Focus down. If you ignore this rule and can’t force your program needs to handle on writing code to express paral- your program to run in only one thread, you’ll 8 increases over time. Plan for that. lelism, but avoid writing code to spend too much time debugging. Because Designed with scaling in mind, your program manage threads or processor you want to have the capability to run in will handle more work as the number of 2 a single thread specifically for debugging, processor cores increases. Every year, we cores. Libraries, OpenMP*, and Intel® Thread- ing Building Blocks (Intel® TBB) are all exam- it doesn’t need to be efficient. You just ask our computers to do more and more. ples of using abstractions. Do not use raw need to avoid creating parallel programs Your designs should favor using increases native threads (Pthreads, Windows* threads, that require concurrency to work correctly, in parallelism to give you advantages in Boost threads, and the like). Threads and such as many producer-consumer models. handling bigger workloads in the future. MPI are the assembly languages for paral- MPI programs often violate this rule, which lelism. They offer maximum flexibility, but is part of the reason MPI programs can be I wrote these rules with implicit mention require too much time to write, debug, and problematic to implement and debug. of threading everywhere. Only rule no. 7 is maintain. Your programming should be at a specifically related to threading. Thread- high enough level that your code is about Avoid using locks. Simply say “no” ing is not the only way to get value out of your problem, not about thread or core to locks. Locks slow programs, multicore. Running multiple programs or management. reduce their scalability, and are multiple processes is often used, especially 5 the source of bugs in parallel in server applications. Program in tasks (chores), not programs. Make implicit synchronization the These rules will work well for you to get threads (cores). Leave the map- solution for your program. When you still the most out of multicore. Some will grow ping of tasks to threads or proces- need explicit synchronization, use atomic in importance over the next 10 years, as sor cores as a distinctly separate operations. Use locks only as a last resort. the number of processor cores rises and we 3 Work hard to design the need for locks see an increase in the diversity of the cores operation in your program, preferably an abstraction you are using that handles completely out of your program. themselves. The coming of heterogeneous thread/core management for you. Create processors and NUMA, for instance, makes an abundance of tasks in your program, or Use tools and libraries designed rule no. 3 more and more important. a task that can be spread across processor to help with concurrency. Don’t You should understand all eight and cores automatically (such as an OpenMP “tough it out” with old tools. Be take all eight to heart. I look forward to loop). By creating tasks, you are free to 6 critical of tool support with regard any comments you may have about these create as many as you can without worrying to how it presents and interacts with rules or parallelism in general. Preorder now. about oversubscription. parallelism. Most tools are not yet ready for parallelism. Look for thread-safe Product shipping May 26. libraries—ideally ones that are designed to utilize parallelism themselves. www.intel.com/software/parallelstudio James Reinders is Chief Software Evangelist and Director of Software Development Products, Intel Corporation. His articles and books on parallelism include Intel® Threading Building Blocks: Outfitting C++ for Multicore Processor Parallelism. Find out more at www.go-parallel.com.

© 2009 – 2010, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. 16 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice. Welcome to the Parallel Universe Welcome to the Parallel Universe

Optimization Notice

Intel® compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the “Intel® Compiler User and Reference Guides” under “Compiler Options.” Many library routines that are part of Intel® compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors.

Intel® compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.

While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel® and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not. Notice revision #20101101

EvolvE your codE.

Parallelism breakthrough. Analyze, compile, debug, check, and tune your code for multicore with Intel® Parallel Studio. Designed for today’s serial apps and tomorrow’s parallel innovators. Available now: www.intel.com/software/parallelstudio

© 2009 – 2010, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. 18 19