Boulder Lake
Total Page:16
File Type:pdf, Size:1020Kb
Overview of Intel® Core 2 Architecture and Software Development Tools Overview of Architecture & Tools We will discuss: What lecture materials are available What labs are available What target courses could be impacted Some high level discussion of underlying technology Objectives After completing this module, you will: Be aware of and have access to several hours worth of MC topics including Architecture, Compiler Technology, Profiling Technology, OpenMP, & Cache Effects Be able create exercises on how to avoid coding common threading hazards associated with some MC systems – such as Poor Cache Utilization, False Sharing and Threading load imbalance Be able create exercises on how to use selected compiler directives & switches to improve behavior on each core Be able create exercises on how to take advantage VTune analyzer to quickly identify load imbalance issues, poor cache reuse and false sharing issues Agenda Multi-core Motivation Tools Overview Taking advantage of Multi-core Taking advantage of parallelism within each core (SSEx) Avoiding Memory/Cache effects Why is the Industry moving to Multi-core? In order to increase performance and reduce power consumption Its is much more efficient to run several cores at a lower frequency than one single core at a much faster frequency Power and Frequency Power vs. Frequency Curve for Single Core Architecture 359 309 259 Dropping Frequency 209 = Large Drop Power 159 Lower Frequency Power (w) Allows Headroom 109 for 2nd Core 59 9 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 Frequency (GHz) Agenda Multi-core Motivation Tools Overview Taking advantage of Multi-core Taking advantage of parallelism within each core (SSEx) Avoiding Memory/Cache effects Processor-independent optimizations /Od Disables optimizations /O1 Optimizes for Binary Size and for Speed: Server Code /O2 Optimizes for Speed (default): Vectorization on Intel 64 /O3 Optimizes for Data Cache: Loopy Floating Point Code /Zi Creates symbols for debugging /Ob0 Turns off inlining which can sometimes help the Analysis tools do a more through job AutoVectorization optimizations QaxSSE2 Intel Pentium 4 and compatible Intel processors. QaxSSE3 Intel(R) Core(TM) processor family with Streaming SIMD Extensions 3 (SSE3) instruction support QaxSSE3_ATOM Can generate MOVBE instructions for Intel processors and can optimize for the Intel(R) Atom(TM) Processor and Intel(R) Intel has a long historyCentrino(R) of Atom(TM) providing Processor auto Technology-vectorization Extensions 3 switches along (SSE3)with support instruction supportfor new processor instructions andQaxSSSE3 backward supportIntel(R) Core(TM)2 for older processor instructions family with SSSE3is maintained QaxSSE4.1 Intel(R) 45nm Hi-k next generation Intel Core(TM) Developers shouldmicroarchitecture keep an witheye support on new for SSE4 developments Vectorizing Compiler in order to leverageand Media the Acceleratorpower of instructions the latest processors QaxSSE4.2 Can generate Intel(R) SSE4 Efficient Accelerated String and Text Processing instructions supported by Intel(R) Core(TM) i7 processors. Can generate Intel(R) SSE4 Vectorizing Compiler and Media Accelerator, Intel(R) SSSE3, SSE3, SSE2, and SSE instructions and it can optimize for the Intel(R) Core(TM) processor family. More Advanced optimizations Qipo Interprocedural optimization performs a static, topological analysis of your application. With /Qipo (-ipo), the analysis spans all of your source files built with /Qipo (-ipo). In other words, code generation in module A can be improved by what is happening in module B. May enable other optimizations like autoparallel and autovector Qparallel enable the auto-parallelizer to generate multi-threaded code for loops that can be safely executed in parallel Qopenmp enable the compiler to generate multi-threaded code based on the OpenMP* directives Lab 1 - AutoParallelization Objective: Use auto-parallelization on a simple code to gain experience with using the compiler’s auto-parallelization feature Follow the VectorSum activity in the student lab doc Try AutoParallel compilation on Lab called VectorSum Extra credit: parallelize manually and see how you can beat the auto-parallel option – see openmp section for constructs to try this Parallel Studio to find where to parallelize Parallel Studio will be used in several labs to find appropriate locations to add parallelism to the code. Parallel Amplifier specifically is used to find hotspot information – where in your code does the application spend most of its time Parallel amplifier does not require instrumenting your code in order to find hotspots, compiling with symbol information is a good idea - /Zi Compiling with /Ob0 turns off inlining and sometimes seems to give a more through analysis in Parallel Studio Parallel Amplifier Hotspots What does hotspot analysis show? What about drilling down? The call stack The call stack shows the callee/caller relationship among function in he code Found potential parallelism Lab 2 – Mandelbrot Hotspot Analysis Objective: Use sampling to find some parallelism in the Mandelbrot application Follow the Mandelbrot activity called Mandelbrot Sampling in the student lab doc Identify candidate loops that could be parallelized Agenda Multi-core Motivation Tools Overview Taking advantage of Multi-core High level overview – Intel® Core Architecture Taking advantage of parallelism within each core (SSEx) Avoiding Memory/Cache effects Intel® Core 2 Architecture Snapshot in time during Penryn, Yorkfield, harpertown Software develoers should know number of cores, cache line size and cache sizes to tackle Cache Effects materials 6M 12M 4M 12M 6M L2 6M 4M L2 2X6M 2X6M 2X3M L2 L2 2 cores 4 cores L2 2 cores 4 cores 4 cores Mobile Platform Optimized Desktop Platform Optimized Server Platform Optimized • 1-4 Execution Cores • 2-4 Execution Cores • 4 Execution Cores • 3/6MB L2 Cache Sizes • 2X3, 2X6 MB L2 Cache Sizes • 2x6 L2 Caches • 64 Byte L2 cache line • 64 Byte L2 Cache line • 64 Byte L2 Cache line • 64-bit • 64-bit • DP/MP support • 64-bit **Feature Names TBD Memory Hierarchy ~ 1’s Cycle ~ 1’s - 10 Cycle L2 CPU L1 Cache Cache ~ 100’s Cycle Main Magnetic Memory Disk ~ 1000’s Cycle Intel® Core™ Microarchitecture – Memory Sub-system High Level Architectural view Intel Core 2 Intel Core 2 Duo Processor Quad Processor A A A A A A E E E E E E C C1 C2 B B B Dual Core has shared cache 64B QuadCache Line core has both shared 64B Cache Line Memory And separated Memorycache A = Architectural State E = Execution Engine & Interrupt C = 2nd Level Cache B = Bus Interface Intel® Core™ Microarchitecture – Memory Sub-system With a separated cache Memory Front Side Bus (FSB) Shipping L2 Cache Line ~Half access to memory Cache Line CPU1 CPU2 Intel® Core™ Microarchitecture – Memory Sub-system Advantages of Shared Cache – using Advanced Smart Cache® Technology Memory Front Side Bus (FSB) L2 is shared: No need to ship cache line Cache Line CPU1 CPU2 False Sharing Performance issue in programs where cores may write to different memory addresses BUT in the same cache lines Known as Ping-Ponging – Cache line is shipped between cores Core 0 Core 1 X[0] = 0 X[1] = 0 X[0] = 1 X[1] = 1 False Sharing not an Time X[0] = 2 issue in shared cache 201 0 It is an issue 1in 11 separated cache Agenda Multi-core Motivation Tools Overview Taking advantage of Multi-core Taking advantage of parallelism within each core (SSEx) Avoiding Memory/Cache effects Super Scalar Execution INT Multiple operations executed within a singleMultiple core at Execution the units SIMD same Allowtime SIMD parallelism Many instructions can be retired in a clock cycle FP History of SSE Instructions Intel Intel Intel Intel SSE SSE2 SSE3 SSSE3 Intel 1999 2000 2004 2006 SSE4.1 2007 70 instr 144 instr 13 instr 32 instr Single- Double- Complex Decode Precision precision Long historyData of new instructions Vectors Vectors 47 instructions MostStreaming require 8/16 /using32 packing & unpacking Video Accelerators instructions operations 64/128-bit Graphics building vector blocks integer Advanced vector instr Will be continued by • Intel SSE4.2 (XML processing end 2008) • See - http://download.intel.com/technology/architecture/new- instructions-paper.pdf SSE Data Types & Speedup Potential SSE 4x floats 2x doubles 16x bytes 8x 16-bit shorts SSE-2 Potential speedup (in the targeted loop) SSE-3 roughly the same as the amount4 xof 32 packing-bit integers SSE-4 ie. For floats – speedup ~ 4X 2x 64-bit integers 1x 128-bit integer Goal of SSE(x) Scalar processing SIMD processing with SSE(2,3,4) traditional mode one instruction produces one instruction produces multiple results one•Uses result full width of XMM registers •ManyX functionalX unitsx3 x2 x1 x0 •Choice+ of many of instructions+ •Not all loops can be vectorized Y y3 y2 y1 y0 Y •Cant vectorize most function calls = = X + Y X + Y x3+y3 x2+y2 x1+y1 x0+y0 Lab 3 – IPO assisted Vectorization Objective: Explore how inlining a function can dramatically improve performance by allowing vectorization of loop with function call Open SquareChargeCVectorizationIPO folder and use “nmake all” to build the project from the command line To add switches to make envirnment use nmake all CF=“/QxSSE3” as example Agenda Multi-core Motivation Tools Overview Taking advantage of Multi-core Taking advantage of parallelism within each core (SSEx) Avoiding Memory/Cache effects Cache effects Cache effects can sometimes impact the speed of an application by