Lecture 6 Instruction Level Parallelism (4) EEC 171 Parallel Architectures John Owens UC Davis Credits • © John Owens / UC Davis 2007–8. • Thanks to many sources for slide material: Computer Organization and Design (Patterson & Hennessy) © 2005, Computer Architecture (Hennessy & Patterson) © 2007, Inside the Machine (Jon Stokes) © 2007, © Dan Connors / University of Colorado 2007, © Wen- Mei Hwu/David Kirk, University of Illinois 2007, © David Patterson / UCB 2003–7, © John Lazzaro / UCB 2006, © Mary Jane Irwin / Penn State 2005, © John Kubiatowicz / UCB 2002, © Krste Asinovic/Arvind / MIT 2002, © Morgan Kaufmann Publishers 1998. Outline
• Trace scheduling & trace caches • Pentium 4 • Intel tips for performance • EPIC & Itanium: Modern VLIW • Transmeta approach—alternate VLIW strategy • Limits to ILP Trace scheduling • Problem: Can’t find enough ILP within single block • Solution: Pretend blocks are bigger! • Problem: But blocks aren’t bigger! Blocks are separated by branches. • Solution 1: Let hardware speculate (branch prediction) • Solution 2: Let software make its best guess based on history (trace scheduling) • … and then implement this in hardware (trace cache, Pentium 4) Example1316 (SW)P. P. CHANG, S. A. MAHLKE AND W.-M. W. HWU • Original loop allows us to increment (a) either r0 or r1: x[r0]=x[r0]+r1 • Profile says incrementing r1 (b) is much more common • Optimize for that case Figure 9. An example of super-block global variaChangble migra tieton: al.(a) oSPErigina l Dec.progra m1991 segment; (b) program segment after global variable migration
5. op ( x ) and op ( y ) are incremented by the same value, i.e. K1 = K2 * 6. There are no branch instructions between op ( x ) and op ( y ). 7. For each operation op ( j ) in which src ( j ) contains dest ( x ), either j = x or all elements of src ( j ) except dest ( x ) are loop invariant. 8. All uses of dest ( x ) can be modified to dest ( y ) in the super-block without incurring time penalty.† The action function of induction variable elimination consists of four steps: 1. op ( x ) is deleted. 2. A subtraction instruction op ( m ), dest ( m ) ! dest ( x ) – dest ( y ), is inserted after the last instruction in the preheader of the super-block loop. 3. For each instruction op ( a ) which uses dest ( x ), let other_src ( a ) denote the
* The restriction of predicate 5( K1 = K2 ) can be removed in some special uses of dest ( x ); however, these special uses are too complex to be discussed in this paper. † For example, if we know that dest ( x ) = dest ( y ) + 5 because of different initial values, then a (branch if not equal) bne(dest ( x ),0) instruction is converted to ab ne(dest ( y ), -5) instruction. For some machines,b ne(dest ( y ), -5) needs to be broken down to a compare instruction plus a branch instruction; then, the optimization may degrade performance. Structure1304 P. P. CH ANofG, S. A . CompilerMAHLKE AND W.-M. W. HWU
1
Figure 1. A block diagram of our prototype C compiler
Table I. Classic code optimizations
Local Global
constant propagation constant propagation copy propagation copy propagation common subexpression elimination common subexpression elimination redundant load elimination redundant load elimination redundant store elimination redundant store elimination constant folding loop unrolling strength reduction loop invariant code removal constant combining loop induction strength reduction operation folding loop induction variable elimination dead code removal dead code removal code reordering global variable migration Bigger example USING PROFILE INFORMATION 1309