Characteristics of AMD Barcelona Floating Point Execution Ben Bales and Richard Barrett Computer Science and Mathematics Division Oak Ridge National Laboratory
Cray User Group Atlanta, GA May 7, 2009
Managed by UT-Battelle for the U. S. Department of Energy Subjects
•Definitions •Barcelona Overview •Branch Predictor + Conditional Moves •Software Pipelining •Vector Operations •How to Take Advantage of Given Optimizations
Managed by UT-Battelle for the U. S. Department of Energy SIMD and Vector
SIMD Vector 1 2 3 4 1 2 3 4
+ + 1 2 3 4 1 2 3 4 = = 1 2 3 4 1 2 3 4
Managed by UT-Battelle for the U. S. Department of Energy Software Pipelining
In-Order Scheduler Out-of-Order Scheduler Lookahead Multiply Multiply Multiply Multiply Multiply Multiply Add Lookahead Add Multiply Multiply Subtract Subtract Add Add Add Add Add Add
Software Pipelining references attempts to manually guarantee the existence of superscalar code within the Lookahead window.
Managed by UT-Battelle for the U. S. Department of Energy Barcelona Overview Three level cache architecture •L1/L2 – High speed computational caches (256 and 128bits per cycle respectively)
SIMD Instruction Set •Limited “Vector” operations •Three SIMD pipelines
This presentation focuses on single precision
Managed by UT-Battelle for the U. S. Department of Energy Branches Four sub-topics •SSE conditional moves •Integer comparators •Branch prediction mechanism •Case study: CORDIC Arctangent
Managed by UT-Battelle for the U. S. Department of Energy SSE Conditional Moves Branches that can be expressed as moves conditional on simple floating point comparisons SSE friendly
if(a < b) { sgn = 1.0; if(a < b) { } else { ans = c + d; sgn = -1.0; } else { } ans = c – d; } ans = c + sgn * d;
Managed by UT-Battelle for the U. S. Department of Energy Integer Comparison Unit As recommended in the AMD 10h optimization guide, floating point comparisons can be performed in integer units (chart taken from guide):
Comparison Against Zero IntegerComparison Replaces if(*((unsigned int *)&f) > 0x8000000U) if(f < 0.0f) if(*((int*)&f) <= 0) if(f <= 0.0f) if(*((int*)&f) > 0) if(f > 0.0f) if(*((unsigned int *)&f) <= 0x8000000U) if(f >= 0.0f)
Managed by UT-Battelle for the U. S. Department of Energy Branch Predictor Dynamic branches predicted based on 2-bit Global History Bimodal Counters
Value: 0 1 2 3 Action: No No Branch Branch Branch Branch Counters updated as branch conditionals evaluated
Managed by UT-Battelle for the U. S. Department of Energy Branch Predictor cont. Heuristic improved by addressing GHBC with Global Branch History
Global Branch History GHBC 0 1 0 3 1 0 1 0
Unfortunately, loops waste half of the available bits!
Managed by UT-Battelle for the U. S. Department of Energy Branch Predictor cont.
… 0 1 0 1 0 1 1 1 1 1 0 1
Dots – Instruction address Red – Loop branch bits Green – Legitimate branch bits
Unrolling loops can help improve the Legitimate/Loop ratio and make the GHBCs work better
Managed by UT-Battelle for the U. S. Department of Energy Branch Benchmarks
Branch Speed Comparison 16
14
12
10 Branch Block Branch 8 Int Branch Int Block Branch 6 SSE Conditional Move (x1) Clock Cycles/Branch Cycles/Branch Clock Clock SSE Conditional Move (x4) 4
2
0 Random Non-Random
Managed by UT-Battelle for the U. S. Department of Energy CORDIC Arctangent •CORDIC algorithms popular for implementing trig functions with limited hardware •Basic algorithm is SSE friendly, but includes a tricky conditional move
Managed by UT-Battelle for the U. S. Department of Energy CORDIC Arctangent (SIMD) for(i = 0; i < N; i++) { if(y < 0.0f) { s = 1.0f; } else { s = -1.0f; }
n_x = x – s * y * powf(2.0f, -(float)i); //powf and atanf are n_y = y + s * x * powf (2.0f, -(float) i); //implemented in n_z = z – s * atanf(powf(2.0f, -(float)i); //lookup tables
x = n_x; y = n_y; z = n_z; }
Managed by UT-Battelle for the U. S. Department of Energy CORDIC Arctangent (Vector) for(i = 0; i < N; i++) { if(y < 0.0f) { s = 1.0f; } else { s = -1.0f; }
n_x = x + (-s) * y * powf(2.0f, -(float)i) ; n_y = y + s * x * powf (2.0f, -(float) i) ; n_z = z + (-s) * 1.0 * atanf(powf(2.0f, -(float)i) ;
x = n_x; y = n_y; z = n_z; }
Managed by UT-Battelle for the U. S. Department of Energy CORDIC Performance
Cordic Contional Performance 25.00
20.00
15.00 Single Element, Parallel Single Element, Vector 10.00 Pipelined, Vector Clock Cycles/Iteration Cycles/Iteration Clock Clock
5.00
0.00 Conditional Moves Floating Point Branches Integer Branches
Managed by UT-Battelle for the U. S. Department of Energy CORDIC Performance Cont.
Cordic Conditional Performance cont. 9.00
8.00
7.00
6.00
5.00 Pipelined, Vector Full SIMD, Parallel 4.00 Full SIMD, Pipelined, Parallel
3.00 Clock Cycles/Iteration Cycles/Iteration Clock Clock
2.00
1.00
0.00
Managed by UT-Battelle for the U. S. Department of Energy Software Pipelining •Out-of-Order scheduling is built around the idea of automatically pipelining code •Out-of-Order scheduler is effective while multiple independent blocks of code appear in its look ahead window •Questions: •How far ahead does the scheduler look? •How effective is “effective”?
Managed by UT-Battelle for the U. S. Department of Energy Out-of-Order Scheduler
Blocked vs. Shuffled Instructions 19
18
17
16
15
14
GFOPS Blocked 13 Shuffled
12
11
10 0 20 40 60 80 100 120 Instruction Block Size
Memory Dependent Tests Blocked Shuffled Gain 15.10GFLOPS 16.00GFLOPS 5.62%
Managed by UT-Battelle for the U. S. Department of Energy Software Pipelining •Scheduler is quite effective as long as instructions are available in its window •Instructions easily made “available” by two techniques: •Unrolling loops •Staging loops
Managed by UT-Battelle for the U. S. Department of Energy Software Pipelining Original
Stage Stage Stage Stage … 1A 2A 1B 2B
Unrolled
Stage Stage Stage Stage … 1A 1B 2A 2B
Staged
Stage Stage Stage Stage … … 1A 1B 2A 2B
Managed by UT-Battelle for the U. S. Department of Energy Software Pipelining
Sine Performance 20.00
18.00
16.00
14.00 Staged, Blocked, Pipelined 12.00 Staged, Shuffled, Pipelined Non-Staged, Blocked, Pipelined 10.00 Non-Staged, Shuffled, Pipelined 8.00 Staged, Non-Pipelined
Clock Cycles/Value Cycles/Value Clock Clock Non-Staged, Non-Pipelined 6.00
4.00
2.00
0.00
Managed by UT-Battelle for the U. S. Department of Energy Vector Optimizations •Some codes pre-shuffle values to make vector operations fit better to SIMD instruction sets •For multiplication on complex numbers we can trade a dependency and a shuffle for a load
Managed by UT-Battelle for the U. S. Department of Energy Vector Optimizations
Effects of Buffering (Barcelona) 1.6
1.55
1.5
1.45
1.4 Regular 1.35 Buffered
1.3 Average Average Average Clock Clock Cycles/Complex Cycles/Complex Op Op 1.25
1.2 0 2000 4000 6000 8000 10000 12000 Buffer Block Size (floats)
Managed by UT-Battelle for the U. S. Department of Energy Vector Optimizations
Effects of Buffering (Athlon)
2.900
2.700
2.500
2.300 Regular Buffered 2.100
Average Average Average Clock Clock Cycles/Complex Cycles/Complex Op Op 1.900
1.700 0 2000 4000 6000 8000 10000 12000 Buffer Block Size (floats)
Managed by UT-Battelle for the U. S. Department of Energy Vector Optimizations
Comparison of Buffering Importance 40
30
20
10 Barcelona Athlon
0 0 2000 4000 6000 8000 10000 12000
-10 Percent Percent Percent GainGainofof BufferedBuffered Implementation Implementation
-20 Buffer Block Size (floats)
Managed by UT-Battelle for the U. S. Department of Energy How to Use This Stuff •All these tests were written in C with the help of the Intel Assembly Intrinsics •Unfortunately, the intrinsics do not work well on PGI or Pathscale compilers (they do work in GNU, Intel, and Microsoft ones) •Intrinsics, while better than assembly, are clunky at best
Managed by UT-Battelle for the U. S. Department of Energy How to Use This Stuff
for(i = 0; i < ITERS; i++) { n_two = _mm_set1_ps(-2.0f); m = (__m128)_mm_shuffle_epi32((__m128i)v1, 0x0); m = _mm_cmpgt_ps(m, zero); n_two = _mm_and_ps(n_two, m); n_two = _mm_add_ps(n_two, p_one); n_v = (__m128)_mm_shuffle_epi32((__m128i)v1, 0xF1); r1 = _mm_load_ps(&lin_lookup[i * 4]); n_v = _mm_mul_ps(n_v, r1); n_v = _mm_mul_ps(n_two, n_v); v1 = _mm_add_ps(v1, n_v); }
Managed by UT-Battelle for the U. S. Department of Energy Conclusions
•Conditionals •Unroll slow conditional loops •Use integer comparisons •Code for conditional moves •Pipeline operations explicitly •Separate staged computations •Don’t worry about vector operations
Managed by UT-Battelle for the U. S. Department of Energy Square Root/Reciprocal Time
Square Root/Reciprocal Time 25
20
15
Barcelona Athlon 10 Clock Cycles/Operation Cycles/Operation ClockClock
5
0 GNU RCP GNU SQRT SSE DIV RCP SSE RCP SSE SQRT
Managed by UT-Battelle for the U. S. Department of Energy