Characteristics of AMD Barcelona Floating Point Execution Ben Bales and Richard Barrett Computer Science and Mathematics Division Oak Ridge National Laboratory

Home , AMD 10h, CORDIC

Cray User Group Atlanta, GA May 7, 2009

Managed by UT-Battelle for the U. S. Department of Energy Subjects

•Definitions •Barcelona Overview •Branch Predictor + Conditional Moves •Software Pipelining •Vector Operations •How to Take Advantage of Given Optimizations

Managed by UT-Battelle for the U. S. Department of Energy SIMD and Vector

SIMD Vector 1 2 3 4 1 2 3 4

+ + 1 2 3 4 1 2 3 4 = = 1 2 3 4 1 2 3 4

Managed by UT-Battelle for the U. S. Department of Energy Software Pipelining

In-Order Scheduler Out-of-Order Scheduler Lookahead Multiply Multiply Multiply Multiply Multiply Multiply Add Lookahead Add Multiply Multiply Subtract Subtract Add Add Add Add Add Add

Software Pipelining references attempts to manually guarantee the existence of superscalar code within the Lookahead window.

Managed by UT-Battelle for the U. S. Department of Energy Barcelona Overview Three level cache architecture •L1/L2 – High speed computational caches (256 and 128bits per cycle respectively)

SIMD Instruction Set •Limited “Vector” operations •Three SIMD pipelines

This presentation focuses on single precision

Managed by UT-Battelle for the U. S. Department of Energy Branches Four sub-topics •SSE conditional moves •Integer comparators •Branch prediction mechanism •Case study: CORDIC Arctangent

Managed by UT-Battelle for the U. S. Department of Energy SSE Conditional Moves Branches that can be expressed as moves conditional on simple floating point comparisons SSE friendly

if(a < b) { sgn = 1.0; if(a < b) { } else { ans = c + d; sgn = -1.0; } else { } ans = c – d; } ans = c + sgn * d;

Managed by UT-Battelle for the U. S. Department of Energy Integer Comparison Unit As recommended in the AMD 10h optimization guide, floating point comparisons can be performed in integer units (chart taken from guide):

Comparison Against Zero IntegerComparison Replaces if(*((unsigned int *)&f) > 0x8000000U) if(f < 0.0f) if(*((int*)&f) <= 0) if(f <= 0.0f) if(*((int*)&f) > 0) if(f > 0.0f) if(*((unsigned int *)&f) <= 0x8000000U) if(f >= 0.0f)

Managed by UT-Battelle for the U. S. Department of Energy Branch Predictor Dynamic branches predicted based on 2-bit Global History Bimodal Counters

Value: 0 1 2 3 Action: No No Branch Branch Branch Branch Counters updated as branch conditionals evaluated

Managed by UT-Battelle for the U. S. Department of Energy Branch Predictor cont. Heuristic improved by addressing GHBC with Global Branch History

Global Branch History GHBC 0 1 0 3 1 0 1 0

Unfortunately, loops waste half of the available bits!

Managed by UT-Battelle for the U. S. Department of Energy Branch Predictor cont.

… 0 1 0 1 0 1 1 1 1 1 0 1

Dots – Instruction address Red – Loop branch bits Green – Legitimate branch bits

Unrolling loops can help improve the Legitimate/Loop ratio and make the GHBCs work better

Managed by UT-Battelle for the U. S. Department of Energy Branch Benchmarks

Branch Speed Comparison 16

10 Branch Block Branch 8 Int Branch Int Block Branch 6 SSE Conditional Move (x1) Clock Cycles/Branch Cycles/Branch Clock Clock SSE Conditional Move (x4) 4

0 Random Non-Random

Managed by UT-Battelle for the U. S. Department of Energy CORDIC Arctangent •CORDIC algorithms popular for implementing trig functions with limited hardware •Basic algorithm is SSE friendly, but includes a tricky conditional move

Managed by UT-Battelle for the U. S. Department of Energy CORDIC Arctangent (SIMD) for(i = 0; i < N; i++) { if(y < 0.0f) { s = 1.0f; } else { s = -1.0f; }

n_x = x – s * y * powf(2.0f, -(float)i); //powf and atanf are n_y = y + s * x * powf (2.0f, -(float) i); //implemented in n_z = z – s * atanf(powf(2.0f, -(float)i); //lookup tables

x = n_x; y = n_y; z = n_z; }

Managed by UT-Battelle for the U. S. Department of Energy CORDIC Arctangent (Vector) for(i = 0; i < N; i++) { if(y < 0.0f) { s = 1.0f; } else { s = -1.0f; }

n_x = x + (-s) * y * powf(2.0f, -(float)i) ; n_y = y + s * x * powf (2.0f, -(float) i) ; n_z = z + (-s) * 1.0 * atanf(powf(2.0f, -(float)i) ;

x = n_x; y = n_y; z = n_z; }

Managed by UT-Battelle for the U. S. Department of Energy CORDIC Performance

Cordic Contional Performance 25.00

20.00

15.00 Single Element, Parallel Single Element, Vector 10.00 Pipelined, Vector Clock Cycles/Iteration Cycles/Iteration Clock Clock

5.00

0.00 Conditional Moves Floating Point Branches Integer Branches

Managed by UT-Battelle for the U. S. Department of Energy CORDIC Performance Cont.

Cordic Conditional Performance cont. 9.00

8.00

7.00

6.00

5.00 Pipelined, Vector Full SIMD, Parallel 4.00 Full SIMD, Pipelined, Parallel

3.00 Clock Cycles/Iteration Cycles/Iteration Clock Clock

2.00

1.00

0.00

Managed by UT-Battelle for the U. S. Department of Energy Software Pipelining •Out-of-Order scheduling is built around the idea of automatically pipelining code •Out-of-Order scheduler is effective while multiple independent blocks of code appear in its look ahead window •Questions: •How far ahead does the scheduler look? •How effective is “effective”?

Managed by UT-Battelle for the U. S. Department of Energy Out-of-Order Scheduler

Blocked vs. Shuffled Instructions 19

GFOPS Blocked 13 Shuffled

10 0 20 40 60 80 100 120 Instruction Block Size

Memory Dependent Tests Blocked Shuffled Gain 15.10GFLOPS 16.00GFLOPS 5.62%

Managed by UT-Battelle for the U. S. Department of Energy Software Pipelining •Scheduler is quite effective as long as instructions are available in its window •Instructions easily made “available” by two techniques: •Unrolling loops •Staging loops

Managed by UT-Battelle for the U. S. Department of Energy Software Pipelining Original

Stage Stage Stage Stage … 1A 2A 1B 2B

Unrolled

Stage Stage Stage Stage … 1A 1B 2A 2B

Staged

Stage Stage Stage Stage … … 1A 1B 2A 2B

Managed by UT-Battelle for the U. S. Department of Energy Software Pipelining

Sine Performance 20.00

18.00

16.00

14.00 Staged, Blocked, Pipelined 12.00 Staged, Shuffled, Pipelined Non-Staged, Blocked, Pipelined 10.00 Non-Staged, Shuffled, Pipelined 8.00 Staged, Non-Pipelined

Clock Cycles/Value Cycles/Value Clock Clock Non-Staged, Non-Pipelined 6.00

4.00

2.00

0.00

Managed by UT-Battelle for the U. S. Department of Energy Vector Optimizations •Some codes pre-shuffle values to make vector operations fit better to SIMD instruction sets •For multiplication on complex numbers we can trade a dependency and a shuffle for a load

Managed by UT-Battelle for the U. S. Department of Energy Vector Optimizations

Effects of Buffering (Barcelona) 1.6

1.55

1.5

1.45

1.4 Regular 1.35 Buffered

1.3 Average Average Average Clock Clock Cycles/Complex Cycles/Complex Op Op 1.25

1.2 0 2000 4000 6000 8000 10000 12000 Buffer Block Size (floats)

Managed by UT-Battelle for the U. S. Department of Energy Vector Optimizations

Effects of Buffering (Athlon)

2.900

2.700

2.500

2.300 Regular Buffered 2.100

Average Average Average Clock Clock Cycles/Complex Cycles/Complex Op Op 1.900

1.700 0 2000 4000 6000 8000 10000 12000 Buffer Block Size (floats)

Managed by UT-Battelle for the U. S. Department of Energy Vector Optimizations

Comparison of Buffering Importance 40

10 Barcelona Athlon

0 0 2000 4000 6000 8000 10000 12000

-10 Percent Percent Percent GainGainofof BufferedBuffered Implementation Implementation

-20 Buffer Block Size (floats)

Managed by UT-Battelle for the U. S. Department of Energy How to Use This Stuff •All these tests were written in C with the help of the Intel Assembly Intrinsics •Unfortunately, the intrinsics do not work well on PGI or Pathscale compilers (they do work in GNU, Intel, and Microsoft ones) •Intrinsics, while better than assembly, are clunky at best

Managed by UT-Battelle for the U. S. Department of Energy How to Use This Stuff

for(i = 0; i < ITERS; i++) { n_two = _mm_set1_ps(-2.0f); m = (__m128)_mm_shuffle_epi32((__m128i)v1, 0x0); m = _mm_cmpgt_ps(m, zero); n_two = _mm_and_ps(n_two, m); n_two = _mm_add_ps(n_two, p_one); n_v = (__m128)_mm_shuffle_epi32((__m128i)v1, 0xF1); r1 = _mm_load_ps(&lin_lookup[i * 4]); n_v = _mm_mul_ps(n_v, r1); n_v = _mm_mul_ps(n_two, n_v); v1 = _mm_add_ps(v1, n_v); }

Managed by UT-Battelle for the U. S. Department of Energy Conclusions

•Conditionals •Unroll slow conditional loops •Use integer comparisons •Code for conditional moves •Pipeline operations explicitly •Separate staged computations •Don’t worry about vector operations

Managed by UT-Battelle for the U. S. Department of Energy Square Root/Reciprocal Time

Square Root/Reciprocal Time 25

Barcelona Athlon 10 Clock Cycles/Operation Cycles/Operation ClockClock

0 GNU RCP GNU SQRT SSE DIV RCP SSE RCP SSE SQRT

Managed by UT-Battelle for the U. S. Department of Energy