® Knights Landing™ Hardware

TACC KNL Tutorial PRESENTED BY: John Cazes IXPUG Annual Meeting 2016 Lars Koesterke

9/20/16 1 Intel’s Architecture

• Leverages architecture • Simpler x86 cores, higher compute throughput per watt • Supports legacy programming models • Fortran, C/C++ • MPI, OpenMP, pthreads • Designed for floating point performance • Provides high memory bandwidth • Runs an operating system • Many- core design rather than multi- core • Designed to run hundreds of execution threads in parallel

9/20/16 2 2nd Generation Intel Xeon Phi Knights Landing • Many Integrated Cores (MIC) architecture • Up to 72 cores (based on ) • 4 H/W threads per core • Possible 288 threads of execution • 16 GB MCDRAM* (high bandwidth) on-package • 1 socket – self hosted (no more PCI bottleneck!) • 3+ TF DP peak performance • 6+ TF SP peak performance • 400+ GB/s STREAM performance • Supports Intel Omni-Path Fabric

* Multi-Channel DRAM 9/20/16 3 Knights Corner à Knights Landing

KNC KNL Co-processor Self hosted Stripped down Linux Centos 7 Binary incompatible with other Binary compatible with prior Xeon architectures (non Phi) architectures 1.1 GHz processor 1.4 GHz processor 8 GB RAM Up to 400 GB RAM (including 16 GB MCDRAM) 22 nm process 14 nm process 1 512-bit VPU 2 512-bit VPUs No support for: Support for: • Out of order • Out of order • Branch prediction • Branch prediction • Fast unaligned memory access • Fast unaligned memory access

9/20/16 4 KNL Diagram

• Cores are grouped in pairs (tiles) • Up to 36 tiles (72 cores) • 2D mesh interconnect • 2 DDR memory controllers • 6 channels DDR4 • Up to 90 GB/s • 16 GB MCDRAM • 8 embedded DRAM controllers • Up to 475 GB/s

(KNIGHTS LANDING: SECOND- GENERATION INTEL XEON PHI PRODUCT, A. Sodani, et.al.,IEEE Micro March/April 2016) 9/20/16 5 KNL Tile

• Each core (based on Intel Silvermont): • Local L1 cache • 2 512-bit VPUs (almost symmetric) • 2 cores/tile • 1 MB shared L2 cache (up to 36 MB L2 per KNL) • Shared mesh connection

(KNIGHTS LANDING: SECOND- GENERATION INTEL XEON PHI PRODUCT, A. Sodani, et.al.,IEEE Micro March/April 2016) 9/20/16 6 KNL Core

8-way 32KB instruction cache

2 VPUs, only one has support for legacy floating point ops • Compile with -xMIC-AVX512 to use both VPUs (Only supported by Intel compilers)

8-way 32KB data cache

(KNIGHTS LANDING: SECOND- GENERATION INTEL XEON PHI PRODUCT, A. Sodani, et.al.,IEEE Micro March/April 2016) 9/20/16 7 KNL ISA Haswell KNL • KNL supports all /MMX x87/MMX x87/MMX legacy instructions SSE SSE SSE • Introduces AVX-512 AVX AVX AVX Extensions:

AVX2 AVX2 Legacy • Foundations BMI BMI (common between Xeon and Xeon Phi) TSX • Conflict Detection AVX-512F • Prefetch AVX-512CD • Exponential and AVX-512PF Reciprocal AVX-512ER

-xMIC-AVX512

AVX-512 ISA: http://goo.gl/TGQIKE 9/20/168 KNL • C/C++/Fortran… and Python/Java/… • “Feels” like a traditional node (not a co-processor!) • However: • Many-core approach • Cores relatively slow • Intra-node parallelization required • Binary compatible with previous Xeon, but not the other way around (when compiled with -xMIC-AVX512 )

9/20/169 Stampede KNL Upgrade

• Upgrade to TACC’s Stampede cluster • ~1.5 PF additional performance • 117 in Top 500 • First KNL system on the list • 504 68-core KNL nodes • Intel’s Omni-Path Fabric Network • Separate cluster that shares filesystems with Stampede • Funded by the National Science Foundation (NSF) through grant #ACI- 1134872

9/20/16 10 Stampede KNL Upgrade

Stampede’s Original Components (Sandy Bridge Cluster)

Sandy Bridge compute nodes with KNC MIC coprocessors login1 through login4 (Sandy Bridge) Infiniband network

sbatch Sandy Bridge largemem and GPU idev compute nodes

ssh Centos 6

Internet ssh $HOME $SCRATCH $WORK

ssh Centos 7

sbatch idev KNL compute nodes login-knl1 OmniPath (Haswell) network Stampede Upgrade (KNL Cluster)

9/20/16 11 Vectorization

Differences with KNC and understanding vector reports

9/20/16 12 Vectorization on KNL Similarities to KNC Differences from KNC Supports 512-bit vectors: 2 VPUs • 16 32-bit floats/integers • 8 64-bit doubles 32 addressable registers Full support for packed 64-bit integer arithmetic Supports masked operations Supports unaligned loads & stores

Supports SSE/2/3/4, AVX, and AVX2 instruction sets • Only on 1 of the 2 vector units Many other improvements: • Improved Gather/Scatter • Hardware FP Divide • Hardware FP Inverse square root • …

9/20/16 13 Vectorization Procedure

• Compile with -xMIC-AVX512 to target KNL

• Add -qopt-report=[234] to get optimization reports • 2: brief overview of which loops are vectorized and not vectorized (search for “dependence”) • 3: summaries of load and store streams, alignment, and estimated speedup for each loop • 4: load and store stream information by array name, and estimated overhead of vectorization • The primary inhibitor of vectorization is possible aliasing • Learn how to use the “restrict” keyword in C • Vectorization can be forced using a pragma • This may give incorrect results if aliasing is actually present!

9/20/16 14 Optimization Reports • Sample code “swim.f” - 551 lines of Fortran • Optimization report sizes: (default == all “phases”) • Level 2: 386 lines • Level 3: 696 lines • Level 4: 1253 lines • Most of the length is from the vectorization report. • The combined report includes other important information, so you probably don’t want to exclude the other “phases” (- qopt-report-phase)

9/20/16 15 Example loop nest from swim.f !$OMP PARALLEL DO do j=1,n do i=1,m cu(i+1,j) = .5d0*(p(i+1,j,mid)+p(i,j,mid))*u(i+1,j,mid) cv(i,j+1) = .5d0*(p(i,j+1,mid)+p(i,j,mid))*v(i,j+1,mid) z(i+1,j+1) = (fsdx*(v(i+1,j+1,mid)-v(i,j+1,mid))-fsdy* (u(i+1,j+1,mid)-u(i+1,j,mid)))/ (p(i,j,mid)+p(i+1,j,mid)+p(i+1,j+1,mid)+p(i,j+1,mid)) h(i,j) = p(i,j,mid)+.25d0* (u(i+1,j,mid)*u(i+1,j,mid)+u(i,j,mid)*u(i,j,mid) +v(i,j+1,mid)*v(i,j+1,mid)+v(i,j,mid)*v(i,j,mid)) end do end do

Details don’t matter – just note that • This are 2 nested loops • There are a lot of array references on the right-hand sides. • There are 4 arrays being stored.

9/20/16 16 Example Level 2 optimization report

LOOP BEGIN at swim.f(318,7) Outer loop not vectorized remark #15542: loop was not vectorized: inner loop was already vectorized

LOOP BEGIN at swim.f(319,11) ”Peel loop” (prolog) reported remark #15301: PEEL LOOP WAS VECTORIZED separately LOOP END

LOOP BEGIN at swim.f(319,11) remark #15300: LOOP WAS VECTORIZED Main loop – remark #25456: Number of Array Refs Scalar Replaced In Loop: 1 only very high LOOP END level info here LOOP BEGIN at swim.f(319,11) remark #15301: REMAINDER LOOP WAS VECTORIZED LOOP END LOOP END “Remainder loop” (epilog) reported separately

9/20/16 17 Example Level 3 optimization report

LOOP BEGIN at swim.f(318,7) remark #15542: loop was not vectorized: inner loop was already vectorized

LOOP BEGIN at swim.f(319,11) remark #15301: PEEL LOOP WAS VECTORIZED LOOP END

LOOP BEGIN at swim.f(319,11) remark #15300: LOOP WAS VECTORIZED [Lots more stuff added here – see next slide] remark #25456: Number of Array Refs Scalar Replaced In Loop: 1 LOOP END

LOOP BEGIN at swim.f(319,11) remark #15301: REMAINDER LOOP WAS VECTORIZED LOOP END LOOP END

9/20/16 18 Level 3 optimization report extra info

LOOP BEGIN at swim.f(319,11) remark #15300: LOOP WAS VECTORIZED remark #15448: unmasked aligned unit stride loads: 14 remark #15449: unmasked aligned unit stride stores: 2 Memory remark #15450: unmasked unaligned unit stride loads: 9 Reference info remark #15451: unmasked unaligned unit stride stores: 2 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 98 remark #15477: vector loop cost: 12.870 Estimated Cycle remark #15478: estimated potential speedup: 7.540 Cost & Speedup remark #15488: --- end vector loop cost summary --- remark #25456: Number of Array Refs Scalar Replaced In Loop: 1 remark #25015: Estimate of max trip count of loop=250 Compiler cost model LOOP END based on this assumed trip count

9/20/16 19 Level 4 optimization report

LOOP BEGIN at swim.f(319,11) remark #15300: LOOP WAS VECTORIZED [Lots more stuff added here – see next slide] remark #15448: unmasked aligned unit stride loads: 14 remark #15449: unmasked aligned unit stride stores: 2 remark #15450: unmasked unaligned unit stride loads: 9 remark #15451: unmasked unaligned unit stride stores: 2 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 98 remark #15477: vector loop cost: 12.870 remark #15478: estimated potential speedup: 7.540 remark #15488: --- end vector loop cost summary --- remark #25456: Number of Array Refs Scalar Replaced In Loop: 1 remark #25015: Estimate of max trip count of loop=250 LOOP END

9/20/16 20 Level 4 optimization report extra info

LOOP BEGIN at swim.f(319,11) remark #15300: LOOP WAS VECTORIZED remark #15389: vectorization support: reference cu has unaligned access [ swim.f(320,15) ] remark #15389: vectorization support: reference p has unaligned access [ swim.f(320,15) ] remark #15388: vectorization support: reference p has aligned access [ swim.f(320,15) ] remark #15389: vectorization support: reference u has unaligned access [ swim.f(320,15) ] remark #15388: vectorization support: reference cv has aligned access [ swim.f(321,15) ] remark #15388: vectorization support: reference p has aligned access [ swim.f(321,15) ] [… lots ofAlignment status for every similar lines omitted here …] remark #15389: vectorizationarray used support: reference u has unaligned access [ swim.f(325,15) ] remark #15389: vectorization support: reference u has unaligned access [ swim.f(325,15) ] remark #15388: vectorization support: reference u has aligned access [ swim.f(325,15) ] remark #15388: vectorization support: reference u has aligned access [ swim.f(325,15) ] remark #15388: vectorization support: reference v has aligned access [ swim.f(325,15) ] remark #15388: vectorization support: reference v has aligned access [ swim.f(325,15) ] remark #15388: vectorization support: reference v has aligned access [ swim.f(325,15) ] remark #15388: vectorization support: reference v has aligned access [ swim.f(325,15) ] remark #15381: vectorization support: unaligned access used inside loop body remark #15305: vectorization support: vector length 8 Vector length remark #15309: vectorization support: normalized vectorization overhead 0.117 used (typically 8 Vectorization overhead will be relatively or 16) high in Peel and Remainder loops, but should be low in main loop 9/20/16 21 And after all this…

9/20/1622 Does it really work? FLASH 6 More results: https://goo.gl/ZTBzuE 68 5 136

4 34 272

3 48 8 Sandy Bridge

Steps/second Haswell 2 KNL 1

0 0 50 100 150 200 250 300 Threads 9/20/1623