Intel® Omni-Path Architecture: Overview August 2017 Legal Notices and Disclaimers

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH ® PRODUCTS. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life-saving, life- sustaining, critical control or safety systems, or in nuclear facility applications.

Intel products may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel may make changes to dates, specifications, product descriptions, and plans referenced in this document at any time, without notice.

This document may contain information on products in the design phase of development. The information herein is subject to change without notice. Do not finalize a design with this information.

Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them.

Intel Corporation or its subsidiaries in the United States and other countries may have patents or pending patent applications, trademarks, copyrights, or other intellectual property rights that relate to the presented subject matter. The furnishing of documents and other materials and information does not provide any license, express or implied, by estoppel or otherwise, to any such patents, trademarks, copyrights, or other intellectual property rights.

Wireless connectivity and some features may require you to purchase additional software, services or external hardware.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations

Intel, the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

Other names and brands may be claimed as the property of others.

Copyright © 2017 Intel Corporation. All rights reserved.

Intel® Omni-Path 2 Architecture Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel . Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright© 2017, Intel Corporation. All rights reserved. Intel, the Intel logo, , , , Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. -dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Intel® Omni-Path 3 Architecture Intel® Omni-Path Architecture (OPA) Training Agenda

• Intel® OPA overview • Omni-Path performance and innovations • Sales enablement tools • Online configurator tools • Where to go for assistance

Intel® Omni-Path 4 Architecture What is Intel® Omni-Path Architecture? Evolutionary Approach, Revolutionary Features, End-to-End Solution

Building on the industry’s best technologies .Highly leverage existing Aries and Intel® True Scale fabric as well as OpenFabric Alliance host stack .Significant additions of Intel intellectual property .Excellent performance at an even better price/performance .Adds innovative new features and capabilities to improve performance, reliability, and QoS Robust product offerings and ecosystem .End-to-end Intel product line .>100 OEM designs1 .Strong ecosystem with 80+ Fabric Builders members

1 Source: Intel internal information. Design win count based on OEM and HPC storage vendors who are planning to offer either Intel-branded or custom switch products, along with the total number of OEM platforms that are currently planned to support custom and/or standard Intel® OPA adapters. Design win count as of November 1, 2015 and subject to change without notice based on vendor product plans. *Other names and brands may be claimed as property of others.

Intel® Omni-Path 5 Architecture Intel® Omni-Path Architecture Evolutionary Approach, Revolutionary Features, End-to-End Solution

HFI Adapters Edge Switches Director Switches Silicon Software Cables Single port 1U Form Factor QSFP-based OEM custom designs Open Source Third Party Vendors x8 and x16 24 and 48 port 192 and 768 port HFI and Switch ASICs Host Software and Passive Copper Fabric Manager Active Optical x16 Adapter (100 Gb/s) HFI silicon 768-port Up to 2 ports Director Switch (50 GB/s 48-port (20U chassis) total b/w) x8 Adapter Edge Switch (58 Gb/s) Switch silicon up to 48 ports 24-port “-F” 192-port (1200 GB/s Edge Switch Processor Director Switch total b/w w/integrated (7U chassis) HFI

1 Source: Intel internal information. Design win count based on OEM and HPC storage vendors who are planning to offer either Intel-branded or custom switch products, along with the total number of OEM platforms that are currently planned to support custom and/or standard Intel® OPA adapters. Design win count as of November 1, 2015 and subject to change without notice based on vendor product plans. *Other names and brands may be claimed as property of others.

Intel® Omni-Path 6 Architecture Why OPA has been winning in HPC 2

HPC Performance – Major Buying Criteria . In most cases, Intel OPA has equal to or better HPC performance (latency, bandwidth, and applications)

Price / Performance – Next Evaluation/Buying Point 1 . Leadership price/performance at any scale – from entry to large supercomputers . Build a better cluster – more FLOPS and storage

OPA Features – Key Differentiators . Builds on key technologies from QLogic and Cray needed to support ever-increasing HPC cluster sizes & workloads . Features: CPU/Fabric Integration, HPC Optimization and Enhanced Fabric Functionality

1 Configuration assumes a 750-node cluster, and number of switch chips required is based on a full bisectional bandwidth (FBB) Fat-Tree configuration. Intel® OPA uses one fully-populated 768-port director switch, and Mellanox EDR solution uses a combination of 648-port director switches and 36-port edge switches. Mellanox component pricing from www.kernelsoftware.com, with prices as of April 4, 2017. Compute node pricing based on Dell PowerEdge R730 server from www.dell.com, with prices as of April 4, 2017. Intel® OPA pricing based on pricing from www.kernelsoftware.com as of April 4, 2017 * Other names and brands may be claimed as property of others. 2 See Performance backup slide for configuration information,

Intel® Omni-Path 7 Architecture June 2017 Top500 Analysis Intel® OPA 22% more total entries Intel® OPA 30% more top 100 than EDR IB* 14 entries than EDR IB* 13 12 10 10 8 6 31 Intel® OPA 6 4 4 3 EDR IB* 38 2 1 0 0 0 Top 10 Top 15 Top 50 Top 100 Intel® OPA EDR IB*

Intel® OPA 7.6% higher average Share of 100Gb Flops – 67.1PF efficiency at scale than EDR IB* OPA vs 38.7PF EDR 1 81.9% 0.8 74.3%

0.6 38.7 EDR IB* 0.4 67.1

Efficiency 0.2 Intel® OPA (Rmax/Rpeak) 0 Average efficiency on Intel® Xeon® processors www.top500.org June 2017 list *Other names and brands may be claimed as the property of others. Intel® Omni-Path 8 Architecture Intel® OPA’s Impact across Many Segments

Supercomputers 100’s of Accounts Deployed

Artificial HPC Cloud Intelligence Traditional HPC

LRZ Enterprise R&D

Intel® Omni-Path 11 Architecture Intel® Omni-Path Host Fabric Interface 100 Series Single Port1 Power Copper Optical (3W QSFP) Model Typical Maximum Typical Maximum Low Profile PCIe Card X16 HFI 7.4W 11.7W 10.6W 14.9W − 2.71”x 6.6“ max. Spec compliant. − Standard and low profile brackets X8 HFI 6.3W 8.3W 9.5W 11.5W Wolf River (WFR-B) HFI ASIC PCIe Gen3 Single 100 Gb/s Intel® OPA port x16 HFI (100Gb Throughput) − QSFP28 Form Factor − Supports multiple optical transceivers − Single Link status LED (Green) Thermal − Passive thermal - QSFP Port Heatsink − Standard 55C, 200lfm environment x8 HFI (~58Gb Throughput) PCIe Limited

1Specifications contained in public Product Briefs.

Intel® Omni-Path 12 Architecture Intel® Omni-Path Edge Switch 1 100 Series 24/48 Port Power Copper Optical (3W QSFP) Model Typical Maximum Typical Maximum Compact Space (1U) − 1.7”H x 17.3”W x 16.8“L 24-Ports 146W 179W 231W 264W Switching Capacity 48-Ports 186W 238W 356W 408W − 4.8/9.6 Tb/s switching capability Line Speed − 100Gb/s Link Rate 24-port Standards-based Hardware Connections Edge Switch − QSFP28 Redundancy − N+N redundant Power Supplies (optional) − N+1 Cooling Fans (speed control, customer changeable forward/reverse airflow) 48-port Management Module (optional) Edge Switch

1Specifications contained in public Product Briefs.

Intel® Omni-Path 13 Architecture *NEW* Eldorado Forest Edge Switch SKUs Hot-Swap Overview

Feature Description Port Configuration 48 port QSFP connectors Port Bandwidth QSFP ports support OPA at 100Gbps System height 1U Dimensions System depth 25” Input options 100-264VAC Power Delivery Serviceability Hot swappable Redundancy AC & DC redundancy option (2 PSUs) In-band Same as current Management Out-of-band Same as current Fabric Same as current Medium Air Direction Front-to-back (Port Side Exhaust) Cooling Redundancy N+1 fans (total of 6 fans) Serviceability Hot Swappable (2 trays of 3 fans)

Intel® Omni-Path 14 Architecture Intel® Omni-Path Director Class Systems 100 Series 6-Slot/24-Slot Systems1 Power Copper Optical (3W QSFP) Highly Integrated Model Typical Maximum Typical Maximum − 7U/20U plus 1U Shelf 6-.6kW 2.3kW 2.4kW 3.0kW Switching Capacity 24-Slot 6.8kW 8.9kW 9.5kW 11.6kW − 38.4/153.6 Tb/s switching capability Common Features . Intel® Omni-Path Fabric Switch Silicon 100 Series (100Gb/s) 6-Slot . Standards-based Hardware Connections – QSFP28 Director Switch . Up to Full bisectional bandwidth Fat Tree internal topology . Common Management Card w/Edge Switches . 32-Port QSFP28-based Leaf Modules . Air-cooled, front to back (cable side) air cooling . Hot-Swappable Modules − Leaf, Spine, Management, Fan , Power Supply . Module Redundancy − Management (N+1), Fan (N+1, Speed Controlled), PSU (DC, AC/DC) . System Power : 180-240AC 24-Slot Director Switch

1Specifications contained in public Product Briefs.

Intel® Omni-Path 15 Architecture Next Up for Intel® OPA: Artificial Intelligence Intel offers a complete AI Portfolio . From CPUs to software to computer vision to libraries and tools

Intel® OPA advantages: Breakthrough Cloud performance on scale-out apps DATA Center

. Low latency Accelerant . High bandwidth Technologies . High message rate Things & devices . GPU Direct RDMA support . Xeon Phi Integration

World-class interconnect solution for shorter time to train

Intel® Omni-Path 16 Architecture NVMe* over OPA: Ready for Prime Time OPA + Intel® Optane™ Technology . High Endurance . Low latency . High Efficiency . Total NVMe over Fabric Solution!

NVMe-over-OPA status ~1.5M 4k Random IOPS! . Supported in 10.4.1 release 99% Bandwidth Efficiency! . Compliant with NVMeF specs 1.0

Only Intel is delivering a total NVMe over Fabric solution!

Target and Host system configuration: 2 x Intel® Xeon® CPU E5-2699 v3 @ 2.30Ghz, Intel® Server Board S2600WT, 128GB DDR4, CentOS 7.3.1611, kernel 4.10.12, IFS 10.4.1, NULL-BLK, FIO 2.19 options hfi1 krcvqs=8 sge_copy_mode=2 wss_threshold=70 Intel® Omni-Path 17 Architecture *Other names and brands may be claimed as the property of others. Intel® Omni-Path Maximizing Support for Heterogeneous Clusters

GPU memory GPU memory GPU GPU PCI bus

Intel Xeon Intel Xeon Xeon Phi™ Xeon Phi™ Intel Xeon Intel Xeon Processor Processor-F Processor Processor-F Processor-F Processor-F (Haswell and (Skylake-F) (Knights (Knights Landing) (Skylake-F) (Skylake-F) Broadwell) Landing) HFI HFI HFI HFI

PCI PCI WFR HFI WFR HFI Card Card GPU Direct v3 support in 10.3 release

Greater flexibility for creating compute islands depending on user requirements

Intel® Omni-Path 18 Architecture Performance Latency, Bandwidth, and Message Rate - Intel® MPI Benchmarks Intel® Omni-Path Architecture (Intel® OPA)

Intel® Xeon® Intel® Xeon® Platinum 8170 E5-2697A v4 CPU1 Metric CPU2 2.6 GHz, 16c 2.1 GHz, 26c Latency (one-way, 1 switch, 8B) [ns] ; PingPong 910 940 Bandwidth (1 rank per node, 1 port, uni-dir, 1MB) [GB/s] ; Uniband 12.3 12.3 Bandwidth (1 rank per node, 1 port, bi-dir, 1MB) [GB/s] ; Biband 24.4 24.5 Message Rate (max ranks per node, uni-dir, 8B) [Mmps] ; Uniband 112.0 +35% 152.0 Message Rate (max ranks per node, bi-dir, 8B) [Mmps] ; Biband 131.7 +60% 211.2 More CPU cores

See configuration Item # 1

Intel® Omni-Path Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the 20 Architecture performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Copyright © 2016, Intel Corporation.*Other names and brands may be claimed as the property of others. Intel® QuickPath Interconnect vs UltraPath Interconnect Previous generation Intel® Xeon® processors were interconnected with the Intel® QuickPath Interconnect (QPI) Intel® Xeon® Scalable Family processors (codename Skylake) are now connected with the Intel® UltraPath Interconnect

Example: Local MPI processes on dual socket servers (no QPI or UPI involvement) :

PCIe* Link Intel® OPA Intel® OPA PCIe* Link link link

Node 1 Node 2 “Local socket” latency

For more detail on Intel® UPI see: https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview

Intel® Omni-Path 21 Architecture Intel® QuickPath Interconnect vs UltraPath Interconnect For systems with one Intel® OPA adapter installed on a single socket, any data communication involving cores on the other socket need to traverse the QPI/UPI Example: Remote MPI processes on dual socket servers:

QPI QPI

UPI UPI

PCIe* Link Intel® OPA Intel® OPA PCIe* Link link link

Node 2 Node 1 “Remote socket” latency

For more detail on Intel® UPI see: https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview

Intel® Omni-Path 22 Architecture Intel® QuickPath Interconnect vs UltraPath Interconnect Benefits of the UPI vs QPI for processes on the remote socket:

Intel® Xeon® Intel® Xeon® E5-2697A v4 CPU1 Platinum 8170 CPU2 QPI UPI

MPI Latency using local .13 µsec 1.10 µsec

MPI Latency using remote socket 1.38 µsec 1.26 µsec

Difference is the intra-socket latency (x2 servers) 0.25 µsec 0.16 µsec

up to 36% faster

See configuration item #2 *CPU frequency fixed to 2.1 GHz for both processors For more detail on Intel® UPI see: https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview

Intel® Omni-Path 23 Architecture Application Performance - Intel® Xeon® Platinum 8170 processors Intel® Omni-Path Architecture (Intel® OPA) vs EDR InfiniBand* Best performance using either Intel® MPI or Open MPI for both fabrics1

Finite Electro- Material Computational Computational Computational Climate & Chemistry Physics/Bio Element magnetics Sciences Molecular Dynamics Geodynamics Fluid Dynamics Weather 15% 13% Analysis 13% Graphics 11% 10% 9% 9% 10% 8% 8% 8% 8% 8% 7% 7% 6% 5% 5% 5% 4% 4% 4% 5% 3% 3% 2% 2% 2% 1% 1% 1% 0% 0% 0% 0% 0% -1%-1%0% 0% 0% -1% -2% -2%-2% -3%

-6% -6% -5% -7%

Relative Performance Relative -10% Intel(R) OPA vs EDR EDR IB*vs OPA Intel(R)

-15% HIGHER is Better

16 nodes / 52 ranks per node2 *see the configuration slides for complete 1. Not every application was tested with both Intel MPI and Open MPI. See configuration slide for necessary details. configuration information for select workloads 2. VASP benchmark comparison at 8 nodes only. Some workloads use different rank counts. See configuration slide for detail.

Intel® Omni-Path Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the 26 Architecture performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Copyright © 2016, Intel Corporation.*Other names and brands may be claimed as the property of others. Configuration for Application Performance: Intel® Xeon® Platinum 8170 processors (page 1 of 2)

Common configuration for all applications listed below (Unless otherwise specified for the application)

. Internal Intel testing with dual socket Intel® Xeon® Platinum 8170 processor nodes. Unless otherwise specified, one MPI rank per physical CPU core is used. 192 GB 2666 MHz DDR4 memory on first 8 nodes, 64 GB 2666 MHz DDR4 memory on second group of 8 nodes. This is due to current limited memory supply. Red Hat Enterprise * Server release 7.3 (Maipo), 3.10.0-514.16.1.el7.x86_64 kernel. Intel® Turbo Boost and Hyperthreading Technology enabled. All compute nodes connected to one edge switch device. All data written over NFSv3 with 1GbE to Intel SSDSC2BB48 drive storage. Intel MPI 2017 Update 3, Open MPI 2.1.1 or 1.10.4-hfi as packaged with IFS for Intel OPA.

. Intel OPA: Intel Fabric Suite 10.3.1.0.22. Intel Corporation Device 24f0 – Series 100 HFI. OPA Switch: Series 100 Edge Switch – 48 port. Non-default HFI parameters: krcvqs=4, eager_buffer_size=8388608, max_mtu=10240. Best performance shown using either I_MPI_FABRICS tmi or shm:tmi

. EDR IB* based on internal testing: MLNX_OFED_LINUX-4.0-2.0.0.1 (OFED-4.0-2.0.0). Mellanox EDR ConnectX-4 Single Port Rev 3 MCX455A HCA. Mellanox SB7700 - 36 Port EDR Infiniband switch. Forward Error Correction is automatically disabled because <=2M copper IB* cables were used in the testing.. Unless otherwise noted, I_MPI_FABRICS=shm:dapl for Intel MPI. For Open MPI, MXM/FCA as packaged with MOFED, flags used: “-x LD_PRELOAD=/opt/mellanox/hcoll/lib/libhcoll.so -mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=mlx5_0:1 -x HCOLL_ENABLE_MCAST_ALL=0”.

. LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) Feb 16, 2016 stable version release. Official Git Mirror for LAMMPS (http://lammps.sandia.gov/download.html) ls, rhodo, sw, and water benchmark. Intel MPI 2017 Update 3 with 52 ranks per node and 2 OMP threads per rank. Common parameters: I_MPI_PIN_DOMAIN=core Run detail: Number of time steps=100, warm up time steps=10 (not timed) Number of copies of the simulation box in each dimension: 8x8x4 and problem size: 8x8x4x32k = 8,192k atoms Build parameters: Modules: yes-asphere yes-class2 yes-kspace yes-manybody yes-misc yes-molecule yes-mpiio yes-opt yes-replica yes-rigid yes-user-omp yes-user-intel. Binary to be built: lmp_intel_cpu. . Runtime lammps parameters: -pk intel 0 -sf intel -v n 1

. NAMD version 2.10b2, stmv and apoa1 benchmark. Intel MPI 2017 Update 3 Build detail: CHARM 6.6.1. FFTW 3.3.4. Relevant build flags: ./config Linux-x86_64-icc --charm-arch mpi-linux-x86_64-ifort-smp-mpicxx -- cxx icpc --cc icc --with-fftw3. 52 MPI ranks per node.

. NWCHEM release 6.6. Binary: nwchem_armci-mpi_intel-mpi_mkl with MPI-PR run over MPI-1. Workload: siosi3 and siosi5. http://www.nwchem-sw.org/index.php/Main_Page. Intel OPA: -genv PSM2_SDMA=0. EDR parameters: I_MPI_FABRICS=shm:ofa. 2 ranks per node, 1 rank for computation and 1 rank for communication. -genv CSP_VERBOSE 1 -genv CSP_NG 1 -genv LD_PRELOAD libcasper.so

. BSMBench - An HPC Benchmark for BSM Lattice Physics Version 1.0. 32 ranks per node. Parameters: global size is 64x32x32x32, proc grid is 8x4x4x4. Machine config build file: cluster.cfg

. GROMACS version 2016.2. http://www.prace-ri.eu/UEABS/GROMACS/1.2/GROMACS_TestCaseB.tar.gz lignocellulose-rf benchmark. Build detail: -xCORE-AVX2 (Open MPI) and AVX512 for Intel MPI, -g -static-intel. CC=mpicc CXX=mpicxx -DBUILD_SHARED_LIBS=OFF -DGMX_FFT_LIBRARY=mkl -DGMX_MPI=ON -DGMX_OPENMP=ON -DGMX_CYCLE_SUBCOUNTERS=ON -DGMX_GPU=OFF -DGMX_BUILD_HELP=OFF - DGMX_HWLOC=OFF -DGMX_SIMD=AVX2_256 (OpenMPI) or AVX512 for Intel MPI. GMX_OPENMP_MAX_THREADS=256. Run detail: gmx_mpi mdrun -s run.tpr -gcom 20 -resethway -noconfout

. Spec MPI2007, https://www.spec.org/mpi/. *Intel Internal measurements marked estimates until published. Applications listed with “-Large” or “-Medium” in the name were part of the spec MPI suite. OPA parameters: best of I_MPI_FABRICS shm:tmi and tmi. EDR parameters: best of I_MPI_FABRICS shm:dapl and shm:ofa. Intel Parallel Studio 2017 Update 4, Intel compiler 17.0.4. -O3 -xCORE-AVX2 -no-prec-div. Intel MPI: mpiicc, mpiifort, mpiicpc. Open MPI: mpicc, mpifort, mpicxx. Run detail: mref and lref suites, 3 iterations. 121.pop2: CPORTABILITY=-DSPEC_MPI_CASE_FLAG. 126.lammps: CXXPORTABILITY = - DMPICH_IGNORE_CXX_SEEK. 127.wrf2: CPORTABILITY = -DSPEC_MPI_CASE_FLAG -DSPEC_MPI_LINUX. 129.tera_tf=default=default=default: srcalt=add_rank_support 130.socorro=default=default=default: srcalt=nullify_ptrs FPORTABILITY = -assume nostd_intent_in CPORTABILITY = -DSPEC_EIGHT_BYTE_LONG CPORTABILITY = -DSPEC_SINGLE_UNDERSCORE. Intel® OPA: 32 MPI ranks per node for 115.fds4 benchmark

Intel® Omni-Path 27 Architecture Configuration for Application Performance Intel® Xeon® Platinum 8170 processors (page 2 of 2)

. Quantum ESPRESSO is an integrated suite of Open-Source computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves, and pseudopotentials. http://www.quantum-espresso.org/ Build detail: MKLROOT=/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl export FC=mpifort export F90=$FC export F77=$FC export MPIF90=$FC export FCFLAGS="-O3 -xCORE-AVX2 -fno-alias -ansi-alias -g -mkl -I$MKLROOT/include/fftw -I${MKLROOT}/include/intel64/ilp64 -I${MKLROOT}/include -qopenmp -static-intel" export FFLAGS=$FCFLAGS export CC=mpicc export CPP="icc -E" export CFLAGS=$FCFLAGS export AR=xiar export BLAS_LIBS="-L$MKLROOT/lib/intel64 -lmkl_blas95_lp64" export LAPACK_LIBS="- L$MKLROOT/lib/intel64_lin -lmkl_blacs_openmpi_lp64" export SCALAPACK_LIBS="-L$MKLROOT/lib/intel64_lin -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64" export FFT_LIBS="-L$MKLROOT/intel64" ./configure --enable-openmp --enable-parallel. BLAS_LIBS= -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core ELPA_LIBS_SWITCH = enabled SCALAPACK_LIBS = $(TOPDIR)/ELPA/libelpa.a -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 DFLAGS= -D__INTEL -D__FFTW -D__MPI -D__PARA -D__SCALAPACK -D__ELPA -D__OPENMP $(MANUAL_DFLAGS) AUSURF112 benchmark, all default options

. LS-DYNA, A Program for Nonlinear Dynamic Analysis of Structures in Three Dimensions Version : mpp s R9.1.0 Revision: 113698, single precision (I4R4) OPA parameters: best of I_MPI_FABRICS shm:tmi and tmi EDR parameters: best of I_MPI_FABRICS shm:dapl and shm:ofa. Example pfile: gen { nodump nobeamout dboutonly } dir { global one_global_dir local /tmp/3cars }. 32 MPI ranks per node used for OPA with 3cars benchmark.

. SPECFEM3D_GLOBE simulates the three-dimensional global and regional seismic wave propagation based upon the spectral-element method (SEM). It is a time-step algorithm which simulates the propagation of earth waves given the initial conditions, mesh coordinates/ details of the earth crust. small_benchmark_run_to_test_more_complex_Earth benchmark, default input settings. specfem3d_globe-7.0.0. Intel Parallel Studio XE 2017 Update 4. FC=mpiifort CC=mpiicc MPIFC=mpiifort FCFLAGS=-g -xCORE_AVX2 CFLAGS=-g -O2 -xCORE_AVX2. run_this_example.sh and run_mesher_solver.sh, NCHUNKS=6, NEX_XI=NEX_ETA=80, NPROC_XI=NPROC_ETA=10. 600 cores used, 52 cores per node.

. VASP v6 beta, May 9, 2017, https://github.com/vasp-dev/vasp-knl.git Intel Parallel Studio XE 2017 Update 4, CPP_OPTIONS= -DMPI -DHOST=\"IFC17_impi\" \ -DCACHE_SIZE=12000 -Davoidalloc - DMPI_BLOCK=8000 -DscaLAPACK -Duse_collective -DnoAugXCmeta -Duse_bse_te -Duse_shmem -Dtbdyn -Dvasp6 -D_OPENMP -DPROFILING -Dshmem_bcast_buffer -Dshmem_rproj - Dmemalign64 -DELPA -DVTUNE_PROFILING ARCH=-xCORE-AVX512 Testing performed on 8 nodes with 192 GB 2666 MHz DDR4 memory

. WRF - Weather Research & Forecasting Model (http://www.wrf-model.org/index.php) version 3.5.1. OPA parameters: best of I_MPI_FABRICS shm:tmi and tmi. EDR parameters: I_MPI_FABRICS shm:dapl and shm:ofa. Intel Parallel Studio XE 2017 Update 4, ifort, icc, mpif90, mpicc. -xCORE_AVX2 -O3 . Net CDF 4.4.1.1 built with icc. Net CDF-fortran version 4.4.4 built with icc.

. OpenFOAM is a free, open source CFD software package developed primarily by [OpenCFD](http://www.openfoam.com) since 2004 and is currently distributed by [ESI-OpenCFD](http://www.openfoam.com) and the [OpenFOAM Foundation](http://openfoam.org). It has a large user base across most areas of engineering and science, from both commercial and academic organisations. OpenFOAM has an extensive range of features to solve anything from complex fluid flows involving chemical reactions, turbulence and heat transfer, to acoustics, solid mechanics and electromagnetics. (http://www.openfoam.com/documentation). Version v1606+ used for Intel MPI and Open MPI with OPA. Version v1612+ used for Open MPI with EDR IB*. Gcc version 4.8.5 for Intel MPI, Icc version 17.0.4 used for Open MPI. All default make options.

Intel® Omni-Path 28 Architecture Price Performance Intel® Omni-Path Fabric’s 48 Radix Chip It’s more than just a 33% increase in port count over a 36 Radix chip

InfiniBand* EDR (36-port Switch Chip) Intel® Omni-Path Architecture (48-port) FIVE-hop Fat Tree THREE-hop Fat Tree Two (2) 648-port (43) 36-port One (1) 768-port 768 nodes Director Switches Edge Switches 768 nodes Director Switch

% Reduction

(43) 36-port Edge Switches Not required 100% 1,542 Cables 768 50% 99u (2+ racks) Rack Space 20u (<½ rack) 79% ~680ns (5 hops) Switch Latency1 300-330ns2 (3 hops) 51-55%

1. Latency numbers based on Mellanox CS7500 Director Switch and Mellanox SB7700/SB7790 Edge switches. See www.Mellanox.com for more product information. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. *Other names and brands may be claimed as the property of others.

Intel® Omni-Path 30 Architecture Are You Leaving Performance on the Table?

FEWER SWITCHES REQUIRED More Servers, 1000 Same Budget 1000

1 900 900 952 800 800 Up to 27% 700 700 750 More InfiniBand* 600 Servers 600 36-port switch 500 500 400

400 300

300 200 ® SWITCH SWITCH CHIPSREQUIRED Intel OPA 100 200 48-port switch 0 100 Mellanox* EDR Intel® OPA

0 Nodes 249 499 749 999 1249 1499 1749 1999 2249 2499 2749 2999 3249 3499 NODES

1 Configuration assumes a 750-node cluster, and number of switch chips required is based on a full bisectional bandwidth (FBB) Fat-Tree configuration. Intel® OPA uses one fully-populated 768-port director switch, and Mellanox EDR solution uses a combination of 648-port director switches and 36-port edge switches. Mellanox component pricing from www.kernelsoftware.com, with prices as of April 4, 2017. Compute node pricing based on Dell PowerEdge R730 server from www.dell.com, with prices as of April 4, 2017. Intel® OPA pricing based on pricing from www.kernelsoftware.com as of April 4, 2017 * Other names and brands may be claimed as property of others.

Intel® Omni-Path 31 Architecture 3-Year TCO Advantage

Includes HW acquisition costs (server and fabric), 24x7 3-year support, and 3-year power and cooling costs

Fabric Cost Comparison1 (per port) Fabric Power and Cooling1 (per port) Warranty and Support Costs1 (per $4,500 $250 port)

$4,000 $1,000 $200 $900 $3,500 Up to Up to $800 $3,000 58% Up to 59% $700 $150 88% $2,500 lower $600 lower lower $2,000 $500 $100 $400 $1,500 $300 $1,000 $50 $200 $500 $100 $0 $0 $0

Mellanox* EDR Intel® OPA Mellanox* EDR Intel® OPA Mellanox* EDR Intel® OPA Intel® OPA can deliver up to 64% lower Fabric TCO over 3 years1

1 Configuration assumes a 750-node cluster, and number of switch chips required is based on a full bisectional bandwidth (FBB) Fat-Tree configuration. Intel® OPA uses one fully-populated 768-port director switch, and Mellanox EDR solution uses a combination of director switches and edge switches. Includes hardware acquisition costs (server and fabric), 24x7 3-year support (Mellanox Gold support), and 3-year power and cooling costs. Mellanox component pricing from www.kernelsoftware.com, with prices as of April 4, 2017. Intel® OPA pricing is also from www.kernelsoftware.com on April 4, 2017. Mellanox power data based on Mellanox CS7500 Director Switch, Mellanox SB7700/SB7790 Edge switch, and Mellanox ConnectX-5 VPI adapter card product briefs posted on www.mellanox.com as of April 4, 2017. Intel OPA power data based on product briefs posted on www.intel.com as of April 4, 2017. Power and cooling costs based on $0.10 per kWh, and assumes server power costs and server cooling cost are equal and additive. * Other names and brands may be claimed as property of others.

Intel® Omni-Path 32 Architecture Intel® Fabric Builders Current Status

Summary: • Grew to 85 members in 2016 • OSVs shipping OPA inbox! • Storage OEMs delivering OPA based products • 1st catalog published • Omni-Path Users Group (OPUG)

Last update December 19, 2016

Intel® Omni-Path 33 Architecture OPA Features Intel® Omni-Path Architecture: Performance Enhanced Architecture

Lower power and cost, higher density CPU/Fabric Increased bandwidth to each socket Integration Improved latency to each socket

High MPI message rate Optimized Host Low latency & scalable architecture Implementation RDMA support Complementary storage traffic support

High traffic throughput, low port-to-port latency Enhanced Fabric Latency efficient error detection & correction Architecture Deterministic latency and QoS features

Intel® Omni-Path 35 Architecture CPU-Fabric Integration with the Intel® Omni-Path Architecture Future Generations Additional integration, KEY VALUE VECTORS improvements, and features Tighter Intel®  Performance Integration OPA  Density  Cost Next generation Multi-chip Package Intel® Xeon & Xeon® Phi™  Power Integration

 Reliability Future Intel® Xeon® processor (Skylake) Intel® OPA

Intel® Xeon Phi™ processor (Knights Landing) PERFORMANCE Intel® OPA HFI Card Intel® Xeon® processor E5-2600 v4

Intel® Xeon® processor E5-2600 v3

TIME

Intel® Omni-Path 36 Architecture Intel® OPA Integration Advantages over Time

Discrete/Adapter Integration Over Time

MCP (PCIe) MCP (PCIe) Motherboard Adapter OPA OPA Xeon/ OPA KNL-F 2x 100G SKX-F 100G 100G HFI HFI KNL PCIe HFI

All PCIe Lanes Available Dual Rail / Bandwidth Dual Rail / Bandwidth Density Density Price/Performance Price/Performance Price/Performance Xeon & KNL KNL-F SKX-F External PCIe Gen3 PCIe Internal to MCP PCIe Internal to MCP OPA Gen1 (Shipping) OPA Gen1 (Shipping) OPA Gen1 (2H’17)

Intel® Omni-Path 37 Architecture Data Path Low latency , consistent access InfiniBand* HCA vs. Intel® OPA HFI to entire connection state Data loaded then sent

Hardware Assists L4 Transport High Performance DRAM Processor cores with direct access to entire Connection State connection state  Low & consistent Loaded at fabric latency at scale Intel® OPA First Gen initialization HFI Architecture Memory Cache misses as fabric scales require Low latency access Interface CPU CPU intervention to to a small subset of update connection state

L4 Transport

Generic InfiniBand HCA Architecture

Intel® Omni-Path Copyright © 2017, Intel Corporation.*Other names and brands may be claimed as the property of others. 40 Architecture Intel® Omni-Path Architecture Data Transport Multi-modal Data Acceleration (MDA)

Onload, offload, and RDMA: . Onload and offload are approaches for transport layer implementation . Choice influences achievable latency, message rate, scalability, data transfer options . RDMA is an underlying data transfer protocol and Intel® OPA absolutely supports it!

Intel® OPA with MDA data transfer overview: . Host Fabric Interface (HFI) automatically selects the most efficient resources and data path based on the data type and message size of the data transfer . RDMA transfers when it makes sense, typically LARGE MESSAGES where setup time is a relatively small part of the overall transfer time . Programmed I/O sends, coupled with fast mem copy receives, for SMALLER MESSAGES for higher efficiency than setting up an RDMA connection

Intel® Omni-Path 41 Architecture Multi-Modal Data Acceleration (MDA): Optimizing Data Movement through the Fabric

Multi-Modal Data Acceleration App Automatically chooses the most memory Short efficient data transfer path Copy

Highly Latency Sensitive Receive Highly Latency Sensitive CPU Programmed I/O Eager Buffer then Copy buffer Small messages MPI Small messages

HPC app traffic

Transmit Receive VERBS Bandwidth Sensitive App Storage traffic Bandwidth Sensitive Application Buffer placement Send DMA Engines – Header Generation memory Large messages Large messages Receive CPU There is no single transport method that is optimized for all cases. buffer

i.e. Even full offload has too much overhead for smaller messages

Intel® Omni-Path 42 Architecture Data Movement through the Fabric Multiple choices drive performance Mellanox Intel® OPA

MPI and Storage Storage MPI using VERBS using HPC app traffic HPC App/Bulk traffic (Small / Med data VERBS packets, latency, Bulk traffic bandwidth sensitive)

Shorter More Storage Efficient MPI Acceleration Code Path ~10%

Hardware Assists Accelerators Header Generator SDMA FCA DDP SHARP MxM Multiple Paths for data transfer efficiency Single Entry Path plus Accelerators Multiple Hardware Assists for Transfer Optimization

Intel® Omni-Path 43 Architecture QoS Innovations The HPC Optimized Solution: Performance Starts Here Intel® OPA Flow Control Digits (Flits) 1056-bit Link Transfer Packet What InfiniBand 64 bits of data, can’t provide: Adds 8k/10k ______1 Type bit MTU Sizes No added latency for Error Detection which Application enables a BER of 3e-29 generates Virtually eliminates messages 256 B  10 KB performance killing End- 16 Flits to-End retries or once 64 Data bits Data or Command Message 1-bit 14-bit every 92 Trillion hours Type segmented in CRC Traffic Flow Optimization MPI packets of up to 2-bit credit return (TFO) Data prioritization Maximum Transfer with a 65-bit granularity Message Unit (MTU) size and High Priority data sent until message preempts lower priority is complete InfiniBand Standard Serial Stream Significantly reduces 64b/66b Encoding latency jitter 64 bits of data per 66 bits 256 B  4 KB Reduces run-time Unless FEC enabled variations User documentation shows All Lanes not reset for a single lane failure 4K MTU as the largest. Other sizes are possible A single lane failure does 64 Data bits not take down entire link 2-bit InfiniBand* OH

Intel® Omni-Path Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. 45 Architecture Performance varies depending on system configuration. Intel® Omni-Path Fabric Link Layer Innovation: Traffic Flow Optimization (TFO)1 vs InfiniBand2 Virtual Lane (VL) Operation

Packet A Suspended to Send High Priority Packet Flits from B to output then Packet A Packet A will Resumes to arrive at output Completion ISL slightly before Packet B/C

Even with a higher priority VL, already started Packet A can’t be preempted packets B/C follow the completion of Packet A ISL Packet A (VL0) – Low Priority Ports Storage Traffic Packet B (VL1) – Highest Priority MPI Traffic *Other names and brands may be claimed as property of others. Intel technologies’ features and benefits depend on system configuration and may require enabled Packet C (VL0) – Low Priority 1 Internal design Documents hardware, software or service activation. Performance Other Traffic VL = Virtual Lane (Each Lane Has a Different Priority) 2InfiniBand™ Architecture Specification Release 1.3.1 varies depending on system configuration.

Intel® Omni-Path Copyright © 2017, Intel Corporation.*Other names and brands may be claimed as the property of others. 46 Architecture Traffic Flow Optimization (TFO): Point to point latency

12 No Contention 1MB Contention, TFO not enabled TFO Enabled osu_latency osu_bibw 10

Switch 1 8

1 Inter-switch sec), two switch hops sec),switch two

link (ISL) µ 6 TFO reduces average latency with storage contention by up to 78% with up Switch 2 4 to 92% lower standard deviation

2 osu_latency osu_bibw

8 osu_latency osu_latency ( 8 byte 0 app-like traffic storage-like 0 5 10 15 20 25 30 35 40 45 50 (latency sensitive) traffic Iteration

Tests performed on Intel® Xeon® Processor E5-2697Av4 dual-socket servers with 2133 MHz DDR4 RAM per node. OSU OMB 4.1.1 osu_latency and osu_bibw with 1 MPI rank per node for each test. osu_bibw and osu_latency tests repeated in an indefinite loop and 50 iterations are sampled. Open MPI 1.10.0 as packaged in IFS IFS 10.0.0.0.697. Intel Corporation Device 24f0 – Series 100 HFI ASIC (B0 silicon). OPA Switch: Series 100 Edge Switch – 48 port (B0 silicon). “Compute” virtual fabric enabled in opafm.xml with 30% bandwidth allocation and 127. SmallPacket, LargePacket, and PreemptLimit set to 256,4096,and 4096, respectively.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and Intel® Omni-Path functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with 47 Architecture other products. For more complete information visit http://www.intel.com/performance. Copyright © 2017, Intel Corporation.*Other names and brands may be claimed as the property of others. Fabric Protection The Industry Solution: Forward Error Detection (FEC) InfiniBand* and Ethernet Standard

Serial Bit Stream from previous stage

~60ns Latency penaltyEDR when EDR InfiniBand*enabled InfiniBand* Error check (FEC) Data on to Next Stage Errors Propagated Fixed number of bits disabled checked36 Radix for single 36 Radix Switch Switch or multi-bit errors Saves 60ns delay Errors repaired locally

Must wait for and entire fixed bit stream to arrive Lower latency and higher before error determination bandwidth when FEC disabled All errors propagated are seen as “end-to-end” and must be FEC can be enabled or disabled resolved by the servers InfiniBand Source: http://www.infinibandta.org/content/pages.php?pg=technology_public_specification and www.mellanox.com Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and Intel® Omni-Path functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with 49 Architecture other products. For more complete information visit http://www.intel.com/performance. Copyright © 2017, Intel Corporation.*Other names and brands may be claimed as the property of others. Intel® Omni-Path Fabric Link Level Innovation: Packet Integrity Protection (PIP) PIP provides a Bit Error Ratio (BER) of 3e-29 Discards incoming LTPs LTPs Enters after error detection until the Fabric error LTP is received Bad LTPs explicitly request correction Fails CRC Replay Start Point Check

Replay Good LTPs are Data on to Next Stage Buffer implicitly Retryacknowledged Request Null LTP with Then… ------Replay Request No Overhead Incoming LTPs Replayed Error detection/correction always LTPs Pass from Port Buffer In-Line Starting at Error enabled CRC Point Forward Same mechanism used for every link Check Error can escape link once every 92 Trillion hours

Intel® Omni-Path Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. 50 Architecture Ohio State Microbenchmarks - 8B MPI latency/1 Switch Hop Intel® Omni-Path Architecture (Intel® OPA) PSM vs. InfiniBand* EDR MXM

3 M Copper Cables 2 M Copper Cables 1.2 1.2 1.04 1.04 1.02 1 0.92 1 0.94 0.91

0.8 0.8

µsec) µsec) 0.6 0.6 Intel® OPA delivers lower 0.4 0.4 FEC automatically latency

disabled1

Latency ( Latency

FEC enabled FEC enabled FEC Latency ( Latency while always 0.2 0.2 enabled FEC providing error detection and 0 0 correction EDR IB* / MXM Intel® OPA / PSM Intel® OPA Packet Integrity EDR IB* / MXM Older Upgraded Protection (PIP) Older Upgraded PSM Switch Firmware Switch Firmware always enabled Switch Firmware Switch Firmware Packet Integrity Protection (PIP) always enabled FEC - Forward Error Correction LOWER is Better Last Updated: Nov 11 2016 Tests performed on Intel® Xeon® Processor E5-2697A v4 dual-socket servers with 2133 MHz DDR4 memory. Intel® Turbo Boost Technology and Intel® Hyper- Technology enabled. IOU Non-posted Prefetch disabled in BIOS. Snoop hold-off timer = 9. Ohio State Micro Benchmarks v. 5.3. RHEL7.2. Intel® OPA: Intel Corporation Device 24f0 – Series 100 HFI ASIC (B0 silicon). OPA Switch: Series 100 Edge Switch – 48 port (B0 silicon). Open MPI 1.10.2-hfi as packaged in IFS 10.2.0.0.158. EDR based on internal testing: Mellanox EDR ConnectX-4 Single Port Rev 3 MCX455A HCA. Mellanox SB7700 - 36 Port EDR Infiniband switch. MLNX_OFED_LINUX-3.3-1.0.4. Open MPI 1.10.5a1 as packaged in hpcx-v1.7.405-gcc-MLNX_OFED_LINUX-3.4-1.0.0.0-redhat7.2-x86_64. Older switch firmware: 11.0100.0112. Upgraded switch firmware: 11.1200.0102. 1. SwitchIB-FW-11_1200_0102-release_nodes.pdf: “Removed out-of-the-box FEC, reaching 90ns latency, on Mellanox GA level copper cables equal to or shorter than 2m.”

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and Intel® Omni-Path functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with 52 Architecture other products. For more complete information visit http://www.intel.com/performance. Copyright © 2017, Intel Corporation.*Other names and brands may be claimed as the property of others. Dynamic Lane Scaling (DLS): Traffic Protection

4X LINK InfiniBand* – 100 Gbps/Intel® OPA 100 Gbps

Lane Failure 3 GOOD LANES InfiniBand: 25 Gbps Intel® OPA: 75 Gbps

The BEST recovery case for IB is 1x User Setting (per Fabric): . Set maximum degrade option allowable - 4x – Any lane failure would cause link reset or take down ® - 3x – Still operates at degraded bandwidth (75 Gbps) Intel OPA still passing data at reduced - 2x – Still operates at degraded bandwidth (50 Gbps) bandwidth with link recovery via PIP - 1x – Still operates at degraded bandwidth (25 Gbps) InfiniBand* may close entire link or reinitialize Link Recovery: . PIP is used to recover link without reset – @1x introducing fabric delays or routing issues An Intel® OPA innovation

Intel® Omni-Path Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. *Other names and brands may be claimed as property of others. 53 Architecture Intel Omni-Path Architecture: Configuration Tool Overview Intel® OPA Configuration Tool

Data Entry

Configuration ID for later recall https://www.intel.com/content/www/us/en/high-performance-computing-fabrics/omni-path-configurator.html

Intel® Omni-Path 55 Architecture Intel® OPA Configuration Tool – Show Design

Data Entry

Intel® Omni-Path 56 Architecture Intel® OPA Configuration Tool – Design Description

Intel® Omni-Path 57 Architecture Intel® OPA Configuration Tool – Show Cluster Diagram

Intel® Omni-Path 58 Architecture Automatically Generated PowerPoint Configuration

Intel® Omni-Path 59 Architecture Intel® OPA Configuration Tool – Show BOM

Data Entry

Intel® Omni-Path 60 Architecture Intel® OPA Cluster Bill of Materials

Fill out then select

Ability to export to Excel

Intel® Omni-Path 61 Architecture Intel® OPA Configuration Tool – Cable Estimator

Intel® Omni-Path 62 Architecture Sample Floor Plan and Measurement Metrix

Intel® Omni-Path 63 Architecture Cable Length Estimator and Install Plan

Intel® Omni-Path 64 Architecture Tool Defaults are Adjustable

Intel® Omni-Path 65 Architecture Intel Omni-Path Architecture: Resources Intel Omni-Path Fabric Collateral PUBLIC

Intel Website Intel® OPA In the News “Ask the Experts” Webinar Series . Main HPC Fabrics Landing Page . Transforming Next-Generation Sequencing . Best Practices in Deploying Intel® Omni-Path . Intel® OPA Landing Page . AWI Uses New Cray Cluster for Earth Sciences and Architecture (May 2016) . HPC Fabrics Document Library Bioinformatics . Ask the Architects Archive ( 20+ Webinars) . Transforming the Economics of HPC Fabrics with Intel® OPA . Intel Omni-Path Architecture Fabric, the Choice of Leading . Accelerating High Priority Traffic with Intel(R) Omni- Institutions Path Architecture . Higher Performance, Lower Cost HPC Fabric with Intel® OPA . Faster Deep Learning with the Intel® Scalable System . Bridges: Converging HPC and Big Data NEW (PSC) . Configuring LNet Routers with Intel® Enterprise Edition for Framework: Next Generation Processors Lustre* Software Intel Website Partner Video . Boosting Deep Learning with the Intel Scalable System . 2015 Hot Interconnects Technical Whitepaper Framework . University of Cambridge HPC Fabrics Product Briefs: . Supercomputers Aid in Quantum Materials Research . Cray/Omni-Path . Intel® OPA Director Class Switches 100 Series • Intel Omni-Path Builds Bridges at Pittsburgh Supercomputing . Direct Data Networks (DDN) Center (Intsect360 research) . Intel® OPA Fabric Edge Switches 100 Series . Penguin Computing . Podcast . Intel® OPA Host Fabric Adapters 100 Series . Altair . Transcript . Supermicro Intel® OPA Standard Product Guides . MIT Lincoln Laboratory Takes the Mystery Out of Supercomputing . Intersect360 . All Guides, Release Notes and EULAs etc. . Megware to Install CooLMUC 3 Supercomputer at LRZ in . Cabot Partners Germany . RSC HPC Technology Fabric Builders NEW . Penguin Computing Adds Omni-Path and Lustre to its HPC . Intel OPA Feature Animation . Fabric Builders Website Cloud . University of Colorado, Boulder . Fabric Builders Catalog . Huawei E9000 with OPA Launch at CeBIT'17 Fabric Builders Webinars . RWTH Aachen University Deploys NEC LX Supercomputer Omni-Path Blogs . Intel® Omni-Path Architecture Assessment Running Altair . BASF Taps HPE to Build Supercomputer for Chemical . Establishing a Significant Footprint in 100Gb in HyperWorks* Solvers on HPE Apollo Servers NEW Research Top500 . Megware to Intall CooLMUC 3 Supercomputer at LRZ in . ANSYS Fluent® Performance on Intel® Omni-Path Architecture . Exceeding Expectations With the Intel Omni Path Germany NEW and Optimization on the Latest Intel® Processors NEW Architecture . Knights Lander Processor with Omni-Path Makes Cloud Debut NEW . Intel Omni Path Architecture for the Tri Labs (CTS-1) . Jülich to Build 5 Petaflop Supercomputing Booster with Dell Intel® OPA Tools NEW . Intel® OPA Configurator Tool (External)

Intel® Omni-Path 67 Architecture