CC--DACDAC HPCHPC ActivitiesActivities && ExperienceExperience onon AcceleratorsAccelerators

Goldi Misra Group Coordinator & Head HPC Solutions Group C-DAC, Thematic R & D Areas of C-DAC

• High Performance Computing & Grid Computing • Hardware, Software, Systems, Applications, Research, Technology, Infrastructure

• Multilingual Computing and Heritage Computing • Tools, Fonts, Products, Solutions, Research, Technology Development

• Health Informatics • Hospital Information System, Telemedicine, Decision Support System, Tools, Traditional Knowledge-base and DSS for Medicine

• Software Technologies including FOSS • FOSS, Multimedia, ICT for masses, E-Governance, Geomatics, ICT4D

• Professional Electronics including VLSI & Embedded Systems • Digital Broadband and Wireless Systems, Network Technologies, Power Electronics, Real- Time Systems, Control Electronics, Embedded Systems, VLSI/ASIC Design, Agri Electronics, Strategic Electronics

• Cyber Security & Cyber Forensics • Cyber Security tools, technologies & solution development, Research & Training

Education and Training forms an important component of C-DAC activities cutting across the above Thematic Areas Kaleidoscope of C-DAC Products Spectrum of HPC Activities

Technologies

Trainings Systems

Spectrum of HPC Activities

Solutions National Facilities

Applications PARAM Series of

PARAM YUVA Indian Supercomputing Scenario

C-DAC’s PARAM 8000, India’s first Gigascale 1990 ANUPAM- BARC PARAM – C-DAC ANURAG- DRDO Flowsolver- NAL Parallel 1990 Initiatives -2000 Several HPC Facilities setup including those at C-DAC, IISc, BARC, 2000 NAL, CMMACS, DRDO, NCMRWF -2007

C-DAC’s PARAM Padma, India’s first Terascale Supercomputer 2002 Launched (Rank 171 in Top 500 List)

CRL’s EKA ranks 4th in Top 500 List 2007 9 Terascale Systems from India in Top 500 List

C-DAC’s PARAM Yuva Launched (Rank 68 in Top 500 List) 2008

Only 4 systems from India in Top 500 List as against 41 systems from China 2010 India’s best system ranked 47 in Top 500 List

India Govt. takes initiative for a big leap in Supercomputing 2012 6 HPC Activities @ C-DAC National HPC Facilities NPSF @ (1998)

PARAM 10000 system CTSF @ Bangalore (2003)

PARAM Padma System NPSF @ Pune (2009)

PARAM Yuva System

HPC Applications: 1988 -2011

CFD CFD Electro-magnetics

(Launch Vehicle)‏

Weather

Forecasting T80 T172 RTWS, WRF

IC Engine, Protein structure FRP

Computing Seismic inversion

Evolutionary

Seismic inversion, pre and post stack

Seismic Pre and post stack migration, 1D models

Modelling migration, 2D & 3D models

- Protein folding (300 ns), Bio Protein folding (1 ns)

REMD, MEME

informatics

Fracture FRP ,Smart Composites

Mechanics Structures

Structural Engineering

’88 … … ’91 … … … ‘95 … … ‘98 … ‘00 … ‘02 … … ‘08 … … ‘11

1st Mission 2nd Mission 3rd Mission 4th Mission Garuda 54 TF

• InClus- HPC cluster Building Toolkit • CHReME – HPC Resource Management Engine • ONAMA – HPC package for academic institutions • Parallel File System

InClus Integrated Cluster Solution

InClus addresses the need of technical challenges in the field of HPC, it makes cluster easy.

• Web and Desktop based GUI • Provision of Operating Systems on physical as well as virtual machines: RHEL5.x, RHEL 6, CentOS5.x, CentOS6.x • Development platform; Compilers, Debuggers • Scheduler and resource manager • Policy based accounting • High availability support • Remote console. Powerful shell support. • Quickly set up and control Management node services: DNS, HTTP, DHCP, TFTP • User Management • GPU Support • Log monitoring • Critical Error/ Warning reporting via Web interface • SMS/Mail alerts for checking job status CHReME C-DAC’s HPC Resource Management Engine

CHReME addresses challenge of efficient and easy usage and management of resources of HPC systems

 CHReME portal is an end-user job submission, management and monitoring tool that works with various schedulers or Workload Managers such as Torque, OpenPBS, Sun Grid Engine, Moab, Load leveler, etc.

 Timely E-mail notification regarding job status; personalized job list and job status information

 Secure credential specific access on web through https

 Allows users to configure their execution environment through compilers and libraries selection, scheduling parameters etc.

 Scientific & Research Applications specific portals ONAMA

With a mission of “ Equipping Premier Academic Institutions with top of the class HPC solutions from C-DAC packaged with open source software and world class services. This would enable the Premier Academic Institutions to benefit in terms of service delivery and affordability.”

 Onama is an integrated package which opens a new door to future technocrats, providing them a Quantum leap in developing a firm understanding through HPC in several engineering disciplines.

 Onama comprises of a well selected set of parallel & serial applications and tools across various engineering disciplines such as Computer Science, Mechanical, Electronics and Communication, Electrical, Civil, Chemical engineering etc. Besides, it consists of a number of nVIDIA CUDA enabled applications in several domains such as molecular dynamics and physics. • Parallel, Multicore and Manycore Programming • System Administration & Management • Network Security & Audits • Storage Management Technologies • Facility Operations Management and Maintenance • GPU based Programming • HPC User Symposiums • C-DAC Certified HPC Professional Indigenous capability in • Engineering & managing of large supercomputing systems and national supercomputing facility

• HW & SW skills in designing System Area Network • Chip/PCB/system design skills (HW) • Networking stack & system software (SW) • Prototyping/Validation/Certification/Benchmarking/Training

• HW & SW skills in designing RC accelerators • RC HW having upto 12 million logic gates for computing, with different host interfaces • Porting applications/algorithms as HW circuits to achieve large speed-ups • SW ecosystem design for various operating systems

(contd…)

• Porting and scaling applications on large clusters

• Several collaborative projects in Science & Engineering research

• Increase of HPC user community in the nation

• Publications Activities on Intel Many Integrated Core (MIC) Architecture

Knights Ferry Co-Processor Card • 1.2 GHz, upto 32 cores, 2 GB GDDR5, 4 threads/core, 300W, 45nm process • MIC Platform Software Stack (MPSS) 1.0 and 2.0 • Development Tools: Intel FORTRAN & C++ Compilers, Intel MPI and OpenMP, Intel MKL, IPP, TBB, ARBB, Cilk Plus, Support for Eclipse IDE, OpenCL support in future

Expected Specifications of Knights Corner • More than 50 cores per chip • 22nm process size • ~1TF • Mathematical Algorithms: Mandelbrot Set (An example of a simple mathematical definition leading to complex behavior)

• Molecular Dynamics: MD_OPENMP

• Oceanography: Tsunami-N2 (Numerical simulation program with the linear theory in deep sea and with the shallow water theory in shallow sea and on land with constant grid length in the whole region)

• Astrophysics: CAMB (Code for Anisotropies in the Microwave Background [CAMB] computes cosmic microwave background spectra given a set of input cosmological parameters)

• Linear scalability and results are encouraging • Based on familiar x86 architecture • Run on standard, existing programming tools and methods • Minimal Porting efforts • No reprogramming for native compilation and execution • Directive based offloading • Availability of tools (profilers, debugging, monitoring etc.)

We intend to work on the commercial product KNC to get an exact idea of performance, scalability etc.

Application Accelerators Facts

Long-term viability of the technology

Technology is changing at a faster pace Many technology providers do not stay in business long enough

Application development methodology and tools

Differ substantially from conventional multi-core programming Applications can be tuned to achieve a good performance Require a deep study of the underlying hardware architecture to achieve a good performance Codes are platform dependent

Emerging Standards

OpenCL for many-core architectures

GPUs

OpenFPGA efforts for reconfigurable computing • 2nd and 3rd gen Reconfigurable Computing (RC) platform • Uses RC hardware with state of the art FPGAs • RC hardware has upto 12 million logic gates for computing with different host interfaces • Avatars – hw routines/ libraries • Varada – APIs, kernel agent, Linux support • Eco-friendly HPC solution Digital Systems Design

System Scientific Software Application Design C-DAC RC C-DAC’s

Expertise Hardware Library PCB Design Design & Assembly (Avatars)  One of the enabling technologies useful in RC is the field-programmable gate array (FPGA).  Putting FPGAs on add-on cards or motherboards allow FPGAs to serve as compute- intensive co-processors.  FPGAs can be re-configured over and over again, to perform multitude of operations. This enables application-specific, dynamically "programmable" hardware accelerators.

Database: Application: NP_597681 XP_001065 AAN10358 AAN10358 Query 955

Protein

Smith Software Software 2 hr 41 2 hr 18 min 28 min 31 min cores) 3 hr 8 min 2 min (256 sec sec sec

-

Waterman Sequence Search

Cards) 47 sec 39 sec min 8 min min min (16 sec RC 29 25 22

Speed Per card card Per interms of cores of 101.2 100.4 100.1 - Up

Time 

3 Socket XeonGHz 2.93 CoreQuad Quad DL580G5 HP Nodes: (256 Cores) (256 16nodes hrs

8 min

29min

16RC

3000

2250

1500

FPGA

750 CPU Frequency(MHz)

0 1990 1992 1996 1998 2000 2002 2004 2006 2007 Year 2008

Source: Intel, IBM, Xilinx, Altera datasheets

• Scientific and engineering applications in the areas of fracture mechanics, radio astronomy and bioinformatics ported on RC provided significant acceleration compared to purely software based solutions. • These speedups were further increased by many folds, based on configuration and applications. Bioinformatics sequence search solution using RC, gave more than 100 times faster results. • C-DAC's own fracture mechanics code, having double precision Cholesky factorization and forward-backward substitution steps ported on RC provided 16X speedup. • High speed data acquisition and signal processing solutions designed for Very Long Baseline Interferometry (VLBI) and power spectrum experimentation in radio astronomy, replaced a sizable computing cluster. • Double precision matrix multiplication implemented on RC performed better than the standard math library. • Evolution of reconfigurable logic design with more traditional computing paradigm.

• Development of more efficient cache replacement policy for FPGA configurations.

• Reduction in the run-time reconfiguration time.

• FPGAs with HPC will act as a solution to scaling challenges brought on by microprocessors (Power Consumption and clock frequencies).

• Mapping compute-intensive algorithms directly onto parallel FPGA hardware, tightly coupled to a conventional CPU through a high- speed I/O bus, complete applications can be accelerated by orders of magnitude over conventional CPU implementations.

• Development of run time debugging of multi platform enabled code.

Tesla Data Center & Workstation GPU Solutions

Tesla M-series GPUs Tesla C-series GPUs M2090 | M2075/0 | M2050 C2075/0 | C2050 Servers & Blades Workstations M2075/ C2075/ C2050 M2090 M2050 0 0 Cores 512 448 448 448 448 Memory 6 GB 6 GB 3 GB 6 GB 3 GB Memory 148.8 148.8 bandwidth 177.6 GB/s 150 GB/s 148.8 GB/s GB/s GB/s (ECC off) Single 1030 1030 1331 1030 1030 Peak Precision Perf Double Gflops 665 515 515 515 515 Precision Tesla: 2-3x Faster GPU Every 2 Years

Maxwell 16

14

12

10

8

6 Kepler DP GFLOPS Watt per DP

4 Fermi 2 T10

2008 2010 2012 2014 Worldwide GPU Supercomputer Momentum

Tesla 20- series (Fermi) First Launched Double Precision GPU Tesla GPUs Launched Applications

Programming Libraries Directives Languages

Easiest Approach for 2x to 10x Maximum Acceleration Performance NVIDIA GPUDirect

• Accelerated Communication With Network and Storage Devices

• Peer-To-Peer Transfers Between GPUs

• Peer-To-Peer Memory Access

• GPUDirect For Video PGI CUDA x86 CUDA Now Available for CPUs and GPUs

GPU

NVIDIA C / C++ Compiler

Single CUDA C / C++ Codebase

PGI CUDA X86 Compiler C / C++ Support

CPU PGI CUDA x86 CUDA Now Available for CPUs and GPUs

GPU

NVIDIA C / C++ Compiler

Single CUDA C / C++ Codebase

PGI CUDA X86 Compiler C / C++ Support

CPU GPU Computing @ C-DAC Particulars Tesla C1060 Tesla C2050 Tesla 2075

Architecture Tesla 10 Series GPU Tesla 20 Series GPU Tesla 20 Series GPU

Compute Capability 1.3 2 2

No. of Cores 240 448 448

GPU Memory 4 GB 3 GB 6 GB

Memory Bandwidth 102 GB/s 150 GB/s 150 GB/s Bio-informatics: • GPU-HMMER (Does protein sequence alignment using profile HMMs) • MrBayes (Bayesian inference of phylogenetic and evolutionary models) • CUDA-BLASTP (Designed to accelerate NCBI BLASTP for scanning protein sequence databases) • CUDA-MEME (Discover motifs on groups of related DNA or protein sequences etc.)

Weather: • WSM5 (WSM5 is WRF Single Moment 5 Cloud Microphysics module)

CONFIDENTIAL

• Strong scalability of certain codes on multiple cards

• Availability of numerous applications explicitly enabled with CUDA

• Significant reprogramming efforts with CUDA for maximum performance

• New and improved developer tools

• Programming Complexity

• Library & Tools Availability

• Power Vs performance

• Flexibility Vs Accessibility

• Acceleration www.cdac.in www.hpcwire.com www.intel.com www.nvidia.com

ThankThank YYouou

[email protected] [email protected]