High Performance Computing ADVANCED SCIENTIFIC COMPUTING

Prof. Dr. – Ing. Morris Riedel Adjunct Associated Professor School of Engineering and Natural Sciences, University of Iceland, Reykjavik, Iceland Research Group Leader, Juelich Supercomputing Centre, Forschungszentrum Juelich, Germany

@MorrisRiedel LECTURE 1 @Morris Riedel @MorrisRiedel

High Performance Computing

September 5, 2019 Room V02-258 Review of Practical Lecture 0.2 – Short Intro to Programming & Scheduling

. Many C/C++ Programs used in HPC today . Multi-user HPC usage needs Scheduling

Many multi-physics application challenges [3] LLview on different scales & levels of granularity drive the need for HPC

[1] Terrestrial Systems SimLab [2] NEST Web page

not good! right way!

Lecture 1 – High Performance Computing 2/ 50 Outline of the Course

1. High Performance Computing 11. Scientific Visualization & Scalable Infrastructures 2. Parallel Programming with MPI 12. Terrestrial Systems & Climate 13. Systems Biology & Bioinformatics 3. Parallelization Fundamentals 14. Molecular Systems & Libraries 4. Advanced MPI Techniques 15. Computational Fluid Dynamics & Finite Elements 5. Parallel Algorithms & Data Structures 16. Epilogue 6. Parallel Programming with OpenMP

7. Graphical Processing Units (GPUs) + additional practical lectures & Webinars for our hands-on assignments in context 8. Parallel & Scalable Machine & Deep Learning 9. Debugging & Profiling & Performance Toolsets . Practical Topics 10. Hybrid Programming & Patterns . Theoretical / Conceptual Topics

Lecture 1 – High Performance Computing 3/ 50 Outline

. High Performance Computing (HPC) Basics . Four basic building blocks of HPC . TOP500 & Performance Benchmarks . Multi-core CPU Processors . Shared Memory & Distributed Memory Architectures . Hybrid Architectures & Programming

. HPC Ecosystem Technologies . HPC System Software Environment – Revisited . System Architectures & Network Topologies . Many-core GPUs & Supercomputing Co-Design . Relationships to Big Data & Machine/Deep Learning . Data Access & Large-scale Infrastructures

Lecture 1 – High Performance Computing 4/ 50 Selected Learning Outcomes

. Students understand… . Latest developments in parallel processing & high performance computing (HPC) . How to create and use high-performance clusters . What are scalable networks & data-intensive workloads . The importance of domain decomposition . Complex aspects of parallel programming . HPC environment tools that support programming or analyze behaviour . Different abstractions of parallel computing on various levels . Foundations and approaches of scientific domain- specific applications . Students are able to … . Programm and use HPC programming paradigms . Take advantage of innovative scientific computing simulations & technology . Work with technologies and tools to handle parallelism complexity Lecture 1 – High Performance Computing 5/ 50 High Performance Computing (HPC) Basics

Lecture 1 – High Performance Computing 6/ 50 What is High Performance Computing?

. Wikipedia: ‘redirects from HPC to ’ . Interesting – gives us already a hint what it is generally about

. A supercomputer is a computer at the frontline of contemporary processing capacity – particularly speed of calculation

[4] Wikipedia ‘Supercomputer’ Online

. HPC includes work on ‘four basic building blocks’ in this course . Theory (numerical laws, physical models, speed-up performance, etc.) . Technology (multi-core, , networks, storages, etc.) . Architecture (shared-memory, distributed-memory, interconnects, etc.) . Software (libraries, schedulers, monitoring, applications, etc.)

[5] Introduction to High Performance Computing for Scientists and Engineers

Lecture 1 – High Performance Computing 7/ 50 Understanding High Performance Computing (HPC) – Revisited

. High Performance Computing (HPC) is based on computing resources that enable the efficient use of parallel computing techniques through specific support with dedicated hardware such as high performance cpu/core interconnections.

HPC network interconnection important

. High Throughput Computing (HTC) is based on commonly available computing resources such as commodity PCs and small clusters that enable the execution of ‘farming jobs’ without providing a high performance interconnection between the cpu/cores.

network interconnection HTC less important!

 The complementary Cloud Computing & Big Data – Parallel Machine & Deep Learning Course focusses on High Throughput Computing

Lecture 1 – High Performance Computing 8/ 50 Parallel Computing

. All modern supercomputers depend heavily on parallelism . Parallelism can be achieved with many different approaches

. We speak of parallel computing whenever a number of ‘compute elements’ (e.g. cores) solve a problem in a cooperative way

[5] Introduction to High Performance Computing for Scientists and Engineers

. Often known as ‘parallel processing’ of some problem space . Tackle problems in parallel to enable the ‘best performance’ possible . Includes not only parallel computing, but also parallel input/output (I/O) . ‘The measure of speed’ in High Performance Computing matters P1 P2 P3 P4 P5 . Common measure for parallel computers established by TOP500 list

. Based on benchmark for ranking the best 500 computers worldwide [6] TOP500 Supercomputing Sites

 Lecture 3 will give in-depth details on parallelization fundamentals & performance term relationships & theoretical considerations

Lecture 1 – High Performance Computing 9/ 50 TOP 500 List (June 2019)

power challenge

[6] TOP500 Supercomputing Sites

EU #1

Lecture 1 – High Performance Computing 10 / 50 LINPACK Benchmarks and Alternatives

. TOP500 ranking is based on the LINPACK benchmark

[7] LINPACK Benchmark implementation . LINPACK solves a dense system of linear equations of unspecified size

. LINPACK covers only a single architectural aspect (‘critics exist’) . Measures ‘peak performance’: All involved ‘supercomputer elements’ operate on maximum performance . Available through a wide variety of ‘open source implementations’ . Success via ‘simplicity & ease of use’ thus used for over two decades

. Realistic applications benchmark suites might be alternatives . HPC Challenge benchmarks (includes 7 tests) [8] HPC Challenge Benchmark Suite . JUBE benchmark suite (based on real applications) [9] JUBE Benchmark Suite

Lecture 1 – High Performance Computing 11 / 50 Multi-core CPU Processors

. Significant advances in CPU (or microprocessor chips) . Multi-core architecture with dual, quad, six, or n processing cores . Processing cores are all on one chip

one chip . Multi-core CPU chip architecture . Hierarchy of caches (on/off chip) . L1 cache is private to each core; on-chip

. L2 cache is shared; on-chip [10] Distributed & Cloud Computing Book . L3 cache or Dynamic random access memory (DRAM); off-chip

. Clock-rate for single processors increased from 10 MHz (Intel 286) to 4 GHz (Pentium 4) in 30 years . Clock rate increase with higher 5 GHz unfortunately reached a limit due to power limitations / heat . Multi-core CPU chips have quad, six, or n processing cores on one chip and use cache hierarchies

Lecture 1 – High Performance Computing 12 / 50 Dominant Architectures of HPC Systems

. Traditionally two dominant types of architectures . Shared-memory parallelization with OpenMP . . Distributed-memory parallel programming with the Shared-Memory Computers Message Passing Interface (MPI) standard . Distributed Memory Computers

. Often hierarchical (hybrid) systems of both in practice . Dominance in the last couple of years in the community on X86-based commodity clusters running the Linux OS on Intel/AMD processors . More recently, also accelerators play a significant role (e.g., many-core chips)

. More recently . Both above considered as ‘programming models’ . Emerging computing models getting relevant for HPC: e.g., quantum devices, neuromorphic devices

Lecture 1 – High Performance Computing 13 / 50 Shared-Memory Computers

. A shared-memory parallel computer is a system in which a number of CPUs work on a common, shared physical address space

[5] Introduction to High Performance Computing for Scientists and Engineers

. Two varieties of shared-memory systems: 1. Unified Memory Access (UMA) 2. Cache-coherent Nonuniform Memory Access (ccNUMA)

. The Problem of ‘Cache Coherence’ (in UMA/ccNUMA) . Different CPUs use Cache to ‘modify same cache values’

. Consistency between cached data & T1 T2 T3 T4 T5 Memory Shared data in memory must be guaranteed . ‘Cache coherence protocols’ ensure a consistent view of memory

Lecture 1 – High Performance Computing 14 / 50 Shared-Memory with UMA

. UMA systems use ‘flat memory model’: Latencies and bandwidth are the same for all processors and all memory locations. . Also called Symmetric Multiprocessing (SMP) [5] Introduction to High Performance Computing for Scientists and Engineers

. Selected Features . Socket is a physical package (with multiple cores), typically a replacable component . Two dual core chips (2 core/socket) . P = Processor core . L1D = Level 1 Cache – Data (fastest) . L2 = Level 2 Cache (fast) . Memory = main memory (slow) . Chipset = enforces cache coherence and mediates connections to memory

Lecture 1 – High Performance Computing 15 / 50 Shared-Memory with ccNUMA

. ccNUMA systems share logically memory that is physically distributed (similar like distributed-memory systems) . Network logic makes the aggregated memory appear as one single address space [5] Introduction to High Performance Computing for Scientists and Engineers

. Selected Features . Eight cores (4 cores/socket); L3 = Level 3 Cache . Memory interface = establishes a coherent link to enable one ‘logical’ single address space of ‘physically distributed memory’

Lecture 1 – High Performance Computing 16 / 50 Programming with Shared Memory using OpenMP

. Shared-memory programming enables immediate access to all data from all processors without explicit communication . OpenMP is dominant shared-memory programming standard today (v3) . OpenMP is a set of compiler directives to ‘mark parallel regions’

[11] OpenMP API Specification

. Features . Bindings are defined for C, C++, and languages . Threads TX are ‘lightweight processes’ that mutually access data

T1 T2 T3 T4 T5 Memory Shared

 Lecture 6 will give in-depth details on the shared-memory programming model with OpenMP and using its compiler directives

Lecture 1 – High Performance Computing 17 / 50 Distributed-Memory Computers

. A distributed-memory parallel computer establishes a ‘system view’ where no process can access another process’ memory directly

[5] Introduction to High Performance Computing for Scientists and Engineers

. Features . Processors communicate via Network Interfaces (NI) . NI mediates the connection to a Communication network . This setup is rarely used  a programming model view today

Lecture 1 – High Performance Computing 18 / 50 Programming with Distributed Memory using MPI

. Distributed-memory programming enables explicit message passing as communication between processors . Message Passing Interface (MPI) is dominant distributed-memory programming standard today (available in many different version) . MPI is a standard defined and developed by the MPI Forum

[12] MPI Standard

. Features . No remote memory access on distributed-memory systems . Require to ‘send messages’ back and forth between processes PX . Many free Message Passing Interface (MPI) libraries available . Programming is tedious & complicated, but most flexible method P1 P2 P3 P4 P5

 Lecture 2 & 4 will give in-depth details on the distributed-memory programming model with the Message Passing Interface (MPI)

Lecture 1 – High Performance Computing 19 / 50 MPI Standard – GNU OpenMPI Implementation Example – Revisited

. Message Passing Interface (MPI) . A standardized and portable message-passing standard . Designed to support different HPC architectures . A wide variety of MPI implementations exist . Standard defines the syntax and semantics of a core of library routines used in C, C++ & Fortran [12] MPI Standard . OpenMPI Implementation . Open source license based on the BSD license . Full MPI (version 3) standards conformance [13] OpenMPI Web page . Developed & maintained by a consortium of academic, research, & industry partners . Typically available as modules on HPC systems and used with mpicc compiler . Often built with the GNU compiler set and/or Intel compilers

 Lecture 2 will provide a full introduction and many more examples of the Message Passing Interface (MPI) for parallel programming

Lecture 1 – High Performance Computing 20 / 50 Hierarchical Hybrid Computers

. A hierarchical hybrid parallel computer is neither a purely shared-memory nor a purely distributed-memory type system but a mixture of both . Large-scale ‘hybrid’ parallel computers have shared-memory building blocks interconnected with a fast network today

[5] Introduction to High Performance Computing for Scientists and Engineers

. Features . Shared-memory nodes (here ccNUMA) with local NIs . NI mediates connections to other remote ‘SMP nodes’

Lecture 1 – High Performance Computing 21 / 50 Programming Hybrid Systems & Patterns

. Hybrid systems programming uses MPI as explicit internode communication and OpenMP for parallelization within the node . Parallel Programming is often supported by using ‘patterns’ such as stencil methods in order to apply functions to the domain decomposition

. Experience from HPC Practice . Most parallel applications still take no notice of the hardware structure . Use of pure MPI for parallelization remains the dominant programming . Historical reason: old supercomputers all distributed-memory type . Use of accelerators is significantly increasing in practice today . Challenges with the ‘mapping problem’ . Performance of hybrid (as well as pure MPI codes) depends crucially on factors not directly connected to the programming model . It largely depends on the association of threads and processes to cores . Patterns (e.g., stencil methods) support the parallel programming  Lecture 10 will provide insights into hybrid programming models and introduces selected patterns used in parallel programming

Lecture 1 – High Performance Computing 22 / 50 [Video] Juelich – Supercomputer Upgrade

Lecture 1 – High Performance Computing 23 / 50 HPC Ecosystem Technologies

Lecture 1 – High Performance Computing 24 / 50 HPC System Software Environment – Revisited (cf. Practical Lecture 0.2)

. Operating System . Former times often ‘proprietary OS’, nowadays often (reduced) ‘Linux’

. HPC systems and . Scheduling Systems supercomputers typically provide a software . Manage concurrent access of users on Supercomputers environment that support . Different scheduling algorithms can be used with different ‘batch queues’ the processing of parallel and scalable applications . Example: SLURM @ JÖTUNN Cluster, LoadLeveler @ JUQUEEN, etc. . Monitoring systems offer a comprehensive view of the . Monitoring Systems current status of a HPC . Monitor and test status of the system (‘system health checks/heartbeat’) system or supercomputer

. Enables view of usage of system per node/rack (‘system load’) . Scheduling systems enable a method by which . Examples: LLView, INCA, Ganglia @ JOTUNN Cluster, etc. user processes are given access to processors . Performance Analysis Systems . Measure performance of an application and recommend improvements (.e.g Scalasca, Vampir, etc.)

 Lecture 9 will offer more insights into performance analysis systems with debugging, profiling, and HPC performance toolsets

Lecture 1 – High Performance Computing 25 / 50 Scheduling vs. Emerging Interactive Supercomputing Approaches

. JupyterHub is a multi-user version of the notebook designed for companies, classrooms and research labs

[20] A. Lintermann & M. Riedel et al., ‘Enabling [21] A. Streit & M. Riedel et al., ‘UNICORE 6 – [22] Project Jupyter Web page Interactive Supercomputing at JSC – Lessons Learned’ Recent and Future Advancements’

Lecture 1 – High Performance Computing 26 / 50 Modular Supercomputer JUWELS – Revisited

Lecture 1 – High Performance Computing 27 / 50 HPC System Architectures

. HPC systems are very complex ‘machines‘ with many elements . CPUs & multi-cores . ‘multi-threading‘ capabilities . Data access levels . Different levels of Caches . Network topologies . Various interconnects . Architecture Impacts . Vendor designs, e.g., . HPC faced a significant change in practice with respect to performance increase after years IBM Bluegene/Q . Getting more speed for free by waiting for new CPU generations does not work any more . Infrastructure, e.g., cooling & . Multi-core processors emerge that require to use those multiple resources efficiently in parallel power lines in computing hall . Many-core processors emerge that are used to accelerate certain computing application parts

Lecture 1 – High Performance Computing 28 / 50 Example: IBM BlueGene Architecture Evolution

. BlueGene/P

. BlueGene/Q

Lecture 1 – High Performance Computing 29 / 50 Network Topologies

. Large-scale HPC Systems have special network setups . Dedicated I/O nodes, fast interconnects, e.g. Infiniband (IB) . Different network topologies, e.g. tree, 5D Torus network, mesh, etc. (raise challenges in task mappings and communication patterns)

network interconnection important

[5] Introduction to High Performance Computing for Scientists and Engineers Source: IBM

Lecture 1 – High Performance Computing 30 / 50 HPC System Architecture & Continous Developments

. Increasing number of other ‘new’ emerging system architectures . HPC at cutting-edge of computing and integrates new hardware developments as continuous activity . General Purpose Computation on Graphics Processing Unit (GPGPUs/GPUs) . Use of GPUs instead for computer graphics for computing . Programming models are OpenCL and Nvidia CUDA . Artificial Intelligence with methods from machine learning and deep learning influence . Getting more and more adopted in many application fields HPC system architectures today . Complement initial focus on compute- . Field Programmable Gate Array (FPGAs) intensive with data-intensive application co- . Integrated circuit designed to be configured design activities by a user after shipping . Enables updates of functionality and reconfigurable ‘wired’ interconnects . processors . Enables combination of general-purpose cores with co-processing elements that accelerate dedicated forms of computations

Lecture 1 – High Performance Computing 31 / 50 Many-core GPGPUs

. Use of very many simple cores . High throughput computing-oriented architecture . Use massive parallelism by executing a lot of concurrent threads slowly . Handle an ever increasing amount of multiple instruction threads . CPUs instead typically execute a single [10] Distributed & Cloud Computing Book long thread as fast as possible

. Many-core GPUs are used in large . Graphics Processing Unit (GPU) is great for data parallelism and task parallelism . Compared to multi-core CPUs, GPUs consist of a many-core architecture with clusters and within massively hundreds to even thousands of very simple cores executing threads rather slowly parallel supercomputers today . Named General-Purpose Computing on GPUs (GPGPU) . Different programming models emerge

Lecture 1 – High Performance Computing 32 / 50 GPU Acceleration

. GPU accelerator architecture example (e.g. NVIDIA card) . GPUs can have 128 cores on one single GPU chip . Each core can work with eight threads of instructions . GPU is able to concurrently execute 128 * 8 = 1024 threads . Interaction and thus major (bandwidth) bottleneck between CPU and GPU [10] Distributed & Cloud Computing Book is via memory interactions

. E.g. applications that use matrix – . CPU acceleration means that GPUs accelerate computing due to a massive parallelism vector/matrix multiplication with thousands of threads compared to only a few threads used by conventional CPUs (e.g. deep learning algorithms) . GPUs are designed to compute large numbers of floating point operations in parallel

 Lecture 10 will introduce the programming of accelerators with different approaches and their key benefits for applications

Lecture 1 – High Performance Computing 33 / 50 NVIDIA Fermi GPU Example

[10] Distributed & Cloud Computing Book

Lecture 1 – High Performance Computing 34 / 50 DEEP Learning takes advantage of Many-Core Technologies

. Innovation via specific layers and architecture types

[26] A. Rosebrock

[25] Neural Network 3D Simulation

 Lecture 8 will provide more details about parallel & scalable machine & deep learning algorithms and how many-core HPC is used

Lecture 1 – High Performance Computing 35 / 50 Deep Learning Application Example – Using High Performance Computing

. Using Convolutional Neural Networks (CNNs) [27] J. Lange and M. Riedel et al., with hyperspectral remote sensing image data IGARSS Conference, 2018

. Find Hyperparameters & joint ‘new-old‘ modeling & transfer learning given rare labeled/annotated data in science (e.g. 36,000 vs. 14,197,122 images ImageNet)

[28] G. Cavallaro, M. Riedel et al., IGARSS 2019

 Lecture 8 will provide more details about parallel & scalable machine & deep learning algorithms and remote sensing applications

Lecture 1 – High Performance Computing 36 / 50 HPC Relationship to ‘Big Data‘ in Machine & Deep Learning

JURECA High Performance Training Computing & Cloud Time Large Deep Learning Networks Computing ‘small datasets‘

manual feature engineering‘ Medium Deep Learning Networks changes the ordering Small Neural Networks

Traditional Learning Models MatLab

Model Performance / Accuracy Performance Model Statistical SVMs Random Computing with R Forests scikit-learn Weka Octave

Dataset Volume  ‘Big Data‘ [15] www.big-data.tips

Lecture 1 – High Performance Computing 37 / 50 Deep Learning Application Example – Using Cloud Computing

. Performing parallel computing with Apache Spark across different worker nodes

[23] J. Haut, G. Cavallaro and M. Riedel et al., IEEE Transactions on Geoscience and Remote Sensing, 2019

. Using Autoencoder deep neural networks with Cloud computing

[24] Apache Spark Web page

 The complementary Cloud Computing & Big Data – Parallel Machine & Deep Learning Course teaches Apache Spark Approaches

Lecture 1 – High Performance Computing 38 / 50 HPC Relationship to ‘Big Data‘ in Simulation Sciences

[14] F. Berman: Maximising the Potential of Research Data

Lecture 1 – High Performance Computing 39 / 50 Data Access & Challenges

too slow . P = Processor core elements . Compute: floating points or integers . Arithmetic units (compute operations) . Registers (feed those units with operands)

cheaper . ‘Data access‘ for application/levels . Registers: ‘accessed w/o any delay‘ . L1D = Level 1 Cache – Data (fastest, normal) . The DRAM gap is the . L2 = Level 2 Cache (fast, often) large discrepancy . L3 = Level 3 Cache (still fast, less often) between main memory and cache bandwidths . Main memory (slow, but larger in size)

faster . Storage media like harddisk, tapes, etc. (too slow to be used in direct computing)

[5] Introduction to High Performance Computing for Scientists and Engineers

Lecture 1 – High Performance Computing 40 / 50 Big Data Drives Data-Intensive HPC Architecture Designs

. More recently system architectures are influenced by ‘big data‘ . CPU speed has surpassed IO capabilities of existing HPC resources . Scalable I/O gets more and more important in application scalability . Requirements for Hierarchical Storage Management (‘Tiers’) . Mass storage devices (tertiary storage) too slow to enable active processing of ‘big data’ . Increase in simulation time/granularity means TBs equally important as FLOP/s . Tapes cheap, but slowly accessible, direct access to compute nodes needed . Drive new ‘tier-based’ designs

[16] A. Szalay et al., ‘GrayWulf: Scalable Clustered Architecture for Data Intensive Computing’

Lecture 1 – High Performance Computing 41 / 50 Application Co-Design of HPC Architectures – Modular Supercomputing Example

. The modular supercomputing architecture (MSA) [17] DEEP Projects Web Page enables a flexible HPC system design co-designed by the need of different application workloads

Lecture 1 – High Performance Computing 42 / 50 New HPC Architectures – Modular Supercomputing Architecture Example

General MEM Purpose [17] DEEP Projects Web Page CPU CN

General MEM Purpose MEM Many CPU Core NVRAM FPGA NVRAM CPU DN BN

NVRAM General NVRAM Purpose NVRAM NVRAM CPU NVRAM NVRAM NVRAM FPGA FPGA NVRAM MEM

NAM GCE Possible Application Workload

Lecture 1 – High Performance Computing 43 / 50 Large-scale Computing Infrastructures

. Large computing systems are often embedded in infrastructures . Grid computing for distributed data storage and processing via middleware . The success of Grid computing was renowned when being mentioned by Prof. Rolf-Dieter Heuer, CERN Director General, in the context of the Higgs Boson Discovery: . Other large-scale distributed infrastructures exist . Partnership for Advanced Computing in Europe (PRACE)  EU HPC . Extreme Engineering and Discovery Environment (XSEDE)  US HPC

. ‘Results today only possible due to extraordinary performance of Accelerators – Experiments – Grid computing’

[18] Grid Computing Video

 Lecture 11 will give in-depth details on scalable approaches in large-scale HPC infrastructures and how to use them with middleware

Lecture 1 – High Performance Computing 44 / 50 [Video] PRACE – Introduction to Supercomputing

[19] PRACE – Introduction to Supercomputing

Lecture 1 – High Performance Computing 45 / 50 Lecture Bibliography

Lecture 1 – High Performance Computing 46 / 50 Lecture Bibliography (1)

. [1] Terrestrial Systems Simulation Lab, Online: http://www.hpsc-terrsys.de/hpsc-terrsys/EN/Home/home_node.html . [2] Nest:: The Neural Simulation Technology Initiative, Online: https://www.nest-simulator.org/ . [3] T. Bauer, ‘System Monitoring and Job Reports with LLView‘, Online: https://www.fz-juelich.de/SharedDocs/Downloads/IAS/JSC/EN/slides/supercomputer-ressources-2018-11/12b-sc-llview.pdf?__blob=publicationFile . [4] Wikipedia ‘Supercomputer’, Online: http://en.wikipedia.org/wiki/Supercomputer . [5] Introduction to High Performance Computing for Scientists and Engineers, Georg Hager & Gerhard Wellein, Chapman & Hall/CRC Computational Science, ISBN 143981192X, English, ~330 pages, 2010, Online: http://www.amazon.de/Introduction-Performance-Computing-Scientists-Computational/dp/143981192X . [6] TOP500 Supercomputing Sites, Online: http://www.top500.org/ . [7] LINPACK Benchmark, Online: http://www.netlib.org/benchmark/hpl/ . [8] HPC Challenge Benchmark Suite, Online: http://icl.cs.utk.edu/hpcc/ . [9] JUBE Benchmark Suite, Online: http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/JUBE/_node.html . [10] K. Hwang, G. C. Fox, J. J. Dongarra, ‘Distributed and Cloud Computing’, Book, Online: http://store.elsevier.com/product.jsp?locale=en_EU&isbn=9780128002049 . [11] The OpenMP API specification for parallel programming, Online: http://openmp.org/wp/openmp-specifications/

Lecture 1 – High Performance Computing 47 / 50 Lecture Bibliography (2)

. [12] The MPI Standard, Online: http://www.mpi-forum.org/docs/ . [13] OpenMPI Web page, Online: https://www.open-mpi.org/ . [14] Fran Berman, ‘Maximising the Potential of Research Data’ . [15] Big Data Tips – Big Data Mining & Machine Learning, Online: http://www.big-data.tips/ . [16] A. Szalay et al., ‘GrayWulf: Scalable Clustered Architecture for Data Intensive Computing’, Proceedings of the 42nd Hawaii International Conference on System Sciences – 2009 . [17] DEEP Projects Web page, Online: http://www.deep-projects.eu/ . [18] How EMI Contributed to the Higgs Boson Discovery, YouTube Video, Online: http://www.youtube.com/watch?v=FgcoLUys3RY&list=UUz8n-tukF1S7fql19KOAAhw . [19] PRACE – Introduction to Supercomputing, Online: https://www.youtube.com/watch?v=D94FJx9vxFA . [20] A. Lintermann & M. Riedel et al., ‘Enabling Interactive Supercomputing at JSC – Lessons Learned’, ISC 2019, Frankfurt, Germany, Online: https://www.researchgate.net/publication/330621591_Enabling_Interactive_Supercomputing_at_JSC_Lessons_Learned_ISC_High_Performance_2018_Intern ational_Workshops_FrankfurtMain_Germany_June_28_2018_Revised_Selected_Papers . [21] A. Streit & M. Riedel et al., ‘UNICORE 6 – Recent and Future Advancements ’, Online: https://www.researchgate.net/publication/225005053_UNICORE_6_-_recent_and_future_advancements . [22] Project Jupyter Web page, Online: https://jupyter.org/hub

Lecture 1 – High Performance Computing 48 / 50 Lecture Bibliography (3)

. [23] J. Haut, G. Cavallaro and M. Riedel et al., IEEE Transactions on Geoscience and Remote Sensing, 2019, Online: https://www.researchgate.net/publication/335181248_Cloud_Deep_Networks_for_Hyperspectral_Image_Analysis . [24] Apache Spark, Online: https://spark.apache.org/ . [25] YouTube Video, ‘Neural Network 3D Simulation‘, Online: https://www.youtube.com/watch?v=3JQ3hYko51Y . [26] A. Rosebrock, ‘Get off the deep learning bandwagon and get some perspective‘, Online: http://www.pyimagesearch.com/2014/06/09/get-deep-learning-bandwagon-get-perspective/ . [27] J. Lange, G. Cavallaro, M. Goetz, E. Erlingsson, M. Riedel, ‘The Influence of Sampling Methods on Pixel-Wise Hyperspectral Image Classification with 3D Convolutional Neural Networks’, Proceedings of the IGARSS 2018 Conference, Online: https://www.researchgate.net/publication/328991957_The_Influence_of_Sampling_Methods_on_Pixel- Wise_Hyperspectral_Image_Classification_with_3D_Convolutional_Neural_Networks . [28] G. Cavallaro, Y. Bazi, F. Melgani, M. Riedel, ‘Multi-Scale Convolutional SVM Networks for Multi-Class Classification Problems of Remote Sensing Images’, Proceedings of the IGARSS 2019 Conference, to appear

Lecture 1 – High Performance Computing 49 / 50 Lecture 1 – High Performance Computing 50 / 50